1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or https://opensource.org/licenses/CDDL-1.0. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. 23 * Copyright (c) 2018, Joyent, Inc. 24 * Copyright (c) 2011, 2020, Delphix. All rights reserved. 25 * Copyright (c) 2014, Saso Kiselkov. All rights reserved. 26 * Copyright (c) 2017, Nexenta Systems, Inc. All rights reserved. 27 * Copyright (c) 2019, loli10K <ezomori.nozomu@gmail.com>. All rights reserved. 28 * Copyright (c) 2020, George Amanakis. All rights reserved. 29 * Copyright (c) 2019, 2023, Klara Inc. 30 * Copyright (c) 2019, Allan Jude 31 * Copyright (c) 2020, The FreeBSD Foundation [1] 32 * 33 * [1] Portions of this software were developed by Allan Jude 34 * under sponsorship from the FreeBSD Foundation. 35 */ 36 37 /* 38 * DVA-based Adjustable Replacement Cache 39 * 40 * While much of the theory of operation used here is 41 * based on the self-tuning, low overhead replacement cache 42 * presented by Megiddo and Modha at FAST 2003, there are some 43 * significant differences: 44 * 45 * 1. The Megiddo and Modha model assumes any page is evictable. 46 * Pages in its cache cannot be "locked" into memory. This makes 47 * the eviction algorithm simple: evict the last page in the list. 48 * This also make the performance characteristics easy to reason 49 * about. Our cache is not so simple. At any given moment, some 50 * subset of the blocks in the cache are un-evictable because we 51 * have handed out a reference to them. Blocks are only evictable 52 * when there are no external references active. This makes 53 * eviction far more problematic: we choose to evict the evictable 54 * blocks that are the "lowest" in the list. 55 * 56 * There are times when it is not possible to evict the requested 57 * space. In these circumstances we are unable to adjust the cache 58 * size. To prevent the cache growing unbounded at these times we 59 * implement a "cache throttle" that slows the flow of new data 60 * into the cache until we can make space available. 61 * 62 * 2. The Megiddo and Modha model assumes a fixed cache size. 63 * Pages are evicted when the cache is full and there is a cache 64 * miss. Our model has a variable sized cache. It grows with 65 * high use, but also tries to react to memory pressure from the 66 * operating system: decreasing its size when system memory is 67 * tight. 68 * 69 * 3. The Megiddo and Modha model assumes a fixed page size. All 70 * elements of the cache are therefore exactly the same size. So 71 * when adjusting the cache size following a cache miss, its simply 72 * a matter of choosing a single page to evict. In our model, we 73 * have variable sized cache blocks (ranging from 512 bytes to 74 * 128K bytes). We therefore choose a set of blocks to evict to make 75 * space for a cache miss that approximates as closely as possible 76 * the space used by the new block. 77 * 78 * See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache" 79 * by N. Megiddo & D. Modha, FAST 2003 80 */ 81 82 /* 83 * The locking model: 84 * 85 * A new reference to a cache buffer can be obtained in two 86 * ways: 1) via a hash table lookup using the DVA as a key, 87 * or 2) via one of the ARC lists. The arc_read() interface 88 * uses method 1, while the internal ARC algorithms for 89 * adjusting the cache use method 2. We therefore provide two 90 * types of locks: 1) the hash table lock array, and 2) the 91 * ARC list locks. 92 * 93 * Buffers do not have their own mutexes, rather they rely on the 94 * hash table mutexes for the bulk of their protection (i.e. most 95 * fields in the arc_buf_hdr_t are protected by these mutexes). 96 * 97 * buf_hash_find() returns the appropriate mutex (held) when it 98 * locates the requested buffer in the hash table. It returns 99 * NULL for the mutex if the buffer was not in the table. 100 * 101 * buf_hash_remove() expects the appropriate hash mutex to be 102 * already held before it is invoked. 103 * 104 * Each ARC state also has a mutex which is used to protect the 105 * buffer list associated with the state. When attempting to 106 * obtain a hash table lock while holding an ARC list lock you 107 * must use: mutex_tryenter() to avoid deadlock. Also note that 108 * the active state mutex must be held before the ghost state mutex. 109 * 110 * It as also possible to register a callback which is run when the 111 * metadata limit is reached and no buffers can be safely evicted. In 112 * this case the arc user should drop a reference on some arc buffers so 113 * they can be reclaimed. For example, when using the ZPL each dentry 114 * holds a references on a znode. These dentries must be pruned before 115 * the arc buffer holding the znode can be safely evicted. 116 * 117 * Note that the majority of the performance stats are manipulated 118 * with atomic operations. 119 * 120 * The L2ARC uses the l2ad_mtx on each vdev for the following: 121 * 122 * - L2ARC buflist creation 123 * - L2ARC buflist eviction 124 * - L2ARC write completion, which walks L2ARC buflists 125 * - ARC header destruction, as it removes from L2ARC buflists 126 * - ARC header release, as it removes from L2ARC buflists 127 */ 128 129 /* 130 * ARC operation: 131 * 132 * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure. 133 * This structure can point either to a block that is still in the cache or to 134 * one that is only accessible in an L2 ARC device, or it can provide 135 * information about a block that was recently evicted. If a block is 136 * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough 137 * information to retrieve it from the L2ARC device. This information is 138 * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block 139 * that is in this state cannot access the data directly. 140 * 141 * Blocks that are actively being referenced or have not been evicted 142 * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within 143 * the arc_buf_hdr_t that will point to the data block in memory. A block can 144 * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC 145 * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and 146 * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd). 147 * 148 * The L1ARC's data pointer may or may not be uncompressed. The ARC has the 149 * ability to store the physical data (b_pabd) associated with the DVA of the 150 * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block, 151 * it will match its on-disk compression characteristics. This behavior can be 152 * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the 153 * compressed ARC functionality is disabled, the b_pabd will point to an 154 * uncompressed version of the on-disk data. 155 * 156 * Data in the L1ARC is not accessed by consumers of the ARC directly. Each 157 * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it. 158 * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC 159 * consumer. The ARC will provide references to this data and will keep it 160 * cached until it is no longer in use. The ARC caches only the L1ARC's physical 161 * data block and will evict any arc_buf_t that is no longer referenced. The 162 * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the 163 * "overhead_size" kstat. 164 * 165 * Depending on the consumer, an arc_buf_t can be requested in uncompressed or 166 * compressed form. The typical case is that consumers will want uncompressed 167 * data, and when that happens a new data buffer is allocated where the data is 168 * decompressed for them to use. Currently the only consumer who wants 169 * compressed arc_buf_t's is "zfs send", when it streams data exactly as it 170 * exists on disk. When this happens, the arc_buf_t's data buffer is shared 171 * with the arc_buf_hdr_t. 172 * 173 * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The 174 * first one is owned by a compressed send consumer (and therefore references 175 * the same compressed data buffer as the arc_buf_hdr_t) and the second could be 176 * used by any other consumer (and has its own uncompressed copy of the data 177 * buffer). 178 * 179 * arc_buf_hdr_t 180 * +-----------+ 181 * | fields | 182 * | common to | 183 * | L1- and | 184 * | L2ARC | 185 * +-----------+ 186 * | l2arc_buf_hdr_t 187 * | | 188 * +-----------+ 189 * | l1arc_buf_hdr_t 190 * | | arc_buf_t 191 * | b_buf +------------>+-----------+ arc_buf_t 192 * | b_pabd +-+ |b_next +---->+-----------+ 193 * +-----------+ | |-----------| |b_next +-->NULL 194 * | |b_comp = T | +-----------+ 195 * | |b_data +-+ |b_comp = F | 196 * | +-----------+ | |b_data +-+ 197 * +->+------+ | +-----------+ | 198 * compressed | | | | 199 * data | |<--------------+ | uncompressed 200 * +------+ compressed, | data 201 * shared +-->+------+ 202 * data | | 203 * | | 204 * +------+ 205 * 206 * When a consumer reads a block, the ARC must first look to see if the 207 * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new 208 * arc_buf_t and either copies uncompressed data into a new data buffer from an 209 * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a 210 * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the 211 * hdr is compressed and the desired compression characteristics of the 212 * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the 213 * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be 214 * the last buffer in the hdr's b_buf list, however a shared compressed buf can 215 * be anywhere in the hdr's list. 216 * 217 * The diagram below shows an example of an uncompressed ARC hdr that is 218 * sharing its data with an arc_buf_t (note that the shared uncompressed buf is 219 * the last element in the buf list): 220 * 221 * arc_buf_hdr_t 222 * +-----------+ 223 * | | 224 * | | 225 * | | 226 * +-----------+ 227 * l2arc_buf_hdr_t| | 228 * | | 229 * +-----------+ 230 * l1arc_buf_hdr_t| | 231 * | | arc_buf_t (shared) 232 * | b_buf +------------>+---------+ arc_buf_t 233 * | | |b_next +---->+---------+ 234 * | b_pabd +-+ |---------| |b_next +-->NULL 235 * +-----------+ | | | +---------+ 236 * | |b_data +-+ | | 237 * | +---------+ | |b_data +-+ 238 * +->+------+ | +---------+ | 239 * | | | | 240 * uncompressed | | | | 241 * data +------+ | | 242 * ^ +->+------+ | 243 * | uncompressed | | | 244 * | data | | | 245 * | +------+ | 246 * +---------------------------------+ 247 * 248 * Writing to the ARC requires that the ARC first discard the hdr's b_pabd 249 * since the physical block is about to be rewritten. The new data contents 250 * will be contained in the arc_buf_t. As the I/O pipeline performs the write, 251 * it may compress the data before writing it to disk. The ARC will be called 252 * with the transformed data and will memcpy the transformed on-disk block into 253 * a newly allocated b_pabd. Writes are always done into buffers which have 254 * either been loaned (and hence are new and don't have other readers) or 255 * buffers which have been released (and hence have their own hdr, if there 256 * were originally other readers of the buf's original hdr). This ensures that 257 * the ARC only needs to update a single buf and its hdr after a write occurs. 258 * 259 * When the L2ARC is in use, it will also take advantage of the b_pabd. The 260 * L2ARC will always write the contents of b_pabd to the L2ARC. This means 261 * that when compressed ARC is enabled that the L2ARC blocks are identical 262 * to the on-disk block in the main data pool. This provides a significant 263 * advantage since the ARC can leverage the bp's checksum when reading from the 264 * L2ARC to determine if the contents are valid. However, if the compressed 265 * ARC is disabled, then the L2ARC's block must be transformed to look 266 * like the physical block in the main data pool before comparing the 267 * checksum and determining its validity. 268 * 269 * The L1ARC has a slightly different system for storing encrypted data. 270 * Raw (encrypted + possibly compressed) data has a few subtle differences from 271 * data that is just compressed. The biggest difference is that it is not 272 * possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded. 273 * The other difference is that encryption cannot be treated as a suggestion. 274 * If a caller would prefer compressed data, but they actually wind up with 275 * uncompressed data the worst thing that could happen is there might be a 276 * performance hit. If the caller requests encrypted data, however, we must be 277 * sure they actually get it or else secret information could be leaked. Raw 278 * data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore, 279 * may have both an encrypted version and a decrypted version of its data at 280 * once. When a caller needs a raw arc_buf_t, it is allocated and the data is 281 * copied out of this header. To avoid complications with b_pabd, raw buffers 282 * cannot be shared. 283 */ 284 285 #include <sys/spa.h> 286 #include <sys/zio.h> 287 #include <sys/spa_impl.h> 288 #include <sys/zio_compress.h> 289 #include <sys/zio_checksum.h> 290 #include <sys/zfs_context.h> 291 #include <sys/arc.h> 292 #include <sys/zfs_refcount.h> 293 #include <sys/vdev.h> 294 #include <sys/vdev_impl.h> 295 #include <sys/dsl_pool.h> 296 #include <sys/multilist.h> 297 #include <sys/abd.h> 298 #include <sys/zil.h> 299 #include <sys/fm/fs/zfs.h> 300 #include <sys/callb.h> 301 #include <sys/kstat.h> 302 #include <sys/zthr.h> 303 #include <zfs_fletcher.h> 304 #include <sys/arc_impl.h> 305 #include <sys/trace_zfs.h> 306 #include <sys/aggsum.h> 307 #include <sys/wmsum.h> 308 #include <cityhash.h> 309 #include <sys/vdev_trim.h> 310 #include <sys/zfs_racct.h> 311 #include <sys/zstd/zstd.h> 312 313 #ifndef _KERNEL 314 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */ 315 boolean_t arc_watch = B_FALSE; 316 #endif 317 318 /* 319 * This thread's job is to keep enough free memory in the system, by 320 * calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves 321 * arc_available_memory(). 322 */ 323 static zthr_t *arc_reap_zthr; 324 325 /* 326 * This thread's job is to keep arc_size under arc_c, by calling 327 * arc_evict(), which improves arc_is_overflowing(). 328 */ 329 static zthr_t *arc_evict_zthr; 330 static arc_buf_hdr_t **arc_state_evict_markers; 331 static int arc_state_evict_marker_count; 332 333 static kmutex_t arc_evict_lock; 334 static boolean_t arc_evict_needed = B_FALSE; 335 static clock_t arc_last_uncached_flush; 336 337 /* 338 * Count of bytes evicted since boot. 339 */ 340 static uint64_t arc_evict_count; 341 342 /* 343 * List of arc_evict_waiter_t's, representing threads waiting for the 344 * arc_evict_count to reach specific values. 345 */ 346 static list_t arc_evict_waiters; 347 348 /* 349 * When arc_is_overflowing(), arc_get_data_impl() waits for this percent of 350 * the requested amount of data to be evicted. For example, by default for 351 * every 2KB that's evicted, 1KB of it may be "reused" by a new allocation. 352 * Since this is above 100%, it ensures that progress is made towards getting 353 * arc_size under arc_c. Since this is finite, it ensures that allocations 354 * can still happen, even during the potentially long time that arc_size is 355 * more than arc_c. 356 */ 357 static uint_t zfs_arc_eviction_pct = 200; 358 359 /* 360 * The number of headers to evict in arc_evict_state_impl() before 361 * dropping the sublist lock and evicting from another sublist. A lower 362 * value means we're more likely to evict the "correct" header (i.e. the 363 * oldest header in the arc state), but comes with higher overhead 364 * (i.e. more invocations of arc_evict_state_impl()). 365 */ 366 static uint_t zfs_arc_evict_batch_limit = 10; 367 368 /* number of seconds before growing cache again */ 369 uint_t arc_grow_retry = 5; 370 371 /* 372 * Minimum time between calls to arc_kmem_reap_soon(). 373 */ 374 static const int arc_kmem_cache_reap_retry_ms = 1000; 375 376 /* shift of arc_c for calculating overflow limit in arc_get_data_impl */ 377 static int zfs_arc_overflow_shift = 8; 378 379 /* log2(fraction of arc to reclaim) */ 380 uint_t arc_shrink_shift = 7; 381 382 /* percent of pagecache to reclaim arc to */ 383 #ifdef _KERNEL 384 uint_t zfs_arc_pc_percent = 0; 385 #endif 386 387 /* 388 * log2(fraction of ARC which must be free to allow growing). 389 * I.e. If there is less than arc_c >> arc_no_grow_shift free memory, 390 * when reading a new block into the ARC, we will evict an equal-sized block 391 * from the ARC. 392 * 393 * This must be less than arc_shrink_shift, so that when we shrink the ARC, 394 * we will still not allow it to grow. 395 */ 396 uint_t arc_no_grow_shift = 5; 397 398 399 /* 400 * minimum lifespan of a prefetch block in clock ticks 401 * (initialized in arc_init()) 402 */ 403 static uint_t arc_min_prefetch_ms; 404 static uint_t arc_min_prescient_prefetch_ms; 405 406 /* 407 * If this percent of memory is free, don't throttle. 408 */ 409 uint_t arc_lotsfree_percent = 10; 410 411 /* 412 * The arc has filled available memory and has now warmed up. 413 */ 414 boolean_t arc_warm; 415 416 /* 417 * These tunables are for performance analysis. 418 */ 419 uint64_t zfs_arc_max = 0; 420 uint64_t zfs_arc_min = 0; 421 static uint64_t zfs_arc_dnode_limit = 0; 422 static uint_t zfs_arc_dnode_reduce_percent = 10; 423 static uint_t zfs_arc_grow_retry = 0; 424 static uint_t zfs_arc_shrink_shift = 0; 425 uint_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */ 426 427 /* 428 * ARC dirty data constraints for arc_tempreserve_space() throttle: 429 * * total dirty data limit 430 * * anon block dirty limit 431 * * each pool's anon allowance 432 */ 433 static const unsigned long zfs_arc_dirty_limit_percent = 50; 434 static const unsigned long zfs_arc_anon_limit_percent = 25; 435 static const unsigned long zfs_arc_pool_dirty_percent = 20; 436 437 /* 438 * Enable or disable compressed arc buffers. 439 */ 440 int zfs_compressed_arc_enabled = B_TRUE; 441 442 /* 443 * Balance between metadata and data on ghost hits. Values above 100 444 * increase metadata caching by proportionally reducing effect of ghost 445 * data hits on target data/metadata rate. 446 */ 447 static uint_t zfs_arc_meta_balance = 500; 448 449 /* 450 * Percentage that can be consumed by dnodes of ARC meta buffers. 451 */ 452 static uint_t zfs_arc_dnode_limit_percent = 10; 453 454 /* 455 * These tunables are Linux-specific 456 */ 457 static uint64_t zfs_arc_sys_free = 0; 458 static uint_t zfs_arc_min_prefetch_ms = 0; 459 static uint_t zfs_arc_min_prescient_prefetch_ms = 0; 460 static uint_t zfs_arc_lotsfree_percent = 10; 461 462 /* 463 * Number of arc_prune threads 464 */ 465 static int zfs_arc_prune_task_threads = 1; 466 467 /* The 7 states: */ 468 arc_state_t ARC_anon; 469 arc_state_t ARC_mru; 470 arc_state_t ARC_mru_ghost; 471 arc_state_t ARC_mfu; 472 arc_state_t ARC_mfu_ghost; 473 arc_state_t ARC_l2c_only; 474 arc_state_t ARC_uncached; 475 476 arc_stats_t arc_stats = { 477 { "hits", KSTAT_DATA_UINT64 }, 478 { "iohits", KSTAT_DATA_UINT64 }, 479 { "misses", KSTAT_DATA_UINT64 }, 480 { "demand_data_hits", KSTAT_DATA_UINT64 }, 481 { "demand_data_iohits", KSTAT_DATA_UINT64 }, 482 { "demand_data_misses", KSTAT_DATA_UINT64 }, 483 { "demand_metadata_hits", KSTAT_DATA_UINT64 }, 484 { "demand_metadata_iohits", KSTAT_DATA_UINT64 }, 485 { "demand_metadata_misses", KSTAT_DATA_UINT64 }, 486 { "prefetch_data_hits", KSTAT_DATA_UINT64 }, 487 { "prefetch_data_iohits", KSTAT_DATA_UINT64 }, 488 { "prefetch_data_misses", KSTAT_DATA_UINT64 }, 489 { "prefetch_metadata_hits", KSTAT_DATA_UINT64 }, 490 { "prefetch_metadata_iohits", KSTAT_DATA_UINT64 }, 491 { "prefetch_metadata_misses", KSTAT_DATA_UINT64 }, 492 { "mru_hits", KSTAT_DATA_UINT64 }, 493 { "mru_ghost_hits", KSTAT_DATA_UINT64 }, 494 { "mfu_hits", KSTAT_DATA_UINT64 }, 495 { "mfu_ghost_hits", KSTAT_DATA_UINT64 }, 496 { "uncached_hits", KSTAT_DATA_UINT64 }, 497 { "deleted", KSTAT_DATA_UINT64 }, 498 { "mutex_miss", KSTAT_DATA_UINT64 }, 499 { "access_skip", KSTAT_DATA_UINT64 }, 500 { "evict_skip", KSTAT_DATA_UINT64 }, 501 { "evict_not_enough", KSTAT_DATA_UINT64 }, 502 { "evict_l2_cached", KSTAT_DATA_UINT64 }, 503 { "evict_l2_eligible", KSTAT_DATA_UINT64 }, 504 { "evict_l2_eligible_mfu", KSTAT_DATA_UINT64 }, 505 { "evict_l2_eligible_mru", KSTAT_DATA_UINT64 }, 506 { "evict_l2_ineligible", KSTAT_DATA_UINT64 }, 507 { "evict_l2_skip", KSTAT_DATA_UINT64 }, 508 { "hash_elements", KSTAT_DATA_UINT64 }, 509 { "hash_elements_max", KSTAT_DATA_UINT64 }, 510 { "hash_collisions", KSTAT_DATA_UINT64 }, 511 { "hash_chains", KSTAT_DATA_UINT64 }, 512 { "hash_chain_max", KSTAT_DATA_UINT64 }, 513 { "meta", KSTAT_DATA_UINT64 }, 514 { "pd", KSTAT_DATA_UINT64 }, 515 { "pm", KSTAT_DATA_UINT64 }, 516 { "c", KSTAT_DATA_UINT64 }, 517 { "c_min", KSTAT_DATA_UINT64 }, 518 { "c_max", KSTAT_DATA_UINT64 }, 519 { "size", KSTAT_DATA_UINT64 }, 520 { "compressed_size", KSTAT_DATA_UINT64 }, 521 { "uncompressed_size", KSTAT_DATA_UINT64 }, 522 { "overhead_size", KSTAT_DATA_UINT64 }, 523 { "hdr_size", KSTAT_DATA_UINT64 }, 524 { "data_size", KSTAT_DATA_UINT64 }, 525 { "metadata_size", KSTAT_DATA_UINT64 }, 526 { "dbuf_size", KSTAT_DATA_UINT64 }, 527 { "dnode_size", KSTAT_DATA_UINT64 }, 528 { "bonus_size", KSTAT_DATA_UINT64 }, 529 #if defined(COMPAT_FREEBSD11) 530 { "other_size", KSTAT_DATA_UINT64 }, 531 #endif 532 { "anon_size", KSTAT_DATA_UINT64 }, 533 { "anon_data", KSTAT_DATA_UINT64 }, 534 { "anon_metadata", KSTAT_DATA_UINT64 }, 535 { "anon_evictable_data", KSTAT_DATA_UINT64 }, 536 { "anon_evictable_metadata", KSTAT_DATA_UINT64 }, 537 { "mru_size", KSTAT_DATA_UINT64 }, 538 { "mru_data", KSTAT_DATA_UINT64 }, 539 { "mru_metadata", KSTAT_DATA_UINT64 }, 540 { "mru_evictable_data", KSTAT_DATA_UINT64 }, 541 { "mru_evictable_metadata", KSTAT_DATA_UINT64 }, 542 { "mru_ghost_size", KSTAT_DATA_UINT64 }, 543 { "mru_ghost_data", KSTAT_DATA_UINT64 }, 544 { "mru_ghost_metadata", KSTAT_DATA_UINT64 }, 545 { "mru_ghost_evictable_data", KSTAT_DATA_UINT64 }, 546 { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, 547 { "mfu_size", KSTAT_DATA_UINT64 }, 548 { "mfu_data", KSTAT_DATA_UINT64 }, 549 { "mfu_metadata", KSTAT_DATA_UINT64 }, 550 { "mfu_evictable_data", KSTAT_DATA_UINT64 }, 551 { "mfu_evictable_metadata", KSTAT_DATA_UINT64 }, 552 { "mfu_ghost_size", KSTAT_DATA_UINT64 }, 553 { "mfu_ghost_data", KSTAT_DATA_UINT64 }, 554 { "mfu_ghost_metadata", KSTAT_DATA_UINT64 }, 555 { "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 }, 556 { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, 557 { "uncached_size", KSTAT_DATA_UINT64 }, 558 { "uncached_data", KSTAT_DATA_UINT64 }, 559 { "uncached_metadata", KSTAT_DATA_UINT64 }, 560 { "uncached_evictable_data", KSTAT_DATA_UINT64 }, 561 { "uncached_evictable_metadata", KSTAT_DATA_UINT64 }, 562 { "l2_hits", KSTAT_DATA_UINT64 }, 563 { "l2_misses", KSTAT_DATA_UINT64 }, 564 { "l2_prefetch_asize", KSTAT_DATA_UINT64 }, 565 { "l2_mru_asize", KSTAT_DATA_UINT64 }, 566 { "l2_mfu_asize", KSTAT_DATA_UINT64 }, 567 { "l2_bufc_data_asize", KSTAT_DATA_UINT64 }, 568 { "l2_bufc_metadata_asize", KSTAT_DATA_UINT64 }, 569 { "l2_feeds", KSTAT_DATA_UINT64 }, 570 { "l2_rw_clash", KSTAT_DATA_UINT64 }, 571 { "l2_read_bytes", KSTAT_DATA_UINT64 }, 572 { "l2_write_bytes", KSTAT_DATA_UINT64 }, 573 { "l2_writes_sent", KSTAT_DATA_UINT64 }, 574 { "l2_writes_done", KSTAT_DATA_UINT64 }, 575 { "l2_writes_error", KSTAT_DATA_UINT64 }, 576 { "l2_writes_lock_retry", KSTAT_DATA_UINT64 }, 577 { "l2_evict_lock_retry", KSTAT_DATA_UINT64 }, 578 { "l2_evict_reading", KSTAT_DATA_UINT64 }, 579 { "l2_evict_l1cached", KSTAT_DATA_UINT64 }, 580 { "l2_free_on_write", KSTAT_DATA_UINT64 }, 581 { "l2_abort_lowmem", KSTAT_DATA_UINT64 }, 582 { "l2_cksum_bad", KSTAT_DATA_UINT64 }, 583 { "l2_io_error", KSTAT_DATA_UINT64 }, 584 { "l2_size", KSTAT_DATA_UINT64 }, 585 { "l2_asize", KSTAT_DATA_UINT64 }, 586 { "l2_hdr_size", KSTAT_DATA_UINT64 }, 587 { "l2_log_blk_writes", KSTAT_DATA_UINT64 }, 588 { "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 }, 589 { "l2_log_blk_asize", KSTAT_DATA_UINT64 }, 590 { "l2_log_blk_count", KSTAT_DATA_UINT64 }, 591 { "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 }, 592 { "l2_rebuild_success", KSTAT_DATA_UINT64 }, 593 { "l2_rebuild_unsupported", KSTAT_DATA_UINT64 }, 594 { "l2_rebuild_io_errors", KSTAT_DATA_UINT64 }, 595 { "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 }, 596 { "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 }, 597 { "l2_rebuild_lowmem", KSTAT_DATA_UINT64 }, 598 { "l2_rebuild_size", KSTAT_DATA_UINT64 }, 599 { "l2_rebuild_asize", KSTAT_DATA_UINT64 }, 600 { "l2_rebuild_bufs", KSTAT_DATA_UINT64 }, 601 { "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 }, 602 { "l2_rebuild_log_blks", KSTAT_DATA_UINT64 }, 603 { "memory_throttle_count", KSTAT_DATA_UINT64 }, 604 { "memory_direct_count", KSTAT_DATA_UINT64 }, 605 { "memory_indirect_count", KSTAT_DATA_UINT64 }, 606 { "memory_all_bytes", KSTAT_DATA_UINT64 }, 607 { "memory_free_bytes", KSTAT_DATA_UINT64 }, 608 { "memory_available_bytes", KSTAT_DATA_INT64 }, 609 { "arc_no_grow", KSTAT_DATA_UINT64 }, 610 { "arc_tempreserve", KSTAT_DATA_UINT64 }, 611 { "arc_loaned_bytes", KSTAT_DATA_UINT64 }, 612 { "arc_prune", KSTAT_DATA_UINT64 }, 613 { "arc_meta_used", KSTAT_DATA_UINT64 }, 614 { "arc_dnode_limit", KSTAT_DATA_UINT64 }, 615 { "async_upgrade_sync", KSTAT_DATA_UINT64 }, 616 { "predictive_prefetch", KSTAT_DATA_UINT64 }, 617 { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 }, 618 { "demand_iohit_predictive_prefetch", KSTAT_DATA_UINT64 }, 619 { "prescient_prefetch", KSTAT_DATA_UINT64 }, 620 { "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 }, 621 { "demand_iohit_prescient_prefetch", KSTAT_DATA_UINT64 }, 622 { "arc_need_free", KSTAT_DATA_UINT64 }, 623 { "arc_sys_free", KSTAT_DATA_UINT64 }, 624 { "arc_raw_size", KSTAT_DATA_UINT64 }, 625 { "cached_only_in_progress", KSTAT_DATA_UINT64 }, 626 { "abd_chunk_waste_size", KSTAT_DATA_UINT64 }, 627 }; 628 629 arc_sums_t arc_sums; 630 631 #define ARCSTAT_MAX(stat, val) { \ 632 uint64_t m; \ 633 while ((val) > (m = arc_stats.stat.value.ui64) && \ 634 (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \ 635 continue; \ 636 } 637 638 /* 639 * We define a macro to allow ARC hits/misses to be easily broken down by 640 * two separate conditions, giving a total of four different subtypes for 641 * each of hits and misses (so eight statistics total). 642 */ 643 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \ 644 if (cond1) { \ 645 if (cond2) { \ 646 ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \ 647 } else { \ 648 ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \ 649 } \ 650 } else { \ 651 if (cond2) { \ 652 ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \ 653 } else { \ 654 ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\ 655 } \ 656 } 657 658 /* 659 * This macro allows us to use kstats as floating averages. Each time we 660 * update this kstat, we first factor it and the update value by 661 * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall 662 * average. This macro assumes that integer loads and stores are atomic, but 663 * is not safe for multiple writers updating the kstat in parallel (only the 664 * last writer's update will remain). 665 */ 666 #define ARCSTAT_F_AVG_FACTOR 3 667 #define ARCSTAT_F_AVG(stat, value) \ 668 do { \ 669 uint64_t x = ARCSTAT(stat); \ 670 x = x - x / ARCSTAT_F_AVG_FACTOR + \ 671 (value) / ARCSTAT_F_AVG_FACTOR; \ 672 ARCSTAT(stat) = x; \ 673 } while (0) 674 675 static kstat_t *arc_ksp; 676 677 /* 678 * There are several ARC variables that are critical to export as kstats -- 679 * but we don't want to have to grovel around in the kstat whenever we wish to 680 * manipulate them. For these variables, we therefore define them to be in 681 * terms of the statistic variable. This assures that we are not introducing 682 * the possibility of inconsistency by having shadow copies of the variables, 683 * while still allowing the code to be readable. 684 */ 685 #define arc_tempreserve ARCSTAT(arcstat_tempreserve) 686 #define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes) 687 #define arc_dnode_limit ARCSTAT(arcstat_dnode_limit) /* max size for dnodes */ 688 #define arc_need_free ARCSTAT(arcstat_need_free) /* waiting to be evicted */ 689 690 hrtime_t arc_growtime; 691 list_t arc_prune_list; 692 kmutex_t arc_prune_mtx; 693 taskq_t *arc_prune_taskq; 694 695 #define GHOST_STATE(state) \ 696 ((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \ 697 (state) == arc_l2c_only) 698 699 #define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE) 700 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) 701 #define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR) 702 #define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH) 703 #define HDR_PRESCIENT_PREFETCH(hdr) \ 704 ((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH) 705 #define HDR_COMPRESSION_ENABLED(hdr) \ 706 ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC) 707 708 #define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE) 709 #define HDR_UNCACHED(hdr) ((hdr)->b_flags & ARC_FLAG_UNCACHED) 710 #define HDR_L2_READING(hdr) \ 711 (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \ 712 ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)) 713 #define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING) 714 #define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED) 715 #define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD) 716 #define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED) 717 #define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH) 718 #define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA) 719 720 #define HDR_ISTYPE_METADATA(hdr) \ 721 ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA) 722 #define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr)) 723 724 #define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR) 725 #define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR) 726 #define HDR_HAS_RABD(hdr) \ 727 (HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \ 728 (hdr)->b_crypt_hdr.b_rabd != NULL) 729 #define HDR_ENCRYPTED(hdr) \ 730 (HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) 731 #define HDR_AUTHENTICATED(hdr) \ 732 (HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) 733 734 /* For storing compression mode in b_flags */ 735 #define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1) 736 737 #define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \ 738 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS)) 739 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \ 740 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp)); 741 742 #define ARC_BUF_LAST(buf) ((buf)->b_next == NULL) 743 #define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED) 744 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED) 745 #define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED) 746 747 /* 748 * Other sizes 749 */ 750 751 #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t)) 752 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr)) 753 754 /* 755 * Hash table routines 756 */ 757 758 #define BUF_LOCKS 2048 759 typedef struct buf_hash_table { 760 uint64_t ht_mask; 761 arc_buf_hdr_t **ht_table; 762 kmutex_t ht_locks[BUF_LOCKS] ____cacheline_aligned; 763 } buf_hash_table_t; 764 765 static buf_hash_table_t buf_hash_table; 766 767 #define BUF_HASH_INDEX(spa, dva, birth) \ 768 (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask) 769 #define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)]) 770 #define HDR_LOCK(hdr) \ 771 (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth))) 772 773 uint64_t zfs_crc64_table[256]; 774 775 /* 776 * Level 2 ARC 777 */ 778 779 #define L2ARC_WRITE_SIZE (32 * 1024 * 1024) /* initial write max */ 780 #define L2ARC_HEADROOM 8 /* num of writes */ 781 782 /* 783 * If we discover during ARC scan any buffers to be compressed, we boost 784 * our headroom for the next scanning cycle by this percentage multiple. 785 */ 786 #define L2ARC_HEADROOM_BOOST 200 787 #define L2ARC_FEED_SECS 1 /* caching interval secs */ 788 #define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */ 789 790 /* 791 * We can feed L2ARC from two states of ARC buffers, mru and mfu, 792 * and each of the state has two types: data and metadata. 793 */ 794 #define L2ARC_FEED_TYPES 4 795 796 /* L2ARC Performance Tunables */ 797 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */ 798 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */ 799 uint64_t l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */ 800 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST; 801 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */ 802 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */ 803 int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ 804 int l2arc_feed_again = B_TRUE; /* turbo warmup */ 805 int l2arc_norw = B_FALSE; /* no reads during writes */ 806 static uint_t l2arc_meta_percent = 33; /* limit on headers size */ 807 808 /* 809 * L2ARC Internals 810 */ 811 static list_t L2ARC_dev_list; /* device list */ 812 static list_t *l2arc_dev_list; /* device list pointer */ 813 static kmutex_t l2arc_dev_mtx; /* device list mutex */ 814 static l2arc_dev_t *l2arc_dev_last; /* last device used */ 815 static list_t L2ARC_free_on_write; /* free after write buf list */ 816 static list_t *l2arc_free_on_write; /* free after write list ptr */ 817 static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */ 818 static uint64_t l2arc_ndev; /* number of devices */ 819 820 typedef struct l2arc_read_callback { 821 arc_buf_hdr_t *l2rcb_hdr; /* read header */ 822 blkptr_t l2rcb_bp; /* original blkptr */ 823 zbookmark_phys_t l2rcb_zb; /* original bookmark */ 824 int l2rcb_flags; /* original flags */ 825 abd_t *l2rcb_abd; /* temporary buffer */ 826 } l2arc_read_callback_t; 827 828 typedef struct l2arc_data_free { 829 /* protected by l2arc_free_on_write_mtx */ 830 abd_t *l2df_abd; 831 size_t l2df_size; 832 arc_buf_contents_t l2df_type; 833 list_node_t l2df_list_node; 834 } l2arc_data_free_t; 835 836 typedef enum arc_fill_flags { 837 ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */ 838 ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */ 839 ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */ 840 ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */ 841 ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */ 842 } arc_fill_flags_t; 843 844 typedef enum arc_ovf_level { 845 ARC_OVF_NONE, /* ARC within target size. */ 846 ARC_OVF_SOME, /* ARC is slightly overflowed. */ 847 ARC_OVF_SEVERE /* ARC is severely overflowed. */ 848 } arc_ovf_level_t; 849 850 static kmutex_t l2arc_feed_thr_lock; 851 static kcondvar_t l2arc_feed_thr_cv; 852 static uint8_t l2arc_thread_exit; 853 854 static kmutex_t l2arc_rebuild_thr_lock; 855 static kcondvar_t l2arc_rebuild_thr_cv; 856 857 enum arc_hdr_alloc_flags { 858 ARC_HDR_ALLOC_RDATA = 0x1, 859 ARC_HDR_USE_RESERVE = 0x4, 860 ARC_HDR_ALLOC_LINEAR = 0x8, 861 }; 862 863 864 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, const void *, int); 865 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, const void *); 866 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, const void *, int); 867 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, const void *); 868 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, const void *); 869 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, 870 const void *tag); 871 static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t); 872 static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int); 873 static void arc_hdr_destroy(arc_buf_hdr_t *); 874 static void arc_access(arc_buf_hdr_t *, arc_flags_t, boolean_t); 875 static void arc_buf_watch(arc_buf_t *); 876 static void arc_change_state(arc_state_t *, arc_buf_hdr_t *); 877 878 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *); 879 static uint32_t arc_bufc_to_flags(arc_buf_contents_t); 880 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); 881 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); 882 883 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *); 884 static void l2arc_read_done(zio_t *); 885 static void l2arc_do_free_on_write(void); 886 static void l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr, 887 boolean_t state_only); 888 889 static void arc_prune_async(uint64_t adjust); 890 891 #define l2arc_hdr_arcstats_increment(hdr) \ 892 l2arc_hdr_arcstats_update((hdr), B_TRUE, B_FALSE) 893 #define l2arc_hdr_arcstats_decrement(hdr) \ 894 l2arc_hdr_arcstats_update((hdr), B_FALSE, B_FALSE) 895 #define l2arc_hdr_arcstats_increment_state(hdr) \ 896 l2arc_hdr_arcstats_update((hdr), B_TRUE, B_TRUE) 897 #define l2arc_hdr_arcstats_decrement_state(hdr) \ 898 l2arc_hdr_arcstats_update((hdr), B_FALSE, B_TRUE) 899 900 /* 901 * l2arc_exclude_special : A zfs module parameter that controls whether buffers 902 * present on special vdevs are eligibile for caching in L2ARC. If 903 * set to 1, exclude dbufs on special vdevs from being cached to 904 * L2ARC. 905 */ 906 int l2arc_exclude_special = 0; 907 908 /* 909 * l2arc_mfuonly : A ZFS module parameter that controls whether only MFU 910 * metadata and data are cached from ARC into L2ARC. 911 */ 912 static int l2arc_mfuonly = 0; 913 914 /* 915 * L2ARC TRIM 916 * l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of 917 * the current write size (l2arc_write_max) we should TRIM if we 918 * have filled the device. It is defined as a percentage of the 919 * write size. If set to 100 we trim twice the space required to 920 * accommodate upcoming writes. A minimum of 64MB will be trimmed. 921 * It also enables TRIM of the whole L2ARC device upon creation or 922 * addition to an existing pool or if the header of the device is 923 * invalid upon importing a pool or onlining a cache device. The 924 * default is 0, which disables TRIM on L2ARC altogether as it can 925 * put significant stress on the underlying storage devices. This 926 * will vary depending of how well the specific device handles 927 * these commands. 928 */ 929 static uint64_t l2arc_trim_ahead = 0; 930 931 /* 932 * Performance tuning of L2ARC persistence: 933 * 934 * l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding 935 * an L2ARC device (either at pool import or later) will attempt 936 * to rebuild L2ARC buffer contents. 937 * l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls 938 * whether log blocks are written to the L2ARC device. If the L2ARC 939 * device is less than 1GB, the amount of data l2arc_evict() 940 * evicts is significant compared to the amount of restored L2ARC 941 * data. In this case do not write log blocks in L2ARC in order 942 * not to waste space. 943 */ 944 static int l2arc_rebuild_enabled = B_TRUE; 945 static uint64_t l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024; 946 947 /* L2ARC persistence rebuild control routines. */ 948 void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen); 949 static __attribute__((noreturn)) void l2arc_dev_rebuild_thread(void *arg); 950 static int l2arc_rebuild(l2arc_dev_t *dev); 951 952 /* L2ARC persistence read I/O routines. */ 953 static int l2arc_dev_hdr_read(l2arc_dev_t *dev); 954 static int l2arc_log_blk_read(l2arc_dev_t *dev, 955 const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp, 956 l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, 957 zio_t *this_io, zio_t **next_io); 958 static zio_t *l2arc_log_blk_fetch(vdev_t *vd, 959 const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb); 960 static void l2arc_log_blk_fetch_abort(zio_t *zio); 961 962 /* L2ARC persistence block restoration routines. */ 963 static void l2arc_log_blk_restore(l2arc_dev_t *dev, 964 const l2arc_log_blk_phys_t *lb, uint64_t lb_asize); 965 static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, 966 l2arc_dev_t *dev); 967 968 /* L2ARC persistence write I/O routines. */ 969 static uint64_t l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, 970 l2arc_write_callback_t *cb); 971 972 /* L2ARC persistence auxiliary routines. */ 973 boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev, 974 const l2arc_log_blkptr_t *lbp); 975 static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev, 976 const arc_buf_hdr_t *ab); 977 boolean_t l2arc_range_check_overlap(uint64_t bottom, 978 uint64_t top, uint64_t check); 979 static void l2arc_blk_fetch_done(zio_t *zio); 980 static inline uint64_t 981 l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev); 982 983 /* 984 * We use Cityhash for this. It's fast, and has good hash properties without 985 * requiring any large static buffers. 986 */ 987 static uint64_t 988 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth) 989 { 990 return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth)); 991 } 992 993 #define HDR_EMPTY(hdr) \ 994 ((hdr)->b_dva.dva_word[0] == 0 && \ 995 (hdr)->b_dva.dva_word[1] == 0) 996 997 #define HDR_EMPTY_OR_LOCKED(hdr) \ 998 (HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr))) 999 1000 #define HDR_EQUAL(spa, dva, birth, hdr) \ 1001 ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \ 1002 ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \ 1003 ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa) 1004 1005 static void 1006 buf_discard_identity(arc_buf_hdr_t *hdr) 1007 { 1008 hdr->b_dva.dva_word[0] = 0; 1009 hdr->b_dva.dva_word[1] = 0; 1010 hdr->b_birth = 0; 1011 } 1012 1013 static arc_buf_hdr_t * 1014 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp) 1015 { 1016 const dva_t *dva = BP_IDENTITY(bp); 1017 uint64_t birth = BP_GET_BIRTH(bp); 1018 uint64_t idx = BUF_HASH_INDEX(spa, dva, birth); 1019 kmutex_t *hash_lock = BUF_HASH_LOCK(idx); 1020 arc_buf_hdr_t *hdr; 1021 1022 mutex_enter(hash_lock); 1023 for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL; 1024 hdr = hdr->b_hash_next) { 1025 if (HDR_EQUAL(spa, dva, birth, hdr)) { 1026 *lockp = hash_lock; 1027 return (hdr); 1028 } 1029 } 1030 mutex_exit(hash_lock); 1031 *lockp = NULL; 1032 return (NULL); 1033 } 1034 1035 /* 1036 * Insert an entry into the hash table. If there is already an element 1037 * equal to elem in the hash table, then the already existing element 1038 * will be returned and the new element will not be inserted. 1039 * Otherwise returns NULL. 1040 * If lockp == NULL, the caller is assumed to already hold the hash lock. 1041 */ 1042 static arc_buf_hdr_t * 1043 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp) 1044 { 1045 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); 1046 kmutex_t *hash_lock = BUF_HASH_LOCK(idx); 1047 arc_buf_hdr_t *fhdr; 1048 uint32_t i; 1049 1050 ASSERT(!DVA_IS_EMPTY(&hdr->b_dva)); 1051 ASSERT(hdr->b_birth != 0); 1052 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 1053 1054 if (lockp != NULL) { 1055 *lockp = hash_lock; 1056 mutex_enter(hash_lock); 1057 } else { 1058 ASSERT(MUTEX_HELD(hash_lock)); 1059 } 1060 1061 for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL; 1062 fhdr = fhdr->b_hash_next, i++) { 1063 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr)) 1064 return (fhdr); 1065 } 1066 1067 hdr->b_hash_next = buf_hash_table.ht_table[idx]; 1068 buf_hash_table.ht_table[idx] = hdr; 1069 arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE); 1070 1071 /* collect some hash table performance data */ 1072 if (i > 0) { 1073 ARCSTAT_BUMP(arcstat_hash_collisions); 1074 if (i == 1) 1075 ARCSTAT_BUMP(arcstat_hash_chains); 1076 1077 ARCSTAT_MAX(arcstat_hash_chain_max, i); 1078 } 1079 uint64_t he = atomic_inc_64_nv( 1080 &arc_stats.arcstat_hash_elements.value.ui64); 1081 ARCSTAT_MAX(arcstat_hash_elements_max, he); 1082 1083 return (NULL); 1084 } 1085 1086 static void 1087 buf_hash_remove(arc_buf_hdr_t *hdr) 1088 { 1089 arc_buf_hdr_t *fhdr, **hdrp; 1090 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); 1091 1092 ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx))); 1093 ASSERT(HDR_IN_HASH_TABLE(hdr)); 1094 1095 hdrp = &buf_hash_table.ht_table[idx]; 1096 while ((fhdr = *hdrp) != hdr) { 1097 ASSERT3P(fhdr, !=, NULL); 1098 hdrp = &fhdr->b_hash_next; 1099 } 1100 *hdrp = hdr->b_hash_next; 1101 hdr->b_hash_next = NULL; 1102 arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE); 1103 1104 /* collect some hash table performance data */ 1105 atomic_dec_64(&arc_stats.arcstat_hash_elements.value.ui64); 1106 1107 if (buf_hash_table.ht_table[idx] && 1108 buf_hash_table.ht_table[idx]->b_hash_next == NULL) 1109 ARCSTAT_BUMPDOWN(arcstat_hash_chains); 1110 } 1111 1112 /* 1113 * Global data structures and functions for the buf kmem cache. 1114 */ 1115 1116 static kmem_cache_t *hdr_full_cache; 1117 static kmem_cache_t *hdr_l2only_cache; 1118 static kmem_cache_t *buf_cache; 1119 1120 static void 1121 buf_fini(void) 1122 { 1123 #if defined(_KERNEL) 1124 /* 1125 * Large allocations which do not require contiguous pages 1126 * should be using vmem_free() in the linux kernel\ 1127 */ 1128 vmem_free(buf_hash_table.ht_table, 1129 (buf_hash_table.ht_mask + 1) * sizeof (void *)); 1130 #else 1131 kmem_free(buf_hash_table.ht_table, 1132 (buf_hash_table.ht_mask + 1) * sizeof (void *)); 1133 #endif 1134 for (int i = 0; i < BUF_LOCKS; i++) 1135 mutex_destroy(BUF_HASH_LOCK(i)); 1136 kmem_cache_destroy(hdr_full_cache); 1137 kmem_cache_destroy(hdr_l2only_cache); 1138 kmem_cache_destroy(buf_cache); 1139 } 1140 1141 /* 1142 * Constructor callback - called when the cache is empty 1143 * and a new buf is requested. 1144 */ 1145 static int 1146 hdr_full_cons(void *vbuf, void *unused, int kmflag) 1147 { 1148 (void) unused, (void) kmflag; 1149 arc_buf_hdr_t *hdr = vbuf; 1150 1151 memset(hdr, 0, HDR_FULL_SIZE); 1152 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 1153 zfs_refcount_create(&hdr->b_l1hdr.b_refcnt); 1154 #ifdef ZFS_DEBUG 1155 mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL); 1156 #endif 1157 multilist_link_init(&hdr->b_l1hdr.b_arc_node); 1158 list_link_init(&hdr->b_l2hdr.b_l2node); 1159 arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS); 1160 1161 return (0); 1162 } 1163 1164 static int 1165 hdr_l2only_cons(void *vbuf, void *unused, int kmflag) 1166 { 1167 (void) unused, (void) kmflag; 1168 arc_buf_hdr_t *hdr = vbuf; 1169 1170 memset(hdr, 0, HDR_L2ONLY_SIZE); 1171 arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); 1172 1173 return (0); 1174 } 1175 1176 static int 1177 buf_cons(void *vbuf, void *unused, int kmflag) 1178 { 1179 (void) unused, (void) kmflag; 1180 arc_buf_t *buf = vbuf; 1181 1182 memset(buf, 0, sizeof (arc_buf_t)); 1183 arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS); 1184 1185 return (0); 1186 } 1187 1188 /* 1189 * Destructor callback - called when a cached buf is 1190 * no longer required. 1191 */ 1192 static void 1193 hdr_full_dest(void *vbuf, void *unused) 1194 { 1195 (void) unused; 1196 arc_buf_hdr_t *hdr = vbuf; 1197 1198 ASSERT(HDR_EMPTY(hdr)); 1199 zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt); 1200 #ifdef ZFS_DEBUG 1201 mutex_destroy(&hdr->b_l1hdr.b_freeze_lock); 1202 #endif 1203 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 1204 arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS); 1205 } 1206 1207 static void 1208 hdr_l2only_dest(void *vbuf, void *unused) 1209 { 1210 (void) unused; 1211 arc_buf_hdr_t *hdr = vbuf; 1212 1213 ASSERT(HDR_EMPTY(hdr)); 1214 arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); 1215 } 1216 1217 static void 1218 buf_dest(void *vbuf, void *unused) 1219 { 1220 (void) unused; 1221 (void) vbuf; 1222 1223 arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS); 1224 } 1225 1226 static void 1227 buf_init(void) 1228 { 1229 uint64_t *ct = NULL; 1230 uint64_t hsize = 1ULL << 12; 1231 int i, j; 1232 1233 /* 1234 * The hash table is big enough to fill all of physical memory 1235 * with an average block size of zfs_arc_average_blocksize (default 8K). 1236 * By default, the table will take up 1237 * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers). 1238 */ 1239 while (hsize * zfs_arc_average_blocksize < arc_all_memory()) 1240 hsize <<= 1; 1241 retry: 1242 buf_hash_table.ht_mask = hsize - 1; 1243 #if defined(_KERNEL) 1244 /* 1245 * Large allocations which do not require contiguous pages 1246 * should be using vmem_alloc() in the linux kernel 1247 */ 1248 buf_hash_table.ht_table = 1249 vmem_zalloc(hsize * sizeof (void*), KM_SLEEP); 1250 #else 1251 buf_hash_table.ht_table = 1252 kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP); 1253 #endif 1254 if (buf_hash_table.ht_table == NULL) { 1255 ASSERT(hsize > (1ULL << 8)); 1256 hsize >>= 1; 1257 goto retry; 1258 } 1259 1260 hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE, 1261 0, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, KMC_RECLAIMABLE); 1262 hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only", 1263 HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL, 1264 NULL, NULL, 0); 1265 buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t), 1266 0, buf_cons, buf_dest, NULL, NULL, NULL, 0); 1267 1268 for (i = 0; i < 256; i++) 1269 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--) 1270 *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY); 1271 1272 for (i = 0; i < BUF_LOCKS; i++) 1273 mutex_init(BUF_HASH_LOCK(i), NULL, MUTEX_DEFAULT, NULL); 1274 } 1275 1276 #define ARC_MINTIME (hz>>4) /* 62 ms */ 1277 1278 /* 1279 * This is the size that the buf occupies in memory. If the buf is compressed, 1280 * it will correspond to the compressed size. You should use this method of 1281 * getting the buf size unless you explicitly need the logical size. 1282 */ 1283 uint64_t 1284 arc_buf_size(arc_buf_t *buf) 1285 { 1286 return (ARC_BUF_COMPRESSED(buf) ? 1287 HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr)); 1288 } 1289 1290 uint64_t 1291 arc_buf_lsize(arc_buf_t *buf) 1292 { 1293 return (HDR_GET_LSIZE(buf->b_hdr)); 1294 } 1295 1296 /* 1297 * This function will return B_TRUE if the buffer is encrypted in memory. 1298 * This buffer can be decrypted by calling arc_untransform(). 1299 */ 1300 boolean_t 1301 arc_is_encrypted(arc_buf_t *buf) 1302 { 1303 return (ARC_BUF_ENCRYPTED(buf) != 0); 1304 } 1305 1306 /* 1307 * Returns B_TRUE if the buffer represents data that has not had its MAC 1308 * verified yet. 1309 */ 1310 boolean_t 1311 arc_is_unauthenticated(arc_buf_t *buf) 1312 { 1313 return (HDR_NOAUTH(buf->b_hdr) != 0); 1314 } 1315 1316 void 1317 arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt, 1318 uint8_t *iv, uint8_t *mac) 1319 { 1320 arc_buf_hdr_t *hdr = buf->b_hdr; 1321 1322 ASSERT(HDR_PROTECTED(hdr)); 1323 1324 memcpy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); 1325 memcpy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); 1326 memcpy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); 1327 *byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? 1328 ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; 1329 } 1330 1331 /* 1332 * Indicates how this buffer is compressed in memory. If it is not compressed 1333 * the value will be ZIO_COMPRESS_OFF. It can be made normally readable with 1334 * arc_untransform() as long as it is also unencrypted. 1335 */ 1336 enum zio_compress 1337 arc_get_compression(arc_buf_t *buf) 1338 { 1339 return (ARC_BUF_COMPRESSED(buf) ? 1340 HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF); 1341 } 1342 1343 /* 1344 * Return the compression algorithm used to store this data in the ARC. If ARC 1345 * compression is enabled or this is an encrypted block, this will be the same 1346 * as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF. 1347 */ 1348 static inline enum zio_compress 1349 arc_hdr_get_compress(arc_buf_hdr_t *hdr) 1350 { 1351 return (HDR_COMPRESSION_ENABLED(hdr) ? 1352 HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF); 1353 } 1354 1355 uint8_t 1356 arc_get_complevel(arc_buf_t *buf) 1357 { 1358 return (buf->b_hdr->b_complevel); 1359 } 1360 1361 static inline boolean_t 1362 arc_buf_is_shared(arc_buf_t *buf) 1363 { 1364 boolean_t shared = (buf->b_data != NULL && 1365 buf->b_hdr->b_l1hdr.b_pabd != NULL && 1366 abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) && 1367 buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd)); 1368 IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr)); 1369 EQUIV(shared, ARC_BUF_SHARED(buf)); 1370 IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf)); 1371 1372 /* 1373 * It would be nice to assert arc_can_share() too, but the "hdr isn't 1374 * already being shared" requirement prevents us from doing that. 1375 */ 1376 1377 return (shared); 1378 } 1379 1380 /* 1381 * Free the checksum associated with this header. If there is no checksum, this 1382 * is a no-op. 1383 */ 1384 static inline void 1385 arc_cksum_free(arc_buf_hdr_t *hdr) 1386 { 1387 #ifdef ZFS_DEBUG 1388 ASSERT(HDR_HAS_L1HDR(hdr)); 1389 1390 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1391 if (hdr->b_l1hdr.b_freeze_cksum != NULL) { 1392 kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t)); 1393 hdr->b_l1hdr.b_freeze_cksum = NULL; 1394 } 1395 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1396 #endif 1397 } 1398 1399 /* 1400 * Return true iff at least one of the bufs on hdr is not compressed. 1401 * Encrypted buffers count as compressed. 1402 */ 1403 static boolean_t 1404 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr) 1405 { 1406 ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr)); 1407 1408 for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) { 1409 if (!ARC_BUF_COMPRESSED(b)) { 1410 return (B_TRUE); 1411 } 1412 } 1413 return (B_FALSE); 1414 } 1415 1416 1417 /* 1418 * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data 1419 * matches the checksum that is stored in the hdr. If there is no checksum, 1420 * or if the buf is compressed, this is a no-op. 1421 */ 1422 static void 1423 arc_cksum_verify(arc_buf_t *buf) 1424 { 1425 #ifdef ZFS_DEBUG 1426 arc_buf_hdr_t *hdr = buf->b_hdr; 1427 zio_cksum_t zc; 1428 1429 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1430 return; 1431 1432 if (ARC_BUF_COMPRESSED(buf)) 1433 return; 1434 1435 ASSERT(HDR_HAS_L1HDR(hdr)); 1436 1437 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1438 1439 if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) { 1440 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1441 return; 1442 } 1443 1444 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc); 1445 if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc)) 1446 panic("buffer modified while frozen!"); 1447 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1448 #endif 1449 } 1450 1451 /* 1452 * This function makes the assumption that data stored in the L2ARC 1453 * will be transformed exactly as it is in the main pool. Because of 1454 * this we can verify the checksum against the reading process's bp. 1455 */ 1456 static boolean_t 1457 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio) 1458 { 1459 ASSERT(!BP_IS_EMBEDDED(zio->io_bp)); 1460 VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr)); 1461 1462 /* 1463 * Block pointers always store the checksum for the logical data. 1464 * If the block pointer has the gang bit set, then the checksum 1465 * it represents is for the reconstituted data and not for an 1466 * individual gang member. The zio pipeline, however, must be able to 1467 * determine the checksum of each of the gang constituents so it 1468 * treats the checksum comparison differently than what we need 1469 * for l2arc blocks. This prevents us from using the 1470 * zio_checksum_error() interface directly. Instead we must call the 1471 * zio_checksum_error_impl() so that we can ensure the checksum is 1472 * generated using the correct checksum algorithm and accounts for the 1473 * logical I/O size and not just a gang fragment. 1474 */ 1475 return (zio_checksum_error_impl(zio->io_spa, zio->io_bp, 1476 BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size, 1477 zio->io_offset, NULL) == 0); 1478 } 1479 1480 /* 1481 * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a 1482 * checksum and attaches it to the buf's hdr so that we can ensure that the buf 1483 * isn't modified later on. If buf is compressed or there is already a checksum 1484 * on the hdr, this is a no-op (we only checksum uncompressed bufs). 1485 */ 1486 static void 1487 arc_cksum_compute(arc_buf_t *buf) 1488 { 1489 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1490 return; 1491 1492 #ifdef ZFS_DEBUG 1493 arc_buf_hdr_t *hdr = buf->b_hdr; 1494 ASSERT(HDR_HAS_L1HDR(hdr)); 1495 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1496 if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) { 1497 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1498 return; 1499 } 1500 1501 ASSERT(!ARC_BUF_ENCRYPTED(buf)); 1502 ASSERT(!ARC_BUF_COMPRESSED(buf)); 1503 hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), 1504 KM_SLEEP); 1505 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, 1506 hdr->b_l1hdr.b_freeze_cksum); 1507 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1508 #endif 1509 arc_buf_watch(buf); 1510 } 1511 1512 #ifndef _KERNEL 1513 void 1514 arc_buf_sigsegv(int sig, siginfo_t *si, void *unused) 1515 { 1516 (void) sig, (void) unused; 1517 panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr); 1518 } 1519 #endif 1520 1521 static void 1522 arc_buf_unwatch(arc_buf_t *buf) 1523 { 1524 #ifndef _KERNEL 1525 if (arc_watch) { 1526 ASSERT0(mprotect(buf->b_data, arc_buf_size(buf), 1527 PROT_READ | PROT_WRITE)); 1528 } 1529 #else 1530 (void) buf; 1531 #endif 1532 } 1533 1534 static void 1535 arc_buf_watch(arc_buf_t *buf) 1536 { 1537 #ifndef _KERNEL 1538 if (arc_watch) 1539 ASSERT0(mprotect(buf->b_data, arc_buf_size(buf), 1540 PROT_READ)); 1541 #else 1542 (void) buf; 1543 #endif 1544 } 1545 1546 static arc_buf_contents_t 1547 arc_buf_type(arc_buf_hdr_t *hdr) 1548 { 1549 arc_buf_contents_t type; 1550 if (HDR_ISTYPE_METADATA(hdr)) { 1551 type = ARC_BUFC_METADATA; 1552 } else { 1553 type = ARC_BUFC_DATA; 1554 } 1555 VERIFY3U(hdr->b_type, ==, type); 1556 return (type); 1557 } 1558 1559 boolean_t 1560 arc_is_metadata(arc_buf_t *buf) 1561 { 1562 return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0); 1563 } 1564 1565 static uint32_t 1566 arc_bufc_to_flags(arc_buf_contents_t type) 1567 { 1568 switch (type) { 1569 case ARC_BUFC_DATA: 1570 /* metadata field is 0 if buffer contains normal data */ 1571 return (0); 1572 case ARC_BUFC_METADATA: 1573 return (ARC_FLAG_BUFC_METADATA); 1574 default: 1575 break; 1576 } 1577 panic("undefined ARC buffer type!"); 1578 return ((uint32_t)-1); 1579 } 1580 1581 void 1582 arc_buf_thaw(arc_buf_t *buf) 1583 { 1584 arc_buf_hdr_t *hdr = buf->b_hdr; 1585 1586 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 1587 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 1588 1589 arc_cksum_verify(buf); 1590 1591 /* 1592 * Compressed buffers do not manipulate the b_freeze_cksum. 1593 */ 1594 if (ARC_BUF_COMPRESSED(buf)) 1595 return; 1596 1597 ASSERT(HDR_HAS_L1HDR(hdr)); 1598 arc_cksum_free(hdr); 1599 arc_buf_unwatch(buf); 1600 } 1601 1602 void 1603 arc_buf_freeze(arc_buf_t *buf) 1604 { 1605 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1606 return; 1607 1608 if (ARC_BUF_COMPRESSED(buf)) 1609 return; 1610 1611 ASSERT(HDR_HAS_L1HDR(buf->b_hdr)); 1612 arc_cksum_compute(buf); 1613 } 1614 1615 /* 1616 * The arc_buf_hdr_t's b_flags should never be modified directly. Instead, 1617 * the following functions should be used to ensure that the flags are 1618 * updated in a thread-safe way. When manipulating the flags either 1619 * the hash_lock must be held or the hdr must be undiscoverable. This 1620 * ensures that we're not racing with any other threads when updating 1621 * the flags. 1622 */ 1623 static inline void 1624 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) 1625 { 1626 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1627 hdr->b_flags |= flags; 1628 } 1629 1630 static inline void 1631 arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) 1632 { 1633 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1634 hdr->b_flags &= ~flags; 1635 } 1636 1637 /* 1638 * Setting the compression bits in the arc_buf_hdr_t's b_flags is 1639 * done in a special way since we have to clear and set bits 1640 * at the same time. Consumers that wish to set the compression bits 1641 * must use this function to ensure that the flags are updated in 1642 * thread-safe manner. 1643 */ 1644 static void 1645 arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp) 1646 { 1647 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1648 1649 /* 1650 * Holes and embedded blocks will always have a psize = 0 so 1651 * we ignore the compression of the blkptr and set the 1652 * want to uncompress them. Mark them as uncompressed. 1653 */ 1654 if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) { 1655 arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC); 1656 ASSERT(!HDR_COMPRESSION_ENABLED(hdr)); 1657 } else { 1658 arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC); 1659 ASSERT(HDR_COMPRESSION_ENABLED(hdr)); 1660 } 1661 1662 HDR_SET_COMPRESS(hdr, cmp); 1663 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp); 1664 } 1665 1666 /* 1667 * Looks for another buf on the same hdr which has the data decompressed, copies 1668 * from it, and returns true. If no such buf exists, returns false. 1669 */ 1670 static boolean_t 1671 arc_buf_try_copy_decompressed_data(arc_buf_t *buf) 1672 { 1673 arc_buf_hdr_t *hdr = buf->b_hdr; 1674 boolean_t copied = B_FALSE; 1675 1676 ASSERT(HDR_HAS_L1HDR(hdr)); 1677 ASSERT3P(buf->b_data, !=, NULL); 1678 ASSERT(!ARC_BUF_COMPRESSED(buf)); 1679 1680 for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL; 1681 from = from->b_next) { 1682 /* can't use our own data buffer */ 1683 if (from == buf) { 1684 continue; 1685 } 1686 1687 if (!ARC_BUF_COMPRESSED(from)) { 1688 memcpy(buf->b_data, from->b_data, arc_buf_size(buf)); 1689 copied = B_TRUE; 1690 break; 1691 } 1692 } 1693 1694 #ifdef ZFS_DEBUG 1695 /* 1696 * There were no decompressed bufs, so there should not be a 1697 * checksum on the hdr either. 1698 */ 1699 if (zfs_flags & ZFS_DEBUG_MODIFY) 1700 EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL); 1701 #endif 1702 1703 return (copied); 1704 } 1705 1706 /* 1707 * Allocates an ARC buf header that's in an evicted & L2-cached state. 1708 * This is used during l2arc reconstruction to make empty ARC buffers 1709 * which circumvent the regular disk->arc->l2arc path and instead come 1710 * into being in the reverse order, i.e. l2arc->arc. 1711 */ 1712 static arc_buf_hdr_t * 1713 arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev, 1714 dva_t dva, uint64_t daddr, int32_t psize, uint64_t birth, 1715 enum zio_compress compress, uint8_t complevel, boolean_t protected, 1716 boolean_t prefetch, arc_state_type_t arcs_state) 1717 { 1718 arc_buf_hdr_t *hdr; 1719 1720 ASSERT(size != 0); 1721 hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP); 1722 hdr->b_birth = birth; 1723 hdr->b_type = type; 1724 hdr->b_flags = 0; 1725 arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR); 1726 HDR_SET_LSIZE(hdr, size); 1727 HDR_SET_PSIZE(hdr, psize); 1728 arc_hdr_set_compress(hdr, compress); 1729 hdr->b_complevel = complevel; 1730 if (protected) 1731 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 1732 if (prefetch) 1733 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); 1734 hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa); 1735 1736 hdr->b_dva = dva; 1737 1738 hdr->b_l2hdr.b_dev = dev; 1739 hdr->b_l2hdr.b_daddr = daddr; 1740 hdr->b_l2hdr.b_arcs_state = arcs_state; 1741 1742 return (hdr); 1743 } 1744 1745 /* 1746 * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t. 1747 */ 1748 static uint64_t 1749 arc_hdr_size(arc_buf_hdr_t *hdr) 1750 { 1751 uint64_t size; 1752 1753 if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && 1754 HDR_GET_PSIZE(hdr) > 0) { 1755 size = HDR_GET_PSIZE(hdr); 1756 } else { 1757 ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0); 1758 size = HDR_GET_LSIZE(hdr); 1759 } 1760 return (size); 1761 } 1762 1763 static int 1764 arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj) 1765 { 1766 int ret; 1767 uint64_t csize; 1768 uint64_t lsize = HDR_GET_LSIZE(hdr); 1769 uint64_t psize = HDR_GET_PSIZE(hdr); 1770 abd_t *abd = hdr->b_l1hdr.b_pabd; 1771 boolean_t free_abd = B_FALSE; 1772 1773 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1774 ASSERT(HDR_AUTHENTICATED(hdr)); 1775 ASSERT3P(abd, !=, NULL); 1776 1777 /* 1778 * The MAC is calculated on the compressed data that is stored on disk. 1779 * However, if compressed arc is disabled we will only have the 1780 * decompressed data available to us now. Compress it into a temporary 1781 * abd so we can verify the MAC. The performance overhead of this will 1782 * be relatively low, since most objects in an encrypted objset will 1783 * be encrypted (instead of authenticated) anyway. 1784 */ 1785 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 1786 !HDR_COMPRESSION_ENABLED(hdr)) { 1787 abd = NULL; 1788 csize = zio_compress_data(HDR_GET_COMPRESS(hdr), 1789 hdr->b_l1hdr.b_pabd, &abd, lsize, hdr->b_complevel); 1790 ASSERT3P(abd, !=, NULL); 1791 ASSERT3U(csize, <=, psize); 1792 abd_zero_off(abd, csize, psize - csize); 1793 free_abd = B_TRUE; 1794 } 1795 1796 /* 1797 * Authentication is best effort. We authenticate whenever the key is 1798 * available. If we succeed we clear ARC_FLAG_NOAUTH. 1799 */ 1800 if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) { 1801 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF); 1802 ASSERT3U(lsize, ==, psize); 1803 ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd, 1804 psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 1805 } else { 1806 ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize, 1807 hdr->b_crypt_hdr.b_mac); 1808 } 1809 1810 if (ret == 0) 1811 arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH); 1812 else if (ret == ENOENT) 1813 ret = 0; 1814 1815 if (free_abd) 1816 abd_free(abd); 1817 1818 return (ret); 1819 } 1820 1821 /* 1822 * This function will take a header that only has raw encrypted data in 1823 * b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in 1824 * b_l1hdr.b_pabd. If designated in the header flags, this function will 1825 * also decompress the data. 1826 */ 1827 static int 1828 arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb) 1829 { 1830 int ret; 1831 abd_t *cabd = NULL; 1832 boolean_t no_crypt = B_FALSE; 1833 boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 1834 1835 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1836 ASSERT(HDR_ENCRYPTED(hdr)); 1837 1838 arc_hdr_alloc_abd(hdr, 0); 1839 1840 ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot, 1841 B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv, 1842 hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd, 1843 hdr->b_crypt_hdr.b_rabd, &no_crypt); 1844 if (ret != 0) 1845 goto error; 1846 1847 if (no_crypt) { 1848 abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd, 1849 HDR_GET_PSIZE(hdr)); 1850 } 1851 1852 /* 1853 * If this header has disabled arc compression but the b_pabd is 1854 * compressed after decrypting it, we need to decompress the newly 1855 * decrypted data. 1856 */ 1857 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 1858 !HDR_COMPRESSION_ENABLED(hdr)) { 1859 /* 1860 * We want to make sure that we are correctly honoring the 1861 * zfs_abd_scatter_enabled setting, so we allocate an abd here 1862 * and then loan a buffer from it, rather than allocating a 1863 * linear buffer and wrapping it in an abd later. 1864 */ 1865 cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 0); 1866 1867 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), 1868 hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr), 1869 HDR_GET_LSIZE(hdr), &hdr->b_complevel); 1870 if (ret != 0) { 1871 goto error; 1872 } 1873 1874 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 1875 arc_hdr_size(hdr), hdr); 1876 hdr->b_l1hdr.b_pabd = cabd; 1877 } 1878 1879 return (0); 1880 1881 error: 1882 arc_hdr_free_abd(hdr, B_FALSE); 1883 if (cabd != NULL) 1884 arc_free_data_buf(hdr, cabd, arc_hdr_size(hdr), hdr); 1885 1886 return (ret); 1887 } 1888 1889 /* 1890 * This function is called during arc_buf_fill() to prepare the header's 1891 * abd plaintext pointer for use. This involves authenticated protected 1892 * data and decrypting encrypted data into the plaintext abd. 1893 */ 1894 static int 1895 arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa, 1896 const zbookmark_phys_t *zb, boolean_t noauth) 1897 { 1898 int ret; 1899 1900 ASSERT(HDR_PROTECTED(hdr)); 1901 1902 if (hash_lock != NULL) 1903 mutex_enter(hash_lock); 1904 1905 if (HDR_NOAUTH(hdr) && !noauth) { 1906 /* 1907 * The caller requested authenticated data but our data has 1908 * not been authenticated yet. Verify the MAC now if we can. 1909 */ 1910 ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset); 1911 if (ret != 0) 1912 goto error; 1913 } else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) { 1914 /* 1915 * If we only have the encrypted version of the data, but the 1916 * unencrypted version was requested we take this opportunity 1917 * to store the decrypted version in the header for future use. 1918 */ 1919 ret = arc_hdr_decrypt(hdr, spa, zb); 1920 if (ret != 0) 1921 goto error; 1922 } 1923 1924 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 1925 1926 if (hash_lock != NULL) 1927 mutex_exit(hash_lock); 1928 1929 return (0); 1930 1931 error: 1932 if (hash_lock != NULL) 1933 mutex_exit(hash_lock); 1934 1935 return (ret); 1936 } 1937 1938 /* 1939 * This function is used by the dbuf code to decrypt bonus buffers in place. 1940 * The dbuf code itself doesn't have any locking for decrypting a shared dnode 1941 * block, so we use the hash lock here to protect against concurrent calls to 1942 * arc_buf_fill(). 1943 */ 1944 static void 1945 arc_buf_untransform_in_place(arc_buf_t *buf) 1946 { 1947 arc_buf_hdr_t *hdr = buf->b_hdr; 1948 1949 ASSERT(HDR_ENCRYPTED(hdr)); 1950 ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); 1951 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1952 ASSERT3PF(hdr->b_l1hdr.b_pabd, !=, NULL, "hdr %px buf %px", hdr, buf); 1953 1954 zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data, 1955 arc_buf_size(buf)); 1956 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 1957 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 1958 } 1959 1960 /* 1961 * Given a buf that has a data buffer attached to it, this function will 1962 * efficiently fill the buf with data of the specified compression setting from 1963 * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr 1964 * are already sharing a data buf, no copy is performed. 1965 * 1966 * If the buf is marked as compressed but uncompressed data was requested, this 1967 * will allocate a new data buffer for the buf, remove that flag, and fill the 1968 * buf with uncompressed data. You can't request a compressed buf on a hdr with 1969 * uncompressed data, and (since we haven't added support for it yet) if you 1970 * want compressed data your buf must already be marked as compressed and have 1971 * the correct-sized data buffer. 1972 */ 1973 static int 1974 arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, 1975 arc_fill_flags_t flags) 1976 { 1977 int error = 0; 1978 arc_buf_hdr_t *hdr = buf->b_hdr; 1979 boolean_t hdr_compressed = 1980 (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 1981 boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0; 1982 boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0; 1983 dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap; 1984 kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr); 1985 1986 ASSERT3P(buf->b_data, !=, NULL); 1987 IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf)); 1988 IMPLY(compressed, ARC_BUF_COMPRESSED(buf)); 1989 IMPLY(encrypted, HDR_ENCRYPTED(hdr)); 1990 IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf)); 1991 IMPLY(encrypted, ARC_BUF_COMPRESSED(buf)); 1992 IMPLY(encrypted, !arc_buf_is_shared(buf)); 1993 1994 /* 1995 * If the caller wanted encrypted data we just need to copy it from 1996 * b_rabd and potentially byteswap it. We won't be able to do any 1997 * further transforms on it. 1998 */ 1999 if (encrypted) { 2000 ASSERT(HDR_HAS_RABD(hdr)); 2001 abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd, 2002 HDR_GET_PSIZE(hdr)); 2003 goto byteswap; 2004 } 2005 2006 /* 2007 * Adjust encrypted and authenticated headers to accommodate 2008 * the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are 2009 * allowed to fail decryption due to keys not being loaded 2010 * without being marked as an IO error. 2011 */ 2012 if (HDR_PROTECTED(hdr)) { 2013 error = arc_fill_hdr_crypt(hdr, hash_lock, spa, 2014 zb, !!(flags & ARC_FILL_NOAUTH)); 2015 if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) { 2016 return (error); 2017 } else if (error != 0) { 2018 if (hash_lock != NULL) 2019 mutex_enter(hash_lock); 2020 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 2021 if (hash_lock != NULL) 2022 mutex_exit(hash_lock); 2023 return (error); 2024 } 2025 } 2026 2027 /* 2028 * There is a special case here for dnode blocks which are 2029 * decrypting their bonus buffers. These blocks may request to 2030 * be decrypted in-place. This is necessary because there may 2031 * be many dnodes pointing into this buffer and there is 2032 * currently no method to synchronize replacing the backing 2033 * b_data buffer and updating all of the pointers. Here we use 2034 * the hash lock to ensure there are no races. If the need 2035 * arises for other types to be decrypted in-place, they must 2036 * add handling here as well. 2037 */ 2038 if ((flags & ARC_FILL_IN_PLACE) != 0) { 2039 ASSERT(!hdr_compressed); 2040 ASSERT(!compressed); 2041 ASSERT(!encrypted); 2042 2043 if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) { 2044 ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); 2045 2046 if (hash_lock != NULL) 2047 mutex_enter(hash_lock); 2048 arc_buf_untransform_in_place(buf); 2049 if (hash_lock != NULL) 2050 mutex_exit(hash_lock); 2051 2052 /* Compute the hdr's checksum if necessary */ 2053 arc_cksum_compute(buf); 2054 } 2055 2056 return (0); 2057 } 2058 2059 if (hdr_compressed == compressed) { 2060 if (ARC_BUF_SHARED(buf)) { 2061 ASSERT(arc_buf_is_shared(buf)); 2062 } else { 2063 abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd, 2064 arc_buf_size(buf)); 2065 } 2066 } else { 2067 ASSERT(hdr_compressed); 2068 ASSERT(!compressed); 2069 2070 /* 2071 * If the buf is sharing its data with the hdr, unlink it and 2072 * allocate a new data buffer for the buf. 2073 */ 2074 if (ARC_BUF_SHARED(buf)) { 2075 ASSERTF(ARC_BUF_COMPRESSED(buf), 2076 "buf %p was uncompressed", buf); 2077 2078 /* We need to give the buf its own b_data */ 2079 buf->b_flags &= ~ARC_BUF_FLAG_SHARED; 2080 buf->b_data = 2081 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); 2082 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 2083 2084 /* Previously overhead was 0; just add new overhead */ 2085 ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr)); 2086 } else if (ARC_BUF_COMPRESSED(buf)) { 2087 ASSERT(!arc_buf_is_shared(buf)); 2088 2089 /* We need to reallocate the buf's b_data */ 2090 arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr), 2091 buf); 2092 buf->b_data = 2093 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); 2094 2095 /* We increased the size of b_data; update overhead */ 2096 ARCSTAT_INCR(arcstat_overhead_size, 2097 HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr)); 2098 } 2099 2100 /* 2101 * Regardless of the buf's previous compression settings, it 2102 * should not be compressed at the end of this function. 2103 */ 2104 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 2105 2106 /* 2107 * Try copying the data from another buf which already has a 2108 * decompressed version. If that's not possible, it's time to 2109 * bite the bullet and decompress the data from the hdr. 2110 */ 2111 if (arc_buf_try_copy_decompressed_data(buf)) { 2112 /* Skip byteswapping and checksumming (already done) */ 2113 return (0); 2114 } else { 2115 abd_t dabd; 2116 abd_get_from_buf_struct(&dabd, buf->b_data, 2117 HDR_GET_LSIZE(hdr)); 2118 error = zio_decompress_data(HDR_GET_COMPRESS(hdr), 2119 hdr->b_l1hdr.b_pabd, &dabd, 2120 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr), 2121 &hdr->b_complevel); 2122 abd_free(&dabd); 2123 2124 /* 2125 * Absent hardware errors or software bugs, this should 2126 * be impossible, but log it anyway so we can debug it. 2127 */ 2128 if (error != 0) { 2129 zfs_dbgmsg( 2130 "hdr %px, compress %d, psize %d, lsize %d", 2131 hdr, arc_hdr_get_compress(hdr), 2132 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr)); 2133 if (hash_lock != NULL) 2134 mutex_enter(hash_lock); 2135 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 2136 if (hash_lock != NULL) 2137 mutex_exit(hash_lock); 2138 return (SET_ERROR(EIO)); 2139 } 2140 } 2141 } 2142 2143 byteswap: 2144 /* Byteswap the buf's data if necessary */ 2145 if (bswap != DMU_BSWAP_NUMFUNCS) { 2146 ASSERT(!HDR_SHARED_DATA(hdr)); 2147 ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS); 2148 dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr)); 2149 } 2150 2151 /* Compute the hdr's checksum if necessary */ 2152 arc_cksum_compute(buf); 2153 2154 return (0); 2155 } 2156 2157 /* 2158 * If this function is being called to decrypt an encrypted buffer or verify an 2159 * authenticated one, the key must be loaded and a mapping must be made 2160 * available in the keystore via spa_keystore_create_mapping() or one of its 2161 * callers. 2162 */ 2163 int 2164 arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, 2165 boolean_t in_place) 2166 { 2167 int ret; 2168 arc_fill_flags_t flags = 0; 2169 2170 if (in_place) 2171 flags |= ARC_FILL_IN_PLACE; 2172 2173 ret = arc_buf_fill(buf, spa, zb, flags); 2174 if (ret == ECKSUM) { 2175 /* 2176 * Convert authentication and decryption errors to EIO 2177 * (and generate an ereport) before leaving the ARC. 2178 */ 2179 ret = SET_ERROR(EIO); 2180 spa_log_error(spa, zb, buf->b_hdr->b_birth); 2181 (void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION, 2182 spa, NULL, zb, NULL, 0); 2183 } 2184 2185 return (ret); 2186 } 2187 2188 /* 2189 * Increment the amount of evictable space in the arc_state_t's refcount. 2190 * We account for the space used by the hdr and the arc buf individually 2191 * so that we can add and remove them from the refcount individually. 2192 */ 2193 static void 2194 arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state) 2195 { 2196 arc_buf_contents_t type = arc_buf_type(hdr); 2197 2198 ASSERT(HDR_HAS_L1HDR(hdr)); 2199 2200 if (GHOST_STATE(state)) { 2201 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2202 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2203 ASSERT(!HDR_HAS_RABD(hdr)); 2204 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2205 HDR_GET_LSIZE(hdr), hdr); 2206 return; 2207 } 2208 2209 if (hdr->b_l1hdr.b_pabd != NULL) { 2210 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2211 arc_hdr_size(hdr), hdr); 2212 } 2213 if (HDR_HAS_RABD(hdr)) { 2214 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2215 HDR_GET_PSIZE(hdr), hdr); 2216 } 2217 2218 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2219 buf = buf->b_next) { 2220 if (ARC_BUF_SHARED(buf)) 2221 continue; 2222 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2223 arc_buf_size(buf), buf); 2224 } 2225 } 2226 2227 /* 2228 * Decrement the amount of evictable space in the arc_state_t's refcount. 2229 * We account for the space used by the hdr and the arc buf individually 2230 * so that we can add and remove them from the refcount individually. 2231 */ 2232 static void 2233 arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state) 2234 { 2235 arc_buf_contents_t type = arc_buf_type(hdr); 2236 2237 ASSERT(HDR_HAS_L1HDR(hdr)); 2238 2239 if (GHOST_STATE(state)) { 2240 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2241 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2242 ASSERT(!HDR_HAS_RABD(hdr)); 2243 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2244 HDR_GET_LSIZE(hdr), hdr); 2245 return; 2246 } 2247 2248 if (hdr->b_l1hdr.b_pabd != NULL) { 2249 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2250 arc_hdr_size(hdr), hdr); 2251 } 2252 if (HDR_HAS_RABD(hdr)) { 2253 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2254 HDR_GET_PSIZE(hdr), hdr); 2255 } 2256 2257 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2258 buf = buf->b_next) { 2259 if (ARC_BUF_SHARED(buf)) 2260 continue; 2261 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2262 arc_buf_size(buf), buf); 2263 } 2264 } 2265 2266 /* 2267 * Add a reference to this hdr indicating that someone is actively 2268 * referencing that memory. When the refcount transitions from 0 to 1, 2269 * we remove it from the respective arc_state_t list to indicate that 2270 * it is not evictable. 2271 */ 2272 static void 2273 add_reference(arc_buf_hdr_t *hdr, const void *tag) 2274 { 2275 arc_state_t *state = hdr->b_l1hdr.b_state; 2276 2277 ASSERT(HDR_HAS_L1HDR(hdr)); 2278 if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) { 2279 ASSERT(state == arc_anon); 2280 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 2281 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2282 } 2283 2284 if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) && 2285 state != arc_anon && state != arc_l2c_only) { 2286 /* We don't use the L2-only state list. */ 2287 multilist_remove(&state->arcs_list[arc_buf_type(hdr)], hdr); 2288 arc_evictable_space_decrement(hdr, state); 2289 } 2290 } 2291 2292 /* 2293 * Remove a reference from this hdr. When the reference transitions from 2294 * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's 2295 * list making it eligible for eviction. 2296 */ 2297 static int 2298 remove_reference(arc_buf_hdr_t *hdr, const void *tag) 2299 { 2300 int cnt; 2301 arc_state_t *state = hdr->b_l1hdr.b_state; 2302 2303 ASSERT(HDR_HAS_L1HDR(hdr)); 2304 ASSERT(state == arc_anon || MUTEX_HELD(HDR_LOCK(hdr))); 2305 ASSERT(!GHOST_STATE(state)); /* arc_l2c_only counts as a ghost. */ 2306 2307 if ((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) != 0) 2308 return (cnt); 2309 2310 if (state == arc_anon) { 2311 arc_hdr_destroy(hdr); 2312 return (0); 2313 } 2314 if (state == arc_uncached && !HDR_PREFETCH(hdr)) { 2315 arc_change_state(arc_anon, hdr); 2316 arc_hdr_destroy(hdr); 2317 return (0); 2318 } 2319 multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr); 2320 arc_evictable_space_increment(hdr, state); 2321 return (0); 2322 } 2323 2324 /* 2325 * Returns detailed information about a specific arc buffer. When the 2326 * state_index argument is set the function will calculate the arc header 2327 * list position for its arc state. Since this requires a linear traversal 2328 * callers are strongly encourage not to do this. However, it can be helpful 2329 * for targeted analysis so the functionality is provided. 2330 */ 2331 void 2332 arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index) 2333 { 2334 (void) state_index; 2335 arc_buf_hdr_t *hdr = ab->b_hdr; 2336 l1arc_buf_hdr_t *l1hdr = NULL; 2337 l2arc_buf_hdr_t *l2hdr = NULL; 2338 arc_state_t *state = NULL; 2339 2340 memset(abi, 0, sizeof (arc_buf_info_t)); 2341 2342 if (hdr == NULL) 2343 return; 2344 2345 abi->abi_flags = hdr->b_flags; 2346 2347 if (HDR_HAS_L1HDR(hdr)) { 2348 l1hdr = &hdr->b_l1hdr; 2349 state = l1hdr->b_state; 2350 } 2351 if (HDR_HAS_L2HDR(hdr)) 2352 l2hdr = &hdr->b_l2hdr; 2353 2354 if (l1hdr) { 2355 abi->abi_bufcnt = 0; 2356 for (arc_buf_t *buf = l1hdr->b_buf; buf; buf = buf->b_next) 2357 abi->abi_bufcnt++; 2358 abi->abi_access = l1hdr->b_arc_access; 2359 abi->abi_mru_hits = l1hdr->b_mru_hits; 2360 abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits; 2361 abi->abi_mfu_hits = l1hdr->b_mfu_hits; 2362 abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits; 2363 abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt); 2364 } 2365 2366 if (l2hdr) { 2367 abi->abi_l2arc_dattr = l2hdr->b_daddr; 2368 abi->abi_l2arc_hits = l2hdr->b_hits; 2369 } 2370 2371 abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON; 2372 abi->abi_state_contents = arc_buf_type(hdr); 2373 abi->abi_size = arc_hdr_size(hdr); 2374 } 2375 2376 /* 2377 * Move the supplied buffer to the indicated state. The hash lock 2378 * for the buffer must be held by the caller. 2379 */ 2380 static void 2381 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr) 2382 { 2383 arc_state_t *old_state; 2384 int64_t refcnt; 2385 boolean_t update_old, update_new; 2386 arc_buf_contents_t type = arc_buf_type(hdr); 2387 2388 /* 2389 * We almost always have an L1 hdr here, since we call arc_hdr_realloc() 2390 * in arc_read() when bringing a buffer out of the L2ARC. However, the 2391 * L1 hdr doesn't always exist when we change state to arc_anon before 2392 * destroying a header, in which case reallocating to add the L1 hdr is 2393 * pointless. 2394 */ 2395 if (HDR_HAS_L1HDR(hdr)) { 2396 old_state = hdr->b_l1hdr.b_state; 2397 refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt); 2398 update_old = (hdr->b_l1hdr.b_buf != NULL || 2399 hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 2400 2401 IMPLY(GHOST_STATE(old_state), hdr->b_l1hdr.b_buf == NULL); 2402 IMPLY(GHOST_STATE(new_state), hdr->b_l1hdr.b_buf == NULL); 2403 IMPLY(old_state == arc_anon, hdr->b_l1hdr.b_buf == NULL || 2404 ARC_BUF_LAST(hdr->b_l1hdr.b_buf)); 2405 } else { 2406 old_state = arc_l2c_only; 2407 refcnt = 0; 2408 update_old = B_FALSE; 2409 } 2410 update_new = update_old; 2411 if (GHOST_STATE(old_state)) 2412 update_old = B_TRUE; 2413 if (GHOST_STATE(new_state)) 2414 update_new = B_TRUE; 2415 2416 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 2417 ASSERT3P(new_state, !=, old_state); 2418 2419 /* 2420 * If this buffer is evictable, transfer it from the 2421 * old state list to the new state list. 2422 */ 2423 if (refcnt == 0) { 2424 if (old_state != arc_anon && old_state != arc_l2c_only) { 2425 ASSERT(HDR_HAS_L1HDR(hdr)); 2426 /* remove_reference() saves on insert. */ 2427 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 2428 multilist_remove(&old_state->arcs_list[type], 2429 hdr); 2430 arc_evictable_space_decrement(hdr, old_state); 2431 } 2432 } 2433 if (new_state != arc_anon && new_state != arc_l2c_only) { 2434 /* 2435 * An L1 header always exists here, since if we're 2436 * moving to some L1-cached state (i.e. not l2c_only or 2437 * anonymous), we realloc the header to add an L1hdr 2438 * beforehand. 2439 */ 2440 ASSERT(HDR_HAS_L1HDR(hdr)); 2441 multilist_insert(&new_state->arcs_list[type], hdr); 2442 arc_evictable_space_increment(hdr, new_state); 2443 } 2444 } 2445 2446 ASSERT(!HDR_EMPTY(hdr)); 2447 if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) 2448 buf_hash_remove(hdr); 2449 2450 /* adjust state sizes (ignore arc_l2c_only) */ 2451 2452 if (update_new && new_state != arc_l2c_only) { 2453 ASSERT(HDR_HAS_L1HDR(hdr)); 2454 if (GHOST_STATE(new_state)) { 2455 2456 /* 2457 * When moving a header to a ghost state, we first 2458 * remove all arc buffers. Thus, we'll have no arc 2459 * buffer to use for the reference. As a result, we 2460 * use the arc header pointer for the reference. 2461 */ 2462 (void) zfs_refcount_add_many( 2463 &new_state->arcs_size[type], 2464 HDR_GET_LSIZE(hdr), hdr); 2465 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2466 ASSERT(!HDR_HAS_RABD(hdr)); 2467 } else { 2468 2469 /* 2470 * Each individual buffer holds a unique reference, 2471 * thus we must remove each of these references one 2472 * at a time. 2473 */ 2474 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2475 buf = buf->b_next) { 2476 2477 /* 2478 * When the arc_buf_t is sharing the data 2479 * block with the hdr, the owner of the 2480 * reference belongs to the hdr. Only 2481 * add to the refcount if the arc_buf_t is 2482 * not shared. 2483 */ 2484 if (ARC_BUF_SHARED(buf)) 2485 continue; 2486 2487 (void) zfs_refcount_add_many( 2488 &new_state->arcs_size[type], 2489 arc_buf_size(buf), buf); 2490 } 2491 2492 if (hdr->b_l1hdr.b_pabd != NULL) { 2493 (void) zfs_refcount_add_many( 2494 &new_state->arcs_size[type], 2495 arc_hdr_size(hdr), hdr); 2496 } 2497 2498 if (HDR_HAS_RABD(hdr)) { 2499 (void) zfs_refcount_add_many( 2500 &new_state->arcs_size[type], 2501 HDR_GET_PSIZE(hdr), hdr); 2502 } 2503 } 2504 } 2505 2506 if (update_old && old_state != arc_l2c_only) { 2507 ASSERT(HDR_HAS_L1HDR(hdr)); 2508 if (GHOST_STATE(old_state)) { 2509 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2510 ASSERT(!HDR_HAS_RABD(hdr)); 2511 2512 /* 2513 * When moving a header off of a ghost state, 2514 * the header will not contain any arc buffers. 2515 * We use the arc header pointer for the reference 2516 * which is exactly what we did when we put the 2517 * header on the ghost state. 2518 */ 2519 2520 (void) zfs_refcount_remove_many( 2521 &old_state->arcs_size[type], 2522 HDR_GET_LSIZE(hdr), hdr); 2523 } else { 2524 2525 /* 2526 * Each individual buffer holds a unique reference, 2527 * thus we must remove each of these references one 2528 * at a time. 2529 */ 2530 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2531 buf = buf->b_next) { 2532 2533 /* 2534 * When the arc_buf_t is sharing the data 2535 * block with the hdr, the owner of the 2536 * reference belongs to the hdr. Only 2537 * add to the refcount if the arc_buf_t is 2538 * not shared. 2539 */ 2540 if (ARC_BUF_SHARED(buf)) 2541 continue; 2542 2543 (void) zfs_refcount_remove_many( 2544 &old_state->arcs_size[type], 2545 arc_buf_size(buf), buf); 2546 } 2547 ASSERT(hdr->b_l1hdr.b_pabd != NULL || 2548 HDR_HAS_RABD(hdr)); 2549 2550 if (hdr->b_l1hdr.b_pabd != NULL) { 2551 (void) zfs_refcount_remove_many( 2552 &old_state->arcs_size[type], 2553 arc_hdr_size(hdr), hdr); 2554 } 2555 2556 if (HDR_HAS_RABD(hdr)) { 2557 (void) zfs_refcount_remove_many( 2558 &old_state->arcs_size[type], 2559 HDR_GET_PSIZE(hdr), hdr); 2560 } 2561 } 2562 } 2563 2564 if (HDR_HAS_L1HDR(hdr)) { 2565 hdr->b_l1hdr.b_state = new_state; 2566 2567 if (HDR_HAS_L2HDR(hdr) && new_state != arc_l2c_only) { 2568 l2arc_hdr_arcstats_decrement_state(hdr); 2569 hdr->b_l2hdr.b_arcs_state = new_state->arcs_state; 2570 l2arc_hdr_arcstats_increment_state(hdr); 2571 } 2572 } 2573 } 2574 2575 void 2576 arc_space_consume(uint64_t space, arc_space_type_t type) 2577 { 2578 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); 2579 2580 switch (type) { 2581 default: 2582 break; 2583 case ARC_SPACE_DATA: 2584 ARCSTAT_INCR(arcstat_data_size, space); 2585 break; 2586 case ARC_SPACE_META: 2587 ARCSTAT_INCR(arcstat_metadata_size, space); 2588 break; 2589 case ARC_SPACE_BONUS: 2590 ARCSTAT_INCR(arcstat_bonus_size, space); 2591 break; 2592 case ARC_SPACE_DNODE: 2593 ARCSTAT_INCR(arcstat_dnode_size, space); 2594 break; 2595 case ARC_SPACE_DBUF: 2596 ARCSTAT_INCR(arcstat_dbuf_size, space); 2597 break; 2598 case ARC_SPACE_HDRS: 2599 ARCSTAT_INCR(arcstat_hdr_size, space); 2600 break; 2601 case ARC_SPACE_L2HDRS: 2602 aggsum_add(&arc_sums.arcstat_l2_hdr_size, space); 2603 break; 2604 case ARC_SPACE_ABD_CHUNK_WASTE: 2605 /* 2606 * Note: this includes space wasted by all scatter ABD's, not 2607 * just those allocated by the ARC. But the vast majority of 2608 * scatter ABD's come from the ARC, because other users are 2609 * very short-lived. 2610 */ 2611 ARCSTAT_INCR(arcstat_abd_chunk_waste_size, space); 2612 break; 2613 } 2614 2615 if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) 2616 ARCSTAT_INCR(arcstat_meta_used, space); 2617 2618 aggsum_add(&arc_sums.arcstat_size, space); 2619 } 2620 2621 void 2622 arc_space_return(uint64_t space, arc_space_type_t type) 2623 { 2624 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); 2625 2626 switch (type) { 2627 default: 2628 break; 2629 case ARC_SPACE_DATA: 2630 ARCSTAT_INCR(arcstat_data_size, -space); 2631 break; 2632 case ARC_SPACE_META: 2633 ARCSTAT_INCR(arcstat_metadata_size, -space); 2634 break; 2635 case ARC_SPACE_BONUS: 2636 ARCSTAT_INCR(arcstat_bonus_size, -space); 2637 break; 2638 case ARC_SPACE_DNODE: 2639 ARCSTAT_INCR(arcstat_dnode_size, -space); 2640 break; 2641 case ARC_SPACE_DBUF: 2642 ARCSTAT_INCR(arcstat_dbuf_size, -space); 2643 break; 2644 case ARC_SPACE_HDRS: 2645 ARCSTAT_INCR(arcstat_hdr_size, -space); 2646 break; 2647 case ARC_SPACE_L2HDRS: 2648 aggsum_add(&arc_sums.arcstat_l2_hdr_size, -space); 2649 break; 2650 case ARC_SPACE_ABD_CHUNK_WASTE: 2651 ARCSTAT_INCR(arcstat_abd_chunk_waste_size, -space); 2652 break; 2653 } 2654 2655 if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) 2656 ARCSTAT_INCR(arcstat_meta_used, -space); 2657 2658 ASSERT(aggsum_compare(&arc_sums.arcstat_size, space) >= 0); 2659 aggsum_add(&arc_sums.arcstat_size, -space); 2660 } 2661 2662 /* 2663 * Given a hdr and a buf, returns whether that buf can share its b_data buffer 2664 * with the hdr's b_pabd. 2665 */ 2666 static boolean_t 2667 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2668 { 2669 /* 2670 * The criteria for sharing a hdr's data are: 2671 * 1. the buffer is not encrypted 2672 * 2. the hdr's compression matches the buf's compression 2673 * 3. the hdr doesn't need to be byteswapped 2674 * 4. the hdr isn't already being shared 2675 * 5. the buf is either compressed or it is the last buf in the hdr list 2676 * 2677 * Criterion #5 maintains the invariant that shared uncompressed 2678 * bufs must be the final buf in the hdr's b_buf list. Reading this, you 2679 * might ask, "if a compressed buf is allocated first, won't that be the 2680 * last thing in the list?", but in that case it's impossible to create 2681 * a shared uncompressed buf anyway (because the hdr must be compressed 2682 * to have the compressed buf). You might also think that #3 is 2683 * sufficient to make this guarantee, however it's possible 2684 * (specifically in the rare L2ARC write race mentioned in 2685 * arc_buf_alloc_impl()) there will be an existing uncompressed buf that 2686 * is shareable, but wasn't at the time of its allocation. Rather than 2687 * allow a new shared uncompressed buf to be created and then shuffle 2688 * the list around to make it the last element, this simply disallows 2689 * sharing if the new buf isn't the first to be added. 2690 */ 2691 ASSERT3P(buf->b_hdr, ==, hdr); 2692 boolean_t hdr_compressed = 2693 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF; 2694 boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0; 2695 return (!ARC_BUF_ENCRYPTED(buf) && 2696 buf_compressed == hdr_compressed && 2697 hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS && 2698 !HDR_SHARED_DATA(hdr) && 2699 (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf))); 2700 } 2701 2702 /* 2703 * Allocate a buf for this hdr. If you care about the data that's in the hdr, 2704 * or if you want a compressed buffer, pass those flags in. Returns 0 if the 2705 * copy was made successfully, or an error code otherwise. 2706 */ 2707 static int 2708 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb, 2709 const void *tag, boolean_t encrypted, boolean_t compressed, 2710 boolean_t noauth, boolean_t fill, arc_buf_t **ret) 2711 { 2712 arc_buf_t *buf; 2713 arc_fill_flags_t flags = ARC_FILL_LOCKED; 2714 2715 ASSERT(HDR_HAS_L1HDR(hdr)); 2716 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); 2717 VERIFY(hdr->b_type == ARC_BUFC_DATA || 2718 hdr->b_type == ARC_BUFC_METADATA); 2719 ASSERT3P(ret, !=, NULL); 2720 ASSERT3P(*ret, ==, NULL); 2721 IMPLY(encrypted, compressed); 2722 2723 buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE); 2724 buf->b_hdr = hdr; 2725 buf->b_data = NULL; 2726 buf->b_next = hdr->b_l1hdr.b_buf; 2727 buf->b_flags = 0; 2728 2729 add_reference(hdr, tag); 2730 2731 /* 2732 * We're about to change the hdr's b_flags. We must either 2733 * hold the hash_lock or be undiscoverable. 2734 */ 2735 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2736 2737 /* 2738 * Only honor requests for compressed bufs if the hdr is actually 2739 * compressed. This must be overridden if the buffer is encrypted since 2740 * encrypted buffers cannot be decompressed. 2741 */ 2742 if (encrypted) { 2743 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; 2744 buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED; 2745 flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED; 2746 } else if (compressed && 2747 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { 2748 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; 2749 flags |= ARC_FILL_COMPRESSED; 2750 } 2751 2752 if (noauth) { 2753 ASSERT0(encrypted); 2754 flags |= ARC_FILL_NOAUTH; 2755 } 2756 2757 /* 2758 * If the hdr's data can be shared then we share the data buffer and 2759 * set the appropriate bit in the hdr's b_flags to indicate the hdr is 2760 * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new 2761 * buffer to store the buf's data. 2762 * 2763 * There are two additional restrictions here because we're sharing 2764 * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be 2765 * actively involved in an L2ARC write, because if this buf is used by 2766 * an arc_write() then the hdr's data buffer will be released when the 2767 * write completes, even though the L2ARC write might still be using it. 2768 * Second, the hdr's ABD must be linear so that the buf's user doesn't 2769 * need to be ABD-aware. It must be allocated via 2770 * zio_[data_]buf_alloc(), not as a page, because we need to be able 2771 * to abd_release_ownership_of_buf(), which isn't allowed on "linear 2772 * page" buffers because the ABD code needs to handle freeing them 2773 * specially. 2774 */ 2775 boolean_t can_share = arc_can_share(hdr, buf) && 2776 !HDR_L2_WRITING(hdr) && 2777 hdr->b_l1hdr.b_pabd != NULL && 2778 abd_is_linear(hdr->b_l1hdr.b_pabd) && 2779 !abd_is_linear_page(hdr->b_l1hdr.b_pabd); 2780 2781 /* Set up b_data and sharing */ 2782 if (can_share) { 2783 buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd); 2784 buf->b_flags |= ARC_BUF_FLAG_SHARED; 2785 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); 2786 } else { 2787 buf->b_data = 2788 arc_get_data_buf(hdr, arc_buf_size(buf), buf); 2789 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); 2790 } 2791 VERIFY3P(buf->b_data, !=, NULL); 2792 2793 hdr->b_l1hdr.b_buf = buf; 2794 2795 /* 2796 * If the user wants the data from the hdr, we need to either copy or 2797 * decompress the data. 2798 */ 2799 if (fill) { 2800 ASSERT3P(zb, !=, NULL); 2801 return (arc_buf_fill(buf, spa, zb, flags)); 2802 } 2803 2804 return (0); 2805 } 2806 2807 static const char *arc_onloan_tag = "onloan"; 2808 2809 static inline void 2810 arc_loaned_bytes_update(int64_t delta) 2811 { 2812 atomic_add_64(&arc_loaned_bytes, delta); 2813 2814 /* assert that it did not wrap around */ 2815 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); 2816 } 2817 2818 /* 2819 * Loan out an anonymous arc buffer. Loaned buffers are not counted as in 2820 * flight data by arc_tempreserve_space() until they are "returned". Loaned 2821 * buffers must be returned to the arc before they can be used by the DMU or 2822 * freed. 2823 */ 2824 arc_buf_t * 2825 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size) 2826 { 2827 arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag, 2828 is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size); 2829 2830 arc_loaned_bytes_update(arc_buf_size(buf)); 2831 2832 return (buf); 2833 } 2834 2835 arc_buf_t * 2836 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize, 2837 enum zio_compress compression_type, uint8_t complevel) 2838 { 2839 arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag, 2840 psize, lsize, compression_type, complevel); 2841 2842 arc_loaned_bytes_update(arc_buf_size(buf)); 2843 2844 return (buf); 2845 } 2846 2847 arc_buf_t * 2848 arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder, 2849 const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, 2850 dmu_object_type_t ot, uint64_t psize, uint64_t lsize, 2851 enum zio_compress compression_type, uint8_t complevel) 2852 { 2853 arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj, 2854 byteorder, salt, iv, mac, ot, psize, lsize, compression_type, 2855 complevel); 2856 2857 atomic_add_64(&arc_loaned_bytes, psize); 2858 return (buf); 2859 } 2860 2861 2862 /* 2863 * Return a loaned arc buffer to the arc. 2864 */ 2865 void 2866 arc_return_buf(arc_buf_t *buf, const void *tag) 2867 { 2868 arc_buf_hdr_t *hdr = buf->b_hdr; 2869 2870 ASSERT3P(buf->b_data, !=, NULL); 2871 ASSERT(HDR_HAS_L1HDR(hdr)); 2872 (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag); 2873 (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); 2874 2875 arc_loaned_bytes_update(-arc_buf_size(buf)); 2876 } 2877 2878 /* Detach an arc_buf from a dbuf (tag) */ 2879 void 2880 arc_loan_inuse_buf(arc_buf_t *buf, const void *tag) 2881 { 2882 arc_buf_hdr_t *hdr = buf->b_hdr; 2883 2884 ASSERT3P(buf->b_data, !=, NULL); 2885 ASSERT(HDR_HAS_L1HDR(hdr)); 2886 (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); 2887 (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag); 2888 2889 arc_loaned_bytes_update(arc_buf_size(buf)); 2890 } 2891 2892 static void 2893 l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type) 2894 { 2895 l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP); 2896 2897 df->l2df_abd = abd; 2898 df->l2df_size = size; 2899 df->l2df_type = type; 2900 mutex_enter(&l2arc_free_on_write_mtx); 2901 list_insert_head(l2arc_free_on_write, df); 2902 mutex_exit(&l2arc_free_on_write_mtx); 2903 } 2904 2905 static void 2906 arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata) 2907 { 2908 arc_state_t *state = hdr->b_l1hdr.b_state; 2909 arc_buf_contents_t type = arc_buf_type(hdr); 2910 uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); 2911 2912 /* protected by hash lock, if in the hash table */ 2913 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 2914 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 2915 ASSERT(state != arc_anon && state != arc_l2c_only); 2916 2917 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2918 size, hdr); 2919 } 2920 (void) zfs_refcount_remove_many(&state->arcs_size[type], size, hdr); 2921 if (type == ARC_BUFC_METADATA) { 2922 arc_space_return(size, ARC_SPACE_META); 2923 } else { 2924 ASSERT(type == ARC_BUFC_DATA); 2925 arc_space_return(size, ARC_SPACE_DATA); 2926 } 2927 2928 if (free_rdata) { 2929 l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type); 2930 } else { 2931 l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type); 2932 } 2933 } 2934 2935 /* 2936 * Share the arc_buf_t's data with the hdr. Whenever we are sharing the 2937 * data buffer, we transfer the refcount ownership to the hdr and update 2938 * the appropriate kstats. 2939 */ 2940 static void 2941 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2942 { 2943 ASSERT(arc_can_share(hdr, buf)); 2944 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2945 ASSERT(!ARC_BUF_ENCRYPTED(buf)); 2946 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2947 2948 /* 2949 * Start sharing the data buffer. We transfer the 2950 * refcount ownership to the hdr since it always owns 2951 * the refcount whenever an arc_buf_t is shared. 2952 */ 2953 zfs_refcount_transfer_ownership_many( 2954 &hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)], 2955 arc_hdr_size(hdr), buf, hdr); 2956 hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf)); 2957 abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd, 2958 HDR_ISTYPE_METADATA(hdr)); 2959 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); 2960 buf->b_flags |= ARC_BUF_FLAG_SHARED; 2961 2962 /* 2963 * Since we've transferred ownership to the hdr we need 2964 * to increment its compressed and uncompressed kstats and 2965 * decrement the overhead size. 2966 */ 2967 ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr)); 2968 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); 2969 ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf)); 2970 } 2971 2972 static void 2973 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2974 { 2975 ASSERT(arc_buf_is_shared(buf)); 2976 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 2977 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2978 2979 /* 2980 * We are no longer sharing this buffer so we need 2981 * to transfer its ownership to the rightful owner. 2982 */ 2983 zfs_refcount_transfer_ownership_many( 2984 &hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)], 2985 arc_hdr_size(hdr), hdr, buf); 2986 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 2987 abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd); 2988 abd_free(hdr->b_l1hdr.b_pabd); 2989 hdr->b_l1hdr.b_pabd = NULL; 2990 buf->b_flags &= ~ARC_BUF_FLAG_SHARED; 2991 2992 /* 2993 * Since the buffer is no longer shared between 2994 * the arc buf and the hdr, count it as overhead. 2995 */ 2996 ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr)); 2997 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); 2998 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); 2999 } 3000 3001 /* 3002 * Remove an arc_buf_t from the hdr's buf list and return the last 3003 * arc_buf_t on the list. If no buffers remain on the list then return 3004 * NULL. 3005 */ 3006 static arc_buf_t * 3007 arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf) 3008 { 3009 ASSERT(HDR_HAS_L1HDR(hdr)); 3010 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 3011 3012 arc_buf_t **bufp = &hdr->b_l1hdr.b_buf; 3013 arc_buf_t *lastbuf = NULL; 3014 3015 /* 3016 * Remove the buf from the hdr list and locate the last 3017 * remaining buffer on the list. 3018 */ 3019 while (*bufp != NULL) { 3020 if (*bufp == buf) 3021 *bufp = buf->b_next; 3022 3023 /* 3024 * If we've removed a buffer in the middle of 3025 * the list then update the lastbuf and update 3026 * bufp. 3027 */ 3028 if (*bufp != NULL) { 3029 lastbuf = *bufp; 3030 bufp = &(*bufp)->b_next; 3031 } 3032 } 3033 buf->b_next = NULL; 3034 ASSERT3P(lastbuf, !=, buf); 3035 IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf)); 3036 3037 return (lastbuf); 3038 } 3039 3040 /* 3041 * Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's 3042 * list and free it. 3043 */ 3044 static void 3045 arc_buf_destroy_impl(arc_buf_t *buf) 3046 { 3047 arc_buf_hdr_t *hdr = buf->b_hdr; 3048 3049 /* 3050 * Free up the data associated with the buf but only if we're not 3051 * sharing this with the hdr. If we are sharing it with the hdr, the 3052 * hdr is responsible for doing the free. 3053 */ 3054 if (buf->b_data != NULL) { 3055 /* 3056 * We're about to change the hdr's b_flags. We must either 3057 * hold the hash_lock or be undiscoverable. 3058 */ 3059 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 3060 3061 arc_cksum_verify(buf); 3062 arc_buf_unwatch(buf); 3063 3064 if (ARC_BUF_SHARED(buf)) { 3065 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 3066 } else { 3067 ASSERT(!arc_buf_is_shared(buf)); 3068 uint64_t size = arc_buf_size(buf); 3069 arc_free_data_buf(hdr, buf->b_data, size, buf); 3070 ARCSTAT_INCR(arcstat_overhead_size, -size); 3071 } 3072 buf->b_data = NULL; 3073 3074 /* 3075 * If we have no more encrypted buffers and we've already 3076 * gotten a copy of the decrypted data we can free b_rabd 3077 * to save some space. 3078 */ 3079 if (ARC_BUF_ENCRYPTED(buf) && HDR_HAS_RABD(hdr) && 3080 hdr->b_l1hdr.b_pabd != NULL && !HDR_IO_IN_PROGRESS(hdr)) { 3081 arc_buf_t *b; 3082 for (b = hdr->b_l1hdr.b_buf; b; b = b->b_next) { 3083 if (b != buf && ARC_BUF_ENCRYPTED(b)) 3084 break; 3085 } 3086 if (b == NULL) 3087 arc_hdr_free_abd(hdr, B_TRUE); 3088 } 3089 } 3090 3091 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); 3092 3093 if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) { 3094 /* 3095 * If the current arc_buf_t is sharing its data buffer with the 3096 * hdr, then reassign the hdr's b_pabd to share it with the new 3097 * buffer at the end of the list. The shared buffer is always 3098 * the last one on the hdr's buffer list. 3099 * 3100 * There is an equivalent case for compressed bufs, but since 3101 * they aren't guaranteed to be the last buf in the list and 3102 * that is an exceedingly rare case, we just allow that space be 3103 * wasted temporarily. We must also be careful not to share 3104 * encrypted buffers, since they cannot be shared. 3105 */ 3106 if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) { 3107 /* Only one buf can be shared at once */ 3108 ASSERT(!arc_buf_is_shared(lastbuf)); 3109 /* hdr is uncompressed so can't have compressed buf */ 3110 ASSERT(!ARC_BUF_COMPRESSED(lastbuf)); 3111 3112 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 3113 arc_hdr_free_abd(hdr, B_FALSE); 3114 3115 /* 3116 * We must setup a new shared block between the 3117 * last buffer and the hdr. The data would have 3118 * been allocated by the arc buf so we need to transfer 3119 * ownership to the hdr since it's now being shared. 3120 */ 3121 arc_share_buf(hdr, lastbuf); 3122 } 3123 } else if (HDR_SHARED_DATA(hdr)) { 3124 /* 3125 * Uncompressed shared buffers are always at the end 3126 * of the list. Compressed buffers don't have the 3127 * same requirements. This makes it hard to 3128 * simply assert that the lastbuf is shared so 3129 * we rely on the hdr's compression flags to determine 3130 * if we have a compressed, shared buffer. 3131 */ 3132 ASSERT3P(lastbuf, !=, NULL); 3133 ASSERT(arc_buf_is_shared(lastbuf) || 3134 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 3135 } 3136 3137 /* 3138 * Free the checksum if we're removing the last uncompressed buf from 3139 * this hdr. 3140 */ 3141 if (!arc_hdr_has_uncompressed_buf(hdr)) { 3142 arc_cksum_free(hdr); 3143 } 3144 3145 /* clean up the buf */ 3146 buf->b_hdr = NULL; 3147 kmem_cache_free(buf_cache, buf); 3148 } 3149 3150 static void 3151 arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags) 3152 { 3153 uint64_t size; 3154 boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0); 3155 3156 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); 3157 ASSERT(HDR_HAS_L1HDR(hdr)); 3158 ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata); 3159 IMPLY(alloc_rdata, HDR_PROTECTED(hdr)); 3160 3161 if (alloc_rdata) { 3162 size = HDR_GET_PSIZE(hdr); 3163 ASSERT3P(hdr->b_crypt_hdr.b_rabd, ==, NULL); 3164 hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr, 3165 alloc_flags); 3166 ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL); 3167 ARCSTAT_INCR(arcstat_raw_size, size); 3168 } else { 3169 size = arc_hdr_size(hdr); 3170 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 3171 hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr, 3172 alloc_flags); 3173 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 3174 } 3175 3176 ARCSTAT_INCR(arcstat_compressed_size, size); 3177 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); 3178 } 3179 3180 static void 3181 arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata) 3182 { 3183 uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); 3184 3185 ASSERT(HDR_HAS_L1HDR(hdr)); 3186 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 3187 IMPLY(free_rdata, HDR_HAS_RABD(hdr)); 3188 3189 /* 3190 * If the hdr is currently being written to the l2arc then 3191 * we defer freeing the data by adding it to the l2arc_free_on_write 3192 * list. The l2arc will free the data once it's finished 3193 * writing it to the l2arc device. 3194 */ 3195 if (HDR_L2_WRITING(hdr)) { 3196 arc_hdr_free_on_write(hdr, free_rdata); 3197 ARCSTAT_BUMP(arcstat_l2_free_on_write); 3198 } else if (free_rdata) { 3199 arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr); 3200 } else { 3201 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr); 3202 } 3203 3204 if (free_rdata) { 3205 hdr->b_crypt_hdr.b_rabd = NULL; 3206 ARCSTAT_INCR(arcstat_raw_size, -size); 3207 } else { 3208 hdr->b_l1hdr.b_pabd = NULL; 3209 } 3210 3211 if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr)) 3212 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 3213 3214 ARCSTAT_INCR(arcstat_compressed_size, -size); 3215 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); 3216 } 3217 3218 /* 3219 * Allocate empty anonymous ARC header. The header will get its identity 3220 * assigned and buffers attached later as part of read or write operations. 3221 * 3222 * In case of read arc_read() assigns header its identify (b_dva + b_birth), 3223 * inserts it into ARC hash to become globally visible and allocates physical 3224 * (b_pabd) or raw (b_rabd) ABD buffer to read into from disk. On disk read 3225 * completion arc_read_done() allocates ARC buffer(s) as needed, potentially 3226 * sharing one of them with the physical ABD buffer. 3227 * 3228 * In case of write arc_alloc_buf() allocates ARC buffer to be filled with 3229 * data. Then after compression and/or encryption arc_write_ready() allocates 3230 * and fills (or potentially shares) physical (b_pabd) or raw (b_rabd) ABD 3231 * buffer. On disk write completion arc_write_done() assigns the header its 3232 * new identity (b_dva + b_birth) and inserts into ARC hash. 3233 * 3234 * In case of partial overwrite the old data is read first as described. Then 3235 * arc_release() either allocates new anonymous ARC header and moves the ARC 3236 * buffer to it, or reuses the old ARC header by discarding its identity and 3237 * removing it from ARC hash. After buffer modification normal write process 3238 * follows as described. 3239 */ 3240 static arc_buf_hdr_t * 3241 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize, 3242 boolean_t protected, enum zio_compress compression_type, uint8_t complevel, 3243 arc_buf_contents_t type) 3244 { 3245 arc_buf_hdr_t *hdr; 3246 3247 VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA); 3248 hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE); 3249 3250 ASSERT(HDR_EMPTY(hdr)); 3251 #ifdef ZFS_DEBUG 3252 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3253 #endif 3254 HDR_SET_PSIZE(hdr, psize); 3255 HDR_SET_LSIZE(hdr, lsize); 3256 hdr->b_spa = spa; 3257 hdr->b_type = type; 3258 hdr->b_flags = 0; 3259 arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR); 3260 arc_hdr_set_compress(hdr, compression_type); 3261 hdr->b_complevel = complevel; 3262 if (protected) 3263 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 3264 3265 hdr->b_l1hdr.b_state = arc_anon; 3266 hdr->b_l1hdr.b_arc_access = 0; 3267 hdr->b_l1hdr.b_mru_hits = 0; 3268 hdr->b_l1hdr.b_mru_ghost_hits = 0; 3269 hdr->b_l1hdr.b_mfu_hits = 0; 3270 hdr->b_l1hdr.b_mfu_ghost_hits = 0; 3271 hdr->b_l1hdr.b_buf = NULL; 3272 3273 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 3274 3275 return (hdr); 3276 } 3277 3278 /* 3279 * Transition between the two allocation states for the arc_buf_hdr struct. 3280 * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without 3281 * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller 3282 * version is used when a cache buffer is only in the L2ARC in order to reduce 3283 * memory usage. 3284 */ 3285 static arc_buf_hdr_t * 3286 arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new) 3287 { 3288 ASSERT(HDR_HAS_L2HDR(hdr)); 3289 3290 arc_buf_hdr_t *nhdr; 3291 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; 3292 3293 ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) || 3294 (old == hdr_l2only_cache && new == hdr_full_cache)); 3295 3296 nhdr = kmem_cache_alloc(new, KM_PUSHPAGE); 3297 3298 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 3299 buf_hash_remove(hdr); 3300 3301 memcpy(nhdr, hdr, HDR_L2ONLY_SIZE); 3302 3303 if (new == hdr_full_cache) { 3304 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR); 3305 /* 3306 * arc_access and arc_change_state need to be aware that a 3307 * header has just come out of L2ARC, so we set its state to 3308 * l2c_only even though it's about to change. 3309 */ 3310 nhdr->b_l1hdr.b_state = arc_l2c_only; 3311 3312 /* Verify previous threads set to NULL before freeing */ 3313 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL); 3314 ASSERT(!HDR_HAS_RABD(hdr)); 3315 } else { 3316 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 3317 #ifdef ZFS_DEBUG 3318 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3319 #endif 3320 3321 /* 3322 * If we've reached here, We must have been called from 3323 * arc_evict_hdr(), as such we should have already been 3324 * removed from any ghost list we were previously on 3325 * (which protects us from racing with arc_evict_state), 3326 * thus no locking is needed during this check. 3327 */ 3328 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 3329 3330 /* 3331 * A buffer must not be moved into the arc_l2c_only 3332 * state if it's not finished being written out to the 3333 * l2arc device. Otherwise, the b_l1hdr.b_pabd field 3334 * might try to be accessed, even though it was removed. 3335 */ 3336 VERIFY(!HDR_L2_WRITING(hdr)); 3337 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL); 3338 ASSERT(!HDR_HAS_RABD(hdr)); 3339 3340 arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR); 3341 } 3342 /* 3343 * The header has been reallocated so we need to re-insert it into any 3344 * lists it was on. 3345 */ 3346 (void) buf_hash_insert(nhdr, NULL); 3347 3348 ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node)); 3349 3350 mutex_enter(&dev->l2ad_mtx); 3351 3352 /* 3353 * We must place the realloc'ed header back into the list at 3354 * the same spot. Otherwise, if it's placed earlier in the list, 3355 * l2arc_write_buffers() could find it during the function's 3356 * write phase, and try to write it out to the l2arc. 3357 */ 3358 list_insert_after(&dev->l2ad_buflist, hdr, nhdr); 3359 list_remove(&dev->l2ad_buflist, hdr); 3360 3361 mutex_exit(&dev->l2ad_mtx); 3362 3363 /* 3364 * Since we're using the pointer address as the tag when 3365 * incrementing and decrementing the l2ad_alloc refcount, we 3366 * must remove the old pointer (that we're about to destroy) and 3367 * add the new pointer to the refcount. Otherwise we'd remove 3368 * the wrong pointer address when calling arc_hdr_destroy() later. 3369 */ 3370 3371 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, 3372 arc_hdr_size(hdr), hdr); 3373 (void) zfs_refcount_add_many(&dev->l2ad_alloc, 3374 arc_hdr_size(nhdr), nhdr); 3375 3376 buf_discard_identity(hdr); 3377 kmem_cache_free(old, hdr); 3378 3379 return (nhdr); 3380 } 3381 3382 /* 3383 * This function is used by the send / receive code to convert a newly 3384 * allocated arc_buf_t to one that is suitable for a raw encrypted write. It 3385 * is also used to allow the root objset block to be updated without altering 3386 * its embedded MACs. Both block types will always be uncompressed so we do not 3387 * have to worry about compression type or psize. 3388 */ 3389 void 3390 arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder, 3391 dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv, 3392 const uint8_t *mac) 3393 { 3394 arc_buf_hdr_t *hdr = buf->b_hdr; 3395 3396 ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET); 3397 ASSERT(HDR_HAS_L1HDR(hdr)); 3398 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 3399 3400 buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED); 3401 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 3402 hdr->b_crypt_hdr.b_dsobj = dsobj; 3403 hdr->b_crypt_hdr.b_ot = ot; 3404 hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? 3405 DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); 3406 if (!arc_hdr_has_uncompressed_buf(hdr)) 3407 arc_cksum_free(hdr); 3408 3409 if (salt != NULL) 3410 memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN); 3411 if (iv != NULL) 3412 memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN); 3413 if (mac != NULL) 3414 memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN); 3415 } 3416 3417 /* 3418 * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller. 3419 * The buf is returned thawed since we expect the consumer to modify it. 3420 */ 3421 arc_buf_t * 3422 arc_alloc_buf(spa_t *spa, const void *tag, arc_buf_contents_t type, 3423 int32_t size) 3424 { 3425 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size, 3426 B_FALSE, ZIO_COMPRESS_OFF, 0, type); 3427 3428 arc_buf_t *buf = NULL; 3429 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE, 3430 B_FALSE, B_FALSE, &buf)); 3431 arc_buf_thaw(buf); 3432 3433 return (buf); 3434 } 3435 3436 /* 3437 * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this 3438 * for bufs containing metadata. 3439 */ 3440 arc_buf_t * 3441 arc_alloc_compressed_buf(spa_t *spa, const void *tag, uint64_t psize, 3442 uint64_t lsize, enum zio_compress compression_type, uint8_t complevel) 3443 { 3444 ASSERT3U(lsize, >, 0); 3445 ASSERT3U(lsize, >=, psize); 3446 ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF); 3447 ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); 3448 3449 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, 3450 B_FALSE, compression_type, complevel, ARC_BUFC_DATA); 3451 3452 arc_buf_t *buf = NULL; 3453 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, 3454 B_TRUE, B_FALSE, B_FALSE, &buf)); 3455 arc_buf_thaw(buf); 3456 3457 /* 3458 * To ensure that the hdr has the correct data in it if we call 3459 * arc_untransform() on this buf before it's been written to disk, 3460 * it's easiest if we just set up sharing between the buf and the hdr. 3461 */ 3462 arc_share_buf(hdr, buf); 3463 3464 return (buf); 3465 } 3466 3467 arc_buf_t * 3468 arc_alloc_raw_buf(spa_t *spa, const void *tag, uint64_t dsobj, 3469 boolean_t byteorder, const uint8_t *salt, const uint8_t *iv, 3470 const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize, 3471 enum zio_compress compression_type, uint8_t complevel) 3472 { 3473 arc_buf_hdr_t *hdr; 3474 arc_buf_t *buf; 3475 arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ? 3476 ARC_BUFC_METADATA : ARC_BUFC_DATA; 3477 3478 ASSERT3U(lsize, >, 0); 3479 ASSERT3U(lsize, >=, psize); 3480 ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF); 3481 ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); 3482 3483 hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE, 3484 compression_type, complevel, type); 3485 3486 hdr->b_crypt_hdr.b_dsobj = dsobj; 3487 hdr->b_crypt_hdr.b_ot = ot; 3488 hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? 3489 DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); 3490 memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN); 3491 memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN); 3492 memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN); 3493 3494 /* 3495 * This buffer will be considered encrypted even if the ot is not an 3496 * encrypted type. It will become authenticated instead in 3497 * arc_write_ready(). 3498 */ 3499 buf = NULL; 3500 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE, 3501 B_FALSE, B_FALSE, &buf)); 3502 arc_buf_thaw(buf); 3503 3504 return (buf); 3505 } 3506 3507 static void 3508 l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr, 3509 boolean_t state_only) 3510 { 3511 l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr; 3512 l2arc_dev_t *dev = l2hdr->b_dev; 3513 uint64_t lsize = HDR_GET_LSIZE(hdr); 3514 uint64_t psize = HDR_GET_PSIZE(hdr); 3515 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); 3516 arc_buf_contents_t type = hdr->b_type; 3517 int64_t lsize_s; 3518 int64_t psize_s; 3519 int64_t asize_s; 3520 3521 if (incr) { 3522 lsize_s = lsize; 3523 psize_s = psize; 3524 asize_s = asize; 3525 } else { 3526 lsize_s = -lsize; 3527 psize_s = -psize; 3528 asize_s = -asize; 3529 } 3530 3531 /* If the buffer is a prefetch, count it as such. */ 3532 if (HDR_PREFETCH(hdr)) { 3533 ARCSTAT_INCR(arcstat_l2_prefetch_asize, asize_s); 3534 } else { 3535 /* 3536 * We use the value stored in the L2 header upon initial 3537 * caching in L2ARC. This value will be updated in case 3538 * an MRU/MRU_ghost buffer transitions to MFU but the L2ARC 3539 * metadata (log entry) cannot currently be updated. Having 3540 * the ARC state in the L2 header solves the problem of a 3541 * possibly absent L1 header (apparent in buffers restored 3542 * from persistent L2ARC). 3543 */ 3544 switch (hdr->b_l2hdr.b_arcs_state) { 3545 case ARC_STATE_MRU_GHOST: 3546 case ARC_STATE_MRU: 3547 ARCSTAT_INCR(arcstat_l2_mru_asize, asize_s); 3548 break; 3549 case ARC_STATE_MFU_GHOST: 3550 case ARC_STATE_MFU: 3551 ARCSTAT_INCR(arcstat_l2_mfu_asize, asize_s); 3552 break; 3553 default: 3554 break; 3555 } 3556 } 3557 3558 if (state_only) 3559 return; 3560 3561 ARCSTAT_INCR(arcstat_l2_psize, psize_s); 3562 ARCSTAT_INCR(arcstat_l2_lsize, lsize_s); 3563 3564 switch (type) { 3565 case ARC_BUFC_DATA: 3566 ARCSTAT_INCR(arcstat_l2_bufc_data_asize, asize_s); 3567 break; 3568 case ARC_BUFC_METADATA: 3569 ARCSTAT_INCR(arcstat_l2_bufc_metadata_asize, asize_s); 3570 break; 3571 default: 3572 break; 3573 } 3574 } 3575 3576 3577 static void 3578 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr) 3579 { 3580 l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr; 3581 l2arc_dev_t *dev = l2hdr->b_dev; 3582 uint64_t psize = HDR_GET_PSIZE(hdr); 3583 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); 3584 3585 ASSERT(MUTEX_HELD(&dev->l2ad_mtx)); 3586 ASSERT(HDR_HAS_L2HDR(hdr)); 3587 3588 list_remove(&dev->l2ad_buflist, hdr); 3589 3590 l2arc_hdr_arcstats_decrement(hdr); 3591 vdev_space_update(dev->l2ad_vdev, -asize, 0, 0); 3592 3593 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), 3594 hdr); 3595 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); 3596 } 3597 3598 static void 3599 arc_hdr_destroy(arc_buf_hdr_t *hdr) 3600 { 3601 if (HDR_HAS_L1HDR(hdr)) { 3602 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 3603 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 3604 } 3605 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3606 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 3607 3608 if (HDR_HAS_L2HDR(hdr)) { 3609 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; 3610 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx); 3611 3612 if (!buflist_held) 3613 mutex_enter(&dev->l2ad_mtx); 3614 3615 /* 3616 * Even though we checked this conditional above, we 3617 * need to check this again now that we have the 3618 * l2ad_mtx. This is because we could be racing with 3619 * another thread calling l2arc_evict() which might have 3620 * destroyed this header's L2 portion as we were waiting 3621 * to acquire the l2ad_mtx. If that happens, we don't 3622 * want to re-destroy the header's L2 portion. 3623 */ 3624 if (HDR_HAS_L2HDR(hdr)) { 3625 3626 if (!HDR_EMPTY(hdr)) 3627 buf_discard_identity(hdr); 3628 3629 arc_hdr_l2hdr_destroy(hdr); 3630 } 3631 3632 if (!buflist_held) 3633 mutex_exit(&dev->l2ad_mtx); 3634 } 3635 3636 /* 3637 * The header's identify can only be safely discarded once it is no 3638 * longer discoverable. This requires removing it from the hash table 3639 * and the l2arc header list. After this point the hash lock can not 3640 * be used to protect the header. 3641 */ 3642 if (!HDR_EMPTY(hdr)) 3643 buf_discard_identity(hdr); 3644 3645 if (HDR_HAS_L1HDR(hdr)) { 3646 arc_cksum_free(hdr); 3647 3648 while (hdr->b_l1hdr.b_buf != NULL) 3649 arc_buf_destroy_impl(hdr->b_l1hdr.b_buf); 3650 3651 if (hdr->b_l1hdr.b_pabd != NULL) 3652 arc_hdr_free_abd(hdr, B_FALSE); 3653 3654 if (HDR_HAS_RABD(hdr)) 3655 arc_hdr_free_abd(hdr, B_TRUE); 3656 } 3657 3658 ASSERT3P(hdr->b_hash_next, ==, NULL); 3659 if (HDR_HAS_L1HDR(hdr)) { 3660 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 3661 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 3662 #ifdef ZFS_DEBUG 3663 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3664 #endif 3665 kmem_cache_free(hdr_full_cache, hdr); 3666 } else { 3667 kmem_cache_free(hdr_l2only_cache, hdr); 3668 } 3669 } 3670 3671 void 3672 arc_buf_destroy(arc_buf_t *buf, const void *tag) 3673 { 3674 arc_buf_hdr_t *hdr = buf->b_hdr; 3675 3676 if (hdr->b_l1hdr.b_state == arc_anon) { 3677 ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf); 3678 ASSERT(ARC_BUF_LAST(buf)); 3679 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3680 VERIFY0(remove_reference(hdr, tag)); 3681 return; 3682 } 3683 3684 kmutex_t *hash_lock = HDR_LOCK(hdr); 3685 mutex_enter(hash_lock); 3686 3687 ASSERT3P(hdr, ==, buf->b_hdr); 3688 ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL); 3689 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 3690 ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon); 3691 ASSERT3P(buf->b_data, !=, NULL); 3692 3693 arc_buf_destroy_impl(buf); 3694 (void) remove_reference(hdr, tag); 3695 mutex_exit(hash_lock); 3696 } 3697 3698 /* 3699 * Evict the arc_buf_hdr that is provided as a parameter. The resultant 3700 * state of the header is dependent on its state prior to entering this 3701 * function. The following transitions are possible: 3702 * 3703 * - arc_mru -> arc_mru_ghost 3704 * - arc_mfu -> arc_mfu_ghost 3705 * - arc_mru_ghost -> arc_l2c_only 3706 * - arc_mru_ghost -> deleted 3707 * - arc_mfu_ghost -> arc_l2c_only 3708 * - arc_mfu_ghost -> deleted 3709 * - arc_uncached -> deleted 3710 * 3711 * Return total size of evicted data buffers for eviction progress tracking. 3712 * When evicting from ghost states return logical buffer size to make eviction 3713 * progress at the same (or at least comparable) rate as from non-ghost states. 3714 * 3715 * Return *real_evicted for actual ARC size reduction to wake up threads 3716 * waiting for it. For non-ghost states it includes size of evicted data 3717 * buffers (the headers are not freed there). For ghost states it includes 3718 * only the evicted headers size. 3719 */ 3720 static int64_t 3721 arc_evict_hdr(arc_buf_hdr_t *hdr, uint64_t *real_evicted) 3722 { 3723 arc_state_t *evicted_state, *state; 3724 int64_t bytes_evicted = 0; 3725 uint_t min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ? 3726 arc_min_prescient_prefetch_ms : arc_min_prefetch_ms; 3727 3728 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 3729 ASSERT(HDR_HAS_L1HDR(hdr)); 3730 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3731 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 3732 ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt)); 3733 3734 *real_evicted = 0; 3735 state = hdr->b_l1hdr.b_state; 3736 if (GHOST_STATE(state)) { 3737 3738 /* 3739 * l2arc_write_buffers() relies on a header's L1 portion 3740 * (i.e. its b_pabd field) during it's write phase. 3741 * Thus, we cannot push a header onto the arc_l2c_only 3742 * state (removing its L1 piece) until the header is 3743 * done being written to the l2arc. 3744 */ 3745 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) { 3746 ARCSTAT_BUMP(arcstat_evict_l2_skip); 3747 return (bytes_evicted); 3748 } 3749 3750 ARCSTAT_BUMP(arcstat_deleted); 3751 bytes_evicted += HDR_GET_LSIZE(hdr); 3752 3753 DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr); 3754 3755 if (HDR_HAS_L2HDR(hdr)) { 3756 ASSERT(hdr->b_l1hdr.b_pabd == NULL); 3757 ASSERT(!HDR_HAS_RABD(hdr)); 3758 /* 3759 * This buffer is cached on the 2nd Level ARC; 3760 * don't destroy the header. 3761 */ 3762 arc_change_state(arc_l2c_only, hdr); 3763 /* 3764 * dropping from L1+L2 cached to L2-only, 3765 * realloc to remove the L1 header. 3766 */ 3767 (void) arc_hdr_realloc(hdr, hdr_full_cache, 3768 hdr_l2only_cache); 3769 *real_evicted += HDR_FULL_SIZE - HDR_L2ONLY_SIZE; 3770 } else { 3771 arc_change_state(arc_anon, hdr); 3772 arc_hdr_destroy(hdr); 3773 *real_evicted += HDR_FULL_SIZE; 3774 } 3775 return (bytes_evicted); 3776 } 3777 3778 ASSERT(state == arc_mru || state == arc_mfu || state == arc_uncached); 3779 evicted_state = (state == arc_uncached) ? arc_anon : 3780 ((state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost); 3781 3782 /* prefetch buffers have a minimum lifespan */ 3783 if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) && 3784 ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access < 3785 MSEC_TO_TICK(min_lifetime)) { 3786 ARCSTAT_BUMP(arcstat_evict_skip); 3787 return (bytes_evicted); 3788 } 3789 3790 if (HDR_HAS_L2HDR(hdr)) { 3791 ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr)); 3792 } else { 3793 if (l2arc_write_eligible(hdr->b_spa, hdr)) { 3794 ARCSTAT_INCR(arcstat_evict_l2_eligible, 3795 HDR_GET_LSIZE(hdr)); 3796 3797 switch (state->arcs_state) { 3798 case ARC_STATE_MRU: 3799 ARCSTAT_INCR( 3800 arcstat_evict_l2_eligible_mru, 3801 HDR_GET_LSIZE(hdr)); 3802 break; 3803 case ARC_STATE_MFU: 3804 ARCSTAT_INCR( 3805 arcstat_evict_l2_eligible_mfu, 3806 HDR_GET_LSIZE(hdr)); 3807 break; 3808 default: 3809 break; 3810 } 3811 } else { 3812 ARCSTAT_INCR(arcstat_evict_l2_ineligible, 3813 HDR_GET_LSIZE(hdr)); 3814 } 3815 } 3816 3817 bytes_evicted += arc_hdr_size(hdr); 3818 *real_evicted += arc_hdr_size(hdr); 3819 3820 /* 3821 * If this hdr is being evicted and has a compressed buffer then we 3822 * discard it here before we change states. This ensures that the 3823 * accounting is updated correctly in arc_free_data_impl(). 3824 */ 3825 if (hdr->b_l1hdr.b_pabd != NULL) 3826 arc_hdr_free_abd(hdr, B_FALSE); 3827 3828 if (HDR_HAS_RABD(hdr)) 3829 arc_hdr_free_abd(hdr, B_TRUE); 3830 3831 arc_change_state(evicted_state, hdr); 3832 DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr); 3833 if (evicted_state == arc_anon) { 3834 arc_hdr_destroy(hdr); 3835 *real_evicted += HDR_FULL_SIZE; 3836 } else { 3837 ASSERT(HDR_IN_HASH_TABLE(hdr)); 3838 } 3839 3840 return (bytes_evicted); 3841 } 3842 3843 static void 3844 arc_set_need_free(void) 3845 { 3846 ASSERT(MUTEX_HELD(&arc_evict_lock)); 3847 int64_t remaining = arc_free_memory() - arc_sys_free / 2; 3848 arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters); 3849 if (aw == NULL) { 3850 arc_need_free = MAX(-remaining, 0); 3851 } else { 3852 arc_need_free = 3853 MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count)); 3854 } 3855 } 3856 3857 static uint64_t 3858 arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker, 3859 uint64_t spa, uint64_t bytes) 3860 { 3861 multilist_sublist_t *mls; 3862 uint64_t bytes_evicted = 0, real_evicted = 0; 3863 arc_buf_hdr_t *hdr; 3864 kmutex_t *hash_lock; 3865 uint_t evict_count = zfs_arc_evict_batch_limit; 3866 3867 ASSERT3P(marker, !=, NULL); 3868 3869 mls = multilist_sublist_lock_idx(ml, idx); 3870 3871 for (hdr = multilist_sublist_prev(mls, marker); likely(hdr != NULL); 3872 hdr = multilist_sublist_prev(mls, marker)) { 3873 if ((evict_count == 0) || (bytes_evicted >= bytes)) 3874 break; 3875 3876 /* 3877 * To keep our iteration location, move the marker 3878 * forward. Since we're not holding hdr's hash lock, we 3879 * must be very careful and not remove 'hdr' from the 3880 * sublist. Otherwise, other consumers might mistake the 3881 * 'hdr' as not being on a sublist when they call the 3882 * multilist_link_active() function (they all rely on 3883 * the hash lock protecting concurrent insertions and 3884 * removals). multilist_sublist_move_forward() was 3885 * specifically implemented to ensure this is the case 3886 * (only 'marker' will be removed and re-inserted). 3887 */ 3888 multilist_sublist_move_forward(mls, marker); 3889 3890 /* 3891 * The only case where the b_spa field should ever be 3892 * zero, is the marker headers inserted by 3893 * arc_evict_state(). It's possible for multiple threads 3894 * to be calling arc_evict_state() concurrently (e.g. 3895 * dsl_pool_close() and zio_inject_fault()), so we must 3896 * skip any markers we see from these other threads. 3897 */ 3898 if (hdr->b_spa == 0) 3899 continue; 3900 3901 /* we're only interested in evicting buffers of a certain spa */ 3902 if (spa != 0 && hdr->b_spa != spa) { 3903 ARCSTAT_BUMP(arcstat_evict_skip); 3904 continue; 3905 } 3906 3907 hash_lock = HDR_LOCK(hdr); 3908 3909 /* 3910 * We aren't calling this function from any code path 3911 * that would already be holding a hash lock, so we're 3912 * asserting on this assumption to be defensive in case 3913 * this ever changes. Without this check, it would be 3914 * possible to incorrectly increment arcstat_mutex_miss 3915 * below (e.g. if the code changed such that we called 3916 * this function with a hash lock held). 3917 */ 3918 ASSERT(!MUTEX_HELD(hash_lock)); 3919 3920 if (mutex_tryenter(hash_lock)) { 3921 uint64_t revicted; 3922 uint64_t evicted = arc_evict_hdr(hdr, &revicted); 3923 mutex_exit(hash_lock); 3924 3925 bytes_evicted += evicted; 3926 real_evicted += revicted; 3927 3928 /* 3929 * If evicted is zero, arc_evict_hdr() must have 3930 * decided to skip this header, don't increment 3931 * evict_count in this case. 3932 */ 3933 if (evicted != 0) 3934 evict_count--; 3935 3936 } else { 3937 ARCSTAT_BUMP(arcstat_mutex_miss); 3938 } 3939 } 3940 3941 multilist_sublist_unlock(mls); 3942 3943 /* 3944 * Increment the count of evicted bytes, and wake up any threads that 3945 * are waiting for the count to reach this value. Since the list is 3946 * ordered by ascending aew_count, we pop off the beginning of the 3947 * list until we reach the end, or a waiter that's past the current 3948 * "count". Doing this outside the loop reduces the number of times 3949 * we need to acquire the global arc_evict_lock. 3950 * 3951 * Only wake when there's sufficient free memory in the system 3952 * (specifically, arc_sys_free/2, which by default is a bit more than 3953 * 1/64th of RAM). See the comments in arc_wait_for_eviction(). 3954 */ 3955 mutex_enter(&arc_evict_lock); 3956 arc_evict_count += real_evicted; 3957 3958 if (arc_free_memory() > arc_sys_free / 2) { 3959 arc_evict_waiter_t *aw; 3960 while ((aw = list_head(&arc_evict_waiters)) != NULL && 3961 aw->aew_count <= arc_evict_count) { 3962 list_remove(&arc_evict_waiters, aw); 3963 cv_broadcast(&aw->aew_cv); 3964 } 3965 } 3966 arc_set_need_free(); 3967 mutex_exit(&arc_evict_lock); 3968 3969 /* 3970 * If the ARC size is reduced from arc_c_max to arc_c_min (especially 3971 * if the average cached block is small), eviction can be on-CPU for 3972 * many seconds. To ensure that other threads that may be bound to 3973 * this CPU are able to make progress, make a voluntary preemption 3974 * call here. 3975 */ 3976 kpreempt(KPREEMPT_SYNC); 3977 3978 return (bytes_evicted); 3979 } 3980 3981 static arc_buf_hdr_t * 3982 arc_state_alloc_marker(void) 3983 { 3984 arc_buf_hdr_t *marker = kmem_cache_alloc(hdr_full_cache, KM_SLEEP); 3985 3986 /* 3987 * A b_spa of 0 is used to indicate that this header is 3988 * a marker. This fact is used in arc_evict_state_impl(). 3989 */ 3990 marker->b_spa = 0; 3991 3992 return (marker); 3993 } 3994 3995 static void 3996 arc_state_free_marker(arc_buf_hdr_t *marker) 3997 { 3998 kmem_cache_free(hdr_full_cache, marker); 3999 } 4000 4001 /* 4002 * Allocate an array of buffer headers used as placeholders during arc state 4003 * eviction. 4004 */ 4005 static arc_buf_hdr_t ** 4006 arc_state_alloc_markers(int count) 4007 { 4008 arc_buf_hdr_t **markers; 4009 4010 markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP); 4011 for (int i = 0; i < count; i++) 4012 markers[i] = arc_state_alloc_marker(); 4013 return (markers); 4014 } 4015 4016 static void 4017 arc_state_free_markers(arc_buf_hdr_t **markers, int count) 4018 { 4019 for (int i = 0; i < count; i++) 4020 arc_state_free_marker(markers[i]); 4021 kmem_free(markers, sizeof (*markers) * count); 4022 } 4023 4024 /* 4025 * Evict buffers from the given arc state, until we've removed the 4026 * specified number of bytes. Move the removed buffers to the 4027 * appropriate evict state. 4028 * 4029 * This function makes a "best effort". It skips over any buffers 4030 * it can't get a hash_lock on, and so, may not catch all candidates. 4031 * It may also return without evicting as much space as requested. 4032 * 4033 * If bytes is specified using the special value ARC_EVICT_ALL, this 4034 * will evict all available (i.e. unlocked and evictable) buffers from 4035 * the given arc state; which is used by arc_flush(). 4036 */ 4037 static uint64_t 4038 arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa, 4039 uint64_t bytes) 4040 { 4041 uint64_t total_evicted = 0; 4042 multilist_t *ml = &state->arcs_list[type]; 4043 int num_sublists; 4044 arc_buf_hdr_t **markers; 4045 4046 num_sublists = multilist_get_num_sublists(ml); 4047 4048 /* 4049 * If we've tried to evict from each sublist, made some 4050 * progress, but still have not hit the target number of bytes 4051 * to evict, we want to keep trying. The markers allow us to 4052 * pick up where we left off for each individual sublist, rather 4053 * than starting from the tail each time. 4054 */ 4055 if (zthr_iscurthread(arc_evict_zthr)) { 4056 markers = arc_state_evict_markers; 4057 ASSERT3S(num_sublists, <=, arc_state_evict_marker_count); 4058 } else { 4059 markers = arc_state_alloc_markers(num_sublists); 4060 } 4061 for (int i = 0; i < num_sublists; i++) { 4062 multilist_sublist_t *mls; 4063 4064 mls = multilist_sublist_lock_idx(ml, i); 4065 multilist_sublist_insert_tail(mls, markers[i]); 4066 multilist_sublist_unlock(mls); 4067 } 4068 4069 /* 4070 * While we haven't hit our target number of bytes to evict, or 4071 * we're evicting all available buffers. 4072 */ 4073 while (total_evicted < bytes) { 4074 int sublist_idx = multilist_get_random_index(ml); 4075 uint64_t scan_evicted = 0; 4076 4077 /* 4078 * Start eviction using a randomly selected sublist, 4079 * this is to try and evenly balance eviction across all 4080 * sublists. Always starting at the same sublist 4081 * (e.g. index 0) would cause evictions to favor certain 4082 * sublists over others. 4083 */ 4084 for (int i = 0; i < num_sublists; i++) { 4085 uint64_t bytes_remaining; 4086 uint64_t bytes_evicted; 4087 4088 if (total_evicted < bytes) 4089 bytes_remaining = bytes - total_evicted; 4090 else 4091 break; 4092 4093 bytes_evicted = arc_evict_state_impl(ml, sublist_idx, 4094 markers[sublist_idx], spa, bytes_remaining); 4095 4096 scan_evicted += bytes_evicted; 4097 total_evicted += bytes_evicted; 4098 4099 /* we've reached the end, wrap to the beginning */ 4100 if (++sublist_idx >= num_sublists) 4101 sublist_idx = 0; 4102 } 4103 4104 /* 4105 * If we didn't evict anything during this scan, we have 4106 * no reason to believe we'll evict more during another 4107 * scan, so break the loop. 4108 */ 4109 if (scan_evicted == 0) { 4110 /* This isn't possible, let's make that obvious */ 4111 ASSERT3S(bytes, !=, 0); 4112 4113 /* 4114 * When bytes is ARC_EVICT_ALL, the only way to 4115 * break the loop is when scan_evicted is zero. 4116 * In that case, we actually have evicted enough, 4117 * so we don't want to increment the kstat. 4118 */ 4119 if (bytes != ARC_EVICT_ALL) { 4120 ASSERT3S(total_evicted, <, bytes); 4121 ARCSTAT_BUMP(arcstat_evict_not_enough); 4122 } 4123 4124 break; 4125 } 4126 } 4127 4128 for (int i = 0; i < num_sublists; i++) { 4129 multilist_sublist_t *mls = multilist_sublist_lock_idx(ml, i); 4130 multilist_sublist_remove(mls, markers[i]); 4131 multilist_sublist_unlock(mls); 4132 } 4133 if (markers != arc_state_evict_markers) 4134 arc_state_free_markers(markers, num_sublists); 4135 4136 return (total_evicted); 4137 } 4138 4139 /* 4140 * Flush all "evictable" data of the given type from the arc state 4141 * specified. This will not evict any "active" buffers (i.e. referenced). 4142 * 4143 * When 'retry' is set to B_FALSE, the function will make a single pass 4144 * over the state and evict any buffers that it can. Since it doesn't 4145 * continually retry the eviction, it might end up leaving some buffers 4146 * in the ARC due to lock misses. 4147 * 4148 * When 'retry' is set to B_TRUE, the function will continually retry the 4149 * eviction until *all* evictable buffers have been removed from the 4150 * state. As a result, if concurrent insertions into the state are 4151 * allowed (e.g. if the ARC isn't shutting down), this function might 4152 * wind up in an infinite loop, continually trying to evict buffers. 4153 */ 4154 static uint64_t 4155 arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type, 4156 boolean_t retry) 4157 { 4158 uint64_t evicted = 0; 4159 4160 while (zfs_refcount_count(&state->arcs_esize[type]) != 0) { 4161 evicted += arc_evict_state(state, type, spa, ARC_EVICT_ALL); 4162 4163 if (!retry) 4164 break; 4165 } 4166 4167 return (evicted); 4168 } 4169 4170 /* 4171 * Evict the specified number of bytes from the state specified. This 4172 * function prevents us from trying to evict more from a state's list 4173 * than is "evictable", and to skip evicting altogether when passed a 4174 * negative value for "bytes". In contrast, arc_evict_state() will 4175 * evict everything it can, when passed a negative value for "bytes". 4176 */ 4177 static uint64_t 4178 arc_evict_impl(arc_state_t *state, arc_buf_contents_t type, int64_t bytes) 4179 { 4180 uint64_t delta; 4181 4182 if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) { 4183 delta = MIN(zfs_refcount_count(&state->arcs_esize[type]), 4184 bytes); 4185 return (arc_evict_state(state, type, 0, delta)); 4186 } 4187 4188 return (0); 4189 } 4190 4191 /* 4192 * Adjust specified fraction, taking into account initial ghost state(s) size, 4193 * ghost hit bytes towards increasing the fraction, ghost hit bytes towards 4194 * decreasing it, plus a balance factor, controlling the decrease rate, used 4195 * to balance metadata vs data. 4196 */ 4197 static uint64_t 4198 arc_evict_adj(uint64_t frac, uint64_t total, uint64_t up, uint64_t down, 4199 uint_t balance) 4200 { 4201 if (total < 8 || up + down == 0) 4202 return (frac); 4203 4204 /* 4205 * We should not have more ghost hits than ghost size, but they 4206 * may get close. Restrict maximum adjustment in that case. 4207 */ 4208 if (up + down >= total / 4) { 4209 uint64_t scale = (up + down) / (total / 8); 4210 up /= scale; 4211 down /= scale; 4212 } 4213 4214 /* Get maximal dynamic range by choosing optimal shifts. */ 4215 int s = highbit64(total); 4216 s = MIN(64 - s, 32); 4217 4218 uint64_t ofrac = (1ULL << 32) - frac; 4219 4220 if (frac >= 4 * ofrac) 4221 up /= frac / (2 * ofrac + 1); 4222 up = (up << s) / (total >> (32 - s)); 4223 if (ofrac >= 4 * frac) 4224 down /= ofrac / (2 * frac + 1); 4225 down = (down << s) / (total >> (32 - s)); 4226 down = down * 100 / balance; 4227 4228 return (frac + up - down); 4229 } 4230 4231 /* 4232 * Calculate (x * multiplier / divisor) without unnecesary overflows. 4233 */ 4234 static uint64_t 4235 arc_mf(uint64_t x, uint64_t multiplier, uint64_t divisor) 4236 { 4237 uint64_t q = (x / divisor); 4238 uint64_t r = (x % divisor); 4239 4240 return ((q * multiplier) + ((r * multiplier) / divisor)); 4241 } 4242 4243 /* 4244 * Evict buffers from the cache, such that arcstat_size is capped by arc_c. 4245 */ 4246 static uint64_t 4247 arc_evict(void) 4248 { 4249 uint64_t asize, bytes, total_evicted = 0; 4250 int64_t e, mrud, mrum, mfud, mfum, w; 4251 static uint64_t ogrd, ogrm, ogfd, ogfm; 4252 static uint64_t gsrd, gsrm, gsfd, gsfm; 4253 uint64_t ngrd, ngrm, ngfd, ngfm; 4254 4255 /* Get current size of ARC states we can evict from. */ 4256 mrud = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_DATA]) + 4257 zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]); 4258 mrum = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) + 4259 zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]); 4260 mfud = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_DATA]); 4261 mfum = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]); 4262 uint64_t d = mrud + mfud; 4263 uint64_t m = mrum + mfum; 4264 uint64_t t = d + m; 4265 4266 /* Get ARC ghost hits since last eviction. */ 4267 ngrd = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]); 4268 uint64_t grd = ngrd - ogrd; 4269 ogrd = ngrd; 4270 ngrm = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]); 4271 uint64_t grm = ngrm - ogrm; 4272 ogrm = ngrm; 4273 ngfd = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]); 4274 uint64_t gfd = ngfd - ogfd; 4275 ogfd = ngfd; 4276 ngfm = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]); 4277 uint64_t gfm = ngfm - ogfm; 4278 ogfm = ngfm; 4279 4280 /* Adjust ARC states balance based on ghost hits. */ 4281 arc_meta = arc_evict_adj(arc_meta, gsrd + gsrm + gsfd + gsfm, 4282 grm + gfm, grd + gfd, zfs_arc_meta_balance); 4283 arc_pd = arc_evict_adj(arc_pd, gsrd + gsfd, grd, gfd, 100); 4284 arc_pm = arc_evict_adj(arc_pm, gsrm + gsfm, grm, gfm, 100); 4285 4286 asize = aggsum_value(&arc_sums.arcstat_size); 4287 int64_t wt = t - (asize - arc_c); 4288 4289 /* 4290 * Try to reduce pinned dnodes if more than 3/4 of wanted metadata 4291 * target is not evictable or if they go over arc_dnode_limit. 4292 */ 4293 int64_t prune = 0; 4294 int64_t dn = wmsum_value(&arc_sums.arcstat_dnode_size); 4295 int64_t nem = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) 4296 + zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]) 4297 - zfs_refcount_count(&arc_mru->arcs_esize[ARC_BUFC_METADATA]) 4298 - zfs_refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); 4299 w = wt * (int64_t)(arc_meta >> 16) >> 16; 4300 if (nem > w * 3 / 4) { 4301 prune = dn / sizeof (dnode_t) * 4302 zfs_arc_dnode_reduce_percent / 100; 4303 if (nem < w && w > 4) 4304 prune = arc_mf(prune, nem - w * 3 / 4, w / 4); 4305 } 4306 if (dn > arc_dnode_limit) { 4307 prune = MAX(prune, (dn - arc_dnode_limit) / sizeof (dnode_t) * 4308 zfs_arc_dnode_reduce_percent / 100); 4309 } 4310 if (prune > 0) 4311 arc_prune_async(prune); 4312 4313 /* Evict MRU metadata. */ 4314 w = wt * (int64_t)(arc_meta * arc_pm >> 48) >> 16; 4315 e = MIN((int64_t)(asize - arc_c), (int64_t)(mrum - w)); 4316 bytes = arc_evict_impl(arc_mru, ARC_BUFC_METADATA, e); 4317 total_evicted += bytes; 4318 mrum -= bytes; 4319 asize -= bytes; 4320 4321 /* Evict MFU metadata. */ 4322 w = wt * (int64_t)(arc_meta >> 16) >> 16; 4323 e = MIN((int64_t)(asize - arc_c), (int64_t)(m - w)); 4324 bytes = arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e); 4325 total_evicted += bytes; 4326 mfum -= bytes; 4327 asize -= bytes; 4328 4329 /* Evict MRU data. */ 4330 wt -= m - total_evicted; 4331 w = wt * (int64_t)(arc_pd >> 16) >> 16; 4332 e = MIN((int64_t)(asize - arc_c), (int64_t)(mrud - w)); 4333 bytes = arc_evict_impl(arc_mru, ARC_BUFC_DATA, e); 4334 total_evicted += bytes; 4335 mrud -= bytes; 4336 asize -= bytes; 4337 4338 /* Evict MFU data. */ 4339 e = asize - arc_c; 4340 bytes = arc_evict_impl(arc_mfu, ARC_BUFC_DATA, e); 4341 mfud -= bytes; 4342 total_evicted += bytes; 4343 4344 /* 4345 * Evict ghost lists 4346 * 4347 * Size of each state's ghost list represents how much that state 4348 * may grow by shrinking the other states. Would it need to shrink 4349 * other states to zero (that is unlikely), its ghost size would be 4350 * equal to sum of other three state sizes. But excessive ghost 4351 * size may result in false ghost hits (too far back), that may 4352 * never result in real cache hits if several states are competing. 4353 * So choose some arbitraty point of 1/2 of other state sizes. 4354 */ 4355 gsrd = (mrum + mfud + mfum) / 2; 4356 e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]) - 4357 gsrd; 4358 (void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_DATA, e); 4359 4360 gsrm = (mrud + mfud + mfum) / 2; 4361 e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]) - 4362 gsrm; 4363 (void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_METADATA, e); 4364 4365 gsfd = (mrud + mrum + mfum) / 2; 4366 e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]) - 4367 gsfd; 4368 (void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_DATA, e); 4369 4370 gsfm = (mrud + mrum + mfud) / 2; 4371 e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]) - 4372 gsfm; 4373 (void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_METADATA, e); 4374 4375 return (total_evicted); 4376 } 4377 4378 void 4379 arc_flush(spa_t *spa, boolean_t retry) 4380 { 4381 uint64_t guid = 0; 4382 4383 /* 4384 * If retry is B_TRUE, a spa must not be specified since we have 4385 * no good way to determine if all of a spa's buffers have been 4386 * evicted from an arc state. 4387 */ 4388 ASSERT(!retry || spa == NULL); 4389 4390 if (spa != NULL) 4391 guid = spa_load_guid(spa); 4392 4393 (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry); 4394 (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry); 4395 4396 (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry); 4397 (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry); 4398 4399 (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry); 4400 (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry); 4401 4402 (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry); 4403 (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry); 4404 4405 (void) arc_flush_state(arc_uncached, guid, ARC_BUFC_DATA, retry); 4406 (void) arc_flush_state(arc_uncached, guid, ARC_BUFC_METADATA, retry); 4407 } 4408 4409 uint64_t 4410 arc_reduce_target_size(uint64_t to_free) 4411 { 4412 /* 4413 * Get the actual arc size. Even if we don't need it, this updates 4414 * the aggsum lower bound estimate for arc_is_overflowing(). 4415 */ 4416 uint64_t asize = aggsum_value(&arc_sums.arcstat_size); 4417 4418 /* 4419 * All callers want the ARC to actually evict (at least) this much 4420 * memory. Therefore we reduce from the lower of the current size and 4421 * the target size. This way, even if arc_c is much higher than 4422 * arc_size (as can be the case after many calls to arc_freed(), we will 4423 * immediately have arc_c < arc_size and therefore the arc_evict_zthr 4424 * will evict. 4425 */ 4426 uint64_t c = arc_c; 4427 if (c > arc_c_min) { 4428 c = MIN(c, MAX(asize, arc_c_min)); 4429 to_free = MIN(to_free, c - arc_c_min); 4430 arc_c = c - to_free; 4431 } else { 4432 to_free = 0; 4433 } 4434 4435 /* 4436 * Whether or not we reduced the target size, request eviction if the 4437 * current size is over it now, since caller obviously wants some RAM. 4438 */ 4439 if (asize > arc_c) { 4440 /* See comment in arc_evict_cb_check() on why lock+flag */ 4441 mutex_enter(&arc_evict_lock); 4442 arc_evict_needed = B_TRUE; 4443 mutex_exit(&arc_evict_lock); 4444 zthr_wakeup(arc_evict_zthr); 4445 } 4446 4447 return (to_free); 4448 } 4449 4450 /* 4451 * Determine if the system is under memory pressure and is asking 4452 * to reclaim memory. A return value of B_TRUE indicates that the system 4453 * is under memory pressure and that the arc should adjust accordingly. 4454 */ 4455 boolean_t 4456 arc_reclaim_needed(void) 4457 { 4458 return (arc_available_memory() < 0); 4459 } 4460 4461 void 4462 arc_kmem_reap_soon(void) 4463 { 4464 size_t i; 4465 kmem_cache_t *prev_cache = NULL; 4466 kmem_cache_t *prev_data_cache = NULL; 4467 4468 #ifdef _KERNEL 4469 #if defined(_ILP32) 4470 /* 4471 * Reclaim unused memory from all kmem caches. 4472 */ 4473 kmem_reap(); 4474 #endif 4475 #endif 4476 4477 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) { 4478 #if defined(_ILP32) 4479 /* reach upper limit of cache size on 32-bit */ 4480 if (zio_buf_cache[i] == NULL) 4481 break; 4482 #endif 4483 if (zio_buf_cache[i] != prev_cache) { 4484 prev_cache = zio_buf_cache[i]; 4485 kmem_cache_reap_now(zio_buf_cache[i]); 4486 } 4487 if (zio_data_buf_cache[i] != prev_data_cache) { 4488 prev_data_cache = zio_data_buf_cache[i]; 4489 kmem_cache_reap_now(zio_data_buf_cache[i]); 4490 } 4491 } 4492 kmem_cache_reap_now(buf_cache); 4493 kmem_cache_reap_now(hdr_full_cache); 4494 kmem_cache_reap_now(hdr_l2only_cache); 4495 kmem_cache_reap_now(zfs_btree_leaf_cache); 4496 abd_cache_reap_now(); 4497 } 4498 4499 static boolean_t 4500 arc_evict_cb_check(void *arg, zthr_t *zthr) 4501 { 4502 (void) arg, (void) zthr; 4503 4504 #ifdef ZFS_DEBUG 4505 /* 4506 * This is necessary in order to keep the kstat information 4507 * up to date for tools that display kstat data such as the 4508 * mdb ::arc dcmd and the Linux crash utility. These tools 4509 * typically do not call kstat's update function, but simply 4510 * dump out stats from the most recent update. Without 4511 * this call, these commands may show stale stats for the 4512 * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even 4513 * with this call, the data might be out of date if the 4514 * evict thread hasn't been woken recently; but that should 4515 * suffice. The arc_state_t structures can be queried 4516 * directly if more accurate information is needed. 4517 */ 4518 if (arc_ksp != NULL) 4519 arc_ksp->ks_update(arc_ksp, KSTAT_READ); 4520 #endif 4521 4522 /* 4523 * We have to rely on arc_wait_for_eviction() to tell us when to 4524 * evict, rather than checking if we are overflowing here, so that we 4525 * are sure to not leave arc_wait_for_eviction() waiting on aew_cv. 4526 * If we have become "not overflowing" since arc_wait_for_eviction() 4527 * checked, we need to wake it up. We could broadcast the CV here, 4528 * but arc_wait_for_eviction() may have not yet gone to sleep. We 4529 * would need to use a mutex to ensure that this function doesn't 4530 * broadcast until arc_wait_for_eviction() has gone to sleep (e.g. 4531 * the arc_evict_lock). However, the lock ordering of such a lock 4532 * would necessarily be incorrect with respect to the zthr_lock, 4533 * which is held before this function is called, and is held by 4534 * arc_wait_for_eviction() when it calls zthr_wakeup(). 4535 */ 4536 if (arc_evict_needed) 4537 return (B_TRUE); 4538 4539 /* 4540 * If we have buffers in uncached state, evict them periodically. 4541 */ 4542 return ((zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_DATA]) + 4543 zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]) && 4544 ddi_get_lbolt() - arc_last_uncached_flush > 4545 MSEC_TO_TICK(arc_min_prefetch_ms / 2))); 4546 } 4547 4548 /* 4549 * Keep arc_size under arc_c by running arc_evict which evicts data 4550 * from the ARC. 4551 */ 4552 static void 4553 arc_evict_cb(void *arg, zthr_t *zthr) 4554 { 4555 (void) arg; 4556 4557 uint64_t evicted = 0; 4558 fstrans_cookie_t cookie = spl_fstrans_mark(); 4559 4560 /* Always try to evict from uncached state. */ 4561 arc_last_uncached_flush = ddi_get_lbolt(); 4562 evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_DATA, B_FALSE); 4563 evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_METADATA, B_FALSE); 4564 4565 /* Evict from other states only if told to. */ 4566 if (arc_evict_needed) 4567 evicted += arc_evict(); 4568 4569 /* 4570 * If evicted is zero, we couldn't evict anything 4571 * via arc_evict(). This could be due to hash lock 4572 * collisions, but more likely due to the majority of 4573 * arc buffers being unevictable. Therefore, even if 4574 * arc_size is above arc_c, another pass is unlikely to 4575 * be helpful and could potentially cause us to enter an 4576 * infinite loop. Additionally, zthr_iscancelled() is 4577 * checked here so that if the arc is shutting down, the 4578 * broadcast will wake any remaining arc evict waiters. 4579 * 4580 * Note we cancel using zthr instead of arc_evict_zthr 4581 * because the latter may not yet be initializd when the 4582 * callback is first invoked. 4583 */ 4584 mutex_enter(&arc_evict_lock); 4585 arc_evict_needed = !zthr_iscancelled(zthr) && 4586 evicted > 0 && aggsum_compare(&arc_sums.arcstat_size, arc_c) > 0; 4587 if (!arc_evict_needed) { 4588 /* 4589 * We're either no longer overflowing, or we 4590 * can't evict anything more, so we should wake 4591 * arc_get_data_impl() sooner. 4592 */ 4593 arc_evict_waiter_t *aw; 4594 while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) { 4595 cv_broadcast(&aw->aew_cv); 4596 } 4597 arc_set_need_free(); 4598 } 4599 mutex_exit(&arc_evict_lock); 4600 spl_fstrans_unmark(cookie); 4601 } 4602 4603 static boolean_t 4604 arc_reap_cb_check(void *arg, zthr_t *zthr) 4605 { 4606 (void) arg, (void) zthr; 4607 4608 int64_t free_memory = arc_available_memory(); 4609 static int reap_cb_check_counter = 0; 4610 4611 /* 4612 * If a kmem reap is already active, don't schedule more. We must 4613 * check for this because kmem_cache_reap_soon() won't actually 4614 * block on the cache being reaped (this is to prevent callers from 4615 * becoming implicitly blocked by a system-wide kmem reap -- which, 4616 * on a system with many, many full magazines, can take minutes). 4617 */ 4618 if (!kmem_cache_reap_active() && free_memory < 0) { 4619 4620 arc_no_grow = B_TRUE; 4621 arc_warm = B_TRUE; 4622 /* 4623 * Wait at least zfs_grow_retry (default 5) seconds 4624 * before considering growing. 4625 */ 4626 arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry); 4627 return (B_TRUE); 4628 } else if (free_memory < arc_c >> arc_no_grow_shift) { 4629 arc_no_grow = B_TRUE; 4630 } else if (gethrtime() >= arc_growtime) { 4631 arc_no_grow = B_FALSE; 4632 } 4633 4634 /* 4635 * Called unconditionally every 60 seconds to reclaim unused 4636 * zstd compression and decompression context. This is done 4637 * here to avoid the need for an independent thread. 4638 */ 4639 if (!((reap_cb_check_counter++) % 60)) 4640 zfs_zstd_cache_reap_now(); 4641 4642 return (B_FALSE); 4643 } 4644 4645 /* 4646 * Keep enough free memory in the system by reaping the ARC's kmem 4647 * caches. To cause more slabs to be reapable, we may reduce the 4648 * target size of the cache (arc_c), causing the arc_evict_cb() 4649 * to free more buffers. 4650 */ 4651 static void 4652 arc_reap_cb(void *arg, zthr_t *zthr) 4653 { 4654 int64_t can_free, free_memory, to_free; 4655 4656 (void) arg, (void) zthr; 4657 fstrans_cookie_t cookie = spl_fstrans_mark(); 4658 4659 /* 4660 * Kick off asynchronous kmem_reap()'s of all our caches. 4661 */ 4662 arc_kmem_reap_soon(); 4663 4664 /* 4665 * Wait at least arc_kmem_cache_reap_retry_ms between 4666 * arc_kmem_reap_soon() calls. Without this check it is possible to 4667 * end up in a situation where we spend lots of time reaping 4668 * caches, while we're near arc_c_min. Waiting here also gives the 4669 * subsequent free memory check a chance of finding that the 4670 * asynchronous reap has already freed enough memory, and we don't 4671 * need to call arc_reduce_target_size(). 4672 */ 4673 delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000); 4674 4675 /* 4676 * Reduce the target size as needed to maintain the amount of free 4677 * memory in the system at a fraction of the arc_size (1/128th by 4678 * default). If oversubscribed (free_memory < 0) then reduce the 4679 * target arc_size by the deficit amount plus the fractional 4680 * amount. If free memory is positive but less than the fractional 4681 * amount, reduce by what is needed to hit the fractional amount. 4682 */ 4683 free_memory = arc_available_memory(); 4684 can_free = arc_c - arc_c_min; 4685 to_free = (MAX(can_free, 0) >> arc_shrink_shift) - free_memory; 4686 if (to_free > 0) 4687 arc_reduce_target_size(to_free); 4688 spl_fstrans_unmark(cookie); 4689 } 4690 4691 #ifdef _KERNEL 4692 /* 4693 * Determine the amount of memory eligible for eviction contained in the 4694 * ARC. All clean data reported by the ghost lists can always be safely 4695 * evicted. Due to arc_c_min, the same does not hold for all clean data 4696 * contained by the regular mru and mfu lists. 4697 * 4698 * In the case of the regular mru and mfu lists, we need to report as 4699 * much clean data as possible, such that evicting that same reported 4700 * data will not bring arc_size below arc_c_min. Thus, in certain 4701 * circumstances, the total amount of clean data in the mru and mfu 4702 * lists might not actually be evictable. 4703 * 4704 * The following two distinct cases are accounted for: 4705 * 4706 * 1. The sum of the amount of dirty data contained by both the mru and 4707 * mfu lists, plus the ARC's other accounting (e.g. the anon list), 4708 * is greater than or equal to arc_c_min. 4709 * (i.e. amount of dirty data >= arc_c_min) 4710 * 4711 * This is the easy case; all clean data contained by the mru and mfu 4712 * lists is evictable. Evicting all clean data can only drop arc_size 4713 * to the amount of dirty data, which is greater than arc_c_min. 4714 * 4715 * 2. The sum of the amount of dirty data contained by both the mru and 4716 * mfu lists, plus the ARC's other accounting (e.g. the anon list), 4717 * is less than arc_c_min. 4718 * (i.e. arc_c_min > amount of dirty data) 4719 * 4720 * 2.1. arc_size is greater than or equal arc_c_min. 4721 * (i.e. arc_size >= arc_c_min > amount of dirty data) 4722 * 4723 * In this case, not all clean data from the regular mru and mfu 4724 * lists is actually evictable; we must leave enough clean data 4725 * to keep arc_size above arc_c_min. Thus, the maximum amount of 4726 * evictable data from the two lists combined, is exactly the 4727 * difference between arc_size and arc_c_min. 4728 * 4729 * 2.2. arc_size is less than arc_c_min 4730 * (i.e. arc_c_min > arc_size > amount of dirty data) 4731 * 4732 * In this case, none of the data contained in the mru and mfu 4733 * lists is evictable, even if it's clean. Since arc_size is 4734 * already below arc_c_min, evicting any more would only 4735 * increase this negative difference. 4736 */ 4737 4738 #endif /* _KERNEL */ 4739 4740 /* 4741 * Adapt arc info given the number of bytes we are trying to add and 4742 * the state that we are coming from. This function is only called 4743 * when we are adding new content to the cache. 4744 */ 4745 static void 4746 arc_adapt(uint64_t bytes) 4747 { 4748 /* 4749 * Wake reap thread if we do not have any available memory 4750 */ 4751 if (arc_reclaim_needed()) { 4752 zthr_wakeup(arc_reap_zthr); 4753 return; 4754 } 4755 4756 if (arc_no_grow) 4757 return; 4758 4759 if (arc_c >= arc_c_max) 4760 return; 4761 4762 /* 4763 * If we're within (2 * maxblocksize) bytes of the target 4764 * cache size, increment the target cache size 4765 */ 4766 if (aggsum_upper_bound(&arc_sums.arcstat_size) + 4767 2 * SPA_MAXBLOCKSIZE >= arc_c) { 4768 uint64_t dc = MAX(bytes, SPA_OLD_MAXBLOCKSIZE); 4769 if (atomic_add_64_nv(&arc_c, dc) > arc_c_max) 4770 arc_c = arc_c_max; 4771 } 4772 } 4773 4774 /* 4775 * Check if ARC current size has grown past our upper thresholds. 4776 */ 4777 static arc_ovf_level_t 4778 arc_is_overflowing(boolean_t lax, boolean_t use_reserve) 4779 { 4780 /* 4781 * We just compare the lower bound here for performance reasons. Our 4782 * primary goals are to make sure that the arc never grows without 4783 * bound, and that it can reach its maximum size. This check 4784 * accomplishes both goals. The maximum amount we could run over by is 4785 * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block 4786 * in the ARC. In practice, that's in the tens of MB, which is low 4787 * enough to be safe. 4788 */ 4789 int64_t over = aggsum_lower_bound(&arc_sums.arcstat_size) - arc_c - 4790 zfs_max_recordsize; 4791 4792 /* Always allow at least one block of overflow. */ 4793 if (over < 0) 4794 return (ARC_OVF_NONE); 4795 4796 /* If we are under memory pressure, report severe overflow. */ 4797 if (!lax) 4798 return (ARC_OVF_SEVERE); 4799 4800 /* We are not under pressure, so be more or less relaxed. */ 4801 int64_t overflow = (arc_c >> zfs_arc_overflow_shift) / 2; 4802 if (use_reserve) 4803 overflow *= 3; 4804 return (over < overflow ? ARC_OVF_SOME : ARC_OVF_SEVERE); 4805 } 4806 4807 static abd_t * 4808 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, const void *tag, 4809 int alloc_flags) 4810 { 4811 arc_buf_contents_t type = arc_buf_type(hdr); 4812 4813 arc_get_data_impl(hdr, size, tag, alloc_flags); 4814 if (alloc_flags & ARC_HDR_ALLOC_LINEAR) 4815 return (abd_alloc_linear(size, type == ARC_BUFC_METADATA)); 4816 else 4817 return (abd_alloc(size, type == ARC_BUFC_METADATA)); 4818 } 4819 4820 static void * 4821 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, const void *tag) 4822 { 4823 arc_buf_contents_t type = arc_buf_type(hdr); 4824 4825 arc_get_data_impl(hdr, size, tag, 0); 4826 if (type == ARC_BUFC_METADATA) { 4827 return (zio_buf_alloc(size)); 4828 } else { 4829 ASSERT(type == ARC_BUFC_DATA); 4830 return (zio_data_buf_alloc(size)); 4831 } 4832 } 4833 4834 /* 4835 * Wait for the specified amount of data (in bytes) to be evicted from the 4836 * ARC, and for there to be sufficient free memory in the system. 4837 * The lax argument specifies that caller does not have a specific reason 4838 * to wait, not aware of any memory pressure. Low memory handlers though 4839 * should set it to B_FALSE to wait for all required evictions to complete. 4840 * The use_reserve argument allows some callers to wait less than others 4841 * to not block critical code paths, possibly blocking other resources. 4842 */ 4843 void 4844 arc_wait_for_eviction(uint64_t amount, boolean_t lax, boolean_t use_reserve) 4845 { 4846 switch (arc_is_overflowing(lax, use_reserve)) { 4847 case ARC_OVF_NONE: 4848 return; 4849 case ARC_OVF_SOME: 4850 /* 4851 * This is a bit racy without taking arc_evict_lock, but the 4852 * worst that can happen is we either call zthr_wakeup() extra 4853 * time due to race with other thread here, or the set flag 4854 * get cleared by arc_evict_cb(), which is unlikely due to 4855 * big hysteresis, but also not important since at this level 4856 * of overflow the eviction is purely advisory. Same time 4857 * taking the global lock here every time without waiting for 4858 * the actual eviction creates a significant lock contention. 4859 */ 4860 if (!arc_evict_needed) { 4861 arc_evict_needed = B_TRUE; 4862 zthr_wakeup(arc_evict_zthr); 4863 } 4864 return; 4865 case ARC_OVF_SEVERE: 4866 default: 4867 { 4868 arc_evict_waiter_t aw; 4869 list_link_init(&aw.aew_node); 4870 cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL); 4871 4872 uint64_t last_count = 0; 4873 mutex_enter(&arc_evict_lock); 4874 if (!list_is_empty(&arc_evict_waiters)) { 4875 arc_evict_waiter_t *last = 4876 list_tail(&arc_evict_waiters); 4877 last_count = last->aew_count; 4878 } else if (!arc_evict_needed) { 4879 arc_evict_needed = B_TRUE; 4880 zthr_wakeup(arc_evict_zthr); 4881 } 4882 /* 4883 * Note, the last waiter's count may be less than 4884 * arc_evict_count if we are low on memory in which 4885 * case arc_evict_state_impl() may have deferred 4886 * wakeups (but still incremented arc_evict_count). 4887 */ 4888 aw.aew_count = MAX(last_count, arc_evict_count) + amount; 4889 4890 list_insert_tail(&arc_evict_waiters, &aw); 4891 4892 arc_set_need_free(); 4893 4894 DTRACE_PROBE3(arc__wait__for__eviction, 4895 uint64_t, amount, 4896 uint64_t, arc_evict_count, 4897 uint64_t, aw.aew_count); 4898 4899 /* 4900 * We will be woken up either when arc_evict_count reaches 4901 * aew_count, or when the ARC is no longer overflowing and 4902 * eviction completes. 4903 * In case of "false" wakeup, we will still be on the list. 4904 */ 4905 do { 4906 cv_wait(&aw.aew_cv, &arc_evict_lock); 4907 } while (list_link_active(&aw.aew_node)); 4908 mutex_exit(&arc_evict_lock); 4909 4910 cv_destroy(&aw.aew_cv); 4911 } 4912 } 4913 } 4914 4915 /* 4916 * Allocate a block and return it to the caller. If we are hitting the 4917 * hard limit for the cache size, we must sleep, waiting for the eviction 4918 * thread to catch up. If we're past the target size but below the hard 4919 * limit, we'll only signal the reclaim thread and continue on. 4920 */ 4921 static void 4922 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag, 4923 int alloc_flags) 4924 { 4925 arc_adapt(size); 4926 4927 /* 4928 * If arc_size is currently overflowing, we must be adding data 4929 * faster than we are evicting. To ensure we don't compound the 4930 * problem by adding more data and forcing arc_size to grow even 4931 * further past it's target size, we wait for the eviction thread to 4932 * make some progress. We also wait for there to be sufficient free 4933 * memory in the system, as measured by arc_free_memory(). 4934 * 4935 * Specifically, we wait for zfs_arc_eviction_pct percent of the 4936 * requested size to be evicted. This should be more than 100%, to 4937 * ensure that that progress is also made towards getting arc_size 4938 * under arc_c. See the comment above zfs_arc_eviction_pct. 4939 */ 4940 arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100, 4941 B_TRUE, alloc_flags & ARC_HDR_USE_RESERVE); 4942 4943 arc_buf_contents_t type = arc_buf_type(hdr); 4944 if (type == ARC_BUFC_METADATA) { 4945 arc_space_consume(size, ARC_SPACE_META); 4946 } else { 4947 arc_space_consume(size, ARC_SPACE_DATA); 4948 } 4949 4950 /* 4951 * Update the state size. Note that ghost states have a 4952 * "ghost size" and so don't need to be updated. 4953 */ 4954 arc_state_t *state = hdr->b_l1hdr.b_state; 4955 if (!GHOST_STATE(state)) { 4956 4957 (void) zfs_refcount_add_many(&state->arcs_size[type], size, 4958 tag); 4959 4960 /* 4961 * If this is reached via arc_read, the link is 4962 * protected by the hash lock. If reached via 4963 * arc_buf_alloc, the header should not be accessed by 4964 * any other thread. And, if reached via arc_read_done, 4965 * the hash lock will protect it if it's found in the 4966 * hash table; otherwise no other thread should be 4967 * trying to [add|remove]_reference it. 4968 */ 4969 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 4970 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 4971 (void) zfs_refcount_add_many(&state->arcs_esize[type], 4972 size, tag); 4973 } 4974 } 4975 } 4976 4977 static void 4978 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, 4979 const void *tag) 4980 { 4981 arc_free_data_impl(hdr, size, tag); 4982 abd_free(abd); 4983 } 4984 4985 static void 4986 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, const void *tag) 4987 { 4988 arc_buf_contents_t type = arc_buf_type(hdr); 4989 4990 arc_free_data_impl(hdr, size, tag); 4991 if (type == ARC_BUFC_METADATA) { 4992 zio_buf_free(buf, size); 4993 } else { 4994 ASSERT(type == ARC_BUFC_DATA); 4995 zio_data_buf_free(buf, size); 4996 } 4997 } 4998 4999 /* 5000 * Free the arc data buffer. 5001 */ 5002 static void 5003 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag) 5004 { 5005 arc_state_t *state = hdr->b_l1hdr.b_state; 5006 arc_buf_contents_t type = arc_buf_type(hdr); 5007 5008 /* protected by hash lock, if in the hash table */ 5009 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 5010 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 5011 ASSERT(state != arc_anon && state != arc_l2c_only); 5012 5013 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 5014 size, tag); 5015 } 5016 (void) zfs_refcount_remove_many(&state->arcs_size[type], size, tag); 5017 5018 VERIFY3U(hdr->b_type, ==, type); 5019 if (type == ARC_BUFC_METADATA) { 5020 arc_space_return(size, ARC_SPACE_META); 5021 } else { 5022 ASSERT(type == ARC_BUFC_DATA); 5023 arc_space_return(size, ARC_SPACE_DATA); 5024 } 5025 } 5026 5027 /* 5028 * This routine is called whenever a buffer is accessed. 5029 */ 5030 static void 5031 arc_access(arc_buf_hdr_t *hdr, arc_flags_t arc_flags, boolean_t hit) 5032 { 5033 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 5034 ASSERT(HDR_HAS_L1HDR(hdr)); 5035 5036 /* 5037 * Update buffer prefetch status. 5038 */ 5039 boolean_t was_prefetch = HDR_PREFETCH(hdr); 5040 boolean_t now_prefetch = arc_flags & ARC_FLAG_PREFETCH; 5041 if (was_prefetch != now_prefetch) { 5042 if (was_prefetch) { 5043 ARCSTAT_CONDSTAT(hit, demand_hit, demand_iohit, 5044 HDR_PRESCIENT_PREFETCH(hdr), prescient, predictive, 5045 prefetch); 5046 } 5047 if (HDR_HAS_L2HDR(hdr)) 5048 l2arc_hdr_arcstats_decrement_state(hdr); 5049 if (was_prefetch) { 5050 arc_hdr_clear_flags(hdr, 5051 ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH); 5052 } else { 5053 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); 5054 } 5055 if (HDR_HAS_L2HDR(hdr)) 5056 l2arc_hdr_arcstats_increment_state(hdr); 5057 } 5058 if (now_prefetch) { 5059 if (arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) { 5060 arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); 5061 ARCSTAT_BUMP(arcstat_prescient_prefetch); 5062 } else { 5063 ARCSTAT_BUMP(arcstat_predictive_prefetch); 5064 } 5065 } 5066 if (arc_flags & ARC_FLAG_L2CACHE) 5067 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); 5068 5069 clock_t now = ddi_get_lbolt(); 5070 if (hdr->b_l1hdr.b_state == arc_anon) { 5071 arc_state_t *new_state; 5072 /* 5073 * This buffer is not in the cache, and does not appear in 5074 * our "ghost" lists. Add it to the MRU or uncached state. 5075 */ 5076 ASSERT0(hdr->b_l1hdr.b_arc_access); 5077 hdr->b_l1hdr.b_arc_access = now; 5078 if (HDR_UNCACHED(hdr)) { 5079 new_state = arc_uncached; 5080 DTRACE_PROBE1(new_state__uncached, arc_buf_hdr_t *, 5081 hdr); 5082 } else { 5083 new_state = arc_mru; 5084 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); 5085 } 5086 arc_change_state(new_state, hdr); 5087 } else if (hdr->b_l1hdr.b_state == arc_mru) { 5088 /* 5089 * This buffer has been accessed once recently and either 5090 * its read is still in progress or it is in the cache. 5091 */ 5092 if (HDR_IO_IN_PROGRESS(hdr)) { 5093 hdr->b_l1hdr.b_arc_access = now; 5094 return; 5095 } 5096 hdr->b_l1hdr.b_mru_hits++; 5097 ARCSTAT_BUMP(arcstat_mru_hits); 5098 5099 /* 5100 * If the previous access was a prefetch, then it already 5101 * handled possible promotion, so nothing more to do for now. 5102 */ 5103 if (was_prefetch) { 5104 hdr->b_l1hdr.b_arc_access = now; 5105 return; 5106 } 5107 5108 /* 5109 * If more than ARC_MINTIME have passed from the previous 5110 * hit, promote the buffer to the MFU state. 5111 */ 5112 if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access + 5113 ARC_MINTIME)) { 5114 hdr->b_l1hdr.b_arc_access = now; 5115 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5116 arc_change_state(arc_mfu, hdr); 5117 } 5118 } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) { 5119 arc_state_t *new_state; 5120 /* 5121 * This buffer has been accessed once recently, but was 5122 * evicted from the cache. Would we have bigger MRU, it 5123 * would be an MRU hit, so handle it the same way, except 5124 * we don't need to check the previous access time. 5125 */ 5126 hdr->b_l1hdr.b_mru_ghost_hits++; 5127 ARCSTAT_BUMP(arcstat_mru_ghost_hits); 5128 hdr->b_l1hdr.b_arc_access = now; 5129 wmsum_add(&arc_mru_ghost->arcs_hits[arc_buf_type(hdr)], 5130 arc_hdr_size(hdr)); 5131 if (was_prefetch) { 5132 new_state = arc_mru; 5133 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); 5134 } else { 5135 new_state = arc_mfu; 5136 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5137 } 5138 arc_change_state(new_state, hdr); 5139 } else if (hdr->b_l1hdr.b_state == arc_mfu) { 5140 /* 5141 * This buffer has been accessed more than once and either 5142 * still in the cache or being restored from one of ghosts. 5143 */ 5144 if (!HDR_IO_IN_PROGRESS(hdr)) { 5145 hdr->b_l1hdr.b_mfu_hits++; 5146 ARCSTAT_BUMP(arcstat_mfu_hits); 5147 } 5148 hdr->b_l1hdr.b_arc_access = now; 5149 } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) { 5150 /* 5151 * This buffer has been accessed more than once recently, but 5152 * has been evicted from the cache. Would we have bigger MFU 5153 * it would stay in cache, so move it back to MFU state. 5154 */ 5155 hdr->b_l1hdr.b_mfu_ghost_hits++; 5156 ARCSTAT_BUMP(arcstat_mfu_ghost_hits); 5157 hdr->b_l1hdr.b_arc_access = now; 5158 wmsum_add(&arc_mfu_ghost->arcs_hits[arc_buf_type(hdr)], 5159 arc_hdr_size(hdr)); 5160 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5161 arc_change_state(arc_mfu, hdr); 5162 } else if (hdr->b_l1hdr.b_state == arc_uncached) { 5163 /* 5164 * This buffer is uncacheable, but we got a hit. Probably 5165 * a demand read after prefetch. Nothing more to do here. 5166 */ 5167 if (!HDR_IO_IN_PROGRESS(hdr)) 5168 ARCSTAT_BUMP(arcstat_uncached_hits); 5169 hdr->b_l1hdr.b_arc_access = now; 5170 } else if (hdr->b_l1hdr.b_state == arc_l2c_only) { 5171 /* 5172 * This buffer is on the 2nd Level ARC and was not accessed 5173 * for a long time, so treat it as new and put into MRU. 5174 */ 5175 hdr->b_l1hdr.b_arc_access = now; 5176 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); 5177 arc_change_state(arc_mru, hdr); 5178 } else { 5179 cmn_err(CE_PANIC, "invalid arc state 0x%p", 5180 hdr->b_l1hdr.b_state); 5181 } 5182 } 5183 5184 /* 5185 * This routine is called by dbuf_hold() to update the arc_access() state 5186 * which otherwise would be skipped for entries in the dbuf cache. 5187 */ 5188 void 5189 arc_buf_access(arc_buf_t *buf) 5190 { 5191 arc_buf_hdr_t *hdr = buf->b_hdr; 5192 5193 /* 5194 * Avoid taking the hash_lock when possible as an optimization. 5195 * The header must be checked again under the hash_lock in order 5196 * to handle the case where it is concurrently being released. 5197 */ 5198 if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) 5199 return; 5200 5201 kmutex_t *hash_lock = HDR_LOCK(hdr); 5202 mutex_enter(hash_lock); 5203 5204 if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) { 5205 mutex_exit(hash_lock); 5206 ARCSTAT_BUMP(arcstat_access_skip); 5207 return; 5208 } 5209 5210 ASSERT(hdr->b_l1hdr.b_state == arc_mru || 5211 hdr->b_l1hdr.b_state == arc_mfu || 5212 hdr->b_l1hdr.b_state == arc_uncached); 5213 5214 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); 5215 arc_access(hdr, 0, B_TRUE); 5216 mutex_exit(hash_lock); 5217 5218 ARCSTAT_BUMP(arcstat_hits); 5219 ARCSTAT_CONDSTAT(B_TRUE /* demand */, demand, prefetch, 5220 !HDR_ISTYPE_METADATA(hdr), data, metadata, hits); 5221 } 5222 5223 /* a generic arc_read_done_func_t which you can use */ 5224 void 5225 arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, 5226 arc_buf_t *buf, void *arg) 5227 { 5228 (void) zio, (void) zb, (void) bp; 5229 5230 if (buf == NULL) 5231 return; 5232 5233 memcpy(arg, buf->b_data, arc_buf_size(buf)); 5234 arc_buf_destroy(buf, arg); 5235 } 5236 5237 /* a generic arc_read_done_func_t */ 5238 void 5239 arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, 5240 arc_buf_t *buf, void *arg) 5241 { 5242 (void) zb, (void) bp; 5243 arc_buf_t **bufp = arg; 5244 5245 if (buf == NULL) { 5246 ASSERT(zio == NULL || zio->io_error != 0); 5247 *bufp = NULL; 5248 } else { 5249 ASSERT(zio == NULL || zio->io_error == 0); 5250 *bufp = buf; 5251 ASSERT(buf->b_data != NULL); 5252 } 5253 } 5254 5255 static void 5256 arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp) 5257 { 5258 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { 5259 ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0); 5260 ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF); 5261 } else { 5262 if (HDR_COMPRESSION_ENABLED(hdr)) { 5263 ASSERT3U(arc_hdr_get_compress(hdr), ==, 5264 BP_GET_COMPRESS(bp)); 5265 } 5266 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); 5267 ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp)); 5268 ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp)); 5269 } 5270 } 5271 5272 static void 5273 arc_read_done(zio_t *zio) 5274 { 5275 blkptr_t *bp = zio->io_bp; 5276 arc_buf_hdr_t *hdr = zio->io_private; 5277 kmutex_t *hash_lock = NULL; 5278 arc_callback_t *callback_list; 5279 arc_callback_t *acb; 5280 5281 /* 5282 * The hdr was inserted into hash-table and removed from lists 5283 * prior to starting I/O. We should find this header, since 5284 * it's in the hash table, and it should be legit since it's 5285 * not possible to evict it during the I/O. The only possible 5286 * reason for it not to be found is if we were freed during the 5287 * read. 5288 */ 5289 if (HDR_IN_HASH_TABLE(hdr)) { 5290 arc_buf_hdr_t *found; 5291 5292 ASSERT3U(hdr->b_birth, ==, BP_GET_BIRTH(zio->io_bp)); 5293 ASSERT3U(hdr->b_dva.dva_word[0], ==, 5294 BP_IDENTITY(zio->io_bp)->dva_word[0]); 5295 ASSERT3U(hdr->b_dva.dva_word[1], ==, 5296 BP_IDENTITY(zio->io_bp)->dva_word[1]); 5297 5298 found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock); 5299 5300 ASSERT((found == hdr && 5301 DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) || 5302 (found == hdr && HDR_L2_READING(hdr))); 5303 ASSERT3P(hash_lock, !=, NULL); 5304 } 5305 5306 if (BP_IS_PROTECTED(bp)) { 5307 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); 5308 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; 5309 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, 5310 hdr->b_crypt_hdr.b_iv); 5311 5312 if (zio->io_error == 0) { 5313 if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) { 5314 void *tmpbuf; 5315 5316 tmpbuf = abd_borrow_buf_copy(zio->io_abd, 5317 sizeof (zil_chain_t)); 5318 zio_crypt_decode_mac_zil(tmpbuf, 5319 hdr->b_crypt_hdr.b_mac); 5320 abd_return_buf(zio->io_abd, tmpbuf, 5321 sizeof (zil_chain_t)); 5322 } else { 5323 zio_crypt_decode_mac_bp(bp, 5324 hdr->b_crypt_hdr.b_mac); 5325 } 5326 } 5327 } 5328 5329 if (zio->io_error == 0) { 5330 /* byteswap if necessary */ 5331 if (BP_SHOULD_BYTESWAP(zio->io_bp)) { 5332 if (BP_GET_LEVEL(zio->io_bp) > 0) { 5333 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; 5334 } else { 5335 hdr->b_l1hdr.b_byteswap = 5336 DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp)); 5337 } 5338 } else { 5339 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 5340 } 5341 if (!HDR_L2_READING(hdr)) { 5342 hdr->b_complevel = zio->io_prop.zp_complevel; 5343 } 5344 } 5345 5346 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED); 5347 if (l2arc_noprefetch && HDR_PREFETCH(hdr)) 5348 arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE); 5349 5350 callback_list = hdr->b_l1hdr.b_acb; 5351 ASSERT3P(callback_list, !=, NULL); 5352 hdr->b_l1hdr.b_acb = NULL; 5353 5354 /* 5355 * If a read request has a callback (i.e. acb_done is not NULL), then we 5356 * make a buf containing the data according to the parameters which were 5357 * passed in. The implementation of arc_buf_alloc_impl() ensures that we 5358 * aren't needlessly decompressing the data multiple times. 5359 */ 5360 int callback_cnt = 0; 5361 for (acb = callback_list; acb != NULL; acb = acb->acb_next) { 5362 5363 /* We need the last one to call below in original order. */ 5364 callback_list = acb; 5365 5366 if (!acb->acb_done || acb->acb_nobuf) 5367 continue; 5368 5369 callback_cnt++; 5370 5371 if (zio->io_error != 0) 5372 continue; 5373 5374 int error = arc_buf_alloc_impl(hdr, zio->io_spa, 5375 &acb->acb_zb, acb->acb_private, acb->acb_encrypted, 5376 acb->acb_compressed, acb->acb_noauth, B_TRUE, 5377 &acb->acb_buf); 5378 5379 /* 5380 * Assert non-speculative zios didn't fail because an 5381 * encryption key wasn't loaded 5382 */ 5383 ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) || 5384 error != EACCES); 5385 5386 /* 5387 * If we failed to decrypt, report an error now (as the zio 5388 * layer would have done if it had done the transforms). 5389 */ 5390 if (error == ECKSUM) { 5391 ASSERT(BP_IS_PROTECTED(bp)); 5392 error = SET_ERROR(EIO); 5393 if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) { 5394 spa_log_error(zio->io_spa, &acb->acb_zb, 5395 BP_GET_LOGICAL_BIRTH(zio->io_bp)); 5396 (void) zfs_ereport_post( 5397 FM_EREPORT_ZFS_AUTHENTICATION, 5398 zio->io_spa, NULL, &acb->acb_zb, zio, 0); 5399 } 5400 } 5401 5402 if (error != 0) { 5403 /* 5404 * Decompression or decryption failed. Set 5405 * io_error so that when we call acb_done 5406 * (below), we will indicate that the read 5407 * failed. Note that in the unusual case 5408 * where one callback is compressed and another 5409 * uncompressed, we will mark all of them 5410 * as failed, even though the uncompressed 5411 * one can't actually fail. In this case, 5412 * the hdr will not be anonymous, because 5413 * if there are multiple callbacks, it's 5414 * because multiple threads found the same 5415 * arc buf in the hash table. 5416 */ 5417 zio->io_error = error; 5418 } 5419 } 5420 5421 /* 5422 * If there are multiple callbacks, we must have the hash lock, 5423 * because the only way for multiple threads to find this hdr is 5424 * in the hash table. This ensures that if there are multiple 5425 * callbacks, the hdr is not anonymous. If it were anonymous, 5426 * we couldn't use arc_buf_destroy() in the error case below. 5427 */ 5428 ASSERT(callback_cnt < 2 || hash_lock != NULL); 5429 5430 if (zio->io_error == 0) { 5431 arc_hdr_verify(hdr, zio->io_bp); 5432 } else { 5433 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 5434 if (hdr->b_l1hdr.b_state != arc_anon) 5435 arc_change_state(arc_anon, hdr); 5436 if (HDR_IN_HASH_TABLE(hdr)) 5437 buf_hash_remove(hdr); 5438 } 5439 5440 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 5441 (void) remove_reference(hdr, hdr); 5442 5443 if (hash_lock != NULL) 5444 mutex_exit(hash_lock); 5445 5446 /* execute each callback and free its structure */ 5447 while ((acb = callback_list) != NULL) { 5448 if (acb->acb_done != NULL) { 5449 if (zio->io_error != 0 && acb->acb_buf != NULL) { 5450 /* 5451 * If arc_buf_alloc_impl() fails during 5452 * decompression, the buf will still be 5453 * allocated, and needs to be freed here. 5454 */ 5455 arc_buf_destroy(acb->acb_buf, 5456 acb->acb_private); 5457 acb->acb_buf = NULL; 5458 } 5459 acb->acb_done(zio, &zio->io_bookmark, zio->io_bp, 5460 acb->acb_buf, acb->acb_private); 5461 } 5462 5463 if (acb->acb_zio_dummy != NULL) { 5464 acb->acb_zio_dummy->io_error = zio->io_error; 5465 zio_nowait(acb->acb_zio_dummy); 5466 } 5467 5468 callback_list = acb->acb_prev; 5469 if (acb->acb_wait) { 5470 mutex_enter(&acb->acb_wait_lock); 5471 acb->acb_wait_error = zio->io_error; 5472 acb->acb_wait = B_FALSE; 5473 cv_signal(&acb->acb_wait_cv); 5474 mutex_exit(&acb->acb_wait_lock); 5475 /* acb will be freed by the waiting thread. */ 5476 } else { 5477 kmem_free(acb, sizeof (arc_callback_t)); 5478 } 5479 } 5480 } 5481 5482 /* 5483 * Lookup the block at the specified DVA (in bp), and return the manner in 5484 * which the block is cached. A zero return indicates not cached. 5485 */ 5486 int 5487 arc_cached(spa_t *spa, const blkptr_t *bp) 5488 { 5489 arc_buf_hdr_t *hdr = NULL; 5490 kmutex_t *hash_lock = NULL; 5491 uint64_t guid = spa_load_guid(spa); 5492 int flags = 0; 5493 5494 if (BP_IS_EMBEDDED(bp)) 5495 return (ARC_CACHED_EMBEDDED); 5496 5497 hdr = buf_hash_find(guid, bp, &hash_lock); 5498 if (hdr == NULL) 5499 return (0); 5500 5501 if (HDR_HAS_L1HDR(hdr)) { 5502 arc_state_t *state = hdr->b_l1hdr.b_state; 5503 /* 5504 * We switch to ensure that any future arc_state_type_t 5505 * changes are handled. This is just a shift to promote 5506 * more compile-time checking. 5507 */ 5508 switch (state->arcs_state) { 5509 case ARC_STATE_ANON: 5510 break; 5511 case ARC_STATE_MRU: 5512 flags |= ARC_CACHED_IN_MRU | ARC_CACHED_IN_L1; 5513 break; 5514 case ARC_STATE_MFU: 5515 flags |= ARC_CACHED_IN_MFU | ARC_CACHED_IN_L1; 5516 break; 5517 case ARC_STATE_UNCACHED: 5518 /* The header is still in L1, probably not for long */ 5519 flags |= ARC_CACHED_IN_L1; 5520 break; 5521 default: 5522 break; 5523 } 5524 } 5525 if (HDR_HAS_L2HDR(hdr)) 5526 flags |= ARC_CACHED_IN_L2; 5527 5528 mutex_exit(hash_lock); 5529 5530 return (flags); 5531 } 5532 5533 /* 5534 * "Read" the block at the specified DVA (in bp) via the 5535 * cache. If the block is found in the cache, invoke the provided 5536 * callback immediately and return. Note that the `zio' parameter 5537 * in the callback will be NULL in this case, since no IO was 5538 * required. If the block is not in the cache pass the read request 5539 * on to the spa with a substitute callback function, so that the 5540 * requested block will be added to the cache. 5541 * 5542 * If a read request arrives for a block that has a read in-progress, 5543 * either wait for the in-progress read to complete (and return the 5544 * results); or, if this is a read with a "done" func, add a record 5545 * to the read to invoke the "done" func when the read completes, 5546 * and return; or just return. 5547 * 5548 * arc_read_done() will invoke all the requested "done" functions 5549 * for readers of this block. 5550 */ 5551 int 5552 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, 5553 arc_read_done_func_t *done, void *private, zio_priority_t priority, 5554 int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb) 5555 { 5556 arc_buf_hdr_t *hdr = NULL; 5557 kmutex_t *hash_lock = NULL; 5558 zio_t *rzio; 5559 uint64_t guid = spa_load_guid(spa); 5560 boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0; 5561 boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) && 5562 (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; 5563 boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) && 5564 (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; 5565 boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp); 5566 boolean_t no_buf = *arc_flags & ARC_FLAG_NO_BUF; 5567 arc_buf_t *buf = NULL; 5568 int rc = 0; 5569 5570 ASSERT(!embedded_bp || 5571 BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA); 5572 ASSERT(!BP_IS_HOLE(bp)); 5573 ASSERT(!BP_IS_REDACTED(bp)); 5574 5575 /* 5576 * Normally SPL_FSTRANS will already be set since kernel threads which 5577 * expect to call the DMU interfaces will set it when created. System 5578 * calls are similarly handled by setting/cleaning the bit in the 5579 * registered callback (module/os/.../zfs/zpl_*). 5580 * 5581 * External consumers such as Lustre which call the exported DMU 5582 * interfaces may not have set SPL_FSTRANS. To avoid a deadlock 5583 * on the hash_lock always set and clear the bit. 5584 */ 5585 fstrans_cookie_t cookie = spl_fstrans_mark(); 5586 top: 5587 if (!embedded_bp) { 5588 /* 5589 * Embedded BP's have no DVA and require no I/O to "read". 5590 * Create an anonymous arc buf to back it. 5591 */ 5592 hdr = buf_hash_find(guid, bp, &hash_lock); 5593 } 5594 5595 /* 5596 * Determine if we have an L1 cache hit or a cache miss. For simplicity 5597 * we maintain encrypted data separately from compressed / uncompressed 5598 * data. If the user is requesting raw encrypted data and we don't have 5599 * that in the header we will read from disk to guarantee that we can 5600 * get it even if the encryption keys aren't loaded. 5601 */ 5602 if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) || 5603 (hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) { 5604 boolean_t is_data = !HDR_ISTYPE_METADATA(hdr); 5605 5606 /* 5607 * Verify the block pointer contents are reasonable. This 5608 * should always be the case since the blkptr is protected by 5609 * a checksum. 5610 */ 5611 if (!zfs_blkptr_verify(spa, bp, BLK_CONFIG_SKIP, 5612 BLK_VERIFY_LOG)) { 5613 mutex_exit(hash_lock); 5614 rc = SET_ERROR(ECKSUM); 5615 goto done; 5616 } 5617 5618 if (HDR_IO_IN_PROGRESS(hdr)) { 5619 if (*arc_flags & ARC_FLAG_CACHED_ONLY) { 5620 mutex_exit(hash_lock); 5621 ARCSTAT_BUMP(arcstat_cached_only_in_progress); 5622 rc = SET_ERROR(ENOENT); 5623 goto done; 5624 } 5625 5626 zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head; 5627 ASSERT3P(head_zio, !=, NULL); 5628 if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) && 5629 priority == ZIO_PRIORITY_SYNC_READ) { 5630 /* 5631 * This is a sync read that needs to wait for 5632 * an in-flight async read. Request that the 5633 * zio have its priority upgraded. 5634 */ 5635 zio_change_priority(head_zio, priority); 5636 DTRACE_PROBE1(arc__async__upgrade__sync, 5637 arc_buf_hdr_t *, hdr); 5638 ARCSTAT_BUMP(arcstat_async_upgrade_sync); 5639 } 5640 5641 DTRACE_PROBE1(arc__iohit, arc_buf_hdr_t *, hdr); 5642 arc_access(hdr, *arc_flags, B_FALSE); 5643 5644 /* 5645 * If there are multiple threads reading the same block 5646 * and that block is not yet in the ARC, then only one 5647 * thread will do the physical I/O and all other 5648 * threads will wait until that I/O completes. 5649 * Synchronous reads use the acb_wait_cv whereas nowait 5650 * reads register a callback. Both are signalled/called 5651 * in arc_read_done. 5652 * 5653 * Errors of the physical I/O may need to be propagated. 5654 * Synchronous read errors are returned here from 5655 * arc_read_done via acb_wait_error. Nowait reads 5656 * attach the acb_zio_dummy zio to pio and 5657 * arc_read_done propagates the physical I/O's io_error 5658 * to acb_zio_dummy, and thereby to pio. 5659 */ 5660 arc_callback_t *acb = NULL; 5661 if (done || pio || *arc_flags & ARC_FLAG_WAIT) { 5662 acb = kmem_zalloc(sizeof (arc_callback_t), 5663 KM_SLEEP); 5664 acb->acb_done = done; 5665 acb->acb_private = private; 5666 acb->acb_compressed = compressed_read; 5667 acb->acb_encrypted = encrypted_read; 5668 acb->acb_noauth = noauth_read; 5669 acb->acb_nobuf = no_buf; 5670 if (*arc_flags & ARC_FLAG_WAIT) { 5671 acb->acb_wait = B_TRUE; 5672 mutex_init(&acb->acb_wait_lock, NULL, 5673 MUTEX_DEFAULT, NULL); 5674 cv_init(&acb->acb_wait_cv, NULL, 5675 CV_DEFAULT, NULL); 5676 } 5677 acb->acb_zb = *zb; 5678 if (pio != NULL) { 5679 acb->acb_zio_dummy = zio_null(pio, 5680 spa, NULL, NULL, NULL, zio_flags); 5681 } 5682 acb->acb_zio_head = head_zio; 5683 acb->acb_next = hdr->b_l1hdr.b_acb; 5684 hdr->b_l1hdr.b_acb->acb_prev = acb; 5685 hdr->b_l1hdr.b_acb = acb; 5686 } 5687 mutex_exit(hash_lock); 5688 5689 ARCSTAT_BUMP(arcstat_iohits); 5690 ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH), 5691 demand, prefetch, is_data, data, metadata, iohits); 5692 5693 if (*arc_flags & ARC_FLAG_WAIT) { 5694 mutex_enter(&acb->acb_wait_lock); 5695 while (acb->acb_wait) { 5696 cv_wait(&acb->acb_wait_cv, 5697 &acb->acb_wait_lock); 5698 } 5699 rc = acb->acb_wait_error; 5700 mutex_exit(&acb->acb_wait_lock); 5701 mutex_destroy(&acb->acb_wait_lock); 5702 cv_destroy(&acb->acb_wait_cv); 5703 kmem_free(acb, sizeof (arc_callback_t)); 5704 } 5705 goto out; 5706 } 5707 5708 ASSERT(hdr->b_l1hdr.b_state == arc_mru || 5709 hdr->b_l1hdr.b_state == arc_mfu || 5710 hdr->b_l1hdr.b_state == arc_uncached); 5711 5712 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); 5713 arc_access(hdr, *arc_flags, B_TRUE); 5714 5715 if (done && !no_buf) { 5716 ASSERT(!embedded_bp || !BP_IS_HOLE(bp)); 5717 5718 /* Get a buf with the desired data in it. */ 5719 rc = arc_buf_alloc_impl(hdr, spa, zb, private, 5720 encrypted_read, compressed_read, noauth_read, 5721 B_TRUE, &buf); 5722 if (rc == ECKSUM) { 5723 /* 5724 * Convert authentication and decryption errors 5725 * to EIO (and generate an ereport if needed) 5726 * before leaving the ARC. 5727 */ 5728 rc = SET_ERROR(EIO); 5729 if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) { 5730 spa_log_error(spa, zb, hdr->b_birth); 5731 (void) zfs_ereport_post( 5732 FM_EREPORT_ZFS_AUTHENTICATION, 5733 spa, NULL, zb, NULL, 0); 5734 } 5735 } 5736 if (rc != 0) { 5737 arc_buf_destroy_impl(buf); 5738 buf = NULL; 5739 (void) remove_reference(hdr, private); 5740 } 5741 5742 /* assert any errors weren't due to unloaded keys */ 5743 ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) || 5744 rc != EACCES); 5745 } 5746 mutex_exit(hash_lock); 5747 ARCSTAT_BUMP(arcstat_hits); 5748 ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH), 5749 demand, prefetch, is_data, data, metadata, hits); 5750 *arc_flags |= ARC_FLAG_CACHED; 5751 goto done; 5752 } else { 5753 uint64_t lsize = BP_GET_LSIZE(bp); 5754 uint64_t psize = BP_GET_PSIZE(bp); 5755 arc_callback_t *acb; 5756 vdev_t *vd = NULL; 5757 uint64_t addr = 0; 5758 boolean_t devw = B_FALSE; 5759 uint64_t size; 5760 abd_t *hdr_abd; 5761 int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0; 5762 arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp); 5763 5764 if (*arc_flags & ARC_FLAG_CACHED_ONLY) { 5765 if (hash_lock != NULL) 5766 mutex_exit(hash_lock); 5767 rc = SET_ERROR(ENOENT); 5768 goto done; 5769 } 5770 5771 /* 5772 * Verify the block pointer contents are reasonable. This 5773 * should always be the case since the blkptr is protected by 5774 * a checksum. 5775 */ 5776 if (!zfs_blkptr_verify(spa, bp, 5777 (zio_flags & ZIO_FLAG_CONFIG_WRITER) ? 5778 BLK_CONFIG_HELD : BLK_CONFIG_NEEDED, BLK_VERIFY_LOG)) { 5779 if (hash_lock != NULL) 5780 mutex_exit(hash_lock); 5781 rc = SET_ERROR(ECKSUM); 5782 goto done; 5783 } 5784 5785 if (hdr == NULL) { 5786 /* 5787 * This block is not in the cache or it has 5788 * embedded data. 5789 */ 5790 arc_buf_hdr_t *exists = NULL; 5791 hdr = arc_hdr_alloc(guid, psize, lsize, 5792 BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type); 5793 5794 if (!embedded_bp) { 5795 hdr->b_dva = *BP_IDENTITY(bp); 5796 hdr->b_birth = BP_GET_BIRTH(bp); 5797 exists = buf_hash_insert(hdr, &hash_lock); 5798 } 5799 if (exists != NULL) { 5800 /* somebody beat us to the hash insert */ 5801 mutex_exit(hash_lock); 5802 buf_discard_identity(hdr); 5803 arc_hdr_destroy(hdr); 5804 goto top; /* restart the IO request */ 5805 } 5806 } else { 5807 /* 5808 * This block is in the ghost cache or encrypted data 5809 * was requested and we didn't have it. If it was 5810 * L2-only (and thus didn't have an L1 hdr), 5811 * we realloc the header to add an L1 hdr. 5812 */ 5813 if (!HDR_HAS_L1HDR(hdr)) { 5814 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache, 5815 hdr_full_cache); 5816 } 5817 5818 if (GHOST_STATE(hdr->b_l1hdr.b_state)) { 5819 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 5820 ASSERT(!HDR_HAS_RABD(hdr)); 5821 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 5822 ASSERT0(zfs_refcount_count( 5823 &hdr->b_l1hdr.b_refcnt)); 5824 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 5825 #ifdef ZFS_DEBUG 5826 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 5827 #endif 5828 } else if (HDR_IO_IN_PROGRESS(hdr)) { 5829 /* 5830 * If this header already had an IO in progress 5831 * and we are performing another IO to fetch 5832 * encrypted data we must wait until the first 5833 * IO completes so as not to confuse 5834 * arc_read_done(). This should be very rare 5835 * and so the performance impact shouldn't 5836 * matter. 5837 */ 5838 arc_callback_t *acb = kmem_zalloc( 5839 sizeof (arc_callback_t), KM_SLEEP); 5840 acb->acb_wait = B_TRUE; 5841 mutex_init(&acb->acb_wait_lock, NULL, 5842 MUTEX_DEFAULT, NULL); 5843 cv_init(&acb->acb_wait_cv, NULL, CV_DEFAULT, 5844 NULL); 5845 acb->acb_zio_head = 5846 hdr->b_l1hdr.b_acb->acb_zio_head; 5847 acb->acb_next = hdr->b_l1hdr.b_acb; 5848 hdr->b_l1hdr.b_acb->acb_prev = acb; 5849 hdr->b_l1hdr.b_acb = acb; 5850 mutex_exit(hash_lock); 5851 mutex_enter(&acb->acb_wait_lock); 5852 while (acb->acb_wait) { 5853 cv_wait(&acb->acb_wait_cv, 5854 &acb->acb_wait_lock); 5855 } 5856 mutex_exit(&acb->acb_wait_lock); 5857 mutex_destroy(&acb->acb_wait_lock); 5858 cv_destroy(&acb->acb_wait_cv); 5859 kmem_free(acb, sizeof (arc_callback_t)); 5860 goto top; 5861 } 5862 } 5863 if (*arc_flags & ARC_FLAG_UNCACHED) { 5864 arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED); 5865 if (!encrypted_read) 5866 alloc_flags |= ARC_HDR_ALLOC_LINEAR; 5867 } 5868 5869 /* 5870 * Take additional reference for IO_IN_PROGRESS. It stops 5871 * arc_access() from putting this header without any buffers 5872 * and so other references but obviously nonevictable onto 5873 * the evictable list of MRU or MFU state. 5874 */ 5875 add_reference(hdr, hdr); 5876 if (!embedded_bp) 5877 arc_access(hdr, *arc_flags, B_FALSE); 5878 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 5879 arc_hdr_alloc_abd(hdr, alloc_flags); 5880 if (encrypted_read) { 5881 ASSERT(HDR_HAS_RABD(hdr)); 5882 size = HDR_GET_PSIZE(hdr); 5883 hdr_abd = hdr->b_crypt_hdr.b_rabd; 5884 zio_flags |= ZIO_FLAG_RAW; 5885 } else { 5886 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 5887 size = arc_hdr_size(hdr); 5888 hdr_abd = hdr->b_l1hdr.b_pabd; 5889 5890 if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { 5891 zio_flags |= ZIO_FLAG_RAW_COMPRESS; 5892 } 5893 5894 /* 5895 * For authenticated bp's, we do not ask the ZIO layer 5896 * to authenticate them since this will cause the entire 5897 * IO to fail if the key isn't loaded. Instead, we 5898 * defer authentication until arc_buf_fill(), which will 5899 * verify the data when the key is available. 5900 */ 5901 if (BP_IS_AUTHENTICATED(bp)) 5902 zio_flags |= ZIO_FLAG_RAW_ENCRYPT; 5903 } 5904 5905 if (BP_IS_AUTHENTICATED(bp)) 5906 arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); 5907 if (BP_GET_LEVEL(bp) > 0) 5908 arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT); 5909 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state)); 5910 5911 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP); 5912 acb->acb_done = done; 5913 acb->acb_private = private; 5914 acb->acb_compressed = compressed_read; 5915 acb->acb_encrypted = encrypted_read; 5916 acb->acb_noauth = noauth_read; 5917 acb->acb_zb = *zb; 5918 5919 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 5920 hdr->b_l1hdr.b_acb = acb; 5921 5922 if (HDR_HAS_L2HDR(hdr) && 5923 (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) { 5924 devw = hdr->b_l2hdr.b_dev->l2ad_writing; 5925 addr = hdr->b_l2hdr.b_daddr; 5926 /* 5927 * Lock out L2ARC device removal. 5928 */ 5929 if (vdev_is_dead(vd) || 5930 !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER)) 5931 vd = NULL; 5932 } 5933 5934 /* 5935 * We count both async reads and scrub IOs as asynchronous so 5936 * that both can be upgraded in the event of a cache hit while 5937 * the read IO is still in-flight. 5938 */ 5939 if (priority == ZIO_PRIORITY_ASYNC_READ || 5940 priority == ZIO_PRIORITY_SCRUB) 5941 arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); 5942 else 5943 arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); 5944 5945 /* 5946 * At this point, we have a level 1 cache miss or a blkptr 5947 * with embedded data. Try again in L2ARC if possible. 5948 */ 5949 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize); 5950 5951 /* 5952 * Skip ARC stat bump for block pointers with embedded 5953 * data. The data are read from the blkptr itself via 5954 * decode_embedded_bp_compressed(). 5955 */ 5956 if (!embedded_bp) { 5957 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, 5958 blkptr_t *, bp, uint64_t, lsize, 5959 zbookmark_phys_t *, zb); 5960 ARCSTAT_BUMP(arcstat_misses); 5961 ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH), 5962 demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data, 5963 metadata, misses); 5964 zfs_racct_read(size, 1); 5965 } 5966 5967 /* Check if the spa even has l2 configured */ 5968 const boolean_t spa_has_l2 = l2arc_ndev != 0 && 5969 spa->spa_l2cache.sav_count > 0; 5970 5971 if (vd != NULL && spa_has_l2 && !(l2arc_norw && devw)) { 5972 /* 5973 * Read from the L2ARC if the following are true: 5974 * 1. The L2ARC vdev was previously cached. 5975 * 2. This buffer still has L2ARC metadata. 5976 * 3. This buffer isn't currently writing to the L2ARC. 5977 * 4. The L2ARC entry wasn't evicted, which may 5978 * also have invalidated the vdev. 5979 */ 5980 if (HDR_HAS_L2HDR(hdr) && 5981 !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr)) { 5982 l2arc_read_callback_t *cb; 5983 abd_t *abd; 5984 uint64_t asize; 5985 5986 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr); 5987 ARCSTAT_BUMP(arcstat_l2_hits); 5988 hdr->b_l2hdr.b_hits++; 5989 5990 cb = kmem_zalloc(sizeof (l2arc_read_callback_t), 5991 KM_SLEEP); 5992 cb->l2rcb_hdr = hdr; 5993 cb->l2rcb_bp = *bp; 5994 cb->l2rcb_zb = *zb; 5995 cb->l2rcb_flags = zio_flags; 5996 5997 /* 5998 * When Compressed ARC is disabled, but the 5999 * L2ARC block is compressed, arc_hdr_size() 6000 * will have returned LSIZE rather than PSIZE. 6001 */ 6002 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 6003 !HDR_COMPRESSION_ENABLED(hdr) && 6004 HDR_GET_PSIZE(hdr) != 0) { 6005 size = HDR_GET_PSIZE(hdr); 6006 } 6007 6008 asize = vdev_psize_to_asize(vd, size); 6009 if (asize != size) { 6010 abd = abd_alloc_for_io(asize, 6011 HDR_ISTYPE_METADATA(hdr)); 6012 cb->l2rcb_abd = abd; 6013 } else { 6014 abd = hdr_abd; 6015 } 6016 6017 ASSERT(addr >= VDEV_LABEL_START_SIZE && 6018 addr + asize <= vd->vdev_psize - 6019 VDEV_LABEL_END_SIZE); 6020 6021 /* 6022 * l2arc read. The SCL_L2ARC lock will be 6023 * released by l2arc_read_done(). 6024 * Issue a null zio if the underlying buffer 6025 * was squashed to zero size by compression. 6026 */ 6027 ASSERT3U(arc_hdr_get_compress(hdr), !=, 6028 ZIO_COMPRESS_EMPTY); 6029 rzio = zio_read_phys(pio, vd, addr, 6030 asize, abd, 6031 ZIO_CHECKSUM_OFF, 6032 l2arc_read_done, cb, priority, 6033 zio_flags | ZIO_FLAG_CANFAIL | 6034 ZIO_FLAG_DONT_PROPAGATE | 6035 ZIO_FLAG_DONT_RETRY, B_FALSE); 6036 acb->acb_zio_head = rzio; 6037 6038 if (hash_lock != NULL) 6039 mutex_exit(hash_lock); 6040 6041 DTRACE_PROBE2(l2arc__read, vdev_t *, vd, 6042 zio_t *, rzio); 6043 ARCSTAT_INCR(arcstat_l2_read_bytes, 6044 HDR_GET_PSIZE(hdr)); 6045 6046 if (*arc_flags & ARC_FLAG_NOWAIT) { 6047 zio_nowait(rzio); 6048 goto out; 6049 } 6050 6051 ASSERT(*arc_flags & ARC_FLAG_WAIT); 6052 if (zio_wait(rzio) == 0) 6053 goto out; 6054 6055 /* l2arc read error; goto zio_read() */ 6056 if (hash_lock != NULL) 6057 mutex_enter(hash_lock); 6058 } else { 6059 DTRACE_PROBE1(l2arc__miss, 6060 arc_buf_hdr_t *, hdr); 6061 ARCSTAT_BUMP(arcstat_l2_misses); 6062 if (HDR_L2_WRITING(hdr)) 6063 ARCSTAT_BUMP(arcstat_l2_rw_clash); 6064 spa_config_exit(spa, SCL_L2ARC, vd); 6065 } 6066 } else { 6067 if (vd != NULL) 6068 spa_config_exit(spa, SCL_L2ARC, vd); 6069 6070 /* 6071 * Only a spa with l2 should contribute to l2 6072 * miss stats. (Including the case of having a 6073 * faulted cache device - that's also a miss.) 6074 */ 6075 if (spa_has_l2) { 6076 /* 6077 * Skip ARC stat bump for block pointers with 6078 * embedded data. The data are read from the 6079 * blkptr itself via 6080 * decode_embedded_bp_compressed(). 6081 */ 6082 if (!embedded_bp) { 6083 DTRACE_PROBE1(l2arc__miss, 6084 arc_buf_hdr_t *, hdr); 6085 ARCSTAT_BUMP(arcstat_l2_misses); 6086 } 6087 } 6088 } 6089 6090 rzio = zio_read(pio, spa, bp, hdr_abd, size, 6091 arc_read_done, hdr, priority, zio_flags, zb); 6092 acb->acb_zio_head = rzio; 6093 6094 if (hash_lock != NULL) 6095 mutex_exit(hash_lock); 6096 6097 if (*arc_flags & ARC_FLAG_WAIT) { 6098 rc = zio_wait(rzio); 6099 goto out; 6100 } 6101 6102 ASSERT(*arc_flags & ARC_FLAG_NOWAIT); 6103 zio_nowait(rzio); 6104 } 6105 6106 out: 6107 /* embedded bps don't actually go to disk */ 6108 if (!embedded_bp) 6109 spa_read_history_add(spa, zb, *arc_flags); 6110 spl_fstrans_unmark(cookie); 6111 return (rc); 6112 6113 done: 6114 if (done) 6115 done(NULL, zb, bp, buf, private); 6116 if (pio && rc != 0) { 6117 zio_t *zio = zio_null(pio, spa, NULL, NULL, NULL, zio_flags); 6118 zio->io_error = rc; 6119 zio_nowait(zio); 6120 } 6121 goto out; 6122 } 6123 6124 arc_prune_t * 6125 arc_add_prune_callback(arc_prune_func_t *func, void *private) 6126 { 6127 arc_prune_t *p; 6128 6129 p = kmem_alloc(sizeof (*p), KM_SLEEP); 6130 p->p_pfunc = func; 6131 p->p_private = private; 6132 list_link_init(&p->p_node); 6133 zfs_refcount_create(&p->p_refcnt); 6134 6135 mutex_enter(&arc_prune_mtx); 6136 zfs_refcount_add(&p->p_refcnt, &arc_prune_list); 6137 list_insert_head(&arc_prune_list, p); 6138 mutex_exit(&arc_prune_mtx); 6139 6140 return (p); 6141 } 6142 6143 void 6144 arc_remove_prune_callback(arc_prune_t *p) 6145 { 6146 boolean_t wait = B_FALSE; 6147 mutex_enter(&arc_prune_mtx); 6148 list_remove(&arc_prune_list, p); 6149 if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0) 6150 wait = B_TRUE; 6151 mutex_exit(&arc_prune_mtx); 6152 6153 /* wait for arc_prune_task to finish */ 6154 if (wait) 6155 taskq_wait_outstanding(arc_prune_taskq, 0); 6156 ASSERT0(zfs_refcount_count(&p->p_refcnt)); 6157 zfs_refcount_destroy(&p->p_refcnt); 6158 kmem_free(p, sizeof (*p)); 6159 } 6160 6161 /* 6162 * Helper function for arc_prune_async() it is responsible for safely 6163 * handling the execution of a registered arc_prune_func_t. 6164 */ 6165 static void 6166 arc_prune_task(void *ptr) 6167 { 6168 arc_prune_t *ap = (arc_prune_t *)ptr; 6169 arc_prune_func_t *func = ap->p_pfunc; 6170 6171 if (func != NULL) 6172 func(ap->p_adjust, ap->p_private); 6173 6174 (void) zfs_refcount_remove(&ap->p_refcnt, func); 6175 } 6176 6177 /* 6178 * Notify registered consumers they must drop holds on a portion of the ARC 6179 * buffers they reference. This provides a mechanism to ensure the ARC can 6180 * honor the metadata limit and reclaim otherwise pinned ARC buffers. 6181 * 6182 * This operation is performed asynchronously so it may be safely called 6183 * in the context of the arc_reclaim_thread(). A reference is taken here 6184 * for each registered arc_prune_t and the arc_prune_task() is responsible 6185 * for releasing it once the registered arc_prune_func_t has completed. 6186 */ 6187 static void 6188 arc_prune_async(uint64_t adjust) 6189 { 6190 arc_prune_t *ap; 6191 6192 mutex_enter(&arc_prune_mtx); 6193 for (ap = list_head(&arc_prune_list); ap != NULL; 6194 ap = list_next(&arc_prune_list, ap)) { 6195 6196 if (zfs_refcount_count(&ap->p_refcnt) >= 2) 6197 continue; 6198 6199 zfs_refcount_add(&ap->p_refcnt, ap->p_pfunc); 6200 ap->p_adjust = adjust; 6201 if (taskq_dispatch(arc_prune_taskq, arc_prune_task, 6202 ap, TQ_SLEEP) == TASKQID_INVALID) { 6203 (void) zfs_refcount_remove(&ap->p_refcnt, ap->p_pfunc); 6204 continue; 6205 } 6206 ARCSTAT_BUMP(arcstat_prune); 6207 } 6208 mutex_exit(&arc_prune_mtx); 6209 } 6210 6211 /* 6212 * Notify the arc that a block was freed, and thus will never be used again. 6213 */ 6214 void 6215 arc_freed(spa_t *spa, const blkptr_t *bp) 6216 { 6217 arc_buf_hdr_t *hdr; 6218 kmutex_t *hash_lock; 6219 uint64_t guid = spa_load_guid(spa); 6220 6221 ASSERT(!BP_IS_EMBEDDED(bp)); 6222 6223 hdr = buf_hash_find(guid, bp, &hash_lock); 6224 if (hdr == NULL) 6225 return; 6226 6227 /* 6228 * We might be trying to free a block that is still doing I/O 6229 * (i.e. prefetch) or has some other reference (i.e. a dedup-ed, 6230 * dmu_sync-ed block). A block may also have a reference if it is 6231 * part of a dedup-ed, dmu_synced write. The dmu_sync() function would 6232 * have written the new block to its final resting place on disk but 6233 * without the dedup flag set. This would have left the hdr in the MRU 6234 * state and discoverable. When the txg finally syncs it detects that 6235 * the block was overridden in open context and issues an override I/O. 6236 * Since this is a dedup block, the override I/O will determine if the 6237 * block is already in the DDT. If so, then it will replace the io_bp 6238 * with the bp from the DDT and allow the I/O to finish. When the I/O 6239 * reaches the done callback, dbuf_write_override_done, it will 6240 * check to see if the io_bp and io_bp_override are identical. 6241 * If they are not, then it indicates that the bp was replaced with 6242 * the bp in the DDT and the override bp is freed. This allows 6243 * us to arrive here with a reference on a block that is being 6244 * freed. So if we have an I/O in progress, or a reference to 6245 * this hdr, then we don't destroy the hdr. 6246 */ 6247 if (!HDR_HAS_L1HDR(hdr) || 6248 zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) { 6249 arc_change_state(arc_anon, hdr); 6250 arc_hdr_destroy(hdr); 6251 mutex_exit(hash_lock); 6252 } else { 6253 mutex_exit(hash_lock); 6254 } 6255 6256 } 6257 6258 /* 6259 * Release this buffer from the cache, making it an anonymous buffer. This 6260 * must be done after a read and prior to modifying the buffer contents. 6261 * If the buffer has more than one reference, we must make 6262 * a new hdr for the buffer. 6263 */ 6264 void 6265 arc_release(arc_buf_t *buf, const void *tag) 6266 { 6267 arc_buf_hdr_t *hdr = buf->b_hdr; 6268 6269 /* 6270 * It would be nice to assert that if its DMU metadata (level > 6271 * 0 || it's the dnode file), then it must be syncing context. 6272 * But we don't know that information at this level. 6273 */ 6274 6275 ASSERT(HDR_HAS_L1HDR(hdr)); 6276 6277 /* 6278 * We don't grab the hash lock prior to this check, because if 6279 * the buffer's header is in the arc_anon state, it won't be 6280 * linked into the hash table. 6281 */ 6282 if (hdr->b_l1hdr.b_state == arc_anon) { 6283 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 6284 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 6285 ASSERT(!HDR_HAS_L2HDR(hdr)); 6286 6287 ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf); 6288 ASSERT(ARC_BUF_LAST(buf)); 6289 ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1); 6290 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 6291 6292 hdr->b_l1hdr.b_arc_access = 0; 6293 6294 /* 6295 * If the buf is being overridden then it may already 6296 * have a hdr that is not empty. 6297 */ 6298 buf_discard_identity(hdr); 6299 arc_buf_thaw(buf); 6300 6301 return; 6302 } 6303 6304 kmutex_t *hash_lock = HDR_LOCK(hdr); 6305 mutex_enter(hash_lock); 6306 6307 /* 6308 * This assignment is only valid as long as the hash_lock is 6309 * held, we must be careful not to reference state or the 6310 * b_state field after dropping the lock. 6311 */ 6312 arc_state_t *state = hdr->b_l1hdr.b_state; 6313 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 6314 ASSERT3P(state, !=, arc_anon); 6315 6316 /* this buffer is not on any list */ 6317 ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0); 6318 6319 if (HDR_HAS_L2HDR(hdr)) { 6320 mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx); 6321 6322 /* 6323 * We have to recheck this conditional again now that 6324 * we're holding the l2ad_mtx to prevent a race with 6325 * another thread which might be concurrently calling 6326 * l2arc_evict(). In that case, l2arc_evict() might have 6327 * destroyed the header's L2 portion as we were waiting 6328 * to acquire the l2ad_mtx. 6329 */ 6330 if (HDR_HAS_L2HDR(hdr)) 6331 arc_hdr_l2hdr_destroy(hdr); 6332 6333 mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx); 6334 } 6335 6336 /* 6337 * Do we have more than one buf? 6338 */ 6339 if (hdr->b_l1hdr.b_buf != buf || !ARC_BUF_LAST(buf)) { 6340 arc_buf_hdr_t *nhdr; 6341 uint64_t spa = hdr->b_spa; 6342 uint64_t psize = HDR_GET_PSIZE(hdr); 6343 uint64_t lsize = HDR_GET_LSIZE(hdr); 6344 boolean_t protected = HDR_PROTECTED(hdr); 6345 enum zio_compress compress = arc_hdr_get_compress(hdr); 6346 arc_buf_contents_t type = arc_buf_type(hdr); 6347 VERIFY3U(hdr->b_type, ==, type); 6348 6349 ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL); 6350 VERIFY3S(remove_reference(hdr, tag), >, 0); 6351 6352 if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) { 6353 ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); 6354 ASSERT(ARC_BUF_LAST(buf)); 6355 } 6356 6357 /* 6358 * Pull the data off of this hdr and attach it to 6359 * a new anonymous hdr. Also find the last buffer 6360 * in the hdr's buffer list. 6361 */ 6362 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); 6363 ASSERT3P(lastbuf, !=, NULL); 6364 6365 /* 6366 * If the current arc_buf_t and the hdr are sharing their data 6367 * buffer, then we must stop sharing that block. 6368 */ 6369 if (ARC_BUF_SHARED(buf)) { 6370 ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); 6371 ASSERT(!arc_buf_is_shared(lastbuf)); 6372 6373 /* 6374 * First, sever the block sharing relationship between 6375 * buf and the arc_buf_hdr_t. 6376 */ 6377 arc_unshare_buf(hdr, buf); 6378 6379 /* 6380 * Now we need to recreate the hdr's b_pabd. Since we 6381 * have lastbuf handy, we try to share with it, but if 6382 * we can't then we allocate a new b_pabd and copy the 6383 * data from buf into it. 6384 */ 6385 if (arc_can_share(hdr, lastbuf)) { 6386 arc_share_buf(hdr, lastbuf); 6387 } else { 6388 arc_hdr_alloc_abd(hdr, 0); 6389 abd_copy_from_buf(hdr->b_l1hdr.b_pabd, 6390 buf->b_data, psize); 6391 } 6392 VERIFY3P(lastbuf->b_data, !=, NULL); 6393 } else if (HDR_SHARED_DATA(hdr)) { 6394 /* 6395 * Uncompressed shared buffers are always at the end 6396 * of the list. Compressed buffers don't have the 6397 * same requirements. This makes it hard to 6398 * simply assert that the lastbuf is shared so 6399 * we rely on the hdr's compression flags to determine 6400 * if we have a compressed, shared buffer. 6401 */ 6402 ASSERT(arc_buf_is_shared(lastbuf) || 6403 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 6404 ASSERT(!arc_buf_is_shared(buf)); 6405 } 6406 6407 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 6408 ASSERT3P(state, !=, arc_l2c_only); 6409 6410 (void) zfs_refcount_remove_many(&state->arcs_size[type], 6411 arc_buf_size(buf), buf); 6412 6413 if (zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) { 6414 ASSERT3P(state, !=, arc_l2c_only); 6415 (void) zfs_refcount_remove_many( 6416 &state->arcs_esize[type], 6417 arc_buf_size(buf), buf); 6418 } 6419 6420 arc_cksum_verify(buf); 6421 arc_buf_unwatch(buf); 6422 6423 /* if this is the last uncompressed buf free the checksum */ 6424 if (!arc_hdr_has_uncompressed_buf(hdr)) 6425 arc_cksum_free(hdr); 6426 6427 mutex_exit(hash_lock); 6428 6429 nhdr = arc_hdr_alloc(spa, psize, lsize, protected, 6430 compress, hdr->b_complevel, type); 6431 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL); 6432 ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt)); 6433 VERIFY3U(nhdr->b_type, ==, type); 6434 ASSERT(!HDR_SHARED_DATA(nhdr)); 6435 6436 nhdr->b_l1hdr.b_buf = buf; 6437 (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag); 6438 buf->b_hdr = nhdr; 6439 6440 (void) zfs_refcount_add_many(&arc_anon->arcs_size[type], 6441 arc_buf_size(buf), buf); 6442 } else { 6443 ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1); 6444 /* protected by hash lock, or hdr is on arc_anon */ 6445 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 6446 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 6447 hdr->b_l1hdr.b_mru_hits = 0; 6448 hdr->b_l1hdr.b_mru_ghost_hits = 0; 6449 hdr->b_l1hdr.b_mfu_hits = 0; 6450 hdr->b_l1hdr.b_mfu_ghost_hits = 0; 6451 arc_change_state(arc_anon, hdr); 6452 hdr->b_l1hdr.b_arc_access = 0; 6453 6454 mutex_exit(hash_lock); 6455 buf_discard_identity(hdr); 6456 arc_buf_thaw(buf); 6457 } 6458 } 6459 6460 int 6461 arc_released(arc_buf_t *buf) 6462 { 6463 return (buf->b_data != NULL && 6464 buf->b_hdr->b_l1hdr.b_state == arc_anon); 6465 } 6466 6467 #ifdef ZFS_DEBUG 6468 int 6469 arc_referenced(arc_buf_t *buf) 6470 { 6471 return (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt)); 6472 } 6473 #endif 6474 6475 static void 6476 arc_write_ready(zio_t *zio) 6477 { 6478 arc_write_callback_t *callback = zio->io_private; 6479 arc_buf_t *buf = callback->awcb_buf; 6480 arc_buf_hdr_t *hdr = buf->b_hdr; 6481 blkptr_t *bp = zio->io_bp; 6482 uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp); 6483 fstrans_cookie_t cookie = spl_fstrans_mark(); 6484 6485 ASSERT(HDR_HAS_L1HDR(hdr)); 6486 ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt)); 6487 ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL); 6488 6489 /* 6490 * If we're reexecuting this zio because the pool suspended, then 6491 * cleanup any state that was previously set the first time the 6492 * callback was invoked. 6493 */ 6494 if (zio->io_flags & ZIO_FLAG_REEXECUTED) { 6495 arc_cksum_free(hdr); 6496 arc_buf_unwatch(buf); 6497 if (hdr->b_l1hdr.b_pabd != NULL) { 6498 if (ARC_BUF_SHARED(buf)) { 6499 arc_unshare_buf(hdr, buf); 6500 } else { 6501 ASSERT(!arc_buf_is_shared(buf)); 6502 arc_hdr_free_abd(hdr, B_FALSE); 6503 } 6504 } 6505 6506 if (HDR_HAS_RABD(hdr)) 6507 arc_hdr_free_abd(hdr, B_TRUE); 6508 } 6509 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 6510 ASSERT(!HDR_HAS_RABD(hdr)); 6511 ASSERT(!HDR_SHARED_DATA(hdr)); 6512 ASSERT(!arc_buf_is_shared(buf)); 6513 6514 callback->awcb_ready(zio, buf, callback->awcb_private); 6515 6516 if (HDR_IO_IN_PROGRESS(hdr)) { 6517 ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED); 6518 } else { 6519 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6520 add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */ 6521 } 6522 6523 if (BP_IS_PROTECTED(bp)) { 6524 /* ZIL blocks are written through zio_rewrite */ 6525 ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); 6526 6527 if (BP_SHOULD_BYTESWAP(bp)) { 6528 if (BP_GET_LEVEL(bp) > 0) { 6529 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; 6530 } else { 6531 hdr->b_l1hdr.b_byteswap = 6532 DMU_OT_BYTESWAP(BP_GET_TYPE(bp)); 6533 } 6534 } else { 6535 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 6536 } 6537 6538 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 6539 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); 6540 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; 6541 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, 6542 hdr->b_crypt_hdr.b_iv); 6543 zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac); 6544 } else { 6545 arc_hdr_clear_flags(hdr, ARC_FLAG_PROTECTED); 6546 } 6547 6548 /* 6549 * If this block was written for raw encryption but the zio layer 6550 * ended up only authenticating it, adjust the buffer flags now. 6551 */ 6552 if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) { 6553 arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); 6554 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 6555 if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF) 6556 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 6557 } else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) { 6558 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 6559 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 6560 } 6561 6562 /* this must be done after the buffer flags are adjusted */ 6563 arc_cksum_compute(buf); 6564 6565 enum zio_compress compress; 6566 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { 6567 compress = ZIO_COMPRESS_OFF; 6568 } else { 6569 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); 6570 compress = BP_GET_COMPRESS(bp); 6571 } 6572 HDR_SET_PSIZE(hdr, psize); 6573 arc_hdr_set_compress(hdr, compress); 6574 hdr->b_complevel = zio->io_prop.zp_complevel; 6575 6576 if (zio->io_error != 0 || psize == 0) 6577 goto out; 6578 6579 /* 6580 * Fill the hdr with data. If the buffer is encrypted we have no choice 6581 * but to copy the data into b_radb. If the hdr is compressed, the data 6582 * we want is available from the zio, otherwise we can take it from 6583 * the buf. 6584 * 6585 * We might be able to share the buf's data with the hdr here. However, 6586 * doing so would cause the ARC to be full of linear ABDs if we write a 6587 * lot of shareable data. As a compromise, we check whether scattered 6588 * ABDs are allowed, and assume that if they are then the user wants 6589 * the ARC to be primarily filled with them regardless of the data being 6590 * written. Therefore, if they're allowed then we allocate one and copy 6591 * the data into it; otherwise, we share the data directly if we can. 6592 */ 6593 if (ARC_BUF_ENCRYPTED(buf)) { 6594 ASSERT3U(psize, >, 0); 6595 ASSERT(ARC_BUF_COMPRESSED(buf)); 6596 arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA | 6597 ARC_HDR_USE_RESERVE); 6598 abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); 6599 } else if (!(HDR_UNCACHED(hdr) || 6600 abd_size_alloc_linear(arc_buf_size(buf))) || 6601 !arc_can_share(hdr, buf)) { 6602 /* 6603 * Ideally, we would always copy the io_abd into b_pabd, but the 6604 * user may have disabled compressed ARC, thus we must check the 6605 * hdr's compression setting rather than the io_bp's. 6606 */ 6607 if (BP_IS_ENCRYPTED(bp)) { 6608 ASSERT3U(psize, >, 0); 6609 arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA | 6610 ARC_HDR_USE_RESERVE); 6611 abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); 6612 } else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && 6613 !ARC_BUF_COMPRESSED(buf)) { 6614 ASSERT3U(psize, >, 0); 6615 arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE); 6616 abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize); 6617 } else { 6618 ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr)); 6619 arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE); 6620 abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data, 6621 arc_buf_size(buf)); 6622 } 6623 } else { 6624 ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd)); 6625 ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf)); 6626 ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf); 6627 ASSERT(ARC_BUF_LAST(buf)); 6628 6629 arc_share_buf(hdr, buf); 6630 } 6631 6632 out: 6633 arc_hdr_verify(hdr, bp); 6634 spl_fstrans_unmark(cookie); 6635 } 6636 6637 static void 6638 arc_write_children_ready(zio_t *zio) 6639 { 6640 arc_write_callback_t *callback = zio->io_private; 6641 arc_buf_t *buf = callback->awcb_buf; 6642 6643 callback->awcb_children_ready(zio, buf, callback->awcb_private); 6644 } 6645 6646 static void 6647 arc_write_done(zio_t *zio) 6648 { 6649 arc_write_callback_t *callback = zio->io_private; 6650 arc_buf_t *buf = callback->awcb_buf; 6651 arc_buf_hdr_t *hdr = buf->b_hdr; 6652 6653 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 6654 6655 if (zio->io_error == 0) { 6656 arc_hdr_verify(hdr, zio->io_bp); 6657 6658 if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) { 6659 buf_discard_identity(hdr); 6660 } else { 6661 hdr->b_dva = *BP_IDENTITY(zio->io_bp); 6662 hdr->b_birth = BP_GET_BIRTH(zio->io_bp); 6663 } 6664 } else { 6665 ASSERT(HDR_EMPTY(hdr)); 6666 } 6667 6668 /* 6669 * If the block to be written was all-zero or compressed enough to be 6670 * embedded in the BP, no write was performed so there will be no 6671 * dva/birth/checksum. The buffer must therefore remain anonymous 6672 * (and uncached). 6673 */ 6674 if (!HDR_EMPTY(hdr)) { 6675 arc_buf_hdr_t *exists; 6676 kmutex_t *hash_lock; 6677 6678 ASSERT3U(zio->io_error, ==, 0); 6679 6680 arc_cksum_verify(buf); 6681 6682 exists = buf_hash_insert(hdr, &hash_lock); 6683 if (exists != NULL) { 6684 /* 6685 * This can only happen if we overwrite for 6686 * sync-to-convergence, because we remove 6687 * buffers from the hash table when we arc_free(). 6688 */ 6689 if (zio->io_flags & ZIO_FLAG_IO_REWRITE) { 6690 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) 6691 panic("bad overwrite, hdr=%p exists=%p", 6692 (void *)hdr, (void *)exists); 6693 ASSERT(zfs_refcount_is_zero( 6694 &exists->b_l1hdr.b_refcnt)); 6695 arc_change_state(arc_anon, exists); 6696 arc_hdr_destroy(exists); 6697 mutex_exit(hash_lock); 6698 exists = buf_hash_insert(hdr, &hash_lock); 6699 ASSERT3P(exists, ==, NULL); 6700 } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) { 6701 /* nopwrite */ 6702 ASSERT(zio->io_prop.zp_nopwrite); 6703 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) 6704 panic("bad nopwrite, hdr=%p exists=%p", 6705 (void *)hdr, (void *)exists); 6706 } else { 6707 /* Dedup */ 6708 ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL); 6709 ASSERT(ARC_BUF_LAST(hdr->b_l1hdr.b_buf)); 6710 ASSERT(hdr->b_l1hdr.b_state == arc_anon); 6711 ASSERT(BP_GET_DEDUP(zio->io_bp)); 6712 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0); 6713 } 6714 } 6715 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6716 VERIFY3S(remove_reference(hdr, hdr), >, 0); 6717 /* if it's not anon, we are doing a scrub */ 6718 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon) 6719 arc_access(hdr, 0, B_FALSE); 6720 mutex_exit(hash_lock); 6721 } else { 6722 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6723 VERIFY3S(remove_reference(hdr, hdr), >, 0); 6724 } 6725 6726 callback->awcb_done(zio, buf, callback->awcb_private); 6727 6728 abd_free(zio->io_abd); 6729 kmem_free(callback, sizeof (arc_write_callback_t)); 6730 } 6731 6732 zio_t * 6733 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, 6734 blkptr_t *bp, arc_buf_t *buf, boolean_t uncached, boolean_t l2arc, 6735 const zio_prop_t *zp, arc_write_done_func_t *ready, 6736 arc_write_done_func_t *children_ready, arc_write_done_func_t *done, 6737 void *private, zio_priority_t priority, int zio_flags, 6738 const zbookmark_phys_t *zb) 6739 { 6740 arc_buf_hdr_t *hdr = buf->b_hdr; 6741 arc_write_callback_t *callback; 6742 zio_t *zio; 6743 zio_prop_t localprop = *zp; 6744 6745 ASSERT3P(ready, !=, NULL); 6746 ASSERT3P(done, !=, NULL); 6747 ASSERT(!HDR_IO_ERROR(hdr)); 6748 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 6749 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 6750 ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL); 6751 if (uncached) 6752 arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED); 6753 else if (l2arc) 6754 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); 6755 6756 if (ARC_BUF_ENCRYPTED(buf)) { 6757 ASSERT(ARC_BUF_COMPRESSED(buf)); 6758 localprop.zp_encrypt = B_TRUE; 6759 localprop.zp_compress = HDR_GET_COMPRESS(hdr); 6760 localprop.zp_complevel = hdr->b_complevel; 6761 localprop.zp_byteorder = 6762 (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? 6763 ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; 6764 memcpy(localprop.zp_salt, hdr->b_crypt_hdr.b_salt, 6765 ZIO_DATA_SALT_LEN); 6766 memcpy(localprop.zp_iv, hdr->b_crypt_hdr.b_iv, 6767 ZIO_DATA_IV_LEN); 6768 memcpy(localprop.zp_mac, hdr->b_crypt_hdr.b_mac, 6769 ZIO_DATA_MAC_LEN); 6770 if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) { 6771 localprop.zp_nopwrite = B_FALSE; 6772 localprop.zp_copies = 6773 MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1); 6774 } 6775 zio_flags |= ZIO_FLAG_RAW; 6776 } else if (ARC_BUF_COMPRESSED(buf)) { 6777 ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf)); 6778 localprop.zp_compress = HDR_GET_COMPRESS(hdr); 6779 localprop.zp_complevel = hdr->b_complevel; 6780 zio_flags |= ZIO_FLAG_RAW_COMPRESS; 6781 } 6782 callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP); 6783 callback->awcb_ready = ready; 6784 callback->awcb_children_ready = children_ready; 6785 callback->awcb_done = done; 6786 callback->awcb_private = private; 6787 callback->awcb_buf = buf; 6788 6789 /* 6790 * The hdr's b_pabd is now stale, free it now. A new data block 6791 * will be allocated when the zio pipeline calls arc_write_ready(). 6792 */ 6793 if (hdr->b_l1hdr.b_pabd != NULL) { 6794 /* 6795 * If the buf is currently sharing the data block with 6796 * the hdr then we need to break that relationship here. 6797 * The hdr will remain with a NULL data pointer and the 6798 * buf will take sole ownership of the block. 6799 */ 6800 if (ARC_BUF_SHARED(buf)) { 6801 arc_unshare_buf(hdr, buf); 6802 } else { 6803 ASSERT(!arc_buf_is_shared(buf)); 6804 arc_hdr_free_abd(hdr, B_FALSE); 6805 } 6806 VERIFY3P(buf->b_data, !=, NULL); 6807 } 6808 6809 if (HDR_HAS_RABD(hdr)) 6810 arc_hdr_free_abd(hdr, B_TRUE); 6811 6812 if (!(zio_flags & ZIO_FLAG_RAW)) 6813 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF); 6814 6815 ASSERT(!arc_buf_is_shared(buf)); 6816 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 6817 6818 zio = zio_write(pio, spa, txg, bp, 6819 abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)), 6820 HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready, 6821 (children_ready != NULL) ? arc_write_children_ready : NULL, 6822 arc_write_done, callback, priority, zio_flags, zb); 6823 6824 return (zio); 6825 } 6826 6827 void 6828 arc_tempreserve_clear(uint64_t reserve) 6829 { 6830 atomic_add_64(&arc_tempreserve, -reserve); 6831 ASSERT((int64_t)arc_tempreserve >= 0); 6832 } 6833 6834 int 6835 arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg) 6836 { 6837 int error; 6838 uint64_t anon_size; 6839 6840 if (!arc_no_grow && 6841 reserve > arc_c/4 && 6842 reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT)) 6843 arc_c = MIN(arc_c_max, reserve * 4); 6844 6845 /* 6846 * Throttle when the calculated memory footprint for the TXG 6847 * exceeds the target ARC size. 6848 */ 6849 if (reserve > arc_c) { 6850 DMU_TX_STAT_BUMP(dmu_tx_memory_reserve); 6851 return (SET_ERROR(ERESTART)); 6852 } 6853 6854 /* 6855 * Don't count loaned bufs as in flight dirty data to prevent long 6856 * network delays from blocking transactions that are ready to be 6857 * assigned to a txg. 6858 */ 6859 6860 /* assert that it has not wrapped around */ 6861 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); 6862 6863 anon_size = MAX((int64_t) 6864 (zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]) + 6865 zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]) - 6866 arc_loaned_bytes), 0); 6867 6868 /* 6869 * Writes will, almost always, require additional memory allocations 6870 * in order to compress/encrypt/etc the data. We therefore need to 6871 * make sure that there is sufficient available memory for this. 6872 */ 6873 error = arc_memory_throttle(spa, reserve, txg); 6874 if (error != 0) 6875 return (error); 6876 6877 /* 6878 * Throttle writes when the amount of dirty data in the cache 6879 * gets too large. We try to keep the cache less than half full 6880 * of dirty blocks so that our sync times don't grow too large. 6881 * 6882 * In the case of one pool being built on another pool, we want 6883 * to make sure we don't end up throttling the lower (backing) 6884 * pool when the upper pool is the majority contributor to dirty 6885 * data. To insure we make forward progress during throttling, we 6886 * also check the current pool's net dirty data and only throttle 6887 * if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty 6888 * data in the cache. 6889 * 6890 * Note: if two requests come in concurrently, we might let them 6891 * both succeed, when one of them should fail. Not a huge deal. 6892 */ 6893 uint64_t total_dirty = reserve + arc_tempreserve + anon_size; 6894 uint64_t spa_dirty_anon = spa_dirty_data(spa); 6895 uint64_t rarc_c = arc_warm ? arc_c : arc_c_max; 6896 if (total_dirty > rarc_c * zfs_arc_dirty_limit_percent / 100 && 6897 anon_size > rarc_c * zfs_arc_anon_limit_percent / 100 && 6898 spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) { 6899 #ifdef ZFS_DEBUG 6900 uint64_t meta_esize = zfs_refcount_count( 6901 &arc_anon->arcs_esize[ARC_BUFC_METADATA]); 6902 uint64_t data_esize = 6903 zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 6904 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK " 6905 "anon_data=%lluK tempreserve=%lluK rarc_c=%lluK\n", 6906 (u_longlong_t)arc_tempreserve >> 10, 6907 (u_longlong_t)meta_esize >> 10, 6908 (u_longlong_t)data_esize >> 10, 6909 (u_longlong_t)reserve >> 10, 6910 (u_longlong_t)rarc_c >> 10); 6911 #endif 6912 DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle); 6913 return (SET_ERROR(ERESTART)); 6914 } 6915 atomic_add_64(&arc_tempreserve, reserve); 6916 return (0); 6917 } 6918 6919 static void 6920 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size, 6921 kstat_named_t *data, kstat_named_t *metadata, 6922 kstat_named_t *evict_data, kstat_named_t *evict_metadata) 6923 { 6924 data->value.ui64 = 6925 zfs_refcount_count(&state->arcs_size[ARC_BUFC_DATA]); 6926 metadata->value.ui64 = 6927 zfs_refcount_count(&state->arcs_size[ARC_BUFC_METADATA]); 6928 size->value.ui64 = data->value.ui64 + metadata->value.ui64; 6929 evict_data->value.ui64 = 6930 zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]); 6931 evict_metadata->value.ui64 = 6932 zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]); 6933 } 6934 6935 static int 6936 arc_kstat_update(kstat_t *ksp, int rw) 6937 { 6938 arc_stats_t *as = ksp->ks_data; 6939 6940 if (rw == KSTAT_WRITE) 6941 return (SET_ERROR(EACCES)); 6942 6943 as->arcstat_hits.value.ui64 = 6944 wmsum_value(&arc_sums.arcstat_hits); 6945 as->arcstat_iohits.value.ui64 = 6946 wmsum_value(&arc_sums.arcstat_iohits); 6947 as->arcstat_misses.value.ui64 = 6948 wmsum_value(&arc_sums.arcstat_misses); 6949 as->arcstat_demand_data_hits.value.ui64 = 6950 wmsum_value(&arc_sums.arcstat_demand_data_hits); 6951 as->arcstat_demand_data_iohits.value.ui64 = 6952 wmsum_value(&arc_sums.arcstat_demand_data_iohits); 6953 as->arcstat_demand_data_misses.value.ui64 = 6954 wmsum_value(&arc_sums.arcstat_demand_data_misses); 6955 as->arcstat_demand_metadata_hits.value.ui64 = 6956 wmsum_value(&arc_sums.arcstat_demand_metadata_hits); 6957 as->arcstat_demand_metadata_iohits.value.ui64 = 6958 wmsum_value(&arc_sums.arcstat_demand_metadata_iohits); 6959 as->arcstat_demand_metadata_misses.value.ui64 = 6960 wmsum_value(&arc_sums.arcstat_demand_metadata_misses); 6961 as->arcstat_prefetch_data_hits.value.ui64 = 6962 wmsum_value(&arc_sums.arcstat_prefetch_data_hits); 6963 as->arcstat_prefetch_data_iohits.value.ui64 = 6964 wmsum_value(&arc_sums.arcstat_prefetch_data_iohits); 6965 as->arcstat_prefetch_data_misses.value.ui64 = 6966 wmsum_value(&arc_sums.arcstat_prefetch_data_misses); 6967 as->arcstat_prefetch_metadata_hits.value.ui64 = 6968 wmsum_value(&arc_sums.arcstat_prefetch_metadata_hits); 6969 as->arcstat_prefetch_metadata_iohits.value.ui64 = 6970 wmsum_value(&arc_sums.arcstat_prefetch_metadata_iohits); 6971 as->arcstat_prefetch_metadata_misses.value.ui64 = 6972 wmsum_value(&arc_sums.arcstat_prefetch_metadata_misses); 6973 as->arcstat_mru_hits.value.ui64 = 6974 wmsum_value(&arc_sums.arcstat_mru_hits); 6975 as->arcstat_mru_ghost_hits.value.ui64 = 6976 wmsum_value(&arc_sums.arcstat_mru_ghost_hits); 6977 as->arcstat_mfu_hits.value.ui64 = 6978 wmsum_value(&arc_sums.arcstat_mfu_hits); 6979 as->arcstat_mfu_ghost_hits.value.ui64 = 6980 wmsum_value(&arc_sums.arcstat_mfu_ghost_hits); 6981 as->arcstat_uncached_hits.value.ui64 = 6982 wmsum_value(&arc_sums.arcstat_uncached_hits); 6983 as->arcstat_deleted.value.ui64 = 6984 wmsum_value(&arc_sums.arcstat_deleted); 6985 as->arcstat_mutex_miss.value.ui64 = 6986 wmsum_value(&arc_sums.arcstat_mutex_miss); 6987 as->arcstat_access_skip.value.ui64 = 6988 wmsum_value(&arc_sums.arcstat_access_skip); 6989 as->arcstat_evict_skip.value.ui64 = 6990 wmsum_value(&arc_sums.arcstat_evict_skip); 6991 as->arcstat_evict_not_enough.value.ui64 = 6992 wmsum_value(&arc_sums.arcstat_evict_not_enough); 6993 as->arcstat_evict_l2_cached.value.ui64 = 6994 wmsum_value(&arc_sums.arcstat_evict_l2_cached); 6995 as->arcstat_evict_l2_eligible.value.ui64 = 6996 wmsum_value(&arc_sums.arcstat_evict_l2_eligible); 6997 as->arcstat_evict_l2_eligible_mfu.value.ui64 = 6998 wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mfu); 6999 as->arcstat_evict_l2_eligible_mru.value.ui64 = 7000 wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mru); 7001 as->arcstat_evict_l2_ineligible.value.ui64 = 7002 wmsum_value(&arc_sums.arcstat_evict_l2_ineligible); 7003 as->arcstat_evict_l2_skip.value.ui64 = 7004 wmsum_value(&arc_sums.arcstat_evict_l2_skip); 7005 as->arcstat_hash_collisions.value.ui64 = 7006 wmsum_value(&arc_sums.arcstat_hash_collisions); 7007 as->arcstat_hash_chains.value.ui64 = 7008 wmsum_value(&arc_sums.arcstat_hash_chains); 7009 as->arcstat_size.value.ui64 = 7010 aggsum_value(&arc_sums.arcstat_size); 7011 as->arcstat_compressed_size.value.ui64 = 7012 wmsum_value(&arc_sums.arcstat_compressed_size); 7013 as->arcstat_uncompressed_size.value.ui64 = 7014 wmsum_value(&arc_sums.arcstat_uncompressed_size); 7015 as->arcstat_overhead_size.value.ui64 = 7016 wmsum_value(&arc_sums.arcstat_overhead_size); 7017 as->arcstat_hdr_size.value.ui64 = 7018 wmsum_value(&arc_sums.arcstat_hdr_size); 7019 as->arcstat_data_size.value.ui64 = 7020 wmsum_value(&arc_sums.arcstat_data_size); 7021 as->arcstat_metadata_size.value.ui64 = 7022 wmsum_value(&arc_sums.arcstat_metadata_size); 7023 as->arcstat_dbuf_size.value.ui64 = 7024 wmsum_value(&arc_sums.arcstat_dbuf_size); 7025 #if defined(COMPAT_FREEBSD11) 7026 as->arcstat_other_size.value.ui64 = 7027 wmsum_value(&arc_sums.arcstat_bonus_size) + 7028 wmsum_value(&arc_sums.arcstat_dnode_size) + 7029 wmsum_value(&arc_sums.arcstat_dbuf_size); 7030 #endif 7031 7032 arc_kstat_update_state(arc_anon, 7033 &as->arcstat_anon_size, 7034 &as->arcstat_anon_data, 7035 &as->arcstat_anon_metadata, 7036 &as->arcstat_anon_evictable_data, 7037 &as->arcstat_anon_evictable_metadata); 7038 arc_kstat_update_state(arc_mru, 7039 &as->arcstat_mru_size, 7040 &as->arcstat_mru_data, 7041 &as->arcstat_mru_metadata, 7042 &as->arcstat_mru_evictable_data, 7043 &as->arcstat_mru_evictable_metadata); 7044 arc_kstat_update_state(arc_mru_ghost, 7045 &as->arcstat_mru_ghost_size, 7046 &as->arcstat_mru_ghost_data, 7047 &as->arcstat_mru_ghost_metadata, 7048 &as->arcstat_mru_ghost_evictable_data, 7049 &as->arcstat_mru_ghost_evictable_metadata); 7050 arc_kstat_update_state(arc_mfu, 7051 &as->arcstat_mfu_size, 7052 &as->arcstat_mfu_data, 7053 &as->arcstat_mfu_metadata, 7054 &as->arcstat_mfu_evictable_data, 7055 &as->arcstat_mfu_evictable_metadata); 7056 arc_kstat_update_state(arc_mfu_ghost, 7057 &as->arcstat_mfu_ghost_size, 7058 &as->arcstat_mfu_ghost_data, 7059 &as->arcstat_mfu_ghost_metadata, 7060 &as->arcstat_mfu_ghost_evictable_data, 7061 &as->arcstat_mfu_ghost_evictable_metadata); 7062 arc_kstat_update_state(arc_uncached, 7063 &as->arcstat_uncached_size, 7064 &as->arcstat_uncached_data, 7065 &as->arcstat_uncached_metadata, 7066 &as->arcstat_uncached_evictable_data, 7067 &as->arcstat_uncached_evictable_metadata); 7068 7069 as->arcstat_dnode_size.value.ui64 = 7070 wmsum_value(&arc_sums.arcstat_dnode_size); 7071 as->arcstat_bonus_size.value.ui64 = 7072 wmsum_value(&arc_sums.arcstat_bonus_size); 7073 as->arcstat_l2_hits.value.ui64 = 7074 wmsum_value(&arc_sums.arcstat_l2_hits); 7075 as->arcstat_l2_misses.value.ui64 = 7076 wmsum_value(&arc_sums.arcstat_l2_misses); 7077 as->arcstat_l2_prefetch_asize.value.ui64 = 7078 wmsum_value(&arc_sums.arcstat_l2_prefetch_asize); 7079 as->arcstat_l2_mru_asize.value.ui64 = 7080 wmsum_value(&arc_sums.arcstat_l2_mru_asize); 7081 as->arcstat_l2_mfu_asize.value.ui64 = 7082 wmsum_value(&arc_sums.arcstat_l2_mfu_asize); 7083 as->arcstat_l2_bufc_data_asize.value.ui64 = 7084 wmsum_value(&arc_sums.arcstat_l2_bufc_data_asize); 7085 as->arcstat_l2_bufc_metadata_asize.value.ui64 = 7086 wmsum_value(&arc_sums.arcstat_l2_bufc_metadata_asize); 7087 as->arcstat_l2_feeds.value.ui64 = 7088 wmsum_value(&arc_sums.arcstat_l2_feeds); 7089 as->arcstat_l2_rw_clash.value.ui64 = 7090 wmsum_value(&arc_sums.arcstat_l2_rw_clash); 7091 as->arcstat_l2_read_bytes.value.ui64 = 7092 wmsum_value(&arc_sums.arcstat_l2_read_bytes); 7093 as->arcstat_l2_write_bytes.value.ui64 = 7094 wmsum_value(&arc_sums.arcstat_l2_write_bytes); 7095 as->arcstat_l2_writes_sent.value.ui64 = 7096 wmsum_value(&arc_sums.arcstat_l2_writes_sent); 7097 as->arcstat_l2_writes_done.value.ui64 = 7098 wmsum_value(&arc_sums.arcstat_l2_writes_done); 7099 as->arcstat_l2_writes_error.value.ui64 = 7100 wmsum_value(&arc_sums.arcstat_l2_writes_error); 7101 as->arcstat_l2_writes_lock_retry.value.ui64 = 7102 wmsum_value(&arc_sums.arcstat_l2_writes_lock_retry); 7103 as->arcstat_l2_evict_lock_retry.value.ui64 = 7104 wmsum_value(&arc_sums.arcstat_l2_evict_lock_retry); 7105 as->arcstat_l2_evict_reading.value.ui64 = 7106 wmsum_value(&arc_sums.arcstat_l2_evict_reading); 7107 as->arcstat_l2_evict_l1cached.value.ui64 = 7108 wmsum_value(&arc_sums.arcstat_l2_evict_l1cached); 7109 as->arcstat_l2_free_on_write.value.ui64 = 7110 wmsum_value(&arc_sums.arcstat_l2_free_on_write); 7111 as->arcstat_l2_abort_lowmem.value.ui64 = 7112 wmsum_value(&arc_sums.arcstat_l2_abort_lowmem); 7113 as->arcstat_l2_cksum_bad.value.ui64 = 7114 wmsum_value(&arc_sums.arcstat_l2_cksum_bad); 7115 as->arcstat_l2_io_error.value.ui64 = 7116 wmsum_value(&arc_sums.arcstat_l2_io_error); 7117 as->arcstat_l2_lsize.value.ui64 = 7118 wmsum_value(&arc_sums.arcstat_l2_lsize); 7119 as->arcstat_l2_psize.value.ui64 = 7120 wmsum_value(&arc_sums.arcstat_l2_psize); 7121 as->arcstat_l2_hdr_size.value.ui64 = 7122 aggsum_value(&arc_sums.arcstat_l2_hdr_size); 7123 as->arcstat_l2_log_blk_writes.value.ui64 = 7124 wmsum_value(&arc_sums.arcstat_l2_log_blk_writes); 7125 as->arcstat_l2_log_blk_asize.value.ui64 = 7126 wmsum_value(&arc_sums.arcstat_l2_log_blk_asize); 7127 as->arcstat_l2_log_blk_count.value.ui64 = 7128 wmsum_value(&arc_sums.arcstat_l2_log_blk_count); 7129 as->arcstat_l2_rebuild_success.value.ui64 = 7130 wmsum_value(&arc_sums.arcstat_l2_rebuild_success); 7131 as->arcstat_l2_rebuild_abort_unsupported.value.ui64 = 7132 wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_unsupported); 7133 as->arcstat_l2_rebuild_abort_io_errors.value.ui64 = 7134 wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_io_errors); 7135 as->arcstat_l2_rebuild_abort_dh_errors.value.ui64 = 7136 wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_dh_errors); 7137 as->arcstat_l2_rebuild_abort_cksum_lb_errors.value.ui64 = 7138 wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors); 7139 as->arcstat_l2_rebuild_abort_lowmem.value.ui64 = 7140 wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_lowmem); 7141 as->arcstat_l2_rebuild_size.value.ui64 = 7142 wmsum_value(&arc_sums.arcstat_l2_rebuild_size); 7143 as->arcstat_l2_rebuild_asize.value.ui64 = 7144 wmsum_value(&arc_sums.arcstat_l2_rebuild_asize); 7145 as->arcstat_l2_rebuild_bufs.value.ui64 = 7146 wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs); 7147 as->arcstat_l2_rebuild_bufs_precached.value.ui64 = 7148 wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs_precached); 7149 as->arcstat_l2_rebuild_log_blks.value.ui64 = 7150 wmsum_value(&arc_sums.arcstat_l2_rebuild_log_blks); 7151 as->arcstat_memory_throttle_count.value.ui64 = 7152 wmsum_value(&arc_sums.arcstat_memory_throttle_count); 7153 as->arcstat_memory_direct_count.value.ui64 = 7154 wmsum_value(&arc_sums.arcstat_memory_direct_count); 7155 as->arcstat_memory_indirect_count.value.ui64 = 7156 wmsum_value(&arc_sums.arcstat_memory_indirect_count); 7157 7158 as->arcstat_memory_all_bytes.value.ui64 = 7159 arc_all_memory(); 7160 as->arcstat_memory_free_bytes.value.ui64 = 7161 arc_free_memory(); 7162 as->arcstat_memory_available_bytes.value.i64 = 7163 arc_available_memory(); 7164 7165 as->arcstat_prune.value.ui64 = 7166 wmsum_value(&arc_sums.arcstat_prune); 7167 as->arcstat_meta_used.value.ui64 = 7168 wmsum_value(&arc_sums.arcstat_meta_used); 7169 as->arcstat_async_upgrade_sync.value.ui64 = 7170 wmsum_value(&arc_sums.arcstat_async_upgrade_sync); 7171 as->arcstat_predictive_prefetch.value.ui64 = 7172 wmsum_value(&arc_sums.arcstat_predictive_prefetch); 7173 as->arcstat_demand_hit_predictive_prefetch.value.ui64 = 7174 wmsum_value(&arc_sums.arcstat_demand_hit_predictive_prefetch); 7175 as->arcstat_demand_iohit_predictive_prefetch.value.ui64 = 7176 wmsum_value(&arc_sums.arcstat_demand_iohit_predictive_prefetch); 7177 as->arcstat_prescient_prefetch.value.ui64 = 7178 wmsum_value(&arc_sums.arcstat_prescient_prefetch); 7179 as->arcstat_demand_hit_prescient_prefetch.value.ui64 = 7180 wmsum_value(&arc_sums.arcstat_demand_hit_prescient_prefetch); 7181 as->arcstat_demand_iohit_prescient_prefetch.value.ui64 = 7182 wmsum_value(&arc_sums.arcstat_demand_iohit_prescient_prefetch); 7183 as->arcstat_raw_size.value.ui64 = 7184 wmsum_value(&arc_sums.arcstat_raw_size); 7185 as->arcstat_cached_only_in_progress.value.ui64 = 7186 wmsum_value(&arc_sums.arcstat_cached_only_in_progress); 7187 as->arcstat_abd_chunk_waste_size.value.ui64 = 7188 wmsum_value(&arc_sums.arcstat_abd_chunk_waste_size); 7189 7190 return (0); 7191 } 7192 7193 /* 7194 * This function *must* return indices evenly distributed between all 7195 * sublists of the multilist. This is needed due to how the ARC eviction 7196 * code is laid out; arc_evict_state() assumes ARC buffers are evenly 7197 * distributed between all sublists and uses this assumption when 7198 * deciding which sublist to evict from and how much to evict from it. 7199 */ 7200 static unsigned int 7201 arc_state_multilist_index_func(multilist_t *ml, void *obj) 7202 { 7203 arc_buf_hdr_t *hdr = obj; 7204 7205 /* 7206 * We rely on b_dva to generate evenly distributed index 7207 * numbers using buf_hash below. So, as an added precaution, 7208 * let's make sure we never add empty buffers to the arc lists. 7209 */ 7210 ASSERT(!HDR_EMPTY(hdr)); 7211 7212 /* 7213 * The assumption here, is the hash value for a given 7214 * arc_buf_hdr_t will remain constant throughout its lifetime 7215 * (i.e. its b_spa, b_dva, and b_birth fields don't change). 7216 * Thus, we don't need to store the header's sublist index 7217 * on insertion, as this index can be recalculated on removal. 7218 * 7219 * Also, the low order bits of the hash value are thought to be 7220 * distributed evenly. Otherwise, in the case that the multilist 7221 * has a power of two number of sublists, each sublists' usage 7222 * would not be evenly distributed. In this context full 64bit 7223 * division would be a waste of time, so limit it to 32 bits. 7224 */ 7225 return ((unsigned int)buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) % 7226 multilist_get_num_sublists(ml)); 7227 } 7228 7229 static unsigned int 7230 arc_state_l2c_multilist_index_func(multilist_t *ml, void *obj) 7231 { 7232 panic("Header %p insert into arc_l2c_only %p", obj, ml); 7233 } 7234 7235 #define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do { \ 7236 if ((do_warn) && (tuning) && ((tuning) != (value))) { \ 7237 cmn_err(CE_WARN, \ 7238 "ignoring tunable %s (using %llu instead)", \ 7239 (#tuning), (u_longlong_t)(value)); \ 7240 } \ 7241 } while (0) 7242 7243 /* 7244 * Called during module initialization and periodically thereafter to 7245 * apply reasonable changes to the exposed performance tunings. Can also be 7246 * called explicitly by param_set_arc_*() functions when ARC tunables are 7247 * updated manually. Non-zero zfs_* values which differ from the currently set 7248 * values will be applied. 7249 */ 7250 void 7251 arc_tuning_update(boolean_t verbose) 7252 { 7253 uint64_t allmem = arc_all_memory(); 7254 7255 /* Valid range: 32M - <arc_c_max> */ 7256 if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) && 7257 (zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) && 7258 (zfs_arc_min <= arc_c_max)) { 7259 arc_c_min = zfs_arc_min; 7260 arc_c = MAX(arc_c, arc_c_min); 7261 } 7262 WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose); 7263 7264 /* Valid range: 64M - <all physical memory> */ 7265 if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) && 7266 (zfs_arc_max >= MIN_ARC_MAX) && (zfs_arc_max < allmem) && 7267 (zfs_arc_max > arc_c_min)) { 7268 arc_c_max = zfs_arc_max; 7269 arc_c = MIN(arc_c, arc_c_max); 7270 if (arc_dnode_limit > arc_c_max) 7271 arc_dnode_limit = arc_c_max; 7272 } 7273 WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose); 7274 7275 /* Valid range: 0 - <all physical memory> */ 7276 arc_dnode_limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit : 7277 MIN(zfs_arc_dnode_limit_percent, 100) * arc_c_max / 100; 7278 WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_limit, verbose); 7279 7280 /* Valid range: 1 - N */ 7281 if (zfs_arc_grow_retry) 7282 arc_grow_retry = zfs_arc_grow_retry; 7283 7284 /* Valid range: 1 - N */ 7285 if (zfs_arc_shrink_shift) { 7286 arc_shrink_shift = zfs_arc_shrink_shift; 7287 arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1); 7288 } 7289 7290 /* Valid range: 1 - N ms */ 7291 if (zfs_arc_min_prefetch_ms) 7292 arc_min_prefetch_ms = zfs_arc_min_prefetch_ms; 7293 7294 /* Valid range: 1 - N ms */ 7295 if (zfs_arc_min_prescient_prefetch_ms) { 7296 arc_min_prescient_prefetch_ms = 7297 zfs_arc_min_prescient_prefetch_ms; 7298 } 7299 7300 /* Valid range: 0 - 100 */ 7301 if (zfs_arc_lotsfree_percent <= 100) 7302 arc_lotsfree_percent = zfs_arc_lotsfree_percent; 7303 WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent, 7304 verbose); 7305 7306 /* Valid range: 0 - <all physical memory> */ 7307 if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free)) 7308 arc_sys_free = MIN(zfs_arc_sys_free, allmem); 7309 WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose); 7310 } 7311 7312 static void 7313 arc_state_multilist_init(multilist_t *ml, 7314 multilist_sublist_index_func_t *index_func, int *maxcountp) 7315 { 7316 multilist_create(ml, sizeof (arc_buf_hdr_t), 7317 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func); 7318 *maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml)); 7319 } 7320 7321 static void 7322 arc_state_init(void) 7323 { 7324 int num_sublists = 0; 7325 7326 arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA], 7327 arc_state_multilist_index_func, &num_sublists); 7328 arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA], 7329 arc_state_multilist_index_func, &num_sublists); 7330 arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA], 7331 arc_state_multilist_index_func, &num_sublists); 7332 arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA], 7333 arc_state_multilist_index_func, &num_sublists); 7334 arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA], 7335 arc_state_multilist_index_func, &num_sublists); 7336 arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA], 7337 arc_state_multilist_index_func, &num_sublists); 7338 arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA], 7339 arc_state_multilist_index_func, &num_sublists); 7340 arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA], 7341 arc_state_multilist_index_func, &num_sublists); 7342 arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_METADATA], 7343 arc_state_multilist_index_func, &num_sublists); 7344 arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_DATA], 7345 arc_state_multilist_index_func, &num_sublists); 7346 7347 /* 7348 * L2 headers should never be on the L2 state list since they don't 7349 * have L1 headers allocated. Special index function asserts that. 7350 */ 7351 arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA], 7352 arc_state_l2c_multilist_index_func, &num_sublists); 7353 arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA], 7354 arc_state_l2c_multilist_index_func, &num_sublists); 7355 7356 /* 7357 * Keep track of the number of markers needed to reclaim buffers from 7358 * any ARC state. The markers will be pre-allocated so as to minimize 7359 * the number of memory allocations performed by the eviction thread. 7360 */ 7361 arc_state_evict_marker_count = num_sublists; 7362 7363 zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); 7364 zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 7365 zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); 7366 zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]); 7367 zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); 7368 zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); 7369 zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); 7370 zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); 7371 zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); 7372 zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); 7373 zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); 7374 zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); 7375 zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]); 7376 zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_DATA]); 7377 7378 zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_DATA]); 7379 zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_METADATA]); 7380 zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_DATA]); 7381 zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_METADATA]); 7382 zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]); 7383 zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]); 7384 zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_DATA]); 7385 zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_METADATA]); 7386 zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]); 7387 zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]); 7388 zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]); 7389 zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]); 7390 zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_DATA]); 7391 zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_METADATA]); 7392 7393 wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA], 0); 7394 wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA], 0); 7395 wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA], 0); 7396 wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA], 0); 7397 7398 wmsum_init(&arc_sums.arcstat_hits, 0); 7399 wmsum_init(&arc_sums.arcstat_iohits, 0); 7400 wmsum_init(&arc_sums.arcstat_misses, 0); 7401 wmsum_init(&arc_sums.arcstat_demand_data_hits, 0); 7402 wmsum_init(&arc_sums.arcstat_demand_data_iohits, 0); 7403 wmsum_init(&arc_sums.arcstat_demand_data_misses, 0); 7404 wmsum_init(&arc_sums.arcstat_demand_metadata_hits, 0); 7405 wmsum_init(&arc_sums.arcstat_demand_metadata_iohits, 0); 7406 wmsum_init(&arc_sums.arcstat_demand_metadata_misses, 0); 7407 wmsum_init(&arc_sums.arcstat_prefetch_data_hits, 0); 7408 wmsum_init(&arc_sums.arcstat_prefetch_data_iohits, 0); 7409 wmsum_init(&arc_sums.arcstat_prefetch_data_misses, 0); 7410 wmsum_init(&arc_sums.arcstat_prefetch_metadata_hits, 0); 7411 wmsum_init(&arc_sums.arcstat_prefetch_metadata_iohits, 0); 7412 wmsum_init(&arc_sums.arcstat_prefetch_metadata_misses, 0); 7413 wmsum_init(&arc_sums.arcstat_mru_hits, 0); 7414 wmsum_init(&arc_sums.arcstat_mru_ghost_hits, 0); 7415 wmsum_init(&arc_sums.arcstat_mfu_hits, 0); 7416 wmsum_init(&arc_sums.arcstat_mfu_ghost_hits, 0); 7417 wmsum_init(&arc_sums.arcstat_uncached_hits, 0); 7418 wmsum_init(&arc_sums.arcstat_deleted, 0); 7419 wmsum_init(&arc_sums.arcstat_mutex_miss, 0); 7420 wmsum_init(&arc_sums.arcstat_access_skip, 0); 7421 wmsum_init(&arc_sums.arcstat_evict_skip, 0); 7422 wmsum_init(&arc_sums.arcstat_evict_not_enough, 0); 7423 wmsum_init(&arc_sums.arcstat_evict_l2_cached, 0); 7424 wmsum_init(&arc_sums.arcstat_evict_l2_eligible, 0); 7425 wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mfu, 0); 7426 wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mru, 0); 7427 wmsum_init(&arc_sums.arcstat_evict_l2_ineligible, 0); 7428 wmsum_init(&arc_sums.arcstat_evict_l2_skip, 0); 7429 wmsum_init(&arc_sums.arcstat_hash_collisions, 0); 7430 wmsum_init(&arc_sums.arcstat_hash_chains, 0); 7431 aggsum_init(&arc_sums.arcstat_size, 0); 7432 wmsum_init(&arc_sums.arcstat_compressed_size, 0); 7433 wmsum_init(&arc_sums.arcstat_uncompressed_size, 0); 7434 wmsum_init(&arc_sums.arcstat_overhead_size, 0); 7435 wmsum_init(&arc_sums.arcstat_hdr_size, 0); 7436 wmsum_init(&arc_sums.arcstat_data_size, 0); 7437 wmsum_init(&arc_sums.arcstat_metadata_size, 0); 7438 wmsum_init(&arc_sums.arcstat_dbuf_size, 0); 7439 wmsum_init(&arc_sums.arcstat_dnode_size, 0); 7440 wmsum_init(&arc_sums.arcstat_bonus_size, 0); 7441 wmsum_init(&arc_sums.arcstat_l2_hits, 0); 7442 wmsum_init(&arc_sums.arcstat_l2_misses, 0); 7443 wmsum_init(&arc_sums.arcstat_l2_prefetch_asize, 0); 7444 wmsum_init(&arc_sums.arcstat_l2_mru_asize, 0); 7445 wmsum_init(&arc_sums.arcstat_l2_mfu_asize, 0); 7446 wmsum_init(&arc_sums.arcstat_l2_bufc_data_asize, 0); 7447 wmsum_init(&arc_sums.arcstat_l2_bufc_metadata_asize, 0); 7448 wmsum_init(&arc_sums.arcstat_l2_feeds, 0); 7449 wmsum_init(&arc_sums.arcstat_l2_rw_clash, 0); 7450 wmsum_init(&arc_sums.arcstat_l2_read_bytes, 0); 7451 wmsum_init(&arc_sums.arcstat_l2_write_bytes, 0); 7452 wmsum_init(&arc_sums.arcstat_l2_writes_sent, 0); 7453 wmsum_init(&arc_sums.arcstat_l2_writes_done, 0); 7454 wmsum_init(&arc_sums.arcstat_l2_writes_error, 0); 7455 wmsum_init(&arc_sums.arcstat_l2_writes_lock_retry, 0); 7456 wmsum_init(&arc_sums.arcstat_l2_evict_lock_retry, 0); 7457 wmsum_init(&arc_sums.arcstat_l2_evict_reading, 0); 7458 wmsum_init(&arc_sums.arcstat_l2_evict_l1cached, 0); 7459 wmsum_init(&arc_sums.arcstat_l2_free_on_write, 0); 7460 wmsum_init(&arc_sums.arcstat_l2_abort_lowmem, 0); 7461 wmsum_init(&arc_sums.arcstat_l2_cksum_bad, 0); 7462 wmsum_init(&arc_sums.arcstat_l2_io_error, 0); 7463 wmsum_init(&arc_sums.arcstat_l2_lsize, 0); 7464 wmsum_init(&arc_sums.arcstat_l2_psize, 0); 7465 aggsum_init(&arc_sums.arcstat_l2_hdr_size, 0); 7466 wmsum_init(&arc_sums.arcstat_l2_log_blk_writes, 0); 7467 wmsum_init(&arc_sums.arcstat_l2_log_blk_asize, 0); 7468 wmsum_init(&arc_sums.arcstat_l2_log_blk_count, 0); 7469 wmsum_init(&arc_sums.arcstat_l2_rebuild_success, 0); 7470 wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_unsupported, 0); 7471 wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_io_errors, 0); 7472 wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_dh_errors, 0); 7473 wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors, 0); 7474 wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_lowmem, 0); 7475 wmsum_init(&arc_sums.arcstat_l2_rebuild_size, 0); 7476 wmsum_init(&arc_sums.arcstat_l2_rebuild_asize, 0); 7477 wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs, 0); 7478 wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs_precached, 0); 7479 wmsum_init(&arc_sums.arcstat_l2_rebuild_log_blks, 0); 7480 wmsum_init(&arc_sums.arcstat_memory_throttle_count, 0); 7481 wmsum_init(&arc_sums.arcstat_memory_direct_count, 0); 7482 wmsum_init(&arc_sums.arcstat_memory_indirect_count, 0); 7483 wmsum_init(&arc_sums.arcstat_prune, 0); 7484 wmsum_init(&arc_sums.arcstat_meta_used, 0); 7485 wmsum_init(&arc_sums.arcstat_async_upgrade_sync, 0); 7486 wmsum_init(&arc_sums.arcstat_predictive_prefetch, 0); 7487 wmsum_init(&arc_sums.arcstat_demand_hit_predictive_prefetch, 0); 7488 wmsum_init(&arc_sums.arcstat_demand_iohit_predictive_prefetch, 0); 7489 wmsum_init(&arc_sums.arcstat_prescient_prefetch, 0); 7490 wmsum_init(&arc_sums.arcstat_demand_hit_prescient_prefetch, 0); 7491 wmsum_init(&arc_sums.arcstat_demand_iohit_prescient_prefetch, 0); 7492 wmsum_init(&arc_sums.arcstat_raw_size, 0); 7493 wmsum_init(&arc_sums.arcstat_cached_only_in_progress, 0); 7494 wmsum_init(&arc_sums.arcstat_abd_chunk_waste_size, 0); 7495 7496 arc_anon->arcs_state = ARC_STATE_ANON; 7497 arc_mru->arcs_state = ARC_STATE_MRU; 7498 arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST; 7499 arc_mfu->arcs_state = ARC_STATE_MFU; 7500 arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST; 7501 arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY; 7502 arc_uncached->arcs_state = ARC_STATE_UNCACHED; 7503 } 7504 7505 static void 7506 arc_state_fini(void) 7507 { 7508 zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); 7509 zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 7510 zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); 7511 zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]); 7512 zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); 7513 zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); 7514 zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); 7515 zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); 7516 zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); 7517 zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); 7518 zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); 7519 zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); 7520 zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]); 7521 zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_DATA]); 7522 7523 zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]); 7524 zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_METADATA]); 7525 zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_DATA]); 7526 zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_METADATA]); 7527 zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]); 7528 zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]); 7529 zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_DATA]); 7530 zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_METADATA]); 7531 zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]); 7532 zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]); 7533 zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]); 7534 zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]); 7535 zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_DATA]); 7536 zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_METADATA]); 7537 7538 multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]); 7539 multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]); 7540 multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]); 7541 multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]); 7542 multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]); 7543 multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]); 7544 multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]); 7545 multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]); 7546 multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]); 7547 multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]); 7548 multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_METADATA]); 7549 multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_DATA]); 7550 7551 wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]); 7552 wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]); 7553 wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]); 7554 wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]); 7555 7556 wmsum_fini(&arc_sums.arcstat_hits); 7557 wmsum_fini(&arc_sums.arcstat_iohits); 7558 wmsum_fini(&arc_sums.arcstat_misses); 7559 wmsum_fini(&arc_sums.arcstat_demand_data_hits); 7560 wmsum_fini(&arc_sums.arcstat_demand_data_iohits); 7561 wmsum_fini(&arc_sums.arcstat_demand_data_misses); 7562 wmsum_fini(&arc_sums.arcstat_demand_metadata_hits); 7563 wmsum_fini(&arc_sums.arcstat_demand_metadata_iohits); 7564 wmsum_fini(&arc_sums.arcstat_demand_metadata_misses); 7565 wmsum_fini(&arc_sums.arcstat_prefetch_data_hits); 7566 wmsum_fini(&arc_sums.arcstat_prefetch_data_iohits); 7567 wmsum_fini(&arc_sums.arcstat_prefetch_data_misses); 7568 wmsum_fini(&arc_sums.arcstat_prefetch_metadata_hits); 7569 wmsum_fini(&arc_sums.arcstat_prefetch_metadata_iohits); 7570 wmsum_fini(&arc_sums.arcstat_prefetch_metadata_misses); 7571 wmsum_fini(&arc_sums.arcstat_mru_hits); 7572 wmsum_fini(&arc_sums.arcstat_mru_ghost_hits); 7573 wmsum_fini(&arc_sums.arcstat_mfu_hits); 7574 wmsum_fini(&arc_sums.arcstat_mfu_ghost_hits); 7575 wmsum_fini(&arc_sums.arcstat_uncached_hits); 7576 wmsum_fini(&arc_sums.arcstat_deleted); 7577 wmsum_fini(&arc_sums.arcstat_mutex_miss); 7578 wmsum_fini(&arc_sums.arcstat_access_skip); 7579 wmsum_fini(&arc_sums.arcstat_evict_skip); 7580 wmsum_fini(&arc_sums.arcstat_evict_not_enough); 7581 wmsum_fini(&arc_sums.arcstat_evict_l2_cached); 7582 wmsum_fini(&arc_sums.arcstat_evict_l2_eligible); 7583 wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mfu); 7584 wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mru); 7585 wmsum_fini(&arc_sums.arcstat_evict_l2_ineligible); 7586 wmsum_fini(&arc_sums.arcstat_evict_l2_skip); 7587 wmsum_fini(&arc_sums.arcstat_hash_collisions); 7588 wmsum_fini(&arc_sums.arcstat_hash_chains); 7589 aggsum_fini(&arc_sums.arcstat_size); 7590 wmsum_fini(&arc_sums.arcstat_compressed_size); 7591 wmsum_fini(&arc_sums.arcstat_uncompressed_size); 7592 wmsum_fini(&arc_sums.arcstat_overhead_size); 7593 wmsum_fini(&arc_sums.arcstat_hdr_size); 7594 wmsum_fini(&arc_sums.arcstat_data_size); 7595 wmsum_fini(&arc_sums.arcstat_metadata_size); 7596 wmsum_fini(&arc_sums.arcstat_dbuf_size); 7597 wmsum_fini(&arc_sums.arcstat_dnode_size); 7598 wmsum_fini(&arc_sums.arcstat_bonus_size); 7599 wmsum_fini(&arc_sums.arcstat_l2_hits); 7600 wmsum_fini(&arc_sums.arcstat_l2_misses); 7601 wmsum_fini(&arc_sums.arcstat_l2_prefetch_asize); 7602 wmsum_fini(&arc_sums.arcstat_l2_mru_asize); 7603 wmsum_fini(&arc_sums.arcstat_l2_mfu_asize); 7604 wmsum_fini(&arc_sums.arcstat_l2_bufc_data_asize); 7605 wmsum_fini(&arc_sums.arcstat_l2_bufc_metadata_asize); 7606 wmsum_fini(&arc_sums.arcstat_l2_feeds); 7607 wmsum_fini(&arc_sums.arcstat_l2_rw_clash); 7608 wmsum_fini(&arc_sums.arcstat_l2_read_bytes); 7609 wmsum_fini(&arc_sums.arcstat_l2_write_bytes); 7610 wmsum_fini(&arc_sums.arcstat_l2_writes_sent); 7611 wmsum_fini(&arc_sums.arcstat_l2_writes_done); 7612 wmsum_fini(&arc_sums.arcstat_l2_writes_error); 7613 wmsum_fini(&arc_sums.arcstat_l2_writes_lock_retry); 7614 wmsum_fini(&arc_sums.arcstat_l2_evict_lock_retry); 7615 wmsum_fini(&arc_sums.arcstat_l2_evict_reading); 7616 wmsum_fini(&arc_sums.arcstat_l2_evict_l1cached); 7617 wmsum_fini(&arc_sums.arcstat_l2_free_on_write); 7618 wmsum_fini(&arc_sums.arcstat_l2_abort_lowmem); 7619 wmsum_fini(&arc_sums.arcstat_l2_cksum_bad); 7620 wmsum_fini(&arc_sums.arcstat_l2_io_error); 7621 wmsum_fini(&arc_sums.arcstat_l2_lsize); 7622 wmsum_fini(&arc_sums.arcstat_l2_psize); 7623 aggsum_fini(&arc_sums.arcstat_l2_hdr_size); 7624 wmsum_fini(&arc_sums.arcstat_l2_log_blk_writes); 7625 wmsum_fini(&arc_sums.arcstat_l2_log_blk_asize); 7626 wmsum_fini(&arc_sums.arcstat_l2_log_blk_count); 7627 wmsum_fini(&arc_sums.arcstat_l2_rebuild_success); 7628 wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_unsupported); 7629 wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_io_errors); 7630 wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_dh_errors); 7631 wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors); 7632 wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_lowmem); 7633 wmsum_fini(&arc_sums.arcstat_l2_rebuild_size); 7634 wmsum_fini(&arc_sums.arcstat_l2_rebuild_asize); 7635 wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs); 7636 wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs_precached); 7637 wmsum_fini(&arc_sums.arcstat_l2_rebuild_log_blks); 7638 wmsum_fini(&arc_sums.arcstat_memory_throttle_count); 7639 wmsum_fini(&arc_sums.arcstat_memory_direct_count); 7640 wmsum_fini(&arc_sums.arcstat_memory_indirect_count); 7641 wmsum_fini(&arc_sums.arcstat_prune); 7642 wmsum_fini(&arc_sums.arcstat_meta_used); 7643 wmsum_fini(&arc_sums.arcstat_async_upgrade_sync); 7644 wmsum_fini(&arc_sums.arcstat_predictive_prefetch); 7645 wmsum_fini(&arc_sums.arcstat_demand_hit_predictive_prefetch); 7646 wmsum_fini(&arc_sums.arcstat_demand_iohit_predictive_prefetch); 7647 wmsum_fini(&arc_sums.arcstat_prescient_prefetch); 7648 wmsum_fini(&arc_sums.arcstat_demand_hit_prescient_prefetch); 7649 wmsum_fini(&arc_sums.arcstat_demand_iohit_prescient_prefetch); 7650 wmsum_fini(&arc_sums.arcstat_raw_size); 7651 wmsum_fini(&arc_sums.arcstat_cached_only_in_progress); 7652 wmsum_fini(&arc_sums.arcstat_abd_chunk_waste_size); 7653 } 7654 7655 uint64_t 7656 arc_target_bytes(void) 7657 { 7658 return (arc_c); 7659 } 7660 7661 void 7662 arc_set_limits(uint64_t allmem) 7663 { 7664 /* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */ 7665 arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT); 7666 7667 /* How to set default max varies by platform. */ 7668 arc_c_max = arc_default_max(arc_c_min, allmem); 7669 } 7670 void 7671 arc_init(void) 7672 { 7673 uint64_t percent, allmem = arc_all_memory(); 7674 mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL); 7675 list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t), 7676 offsetof(arc_evict_waiter_t, aew_node)); 7677 7678 arc_min_prefetch_ms = 1000; 7679 arc_min_prescient_prefetch_ms = 6000; 7680 7681 #if defined(_KERNEL) 7682 arc_lowmem_init(); 7683 #endif 7684 7685 arc_set_limits(allmem); 7686 7687 #ifdef _KERNEL 7688 /* 7689 * If zfs_arc_max is non-zero at init, meaning it was set in the kernel 7690 * environment before the module was loaded, don't block setting the 7691 * maximum because it is less than arc_c_min, instead, reset arc_c_min 7692 * to a lower value. 7693 * zfs_arc_min will be handled by arc_tuning_update(). 7694 */ 7695 if (zfs_arc_max != 0 && zfs_arc_max >= MIN_ARC_MAX && 7696 zfs_arc_max < allmem) { 7697 arc_c_max = zfs_arc_max; 7698 if (arc_c_min >= arc_c_max) { 7699 arc_c_min = MAX(zfs_arc_max / 2, 7700 2ULL << SPA_MAXBLOCKSHIFT); 7701 } 7702 } 7703 #else 7704 /* 7705 * In userland, there's only the memory pressure that we artificially 7706 * create (see arc_available_memory()). Don't let arc_c get too 7707 * small, because it can cause transactions to be larger than 7708 * arc_c, causing arc_tempreserve_space() to fail. 7709 */ 7710 arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT); 7711 #endif 7712 7713 arc_c = arc_c_min; 7714 /* 7715 * 32-bit fixed point fractions of metadata from total ARC size, 7716 * MRU data from all data and MRU metadata from all metadata. 7717 */ 7718 arc_meta = (1ULL << 32) / 4; /* Metadata is 25% of arc_c. */ 7719 arc_pd = (1ULL << 32) / 2; /* Data MRU is 50% of data. */ 7720 arc_pm = (1ULL << 32) / 2; /* Metadata MRU is 50% of metadata. */ 7721 7722 percent = MIN(zfs_arc_dnode_limit_percent, 100); 7723 arc_dnode_limit = arc_c_max * percent / 100; 7724 7725 /* Apply user specified tunings */ 7726 arc_tuning_update(B_TRUE); 7727 7728 /* if kmem_flags are set, lets try to use less memory */ 7729 if (kmem_debugging()) 7730 arc_c = arc_c / 2; 7731 if (arc_c < arc_c_min) 7732 arc_c = arc_c_min; 7733 7734 arc_register_hotplug(); 7735 7736 arc_state_init(); 7737 7738 buf_init(); 7739 7740 list_create(&arc_prune_list, sizeof (arc_prune_t), 7741 offsetof(arc_prune_t, p_node)); 7742 mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL); 7743 7744 arc_prune_taskq = taskq_create("arc_prune", zfs_arc_prune_task_threads, 7745 defclsyspri, 100, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC); 7746 7747 arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED, 7748 sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); 7749 7750 if (arc_ksp != NULL) { 7751 arc_ksp->ks_data = &arc_stats; 7752 arc_ksp->ks_update = arc_kstat_update; 7753 kstat_install(arc_ksp); 7754 } 7755 7756 arc_state_evict_markers = 7757 arc_state_alloc_markers(arc_state_evict_marker_count); 7758 arc_evict_zthr = zthr_create_timer("arc_evict", 7759 arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1), defclsyspri); 7760 arc_reap_zthr = zthr_create_timer("arc_reap", 7761 arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1), minclsyspri); 7762 7763 arc_warm = B_FALSE; 7764 7765 /* 7766 * Calculate maximum amount of dirty data per pool. 7767 * 7768 * If it has been set by a module parameter, take that. 7769 * Otherwise, use a percentage of physical memory defined by 7770 * zfs_dirty_data_max_percent (default 10%) with a cap at 7771 * zfs_dirty_data_max_max (default 4G or 25% of physical memory). 7772 */ 7773 #ifdef __LP64__ 7774 if (zfs_dirty_data_max_max == 0) 7775 zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024, 7776 allmem * zfs_dirty_data_max_max_percent / 100); 7777 #else 7778 if (zfs_dirty_data_max_max == 0) 7779 zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024, 7780 allmem * zfs_dirty_data_max_max_percent / 100); 7781 #endif 7782 7783 if (zfs_dirty_data_max == 0) { 7784 zfs_dirty_data_max = allmem * 7785 zfs_dirty_data_max_percent / 100; 7786 zfs_dirty_data_max = MIN(zfs_dirty_data_max, 7787 zfs_dirty_data_max_max); 7788 } 7789 7790 if (zfs_wrlog_data_max == 0) { 7791 7792 /* 7793 * dp_wrlog_total is reduced for each txg at the end of 7794 * spa_sync(). However, dp_dirty_total is reduced every time 7795 * a block is written out. Thus under normal operation, 7796 * dp_wrlog_total could grow 2 times as big as 7797 * zfs_dirty_data_max. 7798 */ 7799 zfs_wrlog_data_max = zfs_dirty_data_max * 2; 7800 } 7801 } 7802 7803 void 7804 arc_fini(void) 7805 { 7806 arc_prune_t *p; 7807 7808 #ifdef _KERNEL 7809 arc_lowmem_fini(); 7810 #endif /* _KERNEL */ 7811 7812 /* Use B_TRUE to ensure *all* buffers are evicted */ 7813 arc_flush(NULL, B_TRUE); 7814 7815 if (arc_ksp != NULL) { 7816 kstat_delete(arc_ksp); 7817 arc_ksp = NULL; 7818 } 7819 7820 taskq_wait(arc_prune_taskq); 7821 taskq_destroy(arc_prune_taskq); 7822 7823 mutex_enter(&arc_prune_mtx); 7824 while ((p = list_remove_head(&arc_prune_list)) != NULL) { 7825 (void) zfs_refcount_remove(&p->p_refcnt, &arc_prune_list); 7826 zfs_refcount_destroy(&p->p_refcnt); 7827 kmem_free(p, sizeof (*p)); 7828 } 7829 mutex_exit(&arc_prune_mtx); 7830 7831 list_destroy(&arc_prune_list); 7832 mutex_destroy(&arc_prune_mtx); 7833 7834 (void) zthr_cancel(arc_evict_zthr); 7835 (void) zthr_cancel(arc_reap_zthr); 7836 arc_state_free_markers(arc_state_evict_markers, 7837 arc_state_evict_marker_count); 7838 7839 mutex_destroy(&arc_evict_lock); 7840 list_destroy(&arc_evict_waiters); 7841 7842 /* 7843 * Free any buffers that were tagged for destruction. This needs 7844 * to occur before arc_state_fini() runs and destroys the aggsum 7845 * values which are updated when freeing scatter ABDs. 7846 */ 7847 l2arc_do_free_on_write(); 7848 7849 /* 7850 * buf_fini() must proceed arc_state_fini() because buf_fin() may 7851 * trigger the release of kmem magazines, which can callback to 7852 * arc_space_return() which accesses aggsums freed in act_state_fini(). 7853 */ 7854 buf_fini(); 7855 arc_state_fini(); 7856 7857 arc_unregister_hotplug(); 7858 7859 /* 7860 * We destroy the zthrs after all the ARC state has been 7861 * torn down to avoid the case of them receiving any 7862 * wakeup() signals after they are destroyed. 7863 */ 7864 zthr_destroy(arc_evict_zthr); 7865 zthr_destroy(arc_reap_zthr); 7866 7867 ASSERT0(arc_loaned_bytes); 7868 } 7869 7870 /* 7871 * Level 2 ARC 7872 * 7873 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk. 7874 * It uses dedicated storage devices to hold cached data, which are populated 7875 * using large infrequent writes. The main role of this cache is to boost 7876 * the performance of random read workloads. The intended L2ARC devices 7877 * include short-stroked disks, solid state disks, and other media with 7878 * substantially faster read latency than disk. 7879 * 7880 * +-----------------------+ 7881 * | ARC | 7882 * +-----------------------+ 7883 * | ^ ^ 7884 * | | | 7885 * l2arc_feed_thread() arc_read() 7886 * | | | 7887 * | l2arc read | 7888 * V | | 7889 * +---------------+ | 7890 * | L2ARC | | 7891 * +---------------+ | 7892 * | ^ | 7893 * l2arc_write() | | 7894 * | | | 7895 * V | | 7896 * +-------+ +-------+ 7897 * | vdev | | vdev | 7898 * | cache | | cache | 7899 * +-------+ +-------+ 7900 * +=========+ .-----. 7901 * : L2ARC : |-_____-| 7902 * : devices : | Disks | 7903 * +=========+ `-_____-' 7904 * 7905 * Read requests are satisfied from the following sources, in order: 7906 * 7907 * 1) ARC 7908 * 2) vdev cache of L2ARC devices 7909 * 3) L2ARC devices 7910 * 4) vdev cache of disks 7911 * 5) disks 7912 * 7913 * Some L2ARC device types exhibit extremely slow write performance. 7914 * To accommodate for this there are some significant differences between 7915 * the L2ARC and traditional cache design: 7916 * 7917 * 1. There is no eviction path from the ARC to the L2ARC. Evictions from 7918 * the ARC behave as usual, freeing buffers and placing headers on ghost 7919 * lists. The ARC does not send buffers to the L2ARC during eviction as 7920 * this would add inflated write latencies for all ARC memory pressure. 7921 * 7922 * 2. The L2ARC attempts to cache data from the ARC before it is evicted. 7923 * It does this by periodically scanning buffers from the eviction-end of 7924 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are 7925 * not already there. It scans until a headroom of buffers is satisfied, 7926 * which itself is a buffer for ARC eviction. If a compressible buffer is 7927 * found during scanning and selected for writing to an L2ARC device, we 7928 * temporarily boost scanning headroom during the next scan cycle to make 7929 * sure we adapt to compression effects (which might significantly reduce 7930 * the data volume we write to L2ARC). The thread that does this is 7931 * l2arc_feed_thread(), illustrated below; example sizes are included to 7932 * provide a better sense of ratio than this diagram: 7933 * 7934 * head --> tail 7935 * +---------------------+----------+ 7936 * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC 7937 * +---------------------+----------+ | o L2ARC eligible 7938 * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer 7939 * +---------------------+----------+ | 7940 * 15.9 Gbytes ^ 32 Mbytes | 7941 * headroom | 7942 * l2arc_feed_thread() 7943 * | 7944 * l2arc write hand <--[oooo]--' 7945 * | 8 Mbyte 7946 * | write max 7947 * V 7948 * +==============================+ 7949 * L2ARC dev |####|#|###|###| |####| ... | 7950 * +==============================+ 7951 * 32 Gbytes 7952 * 7953 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of 7954 * evicted, then the L2ARC has cached a buffer much sooner than it probably 7955 * needed to, potentially wasting L2ARC device bandwidth and storage. It is 7956 * safe to say that this is an uncommon case, since buffers at the end of 7957 * the ARC lists have moved there due to inactivity. 7958 * 7959 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom, 7960 * then the L2ARC simply misses copying some buffers. This serves as a 7961 * pressure valve to prevent heavy read workloads from both stalling the ARC 7962 * with waits and clogging the L2ARC with writes. This also helps prevent 7963 * the potential for the L2ARC to churn if it attempts to cache content too 7964 * quickly, such as during backups of the entire pool. 7965 * 7966 * 5. After system boot and before the ARC has filled main memory, there are 7967 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru 7968 * lists can remain mostly static. Instead of searching from tail of these 7969 * lists as pictured, the l2arc_feed_thread() will search from the list heads 7970 * for eligible buffers, greatly increasing its chance of finding them. 7971 * 7972 * The L2ARC device write speed is also boosted during this time so that 7973 * the L2ARC warms up faster. Since there have been no ARC evictions yet, 7974 * there are no L2ARC reads, and no fear of degrading read performance 7975 * through increased writes. 7976 * 7977 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that 7978 * the vdev queue can aggregate them into larger and fewer writes. Each 7979 * device is written to in a rotor fashion, sweeping writes through 7980 * available space then repeating. 7981 * 7982 * 7. The L2ARC does not store dirty content. It never needs to flush 7983 * write buffers back to disk based storage. 7984 * 7985 * 8. If an ARC buffer is written (and dirtied) which also exists in the 7986 * L2ARC, the now stale L2ARC buffer is immediately dropped. 7987 * 7988 * The performance of the L2ARC can be tweaked by a number of tunables, which 7989 * may be necessary for different workloads: 7990 * 7991 * l2arc_write_max max write bytes per interval 7992 * l2arc_write_boost extra write bytes during device warmup 7993 * l2arc_noprefetch skip caching prefetched buffers 7994 * l2arc_headroom number of max device writes to precache 7995 * l2arc_headroom_boost when we find compressed buffers during ARC 7996 * scanning, we multiply headroom by this 7997 * percentage factor for the next scan cycle, 7998 * since more compressed buffers are likely to 7999 * be present 8000 * l2arc_feed_secs seconds between L2ARC writing 8001 * 8002 * Tunables may be removed or added as future performance improvements are 8003 * integrated, and also may become zpool properties. 8004 * 8005 * There are three key functions that control how the L2ARC warms up: 8006 * 8007 * l2arc_write_eligible() check if a buffer is eligible to cache 8008 * l2arc_write_size() calculate how much to write 8009 * l2arc_write_interval() calculate sleep delay between writes 8010 * 8011 * These three functions determine what to write, how much, and how quickly 8012 * to send writes. 8013 * 8014 * L2ARC persistence: 8015 * 8016 * When writing buffers to L2ARC, we periodically add some metadata to 8017 * make sure we can pick them up after reboot, thus dramatically reducing 8018 * the impact that any downtime has on the performance of storage systems 8019 * with large caches. 8020 * 8021 * The implementation works fairly simply by integrating the following two 8022 * modifications: 8023 * 8024 * *) When writing to the L2ARC, we occasionally write a "l2arc log block", 8025 * which is an additional piece of metadata which describes what's been 8026 * written. This allows us to rebuild the arc_buf_hdr_t structures of the 8027 * main ARC buffers. There are 2 linked-lists of log blocks headed by 8028 * dh_start_lbps[2]. We alternate which chain we append to, so they are 8029 * time-wise and offset-wise interleaved, but that is an optimization rather 8030 * than for correctness. The log block also includes a pointer to the 8031 * previous block in its chain. 8032 * 8033 * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device 8034 * for our header bookkeeping purposes. This contains a device header, 8035 * which contains our top-level reference structures. We update it each 8036 * time we write a new log block, so that we're able to locate it in the 8037 * L2ARC device. If this write results in an inconsistent device header 8038 * (e.g. due to power failure), we detect this by verifying the header's 8039 * checksum and simply fail to reconstruct the L2ARC after reboot. 8040 * 8041 * Implementation diagram: 8042 * 8043 * +=== L2ARC device (not to scale) ======================================+ 8044 * | ___two newest log block pointers__.__________ | 8045 * | / \dh_start_lbps[1] | 8046 * | / \ \dh_start_lbps[0]| 8047 * |.___/__. V V | 8048 * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---| 8049 * || hdr| ^ /^ /^ / / | 8050 * |+------+ ...--\-------/ \-----/--\------/ / | 8051 * | \--------------/ \--------------/ | 8052 * +======================================================================+ 8053 * 8054 * As can be seen on the diagram, rather than using a simple linked list, 8055 * we use a pair of linked lists with alternating elements. This is a 8056 * performance enhancement due to the fact that we only find out the 8057 * address of the next log block access once the current block has been 8058 * completely read in. Obviously, this hurts performance, because we'd be 8059 * keeping the device's I/O queue at only a 1 operation deep, thus 8060 * incurring a large amount of I/O round-trip latency. Having two lists 8061 * allows us to fetch two log blocks ahead of where we are currently 8062 * rebuilding L2ARC buffers. 8063 * 8064 * On-device data structures: 8065 * 8066 * L2ARC device header: l2arc_dev_hdr_phys_t 8067 * L2ARC log block: l2arc_log_blk_phys_t 8068 * 8069 * L2ARC reconstruction: 8070 * 8071 * When writing data, we simply write in the standard rotary fashion, 8072 * evicting buffers as we go and simply writing new data over them (writing 8073 * a new log block every now and then). This obviously means that once we 8074 * loop around the end of the device, we will start cutting into an already 8075 * committed log block (and its referenced data buffers), like so: 8076 * 8077 * current write head__ __old tail 8078 * \ / 8079 * V V 8080 * <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |--> 8081 * ^ ^^^^^^^^^___________________________________ 8082 * | \ 8083 * <<nextwrite>> may overwrite this blk and/or its bufs --' 8084 * 8085 * When importing the pool, we detect this situation and use it to stop 8086 * our scanning process (see l2arc_rebuild). 8087 * 8088 * There is one significant caveat to consider when rebuilding ARC contents 8089 * from an L2ARC device: what about invalidated buffers? Given the above 8090 * construction, we cannot update blocks which we've already written to amend 8091 * them to remove buffers which were invalidated. Thus, during reconstruction, 8092 * we might be populating the cache with buffers for data that's not on the 8093 * main pool anymore, or may have been overwritten! 8094 * 8095 * As it turns out, this isn't a problem. Every arc_read request includes 8096 * both the DVA and, crucially, the birth TXG of the BP the caller is 8097 * looking for. So even if the cache were populated by completely rotten 8098 * blocks for data that had been long deleted and/or overwritten, we'll 8099 * never actually return bad data from the cache, since the DVA with the 8100 * birth TXG uniquely identify a block in space and time - once created, 8101 * a block is immutable on disk. The worst thing we have done is wasted 8102 * some time and memory at l2arc rebuild to reconstruct outdated ARC 8103 * entries that will get dropped from the l2arc as it is being updated 8104 * with new blocks. 8105 * 8106 * L2ARC buffers that have been evicted by l2arc_evict() ahead of the write 8107 * hand are not restored. This is done by saving the offset (in bytes) 8108 * l2arc_evict() has evicted to in the L2ARC device header and taking it 8109 * into account when restoring buffers. 8110 */ 8111 8112 static boolean_t 8113 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr) 8114 { 8115 /* 8116 * A buffer is *not* eligible for the L2ARC if it: 8117 * 1. belongs to a different spa. 8118 * 2. is already cached on the L2ARC. 8119 * 3. has an I/O in progress (it may be an incomplete read). 8120 * 4. is flagged not eligible (zfs property). 8121 */ 8122 if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) || 8123 HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr)) 8124 return (B_FALSE); 8125 8126 return (B_TRUE); 8127 } 8128 8129 static uint64_t 8130 l2arc_write_size(l2arc_dev_t *dev) 8131 { 8132 uint64_t size; 8133 8134 /* 8135 * Make sure our globals have meaningful values in case the user 8136 * altered them. 8137 */ 8138 size = l2arc_write_max; 8139 if (size == 0) { 8140 cmn_err(CE_NOTE, "l2arc_write_max must be greater than zero, " 8141 "resetting it to the default (%d)", L2ARC_WRITE_SIZE); 8142 size = l2arc_write_max = L2ARC_WRITE_SIZE; 8143 } 8144 8145 if (arc_warm == B_FALSE) 8146 size += l2arc_write_boost; 8147 8148 /* We need to add in the worst case scenario of log block overhead. */ 8149 size += l2arc_log_blk_overhead(size, dev); 8150 if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0) { 8151 /* 8152 * Trim ahead of the write size 64MB or (l2arc_trim_ahead/100) 8153 * times the writesize, whichever is greater. 8154 */ 8155 size += MAX(64 * 1024 * 1024, 8156 (size * l2arc_trim_ahead) / 100); 8157 } 8158 8159 /* 8160 * Make sure the write size does not exceed the size of the cache 8161 * device. This is important in l2arc_evict(), otherwise infinite 8162 * iteration can occur. 8163 */ 8164 size = MIN(size, (dev->l2ad_end - dev->l2ad_start) / 4); 8165 8166 size = P2ROUNDUP(size, 1ULL << dev->l2ad_vdev->vdev_ashift); 8167 8168 return (size); 8169 8170 } 8171 8172 static clock_t 8173 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote) 8174 { 8175 clock_t interval, next, now; 8176 8177 /* 8178 * If the ARC lists are busy, increase our write rate; if the 8179 * lists are stale, idle back. This is achieved by checking 8180 * how much we previously wrote - if it was more than half of 8181 * what we wanted, schedule the next write much sooner. 8182 */ 8183 if (l2arc_feed_again && wrote > (wanted / 2)) 8184 interval = (hz * l2arc_feed_min_ms) / 1000; 8185 else 8186 interval = hz * l2arc_feed_secs; 8187 8188 now = ddi_get_lbolt(); 8189 next = MAX(now, MIN(now + interval, began + interval)); 8190 8191 return (next); 8192 } 8193 8194 /* 8195 * Cycle through L2ARC devices. This is how L2ARC load balances. 8196 * If a device is returned, this also returns holding the spa config lock. 8197 */ 8198 static l2arc_dev_t * 8199 l2arc_dev_get_next(void) 8200 { 8201 l2arc_dev_t *first, *next = NULL; 8202 8203 /* 8204 * Lock out the removal of spas (spa_namespace_lock), then removal 8205 * of cache devices (l2arc_dev_mtx). Once a device has been selected, 8206 * both locks will be dropped and a spa config lock held instead. 8207 */ 8208 mutex_enter(&spa_namespace_lock); 8209 mutex_enter(&l2arc_dev_mtx); 8210 8211 /* if there are no vdevs, there is nothing to do */ 8212 if (l2arc_ndev == 0) 8213 goto out; 8214 8215 first = NULL; 8216 next = l2arc_dev_last; 8217 do { 8218 /* loop around the list looking for a non-faulted vdev */ 8219 if (next == NULL) { 8220 next = list_head(l2arc_dev_list); 8221 } else { 8222 next = list_next(l2arc_dev_list, next); 8223 if (next == NULL) 8224 next = list_head(l2arc_dev_list); 8225 } 8226 8227 /* if we have come back to the start, bail out */ 8228 if (first == NULL) 8229 first = next; 8230 else if (next == first) 8231 break; 8232 8233 ASSERT3P(next, !=, NULL); 8234 } while (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild || 8235 next->l2ad_trim_all || next->l2ad_spa->spa_is_exporting); 8236 8237 /* if we were unable to find any usable vdevs, return NULL */ 8238 if (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild || 8239 next->l2ad_trim_all || next->l2ad_spa->spa_is_exporting) 8240 next = NULL; 8241 8242 l2arc_dev_last = next; 8243 8244 out: 8245 mutex_exit(&l2arc_dev_mtx); 8246 8247 /* 8248 * Grab the config lock to prevent the 'next' device from being 8249 * removed while we are writing to it. 8250 */ 8251 if (next != NULL) 8252 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER); 8253 mutex_exit(&spa_namespace_lock); 8254 8255 return (next); 8256 } 8257 8258 /* 8259 * Free buffers that were tagged for destruction. 8260 */ 8261 static void 8262 l2arc_do_free_on_write(void) 8263 { 8264 l2arc_data_free_t *df; 8265 8266 mutex_enter(&l2arc_free_on_write_mtx); 8267 while ((df = list_remove_head(l2arc_free_on_write)) != NULL) { 8268 ASSERT3P(df->l2df_abd, !=, NULL); 8269 abd_free(df->l2df_abd); 8270 kmem_free(df, sizeof (l2arc_data_free_t)); 8271 } 8272 mutex_exit(&l2arc_free_on_write_mtx); 8273 } 8274 8275 /* 8276 * A write to a cache device has completed. Update all headers to allow 8277 * reads from these buffers to begin. 8278 */ 8279 static void 8280 l2arc_write_done(zio_t *zio) 8281 { 8282 l2arc_write_callback_t *cb; 8283 l2arc_lb_abd_buf_t *abd_buf; 8284 l2arc_lb_ptr_buf_t *lb_ptr_buf; 8285 l2arc_dev_t *dev; 8286 l2arc_dev_hdr_phys_t *l2dhdr; 8287 list_t *buflist; 8288 arc_buf_hdr_t *head, *hdr, *hdr_prev; 8289 kmutex_t *hash_lock; 8290 int64_t bytes_dropped = 0; 8291 8292 cb = zio->io_private; 8293 ASSERT3P(cb, !=, NULL); 8294 dev = cb->l2wcb_dev; 8295 l2dhdr = dev->l2ad_dev_hdr; 8296 ASSERT3P(dev, !=, NULL); 8297 head = cb->l2wcb_head; 8298 ASSERT3P(head, !=, NULL); 8299 buflist = &dev->l2ad_buflist; 8300 ASSERT3P(buflist, !=, NULL); 8301 DTRACE_PROBE2(l2arc__iodone, zio_t *, zio, 8302 l2arc_write_callback_t *, cb); 8303 8304 /* 8305 * All writes completed, or an error was hit. 8306 */ 8307 top: 8308 mutex_enter(&dev->l2ad_mtx); 8309 for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) { 8310 hdr_prev = list_prev(buflist, hdr); 8311 8312 hash_lock = HDR_LOCK(hdr); 8313 8314 /* 8315 * We cannot use mutex_enter or else we can deadlock 8316 * with l2arc_write_buffers (due to swapping the order 8317 * the hash lock and l2ad_mtx are taken). 8318 */ 8319 if (!mutex_tryenter(hash_lock)) { 8320 /* 8321 * Missed the hash lock. We must retry so we 8322 * don't leave the ARC_FLAG_L2_WRITING bit set. 8323 */ 8324 ARCSTAT_BUMP(arcstat_l2_writes_lock_retry); 8325 8326 /* 8327 * We don't want to rescan the headers we've 8328 * already marked as having been written out, so 8329 * we reinsert the head node so we can pick up 8330 * where we left off. 8331 */ 8332 list_remove(buflist, head); 8333 list_insert_after(buflist, hdr, head); 8334 8335 mutex_exit(&dev->l2ad_mtx); 8336 8337 /* 8338 * We wait for the hash lock to become available 8339 * to try and prevent busy waiting, and increase 8340 * the chance we'll be able to acquire the lock 8341 * the next time around. 8342 */ 8343 mutex_enter(hash_lock); 8344 mutex_exit(hash_lock); 8345 goto top; 8346 } 8347 8348 /* 8349 * We could not have been moved into the arc_l2c_only 8350 * state while in-flight due to our ARC_FLAG_L2_WRITING 8351 * bit being set. Let's just ensure that's being enforced. 8352 */ 8353 ASSERT(HDR_HAS_L1HDR(hdr)); 8354 8355 /* 8356 * Skipped - drop L2ARC entry and mark the header as no 8357 * longer L2 eligibile. 8358 */ 8359 if (zio->io_error != 0) { 8360 /* 8361 * Error - drop L2ARC entry. 8362 */ 8363 list_remove(buflist, hdr); 8364 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); 8365 8366 uint64_t psize = HDR_GET_PSIZE(hdr); 8367 l2arc_hdr_arcstats_decrement(hdr); 8368 8369 bytes_dropped += 8370 vdev_psize_to_asize(dev->l2ad_vdev, psize); 8371 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, 8372 arc_hdr_size(hdr), hdr); 8373 } 8374 8375 /* 8376 * Allow ARC to begin reads and ghost list evictions to 8377 * this L2ARC entry. 8378 */ 8379 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING); 8380 8381 mutex_exit(hash_lock); 8382 } 8383 8384 /* 8385 * Free the allocated abd buffers for writing the log blocks. 8386 * If the zio failed reclaim the allocated space and remove the 8387 * pointers to these log blocks from the log block pointer list 8388 * of the L2ARC device. 8389 */ 8390 while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) { 8391 abd_free(abd_buf->abd); 8392 zio_buf_free(abd_buf, sizeof (*abd_buf)); 8393 if (zio->io_error != 0) { 8394 lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list); 8395 /* 8396 * L2BLK_GET_PSIZE returns aligned size for log 8397 * blocks. 8398 */ 8399 uint64_t asize = 8400 L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop); 8401 bytes_dropped += asize; 8402 ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); 8403 ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); 8404 zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, 8405 lb_ptr_buf); 8406 (void) zfs_refcount_remove(&dev->l2ad_lb_count, 8407 lb_ptr_buf); 8408 kmem_free(lb_ptr_buf->lb_ptr, 8409 sizeof (l2arc_log_blkptr_t)); 8410 kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); 8411 } 8412 } 8413 list_destroy(&cb->l2wcb_abd_list); 8414 8415 if (zio->io_error != 0) { 8416 ARCSTAT_BUMP(arcstat_l2_writes_error); 8417 8418 /* 8419 * Restore the lbps array in the header to its previous state. 8420 * If the list of log block pointers is empty, zero out the 8421 * log block pointers in the device header. 8422 */ 8423 lb_ptr_buf = list_head(&dev->l2ad_lbptr_list); 8424 for (int i = 0; i < 2; i++) { 8425 if (lb_ptr_buf == NULL) { 8426 /* 8427 * If the list is empty zero out the device 8428 * header. Otherwise zero out the second log 8429 * block pointer in the header. 8430 */ 8431 if (i == 0) { 8432 memset(l2dhdr, 0, 8433 dev->l2ad_dev_hdr_asize); 8434 } else { 8435 memset(&l2dhdr->dh_start_lbps[i], 0, 8436 sizeof (l2arc_log_blkptr_t)); 8437 } 8438 break; 8439 } 8440 memcpy(&l2dhdr->dh_start_lbps[i], lb_ptr_buf->lb_ptr, 8441 sizeof (l2arc_log_blkptr_t)); 8442 lb_ptr_buf = list_next(&dev->l2ad_lbptr_list, 8443 lb_ptr_buf); 8444 } 8445 } 8446 8447 ARCSTAT_BUMP(arcstat_l2_writes_done); 8448 list_remove(buflist, head); 8449 ASSERT(!HDR_HAS_L1HDR(head)); 8450 kmem_cache_free(hdr_l2only_cache, head); 8451 mutex_exit(&dev->l2ad_mtx); 8452 8453 ASSERT(dev->l2ad_vdev != NULL); 8454 vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0); 8455 8456 l2arc_do_free_on_write(); 8457 8458 kmem_free(cb, sizeof (l2arc_write_callback_t)); 8459 } 8460 8461 static int 8462 l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb) 8463 { 8464 int ret; 8465 spa_t *spa = zio->io_spa; 8466 arc_buf_hdr_t *hdr = cb->l2rcb_hdr; 8467 blkptr_t *bp = zio->io_bp; 8468 uint8_t salt[ZIO_DATA_SALT_LEN]; 8469 uint8_t iv[ZIO_DATA_IV_LEN]; 8470 uint8_t mac[ZIO_DATA_MAC_LEN]; 8471 boolean_t no_crypt = B_FALSE; 8472 8473 /* 8474 * ZIL data is never be written to the L2ARC, so we don't need 8475 * special handling for its unique MAC storage. 8476 */ 8477 ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); 8478 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 8479 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 8480 8481 /* 8482 * If the data was encrypted, decrypt it now. Note that 8483 * we must check the bp here and not the hdr, since the 8484 * hdr does not have its encryption parameters updated 8485 * until arc_read_done(). 8486 */ 8487 if (BP_IS_ENCRYPTED(bp)) { 8488 abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 8489 ARC_HDR_USE_RESERVE); 8490 8491 zio_crypt_decode_params_bp(bp, salt, iv); 8492 zio_crypt_decode_mac_bp(bp, mac); 8493 8494 ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb, 8495 BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), 8496 salt, iv, mac, HDR_GET_PSIZE(hdr), eabd, 8497 hdr->b_l1hdr.b_pabd, &no_crypt); 8498 if (ret != 0) { 8499 arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); 8500 goto error; 8501 } 8502 8503 /* 8504 * If we actually performed decryption, replace b_pabd 8505 * with the decrypted data. Otherwise we can just throw 8506 * our decryption buffer away. 8507 */ 8508 if (!no_crypt) { 8509 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 8510 arc_hdr_size(hdr), hdr); 8511 hdr->b_l1hdr.b_pabd = eabd; 8512 zio->io_abd = eabd; 8513 } else { 8514 arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); 8515 } 8516 } 8517 8518 /* 8519 * If the L2ARC block was compressed, but ARC compression 8520 * is disabled we decompress the data into a new buffer and 8521 * replace the existing data. 8522 */ 8523 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 8524 !HDR_COMPRESSION_ENABLED(hdr)) { 8525 abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 8526 ARC_HDR_USE_RESERVE); 8527 8528 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), 8529 hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr), 8530 HDR_GET_LSIZE(hdr), &hdr->b_complevel); 8531 if (ret != 0) { 8532 arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr); 8533 goto error; 8534 } 8535 8536 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 8537 arc_hdr_size(hdr), hdr); 8538 hdr->b_l1hdr.b_pabd = cabd; 8539 zio->io_abd = cabd; 8540 zio->io_size = HDR_GET_LSIZE(hdr); 8541 } 8542 8543 return (0); 8544 8545 error: 8546 return (ret); 8547 } 8548 8549 8550 /* 8551 * A read to a cache device completed. Validate buffer contents before 8552 * handing over to the regular ARC routines. 8553 */ 8554 static void 8555 l2arc_read_done(zio_t *zio) 8556 { 8557 int tfm_error = 0; 8558 l2arc_read_callback_t *cb = zio->io_private; 8559 arc_buf_hdr_t *hdr; 8560 kmutex_t *hash_lock; 8561 boolean_t valid_cksum; 8562 boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) && 8563 (cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT)); 8564 8565 ASSERT3P(zio->io_vd, !=, NULL); 8566 ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE); 8567 8568 spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd); 8569 8570 ASSERT3P(cb, !=, NULL); 8571 hdr = cb->l2rcb_hdr; 8572 ASSERT3P(hdr, !=, NULL); 8573 8574 hash_lock = HDR_LOCK(hdr); 8575 mutex_enter(hash_lock); 8576 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 8577 8578 /* 8579 * If the data was read into a temporary buffer, 8580 * move it and free the buffer. 8581 */ 8582 if (cb->l2rcb_abd != NULL) { 8583 ASSERT3U(arc_hdr_size(hdr), <, zio->io_size); 8584 if (zio->io_error == 0) { 8585 if (using_rdata) { 8586 abd_copy(hdr->b_crypt_hdr.b_rabd, 8587 cb->l2rcb_abd, arc_hdr_size(hdr)); 8588 } else { 8589 abd_copy(hdr->b_l1hdr.b_pabd, 8590 cb->l2rcb_abd, arc_hdr_size(hdr)); 8591 } 8592 } 8593 8594 /* 8595 * The following must be done regardless of whether 8596 * there was an error: 8597 * - free the temporary buffer 8598 * - point zio to the real ARC buffer 8599 * - set zio size accordingly 8600 * These are required because zio is either re-used for 8601 * an I/O of the block in the case of the error 8602 * or the zio is passed to arc_read_done() and it 8603 * needs real data. 8604 */ 8605 abd_free(cb->l2rcb_abd); 8606 zio->io_size = zio->io_orig_size = arc_hdr_size(hdr); 8607 8608 if (using_rdata) { 8609 ASSERT(HDR_HAS_RABD(hdr)); 8610 zio->io_abd = zio->io_orig_abd = 8611 hdr->b_crypt_hdr.b_rabd; 8612 } else { 8613 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 8614 zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd; 8615 } 8616 } 8617 8618 ASSERT3P(zio->io_abd, !=, NULL); 8619 8620 /* 8621 * Check this survived the L2ARC journey. 8622 */ 8623 ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd || 8624 (HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd)); 8625 zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */ 8626 zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */ 8627 zio->io_prop.zp_complevel = hdr->b_complevel; 8628 8629 valid_cksum = arc_cksum_is_equal(hdr, zio); 8630 8631 /* 8632 * b_rabd will always match the data as it exists on disk if it is 8633 * being used. Therefore if we are reading into b_rabd we do not 8634 * attempt to untransform the data. 8635 */ 8636 if (valid_cksum && !using_rdata) 8637 tfm_error = l2arc_untransform(zio, cb); 8638 8639 if (valid_cksum && tfm_error == 0 && zio->io_error == 0 && 8640 !HDR_L2_EVICTED(hdr)) { 8641 mutex_exit(hash_lock); 8642 zio->io_private = hdr; 8643 arc_read_done(zio); 8644 } else { 8645 /* 8646 * Buffer didn't survive caching. Increment stats and 8647 * reissue to the original storage device. 8648 */ 8649 if (zio->io_error != 0) { 8650 ARCSTAT_BUMP(arcstat_l2_io_error); 8651 } else { 8652 zio->io_error = SET_ERROR(EIO); 8653 } 8654 if (!valid_cksum || tfm_error != 0) 8655 ARCSTAT_BUMP(arcstat_l2_cksum_bad); 8656 8657 /* 8658 * If there's no waiter, issue an async i/o to the primary 8659 * storage now. If there *is* a waiter, the caller must 8660 * issue the i/o in a context where it's OK to block. 8661 */ 8662 if (zio->io_waiter == NULL) { 8663 zio_t *pio = zio_unique_parent(zio); 8664 void *abd = (using_rdata) ? 8665 hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd; 8666 8667 ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL); 8668 8669 zio = zio_read(pio, zio->io_spa, zio->io_bp, 8670 abd, zio->io_size, arc_read_done, 8671 hdr, zio->io_priority, cb->l2rcb_flags, 8672 &cb->l2rcb_zb); 8673 8674 /* 8675 * Original ZIO will be freed, so we need to update 8676 * ARC header with the new ZIO pointer to be used 8677 * by zio_change_priority() in arc_read(). 8678 */ 8679 for (struct arc_callback *acb = hdr->b_l1hdr.b_acb; 8680 acb != NULL; acb = acb->acb_next) 8681 acb->acb_zio_head = zio; 8682 8683 mutex_exit(hash_lock); 8684 zio_nowait(zio); 8685 } else { 8686 mutex_exit(hash_lock); 8687 } 8688 } 8689 8690 kmem_free(cb, sizeof (l2arc_read_callback_t)); 8691 } 8692 8693 /* 8694 * This is the list priority from which the L2ARC will search for pages to 8695 * cache. This is used within loops (0..3) to cycle through lists in the 8696 * desired order. This order can have a significant effect on cache 8697 * performance. 8698 * 8699 * Currently the metadata lists are hit first, MFU then MRU, followed by 8700 * the data lists. This function returns a locked list, and also returns 8701 * the lock pointer. 8702 */ 8703 static multilist_sublist_t * 8704 l2arc_sublist_lock(int list_num) 8705 { 8706 multilist_t *ml = NULL; 8707 unsigned int idx; 8708 8709 ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES); 8710 8711 switch (list_num) { 8712 case 0: 8713 ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA]; 8714 break; 8715 case 1: 8716 ml = &arc_mru->arcs_list[ARC_BUFC_METADATA]; 8717 break; 8718 case 2: 8719 ml = &arc_mfu->arcs_list[ARC_BUFC_DATA]; 8720 break; 8721 case 3: 8722 ml = &arc_mru->arcs_list[ARC_BUFC_DATA]; 8723 break; 8724 default: 8725 return (NULL); 8726 } 8727 8728 /* 8729 * Return a randomly-selected sublist. This is acceptable 8730 * because the caller feeds only a little bit of data for each 8731 * call (8MB). Subsequent calls will result in different 8732 * sublists being selected. 8733 */ 8734 idx = multilist_get_random_index(ml); 8735 return (multilist_sublist_lock_idx(ml, idx)); 8736 } 8737 8738 /* 8739 * Calculates the maximum overhead of L2ARC metadata log blocks for a given 8740 * L2ARC write size. l2arc_evict and l2arc_write_size need to include this 8741 * overhead in processing to make sure there is enough headroom available 8742 * when writing buffers. 8743 */ 8744 static inline uint64_t 8745 l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev) 8746 { 8747 if (dev->l2ad_log_entries == 0) { 8748 return (0); 8749 } else { 8750 uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT; 8751 8752 uint64_t log_blocks = (log_entries + 8753 dev->l2ad_log_entries - 1) / 8754 dev->l2ad_log_entries; 8755 8756 return (vdev_psize_to_asize(dev->l2ad_vdev, 8757 sizeof (l2arc_log_blk_phys_t)) * log_blocks); 8758 } 8759 } 8760 8761 /* 8762 * Evict buffers from the device write hand to the distance specified in 8763 * bytes. This distance may span populated buffers, it may span nothing. 8764 * This is clearing a region on the L2ARC device ready for writing. 8765 * If the 'all' boolean is set, every buffer is evicted. 8766 */ 8767 static void 8768 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all) 8769 { 8770 list_t *buflist; 8771 arc_buf_hdr_t *hdr, *hdr_prev; 8772 kmutex_t *hash_lock; 8773 uint64_t taddr; 8774 l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev; 8775 vdev_t *vd = dev->l2ad_vdev; 8776 boolean_t rerun; 8777 8778 buflist = &dev->l2ad_buflist; 8779 8780 top: 8781 rerun = B_FALSE; 8782 if (dev->l2ad_hand + distance > dev->l2ad_end) { 8783 /* 8784 * When there is no space to accommodate upcoming writes, 8785 * evict to the end. Then bump the write and evict hands 8786 * to the start and iterate. This iteration does not 8787 * happen indefinitely as we make sure in 8788 * l2arc_write_size() that when the write hand is reset, 8789 * the write size does not exceed the end of the device. 8790 */ 8791 rerun = B_TRUE; 8792 taddr = dev->l2ad_end; 8793 } else { 8794 taddr = dev->l2ad_hand + distance; 8795 } 8796 DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist, 8797 uint64_t, taddr, boolean_t, all); 8798 8799 if (!all) { 8800 /* 8801 * This check has to be placed after deciding whether to 8802 * iterate (rerun). 8803 */ 8804 if (dev->l2ad_first) { 8805 /* 8806 * This is the first sweep through the device. There is 8807 * nothing to evict. We have already trimmmed the 8808 * whole device. 8809 */ 8810 goto out; 8811 } else { 8812 /* 8813 * Trim the space to be evicted. 8814 */ 8815 if (vd->vdev_has_trim && dev->l2ad_evict < taddr && 8816 l2arc_trim_ahead > 0) { 8817 /* 8818 * We have to drop the spa_config lock because 8819 * vdev_trim_range() will acquire it. 8820 * l2ad_evict already accounts for the label 8821 * size. To prevent vdev_trim_ranges() from 8822 * adding it again, we subtract it from 8823 * l2ad_evict. 8824 */ 8825 spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev); 8826 vdev_trim_simple(vd, 8827 dev->l2ad_evict - VDEV_LABEL_START_SIZE, 8828 taddr - dev->l2ad_evict); 8829 spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev, 8830 RW_READER); 8831 } 8832 8833 /* 8834 * When rebuilding L2ARC we retrieve the evict hand 8835 * from the header of the device. Of note, l2arc_evict() 8836 * does not actually delete buffers from the cache 8837 * device, but trimming may do so depending on the 8838 * hardware implementation. Thus keeping track of the 8839 * evict hand is useful. 8840 */ 8841 dev->l2ad_evict = MAX(dev->l2ad_evict, taddr); 8842 } 8843 } 8844 8845 retry: 8846 mutex_enter(&dev->l2ad_mtx); 8847 /* 8848 * We have to account for evicted log blocks. Run vdev_space_update() 8849 * on log blocks whose offset (in bytes) is before the evicted offset 8850 * (in bytes) by searching in the list of pointers to log blocks 8851 * present in the L2ARC device. 8852 */ 8853 for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf; 8854 lb_ptr_buf = lb_ptr_buf_prev) { 8855 8856 lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf); 8857 8858 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 8859 uint64_t asize = L2BLK_GET_PSIZE( 8860 (lb_ptr_buf->lb_ptr)->lbp_prop); 8861 8862 /* 8863 * We don't worry about log blocks left behind (ie 8864 * lbp_payload_start < l2ad_hand) because l2arc_write_buffers() 8865 * will never write more than l2arc_evict() evicts. 8866 */ 8867 if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) { 8868 break; 8869 } else { 8870 vdev_space_update(vd, -asize, 0, 0); 8871 ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); 8872 ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); 8873 zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, 8874 lb_ptr_buf); 8875 (void) zfs_refcount_remove(&dev->l2ad_lb_count, 8876 lb_ptr_buf); 8877 list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf); 8878 kmem_free(lb_ptr_buf->lb_ptr, 8879 sizeof (l2arc_log_blkptr_t)); 8880 kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); 8881 } 8882 } 8883 8884 for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) { 8885 hdr_prev = list_prev(buflist, hdr); 8886 8887 ASSERT(!HDR_EMPTY(hdr)); 8888 hash_lock = HDR_LOCK(hdr); 8889 8890 /* 8891 * We cannot use mutex_enter or else we can deadlock 8892 * with l2arc_write_buffers (due to swapping the order 8893 * the hash lock and l2ad_mtx are taken). 8894 */ 8895 if (!mutex_tryenter(hash_lock)) { 8896 /* 8897 * Missed the hash lock. Retry. 8898 */ 8899 ARCSTAT_BUMP(arcstat_l2_evict_lock_retry); 8900 mutex_exit(&dev->l2ad_mtx); 8901 mutex_enter(hash_lock); 8902 mutex_exit(hash_lock); 8903 goto retry; 8904 } 8905 8906 /* 8907 * A header can't be on this list if it doesn't have L2 header. 8908 */ 8909 ASSERT(HDR_HAS_L2HDR(hdr)); 8910 8911 /* Ensure this header has finished being written. */ 8912 ASSERT(!HDR_L2_WRITING(hdr)); 8913 ASSERT(!HDR_L2_WRITE_HEAD(hdr)); 8914 8915 if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict || 8916 hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) { 8917 /* 8918 * We've evicted to the target address, 8919 * or the end of the device. 8920 */ 8921 mutex_exit(hash_lock); 8922 break; 8923 } 8924 8925 if (!HDR_HAS_L1HDR(hdr)) { 8926 ASSERT(!HDR_L2_READING(hdr)); 8927 /* 8928 * This doesn't exist in the ARC. Destroy. 8929 * arc_hdr_destroy() will call list_remove() 8930 * and decrement arcstat_l2_lsize. 8931 */ 8932 arc_change_state(arc_anon, hdr); 8933 arc_hdr_destroy(hdr); 8934 } else { 8935 ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only); 8936 ARCSTAT_BUMP(arcstat_l2_evict_l1cached); 8937 /* 8938 * Invalidate issued or about to be issued 8939 * reads, since we may be about to write 8940 * over this location. 8941 */ 8942 if (HDR_L2_READING(hdr)) { 8943 ARCSTAT_BUMP(arcstat_l2_evict_reading); 8944 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED); 8945 } 8946 8947 arc_hdr_l2hdr_destroy(hdr); 8948 } 8949 mutex_exit(hash_lock); 8950 } 8951 mutex_exit(&dev->l2ad_mtx); 8952 8953 out: 8954 /* 8955 * We need to check if we evict all buffers, otherwise we may iterate 8956 * unnecessarily. 8957 */ 8958 if (!all && rerun) { 8959 /* 8960 * Bump device hand to the device start if it is approaching the 8961 * end. l2arc_evict() has already evicted ahead for this case. 8962 */ 8963 dev->l2ad_hand = dev->l2ad_start; 8964 dev->l2ad_evict = dev->l2ad_start; 8965 dev->l2ad_first = B_FALSE; 8966 goto top; 8967 } 8968 8969 if (!all) { 8970 /* 8971 * In case of cache device removal (all) the following 8972 * assertions may be violated without functional consequences 8973 * as the device is about to be removed. 8974 */ 8975 ASSERT3U(dev->l2ad_hand + distance, <=, dev->l2ad_end); 8976 if (!dev->l2ad_first) 8977 ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict); 8978 } 8979 } 8980 8981 /* 8982 * Handle any abd transforms that might be required for writing to the L2ARC. 8983 * If successful, this function will always return an abd with the data 8984 * transformed as it is on disk in a new abd of asize bytes. 8985 */ 8986 static int 8987 l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize, 8988 abd_t **abd_out) 8989 { 8990 int ret; 8991 abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd; 8992 enum zio_compress compress = HDR_GET_COMPRESS(hdr); 8993 uint64_t psize = HDR_GET_PSIZE(hdr); 8994 uint64_t size = arc_hdr_size(hdr); 8995 boolean_t ismd = HDR_ISTYPE_METADATA(hdr); 8996 boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 8997 dsl_crypto_key_t *dck = NULL; 8998 uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 }; 8999 boolean_t no_crypt = B_FALSE; 9000 9001 ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 9002 !HDR_COMPRESSION_ENABLED(hdr)) || 9003 HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize); 9004 ASSERT3U(psize, <=, asize); 9005 9006 /* 9007 * If this data simply needs its own buffer, we simply allocate it 9008 * and copy the data. This may be done to eliminate a dependency on a 9009 * shared buffer or to reallocate the buffer to match asize. 9010 */ 9011 if (HDR_HAS_RABD(hdr)) { 9012 ASSERT3U(asize, >, psize); 9013 to_write = abd_alloc_for_io(asize, ismd); 9014 abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize); 9015 abd_zero_off(to_write, psize, asize - psize); 9016 goto out; 9017 } 9018 9019 if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) && 9020 !HDR_ENCRYPTED(hdr)) { 9021 ASSERT3U(size, ==, psize); 9022 to_write = abd_alloc_for_io(asize, ismd); 9023 abd_copy(to_write, hdr->b_l1hdr.b_pabd, size); 9024 if (asize > size) 9025 abd_zero_off(to_write, size, asize - size); 9026 goto out; 9027 } 9028 9029 if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { 9030 cabd = abd_alloc_for_io(MAX(size, asize), ismd); 9031 uint64_t csize = zio_compress_data(compress, to_write, &cabd, 9032 size, hdr->b_complevel); 9033 if (csize > psize) { 9034 /* 9035 * We can't re-compress the block into the original 9036 * psize. Even if it fits into asize, it does not 9037 * matter, since checksum will never match on read. 9038 */ 9039 abd_free(cabd); 9040 return (SET_ERROR(EIO)); 9041 } 9042 if (asize > csize) 9043 abd_zero_off(cabd, csize, asize - csize); 9044 to_write = cabd; 9045 } 9046 9047 if (HDR_ENCRYPTED(hdr)) { 9048 eabd = abd_alloc_for_io(asize, ismd); 9049 9050 /* 9051 * If the dataset was disowned before the buffer 9052 * made it to this point, the key to re-encrypt 9053 * it won't be available. In this case we simply 9054 * won't write the buffer to the L2ARC. 9055 */ 9056 ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj, 9057 FTAG, &dck); 9058 if (ret != 0) 9059 goto error; 9060 9061 ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key, 9062 hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt, 9063 hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd, 9064 &no_crypt); 9065 if (ret != 0) 9066 goto error; 9067 9068 if (no_crypt) 9069 abd_copy(eabd, to_write, psize); 9070 9071 if (psize != asize) 9072 abd_zero_off(eabd, psize, asize - psize); 9073 9074 /* assert that the MAC we got here matches the one we saved */ 9075 ASSERT0(memcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN)); 9076 spa_keystore_dsl_key_rele(spa, dck, FTAG); 9077 9078 if (to_write == cabd) 9079 abd_free(cabd); 9080 9081 to_write = eabd; 9082 } 9083 9084 out: 9085 ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd); 9086 *abd_out = to_write; 9087 return (0); 9088 9089 error: 9090 if (dck != NULL) 9091 spa_keystore_dsl_key_rele(spa, dck, FTAG); 9092 if (cabd != NULL) 9093 abd_free(cabd); 9094 if (eabd != NULL) 9095 abd_free(eabd); 9096 9097 *abd_out = NULL; 9098 return (ret); 9099 } 9100 9101 static void 9102 l2arc_blk_fetch_done(zio_t *zio) 9103 { 9104 l2arc_read_callback_t *cb; 9105 9106 cb = zio->io_private; 9107 if (cb->l2rcb_abd != NULL) 9108 abd_free(cb->l2rcb_abd); 9109 kmem_free(cb, sizeof (l2arc_read_callback_t)); 9110 } 9111 9112 /* 9113 * Find and write ARC buffers to the L2ARC device. 9114 * 9115 * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid 9116 * for reading until they have completed writing. 9117 * The headroom_boost is an in-out parameter used to maintain headroom boost 9118 * state between calls to this function. 9119 * 9120 * Returns the number of bytes actually written (which may be smaller than 9121 * the delta by which the device hand has changed due to alignment and the 9122 * writing of log blocks). 9123 */ 9124 static uint64_t 9125 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz) 9126 { 9127 arc_buf_hdr_t *hdr, *head, *marker; 9128 uint64_t write_asize, write_psize, headroom; 9129 boolean_t full, from_head = !arc_warm; 9130 l2arc_write_callback_t *cb = NULL; 9131 zio_t *pio, *wzio; 9132 uint64_t guid = spa_load_guid(spa); 9133 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9134 9135 ASSERT3P(dev->l2ad_vdev, !=, NULL); 9136 9137 pio = NULL; 9138 write_asize = write_psize = 0; 9139 full = B_FALSE; 9140 head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE); 9141 arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR); 9142 marker = arc_state_alloc_marker(); 9143 9144 /* 9145 * Copy buffers for L2ARC writing. 9146 */ 9147 for (int pass = 0; pass < L2ARC_FEED_TYPES; pass++) { 9148 /* 9149 * pass == 0: MFU meta 9150 * pass == 1: MRU meta 9151 * pass == 2: MFU data 9152 * pass == 3: MRU data 9153 */ 9154 if (l2arc_mfuonly == 1) { 9155 if (pass == 1 || pass == 3) 9156 continue; 9157 } else if (l2arc_mfuonly > 1) { 9158 if (pass == 3) 9159 continue; 9160 } 9161 9162 uint64_t passed_sz = 0; 9163 headroom = target_sz * l2arc_headroom; 9164 if (zfs_compressed_arc_enabled) 9165 headroom = (headroom * l2arc_headroom_boost) / 100; 9166 9167 /* 9168 * Until the ARC is warm and starts to evict, read from the 9169 * head of the ARC lists rather than the tail. 9170 */ 9171 multilist_sublist_t *mls = l2arc_sublist_lock(pass); 9172 ASSERT3P(mls, !=, NULL); 9173 if (from_head) 9174 hdr = multilist_sublist_head(mls); 9175 else 9176 hdr = multilist_sublist_tail(mls); 9177 9178 while (hdr != NULL) { 9179 kmutex_t *hash_lock; 9180 abd_t *to_write = NULL; 9181 9182 hash_lock = HDR_LOCK(hdr); 9183 if (!mutex_tryenter(hash_lock)) { 9184 skip: 9185 /* Skip this buffer rather than waiting. */ 9186 if (from_head) 9187 hdr = multilist_sublist_next(mls, hdr); 9188 else 9189 hdr = multilist_sublist_prev(mls, hdr); 9190 continue; 9191 } 9192 9193 passed_sz += HDR_GET_LSIZE(hdr); 9194 if (l2arc_headroom != 0 && passed_sz > headroom) { 9195 /* 9196 * Searched too far. 9197 */ 9198 mutex_exit(hash_lock); 9199 break; 9200 } 9201 9202 if (!l2arc_write_eligible(guid, hdr)) { 9203 mutex_exit(hash_lock); 9204 goto skip; 9205 } 9206 9207 ASSERT(HDR_HAS_L1HDR(hdr)); 9208 ASSERT3U(HDR_GET_PSIZE(hdr), >, 0); 9209 ASSERT3U(arc_hdr_size(hdr), >, 0); 9210 ASSERT(hdr->b_l1hdr.b_pabd != NULL || 9211 HDR_HAS_RABD(hdr)); 9212 uint64_t psize = HDR_GET_PSIZE(hdr); 9213 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, 9214 psize); 9215 9216 /* 9217 * If the allocated size of this buffer plus the max 9218 * size for the pending log block exceeds the evicted 9219 * target size, terminate writing buffers for this run. 9220 */ 9221 if (write_asize + asize + 9222 sizeof (l2arc_log_blk_phys_t) > target_sz) { 9223 full = B_TRUE; 9224 mutex_exit(hash_lock); 9225 break; 9226 } 9227 9228 /* 9229 * We should not sleep with sublist lock held or it 9230 * may block ARC eviction. Insert a marker to save 9231 * the position and drop the lock. 9232 */ 9233 if (from_head) { 9234 multilist_sublist_insert_after(mls, hdr, 9235 marker); 9236 } else { 9237 multilist_sublist_insert_before(mls, hdr, 9238 marker); 9239 } 9240 multilist_sublist_unlock(mls); 9241 9242 /* 9243 * If this header has b_rabd, we can use this since it 9244 * must always match the data exactly as it exists on 9245 * disk. Otherwise, the L2ARC can normally use the 9246 * hdr's data, but if we're sharing data between the 9247 * hdr and one of its bufs, L2ARC needs its own copy of 9248 * the data so that the ZIO below can't race with the 9249 * buf consumer. To ensure that this copy will be 9250 * available for the lifetime of the ZIO and be cleaned 9251 * up afterwards, we add it to the l2arc_free_on_write 9252 * queue. If we need to apply any transforms to the 9253 * data (compression, encryption) we will also need the 9254 * extra buffer. 9255 */ 9256 if (HDR_HAS_RABD(hdr) && psize == asize) { 9257 to_write = hdr->b_crypt_hdr.b_rabd; 9258 } else if ((HDR_COMPRESSION_ENABLED(hdr) || 9259 HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) && 9260 !HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) && 9261 psize == asize) { 9262 to_write = hdr->b_l1hdr.b_pabd; 9263 } else { 9264 int ret; 9265 arc_buf_contents_t type = arc_buf_type(hdr); 9266 9267 ret = l2arc_apply_transforms(spa, hdr, asize, 9268 &to_write); 9269 if (ret != 0) { 9270 arc_hdr_clear_flags(hdr, 9271 ARC_FLAG_L2CACHE); 9272 mutex_exit(hash_lock); 9273 goto next; 9274 } 9275 9276 l2arc_free_abd_on_write(to_write, asize, type); 9277 } 9278 9279 hdr->b_l2hdr.b_dev = dev; 9280 hdr->b_l2hdr.b_daddr = dev->l2ad_hand; 9281 hdr->b_l2hdr.b_hits = 0; 9282 hdr->b_l2hdr.b_arcs_state = 9283 hdr->b_l1hdr.b_state->arcs_state; 9284 mutex_enter(&dev->l2ad_mtx); 9285 if (pio == NULL) { 9286 /* 9287 * Insert a dummy header on the buflist so 9288 * l2arc_write_done() can find where the 9289 * write buffers begin without searching. 9290 */ 9291 list_insert_head(&dev->l2ad_buflist, head); 9292 } 9293 list_insert_head(&dev->l2ad_buflist, hdr); 9294 mutex_exit(&dev->l2ad_mtx); 9295 arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR | 9296 ARC_FLAG_L2_WRITING); 9297 9298 (void) zfs_refcount_add_many(&dev->l2ad_alloc, 9299 arc_hdr_size(hdr), hdr); 9300 l2arc_hdr_arcstats_increment(hdr); 9301 9302 boolean_t commit = l2arc_log_blk_insert(dev, hdr); 9303 mutex_exit(hash_lock); 9304 9305 if (pio == NULL) { 9306 cb = kmem_alloc( 9307 sizeof (l2arc_write_callback_t), KM_SLEEP); 9308 cb->l2wcb_dev = dev; 9309 cb->l2wcb_head = head; 9310 list_create(&cb->l2wcb_abd_list, 9311 sizeof (l2arc_lb_abd_buf_t), 9312 offsetof(l2arc_lb_abd_buf_t, node)); 9313 pio = zio_root(spa, l2arc_write_done, cb, 9314 ZIO_FLAG_CANFAIL); 9315 } 9316 9317 wzio = zio_write_phys(pio, dev->l2ad_vdev, 9318 dev->l2ad_hand, asize, to_write, 9319 ZIO_CHECKSUM_OFF, NULL, hdr, 9320 ZIO_PRIORITY_ASYNC_WRITE, 9321 ZIO_FLAG_CANFAIL, B_FALSE); 9322 9323 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, 9324 zio_t *, wzio); 9325 zio_nowait(wzio); 9326 9327 write_psize += psize; 9328 write_asize += asize; 9329 dev->l2ad_hand += asize; 9330 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 9331 9332 if (commit) { 9333 /* l2ad_hand will be adjusted inside. */ 9334 write_asize += 9335 l2arc_log_blk_commit(dev, pio, cb); 9336 } 9337 9338 next: 9339 multilist_sublist_lock(mls); 9340 if (from_head) 9341 hdr = multilist_sublist_next(mls, marker); 9342 else 9343 hdr = multilist_sublist_prev(mls, marker); 9344 multilist_sublist_remove(mls, marker); 9345 } 9346 9347 multilist_sublist_unlock(mls); 9348 9349 if (full == B_TRUE) 9350 break; 9351 } 9352 9353 arc_state_free_marker(marker); 9354 9355 /* No buffers selected for writing? */ 9356 if (pio == NULL) { 9357 ASSERT0(write_psize); 9358 ASSERT(!HDR_HAS_L1HDR(head)); 9359 kmem_cache_free(hdr_l2only_cache, head); 9360 9361 /* 9362 * Although we did not write any buffers l2ad_evict may 9363 * have advanced. 9364 */ 9365 if (dev->l2ad_evict != l2dhdr->dh_evict) 9366 l2arc_dev_hdr_update(dev); 9367 9368 return (0); 9369 } 9370 9371 if (!dev->l2ad_first) 9372 ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict); 9373 9374 ASSERT3U(write_asize, <=, target_sz); 9375 ARCSTAT_BUMP(arcstat_l2_writes_sent); 9376 ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize); 9377 9378 dev->l2ad_writing = B_TRUE; 9379 (void) zio_wait(pio); 9380 dev->l2ad_writing = B_FALSE; 9381 9382 /* 9383 * Update the device header after the zio completes as 9384 * l2arc_write_done() may have updated the memory holding the log block 9385 * pointers in the device header. 9386 */ 9387 l2arc_dev_hdr_update(dev); 9388 9389 return (write_asize); 9390 } 9391 9392 static boolean_t 9393 l2arc_hdr_limit_reached(void) 9394 { 9395 int64_t s = aggsum_upper_bound(&arc_sums.arcstat_l2_hdr_size); 9396 9397 return (arc_reclaim_needed() || 9398 (s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100)); 9399 } 9400 9401 /* 9402 * This thread feeds the L2ARC at regular intervals. This is the beating 9403 * heart of the L2ARC. 9404 */ 9405 static __attribute__((noreturn)) void 9406 l2arc_feed_thread(void *unused) 9407 { 9408 (void) unused; 9409 callb_cpr_t cpr; 9410 l2arc_dev_t *dev; 9411 spa_t *spa; 9412 uint64_t size, wrote; 9413 clock_t begin, next = ddi_get_lbolt(); 9414 fstrans_cookie_t cookie; 9415 9416 CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG); 9417 9418 mutex_enter(&l2arc_feed_thr_lock); 9419 9420 cookie = spl_fstrans_mark(); 9421 while (l2arc_thread_exit == 0) { 9422 CALLB_CPR_SAFE_BEGIN(&cpr); 9423 (void) cv_timedwait_idle(&l2arc_feed_thr_cv, 9424 &l2arc_feed_thr_lock, next); 9425 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock); 9426 next = ddi_get_lbolt() + hz; 9427 9428 /* 9429 * Quick check for L2ARC devices. 9430 */ 9431 mutex_enter(&l2arc_dev_mtx); 9432 if (l2arc_ndev == 0) { 9433 mutex_exit(&l2arc_dev_mtx); 9434 continue; 9435 } 9436 mutex_exit(&l2arc_dev_mtx); 9437 begin = ddi_get_lbolt(); 9438 9439 /* 9440 * This selects the next l2arc device to write to, and in 9441 * doing so the next spa to feed from: dev->l2ad_spa. This 9442 * will return NULL if there are now no l2arc devices or if 9443 * they are all faulted. 9444 * 9445 * If a device is returned, its spa's config lock is also 9446 * held to prevent device removal. l2arc_dev_get_next() 9447 * will grab and release l2arc_dev_mtx. 9448 */ 9449 if ((dev = l2arc_dev_get_next()) == NULL) 9450 continue; 9451 9452 spa = dev->l2ad_spa; 9453 ASSERT3P(spa, !=, NULL); 9454 9455 /* 9456 * If the pool is read-only then force the feed thread to 9457 * sleep a little longer. 9458 */ 9459 if (!spa_writeable(spa)) { 9460 next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz; 9461 spa_config_exit(spa, SCL_L2ARC, dev); 9462 continue; 9463 } 9464 9465 /* 9466 * Avoid contributing to memory pressure. 9467 */ 9468 if (l2arc_hdr_limit_reached()) { 9469 ARCSTAT_BUMP(arcstat_l2_abort_lowmem); 9470 spa_config_exit(spa, SCL_L2ARC, dev); 9471 continue; 9472 } 9473 9474 ARCSTAT_BUMP(arcstat_l2_feeds); 9475 9476 size = l2arc_write_size(dev); 9477 9478 /* 9479 * Evict L2ARC buffers that will be overwritten. 9480 */ 9481 l2arc_evict(dev, size, B_FALSE); 9482 9483 /* 9484 * Write ARC buffers. 9485 */ 9486 wrote = l2arc_write_buffers(spa, dev, size); 9487 9488 /* 9489 * Calculate interval between writes. 9490 */ 9491 next = l2arc_write_interval(begin, size, wrote); 9492 spa_config_exit(spa, SCL_L2ARC, dev); 9493 } 9494 spl_fstrans_unmark(cookie); 9495 9496 l2arc_thread_exit = 0; 9497 cv_broadcast(&l2arc_feed_thr_cv); 9498 CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */ 9499 thread_exit(); 9500 } 9501 9502 boolean_t 9503 l2arc_vdev_present(vdev_t *vd) 9504 { 9505 return (l2arc_vdev_get(vd) != NULL); 9506 } 9507 9508 /* 9509 * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if 9510 * the vdev_t isn't an L2ARC device. 9511 */ 9512 l2arc_dev_t * 9513 l2arc_vdev_get(vdev_t *vd) 9514 { 9515 l2arc_dev_t *dev; 9516 9517 mutex_enter(&l2arc_dev_mtx); 9518 for (dev = list_head(l2arc_dev_list); dev != NULL; 9519 dev = list_next(l2arc_dev_list, dev)) { 9520 if (dev->l2ad_vdev == vd) 9521 break; 9522 } 9523 mutex_exit(&l2arc_dev_mtx); 9524 9525 return (dev); 9526 } 9527 9528 static void 9529 l2arc_rebuild_dev(l2arc_dev_t *dev, boolean_t reopen) 9530 { 9531 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9532 uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; 9533 spa_t *spa = dev->l2ad_spa; 9534 9535 /* 9536 * The L2ARC has to hold at least the payload of one log block for 9537 * them to be restored (persistent L2ARC). The payload of a log block 9538 * depends on the amount of its log entries. We always write log blocks 9539 * with 1022 entries. How many of them are committed or restored depends 9540 * on the size of the L2ARC device. Thus the maximum payload of 9541 * one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device 9542 * is less than that, we reduce the amount of committed and restored 9543 * log entries per block so as to enable persistence. 9544 */ 9545 if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) { 9546 dev->l2ad_log_entries = 0; 9547 } else { 9548 dev->l2ad_log_entries = MIN((dev->l2ad_end - 9549 dev->l2ad_start) >> SPA_MAXBLOCKSHIFT, 9550 L2ARC_LOG_BLK_MAX_ENTRIES); 9551 } 9552 9553 /* 9554 * Read the device header, if an error is returned do not rebuild L2ARC. 9555 */ 9556 if (l2arc_dev_hdr_read(dev) == 0 && dev->l2ad_log_entries > 0) { 9557 /* 9558 * If we are onlining a cache device (vdev_reopen) that was 9559 * still present (l2arc_vdev_present()) and rebuild is enabled, 9560 * we should evict all ARC buffers and pointers to log blocks 9561 * and reclaim their space before restoring its contents to 9562 * L2ARC. 9563 */ 9564 if (reopen) { 9565 if (!l2arc_rebuild_enabled) { 9566 return; 9567 } else { 9568 l2arc_evict(dev, 0, B_TRUE); 9569 /* start a new log block */ 9570 dev->l2ad_log_ent_idx = 0; 9571 dev->l2ad_log_blk_payload_asize = 0; 9572 dev->l2ad_log_blk_payload_start = 0; 9573 } 9574 } 9575 /* 9576 * Just mark the device as pending for a rebuild. We won't 9577 * be starting a rebuild in line here as it would block pool 9578 * import. Instead spa_load_impl will hand that off to an 9579 * async task which will call l2arc_spa_rebuild_start. 9580 */ 9581 dev->l2ad_rebuild = B_TRUE; 9582 } else if (spa_writeable(spa)) { 9583 /* 9584 * In this case TRIM the whole device if l2arc_trim_ahead > 0, 9585 * otherwise create a new header. We zero out the memory holding 9586 * the header to reset dh_start_lbps. If we TRIM the whole 9587 * device the new header will be written by 9588 * vdev_trim_l2arc_thread() at the end of the TRIM to update the 9589 * trim_state in the header too. When reading the header, if 9590 * trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 0 9591 * we opt to TRIM the whole device again. 9592 */ 9593 if (l2arc_trim_ahead > 0) { 9594 dev->l2ad_trim_all = B_TRUE; 9595 } else { 9596 memset(l2dhdr, 0, l2dhdr_asize); 9597 l2arc_dev_hdr_update(dev); 9598 } 9599 } 9600 } 9601 9602 /* 9603 * Add a vdev for use by the L2ARC. By this point the spa has already 9604 * validated the vdev and opened it. 9605 */ 9606 void 9607 l2arc_add_vdev(spa_t *spa, vdev_t *vd) 9608 { 9609 l2arc_dev_t *adddev; 9610 uint64_t l2dhdr_asize; 9611 9612 ASSERT(!l2arc_vdev_present(vd)); 9613 9614 /* 9615 * Create a new l2arc device entry. 9616 */ 9617 adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP); 9618 adddev->l2ad_spa = spa; 9619 adddev->l2ad_vdev = vd; 9620 /* leave extra size for an l2arc device header */ 9621 l2dhdr_asize = adddev->l2ad_dev_hdr_asize = 9622 MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift); 9623 adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize; 9624 adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd); 9625 ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end); 9626 adddev->l2ad_hand = adddev->l2ad_start; 9627 adddev->l2ad_evict = adddev->l2ad_start; 9628 adddev->l2ad_first = B_TRUE; 9629 adddev->l2ad_writing = B_FALSE; 9630 adddev->l2ad_trim_all = B_FALSE; 9631 list_link_init(&adddev->l2ad_node); 9632 adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP); 9633 9634 mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL); 9635 /* 9636 * This is a list of all ARC buffers that are still valid on the 9637 * device. 9638 */ 9639 list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t), 9640 offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node)); 9641 9642 /* 9643 * This is a list of pointers to log blocks that are still present 9644 * on the device. 9645 */ 9646 list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t), 9647 offsetof(l2arc_lb_ptr_buf_t, node)); 9648 9649 vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand); 9650 zfs_refcount_create(&adddev->l2ad_alloc); 9651 zfs_refcount_create(&adddev->l2ad_lb_asize); 9652 zfs_refcount_create(&adddev->l2ad_lb_count); 9653 9654 /* 9655 * Decide if dev is eligible for L2ARC rebuild or whole device 9656 * trimming. This has to happen before the device is added in the 9657 * cache device list and l2arc_dev_mtx is released. Otherwise 9658 * l2arc_feed_thread() might already start writing on the 9659 * device. 9660 */ 9661 l2arc_rebuild_dev(adddev, B_FALSE); 9662 9663 /* 9664 * Add device to global list 9665 */ 9666 mutex_enter(&l2arc_dev_mtx); 9667 list_insert_head(l2arc_dev_list, adddev); 9668 atomic_inc_64(&l2arc_ndev); 9669 mutex_exit(&l2arc_dev_mtx); 9670 } 9671 9672 /* 9673 * Decide if a vdev is eligible for L2ARC rebuild, called from vdev_reopen() 9674 * in case of onlining a cache device. 9675 */ 9676 void 9677 l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen) 9678 { 9679 l2arc_dev_t *dev = NULL; 9680 9681 dev = l2arc_vdev_get(vd); 9682 ASSERT3P(dev, !=, NULL); 9683 9684 /* 9685 * In contrast to l2arc_add_vdev() we do not have to worry about 9686 * l2arc_feed_thread() invalidating previous content when onlining a 9687 * cache device. The device parameters (l2ad*) are not cleared when 9688 * offlining the device and writing new buffers will not invalidate 9689 * all previous content. In worst case only buffers that have not had 9690 * their log block written to the device will be lost. 9691 * When onlining the cache device (ie offline->online without exporting 9692 * the pool in between) this happens: 9693 * vdev_reopen() -> vdev_open() -> l2arc_rebuild_vdev() 9694 * | | 9695 * vdev_is_dead() = B_FALSE l2ad_rebuild = B_TRUE 9696 * During the time where vdev_is_dead = B_FALSE and until l2ad_rebuild 9697 * is set to B_TRUE we might write additional buffers to the device. 9698 */ 9699 l2arc_rebuild_dev(dev, reopen); 9700 } 9701 9702 /* 9703 * Remove a vdev from the L2ARC. 9704 */ 9705 void 9706 l2arc_remove_vdev(vdev_t *vd) 9707 { 9708 l2arc_dev_t *remdev = NULL; 9709 9710 /* 9711 * Find the device by vdev 9712 */ 9713 remdev = l2arc_vdev_get(vd); 9714 ASSERT3P(remdev, !=, NULL); 9715 9716 /* 9717 * Cancel any ongoing or scheduled rebuild. 9718 */ 9719 mutex_enter(&l2arc_rebuild_thr_lock); 9720 if (remdev->l2ad_rebuild_began == B_TRUE) { 9721 remdev->l2ad_rebuild_cancel = B_TRUE; 9722 while (remdev->l2ad_rebuild == B_TRUE) 9723 cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock); 9724 } 9725 mutex_exit(&l2arc_rebuild_thr_lock); 9726 9727 /* 9728 * Remove device from global list 9729 */ 9730 mutex_enter(&l2arc_dev_mtx); 9731 list_remove(l2arc_dev_list, remdev); 9732 l2arc_dev_last = NULL; /* may have been invalidated */ 9733 atomic_dec_64(&l2arc_ndev); 9734 mutex_exit(&l2arc_dev_mtx); 9735 9736 /* 9737 * Clear all buflists and ARC references. L2ARC device flush. 9738 */ 9739 l2arc_evict(remdev, 0, B_TRUE); 9740 list_destroy(&remdev->l2ad_buflist); 9741 ASSERT(list_is_empty(&remdev->l2ad_lbptr_list)); 9742 list_destroy(&remdev->l2ad_lbptr_list); 9743 mutex_destroy(&remdev->l2ad_mtx); 9744 zfs_refcount_destroy(&remdev->l2ad_alloc); 9745 zfs_refcount_destroy(&remdev->l2ad_lb_asize); 9746 zfs_refcount_destroy(&remdev->l2ad_lb_count); 9747 kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize); 9748 vmem_free(remdev, sizeof (l2arc_dev_t)); 9749 } 9750 9751 void 9752 l2arc_init(void) 9753 { 9754 l2arc_thread_exit = 0; 9755 l2arc_ndev = 0; 9756 9757 mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL); 9758 cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL); 9759 mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL); 9760 cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL); 9761 mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL); 9762 mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL); 9763 9764 l2arc_dev_list = &L2ARC_dev_list; 9765 l2arc_free_on_write = &L2ARC_free_on_write; 9766 list_create(l2arc_dev_list, sizeof (l2arc_dev_t), 9767 offsetof(l2arc_dev_t, l2ad_node)); 9768 list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t), 9769 offsetof(l2arc_data_free_t, l2df_list_node)); 9770 } 9771 9772 void 9773 l2arc_fini(void) 9774 { 9775 mutex_destroy(&l2arc_feed_thr_lock); 9776 cv_destroy(&l2arc_feed_thr_cv); 9777 mutex_destroy(&l2arc_rebuild_thr_lock); 9778 cv_destroy(&l2arc_rebuild_thr_cv); 9779 mutex_destroy(&l2arc_dev_mtx); 9780 mutex_destroy(&l2arc_free_on_write_mtx); 9781 9782 list_destroy(l2arc_dev_list); 9783 list_destroy(l2arc_free_on_write); 9784 } 9785 9786 void 9787 l2arc_start(void) 9788 { 9789 if (!(spa_mode_global & SPA_MODE_WRITE)) 9790 return; 9791 9792 (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0, 9793 TS_RUN, defclsyspri); 9794 } 9795 9796 void 9797 l2arc_stop(void) 9798 { 9799 if (!(spa_mode_global & SPA_MODE_WRITE)) 9800 return; 9801 9802 mutex_enter(&l2arc_feed_thr_lock); 9803 cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */ 9804 l2arc_thread_exit = 1; 9805 while (l2arc_thread_exit != 0) 9806 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock); 9807 mutex_exit(&l2arc_feed_thr_lock); 9808 } 9809 9810 /* 9811 * Punches out rebuild threads for the L2ARC devices in a spa. This should 9812 * be called after pool import from the spa async thread, since starting 9813 * these threads directly from spa_import() will make them part of the 9814 * "zpool import" context and delay process exit (and thus pool import). 9815 */ 9816 void 9817 l2arc_spa_rebuild_start(spa_t *spa) 9818 { 9819 ASSERT(MUTEX_HELD(&spa_namespace_lock)); 9820 9821 /* 9822 * Locate the spa's l2arc devices and kick off rebuild threads. 9823 */ 9824 for (int i = 0; i < spa->spa_l2cache.sav_count; i++) { 9825 l2arc_dev_t *dev = 9826 l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]); 9827 if (dev == NULL) { 9828 /* Don't attempt a rebuild if the vdev is UNAVAIL */ 9829 continue; 9830 } 9831 mutex_enter(&l2arc_rebuild_thr_lock); 9832 if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) { 9833 dev->l2ad_rebuild_began = B_TRUE; 9834 (void) thread_create(NULL, 0, l2arc_dev_rebuild_thread, 9835 dev, 0, &p0, TS_RUN, minclsyspri); 9836 } 9837 mutex_exit(&l2arc_rebuild_thr_lock); 9838 } 9839 } 9840 9841 /* 9842 * Main entry point for L2ARC rebuilding. 9843 */ 9844 static __attribute__((noreturn)) void 9845 l2arc_dev_rebuild_thread(void *arg) 9846 { 9847 l2arc_dev_t *dev = arg; 9848 9849 VERIFY(!dev->l2ad_rebuild_cancel); 9850 VERIFY(dev->l2ad_rebuild); 9851 (void) l2arc_rebuild(dev); 9852 mutex_enter(&l2arc_rebuild_thr_lock); 9853 dev->l2ad_rebuild_began = B_FALSE; 9854 dev->l2ad_rebuild = B_FALSE; 9855 mutex_exit(&l2arc_rebuild_thr_lock); 9856 9857 thread_exit(); 9858 } 9859 9860 /* 9861 * This function implements the actual L2ARC metadata rebuild. It: 9862 * starts reading the log block chain and restores each block's contents 9863 * to memory (reconstructing arc_buf_hdr_t's). 9864 * 9865 * Operation stops under any of the following conditions: 9866 * 9867 * 1) We reach the end of the log block chain. 9868 * 2) We encounter *any* error condition (cksum errors, io errors) 9869 */ 9870 static int 9871 l2arc_rebuild(l2arc_dev_t *dev) 9872 { 9873 vdev_t *vd = dev->l2ad_vdev; 9874 spa_t *spa = vd->vdev_spa; 9875 int err = 0; 9876 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9877 l2arc_log_blk_phys_t *this_lb, *next_lb; 9878 zio_t *this_io = NULL, *next_io = NULL; 9879 l2arc_log_blkptr_t lbps[2]; 9880 l2arc_lb_ptr_buf_t *lb_ptr_buf; 9881 boolean_t lock_held; 9882 9883 this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP); 9884 next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP); 9885 9886 /* 9887 * We prevent device removal while issuing reads to the device, 9888 * then during the rebuilding phases we drop this lock again so 9889 * that a spa_unload or device remove can be initiated - this is 9890 * safe, because the spa will signal us to stop before removing 9891 * our device and wait for us to stop. 9892 */ 9893 spa_config_enter(spa, SCL_L2ARC, vd, RW_READER); 9894 lock_held = B_TRUE; 9895 9896 /* 9897 * Retrieve the persistent L2ARC device state. 9898 * L2BLK_GET_PSIZE returns aligned size for log blocks. 9899 */ 9900 dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start); 9901 dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr + 9902 L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop), 9903 dev->l2ad_start); 9904 dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST); 9905 9906 vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time; 9907 vd->vdev_trim_state = l2dhdr->dh_trim_state; 9908 9909 /* 9910 * In case the zfs module parameter l2arc_rebuild_enabled is false 9911 * we do not start the rebuild process. 9912 */ 9913 if (!l2arc_rebuild_enabled) 9914 goto out; 9915 9916 /* Prepare the rebuild process */ 9917 memcpy(lbps, l2dhdr->dh_start_lbps, sizeof (lbps)); 9918 9919 /* Start the rebuild process */ 9920 for (;;) { 9921 if (!l2arc_log_blkptr_valid(dev, &lbps[0])) 9922 break; 9923 9924 if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1], 9925 this_lb, next_lb, this_io, &next_io)) != 0) 9926 goto out; 9927 9928 /* 9929 * Our memory pressure valve. If the system is running low 9930 * on memory, rather than swamping memory with new ARC buf 9931 * hdrs, we opt not to rebuild the L2ARC. At this point, 9932 * however, we have already set up our L2ARC dev to chain in 9933 * new metadata log blocks, so the user may choose to offline/ 9934 * online the L2ARC dev at a later time (or re-import the pool) 9935 * to reconstruct it (when there's less memory pressure). 9936 */ 9937 if (l2arc_hdr_limit_reached()) { 9938 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem); 9939 cmn_err(CE_NOTE, "System running low on memory, " 9940 "aborting L2ARC rebuild."); 9941 err = SET_ERROR(ENOMEM); 9942 goto out; 9943 } 9944 9945 spa_config_exit(spa, SCL_L2ARC, vd); 9946 lock_held = B_FALSE; 9947 9948 /* 9949 * Now that we know that the next_lb checks out alright, we 9950 * can start reconstruction from this log block. 9951 * L2BLK_GET_PSIZE returns aligned size for log blocks. 9952 */ 9953 uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop); 9954 l2arc_log_blk_restore(dev, this_lb, asize); 9955 9956 /* 9957 * log block restored, include its pointer in the list of 9958 * pointers to log blocks present in the L2ARC device. 9959 */ 9960 lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); 9961 lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), 9962 KM_SLEEP); 9963 memcpy(lb_ptr_buf->lb_ptr, &lbps[0], 9964 sizeof (l2arc_log_blkptr_t)); 9965 mutex_enter(&dev->l2ad_mtx); 9966 list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf); 9967 ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); 9968 ARCSTAT_BUMP(arcstat_l2_log_blk_count); 9969 zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); 9970 zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); 9971 mutex_exit(&dev->l2ad_mtx); 9972 vdev_space_update(vd, asize, 0, 0); 9973 9974 /* 9975 * Protection against loops of log blocks: 9976 * 9977 * l2ad_hand l2ad_evict 9978 * V V 9979 * l2ad_start |=======================================| l2ad_end 9980 * -----|||----|||---|||----||| 9981 * (3) (2) (1) (0) 9982 * ---|||---|||----|||---||| 9983 * (7) (6) (5) (4) 9984 * 9985 * In this situation the pointer of log block (4) passes 9986 * l2arc_log_blkptr_valid() but the log block should not be 9987 * restored as it is overwritten by the payload of log block 9988 * (0). Only log blocks (0)-(3) should be restored. We check 9989 * whether l2ad_evict lies in between the payload starting 9990 * offset of the next log block (lbps[1].lbp_payload_start) 9991 * and the payload starting offset of the present log block 9992 * (lbps[0].lbp_payload_start). If true and this isn't the 9993 * first pass, we are looping from the beginning and we should 9994 * stop. 9995 */ 9996 if (l2arc_range_check_overlap(lbps[1].lbp_payload_start, 9997 lbps[0].lbp_payload_start, dev->l2ad_evict) && 9998 !dev->l2ad_first) 9999 goto out; 10000 10001 kpreempt(KPREEMPT_SYNC); 10002 for (;;) { 10003 mutex_enter(&l2arc_rebuild_thr_lock); 10004 if (dev->l2ad_rebuild_cancel) { 10005 dev->l2ad_rebuild = B_FALSE; 10006 cv_signal(&l2arc_rebuild_thr_cv); 10007 mutex_exit(&l2arc_rebuild_thr_lock); 10008 err = SET_ERROR(ECANCELED); 10009 goto out; 10010 } 10011 mutex_exit(&l2arc_rebuild_thr_lock); 10012 if (spa_config_tryenter(spa, SCL_L2ARC, vd, 10013 RW_READER)) { 10014 lock_held = B_TRUE; 10015 break; 10016 } 10017 /* 10018 * L2ARC config lock held by somebody in writer, 10019 * possibly due to them trying to remove us. They'll 10020 * likely to want us to shut down, so after a little 10021 * delay, we check l2ad_rebuild_cancel and retry 10022 * the lock again. 10023 */ 10024 delay(1); 10025 } 10026 10027 /* 10028 * Continue with the next log block. 10029 */ 10030 lbps[0] = lbps[1]; 10031 lbps[1] = this_lb->lb_prev_lbp; 10032 PTR_SWAP(this_lb, next_lb); 10033 this_io = next_io; 10034 next_io = NULL; 10035 } 10036 10037 if (this_io != NULL) 10038 l2arc_log_blk_fetch_abort(this_io); 10039 out: 10040 if (next_io != NULL) 10041 l2arc_log_blk_fetch_abort(next_io); 10042 vmem_free(this_lb, sizeof (*this_lb)); 10043 vmem_free(next_lb, sizeof (*next_lb)); 10044 10045 if (!l2arc_rebuild_enabled) { 10046 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 10047 "disabled"); 10048 } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) { 10049 ARCSTAT_BUMP(arcstat_l2_rebuild_success); 10050 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 10051 "successful, restored %llu blocks", 10052 (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); 10053 } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) { 10054 /* 10055 * No error but also nothing restored, meaning the lbps array 10056 * in the device header points to invalid/non-present log 10057 * blocks. Reset the header. 10058 */ 10059 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 10060 "no valid log blocks"); 10061 memset(l2dhdr, 0, dev->l2ad_dev_hdr_asize); 10062 l2arc_dev_hdr_update(dev); 10063 } else if (err == ECANCELED) { 10064 /* 10065 * In case the rebuild was canceled do not log to spa history 10066 * log as the pool may be in the process of being removed. 10067 */ 10068 zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks", 10069 (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); 10070 } else if (err != 0) { 10071 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 10072 "aborted, restored %llu blocks", 10073 (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); 10074 } 10075 10076 if (lock_held) 10077 spa_config_exit(spa, SCL_L2ARC, vd); 10078 10079 return (err); 10080 } 10081 10082 /* 10083 * Attempts to read the device header on the provided L2ARC device and writes 10084 * it to `hdr'. On success, this function returns 0, otherwise the appropriate 10085 * error code is returned. 10086 */ 10087 static int 10088 l2arc_dev_hdr_read(l2arc_dev_t *dev) 10089 { 10090 int err; 10091 uint64_t guid; 10092 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 10093 const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; 10094 abd_t *abd; 10095 10096 guid = spa_guid(dev->l2ad_vdev->vdev_spa); 10097 10098 abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); 10099 10100 err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev, 10101 VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, 10102 ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_SYNC_READ, 10103 ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY | 10104 ZIO_FLAG_SPECULATIVE, B_FALSE)); 10105 10106 abd_free(abd); 10107 10108 if (err != 0) { 10109 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors); 10110 zfs_dbgmsg("L2ARC IO error (%d) while reading device header, " 10111 "vdev guid: %llu", err, 10112 (u_longlong_t)dev->l2ad_vdev->vdev_guid); 10113 return (err); 10114 } 10115 10116 if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC)) 10117 byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr)); 10118 10119 if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC || 10120 l2dhdr->dh_spa_guid != guid || 10121 l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid || 10122 l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION || 10123 l2dhdr->dh_log_entries != dev->l2ad_log_entries || 10124 l2dhdr->dh_end != dev->l2ad_end || 10125 !l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end, 10126 l2dhdr->dh_evict) || 10127 (l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE && 10128 l2arc_trim_ahead > 0)) { 10129 /* 10130 * Attempt to rebuild a device containing no actual dev hdr 10131 * or containing a header from some other pool or from another 10132 * version of persistent L2ARC. 10133 */ 10134 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported); 10135 return (SET_ERROR(ENOTSUP)); 10136 } 10137 10138 return (0); 10139 } 10140 10141 /* 10142 * Reads L2ARC log blocks from storage and validates their contents. 10143 * 10144 * This function implements a simple fetcher to make sure that while 10145 * we're processing one buffer the L2ARC is already fetching the next 10146 * one in the chain. 10147 * 10148 * The arguments this_lp and next_lp point to the current and next log block 10149 * address in the block chain. Similarly, this_lb and next_lb hold the 10150 * l2arc_log_blk_phys_t's of the current and next L2ARC blk. 10151 * 10152 * The `this_io' and `next_io' arguments are used for block fetching. 10153 * When issuing the first blk IO during rebuild, you should pass NULL for 10154 * `this_io'. This function will then issue a sync IO to read the block and 10155 * also issue an async IO to fetch the next block in the block chain. The 10156 * fetched IO is returned in `next_io'. On subsequent calls to this 10157 * function, pass the value returned in `next_io' from the previous call 10158 * as `this_io' and a fresh `next_io' pointer to hold the next fetch IO. 10159 * Prior to the call, you should initialize your `next_io' pointer to be 10160 * NULL. If no fetch IO was issued, the pointer is left set at NULL. 10161 * 10162 * On success, this function returns 0, otherwise it returns an appropriate 10163 * error code. On error the fetching IO is aborted and cleared before 10164 * returning from this function. Therefore, if we return `success', the 10165 * caller can assume that we have taken care of cleanup of fetch IOs. 10166 */ 10167 static int 10168 l2arc_log_blk_read(l2arc_dev_t *dev, 10169 const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp, 10170 l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, 10171 zio_t *this_io, zio_t **next_io) 10172 { 10173 int err = 0; 10174 zio_cksum_t cksum; 10175 uint64_t asize; 10176 10177 ASSERT(this_lbp != NULL && next_lbp != NULL); 10178 ASSERT(this_lb != NULL && next_lb != NULL); 10179 ASSERT(next_io != NULL && *next_io == NULL); 10180 ASSERT(l2arc_log_blkptr_valid(dev, this_lbp)); 10181 10182 /* 10183 * Check to see if we have issued the IO for this log block in a 10184 * previous run. If not, this is the first call, so issue it now. 10185 */ 10186 if (this_io == NULL) { 10187 this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp, 10188 this_lb); 10189 } 10190 10191 /* 10192 * Peek to see if we can start issuing the next IO immediately. 10193 */ 10194 if (l2arc_log_blkptr_valid(dev, next_lbp)) { 10195 /* 10196 * Start issuing IO for the next log block early - this 10197 * should help keep the L2ARC device busy while we 10198 * decompress and restore this log block. 10199 */ 10200 *next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp, 10201 next_lb); 10202 } 10203 10204 /* Wait for the IO to read this log block to complete */ 10205 if ((err = zio_wait(this_io)) != 0) { 10206 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors); 10207 zfs_dbgmsg("L2ARC IO error (%d) while reading log block, " 10208 "offset: %llu, vdev guid: %llu", err, 10209 (u_longlong_t)this_lbp->lbp_daddr, 10210 (u_longlong_t)dev->l2ad_vdev->vdev_guid); 10211 goto cleanup; 10212 } 10213 10214 /* 10215 * Make sure the buffer checks out. 10216 * L2BLK_GET_PSIZE returns aligned size for log blocks. 10217 */ 10218 asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop); 10219 fletcher_4_native(this_lb, asize, NULL, &cksum); 10220 if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) { 10221 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors); 10222 zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, " 10223 "vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu", 10224 (u_longlong_t)this_lbp->lbp_daddr, 10225 (u_longlong_t)dev->l2ad_vdev->vdev_guid, 10226 (u_longlong_t)dev->l2ad_hand, 10227 (u_longlong_t)dev->l2ad_evict); 10228 err = SET_ERROR(ECKSUM); 10229 goto cleanup; 10230 } 10231 10232 /* Now we can take our time decoding this buffer */ 10233 switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) { 10234 case ZIO_COMPRESS_OFF: 10235 break; 10236 case ZIO_COMPRESS_LZ4: { 10237 abd_t *abd = abd_alloc_linear(asize, B_TRUE); 10238 abd_copy_from_buf_off(abd, this_lb, 0, asize); 10239 abd_t dabd; 10240 abd_get_from_buf_struct(&dabd, this_lb, sizeof (*this_lb)); 10241 err = zio_decompress_data( 10242 L2BLK_GET_COMPRESS((this_lbp)->lbp_prop), 10243 abd, &dabd, asize, sizeof (*this_lb), NULL); 10244 abd_free(&dabd); 10245 abd_free(abd); 10246 if (err != 0) { 10247 err = SET_ERROR(EINVAL); 10248 goto cleanup; 10249 } 10250 break; 10251 } 10252 default: 10253 err = SET_ERROR(EINVAL); 10254 goto cleanup; 10255 } 10256 if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC)) 10257 byteswap_uint64_array(this_lb, sizeof (*this_lb)); 10258 if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) { 10259 err = SET_ERROR(EINVAL); 10260 goto cleanup; 10261 } 10262 cleanup: 10263 /* Abort an in-flight fetch I/O in case of error */ 10264 if (err != 0 && *next_io != NULL) { 10265 l2arc_log_blk_fetch_abort(*next_io); 10266 *next_io = NULL; 10267 } 10268 return (err); 10269 } 10270 10271 /* 10272 * Restores the payload of a log block to ARC. This creates empty ARC hdr 10273 * entries which only contain an l2arc hdr, essentially restoring the 10274 * buffers to their L2ARC evicted state. This function also updates space 10275 * usage on the L2ARC vdev to make sure it tracks restored buffers. 10276 */ 10277 static void 10278 l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb, 10279 uint64_t lb_asize) 10280 { 10281 uint64_t size = 0, asize = 0; 10282 uint64_t log_entries = dev->l2ad_log_entries; 10283 10284 /* 10285 * Usually arc_adapt() is called only for data, not headers, but 10286 * since we may allocate significant amount of memory here, let ARC 10287 * grow its arc_c. 10288 */ 10289 arc_adapt(log_entries * HDR_L2ONLY_SIZE); 10290 10291 for (int i = log_entries - 1; i >= 0; i--) { 10292 /* 10293 * Restore goes in the reverse temporal direction to preserve 10294 * correct temporal ordering of buffers in the l2ad_buflist. 10295 * l2arc_hdr_restore also does a list_insert_tail instead of 10296 * list_insert_head on the l2ad_buflist: 10297 * 10298 * LIST l2ad_buflist LIST 10299 * HEAD <------ (time) ------ TAIL 10300 * direction +-----+-----+-----+-----+-----+ direction 10301 * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild 10302 * fill +-----+-----+-----+-----+-----+ 10303 * ^ ^ 10304 * | | 10305 * | | 10306 * l2arc_feed_thread l2arc_rebuild 10307 * will place new bufs here restores bufs here 10308 * 10309 * During l2arc_rebuild() the device is not used by 10310 * l2arc_feed_thread() as dev->l2ad_rebuild is set to true. 10311 */ 10312 size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop); 10313 asize += vdev_psize_to_asize(dev->l2ad_vdev, 10314 L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop)); 10315 l2arc_hdr_restore(&lb->lb_entries[i], dev); 10316 } 10317 10318 /* 10319 * Record rebuild stats: 10320 * size Logical size of restored buffers in the L2ARC 10321 * asize Aligned size of restored buffers in the L2ARC 10322 */ 10323 ARCSTAT_INCR(arcstat_l2_rebuild_size, size); 10324 ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize); 10325 ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries); 10326 ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize); 10327 ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize); 10328 ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks); 10329 } 10330 10331 /* 10332 * Restores a single ARC buf hdr from a log entry. The ARC buffer is put 10333 * into a state indicating that it has been evicted to L2ARC. 10334 */ 10335 static void 10336 l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev) 10337 { 10338 arc_buf_hdr_t *hdr, *exists; 10339 kmutex_t *hash_lock; 10340 arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop); 10341 uint64_t asize; 10342 10343 /* 10344 * Do all the allocation before grabbing any locks, this lets us 10345 * sleep if memory is full and we don't have to deal with failed 10346 * allocations. 10347 */ 10348 hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type, 10349 dev, le->le_dva, le->le_daddr, 10350 L2BLK_GET_PSIZE((le)->le_prop), le->le_birth, 10351 L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel, 10352 L2BLK_GET_PROTECTED((le)->le_prop), 10353 L2BLK_GET_PREFETCH((le)->le_prop), 10354 L2BLK_GET_STATE((le)->le_prop)); 10355 asize = vdev_psize_to_asize(dev->l2ad_vdev, 10356 L2BLK_GET_PSIZE((le)->le_prop)); 10357 10358 /* 10359 * vdev_space_update() has to be called before arc_hdr_destroy() to 10360 * avoid underflow since the latter also calls vdev_space_update(). 10361 */ 10362 l2arc_hdr_arcstats_increment(hdr); 10363 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 10364 10365 mutex_enter(&dev->l2ad_mtx); 10366 list_insert_tail(&dev->l2ad_buflist, hdr); 10367 (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); 10368 mutex_exit(&dev->l2ad_mtx); 10369 10370 exists = buf_hash_insert(hdr, &hash_lock); 10371 if (exists) { 10372 /* Buffer was already cached, no need to restore it. */ 10373 arc_hdr_destroy(hdr); 10374 /* 10375 * If the buffer is already cached, check whether it has 10376 * L2ARC metadata. If not, enter them and update the flag. 10377 * This is important is case of onlining a cache device, since 10378 * we previously evicted all L2ARC metadata from ARC. 10379 */ 10380 if (!HDR_HAS_L2HDR(exists)) { 10381 arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR); 10382 exists->b_l2hdr.b_dev = dev; 10383 exists->b_l2hdr.b_daddr = le->le_daddr; 10384 exists->b_l2hdr.b_arcs_state = 10385 L2BLK_GET_STATE((le)->le_prop); 10386 mutex_enter(&dev->l2ad_mtx); 10387 list_insert_tail(&dev->l2ad_buflist, exists); 10388 (void) zfs_refcount_add_many(&dev->l2ad_alloc, 10389 arc_hdr_size(exists), exists); 10390 mutex_exit(&dev->l2ad_mtx); 10391 l2arc_hdr_arcstats_increment(exists); 10392 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 10393 } 10394 ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached); 10395 } 10396 10397 mutex_exit(hash_lock); 10398 } 10399 10400 /* 10401 * Starts an asynchronous read IO to read a log block. This is used in log 10402 * block reconstruction to start reading the next block before we are done 10403 * decoding and reconstructing the current block, to keep the l2arc device 10404 * nice and hot with read IO to process. 10405 * The returned zio will contain a newly allocated memory buffers for the IO 10406 * data which should then be freed by the caller once the zio is no longer 10407 * needed (i.e. due to it having completed). If you wish to abort this 10408 * zio, you should do so using l2arc_log_blk_fetch_abort, which takes 10409 * care of disposing of the allocated buffers correctly. 10410 */ 10411 static zio_t * 10412 l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp, 10413 l2arc_log_blk_phys_t *lb) 10414 { 10415 uint32_t asize; 10416 zio_t *pio; 10417 l2arc_read_callback_t *cb; 10418 10419 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 10420 asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); 10421 ASSERT(asize <= sizeof (l2arc_log_blk_phys_t)); 10422 10423 cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP); 10424 cb->l2rcb_abd = abd_get_from_buf(lb, asize); 10425 pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb, 10426 ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY); 10427 (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize, 10428 cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL, 10429 ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL | 10430 ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE)); 10431 10432 return (pio); 10433 } 10434 10435 /* 10436 * Aborts a zio returned from l2arc_log_blk_fetch and frees the data 10437 * buffers allocated for it. 10438 */ 10439 static void 10440 l2arc_log_blk_fetch_abort(zio_t *zio) 10441 { 10442 (void) zio_wait(zio); 10443 } 10444 10445 /* 10446 * Creates a zio to update the device header on an l2arc device. 10447 */ 10448 void 10449 l2arc_dev_hdr_update(l2arc_dev_t *dev) 10450 { 10451 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 10452 const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; 10453 abd_t *abd; 10454 int err; 10455 10456 VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER)); 10457 10458 l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC; 10459 l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION; 10460 l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa); 10461 l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid; 10462 l2dhdr->dh_log_entries = dev->l2ad_log_entries; 10463 l2dhdr->dh_evict = dev->l2ad_evict; 10464 l2dhdr->dh_start = dev->l2ad_start; 10465 l2dhdr->dh_end = dev->l2ad_end; 10466 l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize); 10467 l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count); 10468 l2dhdr->dh_flags = 0; 10469 l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time; 10470 l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state; 10471 if (dev->l2ad_first) 10472 l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST; 10473 10474 abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); 10475 10476 err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev, 10477 VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL, 10478 NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE)); 10479 10480 abd_free(abd); 10481 10482 if (err != 0) { 10483 zfs_dbgmsg("L2ARC IO error (%d) while writing device header, " 10484 "vdev guid: %llu", err, 10485 (u_longlong_t)dev->l2ad_vdev->vdev_guid); 10486 } 10487 } 10488 10489 /* 10490 * Commits a log block to the L2ARC device. This routine is invoked from 10491 * l2arc_write_buffers when the log block fills up. 10492 * This function allocates some memory to temporarily hold the serialized 10493 * buffer to be written. This is then released in l2arc_write_done. 10494 */ 10495 static uint64_t 10496 l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb) 10497 { 10498 l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; 10499 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 10500 uint64_t psize, asize; 10501 zio_t *wzio; 10502 l2arc_lb_abd_buf_t *abd_buf; 10503 abd_t *abd = NULL; 10504 l2arc_lb_ptr_buf_t *lb_ptr_buf; 10505 10506 VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries); 10507 10508 abd_buf = zio_buf_alloc(sizeof (*abd_buf)); 10509 abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb)); 10510 lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); 10511 lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP); 10512 10513 /* link the buffer into the block chain */ 10514 lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1]; 10515 lb->lb_magic = L2ARC_LOG_BLK_MAGIC; 10516 10517 /* 10518 * l2arc_log_blk_commit() may be called multiple times during a single 10519 * l2arc_write_buffers() call. Save the allocated abd buffers in a list 10520 * so we can free them in l2arc_write_done() later on. 10521 */ 10522 list_insert_tail(&cb->l2wcb_abd_list, abd_buf); 10523 10524 /* try to compress the buffer */ 10525 psize = zio_compress_data(ZIO_COMPRESS_LZ4, 10526 abd_buf->abd, &abd, sizeof (*lb), 0); 10527 10528 /* a log block is never entirely zero */ 10529 ASSERT(psize != 0); 10530 asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); 10531 ASSERT(asize <= sizeof (*lb)); 10532 10533 /* 10534 * Update the start log block pointer in the device header to point 10535 * to the log block we're about to write. 10536 */ 10537 l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0]; 10538 l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand; 10539 l2dhdr->dh_start_lbps[0].lbp_payload_asize = 10540 dev->l2ad_log_blk_payload_asize; 10541 l2dhdr->dh_start_lbps[0].lbp_payload_start = 10542 dev->l2ad_log_blk_payload_start; 10543 L2BLK_SET_LSIZE( 10544 (&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb)); 10545 L2BLK_SET_PSIZE( 10546 (&l2dhdr->dh_start_lbps[0])->lbp_prop, asize); 10547 L2BLK_SET_CHECKSUM( 10548 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 10549 ZIO_CHECKSUM_FLETCHER_4); 10550 if (asize < sizeof (*lb)) { 10551 /* compression succeeded */ 10552 abd_zero_off(abd, psize, asize - psize); 10553 L2BLK_SET_COMPRESS( 10554 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 10555 ZIO_COMPRESS_LZ4); 10556 } else { 10557 /* compression failed */ 10558 abd_copy_from_buf_off(abd, lb, 0, sizeof (*lb)); 10559 L2BLK_SET_COMPRESS( 10560 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 10561 ZIO_COMPRESS_OFF); 10562 } 10563 10564 /* checksum what we're about to write */ 10565 abd_fletcher_4_native(abd, asize, NULL, 10566 &l2dhdr->dh_start_lbps[0].lbp_cksum); 10567 10568 abd_free(abd_buf->abd); 10569 10570 /* perform the write itself */ 10571 abd_buf->abd = abd; 10572 wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand, 10573 asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL, 10574 ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE); 10575 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio); 10576 (void) zio_nowait(wzio); 10577 10578 dev->l2ad_hand += asize; 10579 /* 10580 * Include the committed log block's pointer in the list of pointers 10581 * to log blocks present in the L2ARC device. 10582 */ 10583 memcpy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[0], 10584 sizeof (l2arc_log_blkptr_t)); 10585 mutex_enter(&dev->l2ad_mtx); 10586 list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf); 10587 ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); 10588 ARCSTAT_BUMP(arcstat_l2_log_blk_count); 10589 zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); 10590 zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); 10591 mutex_exit(&dev->l2ad_mtx); 10592 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 10593 10594 /* bump the kstats */ 10595 ARCSTAT_INCR(arcstat_l2_write_bytes, asize); 10596 ARCSTAT_BUMP(arcstat_l2_log_blk_writes); 10597 ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize); 10598 ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, 10599 dev->l2ad_log_blk_payload_asize / asize); 10600 10601 /* start a new log block */ 10602 dev->l2ad_log_ent_idx = 0; 10603 dev->l2ad_log_blk_payload_asize = 0; 10604 dev->l2ad_log_blk_payload_start = 0; 10605 10606 return (asize); 10607 } 10608 10609 /* 10610 * Validates an L2ARC log block address to make sure that it can be read 10611 * from the provided L2ARC device. 10612 */ 10613 boolean_t 10614 l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp) 10615 { 10616 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 10617 uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); 10618 uint64_t end = lbp->lbp_daddr + asize - 1; 10619 uint64_t start = lbp->lbp_payload_start; 10620 boolean_t evicted = B_FALSE; 10621 10622 /* 10623 * A log block is valid if all of the following conditions are true: 10624 * - it fits entirely (including its payload) between l2ad_start and 10625 * l2ad_end 10626 * - it has a valid size 10627 * - neither the log block itself nor part of its payload was evicted 10628 * by l2arc_evict(): 10629 * 10630 * l2ad_hand l2ad_evict 10631 * | | lbp_daddr 10632 * | start | | end 10633 * | | | | | 10634 * V V V V V 10635 * l2ad_start ============================================ l2ad_end 10636 * --------------------------|||| 10637 * ^ ^ 10638 * | log block 10639 * payload 10640 */ 10641 10642 evicted = 10643 l2arc_range_check_overlap(start, end, dev->l2ad_hand) || 10644 l2arc_range_check_overlap(start, end, dev->l2ad_evict) || 10645 l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) || 10646 l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end); 10647 10648 return (start >= dev->l2ad_start && end <= dev->l2ad_end && 10649 asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) && 10650 (!evicted || dev->l2ad_first)); 10651 } 10652 10653 /* 10654 * Inserts ARC buffer header `hdr' into the current L2ARC log block on 10655 * the device. The buffer being inserted must be present in L2ARC. 10656 * Returns B_TRUE if the L2ARC log block is full and needs to be committed 10657 * to L2ARC, or B_FALSE if it still has room for more ARC buffers. 10658 */ 10659 static boolean_t 10660 l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr) 10661 { 10662 l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; 10663 l2arc_log_ent_phys_t *le; 10664 10665 if (dev->l2ad_log_entries == 0) 10666 return (B_FALSE); 10667 10668 int index = dev->l2ad_log_ent_idx++; 10669 10670 ASSERT3S(index, <, dev->l2ad_log_entries); 10671 ASSERT(HDR_HAS_L2HDR(hdr)); 10672 10673 le = &lb->lb_entries[index]; 10674 memset(le, 0, sizeof (*le)); 10675 le->le_dva = hdr->b_dva; 10676 le->le_birth = hdr->b_birth; 10677 le->le_daddr = hdr->b_l2hdr.b_daddr; 10678 if (index == 0) 10679 dev->l2ad_log_blk_payload_start = le->le_daddr; 10680 L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr)); 10681 L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr)); 10682 L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr)); 10683 le->le_complevel = hdr->b_complevel; 10684 L2BLK_SET_TYPE((le)->le_prop, hdr->b_type); 10685 L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr))); 10686 L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr))); 10687 L2BLK_SET_STATE((le)->le_prop, hdr->b_l2hdr.b_arcs_state); 10688 10689 dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev, 10690 HDR_GET_PSIZE(hdr)); 10691 10692 return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries); 10693 } 10694 10695 /* 10696 * Checks whether a given L2ARC device address sits in a time-sequential 10697 * range. The trick here is that the L2ARC is a rotary buffer, so we can't 10698 * just do a range comparison, we need to handle the situation in which the 10699 * range wraps around the end of the L2ARC device. Arguments: 10700 * bottom -- Lower end of the range to check (written to earlier). 10701 * top -- Upper end of the range to check (written to later). 10702 * check -- The address for which we want to determine if it sits in 10703 * between the top and bottom. 10704 * 10705 * The 3-way conditional below represents the following cases: 10706 * 10707 * bottom < top : Sequentially ordered case: 10708 * <check>--------+-------------------+ 10709 * | (overlap here?) | 10710 * L2ARC dev V V 10711 * |---------------<bottom>============<top>--------------| 10712 * 10713 * bottom > top: Looped-around case: 10714 * <check>--------+------------------+ 10715 * | (overlap here?) | 10716 * L2ARC dev V V 10717 * |===============<top>---------------<bottom>===========| 10718 * ^ ^ 10719 * | (or here?) | 10720 * +---------------+---------<check> 10721 * 10722 * top == bottom : Just a single address comparison. 10723 */ 10724 boolean_t 10725 l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check) 10726 { 10727 if (bottom < top) 10728 return (bottom <= check && check <= top); 10729 else if (bottom > top) 10730 return (check <= top || bottom <= check); 10731 else 10732 return (check == top); 10733 } 10734 10735 EXPORT_SYMBOL(arc_buf_size); 10736 EXPORT_SYMBOL(arc_write); 10737 EXPORT_SYMBOL(arc_read); 10738 EXPORT_SYMBOL(arc_buf_info); 10739 EXPORT_SYMBOL(arc_getbuf_func); 10740 EXPORT_SYMBOL(arc_add_prune_callback); 10741 EXPORT_SYMBOL(arc_remove_prune_callback); 10742 10743 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min, 10744 spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes"); 10745 10746 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max, 10747 spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes"); 10748 10749 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_balance, UINT, ZMOD_RW, 10750 "Balance between metadata and data on ghost hits."); 10751 10752 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int, 10753 param_get_uint, ZMOD_RW, "Seconds before growing ARC size"); 10754 10755 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int, 10756 param_get_uint, ZMOD_RW, "log2(fraction of ARC to reclaim)"); 10757 10758 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW, 10759 "Percent of pagecache to reclaim ARC to"); 10760 10761 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, UINT, ZMOD_RD, 10762 "Target average block size"); 10763 10764 ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW, 10765 "Disable compressed ARC buffers"); 10766 10767 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int, 10768 param_get_uint, ZMOD_RW, "Min life of prefetch block in ms"); 10769 10770 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms, 10771 param_set_arc_int, param_get_uint, ZMOD_RW, 10772 "Min life of prescient prefetched block in ms"); 10773 10774 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, U64, ZMOD_RW, 10775 "Max write bytes per interval"); 10776 10777 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, U64, ZMOD_RW, 10778 "Extra write bytes during device warmup"); 10779 10780 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, U64, ZMOD_RW, 10781 "Number of max device writes to precache"); 10782 10783 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, U64, ZMOD_RW, 10784 "Compressed l2arc_headroom multiplier"); 10785 10786 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, U64, ZMOD_RW, 10787 "TRIM ahead L2ARC write size multiplier"); 10788 10789 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, U64, ZMOD_RW, 10790 "Seconds between L2ARC writing"); 10791 10792 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, U64, ZMOD_RW, 10793 "Min feed interval in milliseconds"); 10794 10795 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW, 10796 "Skip caching prefetched buffers"); 10797 10798 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW, 10799 "Turbo L2ARC warmup"); 10800 10801 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW, 10802 "No reads during writes"); 10803 10804 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, UINT, ZMOD_RW, 10805 "Percent of ARC size allowed for L2ARC-only headers"); 10806 10807 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW, 10808 "Rebuild the L2ARC when importing a pool"); 10809 10810 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, U64, ZMOD_RW, 10811 "Min size in bytes to write rebuild log blocks in L2ARC"); 10812 10813 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW, 10814 "Cache only MFU data from ARC into L2ARC"); 10815 10816 ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, exclude_special, INT, ZMOD_RW, 10817 "Exclude dbufs on special vdevs from being cached to L2ARC if set."); 10818 10819 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int, 10820 param_get_uint, ZMOD_RW, "System free memory I/O throttle in bytes"); 10821 10822 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_u64, 10823 spl_param_get_u64, ZMOD_RW, "System free memory target size in bytes"); 10824 10825 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_u64, 10826 spl_param_get_u64, ZMOD_RW, "Minimum bytes of dnodes in ARC"); 10827 10828 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent, 10829 param_set_arc_int, param_get_uint, ZMOD_RW, 10830 "Percent of ARC meta buffers for dnodes"); 10831 10832 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, UINT, ZMOD_RW, 10833 "Percentage of excess dnodes to try to unpin"); 10834 10835 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, UINT, ZMOD_RW, 10836 "When full, ARC allocation waits for eviction of this % of alloc size"); 10837 10838 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batch_limit, UINT, ZMOD_RW, 10839 "The number of headers to evict per sublist before moving to the next"); 10840 10841 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, prune_task_threads, INT, ZMOD_RW, 10842 "Number of arc_prune threads"); 10843