1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. 23 * Copyright (c) 2019, Joyent, Inc. 24 * Copyright (c) 2011, 2018 by Delphix. All rights reserved. 25 * Copyright (c) 2014 by Saso Kiselkov. All rights reserved. 26 * Copyright 2017 Nexenta Systems, Inc. All rights reserved. 27 * Copyright (c) 2011, 2019, Delphix. All rights reserved. 28 * Copyright (c) 2020, George Amanakis. All rights reserved. 29 */ 30 31 /* 32 * DVA-based Adjustable Replacement Cache 33 * 34 * While much of the theory of operation used here is 35 * based on the self-tuning, low overhead replacement cache 36 * presented by Megiddo and Modha at FAST 2003, there are some 37 * significant differences: 38 * 39 * 1. The Megiddo and Modha model assumes any page is evictable. 40 * Pages in its cache cannot be "locked" into memory. This makes 41 * the eviction algorithm simple: evict the last page in the list. 42 * This also make the performance characteristics easy to reason 43 * about. Our cache is not so simple. At any given moment, some 44 * subset of the blocks in the cache are un-evictable because we 45 * have handed out a reference to them. Blocks are only evictable 46 * when there are no external references active. This makes 47 * eviction far more problematic: we choose to evict the evictable 48 * blocks that are the "lowest" in the list. 49 * 50 * There are times when it is not possible to evict the requested 51 * space. In these circumstances we are unable to adjust the cache 52 * size. To prevent the cache growing unbounded at these times we 53 * implement a "cache throttle" that slows the flow of new data 54 * into the cache until we can make space available. 55 * 56 * 2. The Megiddo and Modha model assumes a fixed cache size. 57 * Pages are evicted when the cache is full and there is a cache 58 * miss. Our model has a variable sized cache. It grows with 59 * high use, but also tries to react to memory pressure from the 60 * operating system: decreasing its size when system memory is 61 * tight. 62 * 63 * 3. The Megiddo and Modha model assumes a fixed page size. All 64 * elements of the cache are therefore exactly the same size. So 65 * when adjusting the cache size following a cache miss, its simply 66 * a matter of choosing a single page to evict. In our model, we 67 * have variable sized cache blocks (rangeing from 512 bytes to 68 * 128K bytes). We therefore choose a set of blocks to evict to make 69 * space for a cache miss that approximates as closely as possible 70 * the space used by the new block. 71 * 72 * See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache" 73 * by N. Megiddo & D. Modha, FAST 2003 74 */ 75 76 /* 77 * The locking model: 78 * 79 * A new reference to a cache buffer can be obtained in two 80 * ways: 1) via a hash table lookup using the DVA as a key, 81 * or 2) via one of the ARC lists. The arc_read() interface 82 * uses method 1, while the internal ARC algorithms for 83 * adjusting the cache use method 2. We therefore provide two 84 * types of locks: 1) the hash table lock array, and 2) the 85 * ARC list locks. 86 * 87 * Buffers do not have their own mutexes, rather they rely on the 88 * hash table mutexes for the bulk of their protection (i.e. most 89 * fields in the arc_buf_hdr_t are protected by these mutexes). 90 * 91 * buf_hash_find() returns the appropriate mutex (held) when it 92 * locates the requested buffer in the hash table. It returns 93 * NULL for the mutex if the buffer was not in the table. 94 * 95 * buf_hash_remove() expects the appropriate hash mutex to be 96 * already held before it is invoked. 97 * 98 * Each ARC state also has a mutex which is used to protect the 99 * buffer list associated with the state. When attempting to 100 * obtain a hash table lock while holding an ARC list lock you 101 * must use: mutex_tryenter() to avoid deadlock. Also note that 102 * the active state mutex must be held before the ghost state mutex. 103 * 104 * Note that the majority of the performance stats are manipulated 105 * with atomic operations. 106 * 107 * The L2ARC uses the l2ad_mtx on each vdev for the following: 108 * 109 * - L2ARC buflist creation 110 * - L2ARC buflist eviction 111 * - L2ARC write completion, which walks L2ARC buflists 112 * - ARC header destruction, as it removes from L2ARC buflists 113 * - ARC header release, as it removes from L2ARC buflists 114 */ 115 116 /* 117 * ARC operation: 118 * 119 * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure. 120 * This structure can point either to a block that is still in the cache or to 121 * one that is only accessible in an L2 ARC device, or it can provide 122 * information about a block that was recently evicted. If a block is 123 * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough 124 * information to retrieve it from the L2ARC device. This information is 125 * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block 126 * that is in this state cannot access the data directly. 127 * 128 * Blocks that are actively being referenced or have not been evicted 129 * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within 130 * the arc_buf_hdr_t that will point to the data block in memory. A block can 131 * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC 132 * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and 133 * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd). 134 * 135 * The L1ARC's data pointer may or may not be uncompressed. The ARC has the 136 * ability to store the physical data (b_pabd) associated with the DVA of the 137 * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block, 138 * it will match its on-disk compression characteristics. This behavior can be 139 * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the 140 * compressed ARC functionality is disabled, the b_pabd will point to an 141 * uncompressed version of the on-disk data. 142 * 143 * Data in the L1ARC is not accessed by consumers of the ARC directly. Each 144 * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it. 145 * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC 146 * consumer. The ARC will provide references to this data and will keep it 147 * cached until it is no longer in use. The ARC caches only the L1ARC's physical 148 * data block and will evict any arc_buf_t that is no longer referenced. The 149 * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the 150 * "overhead_size" kstat. 151 * 152 * Depending on the consumer, an arc_buf_t can be requested in uncompressed or 153 * compressed form. The typical case is that consumers will want uncompressed 154 * data, and when that happens a new data buffer is allocated where the data is 155 * decompressed for them to use. Currently the only consumer who wants 156 * compressed arc_buf_t's is "zfs send", when it streams data exactly as it 157 * exists on disk. When this happens, the arc_buf_t's data buffer is shared 158 * with the arc_buf_hdr_t. 159 * 160 * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The 161 * first one is owned by a compressed send consumer (and therefore references 162 * the same compressed data buffer as the arc_buf_hdr_t) and the second could be 163 * used by any other consumer (and has its own uncompressed copy of the data 164 * buffer). 165 * 166 * arc_buf_hdr_t 167 * +-----------+ 168 * | fields | 169 * | common to | 170 * | L1- and | 171 * | L2ARC | 172 * +-----------+ 173 * | l2arc_buf_hdr_t 174 * | | 175 * +-----------+ 176 * | l1arc_buf_hdr_t 177 * | | arc_buf_t 178 * | b_buf +------------>+-----------+ arc_buf_t 179 * | b_pabd +-+ |b_next +---->+-----------+ 180 * +-----------+ | |-----------| |b_next +-->NULL 181 * | |b_comp = T | +-----------+ 182 * | |b_data +-+ |b_comp = F | 183 * | +-----------+ | |b_data +-+ 184 * +->+------+ | +-----------+ | 185 * compressed | | | | 186 * data | |<--------------+ | uncompressed 187 * +------+ compressed, | data 188 * shared +-->+------+ 189 * data | | 190 * | | 191 * +------+ 192 * 193 * When a consumer reads a block, the ARC must first look to see if the 194 * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new 195 * arc_buf_t and either copies uncompressed data into a new data buffer from an 196 * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a 197 * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the 198 * hdr is compressed and the desired compression characteristics of the 199 * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the 200 * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be 201 * the last buffer in the hdr's b_buf list, however a shared compressed buf can 202 * be anywhere in the hdr's list. 203 * 204 * The diagram below shows an example of an uncompressed ARC hdr that is 205 * sharing its data with an arc_buf_t (note that the shared uncompressed buf is 206 * the last element in the buf list): 207 * 208 * arc_buf_hdr_t 209 * +-----------+ 210 * | | 211 * | | 212 * | | 213 * +-----------+ 214 * l2arc_buf_hdr_t| | 215 * | | 216 * +-----------+ 217 * l1arc_buf_hdr_t| | 218 * | | arc_buf_t (shared) 219 * | b_buf +------------>+---------+ arc_buf_t 220 * | | |b_next +---->+---------+ 221 * | b_pabd +-+ |---------| |b_next +-->NULL 222 * +-----------+ | | | +---------+ 223 * | |b_data +-+ | | 224 * | +---------+ | |b_data +-+ 225 * +->+------+ | +---------+ | 226 * | | | | 227 * uncompressed | | | | 228 * data +------+ | | 229 * ^ +->+------+ | 230 * | uncompressed | | | 231 * | data | | | 232 * | +------+ | 233 * +---------------------------------+ 234 * 235 * Writing to the ARC requires that the ARC first discard the hdr's b_pabd 236 * since the physical block is about to be rewritten. The new data contents 237 * will be contained in the arc_buf_t. As the I/O pipeline performs the write, 238 * it may compress the data before writing it to disk. The ARC will be called 239 * with the transformed data and will bcopy the transformed on-disk block into 240 * a newly allocated b_pabd. Writes are always done into buffers which have 241 * either been loaned (and hence are new and don't have other readers) or 242 * buffers which have been released (and hence have their own hdr, if there 243 * were originally other readers of the buf's original hdr). This ensures that 244 * the ARC only needs to update a single buf and its hdr after a write occurs. 245 * 246 * When the L2ARC is in use, it will also take advantage of the b_pabd. The 247 * L2ARC will always write the contents of b_pabd to the L2ARC. This means 248 * that when compressed ARC is enabled that the L2ARC blocks are identical 249 * to the on-disk block in the main data pool. This provides a significant 250 * advantage since the ARC can leverage the bp's checksum when reading from the 251 * L2ARC to determine if the contents are valid. However, if the compressed 252 * ARC is disabled, then the L2ARC's block must be transformed to look 253 * like the physical block in the main data pool before comparing the 254 * checksum and determining its validity. 255 * 256 * The L1ARC has a slightly different system for storing encrypted data. 257 * Raw (encrypted + possibly compressed) data has a few subtle differences from 258 * data that is just compressed. The biggest difference is that it is not 259 * possible to decrypt encrypted data (or visa versa) if the keys aren't loaded. 260 * The other difference is that encryption cannot be treated as a suggestion. 261 * If a caller would prefer compressed data, but they actually wind up with 262 * uncompressed data the worst thing that could happen is there might be a 263 * performance hit. If the caller requests encrypted data, however, we must be 264 * sure they actually get it or else secret information could be leaked. Raw 265 * data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore, 266 * may have both an encrypted version and a decrypted version of its data at 267 * once. When a caller needs a raw arc_buf_t, it is allocated and the data is 268 * copied out of this header. To avoid complications with b_pabd, raw buffers 269 * cannot be shared. 270 */ 271 272 #include <sys/spa.h> 273 #include <sys/zio.h> 274 #include <sys/spa_impl.h> 275 #include <sys/zio_compress.h> 276 #include <sys/zio_checksum.h> 277 #include <sys/zfs_context.h> 278 #include <sys/arc.h> 279 #include <sys/refcount.h> 280 #include <sys/vdev.h> 281 #include <sys/vdev_impl.h> 282 #include <sys/dsl_pool.h> 283 #include <sys/zio_checksum.h> 284 #include <sys/multilist.h> 285 #include <sys/abd.h> 286 #include <sys/zil.h> 287 #include <sys/fm/fs/zfs.h> 288 #ifdef _KERNEL 289 #include <sys/vmsystm.h> 290 #include <vm/anon.h> 291 #include <sys/fs/swapnode.h> 292 #include <sys/dnlc.h> 293 #endif 294 #include <sys/callb.h> 295 #include <sys/kstat.h> 296 #include <sys/zthr.h> 297 #include <zfs_fletcher.h> 298 #include <sys/arc_impl.h> 299 #include <sys/aggsum.h> 300 #include <sys/cityhash.h> 301 #include <sys/param.h> 302 303 #ifndef _KERNEL 304 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */ 305 boolean_t arc_watch = B_FALSE; 306 int arc_procfd; 307 #endif 308 309 /* 310 * This thread's job is to keep enough free memory in the system, by 311 * calling arc_kmem_reap_now() plus arc_shrink(), which improves 312 * arc_available_memory(). 313 */ 314 static zthr_t *arc_reap_zthr; 315 316 /* 317 * This thread's job is to keep arc_size under arc_c, by calling 318 * arc_adjust(), which improves arc_is_overflowing(). 319 */ 320 static zthr_t *arc_adjust_zthr; 321 322 static kmutex_t arc_adjust_lock; 323 static kcondvar_t arc_adjust_waiters_cv; 324 static boolean_t arc_adjust_needed = B_FALSE; 325 326 uint_t arc_reduce_dnlc_percent = 3; 327 328 /* 329 * The number of headers to evict in arc_evict_state_impl() before 330 * dropping the sublist lock and evicting from another sublist. A lower 331 * value means we're more likely to evict the "correct" header (i.e. the 332 * oldest header in the arc state), but comes with higher overhead 333 * (i.e. more invocations of arc_evict_state_impl()). 334 */ 335 int zfs_arc_evict_batch_limit = 10; 336 337 /* number of seconds before growing cache again */ 338 int arc_grow_retry = 60; 339 340 /* 341 * Minimum time between calls to arc_kmem_reap_soon(). Note that this will 342 * be converted to ticks, so with the default hz=100, a setting of 15 ms 343 * will actually wait 2 ticks, or 20ms. 344 */ 345 int arc_kmem_cache_reap_retry_ms = 1000; 346 347 /* shift of arc_c for calculating overflow limit in arc_get_data_impl */ 348 int zfs_arc_overflow_shift = 8; 349 350 /* shift of arc_c for calculating both min and max arc_p */ 351 int arc_p_min_shift = 4; 352 353 /* log2(fraction of arc to reclaim) */ 354 int arc_shrink_shift = 7; 355 356 /* 357 * log2(fraction of ARC which must be free to allow growing). 358 * I.e. If there is less than arc_c >> arc_no_grow_shift free memory, 359 * when reading a new block into the ARC, we will evict an equal-sized block 360 * from the ARC. 361 * 362 * This must be less than arc_shrink_shift, so that when we shrink the ARC, 363 * we will still not allow it to grow. 364 */ 365 int arc_no_grow_shift = 5; 366 367 368 /* 369 * minimum lifespan of a prefetch block in clock ticks 370 * (initialized in arc_init()) 371 */ 372 static int zfs_arc_min_prefetch_ms = 1; 373 static int zfs_arc_min_prescient_prefetch_ms = 6; 374 375 /* 376 * If this percent of memory is free, don't throttle. 377 */ 378 int arc_lotsfree_percent = 10; 379 380 static boolean_t arc_initialized; 381 382 /* 383 * The arc has filled available memory and has now warmed up. 384 */ 385 static boolean_t arc_warm; 386 387 /* 388 * log2 fraction of the zio arena to keep free. 389 */ 390 int arc_zio_arena_free_shift = 2; 391 392 /* 393 * These tunables are for performance analysis. 394 */ 395 uint64_t zfs_arc_max; 396 uint64_t zfs_arc_min; 397 uint64_t zfs_arc_meta_limit = 0; 398 uint64_t zfs_arc_meta_min = 0; 399 int zfs_arc_grow_retry = 0; 400 int zfs_arc_shrink_shift = 0; 401 int zfs_arc_p_min_shift = 0; 402 int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */ 403 404 /* 405 * ARC dirty data constraints for arc_tempreserve_space() throttle 406 */ 407 uint_t zfs_arc_dirty_limit_percent = 50; /* total dirty data limit */ 408 uint_t zfs_arc_anon_limit_percent = 25; /* anon block dirty limit */ 409 uint_t zfs_arc_pool_dirty_percent = 20; /* each pool's anon allowance */ 410 411 boolean_t zfs_compressed_arc_enabled = B_TRUE; 412 413 /* The 6 states: */ 414 static arc_state_t ARC_anon; 415 static arc_state_t ARC_mru; 416 static arc_state_t ARC_mru_ghost; 417 static arc_state_t ARC_mfu; 418 static arc_state_t ARC_mfu_ghost; 419 static arc_state_t ARC_l2c_only; 420 421 arc_stats_t arc_stats = { 422 { "hits", KSTAT_DATA_UINT64 }, 423 { "misses", KSTAT_DATA_UINT64 }, 424 { "demand_data_hits", KSTAT_DATA_UINT64 }, 425 { "demand_data_misses", KSTAT_DATA_UINT64 }, 426 { "demand_metadata_hits", KSTAT_DATA_UINT64 }, 427 { "demand_metadata_misses", KSTAT_DATA_UINT64 }, 428 { "prefetch_data_hits", KSTAT_DATA_UINT64 }, 429 { "prefetch_data_misses", KSTAT_DATA_UINT64 }, 430 { "prefetch_metadata_hits", KSTAT_DATA_UINT64 }, 431 { "prefetch_metadata_misses", KSTAT_DATA_UINT64 }, 432 { "mru_hits", KSTAT_DATA_UINT64 }, 433 { "mru_ghost_hits", KSTAT_DATA_UINT64 }, 434 { "mfu_hits", KSTAT_DATA_UINT64 }, 435 { "mfu_ghost_hits", KSTAT_DATA_UINT64 }, 436 { "deleted", KSTAT_DATA_UINT64 }, 437 { "mutex_miss", KSTAT_DATA_UINT64 }, 438 { "access_skip", KSTAT_DATA_UINT64 }, 439 { "evict_skip", KSTAT_DATA_UINT64 }, 440 { "evict_not_enough", KSTAT_DATA_UINT64 }, 441 { "evict_l2_cached", KSTAT_DATA_UINT64 }, 442 { "evict_l2_eligible", KSTAT_DATA_UINT64 }, 443 { "evict_l2_ineligible", KSTAT_DATA_UINT64 }, 444 { "evict_l2_skip", KSTAT_DATA_UINT64 }, 445 { "hash_elements", KSTAT_DATA_UINT64 }, 446 { "hash_elements_max", KSTAT_DATA_UINT64 }, 447 { "hash_collisions", KSTAT_DATA_UINT64 }, 448 { "hash_chains", KSTAT_DATA_UINT64 }, 449 { "hash_chain_max", KSTAT_DATA_UINT64 }, 450 { "p", KSTAT_DATA_UINT64 }, 451 { "c", KSTAT_DATA_UINT64 }, 452 { "c_min", KSTAT_DATA_UINT64 }, 453 { "c_max", KSTAT_DATA_UINT64 }, 454 { "size", KSTAT_DATA_UINT64 }, 455 { "compressed_size", KSTAT_DATA_UINT64 }, 456 { "uncompressed_size", KSTAT_DATA_UINT64 }, 457 { "overhead_size", KSTAT_DATA_UINT64 }, 458 { "hdr_size", KSTAT_DATA_UINT64 }, 459 { "data_size", KSTAT_DATA_UINT64 }, 460 { "metadata_size", KSTAT_DATA_UINT64 }, 461 { "other_size", KSTAT_DATA_UINT64 }, 462 { "anon_size", KSTAT_DATA_UINT64 }, 463 { "anon_evictable_data", KSTAT_DATA_UINT64 }, 464 { "anon_evictable_metadata", KSTAT_DATA_UINT64 }, 465 { "mru_size", KSTAT_DATA_UINT64 }, 466 { "mru_evictable_data", KSTAT_DATA_UINT64 }, 467 { "mru_evictable_metadata", KSTAT_DATA_UINT64 }, 468 { "mru_ghost_size", KSTAT_DATA_UINT64 }, 469 { "mru_ghost_evictable_data", KSTAT_DATA_UINT64 }, 470 { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, 471 { "mfu_size", KSTAT_DATA_UINT64 }, 472 { "mfu_evictable_data", KSTAT_DATA_UINT64 }, 473 { "mfu_evictable_metadata", KSTAT_DATA_UINT64 }, 474 { "mfu_ghost_size", KSTAT_DATA_UINT64 }, 475 { "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 }, 476 { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, 477 { "l2_hits", KSTAT_DATA_UINT64 }, 478 { "l2_misses", KSTAT_DATA_UINT64 }, 479 { "l2_feeds", KSTAT_DATA_UINT64 }, 480 { "l2_rw_clash", KSTAT_DATA_UINT64 }, 481 { "l2_read_bytes", KSTAT_DATA_UINT64 }, 482 { "l2_write_bytes", KSTAT_DATA_UINT64 }, 483 { "l2_writes_sent", KSTAT_DATA_UINT64 }, 484 { "l2_writes_done", KSTAT_DATA_UINT64 }, 485 { "l2_writes_error", KSTAT_DATA_UINT64 }, 486 { "l2_writes_lock_retry", KSTAT_DATA_UINT64 }, 487 { "l2_evict_lock_retry", KSTAT_DATA_UINT64 }, 488 { "l2_evict_reading", KSTAT_DATA_UINT64 }, 489 { "l2_evict_l1cached", KSTAT_DATA_UINT64 }, 490 { "l2_free_on_write", KSTAT_DATA_UINT64 }, 491 { "l2_abort_lowmem", KSTAT_DATA_UINT64 }, 492 { "l2_cksum_bad", KSTAT_DATA_UINT64 }, 493 { "l2_io_error", KSTAT_DATA_UINT64 }, 494 { "l2_size", KSTAT_DATA_UINT64 }, 495 { "l2_asize", KSTAT_DATA_UINT64 }, 496 { "l2_hdr_size", KSTAT_DATA_UINT64 }, 497 { "l2_log_blk_writes", KSTAT_DATA_UINT64 }, 498 { "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 }, 499 { "l2_log_blk_asize", KSTAT_DATA_UINT64 }, 500 { "l2_log_blk_count", KSTAT_DATA_UINT64 }, 501 { "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 }, 502 { "l2_rebuild_success", KSTAT_DATA_UINT64 }, 503 { "l2_rebuild_unsupported", KSTAT_DATA_UINT64 }, 504 { "l2_rebuild_io_errors", KSTAT_DATA_UINT64 }, 505 { "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 }, 506 { "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 }, 507 { "l2_rebuild_lowmem", KSTAT_DATA_UINT64 }, 508 { "l2_rebuild_size", KSTAT_DATA_UINT64 }, 509 { "l2_rebuild_asize", KSTAT_DATA_UINT64 }, 510 { "l2_rebuild_bufs", KSTAT_DATA_UINT64 }, 511 { "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 }, 512 { "l2_rebuild_log_blks", KSTAT_DATA_UINT64 }, 513 { "memory_throttle_count", KSTAT_DATA_UINT64 }, 514 { "arc_meta_used", KSTAT_DATA_UINT64 }, 515 { "arc_meta_limit", KSTAT_DATA_UINT64 }, 516 { "arc_meta_max", KSTAT_DATA_UINT64 }, 517 { "arc_meta_min", KSTAT_DATA_UINT64 }, 518 { "async_upgrade_sync", KSTAT_DATA_UINT64 }, 519 { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 }, 520 { "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 }, 521 }; 522 523 #define ARCSTAT_MAX(stat, val) { \ 524 uint64_t m; \ 525 while ((val) > (m = arc_stats.stat.value.ui64) && \ 526 (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \ 527 continue; \ 528 } 529 530 #define ARCSTAT_MAXSTAT(stat) \ 531 ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64) 532 533 /* 534 * We define a macro to allow ARC hits/misses to be easily broken down by 535 * two separate conditions, giving a total of four different subtypes for 536 * each of hits and misses (so eight statistics total). 537 */ 538 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \ 539 if (cond1) { \ 540 if (cond2) { \ 541 ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \ 542 } else { \ 543 ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \ 544 } \ 545 } else { \ 546 if (cond2) { \ 547 ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \ 548 } else { \ 549 ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\ 550 } \ 551 } 552 553 /* 554 * This macro allows us to use kstats as floating averages. Each time we 555 * update this kstat, we first factor it and the update value by 556 * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall 557 * average. This macro assumes that integer loads and stores are atomic, but 558 * is not safe for multiple writers updating the kstat in parallel (only the 559 * last writer's update will remain). 560 */ 561 #define ARCSTAT_F_AVG_FACTOR 3 562 #define ARCSTAT_F_AVG(stat, value) \ 563 do { \ 564 uint64_t x = ARCSTAT(stat); \ 565 x = x - x / ARCSTAT_F_AVG_FACTOR + \ 566 (value) / ARCSTAT_F_AVG_FACTOR; \ 567 ARCSTAT(stat) = x; \ 568 _NOTE(CONSTCOND) \ 569 } while (0) 570 571 kstat_t *arc_ksp; 572 static arc_state_t *arc_anon; 573 static arc_state_t *arc_mru; 574 static arc_state_t *arc_mru_ghost; 575 static arc_state_t *arc_mfu; 576 static arc_state_t *arc_mfu_ghost; 577 static arc_state_t *arc_l2c_only; 578 579 /* 580 * There are also some ARC variables that we want to export, but that are 581 * updated so often that having the canonical representation be the statistic 582 * variable causes a performance bottleneck. We want to use aggsum_t's for these 583 * instead, but still be able to export the kstat in the same way as before. 584 * The solution is to always use the aggsum version, except in the kstat update 585 * callback. 586 */ 587 aggsum_t arc_size; 588 aggsum_t arc_meta_used; 589 aggsum_t astat_data_size; 590 aggsum_t astat_metadata_size; 591 aggsum_t astat_hdr_size; 592 aggsum_t astat_other_size; 593 aggsum_t astat_l2_hdr_size; 594 595 static int arc_no_grow; /* Don't try to grow cache size */ 596 static hrtime_t arc_growtime; 597 static uint64_t arc_tempreserve; 598 static uint64_t arc_loaned_bytes; 599 600 #define GHOST_STATE(state) \ 601 ((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \ 602 (state) == arc_l2c_only) 603 604 #define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE) 605 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) 606 #define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR) 607 #define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH) 608 #define HDR_PRESCIENT_PREFETCH(hdr) \ 609 ((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH) 610 #define HDR_COMPRESSION_ENABLED(hdr) \ 611 ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC) 612 613 #define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE) 614 #define HDR_L2_READING(hdr) \ 615 (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \ 616 ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)) 617 #define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING) 618 #define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED) 619 #define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD) 620 #define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED) 621 #define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH) 622 #define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA) 623 624 #define HDR_ISTYPE_METADATA(hdr) \ 625 ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA) 626 #define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr)) 627 628 #define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR) 629 #define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR) 630 #define HDR_HAS_RABD(hdr) \ 631 (HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \ 632 (hdr)->b_crypt_hdr.b_rabd != NULL) 633 #define HDR_ENCRYPTED(hdr) \ 634 (HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) 635 #define HDR_AUTHENTICATED(hdr) \ 636 (HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) 637 638 /* For storing compression mode in b_flags */ 639 #define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1) 640 641 #define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \ 642 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS)) 643 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \ 644 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp)); 645 646 #define ARC_BUF_LAST(buf) ((buf)->b_next == NULL) 647 #define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED) 648 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED) 649 #define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED) 650 651 /* 652 * Other sizes 653 */ 654 655 #define HDR_FULL_CRYPT_SIZE ((int64_t)sizeof (arc_buf_hdr_t)) 656 #define HDR_FULL_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_crypt_hdr)) 657 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr)) 658 659 /* 660 * Hash table routines 661 */ 662 663 #define HT_LOCK_PAD 64 664 665 struct ht_lock { 666 kmutex_t ht_lock; 667 #ifdef _KERNEL 668 unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))]; 669 #endif 670 }; 671 672 #define BUF_LOCKS 256 673 typedef struct buf_hash_table { 674 uint64_t ht_mask; 675 arc_buf_hdr_t **ht_table; 676 struct ht_lock ht_locks[BUF_LOCKS]; 677 } buf_hash_table_t; 678 679 static buf_hash_table_t buf_hash_table; 680 681 #define BUF_HASH_INDEX(spa, dva, birth) \ 682 (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask) 683 #define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)]) 684 #define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock)) 685 #define HDR_LOCK(hdr) \ 686 (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth))) 687 688 uint64_t zfs_crc64_table[256]; 689 690 /* 691 * Level 2 ARC 692 */ 693 694 #define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */ 695 #define L2ARC_HEADROOM 2 /* num of writes */ 696 /* 697 * If we discover during ARC scan any buffers to be compressed, we boost 698 * our headroom for the next scanning cycle by this percentage multiple. 699 */ 700 #define L2ARC_HEADROOM_BOOST 200 701 #define L2ARC_FEED_SECS 1 /* caching interval secs */ 702 #define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */ 703 704 #define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent) 705 #define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done) 706 707 /* L2ARC Performance Tunables */ 708 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */ 709 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */ 710 uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */ 711 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST; 712 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */ 713 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */ 714 boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ 715 boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */ 716 boolean_t l2arc_norw = B_TRUE; /* no reads during writes */ 717 718 /* 719 * L2ARC Internals 720 */ 721 static list_t L2ARC_dev_list; /* device list */ 722 static list_t *l2arc_dev_list; /* device list pointer */ 723 static kmutex_t l2arc_dev_mtx; /* device list mutex */ 724 static l2arc_dev_t *l2arc_dev_last; /* last device used */ 725 static list_t L2ARC_free_on_write; /* free after write buf list */ 726 static list_t *l2arc_free_on_write; /* free after write list ptr */ 727 static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */ 728 static uint64_t l2arc_ndev; /* number of devices */ 729 730 typedef struct l2arc_read_callback { 731 arc_buf_hdr_t *l2rcb_hdr; /* read header */ 732 blkptr_t l2rcb_bp; /* original blkptr */ 733 zbookmark_phys_t l2rcb_zb; /* original bookmark */ 734 int l2rcb_flags; /* original flags */ 735 abd_t *l2rcb_abd; /* temporary buffer */ 736 } l2arc_read_callback_t; 737 738 typedef struct l2arc_data_free { 739 /* protected by l2arc_free_on_write_mtx */ 740 abd_t *l2df_abd; 741 size_t l2df_size; 742 arc_buf_contents_t l2df_type; 743 list_node_t l2df_list_node; 744 } l2arc_data_free_t; 745 746 static kmutex_t l2arc_feed_thr_lock; 747 static kcondvar_t l2arc_feed_thr_cv; 748 static uint8_t l2arc_thread_exit; 749 750 static kmutex_t l2arc_rebuild_thr_lock; 751 static kcondvar_t l2arc_rebuild_thr_cv; 752 753 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *); 754 typedef enum arc_fill_flags { 755 ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */ 756 ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */ 757 ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */ 758 ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */ 759 ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */ 760 } arc_fill_flags_t; 761 762 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *); 763 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *); 764 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *); 765 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *); 766 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag); 767 static void arc_hdr_free_pabd(arc_buf_hdr_t *, boolean_t); 768 static void arc_hdr_alloc_pabd(arc_buf_hdr_t *, boolean_t); 769 static void arc_access(arc_buf_hdr_t *, kmutex_t *); 770 static boolean_t arc_is_overflowing(); 771 static void arc_buf_watch(arc_buf_t *); 772 static l2arc_dev_t *l2arc_vdev_get(vdev_t *vd); 773 774 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *); 775 static uint32_t arc_bufc_to_flags(arc_buf_contents_t); 776 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); 777 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); 778 779 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *); 780 static void l2arc_read_done(zio_t *); 781 782 /* 783 * The arc_all_memory function is a ZoL enhancement that lives in their OSL 784 * code. In user-space code, which is used primarily for testing, we return 785 * half of all memory. 786 */ 787 uint64_t 788 arc_all_memory(void) 789 { 790 #ifdef _KERNEL 791 return (ptob(physmem)); 792 #else 793 return ((sysconf(_SC_PAGESIZE) * sysconf(_SC_PHYS_PAGES)) / 2); 794 #endif 795 } 796 797 /* 798 * We use Cityhash for this. It's fast, and has good hash properties without 799 * requiring any large static buffers. 800 */ 801 static uint64_t 802 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth) 803 { 804 return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth)); 805 } 806 807 #define HDR_EMPTY(hdr) \ 808 ((hdr)->b_dva.dva_word[0] == 0 && \ 809 (hdr)->b_dva.dva_word[1] == 0) 810 811 #define HDR_EMPTY_OR_LOCKED(hdr) \ 812 (HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr))) 813 814 #define HDR_EQUAL(spa, dva, birth, hdr) \ 815 ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \ 816 ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \ 817 ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa) 818 819 static void 820 buf_discard_identity(arc_buf_hdr_t *hdr) 821 { 822 hdr->b_dva.dva_word[0] = 0; 823 hdr->b_dva.dva_word[1] = 0; 824 hdr->b_birth = 0; 825 } 826 827 static arc_buf_hdr_t * 828 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp) 829 { 830 const dva_t *dva = BP_IDENTITY(bp); 831 uint64_t birth = BP_PHYSICAL_BIRTH(bp); 832 uint64_t idx = BUF_HASH_INDEX(spa, dva, birth); 833 kmutex_t *hash_lock = BUF_HASH_LOCK(idx); 834 arc_buf_hdr_t *hdr; 835 836 mutex_enter(hash_lock); 837 for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL; 838 hdr = hdr->b_hash_next) { 839 if (HDR_EQUAL(spa, dva, birth, hdr)) { 840 *lockp = hash_lock; 841 return (hdr); 842 } 843 } 844 mutex_exit(hash_lock); 845 *lockp = NULL; 846 return (NULL); 847 } 848 849 /* 850 * Insert an entry into the hash table. If there is already an element 851 * equal to elem in the hash table, then the already existing element 852 * will be returned and the new element will not be inserted. 853 * Otherwise returns NULL. 854 * If lockp == NULL, the caller is assumed to already hold the hash lock. 855 */ 856 static arc_buf_hdr_t * 857 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp) 858 { 859 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); 860 kmutex_t *hash_lock = BUF_HASH_LOCK(idx); 861 arc_buf_hdr_t *fhdr; 862 uint32_t i; 863 864 ASSERT(!DVA_IS_EMPTY(&hdr->b_dva)); 865 ASSERT(hdr->b_birth != 0); 866 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 867 868 if (lockp != NULL) { 869 *lockp = hash_lock; 870 mutex_enter(hash_lock); 871 } else { 872 ASSERT(MUTEX_HELD(hash_lock)); 873 } 874 875 for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL; 876 fhdr = fhdr->b_hash_next, i++) { 877 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr)) 878 return (fhdr); 879 } 880 881 hdr->b_hash_next = buf_hash_table.ht_table[idx]; 882 buf_hash_table.ht_table[idx] = hdr; 883 arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE); 884 885 /* collect some hash table performance data */ 886 if (i > 0) { 887 ARCSTAT_BUMP(arcstat_hash_collisions); 888 if (i == 1) 889 ARCSTAT_BUMP(arcstat_hash_chains); 890 891 ARCSTAT_MAX(arcstat_hash_chain_max, i); 892 } 893 894 ARCSTAT_BUMP(arcstat_hash_elements); 895 ARCSTAT_MAXSTAT(arcstat_hash_elements); 896 897 return (NULL); 898 } 899 900 static void 901 buf_hash_remove(arc_buf_hdr_t *hdr) 902 { 903 arc_buf_hdr_t *fhdr, **hdrp; 904 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); 905 906 ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx))); 907 ASSERT(HDR_IN_HASH_TABLE(hdr)); 908 909 hdrp = &buf_hash_table.ht_table[idx]; 910 while ((fhdr = *hdrp) != hdr) { 911 ASSERT3P(fhdr, !=, NULL); 912 hdrp = &fhdr->b_hash_next; 913 } 914 *hdrp = hdr->b_hash_next; 915 hdr->b_hash_next = NULL; 916 arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE); 917 918 /* collect some hash table performance data */ 919 ARCSTAT_BUMPDOWN(arcstat_hash_elements); 920 921 if (buf_hash_table.ht_table[idx] && 922 buf_hash_table.ht_table[idx]->b_hash_next == NULL) 923 ARCSTAT_BUMPDOWN(arcstat_hash_chains); 924 } 925 926 /* 927 * Global data structures and functions for the buf kmem cache. 928 */ 929 930 static kmem_cache_t *hdr_full_cache; 931 static kmem_cache_t *hdr_full_crypt_cache; 932 static kmem_cache_t *hdr_l2only_cache; 933 static kmem_cache_t *buf_cache; 934 935 static void 936 buf_fini(void) 937 { 938 int i; 939 940 kmem_free(buf_hash_table.ht_table, 941 (buf_hash_table.ht_mask + 1) * sizeof (void *)); 942 for (i = 0; i < BUF_LOCKS; i++) 943 mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock); 944 kmem_cache_destroy(hdr_full_cache); 945 kmem_cache_destroy(hdr_full_crypt_cache); 946 kmem_cache_destroy(hdr_l2only_cache); 947 kmem_cache_destroy(buf_cache); 948 } 949 950 /* 951 * Constructor callback - called when the cache is empty 952 * and a new buf is requested. 953 */ 954 /* ARGSUSED */ 955 static int 956 hdr_full_cons(void *vbuf, void *unused, int kmflag) 957 { 958 arc_buf_hdr_t *hdr = vbuf; 959 960 bzero(hdr, HDR_FULL_SIZE); 961 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 962 cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL); 963 zfs_refcount_create(&hdr->b_l1hdr.b_refcnt); 964 mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL); 965 multilist_link_init(&hdr->b_l1hdr.b_arc_node); 966 arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS); 967 968 return (0); 969 } 970 971 /* ARGSUSED */ 972 static int 973 hdr_full_crypt_cons(void *vbuf, void *unused, int kmflag) 974 { 975 arc_buf_hdr_t *hdr = vbuf; 976 977 (void) hdr_full_cons(vbuf, unused, kmflag); 978 bzero(&hdr->b_crypt_hdr, sizeof (hdr->b_crypt_hdr)); 979 arc_space_consume(sizeof (hdr->b_crypt_hdr), ARC_SPACE_HDRS); 980 981 return (0); 982 } 983 984 /* ARGSUSED */ 985 static int 986 hdr_l2only_cons(void *vbuf, void *unused, int kmflag) 987 { 988 arc_buf_hdr_t *hdr = vbuf; 989 990 bzero(hdr, HDR_L2ONLY_SIZE); 991 arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); 992 993 return (0); 994 } 995 996 /* ARGSUSED */ 997 static int 998 buf_cons(void *vbuf, void *unused, int kmflag) 999 { 1000 arc_buf_t *buf = vbuf; 1001 1002 bzero(buf, sizeof (arc_buf_t)); 1003 mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL); 1004 arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS); 1005 1006 return (0); 1007 } 1008 1009 /* 1010 * Destructor callback - called when a cached buf is 1011 * no longer required. 1012 */ 1013 /* ARGSUSED */ 1014 static void 1015 hdr_full_dest(void *vbuf, void *unused) 1016 { 1017 arc_buf_hdr_t *hdr = vbuf; 1018 1019 ASSERT(HDR_EMPTY(hdr)); 1020 cv_destroy(&hdr->b_l1hdr.b_cv); 1021 zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt); 1022 mutex_destroy(&hdr->b_l1hdr.b_freeze_lock); 1023 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 1024 arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS); 1025 } 1026 1027 /* ARGSUSED */ 1028 static void 1029 hdr_full_crypt_dest(void *vbuf, void *unused) 1030 { 1031 arc_buf_hdr_t *hdr = vbuf; 1032 1033 hdr_full_dest(hdr, unused); 1034 arc_space_return(sizeof (hdr->b_crypt_hdr), ARC_SPACE_HDRS); 1035 } 1036 1037 /* ARGSUSED */ 1038 static void 1039 hdr_l2only_dest(void *vbuf, void *unused) 1040 { 1041 arc_buf_hdr_t *hdr = vbuf; 1042 1043 ASSERT(HDR_EMPTY(hdr)); 1044 arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); 1045 } 1046 1047 /* ARGSUSED */ 1048 static void 1049 buf_dest(void *vbuf, void *unused) 1050 { 1051 arc_buf_t *buf = vbuf; 1052 1053 mutex_destroy(&buf->b_evict_lock); 1054 arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS); 1055 } 1056 1057 /* 1058 * Reclaim callback -- invoked when memory is low. 1059 */ 1060 /* ARGSUSED */ 1061 static void 1062 hdr_recl(void *unused) 1063 { 1064 dprintf("hdr_recl called\n"); 1065 /* 1066 * umem calls the reclaim func when we destroy the buf cache, 1067 * which is after we do arc_fini(). 1068 */ 1069 if (arc_initialized) 1070 zthr_wakeup(arc_reap_zthr); 1071 } 1072 1073 static void 1074 buf_init(void) 1075 { 1076 uint64_t *ct; 1077 uint64_t hsize = 1ULL << 12; 1078 int i, j; 1079 1080 /* 1081 * The hash table is big enough to fill all of physical memory 1082 * with an average block size of zfs_arc_average_blocksize (default 8K). 1083 * By default, the table will take up 1084 * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers). 1085 */ 1086 while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE) 1087 hsize <<= 1; 1088 retry: 1089 buf_hash_table.ht_mask = hsize - 1; 1090 buf_hash_table.ht_table = 1091 kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP); 1092 if (buf_hash_table.ht_table == NULL) { 1093 ASSERT(hsize > (1ULL << 8)); 1094 hsize >>= 1; 1095 goto retry; 1096 } 1097 1098 hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE, 1099 0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0); 1100 hdr_full_crypt_cache = kmem_cache_create("arc_buf_hdr_t_full_crypt", 1101 HDR_FULL_CRYPT_SIZE, 0, hdr_full_crypt_cons, hdr_full_crypt_dest, 1102 hdr_recl, NULL, NULL, 0); 1103 hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only", 1104 HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl, 1105 NULL, NULL, 0); 1106 buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t), 1107 0, buf_cons, buf_dest, NULL, NULL, NULL, 0); 1108 1109 for (i = 0; i < 256; i++) 1110 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--) 1111 *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY); 1112 1113 for (i = 0; i < BUF_LOCKS; i++) { 1114 mutex_init(&buf_hash_table.ht_locks[i].ht_lock, 1115 NULL, MUTEX_DEFAULT, NULL); 1116 } 1117 } 1118 1119 /* 1120 * This is the size that the buf occupies in memory. If the buf is compressed, 1121 * it will correspond to the compressed size. You should use this method of 1122 * getting the buf size unless you explicitly need the logical size. 1123 */ 1124 int32_t 1125 arc_buf_size(arc_buf_t *buf) 1126 { 1127 return (ARC_BUF_COMPRESSED(buf) ? 1128 HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr)); 1129 } 1130 1131 int32_t 1132 arc_buf_lsize(arc_buf_t *buf) 1133 { 1134 return (HDR_GET_LSIZE(buf->b_hdr)); 1135 } 1136 1137 /* 1138 * This function will return B_TRUE if the buffer is encrypted in memory. 1139 * This buffer can be decrypted by calling arc_untransform(). 1140 */ 1141 boolean_t 1142 arc_is_encrypted(arc_buf_t *buf) 1143 { 1144 return (ARC_BUF_ENCRYPTED(buf) != 0); 1145 } 1146 1147 /* 1148 * Returns B_TRUE if the buffer represents data that has not had its MAC 1149 * verified yet. 1150 */ 1151 boolean_t 1152 arc_is_unauthenticated(arc_buf_t *buf) 1153 { 1154 return (HDR_NOAUTH(buf->b_hdr) != 0); 1155 } 1156 1157 void 1158 arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt, 1159 uint8_t *iv, uint8_t *mac) 1160 { 1161 arc_buf_hdr_t *hdr = buf->b_hdr; 1162 1163 ASSERT(HDR_PROTECTED(hdr)); 1164 1165 bcopy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN); 1166 bcopy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN); 1167 bcopy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN); 1168 *byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? 1169 /* CONSTCOND */ 1170 ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; 1171 } 1172 1173 /* 1174 * Indicates how this buffer is compressed in memory. If it is not compressed 1175 * the value will be ZIO_COMPRESS_OFF. It can be made normally readable with 1176 * arc_untransform() as long as it is also unencrypted. 1177 */ 1178 enum zio_compress 1179 arc_get_compression(arc_buf_t *buf) 1180 { 1181 return (ARC_BUF_COMPRESSED(buf) ? 1182 HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF); 1183 } 1184 1185 #define ARC_MINTIME (hz>>4) /* 62 ms */ 1186 1187 /* 1188 * Return the compression algorithm used to store this data in the ARC. If ARC 1189 * compression is enabled or this is an encrypted block, this will be the same 1190 * as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF. 1191 */ 1192 static inline enum zio_compress 1193 arc_hdr_get_compress(arc_buf_hdr_t *hdr) 1194 { 1195 return (HDR_COMPRESSION_ENABLED(hdr) ? 1196 HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF); 1197 } 1198 1199 static inline boolean_t 1200 arc_buf_is_shared(arc_buf_t *buf) 1201 { 1202 boolean_t shared = (buf->b_data != NULL && 1203 buf->b_hdr->b_l1hdr.b_pabd != NULL && 1204 abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) && 1205 buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd)); 1206 IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr)); 1207 IMPLY(shared, ARC_BUF_SHARED(buf)); 1208 IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf)); 1209 1210 /* 1211 * It would be nice to assert arc_can_share() too, but the "hdr isn't 1212 * already being shared" requirement prevents us from doing that. 1213 */ 1214 1215 return (shared); 1216 } 1217 1218 /* 1219 * Free the checksum associated with this header. If there is no checksum, this 1220 * is a no-op. 1221 */ 1222 static inline void 1223 arc_cksum_free(arc_buf_hdr_t *hdr) 1224 { 1225 ASSERT(HDR_HAS_L1HDR(hdr)); 1226 1227 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1228 if (hdr->b_l1hdr.b_freeze_cksum != NULL) { 1229 kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t)); 1230 hdr->b_l1hdr.b_freeze_cksum = NULL; 1231 } 1232 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1233 } 1234 1235 /* 1236 * Return true iff at least one of the bufs on hdr is not compressed. 1237 * Encrypted buffers count as compressed. 1238 */ 1239 static boolean_t 1240 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr) 1241 { 1242 ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr)); 1243 1244 for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) { 1245 if (!ARC_BUF_COMPRESSED(b)) { 1246 return (B_TRUE); 1247 } 1248 } 1249 return (B_FALSE); 1250 } 1251 1252 /* 1253 * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data 1254 * matches the checksum that is stored in the hdr. If there is no checksum, 1255 * or if the buf is compressed, this is a no-op. 1256 */ 1257 static void 1258 arc_cksum_verify(arc_buf_t *buf) 1259 { 1260 arc_buf_hdr_t *hdr = buf->b_hdr; 1261 zio_cksum_t zc; 1262 1263 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1264 return; 1265 1266 if (ARC_BUF_COMPRESSED(buf)) 1267 return; 1268 1269 ASSERT(HDR_HAS_L1HDR(hdr)); 1270 1271 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1272 1273 if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) { 1274 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1275 return; 1276 } 1277 1278 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc); 1279 if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc)) 1280 panic("buffer modified while frozen!"); 1281 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1282 } 1283 1284 /* 1285 * This function makes the assumption that data stored in the L2ARC 1286 * will be transformed exactly as it is in the main pool. Because of 1287 * this we can verify the checksum against the reading process's bp. 1288 */ 1289 static boolean_t 1290 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio) 1291 { 1292 enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp); 1293 boolean_t valid_cksum; 1294 1295 ASSERT(!BP_IS_EMBEDDED(zio->io_bp)); 1296 VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr)); 1297 1298 /* 1299 * We rely on the blkptr's checksum to determine if the block 1300 * is valid or not. When compressed arc is enabled, the l2arc 1301 * writes the block to the l2arc just as it appears in the pool. 1302 * This allows us to use the blkptr's checksum to validate the 1303 * data that we just read off of the l2arc without having to store 1304 * a separate checksum in the arc_buf_hdr_t. However, if compressed 1305 * arc is disabled, then the data written to the l2arc is always 1306 * uncompressed and won't match the block as it exists in the main 1307 * pool. When this is the case, we must first compress it if it is 1308 * compressed on the main pool before we can validate the checksum. 1309 */ 1310 if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) { 1311 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF); 1312 uint64_t lsize = HDR_GET_LSIZE(hdr); 1313 uint64_t csize; 1314 1315 abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE); 1316 csize = zio_compress_data(compress, zio->io_abd, 1317 abd_to_buf(cdata), lsize); 1318 1319 ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr)); 1320 if (csize < HDR_GET_PSIZE(hdr)) { 1321 /* 1322 * Compressed blocks are always a multiple of the 1323 * smallest ashift in the pool. Ideally, we would 1324 * like to round up the csize to the next 1325 * spa_min_ashift but that value may have changed 1326 * since the block was last written. Instead, 1327 * we rely on the fact that the hdr's psize 1328 * was set to the psize of the block when it was 1329 * last written. We set the csize to that value 1330 * and zero out any part that should not contain 1331 * data. 1332 */ 1333 abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize); 1334 csize = HDR_GET_PSIZE(hdr); 1335 } 1336 zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL); 1337 } 1338 1339 /* 1340 * Block pointers always store the checksum for the logical data. 1341 * If the block pointer has the gang bit set, then the checksum 1342 * it represents is for the reconstituted data and not for an 1343 * individual gang member. The zio pipeline, however, must be able to 1344 * determine the checksum of each of the gang constituents so it 1345 * treats the checksum comparison differently than what we need 1346 * for l2arc blocks. This prevents us from using the 1347 * zio_checksum_error() interface directly. Instead we must call the 1348 * zio_checksum_error_impl() so that we can ensure the checksum is 1349 * generated using the correct checksum algorithm and accounts for the 1350 * logical I/O size and not just a gang fragment. 1351 */ 1352 valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp, 1353 BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size, 1354 zio->io_offset, NULL) == 0); 1355 zio_pop_transforms(zio); 1356 return (valid_cksum); 1357 } 1358 1359 /* 1360 * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a 1361 * checksum and attaches it to the buf's hdr so that we can ensure that the buf 1362 * isn't modified later on. If buf is compressed or there is already a checksum 1363 * on the hdr, this is a no-op (we only checksum uncompressed bufs). 1364 */ 1365 static void 1366 arc_cksum_compute(arc_buf_t *buf) 1367 { 1368 arc_buf_hdr_t *hdr = buf->b_hdr; 1369 1370 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1371 return; 1372 1373 ASSERT(HDR_HAS_L1HDR(hdr)); 1374 1375 mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock); 1376 if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) { 1377 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1378 return; 1379 } 1380 1381 ASSERT(!ARC_BUF_ENCRYPTED(buf)); 1382 ASSERT(!ARC_BUF_COMPRESSED(buf)); 1383 hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), 1384 KM_SLEEP); 1385 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, 1386 hdr->b_l1hdr.b_freeze_cksum); 1387 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1388 arc_buf_watch(buf); 1389 } 1390 1391 #ifndef _KERNEL 1392 typedef struct procctl { 1393 long cmd; 1394 prwatch_t prwatch; 1395 } procctl_t; 1396 #endif 1397 1398 /* ARGSUSED */ 1399 static void 1400 arc_buf_unwatch(arc_buf_t *buf) 1401 { 1402 #ifndef _KERNEL 1403 if (arc_watch) { 1404 int result; 1405 procctl_t ctl; 1406 ctl.cmd = PCWATCH; 1407 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data; 1408 ctl.prwatch.pr_size = 0; 1409 ctl.prwatch.pr_wflags = 0; 1410 result = write(arc_procfd, &ctl, sizeof (ctl)); 1411 ASSERT3U(result, ==, sizeof (ctl)); 1412 } 1413 #endif 1414 } 1415 1416 /* ARGSUSED */ 1417 static void 1418 arc_buf_watch(arc_buf_t *buf) 1419 { 1420 #ifndef _KERNEL 1421 if (arc_watch) { 1422 int result; 1423 procctl_t ctl; 1424 ctl.cmd = PCWATCH; 1425 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data; 1426 ctl.prwatch.pr_size = arc_buf_size(buf); 1427 ctl.prwatch.pr_wflags = WA_WRITE; 1428 result = write(arc_procfd, &ctl, sizeof (ctl)); 1429 ASSERT3U(result, ==, sizeof (ctl)); 1430 } 1431 #endif 1432 } 1433 1434 static arc_buf_contents_t 1435 arc_buf_type(arc_buf_hdr_t *hdr) 1436 { 1437 arc_buf_contents_t type; 1438 if (HDR_ISTYPE_METADATA(hdr)) { 1439 type = ARC_BUFC_METADATA; 1440 } else { 1441 type = ARC_BUFC_DATA; 1442 } 1443 VERIFY3U(hdr->b_type, ==, type); 1444 return (type); 1445 } 1446 1447 boolean_t 1448 arc_is_metadata(arc_buf_t *buf) 1449 { 1450 return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0); 1451 } 1452 1453 static uint32_t 1454 arc_bufc_to_flags(arc_buf_contents_t type) 1455 { 1456 switch (type) { 1457 case ARC_BUFC_DATA: 1458 /* metadata field is 0 if buffer contains normal data */ 1459 return (0); 1460 case ARC_BUFC_METADATA: 1461 return (ARC_FLAG_BUFC_METADATA); 1462 default: 1463 break; 1464 } 1465 panic("undefined ARC buffer type!"); 1466 return ((uint32_t)-1); 1467 } 1468 1469 void 1470 arc_buf_thaw(arc_buf_t *buf) 1471 { 1472 arc_buf_hdr_t *hdr = buf->b_hdr; 1473 1474 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 1475 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 1476 1477 arc_cksum_verify(buf); 1478 1479 /* 1480 * Compressed buffers do not manipulate the b_freeze_cksum. 1481 */ 1482 if (ARC_BUF_COMPRESSED(buf)) 1483 return; 1484 1485 ASSERT(HDR_HAS_L1HDR(hdr)); 1486 arc_cksum_free(hdr); 1487 1488 mutex_enter(&hdr->b_l1hdr.b_freeze_lock); 1489 #ifdef ZFS_DEBUG 1490 if (zfs_flags & ZFS_DEBUG_MODIFY) { 1491 if (hdr->b_l1hdr.b_thawed != NULL) 1492 kmem_free(hdr->b_l1hdr.b_thawed, 1); 1493 hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP); 1494 } 1495 #endif 1496 1497 mutex_exit(&hdr->b_l1hdr.b_freeze_lock); 1498 1499 arc_buf_unwatch(buf); 1500 } 1501 1502 void 1503 arc_buf_freeze(arc_buf_t *buf) 1504 { 1505 if (!(zfs_flags & ZFS_DEBUG_MODIFY)) 1506 return; 1507 1508 if (ARC_BUF_COMPRESSED(buf)) 1509 return; 1510 1511 ASSERT(HDR_HAS_L1HDR(buf->b_hdr)); 1512 arc_cksum_compute(buf); 1513 } 1514 1515 /* 1516 * The arc_buf_hdr_t's b_flags should never be modified directly. Instead, 1517 * the following functions should be used to ensure that the flags are 1518 * updated in a thread-safe way. When manipulating the flags either 1519 * the hash_lock must be held or the hdr must be undiscoverable. This 1520 * ensures that we're not racing with any other threads when updating 1521 * the flags. 1522 */ 1523 static inline void 1524 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) 1525 { 1526 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1527 hdr->b_flags |= flags; 1528 } 1529 1530 static inline void 1531 arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) 1532 { 1533 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1534 hdr->b_flags &= ~flags; 1535 } 1536 1537 /* 1538 * Setting the compression bits in the arc_buf_hdr_t's b_flags is 1539 * done in a special way since we have to clear and set bits 1540 * at the same time. Consumers that wish to set the compression bits 1541 * must use this function to ensure that the flags are updated in 1542 * thread-safe manner. 1543 */ 1544 static void 1545 arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp) 1546 { 1547 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1548 1549 /* 1550 * Holes and embedded blocks will always have a psize = 0 so 1551 * we ignore the compression of the blkptr and set the 1552 * arc_buf_hdr_t's compression to ZIO_COMPRESS_OFF. 1553 * Holes and embedded blocks remain anonymous so we don't 1554 * want to uncompress them. Mark them as uncompressed. 1555 */ 1556 if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) { 1557 arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC); 1558 ASSERT(!HDR_COMPRESSION_ENABLED(hdr)); 1559 } else { 1560 arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC); 1561 ASSERT(HDR_COMPRESSION_ENABLED(hdr)); 1562 } 1563 1564 HDR_SET_COMPRESS(hdr, cmp); 1565 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp); 1566 } 1567 1568 /* 1569 * Looks for another buf on the same hdr which has the data decompressed, copies 1570 * from it, and returns true. If no such buf exists, returns false. 1571 */ 1572 static boolean_t 1573 arc_buf_try_copy_decompressed_data(arc_buf_t *buf) 1574 { 1575 arc_buf_hdr_t *hdr = buf->b_hdr; 1576 boolean_t copied = B_FALSE; 1577 1578 ASSERT(HDR_HAS_L1HDR(hdr)); 1579 ASSERT3P(buf->b_data, !=, NULL); 1580 ASSERT(!ARC_BUF_COMPRESSED(buf)); 1581 1582 for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL; 1583 from = from->b_next) { 1584 /* can't use our own data buffer */ 1585 if (from == buf) { 1586 continue; 1587 } 1588 1589 if (!ARC_BUF_COMPRESSED(from)) { 1590 bcopy(from->b_data, buf->b_data, arc_buf_size(buf)); 1591 copied = B_TRUE; 1592 break; 1593 } 1594 } 1595 1596 /* 1597 * Note: With encryption support, the following assertion is no longer 1598 * necessarily valid. If we receive two back to back raw snapshots 1599 * (send -w), the second receive can use a hdr with a cksum already 1600 * calculated. This happens via: 1601 * dmu_recv_stream() -> receive_read_record() -> arc_loan_raw_buf() 1602 * The rsend/send_mixed_raw test case exercises this code path. 1603 * 1604 * There were no decompressed bufs, so there should not be a 1605 * checksum on the hdr either. 1606 * EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL); 1607 */ 1608 1609 return (copied); 1610 } 1611 1612 /* 1613 * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t. 1614 */ 1615 static uint64_t 1616 arc_hdr_size(arc_buf_hdr_t *hdr) 1617 { 1618 uint64_t size; 1619 1620 if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && 1621 HDR_GET_PSIZE(hdr) > 0) { 1622 size = HDR_GET_PSIZE(hdr); 1623 } else { 1624 ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0); 1625 size = HDR_GET_LSIZE(hdr); 1626 } 1627 return (size); 1628 } 1629 1630 static int 1631 arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj) 1632 { 1633 int ret; 1634 uint64_t csize; 1635 uint64_t lsize = HDR_GET_LSIZE(hdr); 1636 uint64_t psize = HDR_GET_PSIZE(hdr); 1637 void *tmpbuf = NULL; 1638 abd_t *abd = hdr->b_l1hdr.b_pabd; 1639 1640 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1641 ASSERT(HDR_AUTHENTICATED(hdr)); 1642 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 1643 1644 /* 1645 * The MAC is calculated on the compressed data that is stored on disk. 1646 * However, if compressed arc is disabled we will only have the 1647 * decompressed data available to us now. Compress it into a temporary 1648 * abd so we can verify the MAC. The performance overhead of this will 1649 * be relatively low, since most objects in an encrypted objset will 1650 * be encrypted (instead of authenticated) anyway. 1651 */ 1652 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 1653 !HDR_COMPRESSION_ENABLED(hdr)) { 1654 tmpbuf = zio_buf_alloc(lsize); 1655 abd = abd_get_from_buf(tmpbuf, lsize); 1656 abd_take_ownership_of_buf(abd, B_TRUE); 1657 1658 csize = zio_compress_data(HDR_GET_COMPRESS(hdr), 1659 hdr->b_l1hdr.b_pabd, tmpbuf, lsize); 1660 ASSERT3U(csize, <=, psize); 1661 abd_zero_off(abd, csize, psize - csize); 1662 } 1663 1664 /* 1665 * Authentication is best effort. We authenticate whenever the key is 1666 * available. If we succeed we clear ARC_FLAG_NOAUTH. 1667 */ 1668 if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) { 1669 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF); 1670 ASSERT3U(lsize, ==, psize); 1671 ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd, 1672 psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 1673 } else { 1674 ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize, 1675 hdr->b_crypt_hdr.b_mac); 1676 } 1677 1678 if (ret == 0) 1679 arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH); 1680 else if (ret != ENOENT) 1681 goto error; 1682 1683 if (tmpbuf != NULL) 1684 abd_free(abd); 1685 1686 return (0); 1687 1688 error: 1689 if (tmpbuf != NULL) 1690 abd_free(abd); 1691 1692 return (ret); 1693 } 1694 1695 /* 1696 * This function will take a header that only has raw encrypted data in 1697 * b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in 1698 * b_l1hdr.b_pabd. If designated in the header flags, this function will 1699 * also decompress the data. 1700 */ 1701 static int 1702 arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb) 1703 { 1704 int ret; 1705 abd_t *cabd = NULL; 1706 void *tmp = NULL; 1707 boolean_t no_crypt = B_FALSE; 1708 boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 1709 1710 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1711 ASSERT(HDR_ENCRYPTED(hdr)); 1712 1713 arc_hdr_alloc_pabd(hdr, B_FALSE); 1714 1715 ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot, 1716 B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv, 1717 hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd, 1718 hdr->b_crypt_hdr.b_rabd, &no_crypt); 1719 if (ret != 0) 1720 goto error; 1721 1722 if (no_crypt) { 1723 abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd, 1724 HDR_GET_PSIZE(hdr)); 1725 } 1726 1727 /* 1728 * If this header has disabled arc compression but the b_pabd is 1729 * compressed after decrypting it, we need to decompress the newly 1730 * decrypted data. 1731 */ 1732 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 1733 !HDR_COMPRESSION_ENABLED(hdr)) { 1734 /* 1735 * We want to make sure that we are correctly honoring the 1736 * zfs_abd_scatter_enabled setting, so we allocate an abd here 1737 * and then loan a buffer from it, rather than allocating a 1738 * linear buffer and wrapping it in an abd later. 1739 */ 1740 cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr); 1741 tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr)); 1742 1743 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), 1744 hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr), 1745 HDR_GET_LSIZE(hdr)); 1746 if (ret != 0) { 1747 abd_return_buf(cabd, tmp, arc_hdr_size(hdr)); 1748 goto error; 1749 } 1750 1751 abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); 1752 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 1753 arc_hdr_size(hdr), hdr); 1754 hdr->b_l1hdr.b_pabd = cabd; 1755 } 1756 1757 return (0); 1758 1759 error: 1760 arc_hdr_free_pabd(hdr, B_FALSE); 1761 if (cabd != NULL) 1762 arc_free_data_buf(hdr, cabd, arc_hdr_size(hdr), hdr); 1763 1764 return (ret); 1765 } 1766 1767 /* 1768 * This function is called during arc_buf_fill() to prepare the header's 1769 * abd plaintext pointer for use. This involves authenticated protected 1770 * data and decrypting encrypted data into the plaintext abd. 1771 */ 1772 static int 1773 arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa, 1774 const zbookmark_phys_t *zb, boolean_t noauth) 1775 { 1776 int ret; 1777 1778 ASSERT(HDR_PROTECTED(hdr)); 1779 1780 if (hash_lock != NULL) 1781 mutex_enter(hash_lock); 1782 1783 if (HDR_NOAUTH(hdr) && !noauth) { 1784 /* 1785 * The caller requested authenticated data but our data has 1786 * not been authenticated yet. Verify the MAC now if we can. 1787 */ 1788 ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset); 1789 if (ret != 0) 1790 goto error; 1791 } else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) { 1792 /* 1793 * If we only have the encrypted version of the data, but the 1794 * unencrypted version was requested we take this opportunity 1795 * to store the decrypted version in the header for future use. 1796 */ 1797 ret = arc_hdr_decrypt(hdr, spa, zb); 1798 if (ret != 0) 1799 goto error; 1800 } 1801 1802 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 1803 1804 if (hash_lock != NULL) 1805 mutex_exit(hash_lock); 1806 1807 return (0); 1808 1809 error: 1810 if (hash_lock != NULL) 1811 mutex_exit(hash_lock); 1812 1813 return (ret); 1814 } 1815 1816 /* 1817 * This function is used by the dbuf code to decrypt bonus buffers in place. 1818 * The dbuf code itself doesn't have any locking for decrypting a shared dnode 1819 * block, so we use the hash lock here to protect against concurrent calls to 1820 * arc_buf_fill(). 1821 */ 1822 /* ARGSUSED */ 1823 static void 1824 arc_buf_untransform_in_place(arc_buf_t *buf, kmutex_t *hash_lock) 1825 { 1826 arc_buf_hdr_t *hdr = buf->b_hdr; 1827 1828 ASSERT(HDR_ENCRYPTED(hdr)); 1829 ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); 1830 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 1831 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 1832 1833 zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data, 1834 arc_buf_size(buf)); 1835 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 1836 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 1837 hdr->b_crypt_hdr.b_ebufcnt -= 1; 1838 } 1839 1840 /* 1841 * Given a buf that has a data buffer attached to it, this function will 1842 * efficiently fill the buf with data of the specified compression setting from 1843 * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr 1844 * are already sharing a data buf, no copy is performed. 1845 * 1846 * If the buf is marked as compressed but uncompressed data was requested, this 1847 * will allocate a new data buffer for the buf, remove that flag, and fill the 1848 * buf with uncompressed data. You can't request a compressed buf on a hdr with 1849 * uncompressed data, and (since we haven't added support for it yet) if you 1850 * want compressed data your buf must already be marked as compressed and have 1851 * the correct-sized data buffer. 1852 */ 1853 static int 1854 arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, 1855 arc_fill_flags_t flags) 1856 { 1857 int error = 0; 1858 arc_buf_hdr_t *hdr = buf->b_hdr; 1859 boolean_t hdr_compressed = 1860 (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 1861 boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0; 1862 boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0; 1863 dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap; 1864 kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr); 1865 1866 ASSERT3P(buf->b_data, !=, NULL); 1867 IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf)); 1868 IMPLY(compressed, ARC_BUF_COMPRESSED(buf)); 1869 IMPLY(encrypted, HDR_ENCRYPTED(hdr)); 1870 IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf)); 1871 IMPLY(encrypted, ARC_BUF_COMPRESSED(buf)); 1872 IMPLY(encrypted, !ARC_BUF_SHARED(buf)); 1873 1874 /* 1875 * If the caller wanted encrypted data we just need to copy it from 1876 * b_rabd and potentially byteswap it. We won't be able to do any 1877 * further transforms on it. 1878 */ 1879 if (encrypted) { 1880 ASSERT(HDR_HAS_RABD(hdr)); 1881 abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd, 1882 HDR_GET_PSIZE(hdr)); 1883 goto byteswap; 1884 } 1885 1886 /* 1887 * Adjust encrypted and authenticated headers to accomodate 1888 * the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are 1889 * allowed to fail decryption due to keys not being loaded 1890 * without being marked as an IO error. 1891 */ 1892 if (HDR_PROTECTED(hdr)) { 1893 error = arc_fill_hdr_crypt(hdr, hash_lock, spa, 1894 zb, !!(flags & ARC_FILL_NOAUTH)); 1895 if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) { 1896 return (error); 1897 } else if (error != 0) { 1898 if (hash_lock != NULL) 1899 mutex_enter(hash_lock); 1900 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 1901 if (hash_lock != NULL) 1902 mutex_exit(hash_lock); 1903 return (error); 1904 } 1905 } 1906 1907 /* 1908 * There is a special case here for dnode blocks which are 1909 * decrypting their bonus buffers. These blocks may request to 1910 * be decrypted in-place. This is necessary because there may 1911 * be many dnodes pointing into this buffer and there is 1912 * currently no method to synchronize replacing the backing 1913 * b_data buffer and updating all of the pointers. Here we use 1914 * the hash lock to ensure there are no races. If the need 1915 * arises for other types to be decrypted in-place, they must 1916 * add handling here as well. 1917 */ 1918 if ((flags & ARC_FILL_IN_PLACE) != 0) { 1919 ASSERT(!hdr_compressed); 1920 ASSERT(!compressed); 1921 ASSERT(!encrypted); 1922 1923 if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) { 1924 ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); 1925 1926 if (hash_lock != NULL) 1927 mutex_enter(hash_lock); 1928 arc_buf_untransform_in_place(buf, hash_lock); 1929 if (hash_lock != NULL) 1930 mutex_exit(hash_lock); 1931 1932 /* Compute the hdr's checksum if necessary */ 1933 arc_cksum_compute(buf); 1934 } 1935 1936 return (0); 1937 } 1938 1939 if (hdr_compressed == compressed) { 1940 if (!arc_buf_is_shared(buf)) { 1941 abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd, 1942 arc_buf_size(buf)); 1943 } 1944 } else { 1945 ASSERT(hdr_compressed); 1946 ASSERT(!compressed); 1947 ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr)); 1948 1949 /* 1950 * If the buf is sharing its data with the hdr, unlink it and 1951 * allocate a new data buffer for the buf. 1952 */ 1953 if (arc_buf_is_shared(buf)) { 1954 ASSERT(ARC_BUF_COMPRESSED(buf)); 1955 1956 /* We need to give the buf its own b_data */ 1957 buf->b_flags &= ~ARC_BUF_FLAG_SHARED; 1958 buf->b_data = 1959 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); 1960 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 1961 1962 /* Previously overhead was 0; just add new overhead */ 1963 ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr)); 1964 } else if (ARC_BUF_COMPRESSED(buf)) { 1965 /* We need to reallocate the buf's b_data */ 1966 arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr), 1967 buf); 1968 buf->b_data = 1969 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); 1970 1971 /* We increased the size of b_data; update overhead */ 1972 ARCSTAT_INCR(arcstat_overhead_size, 1973 HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr)); 1974 } 1975 1976 /* 1977 * Regardless of the buf's previous compression settings, it 1978 * should not be compressed at the end of this function. 1979 */ 1980 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 1981 1982 /* 1983 * Try copying the data from another buf which already has a 1984 * decompressed version. If that's not possible, it's time to 1985 * bite the bullet and decompress the data from the hdr. 1986 */ 1987 if (arc_buf_try_copy_decompressed_data(buf)) { 1988 /* Skip byteswapping and checksumming (already done) */ 1989 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL); 1990 return (0); 1991 } else { 1992 error = zio_decompress_data(HDR_GET_COMPRESS(hdr), 1993 hdr->b_l1hdr.b_pabd, buf->b_data, 1994 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr)); 1995 1996 /* 1997 * Absent hardware errors or software bugs, this should 1998 * be impossible, but log it anyway so we can debug it. 1999 */ 2000 if (error != 0) { 2001 zfs_dbgmsg( 2002 "hdr %p, compress %d, psize %d, lsize %d", 2003 hdr, arc_hdr_get_compress(hdr), 2004 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr)); 2005 if (hash_lock != NULL) 2006 mutex_enter(hash_lock); 2007 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 2008 if (hash_lock != NULL) 2009 mutex_exit(hash_lock); 2010 return (SET_ERROR(EIO)); 2011 } 2012 } 2013 } 2014 2015 byteswap: 2016 /* Byteswap the buf's data if necessary */ 2017 if (bswap != DMU_BSWAP_NUMFUNCS) { 2018 ASSERT(!HDR_SHARED_DATA(hdr)); 2019 ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS); 2020 dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr)); 2021 } 2022 2023 /* Compute the hdr's checksum if necessary */ 2024 arc_cksum_compute(buf); 2025 2026 return (0); 2027 } 2028 2029 /* 2030 * If this function is being called to decrypt an encrypted buffer or verify an 2031 * authenticated one, the key must be loaded and a mapping must be made 2032 * available in the keystore via spa_keystore_create_mapping() or one of its 2033 * callers. 2034 */ 2035 int 2036 arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, 2037 boolean_t in_place) 2038 { 2039 int ret; 2040 arc_fill_flags_t flags = 0; 2041 2042 if (in_place) 2043 flags |= ARC_FILL_IN_PLACE; 2044 2045 ret = arc_buf_fill(buf, spa, zb, flags); 2046 if (ret == ECKSUM) { 2047 /* 2048 * Convert authentication and decryption errors to EIO 2049 * (and generate an ereport) before leaving the ARC. 2050 */ 2051 ret = SET_ERROR(EIO); 2052 spa_log_error(spa, zb); 2053 (void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION, 2054 spa, NULL, zb, NULL, 0, 0); 2055 } 2056 2057 return (ret); 2058 } 2059 2060 /* 2061 * Increment the amount of evictable space in the arc_state_t's refcount. 2062 * We account for the space used by the hdr and the arc buf individually 2063 * so that we can add and remove them from the refcount individually. 2064 */ 2065 static void 2066 arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state) 2067 { 2068 arc_buf_contents_t type = arc_buf_type(hdr); 2069 2070 ASSERT(HDR_HAS_L1HDR(hdr)); 2071 2072 if (GHOST_STATE(state)) { 2073 ASSERT0(hdr->b_l1hdr.b_bufcnt); 2074 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2075 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2076 ASSERT(!HDR_HAS_RABD(hdr)); 2077 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2078 HDR_GET_LSIZE(hdr), hdr); 2079 return; 2080 } 2081 2082 ASSERT(!GHOST_STATE(state)); 2083 if (hdr->b_l1hdr.b_pabd != NULL) { 2084 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2085 arc_hdr_size(hdr), hdr); 2086 } 2087 if (HDR_HAS_RABD(hdr)) { 2088 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2089 HDR_GET_PSIZE(hdr), hdr); 2090 } 2091 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2092 buf = buf->b_next) { 2093 if (arc_buf_is_shared(buf)) 2094 continue; 2095 (void) zfs_refcount_add_many(&state->arcs_esize[type], 2096 arc_buf_size(buf), buf); 2097 } 2098 } 2099 2100 /* 2101 * Decrement the amount of evictable space in the arc_state_t's refcount. 2102 * We account for the space used by the hdr and the arc buf individually 2103 * so that we can add and remove them from the refcount individually. 2104 */ 2105 static void 2106 arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state) 2107 { 2108 arc_buf_contents_t type = arc_buf_type(hdr); 2109 2110 ASSERT(HDR_HAS_L1HDR(hdr)); 2111 2112 if (GHOST_STATE(state)) { 2113 ASSERT0(hdr->b_l1hdr.b_bufcnt); 2114 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2115 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2116 ASSERT(!HDR_HAS_RABD(hdr)); 2117 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2118 HDR_GET_LSIZE(hdr), hdr); 2119 return; 2120 } 2121 2122 ASSERT(!GHOST_STATE(state)); 2123 if (hdr->b_l1hdr.b_pabd != NULL) { 2124 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2125 arc_hdr_size(hdr), hdr); 2126 } 2127 if (HDR_HAS_RABD(hdr)) { 2128 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2129 HDR_GET_PSIZE(hdr), hdr); 2130 } 2131 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2132 buf = buf->b_next) { 2133 if (arc_buf_is_shared(buf)) 2134 continue; 2135 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2136 arc_buf_size(buf), buf); 2137 } 2138 } 2139 2140 /* 2141 * Add a reference to this hdr indicating that someone is actively 2142 * referencing that memory. When the refcount transitions from 0 to 1, 2143 * we remove it from the respective arc_state_t list to indicate that 2144 * it is not evictable. 2145 */ 2146 static void 2147 add_reference(arc_buf_hdr_t *hdr, void *tag) 2148 { 2149 ASSERT(HDR_HAS_L1HDR(hdr)); 2150 if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) { 2151 ASSERT(hdr->b_l1hdr.b_state == arc_anon); 2152 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 2153 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2154 } 2155 2156 arc_state_t *state = hdr->b_l1hdr.b_state; 2157 2158 if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) && 2159 (state != arc_anon)) { 2160 /* We don't use the L2-only state list. */ 2161 if (state != arc_l2c_only) { 2162 multilist_remove(state->arcs_list[arc_buf_type(hdr)], 2163 hdr); 2164 arc_evictable_space_decrement(hdr, state); 2165 } 2166 /* remove the prefetch flag if we get a reference */ 2167 arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH); 2168 } 2169 } 2170 2171 /* 2172 * Remove a reference from this hdr. When the reference transitions from 2173 * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's 2174 * list making it eligible for eviction. 2175 */ 2176 static int 2177 remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag) 2178 { 2179 int cnt; 2180 arc_state_t *state = hdr->b_l1hdr.b_state; 2181 2182 ASSERT(HDR_HAS_L1HDR(hdr)); 2183 ASSERT(state == arc_anon || MUTEX_HELD(hash_lock)); 2184 ASSERT(!GHOST_STATE(state)); 2185 2186 /* 2187 * arc_l2c_only counts as a ghost state so we don't need to explicitly 2188 * check to prevent usage of the arc_l2c_only list. 2189 */ 2190 if (((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) && 2191 (state != arc_anon)) { 2192 multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr); 2193 ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0); 2194 arc_evictable_space_increment(hdr, state); 2195 } 2196 return (cnt); 2197 } 2198 2199 /* 2200 * Move the supplied buffer to the indicated state. The hash lock 2201 * for the buffer must be held by the caller. 2202 */ 2203 static void 2204 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr, 2205 kmutex_t *hash_lock) 2206 { 2207 arc_state_t *old_state; 2208 int64_t refcnt; 2209 uint32_t bufcnt; 2210 boolean_t update_old, update_new; 2211 arc_buf_contents_t buftype = arc_buf_type(hdr); 2212 2213 /* 2214 * We almost always have an L1 hdr here, since we call arc_hdr_realloc() 2215 * in arc_read() when bringing a buffer out of the L2ARC. However, the 2216 * L1 hdr doesn't always exist when we change state to arc_anon before 2217 * destroying a header, in which case reallocating to add the L1 hdr is 2218 * pointless. 2219 */ 2220 if (HDR_HAS_L1HDR(hdr)) { 2221 old_state = hdr->b_l1hdr.b_state; 2222 refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt); 2223 bufcnt = hdr->b_l1hdr.b_bufcnt; 2224 2225 update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL || 2226 HDR_HAS_RABD(hdr)); 2227 } else { 2228 old_state = arc_l2c_only; 2229 refcnt = 0; 2230 bufcnt = 0; 2231 update_old = B_FALSE; 2232 } 2233 update_new = update_old; 2234 2235 ASSERT(MUTEX_HELD(hash_lock)); 2236 ASSERT3P(new_state, !=, old_state); 2237 ASSERT(!GHOST_STATE(new_state) || bufcnt == 0); 2238 ASSERT(old_state != arc_anon || bufcnt <= 1); 2239 2240 /* 2241 * If this buffer is evictable, transfer it from the 2242 * old state list to the new state list. 2243 */ 2244 if (refcnt == 0) { 2245 if (old_state != arc_anon && old_state != arc_l2c_only) { 2246 ASSERT(HDR_HAS_L1HDR(hdr)); 2247 multilist_remove(old_state->arcs_list[buftype], hdr); 2248 2249 if (GHOST_STATE(old_state)) { 2250 ASSERT0(bufcnt); 2251 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2252 update_old = B_TRUE; 2253 } 2254 arc_evictable_space_decrement(hdr, old_state); 2255 } 2256 if (new_state != arc_anon && new_state != arc_l2c_only) { 2257 2258 /* 2259 * An L1 header always exists here, since if we're 2260 * moving to some L1-cached state (i.e. not l2c_only or 2261 * anonymous), we realloc the header to add an L1hdr 2262 * beforehand. 2263 */ 2264 ASSERT(HDR_HAS_L1HDR(hdr)); 2265 multilist_insert(new_state->arcs_list[buftype], hdr); 2266 2267 if (GHOST_STATE(new_state)) { 2268 ASSERT0(bufcnt); 2269 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 2270 update_new = B_TRUE; 2271 } 2272 arc_evictable_space_increment(hdr, new_state); 2273 } 2274 } 2275 2276 ASSERT(!HDR_EMPTY(hdr)); 2277 if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) 2278 buf_hash_remove(hdr); 2279 2280 /* adjust state sizes (ignore arc_l2c_only) */ 2281 2282 if (update_new && new_state != arc_l2c_only) { 2283 ASSERT(HDR_HAS_L1HDR(hdr)); 2284 if (GHOST_STATE(new_state)) { 2285 ASSERT0(bufcnt); 2286 2287 /* 2288 * When moving a header to a ghost state, we first 2289 * remove all arc buffers. Thus, we'll have a 2290 * bufcnt of zero, and no arc buffer to use for 2291 * the reference. As a result, we use the arc 2292 * header pointer for the reference. 2293 */ 2294 (void) zfs_refcount_add_many(&new_state->arcs_size, 2295 HDR_GET_LSIZE(hdr), hdr); 2296 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2297 ASSERT(!HDR_HAS_RABD(hdr)); 2298 } else { 2299 uint32_t buffers = 0; 2300 2301 /* 2302 * Each individual buffer holds a unique reference, 2303 * thus we must remove each of these references one 2304 * at a time. 2305 */ 2306 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2307 buf = buf->b_next) { 2308 ASSERT3U(bufcnt, !=, 0); 2309 buffers++; 2310 2311 /* 2312 * When the arc_buf_t is sharing the data 2313 * block with the hdr, the owner of the 2314 * reference belongs to the hdr. Only 2315 * add to the refcount if the arc_buf_t is 2316 * not shared. 2317 */ 2318 if (arc_buf_is_shared(buf)) 2319 continue; 2320 2321 (void) zfs_refcount_add_many( 2322 &new_state->arcs_size, 2323 arc_buf_size(buf), buf); 2324 } 2325 ASSERT3U(bufcnt, ==, buffers); 2326 2327 if (hdr->b_l1hdr.b_pabd != NULL) { 2328 (void) zfs_refcount_add_many( 2329 &new_state->arcs_size, 2330 arc_hdr_size(hdr), hdr); 2331 } 2332 2333 if (HDR_HAS_RABD(hdr)) { 2334 (void) zfs_refcount_add_many( 2335 &new_state->arcs_size, 2336 HDR_GET_PSIZE(hdr), hdr); 2337 } 2338 } 2339 } 2340 2341 if (update_old && old_state != arc_l2c_only) { 2342 ASSERT(HDR_HAS_L1HDR(hdr)); 2343 if (GHOST_STATE(old_state)) { 2344 ASSERT0(bufcnt); 2345 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2346 ASSERT(!HDR_HAS_RABD(hdr)); 2347 2348 /* 2349 * When moving a header off of a ghost state, 2350 * the header will not contain any arc buffers. 2351 * We use the arc header pointer for the reference 2352 * which is exactly what we did when we put the 2353 * header on the ghost state. 2354 */ 2355 2356 (void) zfs_refcount_remove_many(&old_state->arcs_size, 2357 HDR_GET_LSIZE(hdr), hdr); 2358 } else { 2359 uint32_t buffers = 0; 2360 2361 /* 2362 * Each individual buffer holds a unique reference, 2363 * thus we must remove each of these references one 2364 * at a time. 2365 */ 2366 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; 2367 buf = buf->b_next) { 2368 ASSERT3U(bufcnt, !=, 0); 2369 buffers++; 2370 2371 /* 2372 * When the arc_buf_t is sharing the data 2373 * block with the hdr, the owner of the 2374 * reference belongs to the hdr. Only 2375 * add to the refcount if the arc_buf_t is 2376 * not shared. 2377 */ 2378 if (arc_buf_is_shared(buf)) 2379 continue; 2380 2381 (void) zfs_refcount_remove_many( 2382 &old_state->arcs_size, arc_buf_size(buf), 2383 buf); 2384 } 2385 ASSERT3U(bufcnt, ==, buffers); 2386 ASSERT(hdr->b_l1hdr.b_pabd != NULL || 2387 HDR_HAS_RABD(hdr)); 2388 2389 if (hdr->b_l1hdr.b_pabd != NULL) { 2390 (void) zfs_refcount_remove_many( 2391 &old_state->arcs_size, arc_hdr_size(hdr), 2392 hdr); 2393 } 2394 2395 if (HDR_HAS_RABD(hdr)) { 2396 (void) zfs_refcount_remove_many( 2397 &old_state->arcs_size, HDR_GET_PSIZE(hdr), 2398 hdr); 2399 } 2400 } 2401 } 2402 2403 if (HDR_HAS_L1HDR(hdr)) 2404 hdr->b_l1hdr.b_state = new_state; 2405 2406 /* 2407 * L2 headers should never be on the L2 state list since they don't 2408 * have L1 headers allocated. 2409 */ 2410 ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) && 2411 multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA])); 2412 } 2413 2414 void 2415 arc_space_consume(uint64_t space, arc_space_type_t type) 2416 { 2417 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); 2418 2419 switch (type) { 2420 case ARC_SPACE_DATA: 2421 aggsum_add(&astat_data_size, space); 2422 break; 2423 case ARC_SPACE_META: 2424 aggsum_add(&astat_metadata_size, space); 2425 break; 2426 case ARC_SPACE_OTHER: 2427 aggsum_add(&astat_other_size, space); 2428 break; 2429 case ARC_SPACE_HDRS: 2430 aggsum_add(&astat_hdr_size, space); 2431 break; 2432 case ARC_SPACE_L2HDRS: 2433 aggsum_add(&astat_l2_hdr_size, space); 2434 break; 2435 } 2436 2437 if (type != ARC_SPACE_DATA) 2438 aggsum_add(&arc_meta_used, space); 2439 2440 aggsum_add(&arc_size, space); 2441 } 2442 2443 void 2444 arc_space_return(uint64_t space, arc_space_type_t type) 2445 { 2446 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); 2447 2448 switch (type) { 2449 case ARC_SPACE_DATA: 2450 aggsum_add(&astat_data_size, -space); 2451 break; 2452 case ARC_SPACE_META: 2453 aggsum_add(&astat_metadata_size, -space); 2454 break; 2455 case ARC_SPACE_OTHER: 2456 aggsum_add(&astat_other_size, -space); 2457 break; 2458 case ARC_SPACE_HDRS: 2459 aggsum_add(&astat_hdr_size, -space); 2460 break; 2461 case ARC_SPACE_L2HDRS: 2462 aggsum_add(&astat_l2_hdr_size, -space); 2463 break; 2464 } 2465 2466 if (type != ARC_SPACE_DATA) { 2467 ASSERT(aggsum_compare(&arc_meta_used, space) >= 0); 2468 /* 2469 * We use the upper bound here rather than the precise value 2470 * because the arc_meta_max value doesn't need to be 2471 * precise. It's only consumed by humans via arcstats. 2472 */ 2473 if (arc_meta_max < aggsum_upper_bound(&arc_meta_used)) 2474 arc_meta_max = aggsum_upper_bound(&arc_meta_used); 2475 aggsum_add(&arc_meta_used, -space); 2476 } 2477 2478 ASSERT(aggsum_compare(&arc_size, space) >= 0); 2479 aggsum_add(&arc_size, -space); 2480 } 2481 2482 /* 2483 * Given a hdr and a buf, returns whether that buf can share its b_data buffer 2484 * with the hdr's b_pabd. 2485 */ 2486 static boolean_t 2487 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2488 { 2489 /* 2490 * The criteria for sharing a hdr's data are: 2491 * 1. the buffer is not encrypted 2492 * 2. the hdr's compression matches the buf's compression 2493 * 3. the hdr doesn't need to be byteswapped 2494 * 4. the hdr isn't already being shared 2495 * 5. the buf is either compressed or it is the last buf in the hdr list 2496 * 2497 * Criterion #5 maintains the invariant that shared uncompressed 2498 * bufs must be the final buf in the hdr's b_buf list. Reading this, you 2499 * might ask, "if a compressed buf is allocated first, won't that be the 2500 * last thing in the list?", but in that case it's impossible to create 2501 * a shared uncompressed buf anyway (because the hdr must be compressed 2502 * to have the compressed buf). You might also think that #3 is 2503 * sufficient to make this guarantee, however it's possible 2504 * (specifically in the rare L2ARC write race mentioned in 2505 * arc_buf_alloc_impl()) there will be an existing uncompressed buf that 2506 * is sharable, but wasn't at the time of its allocation. Rather than 2507 * allow a new shared uncompressed buf to be created and then shuffle 2508 * the list around to make it the last element, this simply disallows 2509 * sharing if the new buf isn't the first to be added. 2510 */ 2511 ASSERT3P(buf->b_hdr, ==, hdr); 2512 boolean_t hdr_compressed = arc_hdr_get_compress(hdr) != 2513 ZIO_COMPRESS_OFF; 2514 boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0; 2515 return (!ARC_BUF_ENCRYPTED(buf) && 2516 buf_compressed == hdr_compressed && 2517 hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS && 2518 !HDR_SHARED_DATA(hdr) && 2519 (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf))); 2520 } 2521 2522 /* 2523 * Allocate a buf for this hdr. If you care about the data that's in the hdr, 2524 * or if you want a compressed buffer, pass those flags in. Returns 0 if the 2525 * copy was made successfully, or an error code otherwise. 2526 */ 2527 static int 2528 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb, 2529 void *tag, boolean_t encrypted, boolean_t compressed, boolean_t noauth, 2530 boolean_t fill, arc_buf_t **ret) 2531 { 2532 arc_buf_t *buf; 2533 arc_fill_flags_t flags = ARC_FILL_LOCKED; 2534 2535 ASSERT(HDR_HAS_L1HDR(hdr)); 2536 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); 2537 VERIFY(hdr->b_type == ARC_BUFC_DATA || 2538 hdr->b_type == ARC_BUFC_METADATA); 2539 ASSERT3P(ret, !=, NULL); 2540 ASSERT3P(*ret, ==, NULL); 2541 IMPLY(encrypted, compressed); 2542 2543 buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE); 2544 buf->b_hdr = hdr; 2545 buf->b_data = NULL; 2546 buf->b_next = hdr->b_l1hdr.b_buf; 2547 buf->b_flags = 0; 2548 2549 add_reference(hdr, tag); 2550 2551 /* 2552 * We're about to change the hdr's b_flags. We must either 2553 * hold the hash_lock or be undiscoverable. 2554 */ 2555 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2556 2557 /* 2558 * Only honor requests for compressed bufs if the hdr is actually 2559 * compressed. This must be overriden if the buffer is encrypted since 2560 * encrypted buffers cannot be decompressed. 2561 */ 2562 if (encrypted) { 2563 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; 2564 buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED; 2565 flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED; 2566 } else if (compressed && 2567 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { 2568 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; 2569 flags |= ARC_FILL_COMPRESSED; 2570 } 2571 2572 if (noauth) { 2573 ASSERT0(encrypted); 2574 flags |= ARC_FILL_NOAUTH; 2575 } 2576 2577 /* 2578 * If the hdr's data can be shared then we share the data buffer and 2579 * set the appropriate bit in the hdr's b_flags to indicate the hdr is 2580 * allocate a new buffer to store the buf's data. 2581 * 2582 * There are two additional restrictions here because we're sharing 2583 * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be 2584 * actively involved in an L2ARC write, because if this buf is used by 2585 * an arc_write() then the hdr's data buffer will be released when the 2586 * write completes, even though the L2ARC write might still be using it. 2587 * Second, the hdr's ABD must be linear so that the buf's user doesn't 2588 * need to be ABD-aware. 2589 */ 2590 boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) && 2591 hdr->b_l1hdr.b_pabd != NULL && abd_is_linear(hdr->b_l1hdr.b_pabd); 2592 2593 /* Set up b_data and sharing */ 2594 if (can_share) { 2595 buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd); 2596 buf->b_flags |= ARC_BUF_FLAG_SHARED; 2597 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); 2598 } else { 2599 buf->b_data = 2600 arc_get_data_buf(hdr, arc_buf_size(buf), buf); 2601 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); 2602 } 2603 VERIFY3P(buf->b_data, !=, NULL); 2604 2605 hdr->b_l1hdr.b_buf = buf; 2606 hdr->b_l1hdr.b_bufcnt += 1; 2607 if (encrypted) 2608 hdr->b_crypt_hdr.b_ebufcnt += 1; 2609 2610 /* 2611 * If the user wants the data from the hdr, we need to either copy or 2612 * decompress the data. 2613 */ 2614 if (fill) { 2615 ASSERT3P(zb, !=, NULL); 2616 return (arc_buf_fill(buf, spa, zb, flags)); 2617 } 2618 2619 return (0); 2620 } 2621 2622 static char *arc_onloan_tag = "onloan"; 2623 2624 static inline void 2625 arc_loaned_bytes_update(int64_t delta) 2626 { 2627 atomic_add_64(&arc_loaned_bytes, delta); 2628 2629 /* assert that it did not wrap around */ 2630 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); 2631 } 2632 2633 /* 2634 * Loan out an anonymous arc buffer. Loaned buffers are not counted as in 2635 * flight data by arc_tempreserve_space() until they are "returned". Loaned 2636 * buffers must be returned to the arc before they can be used by the DMU or 2637 * freed. 2638 */ 2639 arc_buf_t * 2640 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size) 2641 { 2642 arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag, 2643 is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size); 2644 2645 arc_loaned_bytes_update(arc_buf_size(buf)); 2646 2647 return (buf); 2648 } 2649 2650 arc_buf_t * 2651 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize, 2652 enum zio_compress compression_type) 2653 { 2654 arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag, 2655 psize, lsize, compression_type); 2656 2657 arc_loaned_bytes_update(arc_buf_size(buf)); 2658 2659 return (buf); 2660 } 2661 2662 arc_buf_t * 2663 arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder, 2664 const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, 2665 dmu_object_type_t ot, uint64_t psize, uint64_t lsize, 2666 enum zio_compress compression_type) 2667 { 2668 arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj, 2669 byteorder, salt, iv, mac, ot, psize, lsize, compression_type); 2670 2671 atomic_add_64(&arc_loaned_bytes, psize); 2672 return (buf); 2673 } 2674 2675 /* 2676 * Performance tuning of L2ARC persistence: 2677 * 2678 * l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding 2679 * an L2ARC device (either at pool import or later) will attempt 2680 * to rebuild L2ARC buffer contents. 2681 * l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls 2682 * whether log blocks are written to the L2ARC device. If the L2ARC 2683 * device is less than 1GB, the amount of data l2arc_evict() 2684 * evicts is significant compared to the amount of restored L2ARC 2685 * data. In this case do not write log blocks in L2ARC in order 2686 * not to waste space. 2687 */ 2688 int l2arc_rebuild_enabled = B_TRUE; 2689 unsigned long l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024; 2690 2691 /* L2ARC persistence rebuild control routines. */ 2692 void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen); 2693 static void l2arc_dev_rebuild_start(l2arc_dev_t *dev); 2694 static int l2arc_rebuild(l2arc_dev_t *dev); 2695 2696 /* L2ARC persistence read I/O routines. */ 2697 static int l2arc_dev_hdr_read(l2arc_dev_t *dev); 2698 static int l2arc_log_blk_read(l2arc_dev_t *dev, 2699 const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp, 2700 l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, 2701 zio_t *this_io, zio_t **next_io); 2702 static zio_t *l2arc_log_blk_fetch(vdev_t *vd, 2703 const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb); 2704 static void l2arc_log_blk_fetch_abort(zio_t *zio); 2705 2706 /* L2ARC persistence block restoration routines. */ 2707 static void l2arc_log_blk_restore(l2arc_dev_t *dev, 2708 const l2arc_log_blk_phys_t *lb, uint64_t lb_asize, uint64_t lb_daddr); 2709 static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, 2710 l2arc_dev_t *dev); 2711 2712 /* L2ARC persistence write I/O routines. */ 2713 static void l2arc_dev_hdr_update(l2arc_dev_t *dev); 2714 static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, 2715 l2arc_write_callback_t *cb); 2716 2717 /* L2ARC persistence auxilliary routines. */ 2718 boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev, 2719 const l2arc_log_blkptr_t *lbp); 2720 static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev, 2721 const arc_buf_hdr_t *ab); 2722 boolean_t l2arc_range_check_overlap(uint64_t bottom, 2723 uint64_t top, uint64_t check); 2724 static void l2arc_blk_fetch_done(zio_t *zio); 2725 static inline uint64_t 2726 l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev); 2727 2728 /* 2729 * Return a loaned arc buffer to the arc. 2730 */ 2731 void 2732 arc_return_buf(arc_buf_t *buf, void *tag) 2733 { 2734 arc_buf_hdr_t *hdr = buf->b_hdr; 2735 2736 ASSERT3P(buf->b_data, !=, NULL); 2737 ASSERT(HDR_HAS_L1HDR(hdr)); 2738 (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag); 2739 (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); 2740 2741 arc_loaned_bytes_update(-arc_buf_size(buf)); 2742 } 2743 2744 /* Detach an arc_buf from a dbuf (tag) */ 2745 void 2746 arc_loan_inuse_buf(arc_buf_t *buf, void *tag) 2747 { 2748 arc_buf_hdr_t *hdr = buf->b_hdr; 2749 2750 ASSERT3P(buf->b_data, !=, NULL); 2751 ASSERT(HDR_HAS_L1HDR(hdr)); 2752 (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); 2753 (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag); 2754 2755 arc_loaned_bytes_update(arc_buf_size(buf)); 2756 } 2757 2758 static void 2759 l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type) 2760 { 2761 l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP); 2762 2763 df->l2df_abd = abd; 2764 df->l2df_size = size; 2765 df->l2df_type = type; 2766 mutex_enter(&l2arc_free_on_write_mtx); 2767 list_insert_head(l2arc_free_on_write, df); 2768 mutex_exit(&l2arc_free_on_write_mtx); 2769 } 2770 2771 static void 2772 arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata) 2773 { 2774 arc_state_t *state = hdr->b_l1hdr.b_state; 2775 arc_buf_contents_t type = arc_buf_type(hdr); 2776 uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); 2777 2778 /* protected by hash lock, if in the hash table */ 2779 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 2780 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 2781 ASSERT(state != arc_anon && state != arc_l2c_only); 2782 2783 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 2784 size, hdr); 2785 } 2786 (void) zfs_refcount_remove_many(&state->arcs_size, size, hdr); 2787 if (type == ARC_BUFC_METADATA) { 2788 arc_space_return(size, ARC_SPACE_META); 2789 } else { 2790 ASSERT(type == ARC_BUFC_DATA); 2791 arc_space_return(size, ARC_SPACE_DATA); 2792 } 2793 2794 if (free_rdata) { 2795 l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type); 2796 } else { 2797 l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type); 2798 } 2799 } 2800 2801 /* 2802 * Share the arc_buf_t's data with the hdr. Whenever we are sharing the 2803 * data buffer, we transfer the refcount ownership to the hdr and update 2804 * the appropriate kstats. 2805 */ 2806 static void 2807 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2808 { 2809 /* LINTED */ 2810 arc_state_t *state = hdr->b_l1hdr.b_state; 2811 2812 ASSERT(arc_can_share(hdr, buf)); 2813 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 2814 ASSERT(!ARC_BUF_ENCRYPTED(buf)); 2815 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2816 2817 /* 2818 * Start sharing the data buffer. We transfer the 2819 * refcount ownership to the hdr since it always owns 2820 * the refcount whenever an arc_buf_t is shared. 2821 */ 2822 zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size, 2823 arc_hdr_size(hdr), buf, hdr); 2824 hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf)); 2825 abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd, 2826 HDR_ISTYPE_METADATA(hdr)); 2827 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); 2828 buf->b_flags |= ARC_BUF_FLAG_SHARED; 2829 2830 /* 2831 * Since we've transferred ownership to the hdr we need 2832 * to increment its compressed and uncompressed kstats and 2833 * decrement the overhead size. 2834 */ 2835 ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr)); 2836 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); 2837 ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf)); 2838 } 2839 2840 static void 2841 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2842 { 2843 /* LINTED */ 2844 arc_state_t *state = hdr->b_l1hdr.b_state; 2845 2846 ASSERT(arc_buf_is_shared(buf)); 2847 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 2848 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2849 2850 /* 2851 * We are no longer sharing this buffer so we need 2852 * to transfer its ownership to the rightful owner. 2853 */ 2854 zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size, 2855 arc_hdr_size(hdr), hdr, buf); 2856 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 2857 abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd); 2858 abd_put(hdr->b_l1hdr.b_pabd); 2859 hdr->b_l1hdr.b_pabd = NULL; 2860 buf->b_flags &= ~ARC_BUF_FLAG_SHARED; 2861 2862 /* 2863 * Since the buffer is no longer shared between 2864 * the arc buf and the hdr, count it as overhead. 2865 */ 2866 ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr)); 2867 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); 2868 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); 2869 } 2870 2871 /* 2872 * Remove an arc_buf_t from the hdr's buf list and return the last 2873 * arc_buf_t on the list. If no buffers remain on the list then return 2874 * NULL. 2875 */ 2876 static arc_buf_t * 2877 arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf) 2878 { 2879 arc_buf_t **bufp = &hdr->b_l1hdr.b_buf; 2880 arc_buf_t *lastbuf = NULL; 2881 2882 ASSERT(HDR_HAS_L1HDR(hdr)); 2883 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2884 2885 /* 2886 * Remove the buf from the hdr list and locate the last 2887 * remaining buffer on the list. 2888 */ 2889 while (*bufp != NULL) { 2890 if (*bufp == buf) 2891 *bufp = buf->b_next; 2892 2893 /* 2894 * If we've removed a buffer in the middle of 2895 * the list then update the lastbuf and update 2896 * bufp. 2897 */ 2898 if (*bufp != NULL) { 2899 lastbuf = *bufp; 2900 bufp = &(*bufp)->b_next; 2901 } 2902 } 2903 buf->b_next = NULL; 2904 ASSERT3P(lastbuf, !=, buf); 2905 IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL); 2906 IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL); 2907 IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf)); 2908 2909 return (lastbuf); 2910 } 2911 2912 /* 2913 * Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's 2914 * list and free it. 2915 */ 2916 static void 2917 arc_buf_destroy_impl(arc_buf_t *buf) 2918 { 2919 arc_buf_hdr_t *hdr = buf->b_hdr; 2920 2921 /* 2922 * Free up the data associated with the buf but only if we're not 2923 * sharing this with the hdr. If we are sharing it with the hdr, the 2924 * hdr is responsible for doing the free. 2925 */ 2926 if (buf->b_data != NULL) { 2927 /* 2928 * We're about to change the hdr's b_flags. We must either 2929 * hold the hash_lock or be undiscoverable. 2930 */ 2931 ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); 2932 2933 arc_cksum_verify(buf); 2934 arc_buf_unwatch(buf); 2935 2936 if (arc_buf_is_shared(buf)) { 2937 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); 2938 } else { 2939 uint64_t size = arc_buf_size(buf); 2940 arc_free_data_buf(hdr, buf->b_data, size, buf); 2941 ARCSTAT_INCR(arcstat_overhead_size, -size); 2942 } 2943 buf->b_data = NULL; 2944 2945 ASSERT(hdr->b_l1hdr.b_bufcnt > 0); 2946 hdr->b_l1hdr.b_bufcnt -= 1; 2947 2948 if (ARC_BUF_ENCRYPTED(buf)) { 2949 hdr->b_crypt_hdr.b_ebufcnt -= 1; 2950 2951 /* 2952 * If we have no more encrypted buffers and we've 2953 * already gotten a copy of the decrypted data we can 2954 * free b_rabd to save some space. 2955 */ 2956 if (hdr->b_crypt_hdr.b_ebufcnt == 0 && 2957 HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd != NULL && 2958 !HDR_IO_IN_PROGRESS(hdr)) { 2959 arc_hdr_free_pabd(hdr, B_TRUE); 2960 } 2961 } 2962 } 2963 2964 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); 2965 2966 if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) { 2967 /* 2968 * If the current arc_buf_t is sharing its data buffer with the 2969 * hdr, then reassign the hdr's b_pabd to share it with the new 2970 * buffer at the end of the list. The shared buffer is always 2971 * the last one on the hdr's buffer list. 2972 * 2973 * There is an equivalent case for compressed bufs, but since 2974 * they aren't guaranteed to be the last buf in the list and 2975 * that is an exceedingly rare case, we just allow that space be 2976 * wasted temporarily. We must also be careful not to share 2977 * encrypted buffers, since they cannot be shared. 2978 */ 2979 if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) { 2980 /* Only one buf can be shared at once */ 2981 VERIFY(!arc_buf_is_shared(lastbuf)); 2982 /* hdr is uncompressed so can't have compressed buf */ 2983 VERIFY(!ARC_BUF_COMPRESSED(lastbuf)); 2984 2985 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 2986 arc_hdr_free_pabd(hdr, B_FALSE); 2987 2988 /* 2989 * We must setup a new shared block between the 2990 * last buffer and the hdr. The data would have 2991 * been allocated by the arc buf so we need to transfer 2992 * ownership to the hdr since it's now being shared. 2993 */ 2994 arc_share_buf(hdr, lastbuf); 2995 } 2996 } else if (HDR_SHARED_DATA(hdr)) { 2997 /* 2998 * Uncompressed shared buffers are always at the end 2999 * of the list. Compressed buffers don't have the 3000 * same requirements. This makes it hard to 3001 * simply assert that the lastbuf is shared so 3002 * we rely on the hdr's compression flags to determine 3003 * if we have a compressed, shared buffer. 3004 */ 3005 ASSERT3P(lastbuf, !=, NULL); 3006 ASSERT(arc_buf_is_shared(lastbuf) || 3007 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 3008 } 3009 3010 /* 3011 * Free the checksum if we're removing the last uncompressed buf from 3012 * this hdr. 3013 */ 3014 if (!arc_hdr_has_uncompressed_buf(hdr)) { 3015 arc_cksum_free(hdr); 3016 } 3017 3018 /* clean up the buf */ 3019 buf->b_hdr = NULL; 3020 kmem_cache_free(buf_cache, buf); 3021 } 3022 3023 static void 3024 arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr, boolean_t alloc_rdata) 3025 { 3026 uint64_t size; 3027 3028 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); 3029 ASSERT(HDR_HAS_L1HDR(hdr)); 3030 ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata); 3031 IMPLY(alloc_rdata, HDR_PROTECTED(hdr)); 3032 3033 if (alloc_rdata) { 3034 size = HDR_GET_PSIZE(hdr); 3035 ASSERT3P(hdr->b_crypt_hdr.b_rabd, ==, NULL); 3036 hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr); 3037 ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL); 3038 } else { 3039 size = arc_hdr_size(hdr); 3040 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 3041 hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr); 3042 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 3043 } 3044 3045 ARCSTAT_INCR(arcstat_compressed_size, size); 3046 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); 3047 } 3048 3049 static void 3050 arc_hdr_free_pabd(arc_buf_hdr_t *hdr, boolean_t free_rdata) 3051 { 3052 uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); 3053 3054 ASSERT(HDR_HAS_L1HDR(hdr)); 3055 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 3056 IMPLY(free_rdata, HDR_HAS_RABD(hdr)); 3057 3058 3059 /* 3060 * If the hdr is currently being written to the l2arc then 3061 * we defer freeing the data by adding it to the l2arc_free_on_write 3062 * list. The l2arc will free the data once it's finished 3063 * writing it to the l2arc device. 3064 */ 3065 if (HDR_L2_WRITING(hdr)) { 3066 arc_hdr_free_on_write(hdr, free_rdata); 3067 ARCSTAT_BUMP(arcstat_l2_free_on_write); 3068 } else if (free_rdata) { 3069 arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr); 3070 } else { 3071 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 3072 size, hdr); 3073 } 3074 3075 if (free_rdata) { 3076 hdr->b_crypt_hdr.b_rabd = NULL; 3077 } else { 3078 hdr->b_l1hdr.b_pabd = NULL; 3079 } 3080 3081 if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr)) 3082 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 3083 3084 ARCSTAT_INCR(arcstat_compressed_size, -size); 3085 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); 3086 } 3087 3088 static arc_buf_hdr_t * 3089 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize, 3090 boolean_t protected, enum zio_compress compression_type, 3091 arc_buf_contents_t type, boolean_t alloc_rdata) 3092 { 3093 arc_buf_hdr_t *hdr; 3094 3095 VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA); 3096 if (protected) { 3097 hdr = kmem_cache_alloc(hdr_full_crypt_cache, KM_PUSHPAGE); 3098 } else { 3099 hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE); 3100 } 3101 ASSERT(HDR_EMPTY(hdr)); 3102 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3103 ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL); 3104 HDR_SET_PSIZE(hdr, psize); 3105 HDR_SET_LSIZE(hdr, lsize); 3106 hdr->b_spa = spa; 3107 hdr->b_type = type; 3108 hdr->b_flags = 0; 3109 arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR); 3110 arc_hdr_set_compress(hdr, compression_type); 3111 if (protected) 3112 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 3113 3114 hdr->b_l1hdr.b_state = arc_anon; 3115 hdr->b_l1hdr.b_arc_access = 0; 3116 hdr->b_l1hdr.b_bufcnt = 0; 3117 hdr->b_l1hdr.b_buf = NULL; 3118 3119 /* 3120 * Allocate the hdr's buffer. This will contain either 3121 * the compressed or uncompressed data depending on the block 3122 * it references and compressed arc enablement. 3123 */ 3124 arc_hdr_alloc_pabd(hdr, alloc_rdata); 3125 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 3126 3127 return (hdr); 3128 } 3129 3130 /* 3131 * Transition between the two allocation states for the arc_buf_hdr struct. 3132 * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without 3133 * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller 3134 * version is used when a cache buffer is only in the L2ARC in order to reduce 3135 * memory usage. 3136 */ 3137 static arc_buf_hdr_t * 3138 arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new) 3139 { 3140 ASSERT(HDR_HAS_L2HDR(hdr)); 3141 3142 arc_buf_hdr_t *nhdr; 3143 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; 3144 3145 ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) || 3146 (old == hdr_l2only_cache && new == hdr_full_cache)); 3147 3148 /* 3149 * if the caller wanted a new full header and the header is to be 3150 * encrypted we will actually allocate the header from the full crypt 3151 * cache instead. The same applies to freeing from the old cache. 3152 */ 3153 if (HDR_PROTECTED(hdr) && new == hdr_full_cache) 3154 new = hdr_full_crypt_cache; 3155 if (HDR_PROTECTED(hdr) && old == hdr_full_cache) 3156 old = hdr_full_crypt_cache; 3157 3158 nhdr = kmem_cache_alloc(new, KM_PUSHPAGE); 3159 3160 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 3161 buf_hash_remove(hdr); 3162 3163 bcopy(hdr, nhdr, HDR_L2ONLY_SIZE); 3164 3165 if (new == hdr_full_cache || new == hdr_full_crypt_cache) { 3166 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR); 3167 /* 3168 * arc_access and arc_change_state need to be aware that a 3169 * header has just come out of L2ARC, so we set its state to 3170 * l2c_only even though it's about to change. 3171 */ 3172 nhdr->b_l1hdr.b_state = arc_l2c_only; 3173 3174 /* Verify previous threads set to NULL before freeing */ 3175 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL); 3176 ASSERT(!HDR_HAS_RABD(hdr)); 3177 } else { 3178 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 3179 ASSERT0(hdr->b_l1hdr.b_bufcnt); 3180 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3181 3182 /* 3183 * If we've reached here, We must have been called from 3184 * arc_evict_hdr(), as such we should have already been 3185 * removed from any ghost list we were previously on 3186 * (which protects us from racing with arc_evict_state), 3187 * thus no locking is needed during this check. 3188 */ 3189 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 3190 3191 /* 3192 * A buffer must not be moved into the arc_l2c_only 3193 * state if it's not finished being written out to the 3194 * l2arc device. Otherwise, the b_l1hdr.b_pabd field 3195 * might try to be accessed, even though it was removed. 3196 */ 3197 VERIFY(!HDR_L2_WRITING(hdr)); 3198 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL); 3199 ASSERT(!HDR_HAS_RABD(hdr)); 3200 3201 #ifdef ZFS_DEBUG 3202 if (hdr->b_l1hdr.b_thawed != NULL) { 3203 kmem_free(hdr->b_l1hdr.b_thawed, 1); 3204 hdr->b_l1hdr.b_thawed = NULL; 3205 } 3206 #endif 3207 3208 arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR); 3209 } 3210 /* 3211 * The header has been reallocated so we need to re-insert it into any 3212 * lists it was on. 3213 */ 3214 (void) buf_hash_insert(nhdr, NULL); 3215 3216 ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node)); 3217 3218 mutex_enter(&dev->l2ad_mtx); 3219 3220 /* 3221 * We must place the realloc'ed header back into the list at 3222 * the same spot. Otherwise, if it's placed earlier in the list, 3223 * l2arc_write_buffers() could find it during the function's 3224 * write phase, and try to write it out to the l2arc. 3225 */ 3226 list_insert_after(&dev->l2ad_buflist, hdr, nhdr); 3227 list_remove(&dev->l2ad_buflist, hdr); 3228 3229 mutex_exit(&dev->l2ad_mtx); 3230 3231 /* 3232 * Since we're using the pointer address as the tag when 3233 * incrementing and decrementing the l2ad_alloc refcount, we 3234 * must remove the old pointer (that we're about to destroy) and 3235 * add the new pointer to the refcount. Otherwise we'd remove 3236 * the wrong pointer address when calling arc_hdr_destroy() later. 3237 */ 3238 3239 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), 3240 hdr); 3241 (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), 3242 nhdr); 3243 3244 buf_discard_identity(hdr); 3245 kmem_cache_free(old, hdr); 3246 3247 return (nhdr); 3248 } 3249 3250 /* 3251 * This function allows an L1 header to be reallocated as a crypt 3252 * header and vice versa. If we are going to a crypt header, the 3253 * new fields will be zeroed out. 3254 */ 3255 static arc_buf_hdr_t * 3256 arc_hdr_realloc_crypt(arc_buf_hdr_t *hdr, boolean_t need_crypt) 3257 { 3258 arc_buf_hdr_t *nhdr; 3259 arc_buf_t *buf; 3260 kmem_cache_t *ncache, *ocache; 3261 3262 ASSERT(HDR_HAS_L1HDR(hdr)); 3263 ASSERT3U(!!HDR_PROTECTED(hdr), !=, need_crypt); 3264 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 3265 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 3266 ASSERT(!list_link_active(&hdr->b_l2hdr.b_l2node)); 3267 ASSERT3P(hdr->b_hash_next, ==, NULL); 3268 3269 if (need_crypt) { 3270 ncache = hdr_full_crypt_cache; 3271 ocache = hdr_full_cache; 3272 } else { 3273 ncache = hdr_full_cache; 3274 ocache = hdr_full_crypt_cache; 3275 } 3276 3277 nhdr = kmem_cache_alloc(ncache, KM_PUSHPAGE); 3278 3279 /* 3280 * Copy all members that aren't locks or condvars to the new header. 3281 * No lists are pointing to us (as we asserted above), so we don't 3282 * need to worry about the list nodes. 3283 */ 3284 nhdr->b_dva = hdr->b_dva; 3285 nhdr->b_birth = hdr->b_birth; 3286 nhdr->b_type = hdr->b_type; 3287 nhdr->b_flags = hdr->b_flags; 3288 nhdr->b_psize = hdr->b_psize; 3289 nhdr->b_lsize = hdr->b_lsize; 3290 nhdr->b_spa = hdr->b_spa; 3291 nhdr->b_l2hdr.b_dev = hdr->b_l2hdr.b_dev; 3292 nhdr->b_l2hdr.b_daddr = hdr->b_l2hdr.b_daddr; 3293 nhdr->b_l1hdr.b_freeze_cksum = hdr->b_l1hdr.b_freeze_cksum; 3294 nhdr->b_l1hdr.b_bufcnt = hdr->b_l1hdr.b_bufcnt; 3295 nhdr->b_l1hdr.b_byteswap = hdr->b_l1hdr.b_byteswap; 3296 nhdr->b_l1hdr.b_state = hdr->b_l1hdr.b_state; 3297 nhdr->b_l1hdr.b_arc_access = hdr->b_l1hdr.b_arc_access; 3298 nhdr->b_l1hdr.b_acb = hdr->b_l1hdr.b_acb; 3299 nhdr->b_l1hdr.b_pabd = hdr->b_l1hdr.b_pabd; 3300 #ifdef ZFS_DEBUG 3301 if (hdr->b_l1hdr.b_thawed != NULL) { 3302 nhdr->b_l1hdr.b_thawed = hdr->b_l1hdr.b_thawed; 3303 hdr->b_l1hdr.b_thawed = NULL; 3304 } 3305 #endif 3306 3307 /* 3308 * This refcount_add() exists only to ensure that the individual 3309 * arc buffers always point to a header that is referenced, avoiding 3310 * a small race condition that could trigger ASSERTs. 3311 */ 3312 (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, FTAG); 3313 nhdr->b_l1hdr.b_buf = hdr->b_l1hdr.b_buf; 3314 for (buf = nhdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { 3315 mutex_enter(&buf->b_evict_lock); 3316 buf->b_hdr = nhdr; 3317 mutex_exit(&buf->b_evict_lock); 3318 } 3319 zfs_refcount_transfer(&nhdr->b_l1hdr.b_refcnt, &hdr->b_l1hdr.b_refcnt); 3320 (void) zfs_refcount_remove(&nhdr->b_l1hdr.b_refcnt, FTAG); 3321 ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt)); 3322 3323 if (need_crypt) { 3324 arc_hdr_set_flags(nhdr, ARC_FLAG_PROTECTED); 3325 } else { 3326 arc_hdr_clear_flags(nhdr, ARC_FLAG_PROTECTED); 3327 } 3328 3329 /* unset all members of the original hdr */ 3330 bzero(&hdr->b_dva, sizeof (dva_t)); 3331 hdr->b_birth = 0; 3332 hdr->b_type = ARC_BUFC_INVALID; 3333 hdr->b_flags = 0; 3334 hdr->b_psize = 0; 3335 hdr->b_lsize = 0; 3336 hdr->b_spa = 0; 3337 hdr->b_l2hdr.b_dev = NULL; 3338 hdr->b_l2hdr.b_daddr = 0; 3339 hdr->b_l1hdr.b_freeze_cksum = NULL; 3340 hdr->b_l1hdr.b_buf = NULL; 3341 hdr->b_l1hdr.b_bufcnt = 0; 3342 hdr->b_l1hdr.b_byteswap = 0; 3343 hdr->b_l1hdr.b_state = NULL; 3344 hdr->b_l1hdr.b_arc_access = 0; 3345 hdr->b_l1hdr.b_acb = NULL; 3346 hdr->b_l1hdr.b_pabd = NULL; 3347 3348 if (ocache == hdr_full_crypt_cache) { 3349 ASSERT(!HDR_HAS_RABD(hdr)); 3350 hdr->b_crypt_hdr.b_ot = DMU_OT_NONE; 3351 hdr->b_crypt_hdr.b_ebufcnt = 0; 3352 hdr->b_crypt_hdr.b_dsobj = 0; 3353 bzero(hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); 3354 bzero(hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); 3355 bzero(hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); 3356 } 3357 3358 buf_discard_identity(hdr); 3359 kmem_cache_free(ocache, hdr); 3360 3361 return (nhdr); 3362 } 3363 3364 /* 3365 * This function is used by the send / receive code to convert a newly 3366 * allocated arc_buf_t to one that is suitable for a raw encrypted write. It 3367 * is also used to allow the root objset block to be uupdated without altering 3368 * its embedded MACs. Both block types will always be uncompressed so we do not 3369 * have to worry about compression type or psize. 3370 */ 3371 void 3372 arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder, 3373 dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv, 3374 const uint8_t *mac) 3375 { 3376 arc_buf_hdr_t *hdr = buf->b_hdr; 3377 3378 ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET); 3379 ASSERT(HDR_HAS_L1HDR(hdr)); 3380 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 3381 3382 buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED); 3383 if (!HDR_PROTECTED(hdr)) 3384 hdr = arc_hdr_realloc_crypt(hdr, B_TRUE); 3385 hdr->b_crypt_hdr.b_dsobj = dsobj; 3386 hdr->b_crypt_hdr.b_ot = ot; 3387 hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? 3388 DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); 3389 if (!arc_hdr_has_uncompressed_buf(hdr)) 3390 arc_cksum_free(hdr); 3391 3392 if (salt != NULL) 3393 bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); 3394 if (iv != NULL) 3395 bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); 3396 if (mac != NULL) 3397 bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); 3398 } 3399 3400 /* 3401 * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller. 3402 * The buf is returned thawed since we expect the consumer to modify it. 3403 */ 3404 arc_buf_t * 3405 arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size) 3406 { 3407 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size, 3408 B_FALSE, ZIO_COMPRESS_OFF, type, B_FALSE); 3409 3410 arc_buf_t *buf = NULL; 3411 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE, 3412 B_FALSE, B_FALSE, &buf)); 3413 arc_buf_thaw(buf); 3414 3415 return (buf); 3416 } 3417 3418 /* 3419 * Allocates an ARC buf header that's in an evicted & L2-cached state. 3420 * This is used during l2arc reconstruction to make empty ARC buffers 3421 * which circumvent the regular disk->arc->l2arc path and instead come 3422 * into being in the reverse order, i.e. l2arc->arc. 3423 */ 3424 arc_buf_hdr_t * 3425 arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev, 3426 dva_t dva, uint64_t daddr, int32_t psize, uint64_t birth, 3427 enum zio_compress compress, boolean_t protected, boolean_t prefetch) 3428 { 3429 arc_buf_hdr_t *hdr; 3430 3431 ASSERT(size != 0); 3432 hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP); 3433 hdr->b_birth = birth; 3434 hdr->b_type = type; 3435 hdr->b_flags = 0; 3436 arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR); 3437 HDR_SET_LSIZE(hdr, size); 3438 HDR_SET_PSIZE(hdr, psize); 3439 arc_hdr_set_compress(hdr, compress); 3440 if (protected) 3441 arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); 3442 if (prefetch) 3443 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); 3444 hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa); 3445 3446 hdr->b_dva = dva; 3447 3448 hdr->b_l2hdr.b_dev = dev; 3449 hdr->b_l2hdr.b_daddr = daddr; 3450 3451 return (hdr); 3452 } 3453 3454 /* 3455 * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this 3456 * for bufs containing metadata. 3457 */ 3458 arc_buf_t * 3459 arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize, 3460 enum zio_compress compression_type) 3461 { 3462 ASSERT3U(lsize, >, 0); 3463 ASSERT3U(lsize, >=, psize); 3464 ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF); 3465 ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); 3466 3467 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, 3468 B_FALSE, compression_type, ARC_BUFC_DATA, B_FALSE); 3469 3470 arc_buf_t *buf = NULL; 3471 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, 3472 B_TRUE, B_FALSE, B_FALSE, &buf)); 3473 arc_buf_thaw(buf); 3474 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3475 3476 if (!arc_buf_is_shared(buf)) { 3477 /* 3478 * To ensure that the hdr has the correct data in it if we call 3479 * arc_untransform() on this buf before it's been written to 3480 * disk, it's easiest if we just set up sharing between the 3481 * buf and the hdr. 3482 */ 3483 ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd)); 3484 arc_hdr_free_pabd(hdr, B_FALSE); 3485 arc_share_buf(hdr, buf); 3486 } 3487 3488 return (buf); 3489 } 3490 3491 arc_buf_t * 3492 arc_alloc_raw_buf(spa_t *spa, void *tag, uint64_t dsobj, boolean_t byteorder, 3493 const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, 3494 dmu_object_type_t ot, uint64_t psize, uint64_t lsize, 3495 enum zio_compress compression_type) 3496 { 3497 arc_buf_hdr_t *hdr; 3498 arc_buf_t *buf; 3499 arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ? 3500 ARC_BUFC_METADATA : ARC_BUFC_DATA; 3501 3502 ASSERT3U(lsize, >, 0); 3503 ASSERT3U(lsize, >=, psize); 3504 ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF); 3505 ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); 3506 3507 hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE, 3508 compression_type, type, B_TRUE); 3509 3510 hdr->b_crypt_hdr.b_dsobj = dsobj; 3511 hdr->b_crypt_hdr.b_ot = ot; 3512 hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? 3513 DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); 3514 bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); 3515 bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); 3516 bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); 3517 3518 /* 3519 * This buffer will be considered encrypted even if the ot is not an 3520 * encrypted type. It will become authenticated instead in 3521 * arc_write_ready(). 3522 */ 3523 buf = NULL; 3524 VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE, 3525 B_FALSE, B_FALSE, &buf)); 3526 arc_buf_thaw(buf); 3527 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 3528 3529 return (buf); 3530 } 3531 3532 static void 3533 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr) 3534 { 3535 l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr; 3536 l2arc_dev_t *dev = l2hdr->b_dev; 3537 uint64_t psize = HDR_GET_PSIZE(hdr); 3538 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); 3539 3540 ASSERT(MUTEX_HELD(&dev->l2ad_mtx)); 3541 ASSERT(HDR_HAS_L2HDR(hdr)); 3542 3543 list_remove(&dev->l2ad_buflist, hdr); 3544 3545 ARCSTAT_INCR(arcstat_l2_psize, -psize); 3546 ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr)); 3547 3548 vdev_space_update(dev->l2ad_vdev, -asize, 0, 0); 3549 3550 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), 3551 hdr); 3552 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); 3553 } 3554 3555 static void 3556 arc_hdr_destroy(arc_buf_hdr_t *hdr) 3557 { 3558 if (HDR_HAS_L1HDR(hdr)) { 3559 ASSERT(hdr->b_l1hdr.b_buf == NULL || 3560 hdr->b_l1hdr.b_bufcnt > 0); 3561 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 3562 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 3563 } 3564 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3565 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 3566 3567 if (HDR_HAS_L2HDR(hdr)) { 3568 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; 3569 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx); 3570 3571 if (!buflist_held) 3572 mutex_enter(&dev->l2ad_mtx); 3573 3574 /* 3575 * Even though we checked this conditional above, we 3576 * need to check this again now that we have the 3577 * l2ad_mtx. This is because we could be racing with 3578 * another thread calling l2arc_evict() which might have 3579 * destroyed this header's L2 portion as we were waiting 3580 * to acquire the l2ad_mtx. If that happens, we don't 3581 * want to re-destroy the header's L2 portion. 3582 */ 3583 if (HDR_HAS_L2HDR(hdr)) 3584 arc_hdr_l2hdr_destroy(hdr); 3585 3586 if (!buflist_held) 3587 mutex_exit(&dev->l2ad_mtx); 3588 } 3589 3590 /* 3591 * The header's identity can only be safely discarded once it is no 3592 * longer discoverable. This requires removing it from the hash table 3593 * and the l2arc header list. After this point the hash lock can not 3594 * be used to protect the header. 3595 */ 3596 if (!HDR_EMPTY(hdr)) 3597 buf_discard_identity(hdr); 3598 3599 if (HDR_HAS_L1HDR(hdr)) { 3600 arc_cksum_free(hdr); 3601 3602 while (hdr->b_l1hdr.b_buf != NULL) 3603 arc_buf_destroy_impl(hdr->b_l1hdr.b_buf); 3604 3605 #ifdef ZFS_DEBUG 3606 if (hdr->b_l1hdr.b_thawed != NULL) { 3607 kmem_free(hdr->b_l1hdr.b_thawed, 1); 3608 hdr->b_l1hdr.b_thawed = NULL; 3609 } 3610 #endif 3611 3612 if (hdr->b_l1hdr.b_pabd != NULL) 3613 arc_hdr_free_pabd(hdr, B_FALSE); 3614 3615 if (HDR_HAS_RABD(hdr)) 3616 arc_hdr_free_pabd(hdr, B_TRUE); 3617 } 3618 3619 ASSERT3P(hdr->b_hash_next, ==, NULL); 3620 if (HDR_HAS_L1HDR(hdr)) { 3621 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 3622 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 3623 3624 if (!HDR_PROTECTED(hdr)) { 3625 kmem_cache_free(hdr_full_cache, hdr); 3626 } else { 3627 kmem_cache_free(hdr_full_crypt_cache, hdr); 3628 } 3629 } else { 3630 kmem_cache_free(hdr_l2only_cache, hdr); 3631 } 3632 } 3633 3634 void 3635 arc_buf_destroy(arc_buf_t *buf, void* tag) 3636 { 3637 arc_buf_hdr_t *hdr = buf->b_hdr; 3638 3639 if (hdr->b_l1hdr.b_state == arc_anon) { 3640 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); 3641 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3642 VERIFY0(remove_reference(hdr, NULL, tag)); 3643 arc_hdr_destroy(hdr); 3644 return; 3645 } 3646 3647 kmutex_t *hash_lock = HDR_LOCK(hdr); 3648 mutex_enter(hash_lock); 3649 3650 ASSERT3P(hdr, ==, buf->b_hdr); 3651 ASSERT(hdr->b_l1hdr.b_bufcnt > 0); 3652 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 3653 ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon); 3654 ASSERT3P(buf->b_data, !=, NULL); 3655 3656 (void) remove_reference(hdr, hash_lock, tag); 3657 arc_buf_destroy_impl(buf); 3658 mutex_exit(hash_lock); 3659 } 3660 3661 /* 3662 * Evict the arc_buf_hdr that is provided as a parameter. The resultant 3663 * state of the header is dependent on its state prior to entering this 3664 * function. The following transitions are possible: 3665 * 3666 * - arc_mru -> arc_mru_ghost 3667 * - arc_mfu -> arc_mfu_ghost 3668 * - arc_mru_ghost -> arc_l2c_only 3669 * - arc_mru_ghost -> deleted 3670 * - arc_mfu_ghost -> arc_l2c_only 3671 * - arc_mfu_ghost -> deleted 3672 */ 3673 static int64_t 3674 arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock) 3675 { 3676 arc_state_t *evicted_state, *state; 3677 int64_t bytes_evicted = 0; 3678 int min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ? 3679 zfs_arc_min_prescient_prefetch_ms : zfs_arc_min_prefetch_ms; 3680 3681 ASSERT(MUTEX_HELD(hash_lock)); 3682 ASSERT(HDR_HAS_L1HDR(hdr)); 3683 3684 state = hdr->b_l1hdr.b_state; 3685 if (GHOST_STATE(state)) { 3686 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 3687 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 3688 3689 /* 3690 * l2arc_write_buffers() relies on a header's L1 portion 3691 * (i.e. its b_pabd field) during its write phase. 3692 * Thus, we cannot push a header onto the arc_l2c_only 3693 * state (removing its L1 piece) until the header is 3694 * done being written to the l2arc. 3695 */ 3696 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) { 3697 ARCSTAT_BUMP(arcstat_evict_l2_skip); 3698 return (bytes_evicted); 3699 } 3700 3701 ARCSTAT_BUMP(arcstat_deleted); 3702 bytes_evicted += HDR_GET_LSIZE(hdr); 3703 3704 DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr); 3705 3706 if (HDR_HAS_L2HDR(hdr)) { 3707 ASSERT(hdr->b_l1hdr.b_pabd == NULL); 3708 ASSERT(!HDR_HAS_RABD(hdr)); 3709 /* 3710 * This buffer is cached on the 2nd Level ARC; 3711 * don't destroy the header. 3712 */ 3713 arc_change_state(arc_l2c_only, hdr, hash_lock); 3714 /* 3715 * dropping from L1+L2 cached to L2-only, 3716 * realloc to remove the L1 header. 3717 */ 3718 hdr = arc_hdr_realloc(hdr, hdr_full_cache, 3719 hdr_l2only_cache); 3720 } else { 3721 arc_change_state(arc_anon, hdr, hash_lock); 3722 arc_hdr_destroy(hdr); 3723 } 3724 return (bytes_evicted); 3725 } 3726 3727 ASSERT(state == arc_mru || state == arc_mfu); 3728 evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost; 3729 3730 /* prefetch buffers have a minimum lifespan */ 3731 if (HDR_IO_IN_PROGRESS(hdr) || 3732 ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) && 3733 ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access < min_lifetime * hz)) { 3734 ARCSTAT_BUMP(arcstat_evict_skip); 3735 return (bytes_evicted); 3736 } 3737 3738 ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt)); 3739 while (hdr->b_l1hdr.b_buf) { 3740 arc_buf_t *buf = hdr->b_l1hdr.b_buf; 3741 if (!mutex_tryenter(&buf->b_evict_lock)) { 3742 ARCSTAT_BUMP(arcstat_mutex_miss); 3743 break; 3744 } 3745 if (buf->b_data != NULL) 3746 bytes_evicted += HDR_GET_LSIZE(hdr); 3747 mutex_exit(&buf->b_evict_lock); 3748 arc_buf_destroy_impl(buf); 3749 } 3750 3751 if (HDR_HAS_L2HDR(hdr)) { 3752 ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr)); 3753 } else { 3754 if (l2arc_write_eligible(hdr->b_spa, hdr)) { 3755 ARCSTAT_INCR(arcstat_evict_l2_eligible, 3756 HDR_GET_LSIZE(hdr)); 3757 } else { 3758 ARCSTAT_INCR(arcstat_evict_l2_ineligible, 3759 HDR_GET_LSIZE(hdr)); 3760 } 3761 } 3762 3763 if (hdr->b_l1hdr.b_bufcnt == 0) { 3764 arc_cksum_free(hdr); 3765 3766 bytes_evicted += arc_hdr_size(hdr); 3767 3768 /* 3769 * If this hdr is being evicted and has a compressed 3770 * buffer then we discard it here before we change states. 3771 * This ensures that the accounting is updated correctly 3772 * in arc_free_data_impl(). 3773 */ 3774 if (hdr->b_l1hdr.b_pabd != NULL) 3775 arc_hdr_free_pabd(hdr, B_FALSE); 3776 3777 if (HDR_HAS_RABD(hdr)) 3778 arc_hdr_free_pabd(hdr, B_TRUE); 3779 3780 arc_change_state(evicted_state, hdr, hash_lock); 3781 ASSERT(HDR_IN_HASH_TABLE(hdr)); 3782 arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE); 3783 DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr); 3784 } 3785 3786 return (bytes_evicted); 3787 } 3788 3789 static uint64_t 3790 arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker, 3791 uint64_t spa, int64_t bytes) 3792 { 3793 multilist_sublist_t *mls; 3794 uint64_t bytes_evicted = 0; 3795 arc_buf_hdr_t *hdr; 3796 kmutex_t *hash_lock; 3797 int evict_count = 0; 3798 3799 ASSERT3P(marker, !=, NULL); 3800 IMPLY(bytes < 0, bytes == ARC_EVICT_ALL); 3801 3802 mls = multilist_sublist_lock(ml, idx); 3803 3804 for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL; 3805 hdr = multilist_sublist_prev(mls, marker)) { 3806 if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) || 3807 (evict_count >= zfs_arc_evict_batch_limit)) 3808 break; 3809 3810 /* 3811 * To keep our iteration location, move the marker 3812 * forward. Since we're not holding hdr's hash lock, we 3813 * must be very careful and not remove 'hdr' from the 3814 * sublist. Otherwise, other consumers might mistake the 3815 * 'hdr' as not being on a sublist when they call the 3816 * multilist_link_active() function (they all rely on 3817 * the hash lock protecting concurrent insertions and 3818 * removals). multilist_sublist_move_forward() was 3819 * specifically implemented to ensure this is the case 3820 * (only 'marker' will be removed and re-inserted). 3821 */ 3822 multilist_sublist_move_forward(mls, marker); 3823 3824 /* 3825 * The only case where the b_spa field should ever be 3826 * zero, is the marker headers inserted by 3827 * arc_evict_state(). It's possible for multiple threads 3828 * to be calling arc_evict_state() concurrently (e.g. 3829 * dsl_pool_close() and zio_inject_fault()), so we must 3830 * skip any markers we see from these other threads. 3831 */ 3832 if (hdr->b_spa == 0) 3833 continue; 3834 3835 /* we're only interested in evicting buffers of a certain spa */ 3836 if (spa != 0 && hdr->b_spa != spa) { 3837 ARCSTAT_BUMP(arcstat_evict_skip); 3838 continue; 3839 } 3840 3841 hash_lock = HDR_LOCK(hdr); 3842 3843 /* 3844 * We aren't calling this function from any code path 3845 * that would already be holding a hash lock, so we're 3846 * asserting on this assumption to be defensive in case 3847 * this ever changes. Without this check, it would be 3848 * possible to incorrectly increment arcstat_mutex_miss 3849 * below (e.g. if the code changed such that we called 3850 * this function with a hash lock held). 3851 */ 3852 ASSERT(!MUTEX_HELD(hash_lock)); 3853 3854 if (mutex_tryenter(hash_lock)) { 3855 uint64_t evicted = arc_evict_hdr(hdr, hash_lock); 3856 mutex_exit(hash_lock); 3857 3858 bytes_evicted += evicted; 3859 3860 /* 3861 * If evicted is zero, arc_evict_hdr() must have 3862 * decided to skip this header, don't increment 3863 * evict_count in this case. 3864 */ 3865 if (evicted != 0) 3866 evict_count++; 3867 3868 /* 3869 * If arc_size isn't overflowing, signal any 3870 * threads that might happen to be waiting. 3871 * 3872 * For each header evicted, we wake up a single 3873 * thread. If we used cv_broadcast, we could 3874 * wake up "too many" threads causing arc_size 3875 * to significantly overflow arc_c; since 3876 * arc_get_data_impl() doesn't check for overflow 3877 * when it's woken up (it doesn't because it's 3878 * possible for the ARC to be overflowing while 3879 * full of un-evictable buffers, and the 3880 * function should proceed in this case). 3881 * 3882 * If threads are left sleeping, due to not 3883 * using cv_broadcast here, they will be woken 3884 * up via cv_broadcast in arc_adjust_cb() just 3885 * before arc_adjust_zthr sleeps. 3886 */ 3887 mutex_enter(&arc_adjust_lock); 3888 if (!arc_is_overflowing()) 3889 cv_signal(&arc_adjust_waiters_cv); 3890 mutex_exit(&arc_adjust_lock); 3891 } else { 3892 ARCSTAT_BUMP(arcstat_mutex_miss); 3893 } 3894 } 3895 3896 multilist_sublist_unlock(mls); 3897 3898 return (bytes_evicted); 3899 } 3900 3901 /* 3902 * Evict buffers from the given arc state, until we've removed the 3903 * specified number of bytes. Move the removed buffers to the 3904 * appropriate evict state. 3905 * 3906 * This function makes a "best effort". It skips over any buffers 3907 * it can't get a hash_lock on, and so, may not catch all candidates. 3908 * It may also return without evicting as much space as requested. 3909 * 3910 * If bytes is specified using the special value ARC_EVICT_ALL, this 3911 * will evict all available (i.e. unlocked and evictable) buffers from 3912 * the given arc state; which is used by arc_flush(). 3913 */ 3914 static uint64_t 3915 arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes, 3916 arc_buf_contents_t type) 3917 { 3918 uint64_t total_evicted = 0; 3919 multilist_t *ml = state->arcs_list[type]; 3920 int num_sublists; 3921 arc_buf_hdr_t **markers; 3922 3923 IMPLY(bytes < 0, bytes == ARC_EVICT_ALL); 3924 3925 num_sublists = multilist_get_num_sublists(ml); 3926 3927 /* 3928 * If we've tried to evict from each sublist, made some 3929 * progress, but still have not hit the target number of bytes 3930 * to evict, we want to keep trying. The markers allow us to 3931 * pick up where we left off for each individual sublist, rather 3932 * than starting from the tail each time. 3933 */ 3934 markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP); 3935 for (int i = 0; i < num_sublists; i++) { 3936 markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP); 3937 3938 /* 3939 * A b_spa of 0 is used to indicate that this header is 3940 * a marker. This fact is used in arc_adjust_type() and 3941 * arc_evict_state_impl(). 3942 */ 3943 markers[i]->b_spa = 0; 3944 3945 multilist_sublist_t *mls = multilist_sublist_lock(ml, i); 3946 multilist_sublist_insert_tail(mls, markers[i]); 3947 multilist_sublist_unlock(mls); 3948 } 3949 3950 /* 3951 * While we haven't hit our target number of bytes to evict, or 3952 * we're evicting all available buffers. 3953 */ 3954 while (total_evicted < bytes || bytes == ARC_EVICT_ALL) { 3955 /* 3956 * Start eviction using a randomly selected sublist, 3957 * this is to try and evenly balance eviction across all 3958 * sublists. Always starting at the same sublist 3959 * (e.g. index 0) would cause evictions to favor certain 3960 * sublists over others. 3961 */ 3962 int sublist_idx = multilist_get_random_index(ml); 3963 uint64_t scan_evicted = 0; 3964 3965 for (int i = 0; i < num_sublists; i++) { 3966 uint64_t bytes_remaining; 3967 uint64_t bytes_evicted; 3968 3969 if (bytes == ARC_EVICT_ALL) 3970 bytes_remaining = ARC_EVICT_ALL; 3971 else if (total_evicted < bytes) 3972 bytes_remaining = bytes - total_evicted; 3973 else 3974 break; 3975 3976 bytes_evicted = arc_evict_state_impl(ml, sublist_idx, 3977 markers[sublist_idx], spa, bytes_remaining); 3978 3979 scan_evicted += bytes_evicted; 3980 total_evicted += bytes_evicted; 3981 3982 /* we've reached the end, wrap to the beginning */ 3983 if (++sublist_idx >= num_sublists) 3984 sublist_idx = 0; 3985 } 3986 3987 /* 3988 * If we didn't evict anything during this scan, we have 3989 * no reason to believe we'll evict more during another 3990 * scan, so break the loop. 3991 */ 3992 if (scan_evicted == 0) { 3993 /* This isn't possible, let's make that obvious */ 3994 ASSERT3S(bytes, !=, 0); 3995 3996 /* 3997 * When bytes is ARC_EVICT_ALL, the only way to 3998 * break the loop is when scan_evicted is zero. 3999 * In that case, we actually have evicted enough, 4000 * so we don't want to increment the kstat. 4001 */ 4002 if (bytes != ARC_EVICT_ALL) { 4003 ASSERT3S(total_evicted, <, bytes); 4004 ARCSTAT_BUMP(arcstat_evict_not_enough); 4005 } 4006 4007 break; 4008 } 4009 } 4010 4011 for (int i = 0; i < num_sublists; i++) { 4012 multilist_sublist_t *mls = multilist_sublist_lock(ml, i); 4013 multilist_sublist_remove(mls, markers[i]); 4014 multilist_sublist_unlock(mls); 4015 4016 kmem_cache_free(hdr_full_cache, markers[i]); 4017 } 4018 kmem_free(markers, sizeof (*markers) * num_sublists); 4019 4020 return (total_evicted); 4021 } 4022 4023 /* 4024 * Flush all "evictable" data of the given type from the arc state 4025 * specified. This will not evict any "active" buffers (i.e. referenced). 4026 * 4027 * When 'retry' is set to B_FALSE, the function will make a single pass 4028 * over the state and evict any buffers that it can. Since it doesn't 4029 * continually retry the eviction, it might end up leaving some buffers 4030 * in the ARC due to lock misses. 4031 * 4032 * When 'retry' is set to B_TRUE, the function will continually retry the 4033 * eviction until *all* evictable buffers have been removed from the 4034 * state. As a result, if concurrent insertions into the state are 4035 * allowed (e.g. if the ARC isn't shutting down), this function might 4036 * wind up in an infinite loop, continually trying to evict buffers. 4037 */ 4038 static uint64_t 4039 arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type, 4040 boolean_t retry) 4041 { 4042 uint64_t evicted = 0; 4043 4044 while (zfs_refcount_count(&state->arcs_esize[type]) != 0) { 4045 evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type); 4046 4047 if (!retry) 4048 break; 4049 } 4050 4051 return (evicted); 4052 } 4053 4054 /* 4055 * Evict the specified number of bytes from the state specified, 4056 * restricting eviction to the spa and type given. This function 4057 * prevents us from trying to evict more from a state's list than 4058 * is "evictable", and to skip evicting altogether when passed a 4059 * negative value for "bytes". In contrast, arc_evict_state() will 4060 * evict everything it can, when passed a negative value for "bytes". 4061 */ 4062 static uint64_t 4063 arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes, 4064 arc_buf_contents_t type) 4065 { 4066 int64_t delta; 4067 4068 if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) { 4069 delta = MIN(zfs_refcount_count(&state->arcs_esize[type]), 4070 bytes); 4071 return (arc_evict_state(state, spa, delta, type)); 4072 } 4073 4074 return (0); 4075 } 4076 4077 /* 4078 * Evict metadata buffers from the cache, such that arc_meta_used is 4079 * capped by the arc_meta_limit tunable. 4080 */ 4081 static uint64_t 4082 arc_adjust_meta(uint64_t meta_used) 4083 { 4084 uint64_t total_evicted = 0; 4085 int64_t target; 4086 4087 /* 4088 * If we're over the meta limit, we want to evict enough 4089 * metadata to get back under the meta limit. We don't want to 4090 * evict so much that we drop the MRU below arc_p, though. If 4091 * we're over the meta limit more than we're over arc_p, we 4092 * evict some from the MRU here, and some from the MFU below. 4093 */ 4094 target = MIN((int64_t)(meta_used - arc_meta_limit), 4095 (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) + 4096 zfs_refcount_count(&arc_mru->arcs_size) - arc_p)); 4097 4098 total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA); 4099 4100 /* 4101 * Similar to the above, we want to evict enough bytes to get us 4102 * below the meta limit, but not so much as to drop us below the 4103 * space allotted to the MFU (which is defined as arc_c - arc_p). 4104 */ 4105 target = MIN((int64_t)(meta_used - arc_meta_limit), 4106 (int64_t)(zfs_refcount_count(&arc_mfu->arcs_size) - 4107 (arc_c - arc_p))); 4108 4109 total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); 4110 4111 return (total_evicted); 4112 } 4113 4114 /* 4115 * Return the type of the oldest buffer in the given arc state 4116 * 4117 * This function will select a random sublist of type ARC_BUFC_DATA and 4118 * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist 4119 * is compared, and the type which contains the "older" buffer will be 4120 * returned. 4121 */ 4122 static arc_buf_contents_t 4123 arc_adjust_type(arc_state_t *state) 4124 { 4125 multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA]; 4126 multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA]; 4127 int data_idx = multilist_get_random_index(data_ml); 4128 int meta_idx = multilist_get_random_index(meta_ml); 4129 multilist_sublist_t *data_mls; 4130 multilist_sublist_t *meta_mls; 4131 arc_buf_contents_t type; 4132 arc_buf_hdr_t *data_hdr; 4133 arc_buf_hdr_t *meta_hdr; 4134 4135 /* 4136 * We keep the sublist lock until we're finished, to prevent 4137 * the headers from being destroyed via arc_evict_state(). 4138 */ 4139 data_mls = multilist_sublist_lock(data_ml, data_idx); 4140 meta_mls = multilist_sublist_lock(meta_ml, meta_idx); 4141 4142 /* 4143 * These two loops are to ensure we skip any markers that 4144 * might be at the tail of the lists due to arc_evict_state(). 4145 */ 4146 4147 for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL; 4148 data_hdr = multilist_sublist_prev(data_mls, data_hdr)) { 4149 if (data_hdr->b_spa != 0) 4150 break; 4151 } 4152 4153 for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL; 4154 meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) { 4155 if (meta_hdr->b_spa != 0) 4156 break; 4157 } 4158 4159 if (data_hdr == NULL && meta_hdr == NULL) { 4160 type = ARC_BUFC_DATA; 4161 } else if (data_hdr == NULL) { 4162 ASSERT3P(meta_hdr, !=, NULL); 4163 type = ARC_BUFC_METADATA; 4164 } else if (meta_hdr == NULL) { 4165 ASSERT3P(data_hdr, !=, NULL); 4166 type = ARC_BUFC_DATA; 4167 } else { 4168 ASSERT3P(data_hdr, !=, NULL); 4169 ASSERT3P(meta_hdr, !=, NULL); 4170 4171 /* The headers can't be on the sublist without an L1 header */ 4172 ASSERT(HDR_HAS_L1HDR(data_hdr)); 4173 ASSERT(HDR_HAS_L1HDR(meta_hdr)); 4174 4175 if (data_hdr->b_l1hdr.b_arc_access < 4176 meta_hdr->b_l1hdr.b_arc_access) { 4177 type = ARC_BUFC_DATA; 4178 } else { 4179 type = ARC_BUFC_METADATA; 4180 } 4181 } 4182 4183 multilist_sublist_unlock(meta_mls); 4184 multilist_sublist_unlock(data_mls); 4185 4186 return (type); 4187 } 4188 4189 /* 4190 * Evict buffers from the cache, such that arc_size is capped by arc_c. 4191 */ 4192 static uint64_t 4193 arc_adjust(void) 4194 { 4195 uint64_t total_evicted = 0; 4196 uint64_t bytes; 4197 int64_t target; 4198 uint64_t asize = aggsum_value(&arc_size); 4199 uint64_t ameta = aggsum_value(&arc_meta_used); 4200 4201 /* 4202 * If we're over arc_meta_limit, we want to correct that before 4203 * potentially evicting data buffers below. 4204 */ 4205 total_evicted += arc_adjust_meta(ameta); 4206 4207 /* 4208 * Adjust MRU size 4209 * 4210 * If we're over the target cache size, we want to evict enough 4211 * from the list to get back to our target size. We don't want 4212 * to evict too much from the MRU, such that it drops below 4213 * arc_p. So, if we're over our target cache size more than 4214 * the MRU is over arc_p, we'll evict enough to get back to 4215 * arc_p here, and then evict more from the MFU below. 4216 */ 4217 target = MIN((int64_t)(asize - arc_c), 4218 (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) + 4219 zfs_refcount_count(&arc_mru->arcs_size) + ameta - arc_p)); 4220 4221 /* 4222 * If we're below arc_meta_min, always prefer to evict data. 4223 * Otherwise, try to satisfy the requested number of bytes to 4224 * evict from the type which contains older buffers; in an 4225 * effort to keep newer buffers in the cache regardless of their 4226 * type. If we cannot satisfy the number of bytes from this 4227 * type, spill over into the next type. 4228 */ 4229 if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA && 4230 ameta > arc_meta_min) { 4231 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA); 4232 total_evicted += bytes; 4233 4234 /* 4235 * If we couldn't evict our target number of bytes from 4236 * metadata, we try to get the rest from data. 4237 */ 4238 target -= bytes; 4239 4240 total_evicted += 4241 arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA); 4242 } else { 4243 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA); 4244 total_evicted += bytes; 4245 4246 /* 4247 * If we couldn't evict our target number of bytes from 4248 * data, we try to get the rest from metadata. 4249 */ 4250 target -= bytes; 4251 4252 total_evicted += 4253 arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA); 4254 } 4255 4256 /* 4257 * Adjust MFU size 4258 * 4259 * Now that we've tried to evict enough from the MRU to get its 4260 * size back to arc_p, if we're still above the target cache 4261 * size, we evict the rest from the MFU. 4262 */ 4263 target = asize - arc_c; 4264 4265 if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA && 4266 ameta > arc_meta_min) { 4267 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); 4268 total_evicted += bytes; 4269 4270 /* 4271 * If we couldn't evict our target number of bytes from 4272 * metadata, we try to get the rest from data. 4273 */ 4274 target -= bytes; 4275 4276 total_evicted += 4277 arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA); 4278 } else { 4279 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA); 4280 total_evicted += bytes; 4281 4282 /* 4283 * If we couldn't evict our target number of bytes from 4284 * data, we try to get the rest from data. 4285 */ 4286 target -= bytes; 4287 4288 total_evicted += 4289 arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); 4290 } 4291 4292 /* 4293 * Adjust ghost lists 4294 * 4295 * In addition to the above, the ARC also defines target values 4296 * for the ghost lists. The sum of the mru list and mru ghost 4297 * list should never exceed the target size of the cache, and 4298 * the sum of the mru list, mfu list, mru ghost list, and mfu 4299 * ghost list should never exceed twice the target size of the 4300 * cache. The following logic enforces these limits on the ghost 4301 * caches, and evicts from them as needed. 4302 */ 4303 target = zfs_refcount_count(&arc_mru->arcs_size) + 4304 zfs_refcount_count(&arc_mru_ghost->arcs_size) - arc_c; 4305 4306 bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA); 4307 total_evicted += bytes; 4308 4309 target -= bytes; 4310 4311 total_evicted += 4312 arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA); 4313 4314 /* 4315 * We assume the sum of the mru list and mfu list is less than 4316 * or equal to arc_c (we enforced this above), which means we 4317 * can use the simpler of the two equations below: 4318 * 4319 * mru + mfu + mru ghost + mfu ghost <= 2 * arc_c 4320 * mru ghost + mfu ghost <= arc_c 4321 */ 4322 target = zfs_refcount_count(&arc_mru_ghost->arcs_size) + 4323 zfs_refcount_count(&arc_mfu_ghost->arcs_size) - arc_c; 4324 4325 bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA); 4326 total_evicted += bytes; 4327 4328 target -= bytes; 4329 4330 total_evicted += 4331 arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA); 4332 4333 return (total_evicted); 4334 } 4335 4336 void 4337 arc_flush(spa_t *spa, boolean_t retry) 4338 { 4339 uint64_t guid = 0; 4340 4341 /* 4342 * If retry is B_TRUE, a spa must not be specified since we have 4343 * no good way to determine if all of a spa's buffers have been 4344 * evicted from an arc state. 4345 */ 4346 ASSERT(!retry || spa == 0); 4347 4348 if (spa != NULL) 4349 guid = spa_load_guid(spa); 4350 4351 (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry); 4352 (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry); 4353 4354 (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry); 4355 (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry); 4356 4357 (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry); 4358 (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry); 4359 4360 (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry); 4361 (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry); 4362 } 4363 4364 static void 4365 arc_reduce_target_size(int64_t to_free) 4366 { 4367 uint64_t asize = aggsum_value(&arc_size); 4368 if (arc_c > arc_c_min) { 4369 4370 if (arc_c > arc_c_min + to_free) 4371 atomic_add_64(&arc_c, -to_free); 4372 else 4373 arc_c = arc_c_min; 4374 4375 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift)); 4376 if (asize < arc_c) 4377 arc_c = MAX(asize, arc_c_min); 4378 if (arc_p > arc_c) 4379 arc_p = (arc_c >> 1); 4380 ASSERT(arc_c >= arc_c_min); 4381 ASSERT((int64_t)arc_p >= 0); 4382 } 4383 4384 if (asize > arc_c) { 4385 /* See comment in arc_adjust_cb_check() on why lock+flag */ 4386 mutex_enter(&arc_adjust_lock); 4387 arc_adjust_needed = B_TRUE; 4388 mutex_exit(&arc_adjust_lock); 4389 zthr_wakeup(arc_adjust_zthr); 4390 } 4391 } 4392 4393 typedef enum free_memory_reason_t { 4394 FMR_UNKNOWN, 4395 FMR_NEEDFREE, 4396 FMR_LOTSFREE, 4397 FMR_SWAPFS_MINFREE, 4398 FMR_PAGES_PP_MAXIMUM, 4399 FMR_HEAP_ARENA, 4400 FMR_ZIO_ARENA, 4401 } free_memory_reason_t; 4402 4403 int64_t last_free_memory; 4404 free_memory_reason_t last_free_reason; 4405 4406 /* 4407 * Additional reserve of pages for pp_reserve. 4408 */ 4409 int64_t arc_pages_pp_reserve = 64; 4410 4411 /* 4412 * Additional reserve of pages for swapfs. 4413 */ 4414 int64_t arc_swapfs_reserve = 64; 4415 4416 /* 4417 * Return the amount of memory that can be consumed before reclaim will be 4418 * needed. Positive if there is sufficient free memory, negative indicates 4419 * the amount of memory that needs to be freed up. 4420 */ 4421 static int64_t 4422 arc_available_memory(void) 4423 { 4424 int64_t lowest = INT64_MAX; 4425 int64_t n; 4426 free_memory_reason_t r = FMR_UNKNOWN; 4427 4428 #ifdef _KERNEL 4429 if (needfree > 0) { 4430 n = PAGESIZE * (-needfree); 4431 if (n < lowest) { 4432 lowest = n; 4433 r = FMR_NEEDFREE; 4434 } 4435 } 4436 4437 /* 4438 * check that we're out of range of the pageout scanner. It starts to 4439 * schedule paging if freemem is less than lotsfree and needfree. 4440 * lotsfree is the high-water mark for pageout, and needfree is the 4441 * number of needed free pages. We add extra pages here to make sure 4442 * the scanner doesn't start up while we're freeing memory. 4443 */ 4444 n = PAGESIZE * (freemem - lotsfree - needfree - desfree); 4445 if (n < lowest) { 4446 lowest = n; 4447 r = FMR_LOTSFREE; 4448 } 4449 4450 /* 4451 * check to make sure that swapfs has enough space so that anon 4452 * reservations can still succeed. anon_resvmem() checks that the 4453 * availrmem is greater than swapfs_minfree, and the number of reserved 4454 * swap pages. We also add a bit of extra here just to prevent 4455 * circumstances from getting really dire. 4456 */ 4457 n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve - 4458 desfree - arc_swapfs_reserve); 4459 if (n < lowest) { 4460 lowest = n; 4461 r = FMR_SWAPFS_MINFREE; 4462 } 4463 4464 4465 /* 4466 * Check that we have enough availrmem that memory locking (e.g., via 4467 * mlock(3C) or memcntl(2)) can still succeed. (pages_pp_maximum 4468 * stores the number of pages that cannot be locked; when availrmem 4469 * drops below pages_pp_maximum, page locking mechanisms such as 4470 * page_pp_lock() will fail.) 4471 */ 4472 n = PAGESIZE * (availrmem - pages_pp_maximum - 4473 arc_pages_pp_reserve); 4474 if (n < lowest) { 4475 lowest = n; 4476 r = FMR_PAGES_PP_MAXIMUM; 4477 } 4478 4479 #if defined(__i386) 4480 /* 4481 * If we're on an i386 platform, it's possible that we'll exhaust the 4482 * kernel heap space before we ever run out of available physical 4483 * memory. Most checks of the size of the heap_area compare against 4484 * tune.t_minarmem, which is the minimum available real memory that we 4485 * can have in the system. However, this is generally fixed at 25 pages 4486 * which is so low that it's useless. In this comparison, we seek to 4487 * calculate the total heap-size, and reclaim if more than 3/4ths of the 4488 * heap is allocated. (Or, in the calculation, if less than 1/4th is 4489 * free) 4490 */ 4491 n = (int64_t)vmem_size(heap_arena, VMEM_FREE) - 4492 (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2); 4493 if (n < lowest) { 4494 lowest = n; 4495 r = FMR_HEAP_ARENA; 4496 } 4497 #endif 4498 4499 /* 4500 * If zio data pages are being allocated out of a separate heap segment, 4501 * then enforce that the size of available vmem for this arena remains 4502 * above about 1/4th (1/(2^arc_zio_arena_free_shift)) free. 4503 * 4504 * Note that reducing the arc_zio_arena_free_shift keeps more virtual 4505 * memory (in the zio_arena) free, which can avoid memory 4506 * fragmentation issues. 4507 */ 4508 if (zio_arena != NULL) { 4509 n = (int64_t)vmem_size(zio_arena, VMEM_FREE) - 4510 (vmem_size(zio_arena, VMEM_ALLOC) >> 4511 arc_zio_arena_free_shift); 4512 if (n < lowest) { 4513 lowest = n; 4514 r = FMR_ZIO_ARENA; 4515 } 4516 } 4517 #else 4518 /* Every 100 calls, free a small amount */ 4519 if (spa_get_random(100) == 0) 4520 lowest = -1024; 4521 #endif 4522 4523 last_free_memory = lowest; 4524 last_free_reason = r; 4525 4526 return (lowest); 4527 } 4528 4529 4530 /* 4531 * Determine if the system is under memory pressure and is asking 4532 * to reclaim memory. A return value of B_TRUE indicates that the system 4533 * is under memory pressure and that the arc should adjust accordingly. 4534 */ 4535 static boolean_t 4536 arc_reclaim_needed(void) 4537 { 4538 return (arc_available_memory() < 0); 4539 } 4540 4541 static void 4542 arc_kmem_reap_soon(void) 4543 { 4544 size_t i; 4545 kmem_cache_t *prev_cache = NULL; 4546 kmem_cache_t *prev_data_cache = NULL; 4547 extern kmem_cache_t *zio_buf_cache[]; 4548 extern kmem_cache_t *zio_data_buf_cache[]; 4549 extern kmem_cache_t *zfs_btree_leaf_cache; 4550 extern kmem_cache_t *abd_chunk_cache; 4551 4552 #ifdef _KERNEL 4553 if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) { 4554 /* 4555 * We are exceeding our meta-data cache limit. 4556 * Purge some DNLC entries to release holds on meta-data. 4557 */ 4558 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent); 4559 } 4560 #if defined(__i386) 4561 /* 4562 * Reclaim unused memory from all kmem caches. 4563 */ 4564 kmem_reap(); 4565 #endif 4566 #endif 4567 4568 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) { 4569 if (zio_buf_cache[i] != prev_cache) { 4570 prev_cache = zio_buf_cache[i]; 4571 kmem_cache_reap_soon(zio_buf_cache[i]); 4572 } 4573 if (zio_data_buf_cache[i] != prev_data_cache) { 4574 prev_data_cache = zio_data_buf_cache[i]; 4575 kmem_cache_reap_soon(zio_data_buf_cache[i]); 4576 } 4577 } 4578 kmem_cache_reap_soon(abd_chunk_cache); 4579 kmem_cache_reap_soon(buf_cache); 4580 kmem_cache_reap_soon(hdr_full_cache); 4581 kmem_cache_reap_soon(hdr_l2only_cache); 4582 kmem_cache_reap_soon(zfs_btree_leaf_cache); 4583 4584 if (zio_arena != NULL) { 4585 /* 4586 * Ask the vmem arena to reclaim unused memory from its 4587 * quantum caches. 4588 */ 4589 vmem_qcache_reap(zio_arena); 4590 } 4591 } 4592 4593 /* ARGSUSED */ 4594 static boolean_t 4595 arc_adjust_cb_check(void *arg, zthr_t *zthr) 4596 { 4597 /* 4598 * This is necessary in order for the mdb ::arc dcmd to 4599 * show up to date information. Since the ::arc command 4600 * does not call the kstat's update function, without 4601 * this call, the command may show stale stats for the 4602 * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even 4603 * with this change, the data might be up to 1 second 4604 * out of date(the arc_adjust_zthr has a maximum sleep 4605 * time of 1 second); but that should suffice. The 4606 * arc_state_t structures can be queried directly if more 4607 * accurate information is needed. 4608 */ 4609 if (arc_ksp != NULL) 4610 arc_ksp->ks_update(arc_ksp, KSTAT_READ); 4611 4612 /* 4613 * We have to rely on arc_get_data_impl() to tell us when to adjust, 4614 * rather than checking if we are overflowing here, so that we are 4615 * sure to not leave arc_get_data_impl() waiting on 4616 * arc_adjust_waiters_cv. If we have become "not overflowing" since 4617 * arc_get_data_impl() checked, we need to wake it up. We could 4618 * broadcast the CV here, but arc_get_data_impl() may have not yet 4619 * gone to sleep. We would need to use a mutex to ensure that this 4620 * function doesn't broadcast until arc_get_data_impl() has gone to 4621 * sleep (e.g. the arc_adjust_lock). However, the lock ordering of 4622 * such a lock would necessarily be incorrect with respect to the 4623 * zthr_lock, which is held before this function is called, and is 4624 * held by arc_get_data_impl() when it calls zthr_wakeup(). 4625 */ 4626 return (arc_adjust_needed); 4627 } 4628 4629 /* 4630 * Keep arc_size under arc_c by running arc_adjust which evicts data 4631 * from the ARC. 4632 */ 4633 /* ARGSUSED */ 4634 static void 4635 arc_adjust_cb(void *arg, zthr_t *zthr) 4636 { 4637 uint64_t evicted = 0; 4638 4639 /* Evict from cache */ 4640 evicted = arc_adjust(); 4641 4642 /* 4643 * If evicted is zero, we couldn't evict anything 4644 * via arc_adjust(). This could be due to hash lock 4645 * collisions, but more likely due to the majority of 4646 * arc buffers being unevictable. Therefore, even if 4647 * arc_size is above arc_c, another pass is unlikely to 4648 * be helpful and could potentially cause us to enter an 4649 * infinite loop. Additionally, zthr_iscancelled() is 4650 * checked here so that if the arc is shutting down, the 4651 * broadcast will wake any remaining arc adjust waiters. 4652 */ 4653 mutex_enter(&arc_adjust_lock); 4654 arc_adjust_needed = !zthr_iscancelled(arc_adjust_zthr) && 4655 evicted > 0 && aggsum_compare(&arc_size, arc_c) > 0; 4656 if (!arc_adjust_needed) { 4657 /* 4658 * We're either no longer overflowing, or we 4659 * can't evict anything more, so we should wake 4660 * up any waiters. 4661 */ 4662 cv_broadcast(&arc_adjust_waiters_cv); 4663 } 4664 mutex_exit(&arc_adjust_lock); 4665 } 4666 4667 /* ARGSUSED */ 4668 static boolean_t 4669 arc_reap_cb_check(void *arg, zthr_t *zthr) 4670 { 4671 int64_t free_memory = arc_available_memory(); 4672 4673 /* 4674 * If a kmem reap is already active, don't schedule more. We must 4675 * check for this because kmem_cache_reap_soon() won't actually 4676 * block on the cache being reaped (this is to prevent callers from 4677 * becoming implicitly blocked by a system-wide kmem reap -- which, 4678 * on a system with many, many full magazines, can take minutes). 4679 */ 4680 if (!kmem_cache_reap_active() && 4681 free_memory < 0) { 4682 arc_no_grow = B_TRUE; 4683 arc_warm = B_TRUE; 4684 /* 4685 * Wait at least zfs_grow_retry (default 60) seconds 4686 * before considering growing. 4687 */ 4688 arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry); 4689 return (B_TRUE); 4690 } else if (free_memory < arc_c >> arc_no_grow_shift) { 4691 arc_no_grow = B_TRUE; 4692 } else if (gethrtime() >= arc_growtime) { 4693 arc_no_grow = B_FALSE; 4694 } 4695 4696 return (B_FALSE); 4697 } 4698 4699 /* 4700 * Keep enough free memory in the system by reaping the ARC's kmem 4701 * caches. To cause more slabs to be reapable, we may reduce the 4702 * target size of the cache (arc_c), causing the arc_adjust_cb() 4703 * to free more buffers. 4704 */ 4705 /* ARGSUSED */ 4706 static void 4707 arc_reap_cb(void *arg, zthr_t *zthr) 4708 { 4709 int64_t free_memory; 4710 4711 /* 4712 * Kick off asynchronous kmem_reap()'s of all our caches. 4713 */ 4714 arc_kmem_reap_soon(); 4715 4716 /* 4717 * Wait at least arc_kmem_cache_reap_retry_ms between 4718 * arc_kmem_reap_soon() calls. Without this check it is possible to 4719 * end up in a situation where we spend lots of time reaping 4720 * caches, while we're near arc_c_min. Waiting here also gives the 4721 * subsequent free memory check a chance of finding that the 4722 * asynchronous reap has already freed enough memory, and we don't 4723 * need to call arc_reduce_target_size(). 4724 */ 4725 delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000); 4726 4727 /* 4728 * Reduce the target size as needed to maintain the amount of free 4729 * memory in the system at a fraction of the arc_size (1/128th by 4730 * default). If oversubscribed (free_memory < 0) then reduce the 4731 * target arc_size by the deficit amount plus the fractional 4732 * amount. If free memory is positive but less then the fractional 4733 * amount, reduce by what is needed to hit the fractional amount. 4734 */ 4735 free_memory = arc_available_memory(); 4736 4737 int64_t to_free = 4738 (arc_c >> arc_shrink_shift) - free_memory; 4739 if (to_free > 0) { 4740 #ifdef _KERNEL 4741 to_free = MAX(to_free, ptob(needfree)); 4742 #endif 4743 arc_reduce_target_size(to_free); 4744 } 4745 } 4746 4747 /* 4748 * Adapt arc info given the number of bytes we are trying to add and 4749 * the state that we are coming from. This function is only called 4750 * when we are adding new content to the cache. 4751 */ 4752 static void 4753 arc_adapt(int bytes, arc_state_t *state) 4754 { 4755 int mult; 4756 uint64_t arc_p_min = (arc_c >> arc_p_min_shift); 4757 int64_t mrug_size = zfs_refcount_count(&arc_mru_ghost->arcs_size); 4758 int64_t mfug_size = zfs_refcount_count(&arc_mfu_ghost->arcs_size); 4759 4760 if (state == arc_l2c_only) 4761 return; 4762 4763 ASSERT(bytes > 0); 4764 /* 4765 * Adapt the target size of the MRU list: 4766 * - if we just hit in the MRU ghost list, then increase 4767 * the target size of the MRU list. 4768 * - if we just hit in the MFU ghost list, then increase 4769 * the target size of the MFU list by decreasing the 4770 * target size of the MRU list. 4771 */ 4772 if (state == arc_mru_ghost) { 4773 mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size); 4774 mult = MIN(mult, 10); /* avoid wild arc_p adjustment */ 4775 4776 arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult); 4777 } else if (state == arc_mfu_ghost) { 4778 uint64_t delta; 4779 4780 mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size); 4781 mult = MIN(mult, 10); 4782 4783 delta = MIN(bytes * mult, arc_p); 4784 arc_p = MAX(arc_p_min, arc_p - delta); 4785 } 4786 ASSERT((int64_t)arc_p >= 0); 4787 4788 /* 4789 * Wake reap thread if we do not have any available memory 4790 */ 4791 if (arc_reclaim_needed()) { 4792 zthr_wakeup(arc_reap_zthr); 4793 return; 4794 } 4795 4796 4797 if (arc_no_grow) 4798 return; 4799 4800 if (arc_c >= arc_c_max) 4801 return; 4802 4803 /* 4804 * If we're within (2 * maxblocksize) bytes of the target 4805 * cache size, increment the target cache size 4806 */ 4807 if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) > 4808 0) { 4809 atomic_add_64(&arc_c, (int64_t)bytes); 4810 if (arc_c > arc_c_max) 4811 arc_c = arc_c_max; 4812 else if (state == arc_anon) 4813 atomic_add_64(&arc_p, (int64_t)bytes); 4814 if (arc_p > arc_c) 4815 arc_p = arc_c; 4816 } 4817 ASSERT((int64_t)arc_p >= 0); 4818 } 4819 4820 /* 4821 * Check if arc_size has grown past our upper threshold, determined by 4822 * zfs_arc_overflow_shift. 4823 */ 4824 static boolean_t 4825 arc_is_overflowing(void) 4826 { 4827 /* Always allow at least one block of overflow */ 4828 uint64_t overflow = MAX(SPA_MAXBLOCKSIZE, 4829 arc_c >> zfs_arc_overflow_shift); 4830 4831 /* 4832 * We just compare the lower bound here for performance reasons. Our 4833 * primary goals are to make sure that the arc never grows without 4834 * bound, and that it can reach its maximum size. This check 4835 * accomplishes both goals. The maximum amount we could run over by is 4836 * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block 4837 * in the ARC. In practice, that's in the tens of MB, which is low 4838 * enough to be safe. 4839 */ 4840 return (aggsum_lower_bound(&arc_size) >= arc_c + overflow); 4841 } 4842 4843 static abd_t * 4844 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag) 4845 { 4846 arc_buf_contents_t type = arc_buf_type(hdr); 4847 4848 arc_get_data_impl(hdr, size, tag); 4849 if (type == ARC_BUFC_METADATA) { 4850 return (abd_alloc(size, B_TRUE)); 4851 } else { 4852 ASSERT(type == ARC_BUFC_DATA); 4853 return (abd_alloc(size, B_FALSE)); 4854 } 4855 } 4856 4857 static void * 4858 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag) 4859 { 4860 arc_buf_contents_t type = arc_buf_type(hdr); 4861 4862 arc_get_data_impl(hdr, size, tag); 4863 if (type == ARC_BUFC_METADATA) { 4864 return (zio_buf_alloc(size)); 4865 } else { 4866 ASSERT(type == ARC_BUFC_DATA); 4867 return (zio_data_buf_alloc(size)); 4868 } 4869 } 4870 4871 /* 4872 * Allocate a block and return it to the caller. If we are hitting the 4873 * hard limit for the cache size, we must sleep, waiting for the eviction 4874 * thread to catch up. If we're past the target size but below the hard 4875 * limit, we'll only signal the reclaim thread and continue on. 4876 */ 4877 static void 4878 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag) 4879 { 4880 arc_state_t *state = hdr->b_l1hdr.b_state; 4881 arc_buf_contents_t type = arc_buf_type(hdr); 4882 4883 arc_adapt(size, state); 4884 4885 /* 4886 * If arc_size is currently overflowing, and has grown past our 4887 * upper limit, we must be adding data faster than the evict 4888 * thread can evict. Thus, to ensure we don't compound the 4889 * problem by adding more data and forcing arc_size to grow even 4890 * further past its target size, we halt and wait for the 4891 * eviction thread to catch up. 4892 * 4893 * It's also possible that the reclaim thread is unable to evict 4894 * enough buffers to get arc_size below the overflow limit (e.g. 4895 * due to buffers being un-evictable, or hash lock collisions). 4896 * In this case, we want to proceed regardless if we're 4897 * overflowing; thus we don't use a while loop here. 4898 */ 4899 if (arc_is_overflowing()) { 4900 mutex_enter(&arc_adjust_lock); 4901 4902 /* 4903 * Now that we've acquired the lock, we may no longer be 4904 * over the overflow limit, lets check. 4905 * 4906 * We're ignoring the case of spurious wake ups. If that 4907 * were to happen, it'd let this thread consume an ARC 4908 * buffer before it should have (i.e. before we're under 4909 * the overflow limit and were signalled by the reclaim 4910 * thread). As long as that is a rare occurrence, it 4911 * shouldn't cause any harm. 4912 */ 4913 if (arc_is_overflowing()) { 4914 arc_adjust_needed = B_TRUE; 4915 zthr_wakeup(arc_adjust_zthr); 4916 (void) cv_wait(&arc_adjust_waiters_cv, 4917 &arc_adjust_lock); 4918 } 4919 mutex_exit(&arc_adjust_lock); 4920 } 4921 4922 VERIFY3U(hdr->b_type, ==, type); 4923 if (type == ARC_BUFC_METADATA) { 4924 arc_space_consume(size, ARC_SPACE_META); 4925 } else { 4926 arc_space_consume(size, ARC_SPACE_DATA); 4927 } 4928 4929 /* 4930 * Update the state size. Note that ghost states have a 4931 * "ghost size" and so don't need to be updated. 4932 */ 4933 if (!GHOST_STATE(state)) { 4934 4935 (void) zfs_refcount_add_many(&state->arcs_size, size, tag); 4936 4937 /* 4938 * If this is reached via arc_read, the link is 4939 * protected by the hash lock. If reached via 4940 * arc_buf_alloc, the header should not be accessed by 4941 * any other thread. And, if reached via arc_read_done, 4942 * the hash lock will protect it if it's found in the 4943 * hash table; otherwise no other thread should be 4944 * trying to [add|remove]_reference it. 4945 */ 4946 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 4947 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 4948 (void) zfs_refcount_add_many(&state->arcs_esize[type], 4949 size, tag); 4950 } 4951 4952 /* 4953 * If we are growing the cache, and we are adding anonymous 4954 * data, and we have outgrown arc_p, update arc_p 4955 */ 4956 if (aggsum_compare(&arc_size, arc_c) < 0 && 4957 hdr->b_l1hdr.b_state == arc_anon && 4958 (zfs_refcount_count(&arc_anon->arcs_size) + 4959 zfs_refcount_count(&arc_mru->arcs_size) > arc_p)) 4960 arc_p = MIN(arc_c, arc_p + size); 4961 } 4962 } 4963 4964 static void 4965 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag) 4966 { 4967 arc_free_data_impl(hdr, size, tag); 4968 abd_free(abd); 4969 } 4970 4971 static void 4972 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag) 4973 { 4974 arc_buf_contents_t type = arc_buf_type(hdr); 4975 4976 arc_free_data_impl(hdr, size, tag); 4977 if (type == ARC_BUFC_METADATA) { 4978 zio_buf_free(buf, size); 4979 } else { 4980 ASSERT(type == ARC_BUFC_DATA); 4981 zio_data_buf_free(buf, size); 4982 } 4983 } 4984 4985 /* 4986 * Free the arc data buffer. 4987 */ 4988 static void 4989 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag) 4990 { 4991 arc_state_t *state = hdr->b_l1hdr.b_state; 4992 arc_buf_contents_t type = arc_buf_type(hdr); 4993 4994 /* protected by hash lock, if in the hash table */ 4995 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { 4996 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 4997 ASSERT(state != arc_anon && state != arc_l2c_only); 4998 4999 (void) zfs_refcount_remove_many(&state->arcs_esize[type], 5000 size, tag); 5001 } 5002 (void) zfs_refcount_remove_many(&state->arcs_size, size, tag); 5003 5004 VERIFY3U(hdr->b_type, ==, type); 5005 if (type == ARC_BUFC_METADATA) { 5006 arc_space_return(size, ARC_SPACE_META); 5007 } else { 5008 ASSERT(type == ARC_BUFC_DATA); 5009 arc_space_return(size, ARC_SPACE_DATA); 5010 } 5011 } 5012 5013 /* 5014 * This routine is called whenever a buffer is accessed. 5015 * NOTE: the hash lock is dropped in this function. 5016 */ 5017 static void 5018 arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock) 5019 { 5020 clock_t now; 5021 5022 ASSERT(MUTEX_HELD(hash_lock)); 5023 ASSERT(HDR_HAS_L1HDR(hdr)); 5024 5025 if (hdr->b_l1hdr.b_state == arc_anon) { 5026 /* 5027 * This buffer is not in the cache, and does not 5028 * appear in our "ghost" list. Add the new buffer 5029 * to the MRU state. 5030 */ 5031 5032 ASSERT0(hdr->b_l1hdr.b_arc_access); 5033 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); 5034 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); 5035 arc_change_state(arc_mru, hdr, hash_lock); 5036 5037 } else if (hdr->b_l1hdr.b_state == arc_mru) { 5038 now = ddi_get_lbolt(); 5039 5040 /* 5041 * If this buffer is here because of a prefetch, then either: 5042 * - clear the flag if this is a "referencing" read 5043 * (any subsequent access will bump this into the MFU state). 5044 * or 5045 * - move the buffer to the head of the list if this is 5046 * another prefetch (to make it less likely to be evicted). 5047 */ 5048 if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { 5049 if (zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) { 5050 /* link protected by hash lock */ 5051 ASSERT(multilist_link_active( 5052 &hdr->b_l1hdr.b_arc_node)); 5053 } else { 5054 arc_hdr_clear_flags(hdr, 5055 ARC_FLAG_PREFETCH | 5056 ARC_FLAG_PRESCIENT_PREFETCH); 5057 ARCSTAT_BUMP(arcstat_mru_hits); 5058 } 5059 hdr->b_l1hdr.b_arc_access = now; 5060 return; 5061 } 5062 5063 /* 5064 * This buffer has been "accessed" only once so far, 5065 * but it is still in the cache. Move it to the MFU 5066 * state. 5067 */ 5068 if (now > hdr->b_l1hdr.b_arc_access + ARC_MINTIME) { 5069 /* 5070 * More than 125ms have passed since we 5071 * instantiated this buffer. Move it to the 5072 * most frequently used state. 5073 */ 5074 hdr->b_l1hdr.b_arc_access = now; 5075 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5076 arc_change_state(arc_mfu, hdr, hash_lock); 5077 } 5078 ARCSTAT_BUMP(arcstat_mru_hits); 5079 } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) { 5080 arc_state_t *new_state; 5081 /* 5082 * This buffer has been "accessed" recently, but 5083 * was evicted from the cache. Move it to the 5084 * MFU state. 5085 */ 5086 5087 if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { 5088 new_state = arc_mru; 5089 if (zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) > 0) { 5090 arc_hdr_clear_flags(hdr, 5091 ARC_FLAG_PREFETCH | 5092 ARC_FLAG_PRESCIENT_PREFETCH); 5093 } 5094 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); 5095 } else { 5096 new_state = arc_mfu; 5097 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5098 } 5099 5100 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); 5101 arc_change_state(new_state, hdr, hash_lock); 5102 5103 ARCSTAT_BUMP(arcstat_mru_ghost_hits); 5104 } else if (hdr->b_l1hdr.b_state == arc_mfu) { 5105 /* 5106 * This buffer has been accessed more than once and is 5107 * still in the cache. Keep it in the MFU state. 5108 * 5109 * NOTE: an add_reference() that occurred when we did 5110 * the arc_read() will have kicked this off the list. 5111 * If it was a prefetch, we will explicitly move it to 5112 * the head of the list now. 5113 */ 5114 ARCSTAT_BUMP(arcstat_mfu_hits); 5115 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); 5116 } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) { 5117 arc_state_t *new_state = arc_mfu; 5118 /* 5119 * This buffer has been accessed more than once but has 5120 * been evicted from the cache. Move it back to the 5121 * MFU state. 5122 */ 5123 5124 if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { 5125 /* 5126 * This is a prefetch access... 5127 * move this block back to the MRU state. 5128 */ 5129 new_state = arc_mru; 5130 } 5131 5132 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); 5133 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5134 arc_change_state(new_state, hdr, hash_lock); 5135 5136 ARCSTAT_BUMP(arcstat_mfu_ghost_hits); 5137 } else if (hdr->b_l1hdr.b_state == arc_l2c_only) { 5138 /* 5139 * This buffer is on the 2nd Level ARC. 5140 */ 5141 5142 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); 5143 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); 5144 arc_change_state(arc_mfu, hdr, hash_lock); 5145 } else { 5146 ASSERT(!"invalid arc state"); 5147 } 5148 } 5149 5150 /* 5151 * This routine is called by dbuf_hold() to update the arc_access() state 5152 * which otherwise would be skipped for entries in the dbuf cache. 5153 */ 5154 void 5155 arc_buf_access(arc_buf_t *buf) 5156 { 5157 mutex_enter(&buf->b_evict_lock); 5158 arc_buf_hdr_t *hdr = buf->b_hdr; 5159 5160 /* 5161 * Avoid taking the hash_lock when possible as an optimization. 5162 * The header must be checked again under the hash_lock in order 5163 * to handle the case where it is concurrently being released. 5164 */ 5165 if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) { 5166 mutex_exit(&buf->b_evict_lock); 5167 return; 5168 } 5169 5170 kmutex_t *hash_lock = HDR_LOCK(hdr); 5171 mutex_enter(hash_lock); 5172 5173 if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) { 5174 mutex_exit(hash_lock); 5175 mutex_exit(&buf->b_evict_lock); 5176 ARCSTAT_BUMP(arcstat_access_skip); 5177 return; 5178 } 5179 5180 mutex_exit(&buf->b_evict_lock); 5181 5182 ASSERT(hdr->b_l1hdr.b_state == arc_mru || 5183 hdr->b_l1hdr.b_state == arc_mfu); 5184 5185 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); 5186 arc_access(hdr, hash_lock); 5187 mutex_exit(hash_lock); 5188 5189 ARCSTAT_BUMP(arcstat_hits); 5190 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr), 5191 demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data, metadata, hits); 5192 } 5193 5194 /* a generic arc_read_done_func_t which you can use */ 5195 /* ARGSUSED */ 5196 void 5197 arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, 5198 arc_buf_t *buf, void *arg) 5199 { 5200 if (buf == NULL) 5201 return; 5202 5203 bcopy(buf->b_data, arg, arc_buf_size(buf)); 5204 arc_buf_destroy(buf, arg); 5205 } 5206 5207 /* a generic arc_read_done_func_t */ 5208 void 5209 arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, 5210 arc_buf_t *buf, void *arg) 5211 { 5212 arc_buf_t **bufp = arg; 5213 5214 if (buf == NULL) { 5215 ASSERT(zio == NULL || zio->io_error != 0); 5216 *bufp = NULL; 5217 } else { 5218 ASSERT(zio == NULL || zio->io_error == 0); 5219 *bufp = buf; 5220 ASSERT(buf->b_data != NULL); 5221 } 5222 } 5223 5224 static void 5225 arc_hdr_verify(arc_buf_hdr_t *hdr, const blkptr_t *bp) 5226 { 5227 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { 5228 ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0); 5229 ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF); 5230 } else { 5231 if (HDR_COMPRESSION_ENABLED(hdr)) { 5232 ASSERT3U(arc_hdr_get_compress(hdr), ==, 5233 BP_GET_COMPRESS(bp)); 5234 } 5235 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); 5236 ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp)); 5237 ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp)); 5238 } 5239 } 5240 5241 /* 5242 * XXX this should be changed to return an error, and callers 5243 * re-read from disk on failure (on nondebug bits). 5244 */ 5245 static void 5246 arc_hdr_verify_checksum(spa_t *spa, arc_buf_hdr_t *hdr, const blkptr_t *bp) 5247 { 5248 arc_hdr_verify(hdr, bp); 5249 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) 5250 return; 5251 int err = 0; 5252 abd_t *abd = NULL; 5253 if (BP_IS_ENCRYPTED(bp)) { 5254 if (HDR_HAS_RABD(hdr)) { 5255 abd = hdr->b_crypt_hdr.b_rabd; 5256 } 5257 } else if (HDR_COMPRESSION_ENABLED(hdr)) { 5258 abd = hdr->b_l1hdr.b_pabd; 5259 } 5260 if (abd != NULL) { 5261 /* 5262 * The offset is only used for labels, which are not 5263 * cached in the ARC, so it doesn't matter what we 5264 * pass for the offset parameter. 5265 */ 5266 int psize = HDR_GET_PSIZE(hdr); 5267 err = zio_checksum_error_impl(spa, bp, 5268 BP_GET_CHECKSUM(bp), abd, psize, 0, NULL); 5269 if (err != 0) { 5270 /* 5271 * Use abd_copy_to_buf() rather than 5272 * abd_borrow_buf_copy() so that we are sure to 5273 * include the buf in crash dumps. 5274 */ 5275 void *buf = kmem_alloc(psize, KM_SLEEP); 5276 abd_copy_to_buf(buf, abd, psize); 5277 panic("checksum of cached data doesn't match BP " 5278 "err=%u hdr=%p bp=%p abd=%p buf=%p", 5279 err, (void *)hdr, (void *)bp, (void *)abd, buf); 5280 } 5281 } 5282 } 5283 5284 static void 5285 arc_read_done(zio_t *zio) 5286 { 5287 blkptr_t *bp = zio->io_bp; 5288 arc_buf_hdr_t *hdr = zio->io_private; 5289 kmutex_t *hash_lock = NULL; 5290 arc_callback_t *callback_list; 5291 arc_callback_t *acb; 5292 boolean_t freeable = B_FALSE; 5293 5294 /* 5295 * The hdr was inserted into hash-table and removed from lists 5296 * prior to starting I/O. We should find this header, since 5297 * it's in the hash table, and it should be legit since it's 5298 * not possible to evict it during the I/O. The only possible 5299 * reason for it not to be found is if we were freed during the 5300 * read. 5301 */ 5302 if (HDR_IN_HASH_TABLE(hdr)) { 5303 ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp)); 5304 ASSERT3U(hdr->b_dva.dva_word[0], ==, 5305 BP_IDENTITY(zio->io_bp)->dva_word[0]); 5306 ASSERT3U(hdr->b_dva.dva_word[1], ==, 5307 BP_IDENTITY(zio->io_bp)->dva_word[1]); 5308 5309 arc_buf_hdr_t *found = buf_hash_find(hdr->b_spa, zio->io_bp, 5310 &hash_lock); 5311 5312 ASSERT((found == hdr && 5313 DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) || 5314 (found == hdr && HDR_L2_READING(hdr))); 5315 ASSERT3P(hash_lock, !=, NULL); 5316 } 5317 5318 if (BP_IS_PROTECTED(bp)) { 5319 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); 5320 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; 5321 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, 5322 hdr->b_crypt_hdr.b_iv); 5323 5324 if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) { 5325 void *tmpbuf; 5326 5327 tmpbuf = abd_borrow_buf_copy(zio->io_abd, 5328 sizeof (zil_chain_t)); 5329 zio_crypt_decode_mac_zil(tmpbuf, 5330 hdr->b_crypt_hdr.b_mac); 5331 abd_return_buf(zio->io_abd, tmpbuf, 5332 sizeof (zil_chain_t)); 5333 } else { 5334 zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac); 5335 } 5336 } 5337 5338 if (zio->io_error == 0) { 5339 /* byteswap if necessary */ 5340 if (BP_SHOULD_BYTESWAP(zio->io_bp)) { 5341 if (BP_GET_LEVEL(zio->io_bp) > 0) { 5342 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; 5343 } else { 5344 hdr->b_l1hdr.b_byteswap = 5345 DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp)); 5346 } 5347 } else { 5348 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 5349 } 5350 } 5351 5352 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED); 5353 if (l2arc_noprefetch && HDR_PREFETCH(hdr)) 5354 arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE); 5355 5356 callback_list = hdr->b_l1hdr.b_acb; 5357 ASSERT3P(callback_list, !=, NULL); 5358 5359 if (hash_lock && zio->io_error == 0 && 5360 hdr->b_l1hdr.b_state == arc_anon) { 5361 /* 5362 * Only call arc_access on anonymous buffers. This is because 5363 * if we've issued an I/O for an evicted buffer, we've already 5364 * called arc_access (to prevent any simultaneous readers from 5365 * getting confused). 5366 */ 5367 arc_access(hdr, hash_lock); 5368 } 5369 5370 /* 5371 * If a read request has a callback (i.e. acb_done is not NULL), then we 5372 * make a buf containing the data according to the parameters which were 5373 * passed in. The implementation of arc_buf_alloc_impl() ensures that we 5374 * aren't needlessly decompressing the data multiple times. 5375 */ 5376 int callback_cnt = 0; 5377 for (acb = callback_list; acb != NULL; acb = acb->acb_next) { 5378 if (!acb->acb_done) 5379 continue; 5380 5381 callback_cnt++; 5382 5383 if (zio->io_error != 0) 5384 continue; 5385 5386 int error = arc_buf_alloc_impl(hdr, zio->io_spa, 5387 &acb->acb_zb, acb->acb_private, acb->acb_encrypted, 5388 acb->acb_compressed, acb->acb_noauth, B_TRUE, 5389 &acb->acb_buf); 5390 5391 /* 5392 * Assert non-speculative zios didn't fail because an 5393 * encryption key wasn't loaded 5394 */ 5395 ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) || 5396 error != EACCES); 5397 5398 /* 5399 * If we failed to decrypt, report an error now (as the zio 5400 * layer would have done if it had done the transforms). 5401 */ 5402 if (error == ECKSUM) { 5403 ASSERT(BP_IS_PROTECTED(bp)); 5404 error = SET_ERROR(EIO); 5405 if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) { 5406 spa_log_error(zio->io_spa, &acb->acb_zb); 5407 (void) zfs_ereport_post( 5408 FM_EREPORT_ZFS_AUTHENTICATION, 5409 zio->io_spa, NULL, &acb->acb_zb, zio, 0, 0); 5410 } 5411 } 5412 5413 if (error != 0) { 5414 /* 5415 * Decompression failed. Set io_error 5416 * so that when we call acb_done (below), 5417 * we will indicate that the read failed. 5418 * Note that in the unusual case where one 5419 * callback is compressed and another 5420 * uncompressed, we will mark all of them 5421 * as failed, even though the uncompressed 5422 * one can't actually fail. In this case, 5423 * the hdr will not be anonymous, because 5424 * if there are multiple callbacks, it's 5425 * because multiple threads found the same 5426 * arc buf in the hash table. 5427 */ 5428 zio->io_error = error; 5429 } 5430 } 5431 5432 /* 5433 * If there are multiple callbacks, we must have the hash lock, 5434 * because the only way for multiple threads to find this hdr is 5435 * in the hash table. This ensures that if there are multiple 5436 * callbacks, the hdr is not anonymous. If it were anonymous, 5437 * we couldn't use arc_buf_destroy() in the error case below. 5438 */ 5439 ASSERT(callback_cnt < 2 || hash_lock != NULL); 5440 5441 hdr->b_l1hdr.b_acb = NULL; 5442 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 5443 if (callback_cnt == 0) 5444 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 5445 5446 ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt) || 5447 callback_list != NULL); 5448 5449 if (zio->io_error == 0) { 5450 arc_hdr_verify(hdr, zio->io_bp); 5451 } else { 5452 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); 5453 if (hdr->b_l1hdr.b_state != arc_anon) 5454 arc_change_state(arc_anon, hdr, hash_lock); 5455 if (HDR_IN_HASH_TABLE(hdr)) 5456 buf_hash_remove(hdr); 5457 freeable = zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt); 5458 } 5459 5460 /* 5461 * Broadcast before we drop the hash_lock to avoid the possibility 5462 * that the hdr (and hence the cv) might be freed before we get to 5463 * the cv_broadcast(). 5464 */ 5465 cv_broadcast(&hdr->b_l1hdr.b_cv); 5466 5467 if (hash_lock != NULL) { 5468 mutex_exit(hash_lock); 5469 } else { 5470 /* 5471 * This block was freed while we waited for the read to 5472 * complete. It has been removed from the hash table and 5473 * moved to the anonymous state (so that it won't show up 5474 * in the cache). 5475 */ 5476 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); 5477 freeable = zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt); 5478 } 5479 5480 /* execute each callback and free its structure */ 5481 while ((acb = callback_list) != NULL) { 5482 5483 if (acb->acb_done != NULL) { 5484 if (zio->io_error != 0 && acb->acb_buf != NULL) { 5485 /* 5486 * If arc_buf_alloc_impl() fails during 5487 * decompression, the buf will still be 5488 * allocated, and needs to be freed here. 5489 */ 5490 arc_buf_destroy(acb->acb_buf, acb->acb_private); 5491 acb->acb_buf = NULL; 5492 } 5493 acb->acb_done(zio, &zio->io_bookmark, zio->io_bp, 5494 acb->acb_buf, acb->acb_private); 5495 } 5496 5497 if (acb->acb_zio_dummy != NULL) { 5498 acb->acb_zio_dummy->io_error = zio->io_error; 5499 zio_nowait(acb->acb_zio_dummy); 5500 } 5501 5502 callback_list = acb->acb_next; 5503 kmem_free(acb, sizeof (arc_callback_t)); 5504 } 5505 5506 if (freeable) 5507 arc_hdr_destroy(hdr); 5508 } 5509 5510 /* 5511 * "Read" the block at the specified DVA (in bp) via the 5512 * cache. If the block is found in the cache, invoke the provided 5513 * callback immediately and return. Note that the `zio' parameter 5514 * in the callback will be NULL in this case, since no IO was 5515 * required. If the block is not in the cache pass the read request 5516 * on to the spa with a substitute callback function, so that the 5517 * requested block will be added to the cache. 5518 * 5519 * If a read request arrives for a block that has a read in-progress, 5520 * either wait for the in-progress read to complete (and return the 5521 * results); or, if this is a read with a "done" func, add a record 5522 * to the read to invoke the "done" func when the read completes, 5523 * and return; or just return. 5524 * 5525 * arc_read_done() will invoke all the requested "done" functions 5526 * for readers of this block. 5527 */ 5528 int 5529 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_read_done_func_t *done, 5530 void *private, zio_priority_t priority, int zio_flags, 5531 arc_flags_t *arc_flags, const zbookmark_phys_t *zb) 5532 { 5533 arc_buf_hdr_t *hdr = NULL; 5534 kmutex_t *hash_lock = NULL; 5535 zio_t *rzio; 5536 uint64_t guid = spa_load_guid(spa); 5537 boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0; 5538 boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) && 5539 (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; 5540 boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) && 5541 (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; 5542 int rc = 0; 5543 5544 ASSERT(!BP_IS_EMBEDDED(bp) || 5545 BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA); 5546 5547 top: 5548 if (!BP_IS_EMBEDDED(bp)) { 5549 /* 5550 * Embedded BP's have no DVA and require no I/O to "read". 5551 * Create an anonymous arc buf to back it. 5552 */ 5553 hdr = buf_hash_find(guid, bp, &hash_lock); 5554 } 5555 5556 /* 5557 * Determine if we have an L1 cache hit or a cache miss. For simplicity 5558 * we maintain encrypted data seperately from compressed / uncompressed 5559 * data. If the user is requesting raw encrypted data and we don't have 5560 * that in the header we will read from disk to guarantee that we can 5561 * get it even if the encryption keys aren't loaded. 5562 */ 5563 if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) || 5564 (hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) { 5565 arc_buf_t *buf = NULL; 5566 *arc_flags |= ARC_FLAG_CACHED; 5567 5568 if (HDR_IO_IN_PROGRESS(hdr)) { 5569 zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head; 5570 5571 ASSERT3P(head_zio, !=, NULL); 5572 if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) && 5573 priority == ZIO_PRIORITY_SYNC_READ) { 5574 /* 5575 * This is a sync read that needs to wait for 5576 * an in-flight async read. Request that the 5577 * zio have its priority upgraded. 5578 */ 5579 zio_change_priority(head_zio, priority); 5580 DTRACE_PROBE1(arc__async__upgrade__sync, 5581 arc_buf_hdr_t *, hdr); 5582 ARCSTAT_BUMP(arcstat_async_upgrade_sync); 5583 } 5584 if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) { 5585 arc_hdr_clear_flags(hdr, 5586 ARC_FLAG_PREDICTIVE_PREFETCH); 5587 } 5588 5589 if (*arc_flags & ARC_FLAG_WAIT) { 5590 cv_wait(&hdr->b_l1hdr.b_cv, hash_lock); 5591 mutex_exit(hash_lock); 5592 goto top; 5593 } 5594 ASSERT(*arc_flags & ARC_FLAG_NOWAIT); 5595 5596 if (done) { 5597 arc_callback_t *acb = NULL; 5598 5599 acb = kmem_zalloc(sizeof (arc_callback_t), 5600 KM_SLEEP); 5601 acb->acb_done = done; 5602 acb->acb_private = private; 5603 acb->acb_compressed = compressed_read; 5604 acb->acb_encrypted = encrypted_read; 5605 acb->acb_noauth = noauth_read; 5606 acb->acb_zb = *zb; 5607 if (pio != NULL) 5608 acb->acb_zio_dummy = zio_null(pio, 5609 spa, NULL, NULL, NULL, zio_flags); 5610 5611 ASSERT3P(acb->acb_done, !=, NULL); 5612 acb->acb_zio_head = head_zio; 5613 acb->acb_next = hdr->b_l1hdr.b_acb; 5614 hdr->b_l1hdr.b_acb = acb; 5615 mutex_exit(hash_lock); 5616 return (0); 5617 } 5618 mutex_exit(hash_lock); 5619 return (0); 5620 } 5621 5622 ASSERT(hdr->b_l1hdr.b_state == arc_mru || 5623 hdr->b_l1hdr.b_state == arc_mfu); 5624 5625 if (done) { 5626 if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) { 5627 /* 5628 * This is a demand read which does not have to 5629 * wait for i/o because we did a predictive 5630 * prefetch i/o for it, which has completed. 5631 */ 5632 DTRACE_PROBE1( 5633 arc__demand__hit__predictive__prefetch, 5634 arc_buf_hdr_t *, hdr); 5635 ARCSTAT_BUMP( 5636 arcstat_demand_hit_predictive_prefetch); 5637 arc_hdr_clear_flags(hdr, 5638 ARC_FLAG_PREDICTIVE_PREFETCH); 5639 } 5640 5641 if (hdr->b_flags & ARC_FLAG_PRESCIENT_PREFETCH) { 5642 ARCSTAT_BUMP( 5643 arcstat_demand_hit_prescient_prefetch); 5644 arc_hdr_clear_flags(hdr, 5645 ARC_FLAG_PRESCIENT_PREFETCH); 5646 } 5647 5648 ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp)); 5649 5650 arc_hdr_verify_checksum(spa, hdr, bp); 5651 5652 /* Get a buf with the desired data in it. */ 5653 rc = arc_buf_alloc_impl(hdr, spa, zb, private, 5654 encrypted_read, compressed_read, noauth_read, 5655 B_TRUE, &buf); 5656 if (rc == ECKSUM) { 5657 /* 5658 * Convert authentication and decryption errors 5659 * to EIO (and generate an ereport if needed) 5660 * before leaving the ARC. 5661 */ 5662 rc = SET_ERROR(EIO); 5663 if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) { 5664 spa_log_error(spa, zb); 5665 (void) zfs_ereport_post( 5666 FM_EREPORT_ZFS_AUTHENTICATION, 5667 spa, NULL, zb, NULL, 0, 0); 5668 } 5669 } 5670 if (rc != 0) { 5671 (void) remove_reference(hdr, hash_lock, 5672 private); 5673 arc_buf_destroy_impl(buf); 5674 buf = NULL; 5675 } 5676 /* assert any errors weren't due to unloaded keys */ 5677 ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) || 5678 rc != EACCES); 5679 } else if (*arc_flags & ARC_FLAG_PREFETCH && 5680 zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) { 5681 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); 5682 } 5683 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); 5684 arc_access(hdr, hash_lock); 5685 if (*arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) 5686 arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); 5687 if (*arc_flags & ARC_FLAG_L2CACHE) 5688 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); 5689 mutex_exit(hash_lock); 5690 ARCSTAT_BUMP(arcstat_hits); 5691 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr), 5692 demand, prefetch, !HDR_ISTYPE_METADATA(hdr), 5693 data, metadata, hits); 5694 5695 if (done) 5696 done(NULL, zb, bp, buf, private); 5697 } else { 5698 uint64_t lsize = BP_GET_LSIZE(bp); 5699 uint64_t psize = BP_GET_PSIZE(bp); 5700 arc_callback_t *acb; 5701 vdev_t *vd = NULL; 5702 uint64_t addr = 0; 5703 boolean_t devw = B_FALSE; 5704 uint64_t size; 5705 abd_t *hdr_abd; 5706 5707 if (hdr == NULL) { 5708 /* this block is not in the cache */ 5709 arc_buf_hdr_t *exists = NULL; 5710 arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp); 5711 hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, 5712 BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), type, 5713 encrypted_read); 5714 5715 if (!BP_IS_EMBEDDED(bp)) { 5716 hdr->b_dva = *BP_IDENTITY(bp); 5717 hdr->b_birth = BP_PHYSICAL_BIRTH(bp); 5718 exists = buf_hash_insert(hdr, &hash_lock); 5719 } 5720 if (exists != NULL) { 5721 /* somebody beat us to the hash insert */ 5722 mutex_exit(hash_lock); 5723 buf_discard_identity(hdr); 5724 arc_hdr_destroy(hdr); 5725 goto top; /* restart the IO request */ 5726 } 5727 } else { 5728 /* 5729 * This block is in the ghost cache or encrypted data 5730 * was requested and we didn't have it. If it was 5731 * L2-only (and thus didn't have an L1 hdr), 5732 * we realloc the header to add an L1 hdr. 5733 */ 5734 if (!HDR_HAS_L1HDR(hdr)) { 5735 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache, 5736 hdr_full_cache); 5737 } 5738 5739 if (GHOST_STATE(hdr->b_l1hdr.b_state)) { 5740 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 5741 ASSERT(!HDR_HAS_RABD(hdr)); 5742 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 5743 ASSERT0(zfs_refcount_count( 5744 &hdr->b_l1hdr.b_refcnt)); 5745 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); 5746 ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); 5747 } else if (HDR_IO_IN_PROGRESS(hdr)) { 5748 /* 5749 * If this header already had an IO in progress 5750 * and we are performing another IO to fetch 5751 * encrypted data we must wait until the first 5752 * IO completes so as not to confuse 5753 * arc_read_done(). This should be very rare 5754 * and so the performance impact shouldn't 5755 * matter. 5756 */ 5757 cv_wait(&hdr->b_l1hdr.b_cv, hash_lock); 5758 mutex_exit(hash_lock); 5759 goto top; 5760 } 5761 5762 /* 5763 * This is a delicate dance that we play here. 5764 * This hdr might be in the ghost list so we access 5765 * it to move it out of the ghost list before we 5766 * initiate the read. If it's a prefetch then 5767 * it won't have a callback so we'll remove the 5768 * reference that arc_buf_alloc_impl() created. We 5769 * do this after we've called arc_access() to 5770 * avoid hitting an assert in remove_reference(). 5771 */ 5772 arc_access(hdr, hash_lock); 5773 arc_hdr_alloc_pabd(hdr, encrypted_read); 5774 } 5775 5776 if (encrypted_read) { 5777 ASSERT(HDR_HAS_RABD(hdr)); 5778 size = HDR_GET_PSIZE(hdr); 5779 hdr_abd = hdr->b_crypt_hdr.b_rabd; 5780 zio_flags |= ZIO_FLAG_RAW; 5781 } else { 5782 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 5783 size = arc_hdr_size(hdr); 5784 hdr_abd = hdr->b_l1hdr.b_pabd; 5785 5786 if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { 5787 zio_flags |= ZIO_FLAG_RAW_COMPRESS; 5788 } 5789 5790 /* 5791 * For authenticated bp's, we do not ask the ZIO layer 5792 * to authenticate them since this will cause the entire 5793 * IO to fail if the key isn't loaded. Instead, we 5794 * defer authentication until arc_buf_fill(), which will 5795 * verify the data when the key is available. 5796 */ 5797 if (BP_IS_AUTHENTICATED(bp)) 5798 zio_flags |= ZIO_FLAG_RAW_ENCRYPT; 5799 } 5800 5801 if (*arc_flags & ARC_FLAG_PREFETCH && 5802 zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) 5803 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); 5804 if (*arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) 5805 arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); 5806 5807 if (*arc_flags & ARC_FLAG_L2CACHE) 5808 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); 5809 if (BP_IS_AUTHENTICATED(bp)) 5810 arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); 5811 if (BP_GET_LEVEL(bp) > 0) 5812 arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT); 5813 if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH) 5814 arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH); 5815 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state)); 5816 5817 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP); 5818 acb->acb_done = done; 5819 acb->acb_private = private; 5820 acb->acb_compressed = compressed_read; 5821 acb->acb_encrypted = encrypted_read; 5822 acb->acb_noauth = noauth_read; 5823 acb->acb_zb = *zb; 5824 5825 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 5826 hdr->b_l1hdr.b_acb = acb; 5827 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 5828 5829 if (HDR_HAS_L2HDR(hdr) && 5830 (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) { 5831 devw = hdr->b_l2hdr.b_dev->l2ad_writing; 5832 addr = hdr->b_l2hdr.b_daddr; 5833 /* 5834 * Lock out L2ARC device removal. 5835 */ 5836 if (vdev_is_dead(vd) || 5837 !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER)) 5838 vd = NULL; 5839 } 5840 5841 /* 5842 * We count both async reads and scrub IOs as asynchronous so 5843 * that both can be upgraded in the event of a cache hit while 5844 * the read IO is still in-flight. 5845 */ 5846 if (priority == ZIO_PRIORITY_ASYNC_READ || 5847 priority == ZIO_PRIORITY_SCRUB) 5848 arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); 5849 else 5850 arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); 5851 5852 /* 5853 * At this point, we have a level 1 cache miss. Try again in 5854 * L2ARC if possible. 5855 */ 5856 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize); 5857 5858 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp, 5859 uint64_t, lsize, zbookmark_phys_t *, zb); 5860 ARCSTAT_BUMP(arcstat_misses); 5861 ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr), 5862 demand, prefetch, !HDR_ISTYPE_METADATA(hdr), 5863 data, metadata, misses); 5864 5865 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) { 5866 /* 5867 * Read from the L2ARC if the following are true: 5868 * 1. The L2ARC vdev was previously cached. 5869 * 2. This buffer still has L2ARC metadata. 5870 * 3. This buffer isn't currently writing to the L2ARC. 5871 * 4. The L2ARC entry wasn't evicted, which may 5872 * also have invalidated the vdev. 5873 * 5. This isn't prefetch and l2arc_noprefetch is set. 5874 */ 5875 if (HDR_HAS_L2HDR(hdr) && 5876 !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) && 5877 !(l2arc_noprefetch && HDR_PREFETCH(hdr))) { 5878 l2arc_read_callback_t *cb; 5879 abd_t *abd; 5880 uint64_t asize; 5881 5882 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr); 5883 ARCSTAT_BUMP(arcstat_l2_hits); 5884 5885 cb = kmem_zalloc(sizeof (l2arc_read_callback_t), 5886 KM_SLEEP); 5887 cb->l2rcb_hdr = hdr; 5888 cb->l2rcb_bp = *bp; 5889 cb->l2rcb_zb = *zb; 5890 cb->l2rcb_flags = zio_flags; 5891 5892 asize = vdev_psize_to_asize(vd, size); 5893 if (asize != size) { 5894 abd = abd_alloc_for_io(asize, 5895 HDR_ISTYPE_METADATA(hdr)); 5896 cb->l2rcb_abd = abd; 5897 } else { 5898 abd = hdr_abd; 5899 } 5900 5901 ASSERT(addr >= VDEV_LABEL_START_SIZE && 5902 addr + asize <= vd->vdev_psize - 5903 VDEV_LABEL_END_SIZE); 5904 5905 /* 5906 * l2arc read. The SCL_L2ARC lock will be 5907 * released by l2arc_read_done(). 5908 * Issue a null zio if the underlying buffer 5909 * was squashed to zero size by compression. 5910 */ 5911 ASSERT3U(arc_hdr_get_compress(hdr), !=, 5912 ZIO_COMPRESS_EMPTY); 5913 rzio = zio_read_phys(pio, vd, addr, 5914 asize, abd, 5915 ZIO_CHECKSUM_OFF, 5916 l2arc_read_done, cb, priority, 5917 zio_flags | ZIO_FLAG_DONT_CACHE | 5918 ZIO_FLAG_CANFAIL | 5919 ZIO_FLAG_DONT_PROPAGATE | 5920 ZIO_FLAG_DONT_RETRY, B_FALSE); 5921 acb->acb_zio_head = rzio; 5922 5923 if (hash_lock != NULL) 5924 mutex_exit(hash_lock); 5925 5926 DTRACE_PROBE2(l2arc__read, vdev_t *, vd, 5927 zio_t *, rzio); 5928 ARCSTAT_INCR(arcstat_l2_read_bytes, 5929 HDR_GET_PSIZE(hdr)); 5930 5931 if (*arc_flags & ARC_FLAG_NOWAIT) { 5932 zio_nowait(rzio); 5933 return (0); 5934 } 5935 5936 ASSERT(*arc_flags & ARC_FLAG_WAIT); 5937 if (zio_wait(rzio) == 0) 5938 return (0); 5939 5940 /* l2arc read error; goto zio_read() */ 5941 if (hash_lock != NULL) 5942 mutex_enter(hash_lock); 5943 } else { 5944 DTRACE_PROBE1(l2arc__miss, 5945 arc_buf_hdr_t *, hdr); 5946 ARCSTAT_BUMP(arcstat_l2_misses); 5947 if (HDR_L2_WRITING(hdr)) 5948 ARCSTAT_BUMP(arcstat_l2_rw_clash); 5949 spa_config_exit(spa, SCL_L2ARC, vd); 5950 } 5951 } else { 5952 if (vd != NULL) 5953 spa_config_exit(spa, SCL_L2ARC, vd); 5954 if (l2arc_ndev != 0) { 5955 DTRACE_PROBE1(l2arc__miss, 5956 arc_buf_hdr_t *, hdr); 5957 ARCSTAT_BUMP(arcstat_l2_misses); 5958 } 5959 } 5960 5961 rzio = zio_read(pio, spa, bp, hdr_abd, size, 5962 arc_read_done, hdr, priority, zio_flags, zb); 5963 acb->acb_zio_head = rzio; 5964 5965 if (hash_lock != NULL) 5966 mutex_exit(hash_lock); 5967 5968 if (*arc_flags & ARC_FLAG_WAIT) 5969 return (zio_wait(rzio)); 5970 5971 ASSERT(*arc_flags & ARC_FLAG_NOWAIT); 5972 zio_nowait(rzio); 5973 } 5974 return (rc); 5975 } 5976 5977 /* 5978 * Notify the arc that a block was freed, and thus will never be used again. 5979 */ 5980 void 5981 arc_freed(spa_t *spa, const blkptr_t *bp) 5982 { 5983 arc_buf_hdr_t *hdr; 5984 kmutex_t *hash_lock; 5985 uint64_t guid = spa_load_guid(spa); 5986 5987 ASSERT(!BP_IS_EMBEDDED(bp)); 5988 5989 hdr = buf_hash_find(guid, bp, &hash_lock); 5990 if (hdr == NULL) 5991 return; 5992 5993 /* 5994 * We might be trying to free a block that is still doing I/O 5995 * (i.e. prefetch) or has a reference (i.e. a dedup-ed, 5996 * dmu_sync-ed block). If this block is being prefetched, then it 5997 * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr 5998 * until the I/O completes. A block may also have a reference if it is 5999 * part of a dedup-ed, dmu_synced write. The dmu_sync() function would 6000 * have written the new block to its final resting place on disk but 6001 * without the dedup flag set. This would have left the hdr in the MRU 6002 * state and discoverable. When the txg finally syncs it detects that 6003 * the block was overridden in open context and issues an override I/O. 6004 * Since this is a dedup block, the override I/O will determine if the 6005 * block is already in the DDT. If so, then it will replace the io_bp 6006 * with the bp from the DDT and allow the I/O to finish. When the I/O 6007 * reaches the done callback, dbuf_write_override_done, it will 6008 * check to see if the io_bp and io_bp_override are identical. 6009 * If they are not, then it indicates that the bp was replaced with 6010 * the bp in the DDT and the override bp is freed. This allows 6011 * us to arrive here with a reference on a block that is being 6012 * freed. So if we have an I/O in progress, or a reference to 6013 * this hdr, then we don't destroy the hdr. 6014 */ 6015 if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) && 6016 zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) { 6017 arc_change_state(arc_anon, hdr, hash_lock); 6018 arc_hdr_destroy(hdr); 6019 mutex_exit(hash_lock); 6020 } else { 6021 mutex_exit(hash_lock); 6022 } 6023 6024 } 6025 6026 /* 6027 * Release this buffer from the cache, making it an anonymous buffer. This 6028 * must be done after a read and prior to modifying the buffer contents. 6029 * If the buffer has more than one reference, we must make 6030 * a new hdr for the buffer. 6031 */ 6032 void 6033 arc_release(arc_buf_t *buf, void *tag) 6034 { 6035 arc_buf_hdr_t *hdr = buf->b_hdr; 6036 6037 /* 6038 * It would be nice to assert that if its DMU metadata (level > 6039 * 0 || it's the dnode file), then it must be syncing context. 6040 * But we don't know that information at this level. 6041 */ 6042 6043 mutex_enter(&buf->b_evict_lock); 6044 6045 ASSERT(HDR_HAS_L1HDR(hdr)); 6046 6047 /* 6048 * We don't grab the hash lock prior to this check, because if 6049 * the buffer's header is in the arc_anon state, it won't be 6050 * linked into the hash table. 6051 */ 6052 if (hdr->b_l1hdr.b_state == arc_anon) { 6053 mutex_exit(&buf->b_evict_lock); 6054 /* 6055 * If we are called from dmu_convert_mdn_block_to_raw(), 6056 * a write might be in progress. This is OK because 6057 * the caller won't change the content of this buffer, 6058 * only the flags (via arc_convert_to_raw()). 6059 */ 6060 /* ASSERT(!HDR_IO_IN_PROGRESS(hdr)); */ 6061 ASSERT(!HDR_IN_HASH_TABLE(hdr)); 6062 ASSERT(!HDR_HAS_L2HDR(hdr)); 6063 ASSERT(HDR_EMPTY(hdr)); 6064 6065 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); 6066 ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1); 6067 ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node)); 6068 6069 hdr->b_l1hdr.b_arc_access = 0; 6070 6071 /* 6072 * If the buf is being overridden then it may already 6073 * have a hdr that is not empty. 6074 */ 6075 buf_discard_identity(hdr); 6076 arc_buf_thaw(buf); 6077 6078 return; 6079 } 6080 6081 kmutex_t *hash_lock = HDR_LOCK(hdr); 6082 mutex_enter(hash_lock); 6083 6084 /* 6085 * This assignment is only valid as long as the hash_lock is 6086 * held, we must be careful not to reference state or the 6087 * b_state field after dropping the lock. 6088 */ 6089 arc_state_t *state = hdr->b_l1hdr.b_state; 6090 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 6091 ASSERT3P(state, !=, arc_anon); 6092 6093 /* this buffer is not on any list */ 6094 ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0); 6095 6096 if (HDR_HAS_L2HDR(hdr)) { 6097 mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx); 6098 6099 /* 6100 * We have to recheck this conditional again now that 6101 * we're holding the l2ad_mtx to prevent a race with 6102 * another thread which might be concurrently calling 6103 * l2arc_evict(). In that case, l2arc_evict() might have 6104 * destroyed the header's L2 portion as we were waiting 6105 * to acquire the l2ad_mtx. 6106 */ 6107 if (HDR_HAS_L2HDR(hdr)) 6108 arc_hdr_l2hdr_destroy(hdr); 6109 6110 mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx); 6111 } 6112 6113 /* 6114 * Do we have more than one buf? 6115 */ 6116 if (hdr->b_l1hdr.b_bufcnt > 1) { 6117 arc_buf_hdr_t *nhdr; 6118 uint64_t spa = hdr->b_spa; 6119 uint64_t psize = HDR_GET_PSIZE(hdr); 6120 uint64_t lsize = HDR_GET_LSIZE(hdr); 6121 boolean_t protected = HDR_PROTECTED(hdr); 6122 enum zio_compress compress = arc_hdr_get_compress(hdr); 6123 arc_buf_contents_t type = arc_buf_type(hdr); 6124 VERIFY3U(hdr->b_type, ==, type); 6125 6126 ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL); 6127 (void) remove_reference(hdr, hash_lock, tag); 6128 6129 if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) { 6130 ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); 6131 ASSERT(ARC_BUF_LAST(buf)); 6132 } 6133 6134 /* 6135 * Pull the data off of this hdr and attach it to 6136 * a new anonymous hdr. Also find the last buffer 6137 * in the hdr's buffer list. 6138 */ 6139 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); 6140 ASSERT3P(lastbuf, !=, NULL); 6141 6142 /* 6143 * If the current arc_buf_t and the hdr are sharing their data 6144 * buffer, then we must stop sharing that block. 6145 */ 6146 if (arc_buf_is_shared(buf)) { 6147 ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); 6148 VERIFY(!arc_buf_is_shared(lastbuf)); 6149 6150 /* 6151 * First, sever the block sharing relationship between 6152 * buf and the arc_buf_hdr_t. 6153 */ 6154 arc_unshare_buf(hdr, buf); 6155 6156 /* 6157 * Now we need to recreate the hdr's b_pabd. Since we 6158 * have lastbuf handy, we try to share with it, but if 6159 * we can't then we allocate a new b_pabd and copy the 6160 * data from buf into it. 6161 */ 6162 if (arc_can_share(hdr, lastbuf)) { 6163 arc_share_buf(hdr, lastbuf); 6164 } else { 6165 arc_hdr_alloc_pabd(hdr, B_FALSE); 6166 abd_copy_from_buf(hdr->b_l1hdr.b_pabd, 6167 buf->b_data, psize); 6168 } 6169 VERIFY3P(lastbuf->b_data, !=, NULL); 6170 } else if (HDR_SHARED_DATA(hdr)) { 6171 /* 6172 * Uncompressed shared buffers are always at the end 6173 * of the list. Compressed buffers don't have the 6174 * same requirements. This makes it hard to 6175 * simply assert that the lastbuf is shared so 6176 * we rely on the hdr's compression flags to determine 6177 * if we have a compressed, shared buffer. 6178 */ 6179 ASSERT(arc_buf_is_shared(lastbuf) || 6180 arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); 6181 ASSERT(!ARC_BUF_SHARED(buf)); 6182 } 6183 ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); 6184 ASSERT3P(state, !=, arc_l2c_only); 6185 6186 (void) zfs_refcount_remove_many(&state->arcs_size, 6187 arc_buf_size(buf), buf); 6188 6189 if (zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) { 6190 ASSERT3P(state, !=, arc_l2c_only); 6191 (void) zfs_refcount_remove_many( 6192 &state->arcs_esize[type], 6193 arc_buf_size(buf), buf); 6194 } 6195 6196 hdr->b_l1hdr.b_bufcnt -= 1; 6197 if (ARC_BUF_ENCRYPTED(buf)) 6198 hdr->b_crypt_hdr.b_ebufcnt -= 1; 6199 6200 arc_cksum_verify(buf); 6201 arc_buf_unwatch(buf); 6202 6203 /* if this is the last uncompressed buf free the checksum */ 6204 if (!arc_hdr_has_uncompressed_buf(hdr)) 6205 arc_cksum_free(hdr); 6206 6207 mutex_exit(hash_lock); 6208 6209 /* 6210 * Allocate a new hdr. The new hdr will contain a b_pabd 6211 * buffer which will be freed in arc_write(). 6212 */ 6213 nhdr = arc_hdr_alloc(spa, psize, lsize, protected, 6214 compress, type, HDR_HAS_RABD(hdr)); 6215 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL); 6216 ASSERT0(nhdr->b_l1hdr.b_bufcnt); 6217 ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt)); 6218 VERIFY3U(nhdr->b_type, ==, type); 6219 ASSERT(!HDR_SHARED_DATA(nhdr)); 6220 6221 nhdr->b_l1hdr.b_buf = buf; 6222 nhdr->b_l1hdr.b_bufcnt = 1; 6223 if (ARC_BUF_ENCRYPTED(buf)) 6224 nhdr->b_crypt_hdr.b_ebufcnt = 1; 6225 (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag); 6226 buf->b_hdr = nhdr; 6227 6228 mutex_exit(&buf->b_evict_lock); 6229 (void) zfs_refcount_add_many(&arc_anon->arcs_size, 6230 arc_buf_size(buf), buf); 6231 } else { 6232 mutex_exit(&buf->b_evict_lock); 6233 ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1); 6234 /* protected by hash lock, or hdr is on arc_anon */ 6235 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); 6236 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 6237 arc_change_state(arc_anon, hdr, hash_lock); 6238 hdr->b_l1hdr.b_arc_access = 0; 6239 6240 mutex_exit(hash_lock); 6241 buf_discard_identity(hdr); 6242 arc_buf_thaw(buf); 6243 } 6244 } 6245 6246 int 6247 arc_released(arc_buf_t *buf) 6248 { 6249 int released; 6250 6251 mutex_enter(&buf->b_evict_lock); 6252 released = (buf->b_data != NULL && 6253 buf->b_hdr->b_l1hdr.b_state == arc_anon); 6254 mutex_exit(&buf->b_evict_lock); 6255 return (released); 6256 } 6257 6258 #ifdef ZFS_DEBUG 6259 int 6260 arc_referenced(arc_buf_t *buf) 6261 { 6262 int referenced; 6263 6264 mutex_enter(&buf->b_evict_lock); 6265 referenced = (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt)); 6266 mutex_exit(&buf->b_evict_lock); 6267 return (referenced); 6268 } 6269 #endif 6270 6271 static void 6272 arc_write_ready(zio_t *zio) 6273 { 6274 arc_write_callback_t *callback = zio->io_private; 6275 arc_buf_t *buf = callback->awcb_buf; 6276 arc_buf_hdr_t *hdr = buf->b_hdr; 6277 blkptr_t *bp = zio->io_bp; 6278 uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp); 6279 6280 ASSERT(HDR_HAS_L1HDR(hdr)); 6281 ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt)); 6282 ASSERT(hdr->b_l1hdr.b_bufcnt > 0); 6283 6284 /* 6285 * If we're reexecuting this zio because the pool suspended, then 6286 * cleanup any state that was previously set the first time the 6287 * callback was invoked. 6288 */ 6289 if (zio->io_flags & ZIO_FLAG_REEXECUTED) { 6290 arc_cksum_free(hdr); 6291 arc_buf_unwatch(buf); 6292 if (hdr->b_l1hdr.b_pabd != NULL) { 6293 if (arc_buf_is_shared(buf)) { 6294 arc_unshare_buf(hdr, buf); 6295 } else { 6296 arc_hdr_free_pabd(hdr, B_FALSE); 6297 } 6298 } 6299 6300 if (HDR_HAS_RABD(hdr)) 6301 arc_hdr_free_pabd(hdr, B_TRUE); 6302 } 6303 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 6304 ASSERT(!HDR_HAS_RABD(hdr)); 6305 ASSERT(!HDR_SHARED_DATA(hdr)); 6306 ASSERT(!arc_buf_is_shared(buf)); 6307 6308 callback->awcb_ready(zio, buf, callback->awcb_private); 6309 6310 if (HDR_IO_IN_PROGRESS(hdr)) 6311 ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED); 6312 6313 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6314 6315 if (BP_IS_PROTECTED(bp) != !!HDR_PROTECTED(hdr)) 6316 hdr = arc_hdr_realloc_crypt(hdr, BP_IS_PROTECTED(bp)); 6317 6318 if (BP_IS_PROTECTED(bp)) { 6319 /* ZIL blocks are written through zio_rewrite */ 6320 ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); 6321 ASSERT(HDR_PROTECTED(hdr)); 6322 6323 if (BP_SHOULD_BYTESWAP(bp)) { 6324 if (BP_GET_LEVEL(bp) > 0) { 6325 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; 6326 } else { 6327 hdr->b_l1hdr.b_byteswap = 6328 DMU_OT_BYTESWAP(BP_GET_TYPE(bp)); 6329 } 6330 } else { 6331 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; 6332 } 6333 6334 hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); 6335 hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; 6336 zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, 6337 hdr->b_crypt_hdr.b_iv); 6338 zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac); 6339 } 6340 6341 /* 6342 * If this block was written for raw encryption but the zio layer 6343 * ended up only authenticating it, adjust the buffer flags now. 6344 */ 6345 if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) { 6346 arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); 6347 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 6348 if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF) 6349 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 6350 } else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) { 6351 buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; 6352 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; 6353 } 6354 6355 /* this must be done after the buffer flags are adjusted */ 6356 arc_cksum_compute(buf); 6357 6358 enum zio_compress compress; 6359 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { 6360 compress = ZIO_COMPRESS_OFF; 6361 } else { 6362 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); 6363 compress = BP_GET_COMPRESS(bp); 6364 } 6365 HDR_SET_PSIZE(hdr, psize); 6366 arc_hdr_set_compress(hdr, compress); 6367 6368 if (zio->io_error != 0 || psize == 0) 6369 goto out; 6370 6371 /* 6372 * Fill the hdr with data. If the buffer is encrypted we have no choice 6373 * but to copy the data into b_rabd. If the hdr is compressed, the data 6374 * we want is available from the zio, otherwise we can take it from 6375 * the buf. 6376 * 6377 * We might be able to share the buf's data with the hdr here. However, 6378 * doing so would cause the ARC to be full of linear ABDs if we write a 6379 * lot of shareable data. As a compromise, we check whether scattered 6380 * ABDs are allowed, and assume that if they are then the user wants 6381 * the ARC to be primarily filled with them regardless of the data being 6382 * written. Therefore, if they're allowed then we allocate one and copy 6383 * the data into it; otherwise, we share the data directly if we can. 6384 */ 6385 if (ARC_BUF_ENCRYPTED(buf)) { 6386 ASSERT3U(psize, >, 0); 6387 ASSERT(ARC_BUF_COMPRESSED(buf)); 6388 arc_hdr_alloc_pabd(hdr, B_TRUE); 6389 abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); 6390 } else if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) { 6391 /* 6392 * Ideally, we would always copy the io_abd into b_pabd, but the 6393 * user may have disabled compressed ARC, thus we must check the 6394 * hdr's compression setting rather than the io_bp's. 6395 */ 6396 if (BP_IS_ENCRYPTED(bp)) { 6397 ASSERT3U(psize, >, 0); 6398 arc_hdr_alloc_pabd(hdr, B_TRUE); 6399 abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); 6400 } else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && 6401 !ARC_BUF_COMPRESSED(buf)) { 6402 ASSERT3U(psize, >, 0); 6403 arc_hdr_alloc_pabd(hdr, B_FALSE); 6404 abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize); 6405 } else { 6406 ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr)); 6407 arc_hdr_alloc_pabd(hdr, B_FALSE); 6408 abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data, 6409 arc_buf_size(buf)); 6410 } 6411 } else { 6412 ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd)); 6413 ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf)); 6414 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); 6415 arc_share_buf(hdr, buf); 6416 } 6417 6418 out: 6419 arc_hdr_verify(hdr, bp); 6420 } 6421 6422 static void 6423 arc_write_children_ready(zio_t *zio) 6424 { 6425 arc_write_callback_t *callback = zio->io_private; 6426 arc_buf_t *buf = callback->awcb_buf; 6427 6428 callback->awcb_children_ready(zio, buf, callback->awcb_private); 6429 } 6430 6431 /* 6432 * The SPA calls this callback for each physical write that happens on behalf 6433 * of a logical write. See the comment in dbuf_write_physdone() for details. 6434 */ 6435 static void 6436 arc_write_physdone(zio_t *zio) 6437 { 6438 arc_write_callback_t *cb = zio->io_private; 6439 if (cb->awcb_physdone != NULL) 6440 cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private); 6441 } 6442 6443 static void 6444 arc_write_done(zio_t *zio) 6445 { 6446 arc_write_callback_t *callback = zio->io_private; 6447 arc_buf_t *buf = callback->awcb_buf; 6448 arc_buf_hdr_t *hdr = buf->b_hdr; 6449 6450 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 6451 6452 if (zio->io_error == 0) { 6453 arc_hdr_verify(hdr, zio->io_bp); 6454 6455 if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) { 6456 buf_discard_identity(hdr); 6457 } else { 6458 hdr->b_dva = *BP_IDENTITY(zio->io_bp); 6459 hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp); 6460 } 6461 } else { 6462 ASSERT(HDR_EMPTY(hdr)); 6463 } 6464 6465 /* 6466 * If the block to be written was all-zero or compressed enough to be 6467 * embedded in the BP, no write was performed so there will be no 6468 * dva/birth/checksum. The buffer must therefore remain anonymous 6469 * (and uncached). 6470 */ 6471 if (!HDR_EMPTY(hdr)) { 6472 arc_buf_hdr_t *exists; 6473 kmutex_t *hash_lock; 6474 6475 ASSERT3U(zio->io_error, ==, 0); 6476 6477 arc_cksum_verify(buf); 6478 6479 exists = buf_hash_insert(hdr, &hash_lock); 6480 if (exists != NULL) { 6481 /* 6482 * This can only happen if we overwrite for 6483 * sync-to-convergence, because we remove 6484 * buffers from the hash table when we arc_free(). 6485 */ 6486 if (zio->io_flags & ZIO_FLAG_IO_REWRITE) { 6487 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) 6488 panic("bad overwrite, hdr=%p exists=%p", 6489 (void *)hdr, (void *)exists); 6490 ASSERT(zfs_refcount_is_zero( 6491 &exists->b_l1hdr.b_refcnt)); 6492 arc_change_state(arc_anon, exists, hash_lock); 6493 arc_hdr_destroy(exists); 6494 mutex_exit(hash_lock); 6495 exists = buf_hash_insert(hdr, &hash_lock); 6496 ASSERT3P(exists, ==, NULL); 6497 } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) { 6498 /* nopwrite */ 6499 ASSERT(zio->io_prop.zp_nopwrite); 6500 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) 6501 panic("bad nopwrite, hdr=%p exists=%p", 6502 (void *)hdr, (void *)exists); 6503 } else { 6504 /* Dedup */ 6505 ASSERT(hdr->b_l1hdr.b_bufcnt == 1); 6506 ASSERT(hdr->b_l1hdr.b_state == arc_anon); 6507 ASSERT(BP_GET_DEDUP(zio->io_bp)); 6508 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0); 6509 } 6510 } 6511 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6512 /* if it's not anon, we are doing a scrub */ 6513 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon) 6514 arc_access(hdr, hash_lock); 6515 mutex_exit(hash_lock); 6516 } else { 6517 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); 6518 } 6519 6520 ASSERT(!zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); 6521 callback->awcb_done(zio, buf, callback->awcb_private); 6522 6523 abd_put(zio->io_abd); 6524 kmem_free(callback, sizeof (arc_write_callback_t)); 6525 } 6526 6527 zio_t * 6528 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf, 6529 boolean_t l2arc, const zio_prop_t *zp, arc_write_done_func_t *ready, 6530 arc_write_done_func_t *children_ready, arc_write_done_func_t *physdone, 6531 arc_write_done_func_t *done, void *private, zio_priority_t priority, 6532 int zio_flags, const zbookmark_phys_t *zb) 6533 { 6534 arc_buf_hdr_t *hdr = buf->b_hdr; 6535 arc_write_callback_t *callback; 6536 zio_t *zio; 6537 zio_prop_t localprop = *zp; 6538 6539 ASSERT3P(ready, !=, NULL); 6540 ASSERT3P(done, !=, NULL); 6541 ASSERT(!HDR_IO_ERROR(hdr)); 6542 ASSERT(!HDR_IO_IN_PROGRESS(hdr)); 6543 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); 6544 ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0); 6545 if (l2arc) 6546 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); 6547 6548 if (ARC_BUF_ENCRYPTED(buf)) { 6549 ASSERT(ARC_BUF_COMPRESSED(buf)); 6550 localprop.zp_encrypt = B_TRUE; 6551 localprop.zp_compress = HDR_GET_COMPRESS(hdr); 6552 /* CONSTCOND */ 6553 localprop.zp_byteorder = 6554 (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? 6555 ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; 6556 bcopy(hdr->b_crypt_hdr.b_salt, localprop.zp_salt, 6557 ZIO_DATA_SALT_LEN); 6558 bcopy(hdr->b_crypt_hdr.b_iv, localprop.zp_iv, 6559 ZIO_DATA_IV_LEN); 6560 bcopy(hdr->b_crypt_hdr.b_mac, localprop.zp_mac, 6561 ZIO_DATA_MAC_LEN); 6562 if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) { 6563 localprop.zp_nopwrite = B_FALSE; 6564 localprop.zp_copies = 6565 MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1); 6566 } 6567 zio_flags |= ZIO_FLAG_RAW; 6568 } else if (ARC_BUF_COMPRESSED(buf)) { 6569 ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf)); 6570 localprop.zp_compress = HDR_GET_COMPRESS(hdr); 6571 zio_flags |= ZIO_FLAG_RAW_COMPRESS; 6572 } 6573 6574 callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP); 6575 callback->awcb_ready = ready; 6576 callback->awcb_children_ready = children_ready; 6577 callback->awcb_physdone = physdone; 6578 callback->awcb_done = done; 6579 callback->awcb_private = private; 6580 callback->awcb_buf = buf; 6581 6582 /* 6583 * The hdr's b_pabd is now stale, free it now. A new data block 6584 * will be allocated when the zio pipeline calls arc_write_ready(). 6585 */ 6586 if (hdr->b_l1hdr.b_pabd != NULL) { 6587 /* 6588 * If the buf is currently sharing the data block with 6589 * the hdr then we need to break that relationship here. 6590 * The hdr will remain with a NULL data pointer and the 6591 * buf will take sole ownership of the block. 6592 */ 6593 if (arc_buf_is_shared(buf)) { 6594 arc_unshare_buf(hdr, buf); 6595 } else { 6596 arc_hdr_free_pabd(hdr, B_FALSE); 6597 } 6598 VERIFY3P(buf->b_data, !=, NULL); 6599 } 6600 6601 if (HDR_HAS_RABD(hdr)) 6602 arc_hdr_free_pabd(hdr, B_TRUE); 6603 6604 if (!(zio_flags & ZIO_FLAG_RAW)) 6605 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF); 6606 6607 ASSERT(!arc_buf_is_shared(buf)); 6608 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); 6609 6610 zio = zio_write(pio, spa, txg, bp, 6611 abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)), 6612 HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready, 6613 (children_ready != NULL) ? arc_write_children_ready : NULL, 6614 arc_write_physdone, arc_write_done, callback, 6615 priority, zio_flags, zb); 6616 6617 return (zio); 6618 } 6619 6620 static int 6621 arc_memory_throttle(spa_t *spa, uint64_t reserve, uint64_t txg) 6622 { 6623 #ifdef _KERNEL 6624 uint64_t available_memory = ptob(freemem); 6625 6626 #if defined(__i386) 6627 available_memory = 6628 MIN(available_memory, vmem_size(heap_arena, VMEM_FREE)); 6629 #endif 6630 6631 if (freemem > physmem * arc_lotsfree_percent / 100) 6632 return (0); 6633 6634 if (txg > spa->spa_lowmem_last_txg) { 6635 spa->spa_lowmem_last_txg = txg; 6636 spa->spa_lowmem_page_load = 0; 6637 } 6638 /* 6639 * If we are in pageout, we know that memory is already tight, 6640 * the arc is already going to be evicting, so we just want to 6641 * continue to let page writes occur as quickly as possible. 6642 */ 6643 if (curproc == proc_pageout) { 6644 if (spa->spa_lowmem_page_load > 6645 MAX(ptob(minfree), available_memory) / 4) 6646 return (SET_ERROR(ERESTART)); 6647 /* Note: reserve is inflated, so we deflate */ 6648 atomic_add_64(&spa->spa_lowmem_page_load, reserve / 8); 6649 return (0); 6650 } else if (spa->spa_lowmem_page_load > 0 && arc_reclaim_needed()) { 6651 /* memory is low, delay before restarting */ 6652 ARCSTAT_INCR(arcstat_memory_throttle_count, 1); 6653 return (SET_ERROR(EAGAIN)); 6654 } 6655 spa->spa_lowmem_page_load = 0; 6656 #endif /* _KERNEL */ 6657 return (0); 6658 } 6659 6660 void 6661 arc_tempreserve_clear(uint64_t reserve) 6662 { 6663 atomic_add_64(&arc_tempreserve, -reserve); 6664 ASSERT((int64_t)arc_tempreserve >= 0); 6665 } 6666 6667 int 6668 arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg) 6669 { 6670 int error; 6671 uint64_t anon_size; 6672 6673 if (reserve > arc_c/4 && !arc_no_grow) 6674 arc_c = MIN(arc_c_max, reserve * 4); 6675 if (reserve > arc_c) 6676 return (SET_ERROR(ENOMEM)); 6677 6678 /* 6679 * Don't count loaned bufs as in flight dirty data to prevent long 6680 * network delays from blocking transactions that are ready to be 6681 * assigned to a txg. 6682 */ 6683 6684 /* assert that it has not wrapped around */ 6685 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); 6686 6687 anon_size = MAX((int64_t)(zfs_refcount_count(&arc_anon->arcs_size) - 6688 arc_loaned_bytes), 0); 6689 6690 /* 6691 * Writes will, almost always, require additional memory allocations 6692 * in order to compress/encrypt/etc the data. We therefore need to 6693 * make sure that there is sufficient available memory for this. 6694 */ 6695 error = arc_memory_throttle(spa, reserve, txg); 6696 if (error != 0) 6697 return (error); 6698 6699 /* 6700 * Throttle writes when the amount of dirty data in the cache 6701 * gets too large. We try to keep the cache less than half full 6702 * of dirty blocks so that our sync times don't grow too large. 6703 * 6704 * In the case of one pool being built on another pool, we want 6705 * to make sure we don't end up throttling the lower (backing) 6706 * pool when the upper pool is the majority contributor to dirty 6707 * data. To insure we make forward progress during throttling, we 6708 * also check the current pool's net dirty data and only throttle 6709 * if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty 6710 * data in the cache. 6711 * 6712 * Note: if two requests come in concurrently, we might let them 6713 * both succeed, when one of them should fail. Not a huge deal. 6714 */ 6715 uint64_t total_dirty = reserve + arc_tempreserve + anon_size; 6716 uint64_t spa_dirty_anon = spa_dirty_data(spa); 6717 6718 if (total_dirty > arc_c * zfs_arc_dirty_limit_percent / 100 && 6719 anon_size > arc_c * zfs_arc_anon_limit_percent / 100 && 6720 spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) { 6721 uint64_t meta_esize = 6722 zfs_refcount_count( 6723 &arc_anon->arcs_esize[ARC_BUFC_METADATA]); 6724 uint64_t data_esize = 6725 zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 6726 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK " 6727 "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n", 6728 arc_tempreserve >> 10, meta_esize >> 10, 6729 data_esize >> 10, reserve >> 10, arc_c >> 10); 6730 return (SET_ERROR(ERESTART)); 6731 } 6732 atomic_add_64(&arc_tempreserve, reserve); 6733 return (0); 6734 } 6735 6736 static void 6737 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size, 6738 kstat_named_t *evict_data, kstat_named_t *evict_metadata) 6739 { 6740 size->value.ui64 = zfs_refcount_count(&state->arcs_size); 6741 evict_data->value.ui64 = 6742 zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]); 6743 evict_metadata->value.ui64 = 6744 zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]); 6745 } 6746 6747 static int 6748 arc_kstat_update(kstat_t *ksp, int rw) 6749 { 6750 arc_stats_t *as = ksp->ks_data; 6751 6752 if (rw == KSTAT_WRITE) { 6753 return (EACCES); 6754 } else { 6755 arc_kstat_update_state(arc_anon, 6756 &as->arcstat_anon_size, 6757 &as->arcstat_anon_evictable_data, 6758 &as->arcstat_anon_evictable_metadata); 6759 arc_kstat_update_state(arc_mru, 6760 &as->arcstat_mru_size, 6761 &as->arcstat_mru_evictable_data, 6762 &as->arcstat_mru_evictable_metadata); 6763 arc_kstat_update_state(arc_mru_ghost, 6764 &as->arcstat_mru_ghost_size, 6765 &as->arcstat_mru_ghost_evictable_data, 6766 &as->arcstat_mru_ghost_evictable_metadata); 6767 arc_kstat_update_state(arc_mfu, 6768 &as->arcstat_mfu_size, 6769 &as->arcstat_mfu_evictable_data, 6770 &as->arcstat_mfu_evictable_metadata); 6771 arc_kstat_update_state(arc_mfu_ghost, 6772 &as->arcstat_mfu_ghost_size, 6773 &as->arcstat_mfu_ghost_evictable_data, 6774 &as->arcstat_mfu_ghost_evictable_metadata); 6775 6776 ARCSTAT(arcstat_size) = aggsum_value(&arc_size); 6777 ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used); 6778 ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size); 6779 ARCSTAT(arcstat_metadata_size) = 6780 aggsum_value(&astat_metadata_size); 6781 ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size); 6782 ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size); 6783 ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size); 6784 } 6785 6786 return (0); 6787 } 6788 6789 /* 6790 * This function *must* return indices evenly distributed between all 6791 * sublists of the multilist. This is needed due to how the ARC eviction 6792 * code is laid out; arc_evict_state() assumes ARC buffers are evenly 6793 * distributed between all sublists and uses this assumption when 6794 * deciding which sublist to evict from and how much to evict from it. 6795 */ 6796 unsigned int 6797 arc_state_multilist_index_func(multilist_t *ml, void *obj) 6798 { 6799 arc_buf_hdr_t *hdr = obj; 6800 6801 /* 6802 * We rely on b_dva to generate evenly distributed index 6803 * numbers using buf_hash below. So, as an added precaution, 6804 * let's make sure we never add empty buffers to the arc lists. 6805 */ 6806 ASSERT(!HDR_EMPTY(hdr)); 6807 6808 /* 6809 * The assumption here, is the hash value for a given 6810 * arc_buf_hdr_t will remain constant throughout its lifetime 6811 * (i.e. its b_spa, b_dva, and b_birth fields don't change). 6812 * Thus, we don't need to store the header's sublist index 6813 * on insertion, as this index can be recalculated on removal. 6814 * 6815 * Also, the low order bits of the hash value are thought to be 6816 * distributed evenly. Otherwise, in the case that the multilist 6817 * has a power of two number of sublists, each sublists' usage 6818 * would not be evenly distributed. 6819 */ 6820 return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) % 6821 multilist_get_num_sublists(ml)); 6822 } 6823 6824 static void 6825 arc_state_init(void) 6826 { 6827 arc_anon = &ARC_anon; 6828 arc_mru = &ARC_mru; 6829 arc_mru_ghost = &ARC_mru_ghost; 6830 arc_mfu = &ARC_mfu; 6831 arc_mfu_ghost = &ARC_mfu_ghost; 6832 arc_l2c_only = &ARC_l2c_only; 6833 6834 arc_mru->arcs_list[ARC_BUFC_METADATA] = 6835 multilist_create(sizeof (arc_buf_hdr_t), 6836 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6837 arc_state_multilist_index_func); 6838 arc_mru->arcs_list[ARC_BUFC_DATA] = 6839 multilist_create(sizeof (arc_buf_hdr_t), 6840 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6841 arc_state_multilist_index_func); 6842 arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] = 6843 multilist_create(sizeof (arc_buf_hdr_t), 6844 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6845 arc_state_multilist_index_func); 6846 arc_mru_ghost->arcs_list[ARC_BUFC_DATA] = 6847 multilist_create(sizeof (arc_buf_hdr_t), 6848 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6849 arc_state_multilist_index_func); 6850 arc_mfu->arcs_list[ARC_BUFC_METADATA] = 6851 multilist_create(sizeof (arc_buf_hdr_t), 6852 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6853 arc_state_multilist_index_func); 6854 arc_mfu->arcs_list[ARC_BUFC_DATA] = 6855 multilist_create(sizeof (arc_buf_hdr_t), 6856 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6857 arc_state_multilist_index_func); 6858 arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] = 6859 multilist_create(sizeof (arc_buf_hdr_t), 6860 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6861 arc_state_multilist_index_func); 6862 arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] = 6863 multilist_create(sizeof (arc_buf_hdr_t), 6864 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6865 arc_state_multilist_index_func); 6866 arc_l2c_only->arcs_list[ARC_BUFC_METADATA] = 6867 multilist_create(sizeof (arc_buf_hdr_t), 6868 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6869 arc_state_multilist_index_func); 6870 arc_l2c_only->arcs_list[ARC_BUFC_DATA] = 6871 multilist_create(sizeof (arc_buf_hdr_t), 6872 offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), 6873 arc_state_multilist_index_func); 6874 6875 zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); 6876 zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 6877 zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); 6878 zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]); 6879 zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); 6880 zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); 6881 zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); 6882 zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); 6883 zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); 6884 zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); 6885 zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); 6886 zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); 6887 6888 zfs_refcount_create(&arc_anon->arcs_size); 6889 zfs_refcount_create(&arc_mru->arcs_size); 6890 zfs_refcount_create(&arc_mru_ghost->arcs_size); 6891 zfs_refcount_create(&arc_mfu->arcs_size); 6892 zfs_refcount_create(&arc_mfu_ghost->arcs_size); 6893 zfs_refcount_create(&arc_l2c_only->arcs_size); 6894 6895 aggsum_init(&arc_meta_used, 0); 6896 aggsum_init(&arc_size, 0); 6897 aggsum_init(&astat_data_size, 0); 6898 aggsum_init(&astat_metadata_size, 0); 6899 aggsum_init(&astat_hdr_size, 0); 6900 aggsum_init(&astat_other_size, 0); 6901 aggsum_init(&astat_l2_hdr_size, 0); 6902 } 6903 6904 static void 6905 arc_state_fini(void) 6906 { 6907 zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); 6908 zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]); 6909 zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); 6910 zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]); 6911 zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); 6912 zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); 6913 zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); 6914 zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); 6915 zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); 6916 zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); 6917 zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); 6918 zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); 6919 6920 zfs_refcount_destroy(&arc_anon->arcs_size); 6921 zfs_refcount_destroy(&arc_mru->arcs_size); 6922 zfs_refcount_destroy(&arc_mru_ghost->arcs_size); 6923 zfs_refcount_destroy(&arc_mfu->arcs_size); 6924 zfs_refcount_destroy(&arc_mfu_ghost->arcs_size); 6925 zfs_refcount_destroy(&arc_l2c_only->arcs_size); 6926 6927 multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]); 6928 multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]); 6929 multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]); 6930 multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]); 6931 multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]); 6932 multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]); 6933 multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]); 6934 multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]); 6935 multilist_destroy(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]); 6936 multilist_destroy(arc_l2c_only->arcs_list[ARC_BUFC_DATA]); 6937 6938 aggsum_fini(&arc_meta_used); 6939 aggsum_fini(&arc_size); 6940 aggsum_fini(&astat_data_size); 6941 aggsum_fini(&astat_metadata_size); 6942 aggsum_fini(&astat_hdr_size); 6943 aggsum_fini(&astat_other_size); 6944 aggsum_fini(&astat_l2_hdr_size); 6945 6946 } 6947 6948 uint64_t 6949 arc_max_bytes(void) 6950 { 6951 return (arc_c_max); 6952 } 6953 6954 void 6955 arc_init(void) 6956 { 6957 /* 6958 * allmem is "all memory that we could possibly use". 6959 */ 6960 #ifdef _KERNEL 6961 uint64_t allmem = ptob(physmem - swapfs_minfree); 6962 #else 6963 uint64_t allmem = (physmem * PAGESIZE) / 2; 6964 #endif 6965 mutex_init(&arc_adjust_lock, NULL, MUTEX_DEFAULT, NULL); 6966 cv_init(&arc_adjust_waiters_cv, NULL, CV_DEFAULT, NULL); 6967 6968 /* set min cache to 1/32 of all memory, or 64MB, whichever is more */ 6969 arc_c_min = MAX(allmem / 32, 64 << 20); 6970 /* set max to 3/4 of all memory, or all but 1GB, whichever is more */ 6971 if (allmem >= 1 << 30) 6972 arc_c_max = allmem - (1 << 30); 6973 else 6974 arc_c_max = arc_c_min; 6975 arc_c_max = MAX(allmem * 3 / 4, arc_c_max); 6976 6977 /* 6978 * In userland, there's only the memory pressure that we artificially 6979 * create (see arc_available_memory()). Don't let arc_c get too 6980 * small, because it can cause transactions to be larger than 6981 * arc_c, causing arc_tempreserve_space() to fail. 6982 */ 6983 #ifndef _KERNEL 6984 arc_c_min = arc_c_max / 2; 6985 #endif 6986 6987 /* 6988 * Allow the tunables to override our calculations if they are 6989 * reasonable (ie. over 64MB) 6990 */ 6991 if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) { 6992 arc_c_max = zfs_arc_max; 6993 arc_c_min = MIN(arc_c_min, arc_c_max); 6994 } 6995 if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max) 6996 arc_c_min = zfs_arc_min; 6997 6998 arc_c = arc_c_max; 6999 arc_p = (arc_c >> 1); 7000 7001 /* limit meta-data to 1/4 of the arc capacity */ 7002 arc_meta_limit = arc_c_max / 4; 7003 7004 #ifdef _KERNEL 7005 /* 7006 * Metadata is stored in the kernel's heap. Don't let us 7007 * use more than half the heap for the ARC. 7008 */ 7009 arc_meta_limit = MIN(arc_meta_limit, 7010 vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2); 7011 #endif 7012 7013 /* Allow the tunable to override if it is reasonable */ 7014 if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max) 7015 arc_meta_limit = zfs_arc_meta_limit; 7016 7017 if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0) 7018 arc_c_min = arc_meta_limit / 2; 7019 7020 if (zfs_arc_meta_min > 0) { 7021 arc_meta_min = zfs_arc_meta_min; 7022 } else { 7023 arc_meta_min = arc_c_min / 2; 7024 } 7025 7026 if (zfs_arc_grow_retry > 0) 7027 arc_grow_retry = zfs_arc_grow_retry; 7028 7029 if (zfs_arc_shrink_shift > 0) 7030 arc_shrink_shift = zfs_arc_shrink_shift; 7031 7032 /* 7033 * Ensure that arc_no_grow_shift is less than arc_shrink_shift. 7034 */ 7035 if (arc_no_grow_shift >= arc_shrink_shift) 7036 arc_no_grow_shift = arc_shrink_shift - 1; 7037 7038 if (zfs_arc_p_min_shift > 0) 7039 arc_p_min_shift = zfs_arc_p_min_shift; 7040 7041 /* if kmem_flags are set, lets try to use less memory */ 7042 if (kmem_debugging()) 7043 arc_c = arc_c / 2; 7044 if (arc_c < arc_c_min) 7045 arc_c = arc_c_min; 7046 7047 arc_state_init(); 7048 7049 /* 7050 * The arc must be "uninitialized", so that hdr_recl() (which is 7051 * registered by buf_init()) will not access arc_reap_zthr before 7052 * it is created. 7053 */ 7054 ASSERT(!arc_initialized); 7055 buf_init(); 7056 7057 arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED, 7058 sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); 7059 7060 if (arc_ksp != NULL) { 7061 arc_ksp->ks_data = &arc_stats; 7062 arc_ksp->ks_update = arc_kstat_update; 7063 kstat_install(arc_ksp); 7064 } 7065 7066 arc_adjust_zthr = zthr_create(arc_adjust_cb_check, 7067 arc_adjust_cb, NULL); 7068 arc_reap_zthr = zthr_create_timer(arc_reap_cb_check, 7069 arc_reap_cb, NULL, SEC2NSEC(1)); 7070 7071 arc_initialized = B_TRUE; 7072 arc_warm = B_FALSE; 7073 7074 /* 7075 * Calculate maximum amount of dirty data per pool. 7076 * 7077 * If it has been set by /etc/system, take that. 7078 * Otherwise, use a percentage of physical memory defined by 7079 * zfs_dirty_data_max_percent (default 10%) with a cap at 7080 * zfs_dirty_data_max_max (default 4GB). 7081 */ 7082 if (zfs_dirty_data_max == 0) { 7083 zfs_dirty_data_max = physmem * PAGESIZE * 7084 zfs_dirty_data_max_percent / 100; 7085 zfs_dirty_data_max = MIN(zfs_dirty_data_max, 7086 zfs_dirty_data_max_max); 7087 } 7088 } 7089 7090 void 7091 arc_fini(void) 7092 { 7093 /* Use B_TRUE to ensure *all* buffers are evicted */ 7094 arc_flush(NULL, B_TRUE); 7095 7096 arc_initialized = B_FALSE; 7097 7098 if (arc_ksp != NULL) { 7099 kstat_delete(arc_ksp); 7100 arc_ksp = NULL; 7101 } 7102 7103 (void) zthr_cancel(arc_adjust_zthr); 7104 zthr_destroy(arc_adjust_zthr); 7105 7106 (void) zthr_cancel(arc_reap_zthr); 7107 zthr_destroy(arc_reap_zthr); 7108 7109 mutex_destroy(&arc_adjust_lock); 7110 cv_destroy(&arc_adjust_waiters_cv); 7111 7112 /* 7113 * buf_fini() must proceed arc_state_fini() because buf_fin() may 7114 * trigger the release of kmem magazines, which can callback to 7115 * arc_space_return() which accesses aggsums freed in act_state_fini(). 7116 */ 7117 buf_fini(); 7118 arc_state_fini(); 7119 7120 ASSERT0(arc_loaned_bytes); 7121 } 7122 7123 /* 7124 * Level 2 ARC 7125 * 7126 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk. 7127 * It uses dedicated storage devices to hold cached data, which are populated 7128 * using large infrequent writes. The main role of this cache is to boost 7129 * the performance of random read workloads. The intended L2ARC devices 7130 * include short-stroked disks, solid state disks, and other media with 7131 * substantially faster read latency than disk. 7132 * 7133 * +-----------------------+ 7134 * | ARC | 7135 * +-----------------------+ 7136 * | ^ ^ 7137 * | | | 7138 * l2arc_feed_thread() arc_read() 7139 * | | | 7140 * | l2arc read | 7141 * V | | 7142 * +---------------+ | 7143 * | L2ARC | | 7144 * +---------------+ | 7145 * | ^ | 7146 * l2arc_write() | | 7147 * | | | 7148 * V | | 7149 * +-------+ +-------+ 7150 * | vdev | | vdev | 7151 * | cache | | cache | 7152 * +-------+ +-------+ 7153 * +=========+ .-----. 7154 * : L2ARC : |-_____-| 7155 * : devices : | Disks | 7156 * +=========+ `-_____-' 7157 * 7158 * Read requests are satisfied from the following sources, in order: 7159 * 7160 * 1) ARC 7161 * 2) vdev cache of L2ARC devices 7162 * 3) L2ARC devices 7163 * 4) vdev cache of disks 7164 * 5) disks 7165 * 7166 * Some L2ARC device types exhibit extremely slow write performance. 7167 * To accommodate for this there are some significant differences between 7168 * the L2ARC and traditional cache design: 7169 * 7170 * 1. There is no eviction path from the ARC to the L2ARC. Evictions from 7171 * the ARC behave as usual, freeing buffers and placing headers on ghost 7172 * lists. The ARC does not send buffers to the L2ARC during eviction as 7173 * this would add inflated write latencies for all ARC memory pressure. 7174 * 7175 * 2. The L2ARC attempts to cache data from the ARC before it is evicted. 7176 * It does this by periodically scanning buffers from the eviction-end of 7177 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are 7178 * not already there. It scans until a headroom of buffers is satisfied, 7179 * which itself is a buffer for ARC eviction. If a compressible buffer is 7180 * found during scanning and selected for writing to an L2ARC device, we 7181 * temporarily boost scanning headroom during the next scan cycle to make 7182 * sure we adapt to compression effects (which might significantly reduce 7183 * the data volume we write to L2ARC). The thread that does this is 7184 * l2arc_feed_thread(), illustrated below; example sizes are included to 7185 * provide a better sense of ratio than this diagram: 7186 * 7187 * head --> tail 7188 * +---------------------+----------+ 7189 * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC 7190 * +---------------------+----------+ | o L2ARC eligible 7191 * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer 7192 * +---------------------+----------+ | 7193 * 15.9 Gbytes ^ 32 Mbytes | 7194 * headroom | 7195 * l2arc_feed_thread() 7196 * | 7197 * l2arc write hand <--[oooo]--' 7198 * | 8 Mbyte 7199 * | write max 7200 * V 7201 * +==============================+ 7202 * L2ARC dev |####|#|###|###| |####| ... | 7203 * +==============================+ 7204 * 32 Gbytes 7205 * 7206 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of 7207 * evicted, then the L2ARC has cached a buffer much sooner than it probably 7208 * needed to, potentially wasting L2ARC device bandwidth and storage. It is 7209 * safe to say that this is an uncommon case, since buffers at the end of 7210 * the ARC lists have moved there due to inactivity. 7211 * 7212 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom, 7213 * then the L2ARC simply misses copying some buffers. This serves as a 7214 * pressure valve to prevent heavy read workloads from both stalling the ARC 7215 * with waits and clogging the L2ARC with writes. This also helps prevent 7216 * the potential for the L2ARC to churn if it attempts to cache content too 7217 * quickly, such as during backups of the entire pool. 7218 * 7219 * 5. After system boot and before the ARC has filled main memory, there are 7220 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru 7221 * lists can remain mostly static. Instead of searching from tail of these 7222 * lists as pictured, the l2arc_feed_thread() will search from the list heads 7223 * for eligible buffers, greatly increasing its chance of finding them. 7224 * 7225 * The L2ARC device write speed is also boosted during this time so that 7226 * the L2ARC warms up faster. Since there have been no ARC evictions yet, 7227 * there are no L2ARC reads, and no fear of degrading read performance 7228 * through increased writes. 7229 * 7230 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that 7231 * the vdev queue can aggregate them into larger and fewer writes. Each 7232 * device is written to in a rotor fashion, sweeping writes through 7233 * available space then repeating. 7234 * 7235 * 7. The L2ARC does not store dirty content. It never needs to flush 7236 * write buffers back to disk based storage. 7237 * 7238 * 8. If an ARC buffer is written (and dirtied) which also exists in the 7239 * L2ARC, the now stale L2ARC buffer is immediately dropped. 7240 * 7241 * The performance of the L2ARC can be tweaked by a number of tunables, which 7242 * may be necessary for different workloads: 7243 * 7244 * l2arc_write_max max write bytes per interval 7245 * l2arc_write_boost extra write bytes during device warmup 7246 * l2arc_noprefetch skip caching prefetched buffers 7247 * l2arc_headroom number of max device writes to precache 7248 * l2arc_headroom_boost when we find compressed buffers during ARC 7249 * scanning, we multiply headroom by this 7250 * percentage factor for the next scan cycle, 7251 * since more compressed buffers are likely to 7252 * be present 7253 * l2arc_feed_secs seconds between L2ARC writing 7254 * 7255 * Tunables may be removed or added as future performance improvements are 7256 * integrated, and also may become zpool properties. 7257 * 7258 * There are three key functions that control how the L2ARC warms up: 7259 * 7260 * l2arc_write_eligible() check if a buffer is eligible to cache 7261 * l2arc_write_size() calculate how much to write 7262 * l2arc_write_interval() calculate sleep delay between writes 7263 * 7264 * These three functions determine what to write, how much, and how quickly 7265 * to send writes. 7266 * 7267 * L2ARC persistence: 7268 * 7269 * When writing buffers to L2ARC, we periodically add some metadata to 7270 * make sure we can pick them up after reboot, thus dramatically reducing 7271 * the impact that any downtime has on the performance of storage systems 7272 * with large caches. 7273 * 7274 * The implementation works fairly simply by integrating the following two 7275 * modifications: 7276 * 7277 * *) When writing to the L2ARC, we occasionally write a "l2arc log block", 7278 * which is an additional piece of metadata which describes what's been 7279 * written. This allows us to rebuild the arc_buf_hdr_t structures of the 7280 * main ARC buffers. There are 2 linked-lists of log blocks headed by 7281 * dh_start_lbps[2]. We alternate which chain we append to, so they are 7282 * time-wise and offset-wise interleaved, but that is an optimization rather 7283 * than for correctness. The log block also includes a pointer to the 7284 * previous block in its chain. 7285 * 7286 * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device 7287 * for our header bookkeeping purposes. This contains a device header, 7288 * which contains our top-level reference structures. We update it each 7289 * time we write a new log block, so that we're able to locate it in the 7290 * L2ARC device. If this write results in an inconsistent device header 7291 * (e.g. due to power failure), we detect this by verifying the header's 7292 * checksum and simply fail to reconstruct the L2ARC after reboot. 7293 * 7294 * Implementation diagram: 7295 * 7296 * +=== L2ARC device (not to scale) ======================================+ 7297 * | ___two newest log block pointers__.__________ | 7298 * | / \dh_start_lbps[1] | 7299 * | / \ \dh_start_lbps[0]| 7300 * |.___/__. V V | 7301 * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---| 7302 * || hdr| ^ /^ /^ / / | 7303 * |+------+ ...--\-------/ \-----/--\------/ / | 7304 * | \--------------/ \--------------/ | 7305 * +======================================================================+ 7306 * 7307 * As can be seen on the diagram, rather than using a simple linked list, 7308 * we use a pair of linked lists with alternating elements. This is a 7309 * performance enhancement due to the fact that we only find out the 7310 * address of the next log block access once the current block has been 7311 * completely read in. Obviously, this hurts performance, because we'd be 7312 * keeping the device's I/O queue at only a 1 operation deep, thus 7313 * incurring a large amount of I/O round-trip latency. Having two lists 7314 * allows us to fetch two log blocks ahead of where we are currently 7315 * rebuilding L2ARC buffers. 7316 * 7317 * On-device data structures: 7318 * 7319 * L2ARC device header: l2arc_dev_hdr_phys_t 7320 * L2ARC log block: l2arc_log_blk_phys_t 7321 * 7322 * L2ARC reconstruction: 7323 * 7324 * When writing data, we simply write in the standard rotary fashion, 7325 * evicting buffers as we go and simply writing new data over them (writing 7326 * a new log block every now and then). This obviously means that once we 7327 * loop around the end of the device, we will start cutting into an already 7328 * committed log block (and its referenced data buffers), like so: 7329 * 7330 * current write head__ __old tail 7331 * \ / 7332 * V V 7333 * <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |--> 7334 * ^ ^^^^^^^^^___________________________________ 7335 * | \ 7336 * <<nextwrite>> may overwrite this blk and/or its bufs --' 7337 * 7338 * When importing the pool, we detect this situation and use it to stop 7339 * our scanning process (see l2arc_rebuild). 7340 * 7341 * There is one significant caveat to consider when rebuilding ARC contents 7342 * from an L2ARC device: what about invalidated buffers? Given the above 7343 * construction, we cannot update blocks which we've already written to amend 7344 * them to remove buffers which were invalidated. Thus, during reconstruction, 7345 * we might be populating the cache with buffers for data that's not on the 7346 * main pool anymore, or may have been overwritten! 7347 * 7348 * As it turns out, this isn't a problem. Every arc_read request includes 7349 * both the DVA and, crucially, the birth TXG of the BP the caller is 7350 * looking for. So even if the cache were populated by completely rotten 7351 * blocks for data that had been long deleted and/or overwritten, we'll 7352 * never actually return bad data from the cache, since the DVA with the 7353 * birth TXG uniquely identify a block in space and time - once created, 7354 * a block is immutable on disk. The worst thing we have done is wasted 7355 * some time and memory at l2arc rebuild to reconstruct outdated ARC 7356 * entries that will get dropped from the l2arc as it is being updated 7357 * with new blocks. 7358 * 7359 * L2ARC buffers that have been evicted by l2arc_evict() ahead of the write 7360 * hand are not restored. This is done by saving the offset (in bytes) 7361 * l2arc_evict() has evicted to in the L2ARC device header and taking it 7362 * into account when restoring buffers. 7363 */ 7364 7365 static boolean_t 7366 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr) 7367 { 7368 /* 7369 * A buffer is *not* eligible for the L2ARC if it: 7370 * 1. belongs to a different spa. 7371 * 2. is already cached on the L2ARC. 7372 * 3. has an I/O in progress (it may be an incomplete read). 7373 * 4. is flagged not eligible (zfs property). 7374 */ 7375 if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) || 7376 HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr)) 7377 return (B_FALSE); 7378 7379 return (B_TRUE); 7380 } 7381 7382 static uint64_t 7383 l2arc_write_size(l2arc_dev_t *dev) 7384 { 7385 uint64_t size, dev_size; 7386 7387 /* 7388 * Make sure our globals have meaningful values in case the user 7389 * altered them. 7390 */ 7391 size = l2arc_write_max; 7392 if (size == 0) { 7393 cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must " 7394 "be greater than zero, resetting it to the default (%d)", 7395 L2ARC_WRITE_SIZE); 7396 size = l2arc_write_max = L2ARC_WRITE_SIZE; 7397 } 7398 7399 if (arc_warm == B_FALSE) 7400 size += l2arc_write_boost; 7401 7402 /* 7403 * Make sure the write size does not exceed the size of the cache 7404 * device. This is important in l2arc_evict(), otherwise infinite 7405 * iteration can occur. 7406 */ 7407 dev_size = dev->l2ad_end - dev->l2ad_start; 7408 if ((size + l2arc_log_blk_overhead(size, dev)) >= dev_size) { 7409 cmn_err(CE_NOTE, "l2arc_write_max or l2arc_write_boost " 7410 "plus the overhead of log blocks (persistent L2ARC, " 7411 "%" PRIu64 " bytes) exceeds the size of the cache device " 7412 "(guid %" PRIu64 "), resetting them to the default (%d)", 7413 l2arc_log_blk_overhead(size, dev), 7414 dev->l2ad_vdev->vdev_guid, L2ARC_WRITE_SIZE); 7415 size = l2arc_write_max = l2arc_write_boost = L2ARC_WRITE_SIZE; 7416 7417 if (arc_warm == B_FALSE) 7418 size += l2arc_write_boost; 7419 } 7420 7421 return (size); 7422 7423 } 7424 7425 static clock_t 7426 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote) 7427 { 7428 clock_t interval, next, now; 7429 7430 /* 7431 * If the ARC lists are busy, increase our write rate; if the 7432 * lists are stale, idle back. This is achieved by checking 7433 * how much we previously wrote - if it was more than half of 7434 * what we wanted, schedule the next write much sooner. 7435 */ 7436 if (l2arc_feed_again && wrote > (wanted / 2)) 7437 interval = (hz * l2arc_feed_min_ms) / 1000; 7438 else 7439 interval = hz * l2arc_feed_secs; 7440 7441 now = ddi_get_lbolt(); 7442 next = MAX(now, MIN(now + interval, began + interval)); 7443 7444 return (next); 7445 } 7446 7447 /* 7448 * Cycle through L2ARC devices. This is how L2ARC load balances. 7449 * If a device is returned, this also returns holding the spa config lock. 7450 */ 7451 static l2arc_dev_t * 7452 l2arc_dev_get_next(void) 7453 { 7454 l2arc_dev_t *first, *next = NULL; 7455 7456 /* 7457 * Lock out the removal of spas (spa_namespace_lock), then removal 7458 * of cache devices (l2arc_dev_mtx). Once a device has been selected, 7459 * both locks will be dropped and a spa config lock held instead. 7460 */ 7461 mutex_enter(&spa_namespace_lock); 7462 mutex_enter(&l2arc_dev_mtx); 7463 7464 /* if there are no vdevs, there is nothing to do */ 7465 if (l2arc_ndev == 0) 7466 goto out; 7467 7468 first = NULL; 7469 next = l2arc_dev_last; 7470 do { 7471 /* loop around the list looking for a non-faulted vdev */ 7472 if (next == NULL) { 7473 next = list_head(l2arc_dev_list); 7474 } else { 7475 next = list_next(l2arc_dev_list, next); 7476 if (next == NULL) 7477 next = list_head(l2arc_dev_list); 7478 } 7479 7480 /* if we have come back to the start, bail out */ 7481 if (first == NULL) 7482 first = next; 7483 else if (next == first) 7484 break; 7485 7486 } while (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild); 7487 7488 /* if we were unable to find any usable vdevs, return NULL */ 7489 if (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild) 7490 next = NULL; 7491 7492 l2arc_dev_last = next; 7493 7494 out: 7495 mutex_exit(&l2arc_dev_mtx); 7496 7497 /* 7498 * Grab the config lock to prevent the 'next' device from being 7499 * removed while we are writing to it. 7500 */ 7501 if (next != NULL) 7502 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER); 7503 mutex_exit(&spa_namespace_lock); 7504 7505 return (next); 7506 } 7507 7508 /* 7509 * Free buffers that were tagged for destruction. 7510 */ 7511 static void 7512 l2arc_do_free_on_write() 7513 { 7514 list_t *buflist; 7515 l2arc_data_free_t *df, *df_prev; 7516 7517 mutex_enter(&l2arc_free_on_write_mtx); 7518 buflist = l2arc_free_on_write; 7519 7520 for (df = list_tail(buflist); df; df = df_prev) { 7521 df_prev = list_prev(buflist, df); 7522 ASSERT3P(df->l2df_abd, !=, NULL); 7523 abd_free(df->l2df_abd); 7524 list_remove(buflist, df); 7525 kmem_free(df, sizeof (l2arc_data_free_t)); 7526 } 7527 7528 mutex_exit(&l2arc_free_on_write_mtx); 7529 } 7530 7531 /* 7532 * A write to a cache device has completed. Update all headers to allow 7533 * reads from these buffers to begin. 7534 */ 7535 static void 7536 l2arc_write_done(zio_t *zio) 7537 { 7538 l2arc_write_callback_t *cb; 7539 l2arc_lb_abd_buf_t *abd_buf; 7540 l2arc_lb_ptr_buf_t *lb_ptr_buf; 7541 l2arc_dev_t *dev; 7542 l2arc_dev_hdr_phys_t *l2dhdr; 7543 list_t *buflist; 7544 arc_buf_hdr_t *head, *hdr, *hdr_prev; 7545 kmutex_t *hash_lock; 7546 int64_t bytes_dropped = 0; 7547 7548 cb = zio->io_private; 7549 ASSERT3P(cb, !=, NULL); 7550 dev = cb->l2wcb_dev; 7551 l2dhdr = dev->l2ad_dev_hdr; 7552 ASSERT3P(dev, !=, NULL); 7553 head = cb->l2wcb_head; 7554 ASSERT3P(head, !=, NULL); 7555 buflist = &dev->l2ad_buflist; 7556 ASSERT3P(buflist, !=, NULL); 7557 DTRACE_PROBE2(l2arc__iodone, zio_t *, zio, 7558 l2arc_write_callback_t *, cb); 7559 7560 if (zio->io_error != 0) 7561 ARCSTAT_BUMP(arcstat_l2_writes_error); 7562 7563 /* 7564 * All writes completed, or an error was hit. 7565 */ 7566 top: 7567 mutex_enter(&dev->l2ad_mtx); 7568 for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) { 7569 hdr_prev = list_prev(buflist, hdr); 7570 7571 hash_lock = HDR_LOCK(hdr); 7572 7573 /* 7574 * We cannot use mutex_enter or else we can deadlock 7575 * with l2arc_write_buffers (due to swapping the order 7576 * the hash lock and l2ad_mtx are taken). 7577 */ 7578 if (!mutex_tryenter(hash_lock)) { 7579 /* 7580 * Missed the hash lock. We must retry so we 7581 * don't leave the ARC_FLAG_L2_WRITING bit set. 7582 */ 7583 ARCSTAT_BUMP(arcstat_l2_writes_lock_retry); 7584 7585 /* 7586 * We don't want to rescan the headers we've 7587 * already marked as having been written out, so 7588 * we reinsert the head node so we can pick up 7589 * where we left off. 7590 */ 7591 list_remove(buflist, head); 7592 list_insert_after(buflist, hdr, head); 7593 7594 mutex_exit(&dev->l2ad_mtx); 7595 7596 /* 7597 * We wait for the hash lock to become available 7598 * to try and prevent busy waiting, and increase 7599 * the chance we'll be able to acquire the lock 7600 * the next time around. 7601 */ 7602 mutex_enter(hash_lock); 7603 mutex_exit(hash_lock); 7604 goto top; 7605 } 7606 7607 /* 7608 * We could not have been moved into the arc_l2c_only 7609 * state while in-flight due to our ARC_FLAG_L2_WRITING 7610 * bit being set. Let's just ensure that's being enforced. 7611 */ 7612 ASSERT(HDR_HAS_L1HDR(hdr)); 7613 7614 if (zio->io_error != 0) { 7615 /* 7616 * Error - drop L2ARC entry. 7617 */ 7618 list_remove(buflist, hdr); 7619 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); 7620 7621 uint64_t psize = HDR_GET_PSIZE(hdr); 7622 ARCSTAT_INCR(arcstat_l2_psize, -psize); 7623 ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr)); 7624 7625 bytes_dropped += 7626 vdev_psize_to_asize(dev->l2ad_vdev, psize); 7627 (void) zfs_refcount_remove_many(&dev->l2ad_alloc, 7628 arc_hdr_size(hdr), hdr); 7629 } 7630 7631 /* 7632 * Allow ARC to begin reads and ghost list evictions to 7633 * this L2ARC entry. 7634 */ 7635 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING); 7636 7637 mutex_exit(hash_lock); 7638 } 7639 7640 /* 7641 * Free the allocated abd buffers for writing the log blocks. 7642 * If the zio failed reclaim the allocated space and remove the 7643 * pointers to these log blocks from the log block pointer list 7644 * of the L2ARC device. 7645 */ 7646 while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) { 7647 abd_free(abd_buf->abd); 7648 zio_buf_free(abd_buf, sizeof (*abd_buf)); 7649 if (zio->io_error != 0) { 7650 lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list); 7651 /* 7652 * L2BLK_GET_PSIZE returns aligned size for log 7653 * blocks. 7654 */ 7655 uint64_t asize = 7656 L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop); 7657 bytes_dropped += asize; 7658 ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); 7659 ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); 7660 zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, 7661 lb_ptr_buf); 7662 zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf); 7663 kmem_free(lb_ptr_buf->lb_ptr, 7664 sizeof (l2arc_log_blkptr_t)); 7665 kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); 7666 } 7667 } 7668 list_destroy(&cb->l2wcb_abd_list); 7669 7670 if (zio->io_error != 0) { 7671 /* 7672 * Restore the lbps array in the header to its previous state. 7673 * If the list of log block pointers is empty, zero out the 7674 * log block pointers in the device header. 7675 */ 7676 lb_ptr_buf = list_head(&dev->l2ad_lbptr_list); 7677 for (int i = 0; i < 2; i++) { 7678 if (lb_ptr_buf == NULL) { 7679 /* 7680 * If the list is empty zero out the device 7681 * header. Otherwise zero out the second log 7682 * block pointer in the header. 7683 */ 7684 if (i == 0) { 7685 bzero(l2dhdr, dev->l2ad_dev_hdr_asize); 7686 } else { 7687 bzero(&l2dhdr->dh_start_lbps[i], 7688 sizeof (l2arc_log_blkptr_t)); 7689 } 7690 break; 7691 } 7692 bcopy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[i], 7693 sizeof (l2arc_log_blkptr_t)); 7694 lb_ptr_buf = list_next(&dev->l2ad_lbptr_list, 7695 lb_ptr_buf); 7696 } 7697 } 7698 7699 atomic_inc_64(&l2arc_writes_done); 7700 list_remove(buflist, head); 7701 ASSERT(!HDR_HAS_L1HDR(head)); 7702 kmem_cache_free(hdr_l2only_cache, head); 7703 mutex_exit(&dev->l2ad_mtx); 7704 7705 ASSERT(dev->l2ad_vdev != NULL); 7706 vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0); 7707 7708 l2arc_do_free_on_write(); 7709 7710 kmem_free(cb, sizeof (l2arc_write_callback_t)); 7711 } 7712 7713 static int 7714 l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb) 7715 { 7716 int ret; 7717 spa_t *spa = zio->io_spa; 7718 arc_buf_hdr_t *hdr = cb->l2rcb_hdr; 7719 blkptr_t *bp = zio->io_bp; 7720 uint8_t salt[ZIO_DATA_SALT_LEN]; 7721 uint8_t iv[ZIO_DATA_IV_LEN]; 7722 uint8_t mac[ZIO_DATA_MAC_LEN]; 7723 boolean_t no_crypt = B_FALSE; 7724 7725 /* 7726 * ZIL data is never be written to the L2ARC, so we don't need 7727 * special handling for its unique MAC storage. 7728 */ 7729 ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); 7730 ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); 7731 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 7732 7733 /* 7734 * If the data was encrypted, decrypt it now. Note that 7735 * we must check the bp here and not the hdr, since the 7736 * hdr does not have its encryption parameters updated 7737 * until arc_read_done(). 7738 */ 7739 if (BP_IS_ENCRYPTED(bp)) { 7740 abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr); 7741 7742 zio_crypt_decode_params_bp(bp, salt, iv); 7743 zio_crypt_decode_mac_bp(bp, mac); 7744 7745 ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb, 7746 BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), 7747 salt, iv, mac, HDR_GET_PSIZE(hdr), eabd, 7748 hdr->b_l1hdr.b_pabd, &no_crypt); 7749 if (ret != 0) { 7750 arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); 7751 goto error; 7752 } 7753 7754 /* 7755 * If we actually performed decryption, replace b_pabd 7756 * with the decrypted data. Otherwise we can just throw 7757 * our decryption buffer away. 7758 */ 7759 if (!no_crypt) { 7760 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 7761 arc_hdr_size(hdr), hdr); 7762 hdr->b_l1hdr.b_pabd = eabd; 7763 zio->io_abd = eabd; 7764 } else { 7765 arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); 7766 } 7767 } 7768 7769 /* 7770 * If the L2ARC block was compressed, but ARC compression 7771 * is disabled we decompress the data into a new buffer and 7772 * replace the existing data. 7773 */ 7774 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 7775 !HDR_COMPRESSION_ENABLED(hdr)) { 7776 abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr); 7777 void *tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr)); 7778 7779 ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), 7780 hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr), 7781 HDR_GET_LSIZE(hdr)); 7782 if (ret != 0) { 7783 abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); 7784 arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr); 7785 goto error; 7786 } 7787 7788 abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); 7789 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, 7790 arc_hdr_size(hdr), hdr); 7791 hdr->b_l1hdr.b_pabd = cabd; 7792 zio->io_abd = cabd; 7793 zio->io_size = HDR_GET_LSIZE(hdr); 7794 } 7795 7796 return (0); 7797 7798 error: 7799 return (ret); 7800 } 7801 7802 7803 /* 7804 * A read to a cache device completed. Validate buffer contents before 7805 * handing over to the regular ARC routines. 7806 */ 7807 static void 7808 l2arc_read_done(zio_t *zio) 7809 { 7810 int tfm_error = 0; 7811 l2arc_read_callback_t *cb = zio->io_private; 7812 arc_buf_hdr_t *hdr; 7813 kmutex_t *hash_lock; 7814 boolean_t valid_cksum; 7815 boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) && 7816 (cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT)); 7817 7818 ASSERT3P(zio->io_vd, !=, NULL); 7819 ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE); 7820 7821 spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd); 7822 7823 ASSERT3P(cb, !=, NULL); 7824 hdr = cb->l2rcb_hdr; 7825 ASSERT3P(hdr, !=, NULL); 7826 7827 hash_lock = HDR_LOCK(hdr); 7828 mutex_enter(hash_lock); 7829 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); 7830 7831 /* 7832 * If the data was read into a temporary buffer, 7833 * move it and free the buffer. 7834 */ 7835 if (cb->l2rcb_abd != NULL) { 7836 ASSERT3U(arc_hdr_size(hdr), <, zio->io_size); 7837 if (zio->io_error == 0) { 7838 if (using_rdata) { 7839 abd_copy(hdr->b_crypt_hdr.b_rabd, 7840 cb->l2rcb_abd, arc_hdr_size(hdr)); 7841 } else { 7842 abd_copy(hdr->b_l1hdr.b_pabd, 7843 cb->l2rcb_abd, arc_hdr_size(hdr)); 7844 } 7845 } 7846 7847 /* 7848 * The following must be done regardless of whether 7849 * there was an error: 7850 * - free the temporary buffer 7851 * - point zio to the real ARC buffer 7852 * - set zio size accordingly 7853 * These are required because zio is either re-used for 7854 * an I/O of the block in the case of the error 7855 * or the zio is passed to arc_read_done() and it 7856 * needs real data. 7857 */ 7858 abd_free(cb->l2rcb_abd); 7859 zio->io_size = zio->io_orig_size = arc_hdr_size(hdr); 7860 7861 if (using_rdata) { 7862 ASSERT(HDR_HAS_RABD(hdr)); 7863 zio->io_abd = zio->io_orig_abd = 7864 hdr->b_crypt_hdr.b_rabd; 7865 } else { 7866 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); 7867 zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd; 7868 } 7869 } 7870 7871 ASSERT3P(zio->io_abd, !=, NULL); 7872 7873 /* 7874 * Check this survived the L2ARC journey. 7875 */ 7876 ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd || 7877 (HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd)); 7878 zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */ 7879 zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */ 7880 7881 valid_cksum = arc_cksum_is_equal(hdr, zio); 7882 7883 /* 7884 * b_rabd will always match the data as it exists on disk if it is 7885 * being used. Therefore if we are reading into b_rabd we do not 7886 * attempt to untransform the data. 7887 */ 7888 if (valid_cksum && !using_rdata) 7889 tfm_error = l2arc_untransform(zio, cb); 7890 7891 if (valid_cksum && tfm_error == 0 && zio->io_error == 0 && 7892 !HDR_L2_EVICTED(hdr)) { 7893 mutex_exit(hash_lock); 7894 zio->io_private = hdr; 7895 arc_read_done(zio); 7896 } else { 7897 /* 7898 * Buffer didn't survive caching. Increment stats and 7899 * reissue to the original storage device. 7900 */ 7901 if (zio->io_error != 0) { 7902 ARCSTAT_BUMP(arcstat_l2_io_error); 7903 } else { 7904 zio->io_error = SET_ERROR(EIO); 7905 } 7906 if (!valid_cksum || tfm_error != 0) 7907 ARCSTAT_BUMP(arcstat_l2_cksum_bad); 7908 7909 /* 7910 * If there's no waiter, issue an async i/o to the primary 7911 * storage now. If there *is* a waiter, the caller must 7912 * issue the i/o in a context where it's OK to block. 7913 */ 7914 if (zio->io_waiter == NULL) { 7915 zio_t *pio = zio_unique_parent(zio); 7916 void *abd = (using_rdata) ? 7917 hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd; 7918 7919 ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL); 7920 7921 zio = zio_read(pio, zio->io_spa, zio->io_bp, 7922 abd, zio->io_size, arc_read_done, 7923 hdr, zio->io_priority, cb->l2rcb_flags, 7924 &cb->l2rcb_zb); 7925 7926 /* 7927 * Original ZIO will be freed, so we need to update 7928 * ARC header with the new ZIO pointer to be used 7929 * by zio_change_priority() in arc_read(). 7930 */ 7931 for (struct arc_callback *acb = hdr->b_l1hdr.b_acb; 7932 acb != NULL; acb = acb->acb_next) 7933 acb->acb_zio_head = zio; 7934 7935 mutex_exit(hash_lock); 7936 zio_nowait(zio); 7937 } else { 7938 mutex_exit(hash_lock); 7939 } 7940 } 7941 7942 kmem_free(cb, sizeof (l2arc_read_callback_t)); 7943 } 7944 7945 /* 7946 * This is the list priority from which the L2ARC will search for pages to 7947 * cache. This is used within loops (0..3) to cycle through lists in the 7948 * desired order. This order can have a significant effect on cache 7949 * performance. 7950 * 7951 * Currently the metadata lists are hit first, MFU then MRU, followed by 7952 * the data lists. This function returns a locked list, and also returns 7953 * the lock pointer. 7954 */ 7955 static multilist_sublist_t * 7956 l2arc_sublist_lock(int list_num) 7957 { 7958 multilist_t *ml = NULL; 7959 unsigned int idx; 7960 7961 ASSERT(list_num >= 0 && list_num <= 3); 7962 7963 switch (list_num) { 7964 case 0: 7965 ml = arc_mfu->arcs_list[ARC_BUFC_METADATA]; 7966 break; 7967 case 1: 7968 ml = arc_mru->arcs_list[ARC_BUFC_METADATA]; 7969 break; 7970 case 2: 7971 ml = arc_mfu->arcs_list[ARC_BUFC_DATA]; 7972 break; 7973 case 3: 7974 ml = arc_mru->arcs_list[ARC_BUFC_DATA]; 7975 break; 7976 } 7977 7978 /* 7979 * Return a randomly-selected sublist. This is acceptable 7980 * because the caller feeds only a little bit of data for each 7981 * call (8MB). Subsequent calls will result in different 7982 * sublists being selected. 7983 */ 7984 idx = multilist_get_random_index(ml); 7985 return (multilist_sublist_lock(ml, idx)); 7986 } 7987 7988 /* 7989 * Calculates the maximum overhead of L2ARC metadata log blocks for a given 7990 * L2ARC write size. l2arc_evict and l2arc_write_size need to include this 7991 * overhead in processing to make sure there is enough headroom available 7992 * when writing buffers. 7993 */ 7994 static inline uint64_t 7995 l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev) 7996 { 7997 if (dev->l2ad_log_entries == 0) { 7998 return (0); 7999 } else { 8000 uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT; 8001 8002 uint64_t log_blocks = (log_entries + 8003 dev->l2ad_log_entries - 1) / 8004 dev->l2ad_log_entries; 8005 8006 return (vdev_psize_to_asize(dev->l2ad_vdev, 8007 sizeof (l2arc_log_blk_phys_t)) * log_blocks); 8008 } 8009 } 8010 8011 /* 8012 * Evict buffers from the device write hand to the distance specified in 8013 * bytes. This distance may span populated buffers, it may span nothing. 8014 * This is clearing a region on the L2ARC device ready for writing. 8015 * If the 'all' boolean is set, every buffer is evicted. 8016 */ 8017 static void 8018 l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all) 8019 { 8020 list_t *buflist; 8021 arc_buf_hdr_t *hdr, *hdr_prev; 8022 kmutex_t *hash_lock; 8023 uint64_t taddr; 8024 l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev; 8025 boolean_t rerun; 8026 8027 buflist = &dev->l2ad_buflist; 8028 8029 /* 8030 * We need to add in the worst case scenario of log block overhead. 8031 */ 8032 distance += l2arc_log_blk_overhead(distance, dev); 8033 8034 top: 8035 rerun = B_FALSE; 8036 if (dev->l2ad_hand >= (dev->l2ad_end - distance)) { 8037 /* 8038 * When there is no space to accommodate upcoming writes, 8039 * evict to the end. Then bump the write and evict hands 8040 * to the start and iterate. This iteration does not 8041 * happen indefinitely as we make sure in 8042 * l2arc_write_size() that when the write hand is reset, 8043 * the write size does not exceed the end of the device. 8044 */ 8045 rerun = B_TRUE; 8046 taddr = dev->l2ad_end; 8047 } else { 8048 taddr = dev->l2ad_hand + distance; 8049 } 8050 DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist, 8051 uint64_t, taddr, boolean_t, all); 8052 8053 /* 8054 * This check has to be placed after deciding whether to iterate 8055 * (rerun). 8056 */ 8057 if (!all && dev->l2ad_first) { 8058 /* 8059 * This is the first sweep through the device. There is 8060 * nothing to evict. 8061 */ 8062 goto out; 8063 } 8064 8065 /* 8066 * When rebuilding L2ARC we retrieve the evict hand from the header of 8067 * the device. Of note, l2arc_evict() does not actually delete buffers 8068 * from the cache device, but keeping track of the evict hand will be 8069 * useful when TRIM is implemented. 8070 */ 8071 dev->l2ad_evict = MAX(dev->l2ad_evict, taddr); 8072 8073 retry: 8074 mutex_enter(&dev->l2ad_mtx); 8075 /* 8076 * We have to account for evicted log blocks. Run vdev_space_update() 8077 * on log blocks whose offset (in bytes) is before the evicted offset 8078 * (in bytes) by searching in the list of pointers to log blocks 8079 * present in the L2ARC device. 8080 */ 8081 for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf; 8082 lb_ptr_buf = lb_ptr_buf_prev) { 8083 8084 lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf); 8085 8086 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 8087 uint64_t asize = L2BLK_GET_PSIZE( 8088 (lb_ptr_buf->lb_ptr)->lbp_prop); 8089 8090 /* 8091 * We don't worry about log blocks left behind (ie 8092 * lbp_payload_start < l2ad_hand) because l2arc_write_buffers() 8093 * will never write more than l2arc_evict() evicts. 8094 */ 8095 if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) { 8096 break; 8097 } else { 8098 vdev_space_update(dev->l2ad_vdev, -asize, 0, 0); 8099 ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); 8100 ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); 8101 zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, 8102 lb_ptr_buf); 8103 zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf); 8104 list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf); 8105 kmem_free(lb_ptr_buf->lb_ptr, 8106 sizeof (l2arc_log_blkptr_t)); 8107 kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); 8108 } 8109 } 8110 8111 for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) { 8112 hdr_prev = list_prev(buflist, hdr); 8113 8114 ASSERT(!HDR_EMPTY(hdr)); 8115 hash_lock = HDR_LOCK(hdr); 8116 8117 /* 8118 * We cannot use mutex_enter or else we can deadlock 8119 * with l2arc_write_buffers (due to swapping the order 8120 * the hash lock and l2ad_mtx are taken). 8121 */ 8122 if (!mutex_tryenter(hash_lock)) { 8123 /* 8124 * Missed the hash lock. Retry. 8125 */ 8126 ARCSTAT_BUMP(arcstat_l2_evict_lock_retry); 8127 mutex_exit(&dev->l2ad_mtx); 8128 mutex_enter(hash_lock); 8129 mutex_exit(hash_lock); 8130 goto retry; 8131 } 8132 8133 /* 8134 * A header can't be on this list if it doesn't have L2 header. 8135 */ 8136 ASSERT(HDR_HAS_L2HDR(hdr)); 8137 8138 /* Ensure this header has finished being written. */ 8139 ASSERT(!HDR_L2_WRITING(hdr)); 8140 ASSERT(!HDR_L2_WRITE_HEAD(hdr)); 8141 8142 if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict || 8143 hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) { 8144 /* 8145 * We've evicted to the target address, 8146 * or the end of the device. 8147 */ 8148 mutex_exit(hash_lock); 8149 break; 8150 } 8151 8152 if (!HDR_HAS_L1HDR(hdr)) { 8153 ASSERT(!HDR_L2_READING(hdr)); 8154 /* 8155 * This doesn't exist in the ARC. Destroy. 8156 * arc_hdr_destroy() will call list_remove() 8157 * and decrement arcstat_l2_lsize. 8158 */ 8159 arc_change_state(arc_anon, hdr, hash_lock); 8160 arc_hdr_destroy(hdr); 8161 } else { 8162 ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only); 8163 ARCSTAT_BUMP(arcstat_l2_evict_l1cached); 8164 /* 8165 * Invalidate issued or about to be issued 8166 * reads, since we may be about to write 8167 * over this location. 8168 */ 8169 if (HDR_L2_READING(hdr)) { 8170 ARCSTAT_BUMP(arcstat_l2_evict_reading); 8171 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED); 8172 } 8173 8174 arc_hdr_l2hdr_destroy(hdr); 8175 } 8176 mutex_exit(hash_lock); 8177 } 8178 mutex_exit(&dev->l2ad_mtx); 8179 8180 out: 8181 /* 8182 * We need to check if we evict all buffers, otherwise we may iterate 8183 * unnecessarily. 8184 */ 8185 if (!all && rerun) { 8186 /* 8187 * Bump device hand to the device start if it is approaching the 8188 * end. l2arc_evict() has already evicted ahead for this case. 8189 */ 8190 dev->l2ad_hand = dev->l2ad_start; 8191 dev->l2ad_evict = dev->l2ad_start; 8192 dev->l2ad_first = B_FALSE; 8193 goto top; 8194 } 8195 8196 ASSERT3U(dev->l2ad_hand + distance, <, dev->l2ad_end); 8197 if (!dev->l2ad_first) 8198 ASSERT3U(dev->l2ad_hand, <, dev->l2ad_evict); 8199 } 8200 8201 /* 8202 * Handle any abd transforms that might be required for writing to the L2ARC. 8203 * If successful, this function will always return an abd with the data 8204 * transformed as it is on disk in a new abd of asize bytes. 8205 */ 8206 static int 8207 l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize, 8208 abd_t **abd_out) 8209 { 8210 int ret; 8211 void *tmp = NULL; 8212 abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd; 8213 enum zio_compress compress = HDR_GET_COMPRESS(hdr); 8214 uint64_t psize = HDR_GET_PSIZE(hdr); 8215 uint64_t size = arc_hdr_size(hdr); 8216 boolean_t ismd = HDR_ISTYPE_METADATA(hdr); 8217 boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); 8218 dsl_crypto_key_t *dck = NULL; 8219 uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 }; 8220 boolean_t no_crypt = B_FALSE; 8221 8222 ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && 8223 !HDR_COMPRESSION_ENABLED(hdr)) || 8224 HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize); 8225 ASSERT3U(psize, <=, asize); 8226 8227 /* 8228 * If this data simply needs its own buffer, we simply allocate it 8229 * and copy the data. This may be done to eliminate a dependency on a 8230 * shared buffer or to reallocate the buffer to match asize. 8231 */ 8232 if (HDR_HAS_RABD(hdr) && asize != psize) { 8233 ASSERT3U(asize, >=, psize); 8234 to_write = abd_alloc_for_io(asize, ismd); 8235 abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize); 8236 if (psize != asize) 8237 abd_zero_off(to_write, psize, asize - psize); 8238 goto out; 8239 } 8240 8241 if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) && 8242 !HDR_ENCRYPTED(hdr)) { 8243 ASSERT3U(size, ==, psize); 8244 to_write = abd_alloc_for_io(asize, ismd); 8245 abd_copy(to_write, hdr->b_l1hdr.b_pabd, size); 8246 if (size != asize) 8247 abd_zero_off(to_write, size, asize - size); 8248 goto out; 8249 } 8250 8251 if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { 8252 cabd = abd_alloc_for_io(asize, ismd); 8253 tmp = abd_borrow_buf(cabd, asize); 8254 8255 psize = zio_compress_data(compress, to_write, tmp, size); 8256 ASSERT3U(psize, <=, HDR_GET_PSIZE(hdr)); 8257 if (psize < asize) 8258 bzero((char *)tmp + psize, asize - psize); 8259 psize = HDR_GET_PSIZE(hdr); 8260 abd_return_buf_copy(cabd, tmp, asize); 8261 to_write = cabd; 8262 } 8263 8264 if (HDR_ENCRYPTED(hdr)) { 8265 eabd = abd_alloc_for_io(asize, ismd); 8266 8267 /* 8268 * If the dataset was disowned before the buffer 8269 * made it to this point, the key to re-encrypt 8270 * it won't be available. In this case we simply 8271 * won't write the buffer to the L2ARC. 8272 */ 8273 ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj, 8274 FTAG, &dck); 8275 if (ret != 0) 8276 goto error; 8277 8278 ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key, 8279 hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt, 8280 hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd, 8281 &no_crypt); 8282 if (ret != 0) 8283 goto error; 8284 8285 if (no_crypt) 8286 abd_copy(eabd, to_write, psize); 8287 8288 if (psize != asize) 8289 abd_zero_off(eabd, psize, asize - psize); 8290 8291 /* assert that the MAC we got here matches the one we saved */ 8292 ASSERT0(bcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN)); 8293 spa_keystore_dsl_key_rele(spa, dck, FTAG); 8294 8295 if (to_write == cabd) 8296 abd_free(cabd); 8297 8298 to_write = eabd; 8299 } 8300 8301 out: 8302 ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd); 8303 *abd_out = to_write; 8304 return (0); 8305 8306 error: 8307 if (dck != NULL) 8308 spa_keystore_dsl_key_rele(spa, dck, FTAG); 8309 if (cabd != NULL) 8310 abd_free(cabd); 8311 if (eabd != NULL) 8312 abd_free(eabd); 8313 8314 *abd_out = NULL; 8315 return (ret); 8316 } 8317 8318 static void 8319 l2arc_blk_fetch_done(zio_t *zio) 8320 { 8321 l2arc_read_callback_t *cb; 8322 8323 cb = zio->io_private; 8324 if (cb->l2rcb_abd != NULL) 8325 abd_put(cb->l2rcb_abd); 8326 kmem_free(cb, sizeof (l2arc_read_callback_t)); 8327 } 8328 8329 /* 8330 * Find and write ARC buffers to the L2ARC device. 8331 * 8332 * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid 8333 * for reading until they have completed writing. 8334 * The headroom_boost is an in-out parameter used to maintain headroom boost 8335 * state between calls to this function. 8336 * 8337 * Returns the number of bytes actually written (which may be smaller than 8338 * the delta by which the device hand has changed due to alignment and the 8339 * writing of log blocks). 8340 */ 8341 static uint64_t 8342 l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz) 8343 { 8344 arc_buf_hdr_t *hdr, *hdr_prev, *head; 8345 uint64_t write_asize, write_psize, write_lsize, headroom; 8346 boolean_t full; 8347 l2arc_write_callback_t *cb = NULL; 8348 zio_t *pio, *wzio; 8349 uint64_t guid = spa_load_guid(spa); 8350 8351 ASSERT3P(dev->l2ad_vdev, !=, NULL); 8352 8353 pio = NULL; 8354 write_lsize = write_asize = write_psize = 0; 8355 full = B_FALSE; 8356 head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE); 8357 arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR); 8358 8359 /* 8360 * Copy buffers for L2ARC writing. 8361 */ 8362 for (int try = 0; try <= 3; try++) { 8363 multilist_sublist_t *mls = l2arc_sublist_lock(try); 8364 uint64_t passed_sz = 0; 8365 8366 VERIFY3P(mls, !=, NULL); 8367 8368 /* 8369 * L2ARC fast warmup. 8370 * 8371 * Until the ARC is warm and starts to evict, read from the 8372 * head of the ARC lists rather than the tail. 8373 */ 8374 if (arc_warm == B_FALSE) 8375 hdr = multilist_sublist_head(mls); 8376 else 8377 hdr = multilist_sublist_tail(mls); 8378 8379 headroom = target_sz * l2arc_headroom; 8380 if (zfs_compressed_arc_enabled) 8381 headroom = (headroom * l2arc_headroom_boost) / 100; 8382 8383 for (; hdr; hdr = hdr_prev) { 8384 kmutex_t *hash_lock; 8385 abd_t *to_write = NULL; 8386 8387 if (arc_warm == B_FALSE) 8388 hdr_prev = multilist_sublist_next(mls, hdr); 8389 else 8390 hdr_prev = multilist_sublist_prev(mls, hdr); 8391 8392 hash_lock = HDR_LOCK(hdr); 8393 if (!mutex_tryenter(hash_lock)) { 8394 /* 8395 * Skip this buffer rather than waiting. 8396 */ 8397 continue; 8398 } 8399 8400 passed_sz += HDR_GET_LSIZE(hdr); 8401 if (l2arc_headroom != 0 && passed_sz > headroom) { 8402 /* 8403 * Searched too far. 8404 */ 8405 mutex_exit(hash_lock); 8406 break; 8407 } 8408 8409 if (!l2arc_write_eligible(guid, hdr)) { 8410 mutex_exit(hash_lock); 8411 continue; 8412 } 8413 8414 /* 8415 * We rely on the L1 portion of the header below, so 8416 * it's invalid for this header to have been evicted out 8417 * of the ghost cache, prior to being written out. The 8418 * ARC_FLAG_L2_WRITING bit ensures this won't happen. 8419 */ 8420 ASSERT(HDR_HAS_L1HDR(hdr)); 8421 8422 ASSERT3U(HDR_GET_PSIZE(hdr), >, 0); 8423 ASSERT3U(arc_hdr_size(hdr), >, 0); 8424 ASSERT(hdr->b_l1hdr.b_pabd != NULL || 8425 HDR_HAS_RABD(hdr)); 8426 uint64_t psize = HDR_GET_PSIZE(hdr); 8427 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, 8428 psize); 8429 8430 if ((write_asize + asize) > target_sz) { 8431 full = B_TRUE; 8432 mutex_exit(hash_lock); 8433 break; 8434 } 8435 8436 /* 8437 * We rely on the L1 portion of the header below, so 8438 * it's invalid for this header to have been evicted out 8439 * of the ghost cache, prior to being written out. The 8440 * ARC_FLAG_L2_WRITING bit ensures this won't happen. 8441 */ 8442 arc_hdr_set_flags(hdr, ARC_FLAG_L2_WRITING); 8443 ASSERT(HDR_HAS_L1HDR(hdr)); 8444 8445 ASSERT3U(HDR_GET_PSIZE(hdr), >, 0); 8446 ASSERT(hdr->b_l1hdr.b_pabd != NULL || 8447 HDR_HAS_RABD(hdr)); 8448 ASSERT3U(arc_hdr_size(hdr), >, 0); 8449 8450 /* 8451 * If this header has b_rabd, we can use this since it 8452 * must always match the data exactly as it exists on 8453 * disk. Otherwise, the L2ARC can normally use the 8454 * hdr's data, but if we're sharing data between the 8455 * hdr and one of its bufs, L2ARC needs its own copy of 8456 * the data so that the ZIO below can't race with the 8457 * buf consumer. To ensure that this copy will be 8458 * available for the lifetime of the ZIO and be cleaned 8459 * up afterwards, we add it to the l2arc_free_on_write 8460 * queue. If we need to apply any transforms to the 8461 * data (compression, encryption) we will also need the 8462 * extra buffer. 8463 */ 8464 if (HDR_HAS_RABD(hdr) && psize == asize) { 8465 to_write = hdr->b_crypt_hdr.b_rabd; 8466 } else if ((HDR_COMPRESSION_ENABLED(hdr) || 8467 HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) && 8468 !HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) && 8469 psize == asize) { 8470 to_write = hdr->b_l1hdr.b_pabd; 8471 } else { 8472 int ret; 8473 arc_buf_contents_t type = arc_buf_type(hdr); 8474 8475 ret = l2arc_apply_transforms(spa, hdr, asize, 8476 &to_write); 8477 if (ret != 0) { 8478 arc_hdr_clear_flags(hdr, 8479 ARC_FLAG_L2_WRITING); 8480 mutex_exit(hash_lock); 8481 continue; 8482 } 8483 8484 l2arc_free_abd_on_write(to_write, asize, type); 8485 } 8486 8487 if (pio == NULL) { 8488 /* 8489 * Insert a dummy header on the buflist so 8490 * l2arc_write_done() can find where the 8491 * write buffers begin without searching. 8492 */ 8493 mutex_enter(&dev->l2ad_mtx); 8494 list_insert_head(&dev->l2ad_buflist, head); 8495 mutex_exit(&dev->l2ad_mtx); 8496 8497 cb = kmem_alloc( 8498 sizeof (l2arc_write_callback_t), KM_SLEEP); 8499 cb->l2wcb_dev = dev; 8500 cb->l2wcb_head = head; 8501 /* 8502 * Create a list to save allocated abd buffers 8503 * for l2arc_log_blk_commit(). 8504 */ 8505 list_create(&cb->l2wcb_abd_list, 8506 sizeof (l2arc_lb_abd_buf_t), 8507 offsetof(l2arc_lb_abd_buf_t, node)); 8508 pio = zio_root(spa, l2arc_write_done, cb, 8509 ZIO_FLAG_CANFAIL); 8510 } 8511 8512 hdr->b_l2hdr.b_dev = dev; 8513 hdr->b_l2hdr.b_daddr = dev->l2ad_hand; 8514 arc_hdr_set_flags(hdr, 8515 ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR); 8516 8517 mutex_enter(&dev->l2ad_mtx); 8518 list_insert_head(&dev->l2ad_buflist, hdr); 8519 mutex_exit(&dev->l2ad_mtx); 8520 8521 (void) zfs_refcount_add_many(&dev->l2ad_alloc, 8522 arc_hdr_size(hdr), hdr); 8523 8524 wzio = zio_write_phys(pio, dev->l2ad_vdev, 8525 hdr->b_l2hdr.b_daddr, asize, to_write, 8526 ZIO_CHECKSUM_OFF, NULL, hdr, 8527 ZIO_PRIORITY_ASYNC_WRITE, 8528 ZIO_FLAG_CANFAIL, B_FALSE); 8529 8530 write_lsize += HDR_GET_LSIZE(hdr); 8531 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, 8532 zio_t *, wzio); 8533 8534 write_psize += psize; 8535 write_asize += asize; 8536 dev->l2ad_hand += asize; 8537 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 8538 8539 mutex_exit(hash_lock); 8540 8541 /* 8542 * Append buf info to current log and commit if full. 8543 * arcstat_l2_{size,asize} kstats are updated 8544 * internally. 8545 */ 8546 if (l2arc_log_blk_insert(dev, hdr)) 8547 l2arc_log_blk_commit(dev, pio, cb); 8548 8549 (void) zio_nowait(wzio); 8550 } 8551 8552 multilist_sublist_unlock(mls); 8553 8554 if (full == B_TRUE) 8555 break; 8556 } 8557 8558 /* No buffers selected for writing? */ 8559 if (pio == NULL) { 8560 ASSERT0(write_lsize); 8561 ASSERT(!HDR_HAS_L1HDR(head)); 8562 kmem_cache_free(hdr_l2only_cache, head); 8563 8564 /* 8565 * Although we did not write any buffers l2ad_evict may 8566 * have advanced. 8567 */ 8568 l2arc_dev_hdr_update(dev); 8569 8570 return (0); 8571 } 8572 8573 if (!dev->l2ad_first) 8574 ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict); 8575 8576 ASSERT3U(write_asize, <=, target_sz); 8577 ARCSTAT_BUMP(arcstat_l2_writes_sent); 8578 ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize); 8579 ARCSTAT_INCR(arcstat_l2_lsize, write_lsize); 8580 ARCSTAT_INCR(arcstat_l2_psize, write_psize); 8581 8582 dev->l2ad_writing = B_TRUE; 8583 (void) zio_wait(pio); 8584 dev->l2ad_writing = B_FALSE; 8585 8586 /* 8587 * Update the device header after the zio completes as 8588 * l2arc_write_done() may have updated the memory holding the log block 8589 * pointers in the device header. 8590 */ 8591 l2arc_dev_hdr_update(dev); 8592 8593 return (write_asize); 8594 } 8595 8596 /* 8597 * This thread feeds the L2ARC at regular intervals. This is the beating 8598 * heart of the L2ARC. 8599 */ 8600 /* ARGSUSED */ 8601 static void 8602 l2arc_feed_thread(void *unused) 8603 { 8604 callb_cpr_t cpr; 8605 l2arc_dev_t *dev; 8606 spa_t *spa; 8607 uint64_t size, wrote; 8608 clock_t begin, next = ddi_get_lbolt(); 8609 8610 CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG); 8611 8612 mutex_enter(&l2arc_feed_thr_lock); 8613 8614 while (l2arc_thread_exit == 0) { 8615 CALLB_CPR_SAFE_BEGIN(&cpr); 8616 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock, 8617 next); 8618 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock); 8619 next = ddi_get_lbolt() + hz; 8620 8621 /* 8622 * Quick check for L2ARC devices. 8623 */ 8624 mutex_enter(&l2arc_dev_mtx); 8625 if (l2arc_ndev == 0) { 8626 mutex_exit(&l2arc_dev_mtx); 8627 continue; 8628 } 8629 mutex_exit(&l2arc_dev_mtx); 8630 begin = ddi_get_lbolt(); 8631 8632 /* 8633 * This selects the next l2arc device to write to, and in 8634 * doing so the next spa to feed from: dev->l2ad_spa. This 8635 * will return NULL if there are now no l2arc devices or if 8636 * they are all faulted. 8637 * 8638 * If a device is returned, its spa's config lock is also 8639 * held to prevent device removal. l2arc_dev_get_next() 8640 * will grab and release l2arc_dev_mtx. 8641 */ 8642 if ((dev = l2arc_dev_get_next()) == NULL) 8643 continue; 8644 8645 spa = dev->l2ad_spa; 8646 ASSERT3P(spa, !=, NULL); 8647 8648 /* 8649 * If the pool is read-only then force the feed thread to 8650 * sleep a little longer. 8651 */ 8652 if (!spa_writeable(spa)) { 8653 next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz; 8654 spa_config_exit(spa, SCL_L2ARC, dev); 8655 continue; 8656 } 8657 8658 /* 8659 * Avoid contributing to memory pressure. 8660 */ 8661 if (arc_reclaim_needed()) { 8662 ARCSTAT_BUMP(arcstat_l2_abort_lowmem); 8663 spa_config_exit(spa, SCL_L2ARC, dev); 8664 continue; 8665 } 8666 8667 ARCSTAT_BUMP(arcstat_l2_feeds); 8668 8669 size = l2arc_write_size(dev); 8670 8671 /* 8672 * Evict L2ARC buffers that will be overwritten. 8673 */ 8674 l2arc_evict(dev, size, B_FALSE); 8675 8676 /* 8677 * Write ARC buffers. 8678 */ 8679 wrote = l2arc_write_buffers(spa, dev, size); 8680 8681 /* 8682 * Calculate interval between writes. 8683 */ 8684 next = l2arc_write_interval(begin, size, wrote); 8685 spa_config_exit(spa, SCL_L2ARC, dev); 8686 } 8687 8688 l2arc_thread_exit = 0; 8689 cv_broadcast(&l2arc_feed_thr_cv); 8690 CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */ 8691 thread_exit(); 8692 } 8693 8694 boolean_t 8695 l2arc_vdev_present(vdev_t *vd) 8696 { 8697 return (l2arc_vdev_get(vd) != NULL); 8698 } 8699 8700 /* 8701 * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if 8702 * the vdev_t isn't an L2ARC device. 8703 */ 8704 static l2arc_dev_t * 8705 l2arc_vdev_get(vdev_t *vd) 8706 { 8707 l2arc_dev_t *dev; 8708 8709 mutex_enter(&l2arc_dev_mtx); 8710 for (dev = list_head(l2arc_dev_list); dev != NULL; 8711 dev = list_next(l2arc_dev_list, dev)) { 8712 if (dev->l2ad_vdev == vd) 8713 break; 8714 } 8715 mutex_exit(&l2arc_dev_mtx); 8716 8717 return (dev); 8718 } 8719 8720 /* 8721 * Add a vdev for use by the L2ARC. By this point the spa has already 8722 * validated the vdev and opened it. 8723 */ 8724 void 8725 l2arc_add_vdev(spa_t *spa, vdev_t *vd) 8726 { 8727 l2arc_dev_t *adddev; 8728 uint64_t l2dhdr_asize; 8729 8730 ASSERT(!l2arc_vdev_present(vd)); 8731 8732 /* 8733 * Create a new l2arc device entry. 8734 */ 8735 adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP); 8736 adddev->l2ad_spa = spa; 8737 adddev->l2ad_vdev = vd; 8738 /* leave extra size for an l2arc device header */ 8739 l2dhdr_asize = adddev->l2ad_dev_hdr_asize = 8740 MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift); 8741 adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize; 8742 adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd); 8743 ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end); 8744 adddev->l2ad_hand = adddev->l2ad_start; 8745 adddev->l2ad_evict = adddev->l2ad_start; 8746 adddev->l2ad_first = B_TRUE; 8747 adddev->l2ad_writing = B_FALSE; 8748 adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP); 8749 8750 mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL); 8751 /* 8752 * This is a list of all ARC buffers that are still valid on the 8753 * device. 8754 */ 8755 list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t), 8756 offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node)); 8757 8758 /* 8759 * This is a list of pointers to log blocks that are still present 8760 * on the device. 8761 */ 8762 list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t), 8763 offsetof(l2arc_lb_ptr_buf_t, node)); 8764 8765 vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand); 8766 zfs_refcount_create(&adddev->l2ad_alloc); 8767 zfs_refcount_create(&adddev->l2ad_lb_asize); 8768 zfs_refcount_create(&adddev->l2ad_lb_count); 8769 8770 /* 8771 * Add device to global list 8772 */ 8773 mutex_enter(&l2arc_dev_mtx); 8774 list_insert_head(l2arc_dev_list, adddev); 8775 atomic_inc_64(&l2arc_ndev); 8776 mutex_exit(&l2arc_dev_mtx); 8777 8778 /* 8779 * Decide if vdev is eligible for L2ARC rebuild 8780 */ 8781 l2arc_rebuild_vdev(adddev->l2ad_vdev, B_FALSE); 8782 } 8783 8784 void 8785 l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen) 8786 { 8787 l2arc_dev_t *dev = NULL; 8788 l2arc_dev_hdr_phys_t *l2dhdr; 8789 uint64_t l2dhdr_asize; 8790 spa_t *spa; 8791 int err; 8792 boolean_t l2dhdr_valid = B_TRUE; 8793 8794 dev = l2arc_vdev_get(vd); 8795 ASSERT3P(dev, !=, NULL); 8796 spa = dev->l2ad_spa; 8797 l2dhdr = dev->l2ad_dev_hdr; 8798 l2dhdr_asize = dev->l2ad_dev_hdr_asize; 8799 8800 /* 8801 * The L2ARC has to hold at least the payload of one log block for 8802 * them to be restored (persistent L2ARC). The payload of a log block 8803 * depends on the amount of its log entries. We always write log blocks 8804 * with 1022 entries. How many of them are committed or restored depends 8805 * on the size of the L2ARC device. Thus the maximum payload of 8806 * one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device 8807 * is less than that, we reduce the amount of committed and restored 8808 * log entries per block so as to enable persistence. 8809 */ 8810 if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) { 8811 dev->l2ad_log_entries = 0; 8812 } else { 8813 dev->l2ad_log_entries = MIN((dev->l2ad_end - 8814 dev->l2ad_start) >> SPA_MAXBLOCKSHIFT, 8815 L2ARC_LOG_BLK_MAX_ENTRIES); 8816 } 8817 8818 /* 8819 * Read the device header, if an error is returned do not rebuild L2ARC. 8820 */ 8821 if ((err = l2arc_dev_hdr_read(dev)) != 0) 8822 l2dhdr_valid = B_FALSE; 8823 8824 if (l2dhdr_valid && dev->l2ad_log_entries > 0) { 8825 /* 8826 * If we are onlining a cache device (vdev_reopen) that was 8827 * still present (l2arc_vdev_present()) and rebuild is enabled, 8828 * we should evict all ARC buffers and pointers to log blocks 8829 * and reclaim their space before restoring its contents to 8830 * L2ARC. 8831 */ 8832 if (reopen) { 8833 if (!l2arc_rebuild_enabled) { 8834 return; 8835 } else { 8836 l2arc_evict(dev, 0, B_TRUE); 8837 /* start a new log block */ 8838 dev->l2ad_log_ent_idx = 0; 8839 dev->l2ad_log_blk_payload_asize = 0; 8840 dev->l2ad_log_blk_payload_start = 0; 8841 } 8842 } 8843 /* 8844 * Just mark the device as pending for a rebuild. We won't 8845 * be starting a rebuild in line here as it would block pool 8846 * import. Instead spa_load_impl will hand that off to an 8847 * async task which will call l2arc_spa_rebuild_start. 8848 */ 8849 dev->l2ad_rebuild = B_TRUE; 8850 } else if (spa_writeable(spa)) { 8851 /* 8852 * In this case create a new header. We zero out the memory 8853 * holding the header to reset dh_start_lbps. 8854 */ 8855 bzero(l2dhdr, l2dhdr_asize); 8856 l2arc_dev_hdr_update(dev); 8857 } 8858 } 8859 8860 /* 8861 * Remove a vdev from the L2ARC. 8862 */ 8863 void 8864 l2arc_remove_vdev(vdev_t *vd) 8865 { 8866 l2arc_dev_t *remdev = NULL; 8867 8868 /* 8869 * Find the device by vdev 8870 */ 8871 remdev = l2arc_vdev_get(vd); 8872 ASSERT3P(remdev, !=, NULL); 8873 8874 /* 8875 * Cancel any ongoing or scheduled rebuild. 8876 */ 8877 mutex_enter(&l2arc_rebuild_thr_lock); 8878 if (remdev->l2ad_rebuild_began == B_TRUE) { 8879 remdev->l2ad_rebuild_cancel = B_TRUE; 8880 while (remdev->l2ad_rebuild == B_TRUE) 8881 cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock); 8882 } 8883 mutex_exit(&l2arc_rebuild_thr_lock); 8884 8885 /* 8886 * Remove device from global list 8887 */ 8888 mutex_enter(&l2arc_dev_mtx); 8889 list_remove(l2arc_dev_list, remdev); 8890 l2arc_dev_last = NULL; /* may have been invalidated */ 8891 atomic_dec_64(&l2arc_ndev); 8892 mutex_exit(&l2arc_dev_mtx); 8893 8894 /* 8895 * Clear all buflists and ARC references. L2ARC device flush. 8896 */ 8897 l2arc_evict(remdev, 0, B_TRUE); 8898 list_destroy(&remdev->l2ad_buflist); 8899 ASSERT(list_is_empty(&remdev->l2ad_lbptr_list)); 8900 list_destroy(&remdev->l2ad_lbptr_list); 8901 mutex_destroy(&remdev->l2ad_mtx); 8902 zfs_refcount_destroy(&remdev->l2ad_alloc); 8903 zfs_refcount_destroy(&remdev->l2ad_lb_asize); 8904 zfs_refcount_destroy(&remdev->l2ad_lb_count); 8905 kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize); 8906 kmem_free(remdev, sizeof (l2arc_dev_t)); 8907 } 8908 8909 void 8910 l2arc_init(void) 8911 { 8912 l2arc_thread_exit = 0; 8913 l2arc_ndev = 0; 8914 l2arc_writes_sent = 0; 8915 l2arc_writes_done = 0; 8916 8917 mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL); 8918 cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL); 8919 mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL); 8920 cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL); 8921 mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL); 8922 mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL); 8923 8924 l2arc_dev_list = &L2ARC_dev_list; 8925 l2arc_free_on_write = &L2ARC_free_on_write; 8926 list_create(l2arc_dev_list, sizeof (l2arc_dev_t), 8927 offsetof(l2arc_dev_t, l2ad_node)); 8928 list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t), 8929 offsetof(l2arc_data_free_t, l2df_list_node)); 8930 } 8931 8932 void 8933 l2arc_fini(void) 8934 { 8935 /* 8936 * This is called from dmu_fini(), which is called from spa_fini(); 8937 * Because of this, we can assume that all l2arc devices have 8938 * already been removed when the pools themselves were removed. 8939 */ 8940 8941 l2arc_do_free_on_write(); 8942 8943 mutex_destroy(&l2arc_feed_thr_lock); 8944 cv_destroy(&l2arc_feed_thr_cv); 8945 mutex_destroy(&l2arc_rebuild_thr_lock); 8946 cv_destroy(&l2arc_rebuild_thr_cv); 8947 mutex_destroy(&l2arc_dev_mtx); 8948 mutex_destroy(&l2arc_free_on_write_mtx); 8949 8950 list_destroy(l2arc_dev_list); 8951 list_destroy(l2arc_free_on_write); 8952 } 8953 8954 void 8955 l2arc_start(void) 8956 { 8957 if (!(spa_mode_global & FWRITE)) 8958 return; 8959 8960 (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0, 8961 TS_RUN, minclsyspri); 8962 } 8963 8964 void 8965 l2arc_stop(void) 8966 { 8967 if (!(spa_mode_global & FWRITE)) 8968 return; 8969 8970 mutex_enter(&l2arc_feed_thr_lock); 8971 cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */ 8972 l2arc_thread_exit = 1; 8973 while (l2arc_thread_exit != 0) 8974 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock); 8975 mutex_exit(&l2arc_feed_thr_lock); 8976 } 8977 8978 /* 8979 * Punches out rebuild threads for the L2ARC devices in a spa. This should 8980 * be called after pool import from the spa async thread, since starting 8981 * these threads directly from spa_import() will make them part of the 8982 * "zpool import" context and delay process exit (and thus pool import). 8983 */ 8984 void 8985 l2arc_spa_rebuild_start(spa_t *spa) 8986 { 8987 ASSERT(MUTEX_HELD(&spa_namespace_lock)); 8988 8989 /* 8990 * Locate the spa's l2arc devices and kick off rebuild threads. 8991 */ 8992 for (int i = 0; i < spa->spa_l2cache.sav_count; i++) { 8993 l2arc_dev_t *dev = 8994 l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]); 8995 if (dev == NULL) { 8996 /* Don't attempt a rebuild if the vdev is UNAVAIL */ 8997 continue; 8998 } 8999 mutex_enter(&l2arc_rebuild_thr_lock); 9000 if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) { 9001 dev->l2ad_rebuild_began = B_TRUE; 9002 (void) thread_create(NULL, 0, 9003 (void (*)(void *))l2arc_dev_rebuild_start, 9004 dev, 0, &p0, TS_RUN, minclsyspri); 9005 } 9006 mutex_exit(&l2arc_rebuild_thr_lock); 9007 } 9008 } 9009 9010 /* 9011 * Main entry point for L2ARC rebuilding. 9012 */ 9013 static void 9014 l2arc_dev_rebuild_start(l2arc_dev_t *dev) 9015 { 9016 VERIFY(!dev->l2ad_rebuild_cancel); 9017 VERIFY(dev->l2ad_rebuild); 9018 (void) l2arc_rebuild(dev); 9019 mutex_enter(&l2arc_rebuild_thr_lock); 9020 dev->l2ad_rebuild_began = B_FALSE; 9021 dev->l2ad_rebuild = B_FALSE; 9022 mutex_exit(&l2arc_rebuild_thr_lock); 9023 9024 thread_exit(); 9025 } 9026 9027 /* 9028 * This function implements the actual L2ARC metadata rebuild. It: 9029 * starts reading the log block chain and restores each block's contents 9030 * to memory (reconstructing arc_buf_hdr_t's). 9031 * 9032 * Operation stops under any of the following conditions: 9033 * 9034 * 1) We reach the end of the log block chain. 9035 * 2) We encounter *any* error condition (cksum errors, io errors) 9036 */ 9037 static int 9038 l2arc_rebuild(l2arc_dev_t *dev) 9039 { 9040 vdev_t *vd = dev->l2ad_vdev; 9041 spa_t *spa = vd->vdev_spa; 9042 int err = 0; 9043 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9044 l2arc_log_blk_phys_t *this_lb, *next_lb; 9045 zio_t *this_io = NULL, *next_io = NULL; 9046 l2arc_log_blkptr_t lbps[2]; 9047 l2arc_lb_ptr_buf_t *lb_ptr_buf; 9048 boolean_t lock_held; 9049 9050 this_lb = kmem_zalloc(sizeof (*this_lb), KM_SLEEP); 9051 next_lb = kmem_zalloc(sizeof (*next_lb), KM_SLEEP); 9052 9053 /* 9054 * We prevent device removal while issuing reads to the device, 9055 * then during the rebuilding phases we drop this lock again so 9056 * that a spa_unload or device remove can be initiated - this is 9057 * safe, because the spa will signal us to stop before removing 9058 * our device and wait for us to stop. 9059 */ 9060 spa_config_enter(spa, SCL_L2ARC, vd, RW_READER); 9061 lock_held = B_TRUE; 9062 9063 /* 9064 * Retrieve the persistent L2ARC device state. 9065 * L2BLK_GET_PSIZE returns aligned size for log blocks. 9066 */ 9067 dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start); 9068 dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr + 9069 L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop), 9070 dev->l2ad_start); 9071 dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST); 9072 9073 /* 9074 * In case the zfs module parameter l2arc_rebuild_enabled is false 9075 * we do not start the rebuild process. 9076 */ 9077 if (!l2arc_rebuild_enabled) 9078 goto out; 9079 9080 /* Prepare the rebuild process */ 9081 bcopy(l2dhdr->dh_start_lbps, lbps, sizeof (lbps)); 9082 9083 /* Start the rebuild process */ 9084 for (;;) { 9085 if (!l2arc_log_blkptr_valid(dev, &lbps[0])) 9086 break; 9087 9088 if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1], 9089 this_lb, next_lb, this_io, &next_io)) != 0) 9090 goto out; 9091 9092 /* 9093 * Our memory pressure valve. If the system is running low 9094 * on memory, rather than swamping memory with new ARC buf 9095 * hdrs, we opt not to rebuild the L2ARC. At this point, 9096 * however, we have already set up our L2ARC dev to chain in 9097 * new metadata log blocks, so the user may choose to offline/ 9098 * online the L2ARC dev at a later time (or re-import the pool) 9099 * to reconstruct it (when there's less memory pressure). 9100 */ 9101 if (arc_reclaim_needed()) { 9102 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem); 9103 cmn_err(CE_NOTE, "System running low on memory, " 9104 "aborting L2ARC rebuild."); 9105 err = SET_ERROR(ENOMEM); 9106 goto out; 9107 } 9108 9109 spa_config_exit(spa, SCL_L2ARC, vd); 9110 lock_held = B_FALSE; 9111 9112 /* 9113 * Now that we know that the next_lb checks out alright, we 9114 * can start reconstruction from this log block. 9115 * L2BLK_GET_PSIZE returns aligned size for log blocks. 9116 */ 9117 uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop); 9118 l2arc_log_blk_restore(dev, this_lb, asize, lbps[0].lbp_daddr); 9119 9120 /* 9121 * log block restored, include its pointer in the list of 9122 * pointers to log blocks present in the L2ARC device. 9123 */ 9124 lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); 9125 lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), 9126 KM_SLEEP); 9127 bcopy(&lbps[0], lb_ptr_buf->lb_ptr, 9128 sizeof (l2arc_log_blkptr_t)); 9129 mutex_enter(&dev->l2ad_mtx); 9130 list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf); 9131 ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); 9132 ARCSTAT_BUMP(arcstat_l2_log_blk_count); 9133 zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); 9134 zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); 9135 mutex_exit(&dev->l2ad_mtx); 9136 vdev_space_update(vd, asize, 0, 0); 9137 9138 /* BEGIN CSTYLED */ 9139 /* 9140 * Protection against loops of log blocks: 9141 * 9142 * l2ad_hand l2ad_evict 9143 * V V 9144 * l2ad_start |=======================================| l2ad_end 9145 * -----|||----|||---|||----||| 9146 * (3) (2) (1) (0) 9147 * ---|||---|||----|||---||| 9148 * (7) (6) (5) (4) 9149 * 9150 * In this situation the pointer of log block (4) passes 9151 * l2arc_log_blkptr_valid() but the log block should not be 9152 * restored as it is overwritten by the payload of log block 9153 * (0). Only log blocks (0)-(3) should be restored. We check 9154 * whether l2ad_evict lies in between the payload starting 9155 * offset of the next log block (lbps[1].lbp_payload_start) 9156 * and the payload starting offset of the present log block 9157 * (lbps[0].lbp_payload_start). If true and this isn't the 9158 * first pass, we are looping from the beginning and we should 9159 * stop. 9160 */ 9161 /* END CSTYLED */ 9162 if (l2arc_range_check_overlap(lbps[1].lbp_payload_start, 9163 lbps[0].lbp_payload_start, dev->l2ad_evict) && 9164 !dev->l2ad_first) 9165 goto out; 9166 9167 for (;;) { 9168 mutex_enter(&l2arc_rebuild_thr_lock); 9169 if (dev->l2ad_rebuild_cancel) { 9170 dev->l2ad_rebuild = B_FALSE; 9171 cv_signal(&l2arc_rebuild_thr_cv); 9172 mutex_exit(&l2arc_rebuild_thr_lock); 9173 err = SET_ERROR(ECANCELED); 9174 goto out; 9175 } 9176 mutex_exit(&l2arc_rebuild_thr_lock); 9177 if (spa_config_tryenter(spa, SCL_L2ARC, vd, 9178 RW_READER)) { 9179 lock_held = B_TRUE; 9180 break; 9181 } 9182 /* 9183 * L2ARC config lock held by somebody in writer, 9184 * possibly due to them trying to remove us. They'll 9185 * likely to want us to shut down, so after a little 9186 * delay, we check l2ad_rebuild_cancel and retry 9187 * the lock again. 9188 */ 9189 delay(1); 9190 } 9191 9192 /* 9193 * Continue with the next log block. 9194 */ 9195 lbps[0] = lbps[1]; 9196 lbps[1] = this_lb->lb_prev_lbp; 9197 PTR_SWAP(this_lb, next_lb); 9198 this_io = next_io; 9199 next_io = NULL; 9200 } 9201 9202 if (this_io != NULL) 9203 l2arc_log_blk_fetch_abort(this_io); 9204 out: 9205 if (next_io != NULL) 9206 l2arc_log_blk_fetch_abort(next_io); 9207 kmem_free(this_lb, sizeof (*this_lb)); 9208 kmem_free(next_lb, sizeof (*next_lb)); 9209 9210 if (!l2arc_rebuild_enabled) { 9211 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 9212 "disabled"); 9213 } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) { 9214 ARCSTAT_BUMP(arcstat_l2_rebuild_success); 9215 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 9216 "successful, restored %llu blocks", 9217 (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); 9218 } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) { 9219 /* 9220 * No error but also nothing restored, meaning the lbps array 9221 * in the device header points to invalid/non-present log 9222 * blocks. Reset the header. 9223 */ 9224 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 9225 "no valid log blocks"); 9226 bzero(l2dhdr, dev->l2ad_dev_hdr_asize); 9227 l2arc_dev_hdr_update(dev); 9228 } else if (err != 0) { 9229 spa_history_log_internal(spa, "L2ARC rebuild", NULL, 9230 "aborted, restored %llu blocks", 9231 (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); 9232 } 9233 9234 if (lock_held) 9235 spa_config_exit(spa, SCL_L2ARC, vd); 9236 9237 return (err); 9238 } 9239 9240 /* 9241 * Attempts to read the device header on the provided L2ARC device and writes 9242 * it to `hdr'. On success, this function returns 0, otherwise the appropriate 9243 * error code is returned. 9244 */ 9245 static int 9246 l2arc_dev_hdr_read(l2arc_dev_t *dev) 9247 { 9248 int err; 9249 uint64_t guid; 9250 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9251 const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; 9252 abd_t *abd; 9253 9254 guid = spa_guid(dev->l2ad_vdev->vdev_spa); 9255 9256 abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); 9257 9258 err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev, 9259 VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, 9260 ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_ASYNC_READ, 9261 ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | 9262 ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY | 9263 ZIO_FLAG_SPECULATIVE, B_FALSE)); 9264 9265 abd_put(abd); 9266 9267 if (err != 0) { 9268 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors); 9269 zfs_dbgmsg("L2ARC IO error (%d) while reading device header, " 9270 "vdev guid: %llu", err, dev->l2ad_vdev->vdev_guid); 9271 return (err); 9272 } 9273 9274 if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC)) 9275 byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr)); 9276 9277 if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC || 9278 l2dhdr->dh_spa_guid != guid || 9279 l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid || 9280 l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION || 9281 l2dhdr->dh_log_entries != dev->l2ad_log_entries || 9282 l2dhdr->dh_end != dev->l2ad_end || 9283 !l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end, 9284 l2dhdr->dh_evict)) { 9285 /* 9286 * Attempt to rebuild a device containing no actual dev hdr 9287 * or containing a header from some other pool or from another 9288 * version of persistent L2ARC. 9289 */ 9290 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported); 9291 return (SET_ERROR(ENOTSUP)); 9292 } 9293 9294 return (0); 9295 } 9296 9297 /* 9298 * Reads L2ARC log blocks from storage and validates their contents. 9299 * 9300 * This function implements a simple fetcher to make sure that while 9301 * we're processing one buffer the L2ARC is already fetching the next 9302 * one in the chain. 9303 * 9304 * The arguments this_lp and next_lp point to the current and next log block 9305 * address in the block chain. Similarly, this_lb and next_lb hold the 9306 * l2arc_log_blk_phys_t's of the current and next L2ARC blk. 9307 * 9308 * The `this_io' and `next_io' arguments are used for block fetching. 9309 * When issuing the first blk IO during rebuild, you should pass NULL for 9310 * `this_io'. This function will then issue a sync IO to read the block and 9311 * also issue an async IO to fetch the next block in the block chain. The 9312 * fetched IO is returned in `next_io'. On subsequent calls to this 9313 * function, pass the value returned in `next_io' from the previous call 9314 * as `this_io' and a fresh `next_io' pointer to hold the next fetch IO. 9315 * Prior to the call, you should initialize your `next_io' pointer to be 9316 * NULL. If no fetch IO was issued, the pointer is left set at NULL. 9317 * 9318 * On success, this function returns 0, otherwise it returns an appropriate 9319 * error code. On error the fetching IO is aborted and cleared before 9320 * returning from this function. Therefore, if we return `success', the 9321 * caller can assume that we have taken care of cleanup of fetch IOs. 9322 */ 9323 static int 9324 l2arc_log_blk_read(l2arc_dev_t *dev, 9325 const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp, 9326 l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, 9327 zio_t *this_io, zio_t **next_io) 9328 { 9329 int err = 0; 9330 zio_cksum_t cksum; 9331 abd_t *abd = NULL; 9332 uint64_t asize; 9333 9334 ASSERT(this_lbp != NULL && next_lbp != NULL); 9335 ASSERT(this_lb != NULL && next_lb != NULL); 9336 ASSERT(next_io != NULL && *next_io == NULL); 9337 ASSERT(l2arc_log_blkptr_valid(dev, this_lbp)); 9338 9339 /* 9340 * Check to see if we have issued the IO for this log block in a 9341 * previous run. If not, this is the first call, so issue it now. 9342 */ 9343 if (this_io == NULL) { 9344 this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp, 9345 this_lb); 9346 } 9347 9348 /* 9349 * Peek to see if we can start issuing the next IO immediately. 9350 */ 9351 if (l2arc_log_blkptr_valid(dev, next_lbp)) { 9352 /* 9353 * Start issuing IO for the next log block early - this 9354 * should help keep the L2ARC device busy while we 9355 * decompress and restore this log block. 9356 */ 9357 *next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp, 9358 next_lb); 9359 } 9360 9361 /* Wait for the IO to read this log block to complete */ 9362 if ((err = zio_wait(this_io)) != 0) { 9363 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors); 9364 zfs_dbgmsg("L2ARC IO error (%d) while reading log block, " 9365 "offset: %llu, vdev guid: %llu", err, this_lbp->lbp_daddr, 9366 dev->l2ad_vdev->vdev_guid); 9367 goto cleanup; 9368 } 9369 9370 /* 9371 * Make sure the buffer checks out. 9372 * L2BLK_GET_PSIZE returns aligned size for log blocks. 9373 */ 9374 asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop); 9375 fletcher_4_native(this_lb, asize, NULL, &cksum); 9376 if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) { 9377 ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors); 9378 zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, " 9379 "vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu", 9380 this_lbp->lbp_daddr, dev->l2ad_vdev->vdev_guid, 9381 dev->l2ad_hand, dev->l2ad_evict); 9382 err = SET_ERROR(ECKSUM); 9383 goto cleanup; 9384 } 9385 9386 /* Now we can take our time decoding this buffer */ 9387 switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) { 9388 case ZIO_COMPRESS_OFF: 9389 break; 9390 case ZIO_COMPRESS_LZ4: 9391 abd = abd_alloc_for_io(asize, B_TRUE); 9392 abd_copy_from_buf_off(abd, this_lb, 0, asize); 9393 if ((err = zio_decompress_data( 9394 L2BLK_GET_COMPRESS((this_lbp)->lbp_prop), 9395 abd, this_lb, asize, sizeof (*this_lb))) != 0) { 9396 err = SET_ERROR(EINVAL); 9397 goto cleanup; 9398 } 9399 break; 9400 default: 9401 err = SET_ERROR(EINVAL); 9402 goto cleanup; 9403 } 9404 if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC)) 9405 byteswap_uint64_array(this_lb, sizeof (*this_lb)); 9406 if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) { 9407 err = SET_ERROR(EINVAL); 9408 goto cleanup; 9409 } 9410 cleanup: 9411 /* Abort an in-flight fetch I/O in case of error */ 9412 if (err != 0 && *next_io != NULL) { 9413 l2arc_log_blk_fetch_abort(*next_io); 9414 *next_io = NULL; 9415 } 9416 if (abd != NULL) 9417 abd_free(abd); 9418 return (err); 9419 } 9420 9421 /* 9422 * Restores the payload of a log block to ARC. This creates empty ARC hdr 9423 * entries which only contain an l2arc hdr, essentially restoring the 9424 * buffers to their L2ARC evicted state. This function also updates space 9425 * usage on the L2ARC vdev to make sure it tracks restored buffers. 9426 */ 9427 static void 9428 l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb, 9429 uint64_t lb_asize, uint64_t lb_daddr) 9430 { 9431 uint64_t size = 0, asize = 0; 9432 uint64_t log_entries = dev->l2ad_log_entries; 9433 9434 for (int i = log_entries - 1; i >= 0; i--) { 9435 /* 9436 * Restore goes in the reverse temporal direction to preserve 9437 * correct temporal ordering of buffers in the l2ad_buflist. 9438 * l2arc_hdr_restore also does a list_insert_tail instead of 9439 * list_insert_head on the l2ad_buflist: 9440 * 9441 * LIST l2ad_buflist LIST 9442 * HEAD <------ (time) ------ TAIL 9443 * direction +-----+-----+-----+-----+-----+ direction 9444 * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild 9445 * fill +-----+-----+-----+-----+-----+ 9446 * ^ ^ 9447 * | | 9448 * | | 9449 * l2arc_feed_thread l2arc_rebuild 9450 * will place new bufs here restores bufs here 9451 * 9452 * During l2arc_rebuild() the device is not used by 9453 * l2arc_feed_thread() as dev->l2ad_rebuild is set to true. 9454 */ 9455 size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop); 9456 asize += vdev_psize_to_asize(dev->l2ad_vdev, 9457 L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop)); 9458 l2arc_hdr_restore(&lb->lb_entries[i], dev); 9459 } 9460 9461 /* 9462 * Record rebuild stats: 9463 * size Logical size of restored buffers in the L2ARC 9464 * asize Aligned size of restored buffers in the L2ARC 9465 */ 9466 ARCSTAT_INCR(arcstat_l2_rebuild_size, size); 9467 ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize); 9468 ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries); 9469 ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize); 9470 ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize); 9471 ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks); 9472 } 9473 9474 /* 9475 * Restores a single ARC buf hdr from a log entry. The ARC buffer is put 9476 * into a state indicating that it has been evicted to L2ARC. 9477 */ 9478 static void 9479 l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev) 9480 { 9481 arc_buf_hdr_t *hdr, *exists; 9482 kmutex_t *hash_lock; 9483 arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop); 9484 uint64_t asize; 9485 9486 /* 9487 * Do all the allocation before grabbing any locks, this lets us 9488 * sleep if memory is full and we don't have to deal with failed 9489 * allocations. 9490 */ 9491 hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type, 9492 dev, le->le_dva, le->le_daddr, 9493 L2BLK_GET_PSIZE((le)->le_prop), le->le_birth, 9494 L2BLK_GET_COMPRESS((le)->le_prop), 9495 L2BLK_GET_PROTECTED((le)->le_prop), 9496 L2BLK_GET_PREFETCH((le)->le_prop)); 9497 asize = vdev_psize_to_asize(dev->l2ad_vdev, 9498 L2BLK_GET_PSIZE((le)->le_prop)); 9499 9500 /* 9501 * vdev_space_update() has to be called before arc_hdr_destroy() to 9502 * avoid underflow since the latter also calls the former. 9503 */ 9504 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 9505 9506 ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr)); 9507 ARCSTAT_INCR(arcstat_l2_psize, HDR_GET_PSIZE(hdr)); 9508 9509 mutex_enter(&dev->l2ad_mtx); 9510 list_insert_tail(&dev->l2ad_buflist, hdr); 9511 (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); 9512 mutex_exit(&dev->l2ad_mtx); 9513 9514 exists = buf_hash_insert(hdr, &hash_lock); 9515 if (exists) { 9516 /* Buffer was already cached, no need to restore it. */ 9517 arc_hdr_destroy(hdr); 9518 /* 9519 * If the buffer is already cached, check whether it has 9520 * L2ARC metadata. If not, enter them and update the flag. 9521 * This is important is case of onlining a cache device, since 9522 * we previously evicted all L2ARC metadata from ARC. 9523 */ 9524 if (!HDR_HAS_L2HDR(exists)) { 9525 arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR); 9526 exists->b_l2hdr.b_dev = dev; 9527 exists->b_l2hdr.b_daddr = le->le_daddr; 9528 mutex_enter(&dev->l2ad_mtx); 9529 list_insert_tail(&dev->l2ad_buflist, exists); 9530 (void) zfs_refcount_add_many(&dev->l2ad_alloc, 9531 arc_hdr_size(exists), exists); 9532 mutex_exit(&dev->l2ad_mtx); 9533 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 9534 ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(exists)); 9535 ARCSTAT_INCR(arcstat_l2_psize, HDR_GET_PSIZE(exists)); 9536 } 9537 ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached); 9538 } 9539 9540 mutex_exit(hash_lock); 9541 } 9542 9543 /* 9544 * Starts an asynchronous read IO to read a log block. This is used in log 9545 * block reconstruction to start reading the next block before we are done 9546 * decoding and reconstructing the current block, to keep the l2arc device 9547 * nice and hot with read IO to process. 9548 * The returned zio will contain newly allocated memory buffers for the IO 9549 * data which should then be freed by the caller once the zio is no longer 9550 * needed (i.e. due to it having completed). If you wish to abort this 9551 * zio, you should do so using l2arc_log_blk_fetch_abort, which takes 9552 * care of disposing of the allocated buffers correctly. 9553 */ 9554 static zio_t * 9555 l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp, 9556 l2arc_log_blk_phys_t *lb) 9557 { 9558 uint32_t asize; 9559 zio_t *pio; 9560 l2arc_read_callback_t *cb; 9561 9562 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 9563 asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); 9564 ASSERT(asize <= sizeof (l2arc_log_blk_phys_t)); 9565 9566 cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP); 9567 cb->l2rcb_abd = abd_get_from_buf(lb, asize); 9568 pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb, 9569 ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | 9570 ZIO_FLAG_DONT_RETRY); 9571 (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize, 9572 cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL, 9573 ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | 9574 ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE)); 9575 9576 return (pio); 9577 } 9578 9579 /* 9580 * Aborts a zio returned from l2arc_log_blk_fetch and frees the data 9581 * buffers allocated for it. 9582 */ 9583 static void 9584 l2arc_log_blk_fetch_abort(zio_t *zio) 9585 { 9586 (void) zio_wait(zio); 9587 } 9588 9589 /* 9590 * Creates a zio to update the device header on an l2arc device. 9591 */ 9592 static void 9593 l2arc_dev_hdr_update(l2arc_dev_t *dev) 9594 { 9595 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9596 const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; 9597 abd_t *abd; 9598 int err; 9599 9600 VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER)); 9601 9602 l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC; 9603 l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION; 9604 l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa); 9605 l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid; 9606 l2dhdr->dh_log_entries = dev->l2ad_log_entries; 9607 l2dhdr->dh_evict = dev->l2ad_evict; 9608 l2dhdr->dh_start = dev->l2ad_start; 9609 l2dhdr->dh_end = dev->l2ad_end; 9610 l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize); 9611 l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count); 9612 l2dhdr->dh_flags = 0; 9613 if (dev->l2ad_first) 9614 l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST; 9615 9616 abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); 9617 9618 err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev, 9619 VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL, 9620 NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE)); 9621 9622 abd_put(abd); 9623 9624 if (err != 0) { 9625 zfs_dbgmsg("L2ARC IO error (%d) while writing device header, " 9626 "vdev guid: %llu", err, dev->l2ad_vdev->vdev_guid); 9627 } 9628 } 9629 9630 /* 9631 * Commits a log block to the L2ARC device. This routine is invoked from 9632 * l2arc_write_buffers when the log block fills up. 9633 * This function allocates some memory to temporarily hold the serialized 9634 * buffer to be written. This is then released in l2arc_write_done. 9635 */ 9636 static void 9637 l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb) 9638 { 9639 l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; 9640 l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; 9641 uint64_t psize, asize; 9642 zio_t *wzio; 9643 l2arc_lb_abd_buf_t *abd_buf; 9644 uint8_t *tmpbuf; 9645 l2arc_lb_ptr_buf_t *lb_ptr_buf; 9646 9647 VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries); 9648 9649 tmpbuf = zio_buf_alloc(sizeof (*lb)); 9650 abd_buf = zio_buf_alloc(sizeof (*abd_buf)); 9651 abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb)); 9652 lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); 9653 lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP); 9654 9655 /* link the buffer into the block chain */ 9656 lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1]; 9657 lb->lb_magic = L2ARC_LOG_BLK_MAGIC; 9658 9659 /* 9660 * l2arc_log_blk_commit() may be called multiple times during a single 9661 * l2arc_write_buffers() call. Save the allocated abd buffers in a list 9662 * so we can free them in l2arc_write_done() later on. 9663 */ 9664 list_insert_tail(&cb->l2wcb_abd_list, abd_buf); 9665 9666 /* try to compress the buffer */ 9667 psize = zio_compress_data(ZIO_COMPRESS_LZ4, 9668 abd_buf->abd, tmpbuf, sizeof (*lb)); 9669 9670 /* a log block is never entirely zero */ 9671 ASSERT(psize != 0); 9672 asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); 9673 ASSERT(asize <= sizeof (*lb)); 9674 9675 /* 9676 * Update the start log block pointer in the device header to point 9677 * to the log block we're about to write. 9678 */ 9679 l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0]; 9680 l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand; 9681 l2dhdr->dh_start_lbps[0].lbp_payload_asize = 9682 dev->l2ad_log_blk_payload_asize; 9683 l2dhdr->dh_start_lbps[0].lbp_payload_start = 9684 dev->l2ad_log_blk_payload_start; 9685 _NOTE(CONSTCOND) 9686 L2BLK_SET_LSIZE( 9687 (&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb)); 9688 L2BLK_SET_PSIZE( 9689 (&l2dhdr->dh_start_lbps[0])->lbp_prop, asize); 9690 L2BLK_SET_CHECKSUM( 9691 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 9692 ZIO_CHECKSUM_FLETCHER_4); 9693 if (asize < sizeof (*lb)) { 9694 /* compression succeeded */ 9695 bzero(tmpbuf + psize, asize - psize); 9696 L2BLK_SET_COMPRESS( 9697 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 9698 ZIO_COMPRESS_LZ4); 9699 } else { 9700 /* compression failed */ 9701 bcopy(lb, tmpbuf, sizeof (*lb)); 9702 L2BLK_SET_COMPRESS( 9703 (&l2dhdr->dh_start_lbps[0])->lbp_prop, 9704 ZIO_COMPRESS_OFF); 9705 } 9706 9707 /* checksum what we're about to write */ 9708 fletcher_4_native(tmpbuf, asize, NULL, 9709 &l2dhdr->dh_start_lbps[0].lbp_cksum); 9710 9711 abd_put(abd_buf->abd); 9712 9713 /* perform the write itself */ 9714 abd_buf->abd = abd_get_from_buf(tmpbuf, sizeof (*lb)); 9715 abd_take_ownership_of_buf(abd_buf->abd, B_TRUE); 9716 wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand, 9717 asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL, 9718 ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE); 9719 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio); 9720 (void) zio_nowait(wzio); 9721 9722 dev->l2ad_hand += asize; 9723 /* 9724 * Include the committed log block's pointer in the list of pointers 9725 * to log blocks present in the L2ARC device. 9726 */ 9727 bcopy(&l2dhdr->dh_start_lbps[0], lb_ptr_buf->lb_ptr, 9728 sizeof (l2arc_log_blkptr_t)); 9729 mutex_enter(&dev->l2ad_mtx); 9730 list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf); 9731 ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); 9732 ARCSTAT_BUMP(arcstat_l2_log_blk_count); 9733 zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); 9734 zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); 9735 mutex_exit(&dev->l2ad_mtx); 9736 vdev_space_update(dev->l2ad_vdev, asize, 0, 0); 9737 9738 /* bump the kstats */ 9739 ARCSTAT_INCR(arcstat_l2_write_bytes, asize); 9740 ARCSTAT_BUMP(arcstat_l2_log_blk_writes); 9741 ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize); 9742 ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, 9743 dev->l2ad_log_blk_payload_asize / asize); 9744 9745 /* start a new log block */ 9746 dev->l2ad_log_ent_idx = 0; 9747 dev->l2ad_log_blk_payload_asize = 0; 9748 dev->l2ad_log_blk_payload_start = 0; 9749 } 9750 9751 /* 9752 * Validates an L2ARC log block address to make sure that it can be read 9753 * from the provided L2ARC device. 9754 */ 9755 boolean_t 9756 l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp) 9757 { 9758 /* L2BLK_GET_PSIZE returns aligned size for log blocks */ 9759 uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); 9760 uint64_t end = lbp->lbp_daddr + asize - 1; 9761 uint64_t start = lbp->lbp_payload_start; 9762 boolean_t evicted = B_FALSE; 9763 9764 /* BEGIN CSTYLED */ 9765 /* 9766 * A log block is valid if all of the following conditions are true: 9767 * - it fits entirely (including its payload) between l2ad_start and 9768 * l2ad_end 9769 * - it has a valid size 9770 * - neither the log block itself nor part of its payload was evicted 9771 * by l2arc_evict(): 9772 * 9773 * l2ad_hand l2ad_evict 9774 * | | lbp_daddr 9775 * | start | | end 9776 * | | | | | 9777 * V V V V V 9778 * l2ad_start ============================================ l2ad_end 9779 * --------------------------|||| 9780 * ^ ^ 9781 * | log block 9782 * payload 9783 */ 9784 /* END CSTYLED */ 9785 evicted = 9786 l2arc_range_check_overlap(start, end, dev->l2ad_hand) || 9787 l2arc_range_check_overlap(start, end, dev->l2ad_evict) || 9788 l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) || 9789 l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end); 9790 9791 return (start >= dev->l2ad_start && end <= dev->l2ad_end && 9792 asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) && 9793 (!evicted || dev->l2ad_first)); 9794 } 9795 9796 /* 9797 * Inserts ARC buffer header `hdr' into the current L2ARC log block on 9798 * the device. The buffer being inserted must be present in L2ARC. 9799 * Returns B_TRUE if the L2ARC log block is full and needs to be committed 9800 * to L2ARC, or B_FALSE if it still has room for more ARC buffers. 9801 */ 9802 static boolean_t 9803 l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr) 9804 { 9805 l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; 9806 l2arc_log_ent_phys_t *le; 9807 9808 if (dev->l2ad_log_entries == 0) 9809 return (B_FALSE); 9810 9811 int index = dev->l2ad_log_ent_idx++; 9812 9813 ASSERT3S(index, <, dev->l2ad_log_entries); 9814 ASSERT(HDR_HAS_L2HDR(hdr)); 9815 9816 le = &lb->lb_entries[index]; 9817 bzero(le, sizeof (*le)); 9818 le->le_dva = hdr->b_dva; 9819 le->le_birth = hdr->b_birth; 9820 le->le_daddr = hdr->b_l2hdr.b_daddr; 9821 if (index == 0) 9822 dev->l2ad_log_blk_payload_start = le->le_daddr; 9823 L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr)); 9824 L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr)); 9825 L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr)); 9826 L2BLK_SET_TYPE((le)->le_prop, hdr->b_type); 9827 L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr))); 9828 L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr))); 9829 9830 dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev, 9831 HDR_GET_PSIZE(hdr)); 9832 9833 return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries); 9834 } 9835 9836 /* 9837 * Checks whether a given L2ARC device address sits in a time-sequential 9838 * range. The trick here is that the L2ARC is a rotary buffer, so we can't 9839 * just do a range comparison, we need to handle the situation in which the 9840 * range wraps around the end of the L2ARC device. Arguments: 9841 * bottom -- Lower end of the range to check (written to earlier). 9842 * top -- Upper end of the range to check (written to later). 9843 * check -- The address for which we want to determine if it sits in 9844 * between the top and bottom. 9845 * 9846 * The 3-way conditional below represents the following cases: 9847 * 9848 * bottom < top : Sequentially ordered case: 9849 * <check>--------+-------------------+ 9850 * | (overlap here?) | 9851 * L2ARC dev V V 9852 * |---------------<bottom>============<top>--------------| 9853 * 9854 * bottom > top: Looped-around case: 9855 * <check>--------+------------------+ 9856 * | (overlap here?) | 9857 * L2ARC dev V V 9858 * |===============<top>---------------<bottom>===========| 9859 * ^ ^ 9860 * | (or here?) | 9861 * +---------------+---------<check> 9862 * 9863 * top == bottom : Just a single address comparison. 9864 */ 9865 boolean_t 9866 l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check) 9867 { 9868 if (bottom < top) 9869 return (bottom <= check && check <= top); 9870 else if (bottom > top) 9871 return (check <= top || bottom <= check); 9872 else 9873 return (check == top); 9874 } 9875