1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright 2008 Sun Microsystems, Inc. All rights reserved. 23 * Use is subject to license terms. 24 */ 25 26 /* 27 * Page Retire - Big Theory Statement. 28 * 29 * This file handles removing sections of faulty memory from use when the 30 * user land FMA Diagnosis Engine requests that a page be removed or when 31 * a CE or UE is detected by the hardware. 32 * 33 * In the bad old days, the kernel side of Page Retire did a lot of the work 34 * on its own. Now, with the DE keeping track of errors, the kernel side is 35 * rather simple minded on most platforms. 36 * 37 * Errors are all reflected to the DE, and after digesting the error and 38 * looking at all previously reported errors, the DE decides what should 39 * be done about the current error. If the DE wants a particular page to 40 * be retired, then the kernel page retire code is invoked via an ioctl. 41 * On non-FMA platforms, the ue_drain and ce_drain paths ends up calling 42 * page retire to handle the error. Since page retire is just a simple 43 * mechanism it doesn't need to differentiate between the different callers. 44 * 45 * The p_toxic field in the page_t is used to indicate which errors have 46 * occurred and what action has been taken on a given page. Because errors are 47 * reported without regard to the locked state of a page, no locks are used 48 * to SET the error bits in p_toxic. However, in order to clear the error 49 * bits, the page_t must be held exclusively locked. 50 * 51 * When page_retire() is called, it must be able to acquire locks, sleep, etc. 52 * It must not be called from high-level interrupt context. 53 * 54 * Depending on how the requested page is being used at the time of the retire 55 * request (and on the availability of sufficient system resources), the page 56 * may be retired immediately, or just marked for retirement later. For 57 * example, locked pages are marked, while free pages are retired. Multiple 58 * requests may be made to retire the same page, although there is no need 59 * to: once the p_toxic flags are set, the page will be retired as soon as it 60 * can be exclusively locked. 61 * 62 * The retire mechanism is driven centrally out of page_unlock(). To expedite 63 * the retirement of pages, further requests for SE_SHARED locks are denied 64 * as long as a page retirement is pending. In addition, as long as pages are 65 * pending retirement a background thread runs periodically trying to retire 66 * those pages. Pages which could not be retired while the system is running 67 * are scrubbed prior to rebooting to avoid latent errors on the next boot. 68 * 69 * UE pages without persistent errors are scrubbed and returned to service. 70 * Recidivist pages, as well as FMA-directed requests for retirement, result 71 * in the page being taken out of service. Once the decision is made to take 72 * a page out of service, the page is cleared, hashed onto the retired_pages 73 * vnode, marked as retired, and it is unlocked. No other requesters (except 74 * for unretire) are allowed to lock retired pages. 75 * 76 * The public routines return (sadly) 0 if they worked and a non-zero error 77 * value if something went wrong. This is done for the ioctl side of the 78 * world to allow errors to be reflected all the way out to user land. The 79 * non-zero values are explained in comments atop each function. 80 */ 81 82 /* 83 * Things to fix: 84 * 85 * 1. Trying to retire non-relocatable kvp pages may result in a 86 * quagmire. This is because seg_kmem() no longer keeps its pages locked, 87 * and calls page_lookup() in the free path; since kvp pages are modified 88 * and don't have a usable backing store, page_retire() can't do anything 89 * with them, and we'll keep denying the lock to seg_kmem_free() in a 90 * vicious cycle. To prevent that, we don't deny locks to kvp pages, and 91 * hence only try to retire a page from page_unlock() in the free path. 92 * Since most kernel pages are indefinitely held anyway, and don't 93 * participate in I/O, this is of little consequence. 94 * 95 * 2. Low memory situations will be interesting. If we don't have 96 * enough memory for page_relocate() to succeed, we won't be able to 97 * retire dirty pages; nobody will be able to push them out to disk 98 * either, since we aggressively deny the page lock. We could change 99 * fsflush so it can recognize this situation, grab the lock, and push 100 * the page out, where we'll catch it in the free path and retire it. 101 * 102 * 3. Beware of places that have code like this in them: 103 * 104 * if (! page_tryupgrade(pp)) { 105 * page_unlock(pp); 106 * while (! page_lock(pp, SE_EXCL, NULL, P_RECLAIM)) { 107 * / *NOTHING* / 108 * } 109 * } 110 * page_free(pp); 111 * 112 * The problem is that pp can change identity right after the 113 * page_unlock() call. In particular, page_retire() can step in 114 * there, change pp's identity, and hash pp onto the retired_vnode. 115 * 116 * Of course, other functions besides page_retire() can have the 117 * same effect. A kmem reader can waltz by, set up a mapping to the 118 * page, and then unlock the page. Page_free() will then go castors 119 * up. So if anybody is doing this, it's already a bug. 120 * 121 * 4. mdboot()'s call into page_retire_mdboot() should probably be 122 * moved lower. Where the call is made now, we can get into trouble 123 * by scrubbing a kernel page that is then accessed later. 124 */ 125 126 #include <sys/types.h> 127 #include <sys/param.h> 128 #include <sys/systm.h> 129 #include <sys/mman.h> 130 #include <sys/vnode.h> 131 #include <sys/vfs_opreg.h> 132 #include <sys/cmn_err.h> 133 #include <sys/ksynch.h> 134 #include <sys/thread.h> 135 #include <sys/disp.h> 136 #include <sys/ontrap.h> 137 #include <sys/vmsystm.h> 138 #include <sys/mem_config.h> 139 #include <sys/atomic.h> 140 #include <sys/callb.h> 141 #include <vm/page.h> 142 #include <vm/vm_dep.h> 143 #include <vm/as.h> 144 #include <vm/hat.h> 145 146 /* 147 * vnode for all pages which are retired from the VM system; 148 */ 149 vnode_t *retired_pages; 150 151 static int page_retire_pp_finish(page_t *, void *, uint_t); 152 153 /* 154 * Make a list of all of the pages that have been marked for retirement 155 * but are not yet retired. At system shutdown, we will scrub all of the 156 * pages in the list in case there are outstanding UEs. Then, we 157 * cross-check this list against the number of pages that are yet to be 158 * retired, and if we find inconsistencies, we scan every page_t in the 159 * whole system looking for any pages that need to be scrubbed for UEs. 160 * The background thread also uses this queue to determine which pages 161 * it should keep trying to retire. 162 */ 163 #ifdef DEBUG 164 #define PR_PENDING_QMAX 32 165 #else /* DEBUG */ 166 #define PR_PENDING_QMAX 256 167 #endif /* DEBUG */ 168 page_t *pr_pending_q[PR_PENDING_QMAX]; 169 kmutex_t pr_q_mutex; 170 171 /* 172 * Page retire global kstats 173 */ 174 struct page_retire_kstat { 175 kstat_named_t pr_retired; 176 kstat_named_t pr_requested; 177 kstat_named_t pr_requested_free; 178 kstat_named_t pr_enqueue_fail; 179 kstat_named_t pr_dequeue_fail; 180 kstat_named_t pr_pending; 181 kstat_named_t pr_failed; 182 kstat_named_t pr_failed_kernel; 183 kstat_named_t pr_limit; 184 kstat_named_t pr_limit_exceeded; 185 kstat_named_t pr_fma; 186 kstat_named_t pr_mce; 187 kstat_named_t pr_ue; 188 kstat_named_t pr_ue_cleared_retire; 189 kstat_named_t pr_ue_cleared_free; 190 kstat_named_t pr_ue_persistent; 191 kstat_named_t pr_unretired; 192 }; 193 194 static struct page_retire_kstat page_retire_kstat = { 195 { "pages_retired", KSTAT_DATA_UINT64}, 196 { "pages_retire_request", KSTAT_DATA_UINT64}, 197 { "pages_retire_request_free", KSTAT_DATA_UINT64}, 198 { "pages_notenqueued", KSTAT_DATA_UINT64}, 199 { "pages_notdequeued", KSTAT_DATA_UINT64}, 200 { "pages_pending", KSTAT_DATA_UINT64}, 201 { "pages_deferred", KSTAT_DATA_UINT64}, 202 { "pages_deferred_kernel", KSTAT_DATA_UINT64}, 203 { "pages_limit", KSTAT_DATA_UINT64}, 204 { "pages_limit_exceeded", KSTAT_DATA_UINT64}, 205 { "pages_fma", KSTAT_DATA_UINT64}, 206 { "pages_multiple_ce", KSTAT_DATA_UINT64}, 207 { "pages_ue", KSTAT_DATA_UINT64}, 208 { "pages_ue_cleared_retired", KSTAT_DATA_UINT64}, 209 { "pages_ue_cleared_freed", KSTAT_DATA_UINT64}, 210 { "pages_ue_persistent", KSTAT_DATA_UINT64}, 211 { "pages_unretired", KSTAT_DATA_UINT64}, 212 }; 213 214 static kstat_t *page_retire_ksp = NULL; 215 216 #define PR_INCR_KSTAT(stat) \ 217 atomic_add_64(&(page_retire_kstat.stat.value.ui64), 1) 218 #define PR_DECR_KSTAT(stat) \ 219 atomic_add_64(&(page_retire_kstat.stat.value.ui64), -1) 220 221 #define PR_KSTAT_RETIRED_CE (page_retire_kstat.pr_mce.value.ui64) 222 #define PR_KSTAT_RETIRED_FMA (page_retire_kstat.pr_fma.value.ui64) 223 #define PR_KSTAT_RETIRED_NOTUE (PR_KSTAT_RETIRED_CE + PR_KSTAT_RETIRED_FMA) 224 #define PR_KSTAT_PENDING (page_retire_kstat.pr_pending.value.ui64) 225 #define PR_KSTAT_EQFAIL (page_retire_kstat.pr_enqueue_fail.value.ui64) 226 #define PR_KSTAT_DQFAIL (page_retire_kstat.pr_dequeue_fail.value.ui64) 227 228 /* 229 * page retire kstats to list all retired pages 230 */ 231 static int pr_list_kstat_update(kstat_t *ksp, int rw); 232 static int pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw); 233 kmutex_t pr_list_kstat_mutex; 234 235 /* 236 * Limit the number of multiple CE page retires. 237 * The default is 0.1% of physmem, or 1 in 1000 pages. This is set in 238 * basis points, where 100 basis points equals one percent. 239 */ 240 #define MCE_BPT 10 241 uint64_t max_pages_retired_bps = MCE_BPT; 242 #define PAGE_RETIRE_LIMIT ((physmem * max_pages_retired_bps) / 10000) 243 244 /* 245 * Control over the verbosity of page retirement. 246 * 247 * When set to zero (the default), no messages will be printed. 248 * When set to one, summary messages will be printed. 249 * When set > one, all messages will be printed. 250 * 251 * A value of one will trigger detailed messages for retirement operations, 252 * and is intended as a platform tunable for processors where FMA's DE does 253 * not run (e.g., spitfire). Values > one are intended for debugging only. 254 */ 255 int page_retire_messages = 0; 256 257 /* 258 * Control whether or not we return scrubbed UE pages to service. 259 * By default we do not since FMA wants to run its diagnostics first 260 * and then ask us to unretire the page if it passes. Non-FMA platforms 261 * may set this to zero so we will only retire recidivist pages. It should 262 * not be changed by the user. 263 */ 264 int page_retire_first_ue = 1; 265 266 /* 267 * Master enable for page retire. This prevents a CE or UE early in boot 268 * from trying to retire a page before page_retire_init() has finished 269 * setting things up. This is internal only and is not a tunable! 270 */ 271 static int pr_enable = 0; 272 273 extern struct vnode kvp; 274 275 #ifdef DEBUG 276 struct page_retire_debug { 277 int prd_dup1; 278 int prd_dup2; 279 int prd_qdup; 280 int prd_noaction; 281 int prd_queued; 282 int prd_notqueued; 283 int prd_dequeue; 284 int prd_top; 285 int prd_locked; 286 int prd_reloc; 287 int prd_relocfail; 288 int prd_mod; 289 int prd_mod_late; 290 int prd_kern; 291 int prd_free; 292 int prd_noreclaim; 293 int prd_hashout; 294 int prd_fma; 295 int prd_uescrubbed; 296 int prd_uenotscrubbed; 297 int prd_mce; 298 int prd_prlocked; 299 int prd_prnotlocked; 300 int prd_prretired; 301 int prd_ulocked; 302 int prd_unotretired; 303 int prd_udestroy; 304 int prd_uhashout; 305 int prd_uunretired; 306 int prd_unotlocked; 307 int prd_checkhit; 308 int prd_checkmiss_pend; 309 int prd_checkmiss_noerr; 310 int prd_tctop; 311 int prd_tclocked; 312 int prd_hunt; 313 int prd_dohunt; 314 int prd_earlyhunt; 315 int prd_latehunt; 316 int prd_nofreedemote; 317 int prd_nodemote; 318 int prd_demoted; 319 } pr_debug; 320 321 #define PR_DEBUG(foo) ((pr_debug.foo)++) 322 323 /* 324 * A type histogram. We record the incidence of the various toxic 325 * flag combinations along with the interesting page attributes. The 326 * goal is to get as many combinations as we can while driving all 327 * pr_debug values nonzero (indicating we've exercised all possible 328 * code paths across all possible page types). Not all combinations 329 * will make sense -- e.g. PRT_MOD|PRT_KERNEL. 330 * 331 * pr_type offset bit encoding (when examining with a debugger): 332 * 333 * PRT_NAMED - 0x4 334 * PRT_KERNEL - 0x8 335 * PRT_FREE - 0x10 336 * PRT_MOD - 0x20 337 * PRT_FMA - 0x0 338 * PRT_MCE - 0x40 339 * PRT_UE - 0x80 340 */ 341 342 #define PRT_NAMED 0x01 343 #define PRT_KERNEL 0x02 344 #define PRT_FREE 0x04 345 #define PRT_MOD 0x08 346 #define PRT_FMA 0x00 /* yes, this is not a mistake */ 347 #define PRT_MCE 0x10 348 #define PRT_UE 0x20 349 #define PRT_ALL 0x3F 350 351 int pr_types[PRT_ALL+1]; 352 353 #define PR_TYPES(pp) { \ 354 int whichtype = 0; \ 355 if (pp->p_vnode) \ 356 whichtype |= PRT_NAMED; \ 357 if (PP_ISKAS(pp)) \ 358 whichtype |= PRT_KERNEL; \ 359 if (PP_ISFREE(pp)) \ 360 whichtype |= PRT_FREE; \ 361 if (hat_ismod(pp)) \ 362 whichtype |= PRT_MOD; \ 363 if (pp->p_toxic & PR_UE) \ 364 whichtype |= PRT_UE; \ 365 if (pp->p_toxic & PR_MCE) \ 366 whichtype |= PRT_MCE; \ 367 pr_types[whichtype]++; \ 368 } 369 370 int recl_calls; 371 int recl_mtbf = 3; 372 int reloc_calls; 373 int reloc_mtbf = 7; 374 int pr_calls; 375 int pr_mtbf = 15; 376 377 #define MTBF(v, f) (((++(v)) & (f)) != (f)) 378 379 #else /* DEBUG */ 380 381 #define PR_DEBUG(foo) /* nothing */ 382 #define PR_TYPES(foo) /* nothing */ 383 #define MTBF(v, f) (1) 384 385 #endif /* DEBUG */ 386 387 /* 388 * page_retire_done() - completion processing 389 * 390 * Used by the page_retire code for common completion processing. 391 * It keeps track of how many times a given result has happened, 392 * and writes out an occasional message. 393 * 394 * May be called with a NULL pp (PRD_INVALID_PA case). 395 */ 396 #define PRD_INVALID_KEY -1 397 #define PRD_SUCCESS 0 398 #define PRD_PENDING 1 399 #define PRD_FAILED 2 400 #define PRD_DUPLICATE 3 401 #define PRD_INVALID_PA 4 402 #define PRD_LIMIT 5 403 #define PRD_UE_SCRUBBED 6 404 #define PRD_UNR_SUCCESS 7 405 #define PRD_UNR_CANTLOCK 8 406 #define PRD_UNR_NOT 9 407 408 typedef struct page_retire_op { 409 int pr_key; /* one of the PRD_* defines from above */ 410 int pr_count; /* How many times this has happened */ 411 int pr_retval; /* return value */ 412 int pr_msglvl; /* message level - when to print */ 413 char *pr_message; /* Cryptic message for field service */ 414 } page_retire_op_t; 415 416 static page_retire_op_t page_retire_ops[] = { 417 /* key count retval msglvl message */ 418 {PRD_SUCCESS, 0, 0, 1, 419 "Page 0x%08x.%08x removed from service"}, 420 {PRD_PENDING, 0, EAGAIN, 2, 421 "Page 0x%08x.%08x will be retired on free"}, 422 {PRD_FAILED, 0, EAGAIN, 0, NULL}, 423 {PRD_DUPLICATE, 0, EIO, 2, 424 "Page 0x%08x.%08x already retired or pending"}, 425 {PRD_INVALID_PA, 0, EINVAL, 2, 426 "PA 0x%08x.%08x is not a relocatable page"}, 427 {PRD_LIMIT, 0, 0, 1, 428 "Page 0x%08x.%08x not retired due to limit exceeded"}, 429 {PRD_UE_SCRUBBED, 0, 0, 1, 430 "Previously reported error on page 0x%08x.%08x cleared"}, 431 {PRD_UNR_SUCCESS, 0, 0, 1, 432 "Page 0x%08x.%08x returned to service"}, 433 {PRD_UNR_CANTLOCK, 0, EAGAIN, 2, 434 "Page 0x%08x.%08x could not be unretired"}, 435 {PRD_UNR_NOT, 0, EIO, 2, 436 "Page 0x%08x.%08x is not retired"}, 437 {PRD_INVALID_KEY, 0, 0, 0, NULL} /* MUST BE LAST! */ 438 }; 439 440 /* 441 * print a message if page_retire_messages is true. 442 */ 443 #define PR_MESSAGE(debuglvl, msglvl, msg, pa) \ 444 { \ 445 uint64_t p = (uint64_t)pa; \ 446 if (page_retire_messages >= msglvl && msg != NULL) { \ 447 cmn_err(debuglvl, msg, \ 448 (uint32_t)(p >> 32), (uint32_t)p); \ 449 } \ 450 } 451 452 /* 453 * Note that multiple bits may be set in a single settoxic operation. 454 * May be called without the page locked. 455 */ 456 void 457 page_settoxic(page_t *pp, uchar_t bits) 458 { 459 atomic_or_8(&pp->p_toxic, bits); 460 } 461 462 /* 463 * Note that multiple bits may cleared in a single clrtoxic operation. 464 * Must be called with the page exclusively locked to prevent races which 465 * may attempt to retire a page without any toxic bits set. 466 * Note that the PR_CAPTURE bit can be cleared without the exclusive lock 467 * being held as there is a separate mutex which protects that bit. 468 */ 469 void 470 page_clrtoxic(page_t *pp, uchar_t bits) 471 { 472 ASSERT((bits & PR_CAPTURE) || PAGE_EXCL(pp)); 473 atomic_and_8(&pp->p_toxic, ~bits); 474 } 475 476 /* 477 * Prints any page retire messages to the user, and decides what 478 * error code is appropriate for the condition reported. 479 */ 480 static int 481 page_retire_done(page_t *pp, int code) 482 { 483 page_retire_op_t *prop; 484 uint64_t pa = 0; 485 int i; 486 487 if (pp != NULL) { 488 pa = mmu_ptob((uint64_t)pp->p_pagenum); 489 } 490 491 prop = NULL; 492 for (i = 0; page_retire_ops[i].pr_key != PRD_INVALID_KEY; i++) { 493 if (page_retire_ops[i].pr_key == code) { 494 prop = &page_retire_ops[i]; 495 break; 496 } 497 } 498 499 #ifdef DEBUG 500 if (page_retire_ops[i].pr_key == PRD_INVALID_KEY) { 501 cmn_err(CE_PANIC, "page_retire_done: Invalid opcode %d", code); 502 } 503 #endif 504 505 ASSERT(prop->pr_key == code); 506 507 prop->pr_count++; 508 509 PR_MESSAGE(CE_NOTE, prop->pr_msglvl, prop->pr_message, pa); 510 if (pp != NULL) { 511 page_settoxic(pp, PR_MSG); 512 } 513 514 return (prop->pr_retval); 515 } 516 517 /* 518 * Act like page_destroy(), but instead of freeing the page, hash it onto 519 * the retired_pages vnode, and mark it retired. 520 * 521 * For fun, we try to scrub the page until it's squeaky clean. 522 * availrmem is adjusted here. 523 */ 524 static void 525 page_retire_destroy(page_t *pp) 526 { 527 u_offset_t off = (u_offset_t)((uintptr_t)pp); 528 529 ASSERT(PAGE_EXCL(pp)); 530 ASSERT(!PP_ISFREE(pp)); 531 ASSERT(pp->p_szc == 0); 532 ASSERT(!hat_page_is_mapped(pp)); 533 ASSERT(!pp->p_vnode); 534 535 page_clr_all_props(pp); 536 pagescrub(pp, 0, MMU_PAGESIZE); 537 538 pp->p_next = NULL; 539 pp->p_prev = NULL; 540 if (page_hashin(pp, retired_pages, off, NULL) == 0) { 541 cmn_err(CE_PANIC, "retired page %p hashin failed", (void *)pp); 542 } 543 544 page_settoxic(pp, PR_RETIRED); 545 PR_INCR_KSTAT(pr_retired); 546 547 if (pp->p_toxic & PR_FMA) { 548 PR_INCR_KSTAT(pr_fma); 549 } else if (pp->p_toxic & PR_UE) { 550 PR_INCR_KSTAT(pr_ue); 551 } else { 552 PR_INCR_KSTAT(pr_mce); 553 } 554 555 mutex_enter(&freemem_lock); 556 availrmem--; 557 mutex_exit(&freemem_lock); 558 559 page_unlock(pp); 560 } 561 562 /* 563 * Check whether the number of pages which have been retired already exceeds 564 * the maximum allowable percentage of memory which may be retired. 565 * 566 * Returns 1 if the limit has been exceeded. 567 */ 568 static int 569 page_retire_limit(void) 570 { 571 if (PR_KSTAT_RETIRED_NOTUE >= (uint64_t)PAGE_RETIRE_LIMIT) { 572 PR_INCR_KSTAT(pr_limit_exceeded); 573 return (1); 574 } 575 576 return (0); 577 } 578 579 #define MSG_DM "Data Mismatch occurred at PA 0x%08x.%08x" \ 580 "[ 0x%x != 0x%x ] while attempting to clear previously " \ 581 "reported error; page removed from service" 582 583 #define MSG_UE "Uncorrectable Error occurred at PA 0x%08x.%08x while " \ 584 "attempting to clear previously reported error; page removed " \ 585 "from service" 586 587 /* 588 * Attempt to clear a UE from a page. 589 * Returns 1 if the error has been successfully cleared. 590 */ 591 static int 592 page_clear_transient_ue(page_t *pp) 593 { 594 caddr_t kaddr; 595 uint8_t rb, wb; 596 uint64_t pa; 597 uint32_t pa_hi, pa_lo; 598 on_trap_data_t otd; 599 int errors = 0; 600 int i; 601 602 ASSERT(PAGE_EXCL(pp)); 603 ASSERT(PP_PR_REQ(pp)); 604 ASSERT(pp->p_szc == 0); 605 ASSERT(!hat_page_is_mapped(pp)); 606 607 /* 608 * Clear the page and attempt to clear the UE. If we trap 609 * on the next access to the page, we know the UE has recurred. 610 */ 611 pagescrub(pp, 0, PAGESIZE); 612 613 /* 614 * Map the page and write a bunch of bit patterns to compare 615 * what we wrote with what we read back. This isn't a perfect 616 * test but it should be good enough to catch most of the 617 * recurring UEs. If this fails to catch a recurrent UE, we'll 618 * retire the page the next time we see a UE on the page. 619 */ 620 kaddr = ppmapin(pp, PROT_READ|PROT_WRITE, (caddr_t)-1); 621 622 pa = ptob((uint64_t)page_pptonum(pp)); 623 pa_hi = (uint32_t)(pa >> 32); 624 pa_lo = (uint32_t)pa; 625 626 /* 627 * Disable preemption to prevent the off chance that 628 * we migrate while in the middle of running through 629 * the bit pattern and run on a different processor 630 * than what we started on. 631 */ 632 kpreempt_disable(); 633 634 /* 635 * Fill the page with each (0x00 - 0xFF] bit pattern, flushing 636 * the cache in between reading and writing. We do this under 637 * on_trap() protection to avoid recursion. 638 */ 639 if (on_trap(&otd, OT_DATA_EC)) { 640 PR_MESSAGE(CE_WARN, 1, MSG_UE, pa); 641 errors = 1; 642 } else { 643 for (wb = 0xff; wb > 0; wb--) { 644 for (i = 0; i < PAGESIZE; i++) { 645 kaddr[i] = wb; 646 } 647 648 sync_data_memory(kaddr, PAGESIZE); 649 650 for (i = 0; i < PAGESIZE; i++) { 651 rb = kaddr[i]; 652 if (rb != wb) { 653 /* 654 * We had a mismatch without a trap. 655 * Uh-oh. Something is really wrong 656 * with this system. 657 */ 658 if (page_retire_messages) { 659 cmn_err(CE_WARN, MSG_DM, 660 pa_hi, pa_lo, rb, wb); 661 } 662 errors = 1; 663 goto out; /* double break */ 664 } 665 } 666 } 667 } 668 out: 669 no_trap(); 670 kpreempt_enable(); 671 ppmapout(kaddr); 672 673 return (errors ? 0 : 1); 674 } 675 676 /* 677 * Try to clear a page_t with a single UE. If the UE was transient, it is 678 * returned to service, and we return 1. Otherwise we return 0 meaning 679 * that further processing is required to retire the page. 680 */ 681 static int 682 page_retire_transient_ue(page_t *pp) 683 { 684 ASSERT(PAGE_EXCL(pp)); 685 ASSERT(!hat_page_is_mapped(pp)); 686 687 /* 688 * If this page is a repeat offender, retire him under the 689 * "two strikes and you're out" rule. The caller is responsible 690 * for scrubbing the page to try to clear the error. 691 */ 692 if (pp->p_toxic & PR_UE_SCRUBBED) { 693 PR_INCR_KSTAT(pr_ue_persistent); 694 return (0); 695 } 696 697 if (page_clear_transient_ue(pp)) { 698 /* 699 * We set the PR_SCRUBBED_UE bit; if we ever see this 700 * page again, we will retire it, no questions asked. 701 */ 702 page_settoxic(pp, PR_UE_SCRUBBED); 703 704 if (page_retire_first_ue) { 705 PR_INCR_KSTAT(pr_ue_cleared_retire); 706 return (0); 707 } else { 708 PR_INCR_KSTAT(pr_ue_cleared_free); 709 710 page_clrtoxic(pp, PR_UE | PR_MCE | PR_MSG); 711 712 /* LINTED: CONSTCOND */ 713 VN_DISPOSE(pp, B_FREE, 1, kcred); 714 return (1); 715 } 716 } 717 718 PR_INCR_KSTAT(pr_ue_persistent); 719 return (0); 720 } 721 722 /* 723 * Update the statistics dynamically when our kstat is read. 724 */ 725 static int 726 page_retire_kstat_update(kstat_t *ksp, int rw) 727 { 728 struct page_retire_kstat *pr; 729 730 if (ksp == NULL) 731 return (EINVAL); 732 733 switch (rw) { 734 735 case KSTAT_READ: 736 pr = (struct page_retire_kstat *)ksp->ks_data; 737 ASSERT(pr == &page_retire_kstat); 738 pr->pr_limit.value.ui64 = PAGE_RETIRE_LIMIT; 739 return (0); 740 741 case KSTAT_WRITE: 742 return (EACCES); 743 744 default: 745 return (EINVAL); 746 } 747 /*NOTREACHED*/ 748 } 749 750 static int 751 pr_list_kstat_update(kstat_t *ksp, int rw) 752 { 753 uint_t count; 754 page_t *pp; 755 kmutex_t *vphm; 756 757 if (rw == KSTAT_WRITE) 758 return (EACCES); 759 760 vphm = page_vnode_mutex(retired_pages); 761 mutex_enter(vphm); 762 /* Needs to be under a lock so that for loop will work right */ 763 if (retired_pages->v_pages == NULL) { 764 mutex_exit(vphm); 765 ksp->ks_ndata = 0; 766 ksp->ks_data_size = 0; 767 return (0); 768 } 769 770 count = 1; 771 for (pp = retired_pages->v_pages->p_vpnext; 772 pp != retired_pages->v_pages; pp = pp->p_vpnext) { 773 count++; 774 } 775 mutex_exit(vphm); 776 777 ksp->ks_ndata = count; 778 ksp->ks_data_size = count * 2 * sizeof (uint64_t); 779 780 return (0); 781 } 782 783 /* 784 * all spans will be pagesize and no coalescing will be done with the 785 * list produced. 786 */ 787 static int 788 pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw) 789 { 790 kmutex_t *vphm; 791 page_t *pp; 792 struct memunit { 793 uint64_t address; 794 uint64_t size; 795 } *kspmem; 796 797 if (rw == KSTAT_WRITE) 798 return (EACCES); 799 800 ksp->ks_snaptime = gethrtime(); 801 802 kspmem = (struct memunit *)buf; 803 804 vphm = page_vnode_mutex(retired_pages); 805 mutex_enter(vphm); 806 pp = retired_pages->v_pages; 807 if (((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) || 808 (pp == NULL)) { 809 mutex_exit(vphm); 810 return (0); 811 } 812 kspmem->address = ptob(pp->p_pagenum); 813 kspmem->size = PAGESIZE; 814 kspmem++; 815 for (pp = pp->p_vpnext; pp != retired_pages->v_pages; 816 pp = pp->p_vpnext, kspmem++) { 817 if ((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) 818 break; 819 kspmem->address = ptob(pp->p_pagenum); 820 kspmem->size = PAGESIZE; 821 } 822 mutex_exit(vphm); 823 824 return (0); 825 } 826 827 /* 828 * page_retire_pend_count -- helper function for page_capture_thread, 829 * returns the number of pages pending retirement. 830 */ 831 uint64_t 832 page_retire_pend_count(void) 833 { 834 return (PR_KSTAT_PENDING); 835 } 836 837 void 838 page_retire_incr_pend_count(void) 839 { 840 PR_INCR_KSTAT(pr_pending); 841 } 842 843 void 844 page_retire_decr_pend_count(void) 845 { 846 PR_DECR_KSTAT(pr_pending); 847 } 848 849 /* 850 * Initialize the page retire mechanism: 851 * 852 * - Establish the correctable error retire limit. 853 * - Initialize locks. 854 * - Build the retired_pages vnode. 855 * - Set up the kstats. 856 * - Fire off the background thread. 857 * - Tell page_retire() it's OK to start retiring pages. 858 */ 859 void 860 page_retire_init(void) 861 { 862 const fs_operation_def_t retired_vnodeops_template[] = { 863 { NULL, NULL } 864 }; 865 struct vnodeops *vops; 866 kstat_t *ksp; 867 868 const uint_t page_retire_ndata = 869 sizeof (page_retire_kstat) / sizeof (kstat_named_t); 870 871 ASSERT(page_retire_ksp == NULL); 872 873 if (max_pages_retired_bps <= 0) { 874 max_pages_retired_bps = MCE_BPT; 875 } 876 877 mutex_init(&pr_q_mutex, NULL, MUTEX_DEFAULT, NULL); 878 879 retired_pages = vn_alloc(KM_SLEEP); 880 if (vn_make_ops("retired_pages", retired_vnodeops_template, &vops)) { 881 cmn_err(CE_PANIC, 882 "page_retired_init: can't make retired vnodeops"); 883 } 884 vn_setops(retired_pages, vops); 885 886 if ((page_retire_ksp = kstat_create("unix", 0, "page_retire", 887 "misc", KSTAT_TYPE_NAMED, page_retire_ndata, 888 KSTAT_FLAG_VIRTUAL)) == NULL) { 889 cmn_err(CE_WARN, "kstat_create for page_retire failed"); 890 } else { 891 page_retire_ksp->ks_data = (void *)&page_retire_kstat; 892 page_retire_ksp->ks_update = page_retire_kstat_update; 893 kstat_install(page_retire_ksp); 894 } 895 896 mutex_init(&pr_list_kstat_mutex, NULL, MUTEX_DEFAULT, NULL); 897 ksp = kstat_create("unix", 0, "page_retire_list", "misc", 898 KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VAR_SIZE | KSTAT_FLAG_VIRTUAL); 899 if (ksp != NULL) { 900 ksp->ks_update = pr_list_kstat_update; 901 ksp->ks_snapshot = pr_list_kstat_snapshot; 902 ksp->ks_lock = &pr_list_kstat_mutex; 903 kstat_install(ksp); 904 } 905 906 page_capture_register_callback(PC_RETIRE, -1, page_retire_pp_finish); 907 pr_enable = 1; 908 } 909 910 /* 911 * page_retire_hunt() callback for the retire thread. 912 */ 913 static void 914 page_retire_thread_cb(page_t *pp) 915 { 916 PR_DEBUG(prd_tctop); 917 if (!PP_ISKAS(pp) && page_trylock(pp, SE_EXCL)) { 918 PR_DEBUG(prd_tclocked); 919 page_unlock(pp); 920 } 921 } 922 923 /* 924 * page_retire_hunt() callback for mdboot(). 925 * 926 * It is necessary to scrub any failing pages prior to reboot in order to 927 * prevent a latent error trap from occurring on the next boot. 928 */ 929 void 930 page_retire_mdboot_cb(page_t *pp) 931 { 932 /* 933 * Don't scrub the kernel, since we might still need it, unless 934 * we have UEs on the page, in which case we have nothing to lose. 935 */ 936 if (!PP_ISKAS(pp) || PP_TOXIC(pp)) { 937 pp->p_selock = -1; /* pacify ASSERTs */ 938 PP_CLRFREE(pp); 939 pagescrub(pp, 0, PAGESIZE); 940 pp->p_selock = 0; 941 } 942 pp->p_toxic = 0; 943 } 944 945 946 /* 947 * Callback used by page_trycapture() to finish off retiring a page. 948 * The page has already been cleaned and we've been given sole access to 949 * it. 950 * Always returns 0 to indicate that callback succeded as the callback never 951 * fails to finish retiring the given page. 952 */ 953 /*ARGSUSED*/ 954 static int 955 page_retire_pp_finish(page_t *pp, void *notused, uint_t flags) 956 { 957 int toxic; 958 959 ASSERT(PAGE_EXCL(pp)); 960 ASSERT(pp->p_iolock_state == 0); 961 ASSERT(pp->p_szc == 0); 962 963 toxic = pp->p_toxic; 964 965 /* 966 * The problem page is locked, demoted, unmapped, not free, 967 * hashed out, and not COW or mlocked (whew!). 968 * 969 * Now we select our ammunition, take it around back, and shoot it. 970 */ 971 if (toxic & PR_UE) { 972 ue_error: 973 if (page_retire_transient_ue(pp)) { 974 PR_DEBUG(prd_uescrubbed); 975 (void) page_retire_done(pp, PRD_UE_SCRUBBED); 976 } else { 977 PR_DEBUG(prd_uenotscrubbed); 978 page_retire_destroy(pp); 979 (void) page_retire_done(pp, PRD_SUCCESS); 980 } 981 return (0); 982 } else if (toxic & PR_FMA) { 983 PR_DEBUG(prd_fma); 984 page_retire_destroy(pp); 985 (void) page_retire_done(pp, PRD_SUCCESS); 986 return (0); 987 } else if (toxic & PR_MCE) { 988 PR_DEBUG(prd_mce); 989 page_retire_destroy(pp); 990 (void) page_retire_done(pp, PRD_SUCCESS); 991 return (0); 992 } 993 994 /* 995 * When page_retire_first_ue is set to zero and a UE occurs which is 996 * transient, it's possible that we clear some flags set by a second 997 * UE error on the page which occurs while the first is currently being 998 * handled and thus we need to handle the case where none of the above 999 * are set. In this instance, PR_UE_SCRUBBED should be set and thus 1000 * we should execute the UE code above. 1001 */ 1002 if (toxic & PR_UE_SCRUBBED) { 1003 goto ue_error; 1004 } 1005 1006 /* 1007 * It's impossible to get here. 1008 */ 1009 panic("bad toxic flags 0x%x in page_retire_pp_finish\n", toxic); 1010 return (0); 1011 } 1012 1013 /* 1014 * page_retire() - the front door in to retire a page. 1015 * 1016 * Ideally, page_retire() would instantly retire the requested page. 1017 * Unfortunately, some pages are locked or otherwise tied up and cannot be 1018 * retired right away. We use the page capture logic to deal with this 1019 * situation as it will continuously try to retire the page in the background 1020 * if the first attempt fails. Success is determined by looking to see whether 1021 * the page has been retired after the page_trycapture() attempt. 1022 * 1023 * Returns: 1024 * 1025 * - 0 on success, 1026 * - EINVAL when the PA is whacko, 1027 * - EIO if the page is already retired or already pending retirement, or 1028 * - EAGAIN if the page could not be _immediately_ retired but is pending. 1029 */ 1030 int 1031 page_retire(uint64_t pa, uchar_t reason) 1032 { 1033 page_t *pp; 1034 1035 ASSERT(reason & PR_REASONS); /* there must be a reason */ 1036 ASSERT(!(reason & ~PR_REASONS)); /* but no other bits */ 1037 1038 pp = page_numtopp_nolock(mmu_btop(pa)); 1039 if (pp == NULL) { 1040 PR_MESSAGE(CE_WARN, 1, "Cannot schedule clearing of error on" 1041 " page 0x%08x.%08x; page is not relocatable memory", pa); 1042 return (page_retire_done(pp, PRD_INVALID_PA)); 1043 } 1044 if (PP_RETIRED(pp)) { 1045 PR_DEBUG(prd_dup1); 1046 return (page_retire_done(pp, PRD_DUPLICATE)); 1047 } 1048 1049 if ((reason & PR_UE) && !PP_TOXIC(pp)) { 1050 PR_MESSAGE(CE_NOTE, 1, "Scheduling clearing of error on" 1051 " page 0x%08x.%08x", pa); 1052 } else if (PP_PR_REQ(pp)) { 1053 PR_DEBUG(prd_dup2); 1054 return (page_retire_done(pp, PRD_DUPLICATE)); 1055 } else { 1056 PR_MESSAGE(CE_NOTE, 1, "Scheduling removal of" 1057 " page 0x%08x.%08x", pa); 1058 } 1059 1060 /* Avoid setting toxic bits in the first place */ 1061 if ((reason & (PR_FMA | PR_MCE)) && !(reason & PR_UE) && 1062 page_retire_limit()) { 1063 return (page_retire_done(pp, PRD_LIMIT)); 1064 } 1065 1066 if (MTBF(pr_calls, pr_mtbf)) { 1067 page_settoxic(pp, reason); 1068 if (page_trycapture(pp, 0, CAPTURE_RETIRE, NULL) == 0) { 1069 PR_DEBUG(prd_prlocked); 1070 } else { 1071 PR_DEBUG(prd_prnotlocked); 1072 } 1073 } else { 1074 PR_DEBUG(prd_prnotlocked); 1075 } 1076 1077 if (PP_RETIRED(pp)) { 1078 PR_DEBUG(prd_prretired); 1079 return (0); 1080 } else { 1081 cv_signal(&pc_cv); 1082 PR_INCR_KSTAT(pr_failed); 1083 1084 if (pp->p_toxic & PR_MSG) { 1085 return (page_retire_done(pp, PRD_FAILED)); 1086 } else { 1087 return (page_retire_done(pp, PRD_PENDING)); 1088 } 1089 } 1090 } 1091 1092 /* 1093 * Take a retired page off the retired-pages vnode and clear the toxic flags. 1094 * If "free" is nonzero, lock it and put it back on the freelist. If "free" 1095 * is zero, the caller already holds SE_EXCL lock so we simply unretire it 1096 * and don't do anything else with it. 1097 * 1098 * Any unretire messages are printed from this routine. 1099 * 1100 * Returns 0 if page pp was unretired; else an error code. 1101 * 1102 * If flags is: 1103 * PR_UNR_FREE - lock the page, clear the toxic flags and free it 1104 * to the freelist. 1105 * PR_UNR_TEMP - lock the page, unretire it, leave the toxic 1106 * bits set as is and return it to the caller. 1107 * PR_UNR_CLEAN - page is SE_EXCL locked, unretire it, clear the 1108 * toxic flags and return it to caller as is. 1109 */ 1110 int 1111 page_unretire_pp(page_t *pp, int flags) 1112 { 1113 /* 1114 * To be retired, a page has to be hashed onto the retired_pages vnode 1115 * and have PR_RETIRED set in p_toxic. 1116 */ 1117 if (flags == PR_UNR_CLEAN || 1118 page_try_reclaim_lock(pp, SE_EXCL, SE_RETIRED)) { 1119 ASSERT(PAGE_EXCL(pp)); 1120 PR_DEBUG(prd_ulocked); 1121 if (!PP_RETIRED(pp)) { 1122 PR_DEBUG(prd_unotretired); 1123 page_unlock(pp); 1124 return (page_retire_done(pp, PRD_UNR_NOT)); 1125 } 1126 1127 PR_MESSAGE(CE_NOTE, 1, "unretiring retired" 1128 " page 0x%08x.%08x", mmu_ptob((uint64_t)pp->p_pagenum)); 1129 if (pp->p_toxic & PR_FMA) { 1130 PR_DECR_KSTAT(pr_fma); 1131 } else if (pp->p_toxic & PR_UE) { 1132 PR_DECR_KSTAT(pr_ue); 1133 } else { 1134 PR_DECR_KSTAT(pr_mce); 1135 } 1136 1137 if (flags == PR_UNR_TEMP) 1138 page_clrtoxic(pp, PR_RETIRED); 1139 else 1140 page_clrtoxic(pp, PR_TOXICFLAGS); 1141 1142 if (flags == PR_UNR_FREE) { 1143 PR_DEBUG(prd_udestroy); 1144 page_destroy(pp, 0); 1145 } else { 1146 PR_DEBUG(prd_uhashout); 1147 page_hashout(pp, NULL); 1148 } 1149 1150 mutex_enter(&freemem_lock); 1151 availrmem++; 1152 mutex_exit(&freemem_lock); 1153 1154 PR_DEBUG(prd_uunretired); 1155 PR_DECR_KSTAT(pr_retired); 1156 PR_INCR_KSTAT(pr_unretired); 1157 return (page_retire_done(pp, PRD_UNR_SUCCESS)); 1158 } 1159 PR_DEBUG(prd_unotlocked); 1160 return (page_retire_done(pp, PRD_UNR_CANTLOCK)); 1161 } 1162 1163 /* 1164 * Return a page to service by moving it from the retired_pages vnode 1165 * onto the freelist. 1166 * 1167 * Called from mmioctl_page_retire() on behalf of the FMA DE. 1168 * 1169 * Returns: 1170 * 1171 * - 0 if the page is unretired, 1172 * - EAGAIN if the pp can not be locked, 1173 * - EINVAL if the PA is whacko, and 1174 * - EIO if the pp is not retired. 1175 */ 1176 int 1177 page_unretire(uint64_t pa) 1178 { 1179 page_t *pp; 1180 1181 pp = page_numtopp_nolock(mmu_btop(pa)); 1182 if (pp == NULL) { 1183 return (page_retire_done(pp, PRD_INVALID_PA)); 1184 } 1185 1186 return (page_unretire_pp(pp, PR_UNR_FREE)); 1187 } 1188 1189 /* 1190 * Test a page to see if it is retired. If errors is non-NULL, the toxic 1191 * bits of the page are returned. Returns 0 on success, error code on failure. 1192 */ 1193 int 1194 page_retire_check_pp(page_t *pp, uint64_t *errors) 1195 { 1196 int rc; 1197 1198 if (PP_RETIRED(pp)) { 1199 PR_DEBUG(prd_checkhit); 1200 rc = 0; 1201 } else if (PP_PR_REQ(pp)) { 1202 PR_DEBUG(prd_checkmiss_pend); 1203 rc = EAGAIN; 1204 } else { 1205 PR_DEBUG(prd_checkmiss_noerr); 1206 rc = EIO; 1207 } 1208 1209 /* 1210 * We have magically arranged the bit values returned to fmd(1M) 1211 * to line up with the FMA, MCE, and UE bits of the page_t. 1212 */ 1213 if (errors) { 1214 uint64_t toxic = (uint64_t)(pp->p_toxic & PR_ERRMASK); 1215 if (toxic & PR_UE_SCRUBBED) { 1216 toxic &= ~PR_UE_SCRUBBED; 1217 toxic |= PR_UE; 1218 } 1219 *errors = toxic; 1220 } 1221 1222 return (rc); 1223 } 1224 1225 /* 1226 * Test to see if the page_t for a given PA is retired, and return the 1227 * hardware errors we have seen on the page if requested. 1228 * 1229 * Called from mmioctl_page_retire on behalf of the FMA DE. 1230 * 1231 * Returns: 1232 * 1233 * - 0 if the page is retired, 1234 * - EIO if the page is not retired and has no errors, 1235 * - EAGAIN if the page is not retired but is pending; and 1236 * - EINVAL if the PA is whacko. 1237 */ 1238 int 1239 page_retire_check(uint64_t pa, uint64_t *errors) 1240 { 1241 page_t *pp; 1242 1243 if (errors) { 1244 *errors = 0; 1245 } 1246 1247 pp = page_numtopp_nolock(mmu_btop(pa)); 1248 if (pp == NULL) { 1249 return (page_retire_done(pp, PRD_INVALID_PA)); 1250 } 1251 1252 return (page_retire_check_pp(pp, errors)); 1253 } 1254 1255 /* 1256 * Page retire self-test. For now, it always returns 0. 1257 */ 1258 int 1259 page_retire_test(void) 1260 { 1261 page_t *first, *pp, *cpp, *cpp2, *lpp; 1262 1263 /* 1264 * Tests the corner case where a large page can't be retired 1265 * because one of the constituent pages is locked. We mark 1266 * one page to be retired and try to retire it, and mark the 1267 * other page to be retired but don't try to retire it, so 1268 * that page_unlock() in the failure path will recurse and try 1269 * to retire THAT page. This is the worst possible situation 1270 * we can get ourselves into. 1271 */ 1272 memsegs_lock(0); 1273 pp = first = page_first(); 1274 do { 1275 if (pp->p_szc && PP_PAGEROOT(pp) == pp) { 1276 cpp = pp + 1; 1277 lpp = PP_ISFREE(pp)? pp : pp + 2; 1278 cpp2 = pp + 3; 1279 if (!page_trylock(lpp, pp == lpp? SE_EXCL : SE_SHARED)) 1280 continue; 1281 if (!page_trylock(cpp, SE_EXCL)) { 1282 page_unlock(lpp); 1283 continue; 1284 } 1285 1286 /* fails */ 1287 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1288 1289 page_unlock(lpp); 1290 page_unlock(cpp); 1291 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1292 (void) page_retire(ptob(cpp2->p_pagenum), PR_FMA); 1293 } 1294 } while ((pp = page_next(pp)) != first); 1295 memsegs_unlock(0); 1296 1297 return (0); 1298 } 1299