1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright 2006 Sun Microsystems, Inc. All rights reserved. 23 * Use is subject to license terms. 24 */ 25 26 #pragma ident "%Z%%M% %I% %E% SMI" 27 28 /* 29 * Page Retire - Big Theory Statement. 30 * 31 * This file handles removing sections of faulty memory from use when the 32 * user land FMA Diagnosis Engine requests that a page be removed or when 33 * a CE or UE is detected by the hardware. 34 * 35 * In the bad old days, the kernel side of Page Retire did a lot of the work 36 * on its own. Now, with the DE keeping track of errors, the kernel side is 37 * rather simple minded on most platforms. 38 * 39 * Errors are all reflected to the DE, and after digesting the error and 40 * looking at all previously reported errors, the DE decides what should 41 * be done about the current error. If the DE wants a particular page to 42 * be retired, then the kernel page retire code is invoked via an ioctl. 43 * On non-FMA platforms, the ue_drain and ce_drain paths ends up calling 44 * page retire to handle the error. Since page retire is just a simple 45 * mechanism it doesn't need to differentiate between the different callers. 46 * 47 * The p_toxic field in the page_t is used to indicate which errors have 48 * occurred and what action has been taken on a given page. Because errors are 49 * reported without regard to the locked state of a page, no locks are used 50 * to SET the error bits in p_toxic. However, in order to clear the error 51 * bits, the page_t must be held exclusively locked. 52 * 53 * When page_retire() is called, it must be able to acquire locks, sleep, etc. 54 * It must not be called from high-level interrupt context. 55 * 56 * Depending on how the requested page is being used at the time of the retire 57 * request (and on the availability of sufficient system resources), the page 58 * may be retired immediately, or just marked for retirement later. For 59 * example, locked pages are marked, while free pages are retired. Multiple 60 * requests may be made to retire the same page, although there is no need 61 * to: once the p_toxic flags are set, the page will be retired as soon as it 62 * can be exclusively locked. 63 * 64 * The retire mechanism is driven centrally out of page_unlock(). To expedite 65 * the retirement of pages, further requests for SE_SHARED locks are denied 66 * as long as a page retirement is pending. In addition, as long as pages are 67 * pending retirement a background thread runs periodically trying to retire 68 * those pages. Pages which could not be retired while the system is running 69 * are scrubbed prior to rebooting to avoid latent errors on the next boot. 70 * 71 * UE pages without persistent errors are scrubbed and returned to service. 72 * Recidivist pages, as well as FMA-directed requests for retirement, result 73 * in the page being taken out of service. Once the decision is made to take 74 * a page out of service, the page is cleared, hashed onto the retired_pages 75 * vnode, marked as retired, and it is unlocked. No other requesters (except 76 * for unretire) are allowed to lock retired pages. 77 * 78 * The public routines return (sadly) 0 if they worked and a non-zero error 79 * value if something went wrong. This is done for the ioctl side of the 80 * world to allow errors to be reflected all the way out to user land. The 81 * non-zero values are explained in comments atop each function. 82 */ 83 84 /* 85 * Things to fix: 86 * 87 * 1. Trying to retire non-relocatable kvp pages may result in a 88 * quagmire. This is because seg_kmem() no longer keeps its pages locked, 89 * and calls page_lookup() in the free path; since kvp pages are modified 90 * and don't have a usable backing store, page_retire() can't do anything 91 * with them, and we'll keep denying the lock to seg_kmem_free() in a 92 * vicious cycle. To prevent that, we don't deny locks to kvp pages, and 93 * hence only try to retire a page from page_unlock() in the free path. 94 * Since most kernel pages are indefinitely held anyway, and don't 95 * participate in I/O, this is of little consequence. 96 * 97 * 2. Low memory situations will be interesting. If we don't have 98 * enough memory for page_relocate() to succeed, we won't be able to 99 * retire dirty pages; nobody will be able to push them out to disk 100 * either, since we aggressively deny the page lock. We could change 101 * fsflush so it can recognize this situation, grab the lock, and push 102 * the page out, where we'll catch it in the free path and retire it. 103 * 104 * 3. Beware of places that have code like this in them: 105 * 106 * if (! page_tryupgrade(pp)) { 107 * page_unlock(pp); 108 * while (! page_lock(pp, SE_EXCL, NULL, P_RECLAIM)) { 109 * / *NOTHING* / 110 * } 111 * } 112 * page_free(pp); 113 * 114 * The problem is that pp can change identity right after the 115 * page_unlock() call. In particular, page_retire() can step in 116 * there, change pp's identity, and hash pp onto the retired_vnode. 117 * 118 * Of course, other functions besides page_retire() can have the 119 * same effect. A kmem reader can waltz by, set up a mapping to the 120 * page, and then unlock the page. Page_free() will then go castors 121 * up. So if anybody is doing this, it's already a bug. 122 * 123 * 4. mdboot()'s call into page_retire_mdboot() should probably be 124 * moved lower. Where the call is made now, we can get into trouble 125 * by scrubbing a kernel page that is then accessed later. 126 */ 127 128 #include <sys/types.h> 129 #include <sys/param.h> 130 #include <sys/systm.h> 131 #include <sys/mman.h> 132 #include <sys/vnode.h> 133 #include <sys/cmn_err.h> 134 #include <sys/ksynch.h> 135 #include <sys/thread.h> 136 #include <sys/disp.h> 137 #include <sys/ontrap.h> 138 #include <sys/vmsystm.h> 139 #include <sys/mem_config.h> 140 #include <sys/atomic.h> 141 #include <sys/callb.h> 142 #include <vm/page.h> 143 #include <vm/vm_dep.h> 144 #include <vm/as.h> 145 #include <vm/hat.h> 146 147 /* 148 * vnode for all pages which are retired from the VM system; 149 */ 150 vnode_t *retired_pages; 151 152 static int page_retire_pp_finish(page_t *, void *, uint_t); 153 154 /* 155 * Make a list of all of the pages that have been marked for retirement 156 * but are not yet retired. At system shutdown, we will scrub all of the 157 * pages in the list in case there are outstanding UEs. Then, we 158 * cross-check this list against the number of pages that are yet to be 159 * retired, and if we find inconsistencies, we scan every page_t in the 160 * whole system looking for any pages that need to be scrubbed for UEs. 161 * The background thread also uses this queue to determine which pages 162 * it should keep trying to retire. 163 */ 164 #ifdef DEBUG 165 #define PR_PENDING_QMAX 32 166 #else /* DEBUG */ 167 #define PR_PENDING_QMAX 256 168 #endif /* DEBUG */ 169 page_t *pr_pending_q[PR_PENDING_QMAX]; 170 kmutex_t pr_q_mutex; 171 172 /* 173 * Page retire global kstats 174 */ 175 struct page_retire_kstat { 176 kstat_named_t pr_retired; 177 kstat_named_t pr_requested; 178 kstat_named_t pr_requested_free; 179 kstat_named_t pr_enqueue_fail; 180 kstat_named_t pr_dequeue_fail; 181 kstat_named_t pr_pending; 182 kstat_named_t pr_failed; 183 kstat_named_t pr_failed_kernel; 184 kstat_named_t pr_limit; 185 kstat_named_t pr_limit_exceeded; 186 kstat_named_t pr_fma; 187 kstat_named_t pr_mce; 188 kstat_named_t pr_ue; 189 kstat_named_t pr_ue_cleared_retire; 190 kstat_named_t pr_ue_cleared_free; 191 kstat_named_t pr_ue_persistent; 192 kstat_named_t pr_unretired; 193 }; 194 195 static struct page_retire_kstat page_retire_kstat = { 196 { "pages_retired", KSTAT_DATA_UINT64}, 197 { "pages_retire_request", KSTAT_DATA_UINT64}, 198 { "pages_retire_request_free", KSTAT_DATA_UINT64}, 199 { "pages_notenqueued", KSTAT_DATA_UINT64}, 200 { "pages_notdequeued", KSTAT_DATA_UINT64}, 201 { "pages_pending", KSTAT_DATA_UINT64}, 202 { "pages_deferred", KSTAT_DATA_UINT64}, 203 { "pages_deferred_kernel", KSTAT_DATA_UINT64}, 204 { "pages_limit", KSTAT_DATA_UINT64}, 205 { "pages_limit_exceeded", KSTAT_DATA_UINT64}, 206 { "pages_fma", KSTAT_DATA_UINT64}, 207 { "pages_multiple_ce", KSTAT_DATA_UINT64}, 208 { "pages_ue", KSTAT_DATA_UINT64}, 209 { "pages_ue_cleared_retired", KSTAT_DATA_UINT64}, 210 { "pages_ue_cleared_freed", KSTAT_DATA_UINT64}, 211 { "pages_ue_persistent", KSTAT_DATA_UINT64}, 212 { "pages_unretired", KSTAT_DATA_UINT64}, 213 }; 214 215 static kstat_t *page_retire_ksp = NULL; 216 217 #define PR_INCR_KSTAT(stat) \ 218 atomic_add_64(&(page_retire_kstat.stat.value.ui64), 1) 219 #define PR_DECR_KSTAT(stat) \ 220 atomic_add_64(&(page_retire_kstat.stat.value.ui64), -1) 221 222 #define PR_KSTAT_RETIRED_CE (page_retire_kstat.pr_mce.value.ui64) 223 #define PR_KSTAT_RETIRED_FMA (page_retire_kstat.pr_fma.value.ui64) 224 #define PR_KSTAT_RETIRED_NOTUE (PR_KSTAT_RETIRED_CE + PR_KSTAT_RETIRED_FMA) 225 #define PR_KSTAT_PENDING (page_retire_kstat.pr_pending.value.ui64) 226 #define PR_KSTAT_EQFAIL (page_retire_kstat.pr_enqueue_fail.value.ui64) 227 #define PR_KSTAT_DQFAIL (page_retire_kstat.pr_dequeue_fail.value.ui64) 228 229 /* 230 * page retire kstats to list all retired pages 231 */ 232 static int pr_list_kstat_update(kstat_t *ksp, int rw); 233 static int pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw); 234 kmutex_t pr_list_kstat_mutex; 235 236 /* 237 * Limit the number of multiple CE page retires. 238 * The default is 0.1% of physmem, or 1 in 1000 pages. This is set in 239 * basis points, where 100 basis points equals one percent. 240 */ 241 #define MCE_BPT 10 242 uint64_t max_pages_retired_bps = MCE_BPT; 243 #define PAGE_RETIRE_LIMIT ((physmem * max_pages_retired_bps) / 10000) 244 245 /* 246 * Control over the verbosity of page retirement. 247 * 248 * When set to zero (the default), no messages will be printed. 249 * When set to one, summary messages will be printed. 250 * When set > one, all messages will be printed. 251 * 252 * A value of one will trigger detailed messages for retirement operations, 253 * and is intended as a platform tunable for processors where FMA's DE does 254 * not run (e.g., spitfire). Values > one are intended for debugging only. 255 */ 256 int page_retire_messages = 0; 257 258 /* 259 * Control whether or not we return scrubbed UE pages to service. 260 * By default we do not since FMA wants to run its diagnostics first 261 * and then ask us to unretire the page if it passes. Non-FMA platforms 262 * may set this to zero so we will only retire recidivist pages. It should 263 * not be changed by the user. 264 */ 265 int page_retire_first_ue = 1; 266 267 /* 268 * Master enable for page retire. This prevents a CE or UE early in boot 269 * from trying to retire a page before page_retire_init() has finished 270 * setting things up. This is internal only and is not a tunable! 271 */ 272 static int pr_enable = 0; 273 274 extern struct vnode kvp; 275 276 #ifdef DEBUG 277 struct page_retire_debug { 278 int prd_dup1; 279 int prd_dup2; 280 int prd_qdup; 281 int prd_noaction; 282 int prd_queued; 283 int prd_notqueued; 284 int prd_dequeue; 285 int prd_top; 286 int prd_locked; 287 int prd_reloc; 288 int prd_relocfail; 289 int prd_mod; 290 int prd_mod_late; 291 int prd_kern; 292 int prd_free; 293 int prd_noreclaim; 294 int prd_hashout; 295 int prd_fma; 296 int prd_uescrubbed; 297 int prd_uenotscrubbed; 298 int prd_mce; 299 int prd_prlocked; 300 int prd_prnotlocked; 301 int prd_prretired; 302 int prd_ulocked; 303 int prd_unotretired; 304 int prd_udestroy; 305 int prd_uhashout; 306 int prd_uunretired; 307 int prd_unotlocked; 308 int prd_checkhit; 309 int prd_checkmiss_pend; 310 int prd_checkmiss_noerr; 311 int prd_tctop; 312 int prd_tclocked; 313 int prd_hunt; 314 int prd_dohunt; 315 int prd_earlyhunt; 316 int prd_latehunt; 317 int prd_nofreedemote; 318 int prd_nodemote; 319 int prd_demoted; 320 } pr_debug; 321 322 #define PR_DEBUG(foo) ((pr_debug.foo)++) 323 324 /* 325 * A type histogram. We record the incidence of the various toxic 326 * flag combinations along with the interesting page attributes. The 327 * goal is to get as many combinations as we can while driving all 328 * pr_debug values nonzero (indicating we've exercised all possible 329 * code paths across all possible page types). Not all combinations 330 * will make sense -- e.g. PRT_MOD|PRT_KERNEL. 331 * 332 * pr_type offset bit encoding (when examining with a debugger): 333 * 334 * PRT_NAMED - 0x4 335 * PRT_KERNEL - 0x8 336 * PRT_FREE - 0x10 337 * PRT_MOD - 0x20 338 * PRT_FMA - 0x0 339 * PRT_MCE - 0x40 340 * PRT_UE - 0x80 341 */ 342 343 #define PRT_NAMED 0x01 344 #define PRT_KERNEL 0x02 345 #define PRT_FREE 0x04 346 #define PRT_MOD 0x08 347 #define PRT_FMA 0x00 /* yes, this is not a mistake */ 348 #define PRT_MCE 0x10 349 #define PRT_UE 0x20 350 #define PRT_ALL 0x3F 351 352 int pr_types[PRT_ALL+1]; 353 354 #define PR_TYPES(pp) { \ 355 int whichtype = 0; \ 356 if (pp->p_vnode) \ 357 whichtype |= PRT_NAMED; \ 358 if (PP_ISKVP(pp)) \ 359 whichtype |= PRT_KERNEL; \ 360 if (PP_ISFREE(pp)) \ 361 whichtype |= PRT_FREE; \ 362 if (hat_ismod(pp)) \ 363 whichtype |= PRT_MOD; \ 364 if (pp->p_toxic & PR_UE) \ 365 whichtype |= PRT_UE; \ 366 if (pp->p_toxic & PR_MCE) \ 367 whichtype |= PRT_MCE; \ 368 pr_types[whichtype]++; \ 369 } 370 371 int recl_calls; 372 int recl_mtbf = 3; 373 int reloc_calls; 374 int reloc_mtbf = 7; 375 int pr_calls; 376 int pr_mtbf = 15; 377 378 #define MTBF(v, f) (((++(v)) & (f)) != (f)) 379 380 #else /* DEBUG */ 381 382 #define PR_DEBUG(foo) /* nothing */ 383 #define PR_TYPES(foo) /* nothing */ 384 #define MTBF(v, f) (1) 385 386 #endif /* DEBUG */ 387 388 /* 389 * page_retire_done() - completion processing 390 * 391 * Used by the page_retire code for common completion processing. 392 * It keeps track of how many times a given result has happened, 393 * and writes out an occasional message. 394 * 395 * May be called with a NULL pp (PRD_INVALID_PA case). 396 */ 397 #define PRD_INVALID_KEY -1 398 #define PRD_SUCCESS 0 399 #define PRD_PENDING 1 400 #define PRD_FAILED 2 401 #define PRD_DUPLICATE 3 402 #define PRD_INVALID_PA 4 403 #define PRD_LIMIT 5 404 #define PRD_UE_SCRUBBED 6 405 #define PRD_UNR_SUCCESS 7 406 #define PRD_UNR_CANTLOCK 8 407 #define PRD_UNR_NOT 9 408 409 typedef struct page_retire_op { 410 int pr_key; /* one of the PRD_* defines from above */ 411 int pr_count; /* How many times this has happened */ 412 int pr_retval; /* return value */ 413 int pr_msglvl; /* message level - when to print */ 414 char *pr_message; /* Cryptic message for field service */ 415 } page_retire_op_t; 416 417 static page_retire_op_t page_retire_ops[] = { 418 /* key count retval msglvl message */ 419 {PRD_SUCCESS, 0, 0, 1, 420 "Page 0x%08x.%08x removed from service"}, 421 {PRD_PENDING, 0, EAGAIN, 2, 422 "Page 0x%08x.%08x will be retired on free"}, 423 {PRD_FAILED, 0, EAGAIN, 0, NULL}, 424 {PRD_DUPLICATE, 0, EIO, 2, 425 "Page 0x%08x.%08x already retired or pending"}, 426 {PRD_INVALID_PA, 0, EINVAL, 2, 427 "PA 0x%08x.%08x is not a relocatable page"}, 428 {PRD_LIMIT, 0, 0, 1, 429 "Page 0x%08x.%08x not retired due to limit exceeded"}, 430 {PRD_UE_SCRUBBED, 0, 0, 1, 431 "Previously reported error on page 0x%08x.%08x cleared"}, 432 {PRD_UNR_SUCCESS, 0, 0, 1, 433 "Page 0x%08x.%08x returned to service"}, 434 {PRD_UNR_CANTLOCK, 0, EAGAIN, 2, 435 "Page 0x%08x.%08x could not be unretired"}, 436 {PRD_UNR_NOT, 0, EIO, 2, 437 "Page 0x%08x.%08x is not retired"}, 438 {PRD_INVALID_KEY, 0, 0, 0, NULL} /* MUST BE LAST! */ 439 }; 440 441 /* 442 * print a message if page_retire_messages is true. 443 */ 444 #define PR_MESSAGE(debuglvl, msglvl, msg, pa) \ 445 { \ 446 uint64_t p = (uint64_t)pa; \ 447 if (page_retire_messages >= msglvl && msg != NULL) { \ 448 cmn_err(debuglvl, msg, \ 449 (uint32_t)(p >> 32), (uint32_t)p); \ 450 } \ 451 } 452 453 /* 454 * Note that multiple bits may be set in a single settoxic operation. 455 * May be called without the page locked. 456 */ 457 void 458 page_settoxic(page_t *pp, uchar_t bits) 459 { 460 atomic_or_8(&pp->p_toxic, bits); 461 } 462 463 /* 464 * Note that multiple bits may cleared in a single clrtoxic operation. 465 * Must be called with the page exclusively locked to prevent races which 466 * may attempt to retire a page without any toxic bits set. 467 * Note that the PR_CAPTURE bit can be cleared without the exclusive lock 468 * being held as there is a separate mutex which protects that bit. 469 */ 470 void 471 page_clrtoxic(page_t *pp, uchar_t bits) 472 { 473 ASSERT((bits & PR_CAPTURE) || PAGE_EXCL(pp)); 474 atomic_and_8(&pp->p_toxic, ~bits); 475 } 476 477 /* 478 * Prints any page retire messages to the user, and decides what 479 * error code is appropriate for the condition reported. 480 */ 481 static int 482 page_retire_done(page_t *pp, int code) 483 { 484 page_retire_op_t *prop; 485 uint64_t pa = 0; 486 int i; 487 488 if (pp != NULL) { 489 pa = mmu_ptob((uint64_t)pp->p_pagenum); 490 } 491 492 prop = NULL; 493 for (i = 0; page_retire_ops[i].pr_key != PRD_INVALID_KEY; i++) { 494 if (page_retire_ops[i].pr_key == code) { 495 prop = &page_retire_ops[i]; 496 break; 497 } 498 } 499 500 #ifdef DEBUG 501 if (page_retire_ops[i].pr_key == PRD_INVALID_KEY) { 502 cmn_err(CE_PANIC, "page_retire_done: Invalid opcode %d", code); 503 } 504 #endif 505 506 ASSERT(prop->pr_key == code); 507 508 prop->pr_count++; 509 510 PR_MESSAGE(CE_NOTE, prop->pr_msglvl, prop->pr_message, pa); 511 if (pp != NULL) { 512 page_settoxic(pp, PR_MSG); 513 } 514 515 return (prop->pr_retval); 516 } 517 518 /* 519 * Act like page_destroy(), but instead of freeing the page, hash it onto 520 * the retired_pages vnode, and mark it retired. 521 * 522 * For fun, we try to scrub the page until it's squeaky clean. 523 * availrmem is adjusted here. 524 */ 525 static void 526 page_retire_destroy(page_t *pp) 527 { 528 u_offset_t off = (u_offset_t)((uintptr_t)pp); 529 530 ASSERT(PAGE_EXCL(pp)); 531 ASSERT(!PP_ISFREE(pp)); 532 ASSERT(pp->p_szc == 0); 533 ASSERT(!hat_page_is_mapped(pp)); 534 ASSERT(!pp->p_vnode); 535 536 page_clr_all_props(pp); 537 pagescrub(pp, 0, MMU_PAGESIZE); 538 539 pp->p_next = NULL; 540 pp->p_prev = NULL; 541 if (page_hashin(pp, retired_pages, off, NULL) == 0) { 542 cmn_err(CE_PANIC, "retired page %p hashin failed", (void *)pp); 543 } 544 545 page_settoxic(pp, PR_RETIRED); 546 PR_INCR_KSTAT(pr_retired); 547 548 if (pp->p_toxic & PR_FMA) { 549 PR_INCR_KSTAT(pr_fma); 550 } else if (pp->p_toxic & PR_UE) { 551 PR_INCR_KSTAT(pr_ue); 552 } else { 553 PR_INCR_KSTAT(pr_mce); 554 } 555 556 mutex_enter(&freemem_lock); 557 availrmem--; 558 mutex_exit(&freemem_lock); 559 560 page_unlock(pp); 561 } 562 563 /* 564 * Check whether the number of pages which have been retired already exceeds 565 * the maximum allowable percentage of memory which may be retired. 566 * 567 * Returns 1 if the limit has been exceeded. 568 */ 569 static int 570 page_retire_limit(void) 571 { 572 if (PR_KSTAT_RETIRED_NOTUE >= (uint64_t)PAGE_RETIRE_LIMIT) { 573 PR_INCR_KSTAT(pr_limit_exceeded); 574 return (1); 575 } 576 577 return (0); 578 } 579 580 #define MSG_DM "Data Mismatch occurred at PA 0x%08x.%08x" \ 581 "[ 0x%x != 0x%x ] while attempting to clear previously " \ 582 "reported error; page removed from service" 583 584 #define MSG_UE "Uncorrectable Error occurred at PA 0x%08x.%08x while " \ 585 "attempting to clear previously reported error; page removed " \ 586 "from service" 587 588 /* 589 * Attempt to clear a UE from a page. 590 * Returns 1 if the error has been successfully cleared. 591 */ 592 static int 593 page_clear_transient_ue(page_t *pp) 594 { 595 caddr_t kaddr; 596 uint8_t rb, wb; 597 uint64_t pa; 598 uint32_t pa_hi, pa_lo; 599 on_trap_data_t otd; 600 int errors = 0; 601 int i; 602 603 ASSERT(PAGE_EXCL(pp)); 604 ASSERT(PP_PR_REQ(pp)); 605 ASSERT(pp->p_szc == 0); 606 ASSERT(!hat_page_is_mapped(pp)); 607 608 /* 609 * Clear the page and attempt to clear the UE. If we trap 610 * on the next access to the page, we know the UE has recurred. 611 */ 612 pagescrub(pp, 0, PAGESIZE); 613 614 /* 615 * Map the page and write a bunch of bit patterns to compare 616 * what we wrote with what we read back. This isn't a perfect 617 * test but it should be good enough to catch most of the 618 * recurring UEs. If this fails to catch a recurrent UE, we'll 619 * retire the page the next time we see a UE on the page. 620 */ 621 kaddr = ppmapin(pp, PROT_READ|PROT_WRITE, (caddr_t)-1); 622 623 pa = ptob((uint64_t)page_pptonum(pp)); 624 pa_hi = (uint32_t)(pa >> 32); 625 pa_lo = (uint32_t)pa; 626 627 /* 628 * Fill the page with each (0x00 - 0xFF] bit pattern, flushing 629 * the cache in between reading and writing. We do this under 630 * on_trap() protection to avoid recursion. 631 */ 632 if (on_trap(&otd, OT_DATA_EC)) { 633 PR_MESSAGE(CE_WARN, 1, MSG_UE, pa); 634 errors = 1; 635 } else { 636 for (wb = 0xff; wb > 0; wb--) { 637 for (i = 0; i < PAGESIZE; i++) { 638 kaddr[i] = wb; 639 } 640 641 sync_data_memory(kaddr, PAGESIZE); 642 643 for (i = 0; i < PAGESIZE; i++) { 644 rb = kaddr[i]; 645 if (rb != wb) { 646 /* 647 * We had a mismatch without a trap. 648 * Uh-oh. Something is really wrong 649 * with this system. 650 */ 651 if (page_retire_messages) { 652 cmn_err(CE_WARN, MSG_DM, 653 pa_hi, pa_lo, rb, wb); 654 } 655 errors = 1; 656 goto out; /* double break */ 657 } 658 } 659 } 660 } 661 out: 662 no_trap(); 663 ppmapout(kaddr); 664 665 return (errors ? 0 : 1); 666 } 667 668 /* 669 * Try to clear a page_t with a single UE. If the UE was transient, it is 670 * returned to service, and we return 1. Otherwise we return 0 meaning 671 * that further processing is required to retire the page. 672 */ 673 static int 674 page_retire_transient_ue(page_t *pp) 675 { 676 ASSERT(PAGE_EXCL(pp)); 677 ASSERT(!hat_page_is_mapped(pp)); 678 679 /* 680 * If this page is a repeat offender, retire him under the 681 * "two strikes and you're out" rule. The caller is responsible 682 * for scrubbing the page to try to clear the error. 683 */ 684 if (pp->p_toxic & PR_UE_SCRUBBED) { 685 PR_INCR_KSTAT(pr_ue_persistent); 686 return (0); 687 } 688 689 if (page_clear_transient_ue(pp)) { 690 /* 691 * We set the PR_SCRUBBED_UE bit; if we ever see this 692 * page again, we will retire it, no questions asked. 693 */ 694 page_settoxic(pp, PR_UE_SCRUBBED); 695 696 if (page_retire_first_ue) { 697 PR_INCR_KSTAT(pr_ue_cleared_retire); 698 return (0); 699 } else { 700 PR_INCR_KSTAT(pr_ue_cleared_free); 701 702 page_clrtoxic(pp, PR_UE | PR_MCE | PR_MSG); 703 704 /* LINTED: CONSTCOND */ 705 VN_DISPOSE(pp, B_FREE, 1, kcred); 706 return (1); 707 } 708 } 709 710 PR_INCR_KSTAT(pr_ue_persistent); 711 return (0); 712 } 713 714 /* 715 * Update the statistics dynamically when our kstat is read. 716 */ 717 static int 718 page_retire_kstat_update(kstat_t *ksp, int rw) 719 { 720 struct page_retire_kstat *pr; 721 722 if (ksp == NULL) 723 return (EINVAL); 724 725 switch (rw) { 726 727 case KSTAT_READ: 728 pr = (struct page_retire_kstat *)ksp->ks_data; 729 ASSERT(pr == &page_retire_kstat); 730 pr->pr_limit.value.ui64 = PAGE_RETIRE_LIMIT; 731 return (0); 732 733 case KSTAT_WRITE: 734 return (EACCES); 735 736 default: 737 return (EINVAL); 738 } 739 /*NOTREACHED*/ 740 } 741 742 static int 743 pr_list_kstat_update(kstat_t *ksp, int rw) 744 { 745 uint_t count; 746 page_t *pp; 747 kmutex_t *vphm; 748 749 if (rw == KSTAT_WRITE) 750 return (EACCES); 751 752 vphm = page_vnode_mutex(retired_pages); 753 mutex_enter(vphm); 754 /* Needs to be under a lock so that for loop will work right */ 755 if (retired_pages->v_pages == NULL) { 756 mutex_exit(vphm); 757 ksp->ks_ndata = 0; 758 ksp->ks_data_size = 0; 759 return (0); 760 } 761 762 count = 1; 763 for (pp = retired_pages->v_pages->p_vpnext; 764 pp != retired_pages->v_pages; pp = pp->p_vpnext) { 765 count++; 766 } 767 mutex_exit(vphm); 768 769 ksp->ks_ndata = count; 770 ksp->ks_data_size = count * 2 * sizeof (uint64_t); 771 772 return (0); 773 } 774 775 /* 776 * all spans will be pagesize and no coalescing will be done with the 777 * list produced. 778 */ 779 static int 780 pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw) 781 { 782 kmutex_t *vphm; 783 page_t *pp; 784 struct memunit { 785 uint64_t address; 786 uint64_t size; 787 } *kspmem; 788 789 if (rw == KSTAT_WRITE) 790 return (EACCES); 791 792 ksp->ks_snaptime = gethrtime(); 793 794 kspmem = (struct memunit *)buf; 795 796 vphm = page_vnode_mutex(retired_pages); 797 mutex_enter(vphm); 798 pp = retired_pages->v_pages; 799 if (((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) || 800 (pp == NULL)) { 801 mutex_exit(vphm); 802 return (0); 803 } 804 kspmem->address = ptob(pp->p_pagenum); 805 kspmem->size = PAGESIZE; 806 kspmem++; 807 for (pp = pp->p_vpnext; pp != retired_pages->v_pages; 808 pp = pp->p_vpnext, kspmem++) { 809 if ((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) 810 break; 811 kspmem->address = ptob(pp->p_pagenum); 812 kspmem->size = PAGESIZE; 813 } 814 mutex_exit(vphm); 815 816 return (0); 817 } 818 819 /* 820 * Initialize the page retire mechanism: 821 * 822 * - Establish the correctable error retire limit. 823 * - Initialize locks. 824 * - Build the retired_pages vnode. 825 * - Set up the kstats. 826 * - Fire off the background thread. 827 * - Tell page_retire() it's OK to start retiring pages. 828 */ 829 void 830 page_retire_init(void) 831 { 832 const fs_operation_def_t retired_vnodeops_template[] = {NULL, NULL}; 833 struct vnodeops *vops; 834 kstat_t *ksp; 835 836 const uint_t page_retire_ndata = 837 sizeof (page_retire_kstat) / sizeof (kstat_named_t); 838 839 ASSERT(page_retire_ksp == NULL); 840 841 if (max_pages_retired_bps <= 0) { 842 max_pages_retired_bps = MCE_BPT; 843 } 844 845 mutex_init(&pr_q_mutex, NULL, MUTEX_DEFAULT, NULL); 846 847 retired_pages = vn_alloc(KM_SLEEP); 848 if (vn_make_ops("retired_pages", retired_vnodeops_template, &vops)) { 849 cmn_err(CE_PANIC, 850 "page_retired_init: can't make retired vnodeops"); 851 } 852 vn_setops(retired_pages, vops); 853 854 if ((page_retire_ksp = kstat_create("unix", 0, "page_retire", 855 "misc", KSTAT_TYPE_NAMED, page_retire_ndata, 856 KSTAT_FLAG_VIRTUAL)) == NULL) { 857 cmn_err(CE_WARN, "kstat_create for page_retire failed"); 858 } else { 859 page_retire_ksp->ks_data = (void *)&page_retire_kstat; 860 page_retire_ksp->ks_update = page_retire_kstat_update; 861 kstat_install(page_retire_ksp); 862 } 863 864 mutex_init(&pr_list_kstat_mutex, NULL, MUTEX_DEFAULT, NULL); 865 ksp = kstat_create("unix", 0, "page_retire_list", "misc", 866 KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VAR_SIZE | KSTAT_FLAG_VIRTUAL); 867 if (ksp != NULL) { 868 ksp->ks_update = pr_list_kstat_update; 869 ksp->ks_snapshot = pr_list_kstat_snapshot; 870 ksp->ks_lock = &pr_list_kstat_mutex; 871 kstat_install(ksp); 872 } 873 874 page_capture_register_callback(PC_RETIRE, -1, page_retire_pp_finish); 875 pr_enable = 1; 876 } 877 878 /* 879 * page_retire_hunt() callback for the retire thread. 880 */ 881 static void 882 page_retire_thread_cb(page_t *pp) 883 { 884 PR_DEBUG(prd_tctop); 885 if (!PP_ISKVP(pp) && page_trylock(pp, SE_EXCL)) { 886 PR_DEBUG(prd_tclocked); 887 page_unlock(pp); 888 } 889 } 890 891 /* 892 * page_retire_hunt() callback for mdboot(). 893 * 894 * It is necessary to scrub any failing pages prior to reboot in order to 895 * prevent a latent error trap from occurring on the next boot. 896 */ 897 void 898 page_retire_mdboot_cb(page_t *pp) 899 { 900 /* 901 * Don't scrub the kernel, since we might still need it, unless 902 * we have UEs on the page, in which case we have nothing to lose. 903 */ 904 if (!PP_ISKVP(pp) || PP_TOXIC(pp)) { 905 pp->p_selock = -1; /* pacify ASSERTs */ 906 PP_CLRFREE(pp); 907 pagescrub(pp, 0, PAGESIZE); 908 pp->p_selock = 0; 909 } 910 pp->p_toxic = 0; 911 } 912 913 914 /* 915 * Callback used by page_trycapture() to finish off retiring a page. 916 * The page has already been cleaned and we've been given sole access to 917 * it. 918 * Always returns 0 to indicate that callback succeded as the callback never 919 * fails to finish retiring the given page. 920 */ 921 /*ARGSUSED*/ 922 static int 923 page_retire_pp_finish(page_t *pp, void *notused, uint_t flags) 924 { 925 int toxic; 926 927 ASSERT(PAGE_EXCL(pp)); 928 ASSERT(pp->p_iolock_state == 0); 929 ASSERT(pp->p_szc == 0); 930 931 toxic = pp->p_toxic; 932 933 /* 934 * The problem page is locked, demoted, unmapped, not free, 935 * hashed out, and not COW or mlocked (whew!). 936 * 937 * Now we select our ammunition, take it around back, and shoot it. 938 */ 939 if (toxic & PR_UE) { 940 ue_error: 941 if (page_retire_transient_ue(pp)) { 942 PR_DEBUG(prd_uescrubbed); 943 (void) page_retire_done(pp, PRD_UE_SCRUBBED); 944 } else { 945 PR_DEBUG(prd_uenotscrubbed); 946 page_retire_destroy(pp); 947 (void) page_retire_done(pp, PRD_SUCCESS); 948 } 949 return (0); 950 } else if (toxic & PR_FMA) { 951 PR_DEBUG(prd_fma); 952 page_retire_destroy(pp); 953 (void) page_retire_done(pp, PRD_SUCCESS); 954 return (0); 955 } else if (toxic & PR_MCE) { 956 PR_DEBUG(prd_mce); 957 page_retire_destroy(pp); 958 (void) page_retire_done(pp, PRD_SUCCESS); 959 return (0); 960 } 961 962 /* 963 * When page_retire_first_ue is set to zero and a UE occurs which is 964 * transient, it's possible that we clear some flags set by a second 965 * UE error on the page which occurs while the first is currently being 966 * handled and thus we need to handle the case where none of the above 967 * are set. In this instance, PR_UE_SCRUBBED should be set and thus 968 * we should execute the UE code above. 969 */ 970 if (toxic & PR_UE_SCRUBBED) { 971 goto ue_error; 972 } 973 974 /* 975 * It's impossible to get here. 976 */ 977 panic("bad toxic flags 0x%x in page_retire_pp_finish\n", toxic); 978 return (0); 979 } 980 981 /* 982 * page_retire() - the front door in to retire a page. 983 * 984 * Ideally, page_retire() would instantly retire the requested page. 985 * Unfortunately, some pages are locked or otherwise tied up and cannot be 986 * retired right away. We use the page capture logic to deal with this 987 * situation as it will continuously try to retire the page in the background 988 * if the first attempt fails. Success is determined by looking to see whether 989 * the page has been retired after the page_trycapture() attempt. 990 * 991 * Returns: 992 * 993 * - 0 on success, 994 * - EINVAL when the PA is whacko, 995 * - EIO if the page is already retired or already pending retirement, or 996 * - EAGAIN if the page could not be _immediately_ retired but is pending. 997 */ 998 int 999 page_retire(uint64_t pa, uchar_t reason) 1000 { 1001 page_t *pp; 1002 1003 ASSERT(reason & PR_REASONS); /* there must be a reason */ 1004 ASSERT(!(reason & ~PR_REASONS)); /* but no other bits */ 1005 1006 pp = page_numtopp_nolock(mmu_btop(pa)); 1007 if (pp == NULL) { 1008 PR_MESSAGE(CE_WARN, 1, "Cannot schedule clearing of error on" 1009 " page 0x%08x.%08x; page is not relocatable memory", pa); 1010 return (page_retire_done(pp, PRD_INVALID_PA)); 1011 } 1012 if (PP_RETIRED(pp)) { 1013 PR_DEBUG(prd_dup1); 1014 return (page_retire_done(pp, PRD_DUPLICATE)); 1015 } 1016 1017 if ((reason & PR_UE) && !PP_TOXIC(pp)) { 1018 PR_MESSAGE(CE_NOTE, 1, "Scheduling clearing of error on" 1019 " page 0x%08x.%08x", pa); 1020 } else if (PP_PR_REQ(pp)) { 1021 PR_DEBUG(prd_dup2); 1022 return (page_retire_done(pp, PRD_DUPLICATE)); 1023 } else { 1024 PR_MESSAGE(CE_NOTE, 1, "Scheduling removal of" 1025 " page 0x%08x.%08x", pa); 1026 } 1027 1028 /* Avoid setting toxic bits in the first place */ 1029 if ((reason & (PR_FMA | PR_MCE)) && !(reason & PR_UE) && 1030 page_retire_limit()) { 1031 return (page_retire_done(pp, PRD_LIMIT)); 1032 } 1033 1034 if (MTBF(pr_calls, pr_mtbf)) { 1035 page_settoxic(pp, reason); 1036 if (page_trycapture(pp, 0, CAPTURE_RETIRE, NULL) == 0) { 1037 PR_DEBUG(prd_prlocked); 1038 } else { 1039 PR_DEBUG(prd_prnotlocked); 1040 } 1041 } else { 1042 PR_DEBUG(prd_prnotlocked); 1043 } 1044 1045 if (PP_RETIRED(pp)) { 1046 PR_DEBUG(prd_prretired); 1047 return (0); 1048 } else { 1049 cv_signal(&pc_cv); 1050 PR_INCR_KSTAT(pr_failed); 1051 1052 if (pp->p_toxic & PR_MSG) { 1053 return (page_retire_done(pp, PRD_FAILED)); 1054 } else { 1055 return (page_retire_done(pp, PRD_PENDING)); 1056 } 1057 } 1058 } 1059 1060 /* 1061 * Take a retired page off the retired-pages vnode and clear the toxic flags. 1062 * If "free" is nonzero, lock it and put it back on the freelist. If "free" 1063 * is zero, the caller already holds SE_EXCL lock so we simply unretire it 1064 * and don't do anything else with it. 1065 * 1066 * Any unretire messages are printed from this routine. 1067 * 1068 * Returns 0 if page pp was unretired; else an error code. 1069 * 1070 * If flags is: 1071 * PR_UNR_FREE - lock the page, clear the toxic flags and free it 1072 * to the freelist. 1073 * PR_UNR_TEMP - lock the page, unretire it, leave the toxic 1074 * bits set as is and return it to the caller. 1075 * PR_UNR_CLEAN - page is SE_EXCL locked, unretire it, clear the 1076 * toxic flags and return it to caller as is. 1077 */ 1078 int 1079 page_unretire_pp(page_t *pp, int flags) 1080 { 1081 /* 1082 * To be retired, a page has to be hashed onto the retired_pages vnode 1083 * and have PR_RETIRED set in p_toxic. 1084 */ 1085 if (flags == PR_UNR_CLEAN || 1086 page_try_reclaim_lock(pp, SE_EXCL, SE_RETIRED)) { 1087 ASSERT(PAGE_EXCL(pp)); 1088 PR_DEBUG(prd_ulocked); 1089 if (!PP_RETIRED(pp)) { 1090 PR_DEBUG(prd_unotretired); 1091 page_unlock(pp); 1092 return (page_retire_done(pp, PRD_UNR_NOT)); 1093 } 1094 1095 PR_MESSAGE(CE_NOTE, 1, "unretiring retired" 1096 " page 0x%08x.%08x", mmu_ptob((uint64_t)pp->p_pagenum)); 1097 if (pp->p_toxic & PR_FMA) { 1098 PR_DECR_KSTAT(pr_fma); 1099 } else if (pp->p_toxic & PR_UE) { 1100 PR_DECR_KSTAT(pr_ue); 1101 } else { 1102 PR_DECR_KSTAT(pr_mce); 1103 } 1104 1105 if (flags == PR_UNR_TEMP) 1106 page_clrtoxic(pp, PR_RETIRED); 1107 else 1108 page_clrtoxic(pp, PR_TOXICFLAGS); 1109 1110 if (flags == PR_UNR_FREE) { 1111 PR_DEBUG(prd_udestroy); 1112 page_destroy(pp, 0); 1113 } else { 1114 PR_DEBUG(prd_uhashout); 1115 page_hashout(pp, NULL); 1116 } 1117 1118 mutex_enter(&freemem_lock); 1119 availrmem++; 1120 mutex_exit(&freemem_lock); 1121 1122 PR_DEBUG(prd_uunretired); 1123 PR_DECR_KSTAT(pr_retired); 1124 PR_INCR_KSTAT(pr_unretired); 1125 return (page_retire_done(pp, PRD_UNR_SUCCESS)); 1126 } 1127 PR_DEBUG(prd_unotlocked); 1128 return (page_retire_done(pp, PRD_UNR_CANTLOCK)); 1129 } 1130 1131 /* 1132 * Return a page to service by moving it from the retired_pages vnode 1133 * onto the freelist. 1134 * 1135 * Called from mmioctl_page_retire() on behalf of the FMA DE. 1136 * 1137 * Returns: 1138 * 1139 * - 0 if the page is unretired, 1140 * - EAGAIN if the pp can not be locked, 1141 * - EINVAL if the PA is whacko, and 1142 * - EIO if the pp is not retired. 1143 */ 1144 int 1145 page_unretire(uint64_t pa) 1146 { 1147 page_t *pp; 1148 1149 pp = page_numtopp_nolock(mmu_btop(pa)); 1150 if (pp == NULL) { 1151 return (page_retire_done(pp, PRD_INVALID_PA)); 1152 } 1153 1154 return (page_unretire_pp(pp, PR_UNR_FREE)); 1155 } 1156 1157 /* 1158 * Test a page to see if it is retired. If errors is non-NULL, the toxic 1159 * bits of the page are returned. Returns 0 on success, error code on failure. 1160 */ 1161 int 1162 page_retire_check_pp(page_t *pp, uint64_t *errors) 1163 { 1164 int rc; 1165 1166 if (PP_RETIRED(pp)) { 1167 PR_DEBUG(prd_checkhit); 1168 rc = 0; 1169 } else if (PP_PR_REQ(pp)) { 1170 PR_DEBUG(prd_checkmiss_pend); 1171 rc = EAGAIN; 1172 } else { 1173 PR_DEBUG(prd_checkmiss_noerr); 1174 rc = EIO; 1175 } 1176 1177 /* 1178 * We have magically arranged the bit values returned to fmd(1M) 1179 * to line up with the FMA, MCE, and UE bits of the page_t. 1180 */ 1181 if (errors) { 1182 uint64_t toxic = (uint64_t)(pp->p_toxic & PR_ERRMASK); 1183 if (toxic & PR_UE_SCRUBBED) { 1184 toxic &= ~PR_UE_SCRUBBED; 1185 toxic |= PR_UE; 1186 } 1187 *errors = toxic; 1188 } 1189 1190 return (rc); 1191 } 1192 1193 /* 1194 * Test to see if the page_t for a given PA is retired, and return the 1195 * hardware errors we have seen on the page if requested. 1196 * 1197 * Called from mmioctl_page_retire on behalf of the FMA DE. 1198 * 1199 * Returns: 1200 * 1201 * - 0 if the page is retired, 1202 * - EIO if the page is not retired and has no errors, 1203 * - EAGAIN if the page is not retired but is pending; and 1204 * - EINVAL if the PA is whacko. 1205 */ 1206 int 1207 page_retire_check(uint64_t pa, uint64_t *errors) 1208 { 1209 page_t *pp; 1210 1211 if (errors) { 1212 *errors = 0; 1213 } 1214 1215 pp = page_numtopp_nolock(mmu_btop(pa)); 1216 if (pp == NULL) { 1217 return (page_retire_done(pp, PRD_INVALID_PA)); 1218 } 1219 1220 return (page_retire_check_pp(pp, errors)); 1221 } 1222 1223 /* 1224 * Page retire self-test. For now, it always returns 0. 1225 */ 1226 int 1227 page_retire_test(void) 1228 { 1229 page_t *first, *pp, *cpp, *cpp2, *lpp; 1230 1231 /* 1232 * Tests the corner case where a large page can't be retired 1233 * because one of the constituent pages is locked. We mark 1234 * one page to be retired and try to retire it, and mark the 1235 * other page to be retired but don't try to retire it, so 1236 * that page_unlock() in the failure path will recurse and try 1237 * to retire THAT page. This is the worst possible situation 1238 * we can get ourselves into. 1239 */ 1240 memsegs_lock(0); 1241 pp = first = page_first(); 1242 do { 1243 if (pp->p_szc && PP_PAGEROOT(pp) == pp) { 1244 cpp = pp + 1; 1245 lpp = PP_ISFREE(pp)? pp : pp + 2; 1246 cpp2 = pp + 3; 1247 if (!page_trylock(lpp, pp == lpp? SE_EXCL : SE_SHARED)) 1248 continue; 1249 if (!page_trylock(cpp, SE_EXCL)) { 1250 page_unlock(lpp); 1251 continue; 1252 } 1253 1254 /* fails */ 1255 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1256 1257 page_unlock(lpp); 1258 page_unlock(cpp); 1259 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1260 (void) page_retire(ptob(cpp2->p_pagenum), PR_FMA); 1261 } 1262 } while ((pp = page_next(pp)) != first); 1263 memsegs_unlock(0); 1264 1265 return (0); 1266 } 1267