1 /* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 /* 22 * Copyright 2007 Sun Microsystems, Inc. All rights reserved. 23 * Use is subject to license terms. 24 */ 25 26 #pragma ident "%Z%%M% %I% %E% SMI" 27 28 /* 29 * Page Retire - Big Theory Statement. 30 * 31 * This file handles removing sections of faulty memory from use when the 32 * user land FMA Diagnosis Engine requests that a page be removed or when 33 * a CE or UE is detected by the hardware. 34 * 35 * In the bad old days, the kernel side of Page Retire did a lot of the work 36 * on its own. Now, with the DE keeping track of errors, the kernel side is 37 * rather simple minded on most platforms. 38 * 39 * Errors are all reflected to the DE, and after digesting the error and 40 * looking at all previously reported errors, the DE decides what should 41 * be done about the current error. If the DE wants a particular page to 42 * be retired, then the kernel page retire code is invoked via an ioctl. 43 * On non-FMA platforms, the ue_drain and ce_drain paths ends up calling 44 * page retire to handle the error. Since page retire is just a simple 45 * mechanism it doesn't need to differentiate between the different callers. 46 * 47 * The p_toxic field in the page_t is used to indicate which errors have 48 * occurred and what action has been taken on a given page. Because errors are 49 * reported without regard to the locked state of a page, no locks are used 50 * to SET the error bits in p_toxic. However, in order to clear the error 51 * bits, the page_t must be held exclusively locked. 52 * 53 * When page_retire() is called, it must be able to acquire locks, sleep, etc. 54 * It must not be called from high-level interrupt context. 55 * 56 * Depending on how the requested page is being used at the time of the retire 57 * request (and on the availability of sufficient system resources), the page 58 * may be retired immediately, or just marked for retirement later. For 59 * example, locked pages are marked, while free pages are retired. Multiple 60 * requests may be made to retire the same page, although there is no need 61 * to: once the p_toxic flags are set, the page will be retired as soon as it 62 * can be exclusively locked. 63 * 64 * The retire mechanism is driven centrally out of page_unlock(). To expedite 65 * the retirement of pages, further requests for SE_SHARED locks are denied 66 * as long as a page retirement is pending. In addition, as long as pages are 67 * pending retirement a background thread runs periodically trying to retire 68 * those pages. Pages which could not be retired while the system is running 69 * are scrubbed prior to rebooting to avoid latent errors on the next boot. 70 * 71 * UE pages without persistent errors are scrubbed and returned to service. 72 * Recidivist pages, as well as FMA-directed requests for retirement, result 73 * in the page being taken out of service. Once the decision is made to take 74 * a page out of service, the page is cleared, hashed onto the retired_pages 75 * vnode, marked as retired, and it is unlocked. No other requesters (except 76 * for unretire) are allowed to lock retired pages. 77 * 78 * The public routines return (sadly) 0 if they worked and a non-zero error 79 * value if something went wrong. This is done for the ioctl side of the 80 * world to allow errors to be reflected all the way out to user land. The 81 * non-zero values are explained in comments atop each function. 82 */ 83 84 /* 85 * Things to fix: 86 * 87 * 1. Trying to retire non-relocatable kvp pages may result in a 88 * quagmire. This is because seg_kmem() no longer keeps its pages locked, 89 * and calls page_lookup() in the free path; since kvp pages are modified 90 * and don't have a usable backing store, page_retire() can't do anything 91 * with them, and we'll keep denying the lock to seg_kmem_free() in a 92 * vicious cycle. To prevent that, we don't deny locks to kvp pages, and 93 * hence only try to retire a page from page_unlock() in the free path. 94 * Since most kernel pages are indefinitely held anyway, and don't 95 * participate in I/O, this is of little consequence. 96 * 97 * 2. Low memory situations will be interesting. If we don't have 98 * enough memory for page_relocate() to succeed, we won't be able to 99 * retire dirty pages; nobody will be able to push them out to disk 100 * either, since we aggressively deny the page lock. We could change 101 * fsflush so it can recognize this situation, grab the lock, and push 102 * the page out, where we'll catch it in the free path and retire it. 103 * 104 * 3. Beware of places that have code like this in them: 105 * 106 * if (! page_tryupgrade(pp)) { 107 * page_unlock(pp); 108 * while (! page_lock(pp, SE_EXCL, NULL, P_RECLAIM)) { 109 * / *NOTHING* / 110 * } 111 * } 112 * page_free(pp); 113 * 114 * The problem is that pp can change identity right after the 115 * page_unlock() call. In particular, page_retire() can step in 116 * there, change pp's identity, and hash pp onto the retired_vnode. 117 * 118 * Of course, other functions besides page_retire() can have the 119 * same effect. A kmem reader can waltz by, set up a mapping to the 120 * page, and then unlock the page. Page_free() will then go castors 121 * up. So if anybody is doing this, it's already a bug. 122 * 123 * 4. mdboot()'s call into page_retire_mdboot() should probably be 124 * moved lower. Where the call is made now, we can get into trouble 125 * by scrubbing a kernel page that is then accessed later. 126 */ 127 128 #include <sys/types.h> 129 #include <sys/param.h> 130 #include <sys/systm.h> 131 #include <sys/mman.h> 132 #include <sys/vnode.h> 133 #include <sys/cmn_err.h> 134 #include <sys/ksynch.h> 135 #include <sys/thread.h> 136 #include <sys/disp.h> 137 #include <sys/ontrap.h> 138 #include <sys/vmsystm.h> 139 #include <sys/mem_config.h> 140 #include <sys/atomic.h> 141 #include <sys/callb.h> 142 #include <vm/page.h> 143 #include <vm/vm_dep.h> 144 #include <vm/as.h> 145 #include <vm/hat.h> 146 147 /* 148 * vnode for all pages which are retired from the VM system; 149 */ 150 vnode_t *retired_pages; 151 152 static int page_retire_pp_finish(page_t *, void *, uint_t); 153 154 /* 155 * Make a list of all of the pages that have been marked for retirement 156 * but are not yet retired. At system shutdown, we will scrub all of the 157 * pages in the list in case there are outstanding UEs. Then, we 158 * cross-check this list against the number of pages that are yet to be 159 * retired, and if we find inconsistencies, we scan every page_t in the 160 * whole system looking for any pages that need to be scrubbed for UEs. 161 * The background thread also uses this queue to determine which pages 162 * it should keep trying to retire. 163 */ 164 #ifdef DEBUG 165 #define PR_PENDING_QMAX 32 166 #else /* DEBUG */ 167 #define PR_PENDING_QMAX 256 168 #endif /* DEBUG */ 169 page_t *pr_pending_q[PR_PENDING_QMAX]; 170 kmutex_t pr_q_mutex; 171 172 /* 173 * Page retire global kstats 174 */ 175 struct page_retire_kstat { 176 kstat_named_t pr_retired; 177 kstat_named_t pr_requested; 178 kstat_named_t pr_requested_free; 179 kstat_named_t pr_enqueue_fail; 180 kstat_named_t pr_dequeue_fail; 181 kstat_named_t pr_pending; 182 kstat_named_t pr_failed; 183 kstat_named_t pr_failed_kernel; 184 kstat_named_t pr_limit; 185 kstat_named_t pr_limit_exceeded; 186 kstat_named_t pr_fma; 187 kstat_named_t pr_mce; 188 kstat_named_t pr_ue; 189 kstat_named_t pr_ue_cleared_retire; 190 kstat_named_t pr_ue_cleared_free; 191 kstat_named_t pr_ue_persistent; 192 kstat_named_t pr_unretired; 193 }; 194 195 static struct page_retire_kstat page_retire_kstat = { 196 { "pages_retired", KSTAT_DATA_UINT64}, 197 { "pages_retire_request", KSTAT_DATA_UINT64}, 198 { "pages_retire_request_free", KSTAT_DATA_UINT64}, 199 { "pages_notenqueued", KSTAT_DATA_UINT64}, 200 { "pages_notdequeued", KSTAT_DATA_UINT64}, 201 { "pages_pending", KSTAT_DATA_UINT64}, 202 { "pages_deferred", KSTAT_DATA_UINT64}, 203 { "pages_deferred_kernel", KSTAT_DATA_UINT64}, 204 { "pages_limit", KSTAT_DATA_UINT64}, 205 { "pages_limit_exceeded", KSTAT_DATA_UINT64}, 206 { "pages_fma", KSTAT_DATA_UINT64}, 207 { "pages_multiple_ce", KSTAT_DATA_UINT64}, 208 { "pages_ue", KSTAT_DATA_UINT64}, 209 { "pages_ue_cleared_retired", KSTAT_DATA_UINT64}, 210 { "pages_ue_cleared_freed", KSTAT_DATA_UINT64}, 211 { "pages_ue_persistent", KSTAT_DATA_UINT64}, 212 { "pages_unretired", KSTAT_DATA_UINT64}, 213 }; 214 215 static kstat_t *page_retire_ksp = NULL; 216 217 #define PR_INCR_KSTAT(stat) \ 218 atomic_add_64(&(page_retire_kstat.stat.value.ui64), 1) 219 #define PR_DECR_KSTAT(stat) \ 220 atomic_add_64(&(page_retire_kstat.stat.value.ui64), -1) 221 222 #define PR_KSTAT_RETIRED_CE (page_retire_kstat.pr_mce.value.ui64) 223 #define PR_KSTAT_RETIRED_FMA (page_retire_kstat.pr_fma.value.ui64) 224 #define PR_KSTAT_RETIRED_NOTUE (PR_KSTAT_RETIRED_CE + PR_KSTAT_RETIRED_FMA) 225 #define PR_KSTAT_PENDING (page_retire_kstat.pr_pending.value.ui64) 226 #define PR_KSTAT_EQFAIL (page_retire_kstat.pr_enqueue_fail.value.ui64) 227 #define PR_KSTAT_DQFAIL (page_retire_kstat.pr_dequeue_fail.value.ui64) 228 229 /* 230 * page retire kstats to list all retired pages 231 */ 232 static int pr_list_kstat_update(kstat_t *ksp, int rw); 233 static int pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw); 234 kmutex_t pr_list_kstat_mutex; 235 236 /* 237 * Limit the number of multiple CE page retires. 238 * The default is 0.1% of physmem, or 1 in 1000 pages. This is set in 239 * basis points, where 100 basis points equals one percent. 240 */ 241 #define MCE_BPT 10 242 uint64_t max_pages_retired_bps = MCE_BPT; 243 #define PAGE_RETIRE_LIMIT ((physmem * max_pages_retired_bps) / 10000) 244 245 /* 246 * Control over the verbosity of page retirement. 247 * 248 * When set to zero (the default), no messages will be printed. 249 * When set to one, summary messages will be printed. 250 * When set > one, all messages will be printed. 251 * 252 * A value of one will trigger detailed messages for retirement operations, 253 * and is intended as a platform tunable for processors where FMA's DE does 254 * not run (e.g., spitfire). Values > one are intended for debugging only. 255 */ 256 int page_retire_messages = 0; 257 258 /* 259 * Control whether or not we return scrubbed UE pages to service. 260 * By default we do not since FMA wants to run its diagnostics first 261 * and then ask us to unretire the page if it passes. Non-FMA platforms 262 * may set this to zero so we will only retire recidivist pages. It should 263 * not be changed by the user. 264 */ 265 int page_retire_first_ue = 1; 266 267 /* 268 * Master enable for page retire. This prevents a CE or UE early in boot 269 * from trying to retire a page before page_retire_init() has finished 270 * setting things up. This is internal only and is not a tunable! 271 */ 272 static int pr_enable = 0; 273 274 extern struct vnode kvp; 275 276 #ifdef DEBUG 277 struct page_retire_debug { 278 int prd_dup1; 279 int prd_dup2; 280 int prd_qdup; 281 int prd_noaction; 282 int prd_queued; 283 int prd_notqueued; 284 int prd_dequeue; 285 int prd_top; 286 int prd_locked; 287 int prd_reloc; 288 int prd_relocfail; 289 int prd_mod; 290 int prd_mod_late; 291 int prd_kern; 292 int prd_free; 293 int prd_noreclaim; 294 int prd_hashout; 295 int prd_fma; 296 int prd_uescrubbed; 297 int prd_uenotscrubbed; 298 int prd_mce; 299 int prd_prlocked; 300 int prd_prnotlocked; 301 int prd_prretired; 302 int prd_ulocked; 303 int prd_unotretired; 304 int prd_udestroy; 305 int prd_uhashout; 306 int prd_uunretired; 307 int prd_unotlocked; 308 int prd_checkhit; 309 int prd_checkmiss_pend; 310 int prd_checkmiss_noerr; 311 int prd_tctop; 312 int prd_tclocked; 313 int prd_hunt; 314 int prd_dohunt; 315 int prd_earlyhunt; 316 int prd_latehunt; 317 int prd_nofreedemote; 318 int prd_nodemote; 319 int prd_demoted; 320 } pr_debug; 321 322 #define PR_DEBUG(foo) ((pr_debug.foo)++) 323 324 /* 325 * A type histogram. We record the incidence of the various toxic 326 * flag combinations along with the interesting page attributes. The 327 * goal is to get as many combinations as we can while driving all 328 * pr_debug values nonzero (indicating we've exercised all possible 329 * code paths across all possible page types). Not all combinations 330 * will make sense -- e.g. PRT_MOD|PRT_KERNEL. 331 * 332 * pr_type offset bit encoding (when examining with a debugger): 333 * 334 * PRT_NAMED - 0x4 335 * PRT_KERNEL - 0x8 336 * PRT_FREE - 0x10 337 * PRT_MOD - 0x20 338 * PRT_FMA - 0x0 339 * PRT_MCE - 0x40 340 * PRT_UE - 0x80 341 */ 342 343 #define PRT_NAMED 0x01 344 #define PRT_KERNEL 0x02 345 #define PRT_FREE 0x04 346 #define PRT_MOD 0x08 347 #define PRT_FMA 0x00 /* yes, this is not a mistake */ 348 #define PRT_MCE 0x10 349 #define PRT_UE 0x20 350 #define PRT_ALL 0x3F 351 352 int pr_types[PRT_ALL+1]; 353 354 #define PR_TYPES(pp) { \ 355 int whichtype = 0; \ 356 if (pp->p_vnode) \ 357 whichtype |= PRT_NAMED; \ 358 if (PP_ISKAS(pp)) \ 359 whichtype |= PRT_KERNEL; \ 360 if (PP_ISFREE(pp)) \ 361 whichtype |= PRT_FREE; \ 362 if (hat_ismod(pp)) \ 363 whichtype |= PRT_MOD; \ 364 if (pp->p_toxic & PR_UE) \ 365 whichtype |= PRT_UE; \ 366 if (pp->p_toxic & PR_MCE) \ 367 whichtype |= PRT_MCE; \ 368 pr_types[whichtype]++; \ 369 } 370 371 int recl_calls; 372 int recl_mtbf = 3; 373 int reloc_calls; 374 int reloc_mtbf = 7; 375 int pr_calls; 376 int pr_mtbf = 15; 377 378 #define MTBF(v, f) (((++(v)) & (f)) != (f)) 379 380 #else /* DEBUG */ 381 382 #define PR_DEBUG(foo) /* nothing */ 383 #define PR_TYPES(foo) /* nothing */ 384 #define MTBF(v, f) (1) 385 386 #endif /* DEBUG */ 387 388 /* 389 * page_retire_done() - completion processing 390 * 391 * Used by the page_retire code for common completion processing. 392 * It keeps track of how many times a given result has happened, 393 * and writes out an occasional message. 394 * 395 * May be called with a NULL pp (PRD_INVALID_PA case). 396 */ 397 #define PRD_INVALID_KEY -1 398 #define PRD_SUCCESS 0 399 #define PRD_PENDING 1 400 #define PRD_FAILED 2 401 #define PRD_DUPLICATE 3 402 #define PRD_INVALID_PA 4 403 #define PRD_LIMIT 5 404 #define PRD_UE_SCRUBBED 6 405 #define PRD_UNR_SUCCESS 7 406 #define PRD_UNR_CANTLOCK 8 407 #define PRD_UNR_NOT 9 408 409 typedef struct page_retire_op { 410 int pr_key; /* one of the PRD_* defines from above */ 411 int pr_count; /* How many times this has happened */ 412 int pr_retval; /* return value */ 413 int pr_msglvl; /* message level - when to print */ 414 char *pr_message; /* Cryptic message for field service */ 415 } page_retire_op_t; 416 417 static page_retire_op_t page_retire_ops[] = { 418 /* key count retval msglvl message */ 419 {PRD_SUCCESS, 0, 0, 1, 420 "Page 0x%08x.%08x removed from service"}, 421 {PRD_PENDING, 0, EAGAIN, 2, 422 "Page 0x%08x.%08x will be retired on free"}, 423 {PRD_FAILED, 0, EAGAIN, 0, NULL}, 424 {PRD_DUPLICATE, 0, EIO, 2, 425 "Page 0x%08x.%08x already retired or pending"}, 426 {PRD_INVALID_PA, 0, EINVAL, 2, 427 "PA 0x%08x.%08x is not a relocatable page"}, 428 {PRD_LIMIT, 0, 0, 1, 429 "Page 0x%08x.%08x not retired due to limit exceeded"}, 430 {PRD_UE_SCRUBBED, 0, 0, 1, 431 "Previously reported error on page 0x%08x.%08x cleared"}, 432 {PRD_UNR_SUCCESS, 0, 0, 1, 433 "Page 0x%08x.%08x returned to service"}, 434 {PRD_UNR_CANTLOCK, 0, EAGAIN, 2, 435 "Page 0x%08x.%08x could not be unretired"}, 436 {PRD_UNR_NOT, 0, EIO, 2, 437 "Page 0x%08x.%08x is not retired"}, 438 {PRD_INVALID_KEY, 0, 0, 0, NULL} /* MUST BE LAST! */ 439 }; 440 441 /* 442 * print a message if page_retire_messages is true. 443 */ 444 #define PR_MESSAGE(debuglvl, msglvl, msg, pa) \ 445 { \ 446 uint64_t p = (uint64_t)pa; \ 447 if (page_retire_messages >= msglvl && msg != NULL) { \ 448 cmn_err(debuglvl, msg, \ 449 (uint32_t)(p >> 32), (uint32_t)p); \ 450 } \ 451 } 452 453 /* 454 * Note that multiple bits may be set in a single settoxic operation. 455 * May be called without the page locked. 456 */ 457 void 458 page_settoxic(page_t *pp, uchar_t bits) 459 { 460 atomic_or_8(&pp->p_toxic, bits); 461 } 462 463 /* 464 * Note that multiple bits may cleared in a single clrtoxic operation. 465 * Must be called with the page exclusively locked to prevent races which 466 * may attempt to retire a page without any toxic bits set. 467 * Note that the PR_CAPTURE bit can be cleared without the exclusive lock 468 * being held as there is a separate mutex which protects that bit. 469 */ 470 void 471 page_clrtoxic(page_t *pp, uchar_t bits) 472 { 473 ASSERT((bits & PR_CAPTURE) || PAGE_EXCL(pp)); 474 atomic_and_8(&pp->p_toxic, ~bits); 475 } 476 477 /* 478 * Prints any page retire messages to the user, and decides what 479 * error code is appropriate for the condition reported. 480 */ 481 static int 482 page_retire_done(page_t *pp, int code) 483 { 484 page_retire_op_t *prop; 485 uint64_t pa = 0; 486 int i; 487 488 if (pp != NULL) { 489 pa = mmu_ptob((uint64_t)pp->p_pagenum); 490 } 491 492 prop = NULL; 493 for (i = 0; page_retire_ops[i].pr_key != PRD_INVALID_KEY; i++) { 494 if (page_retire_ops[i].pr_key == code) { 495 prop = &page_retire_ops[i]; 496 break; 497 } 498 } 499 500 #ifdef DEBUG 501 if (page_retire_ops[i].pr_key == PRD_INVALID_KEY) { 502 cmn_err(CE_PANIC, "page_retire_done: Invalid opcode %d", code); 503 } 504 #endif 505 506 ASSERT(prop->pr_key == code); 507 508 prop->pr_count++; 509 510 PR_MESSAGE(CE_NOTE, prop->pr_msglvl, prop->pr_message, pa); 511 if (pp != NULL) { 512 page_settoxic(pp, PR_MSG); 513 } 514 515 return (prop->pr_retval); 516 } 517 518 /* 519 * Act like page_destroy(), but instead of freeing the page, hash it onto 520 * the retired_pages vnode, and mark it retired. 521 * 522 * For fun, we try to scrub the page until it's squeaky clean. 523 * availrmem is adjusted here. 524 */ 525 static void 526 page_retire_destroy(page_t *pp) 527 { 528 u_offset_t off = (u_offset_t)((uintptr_t)pp); 529 530 ASSERT(PAGE_EXCL(pp)); 531 ASSERT(!PP_ISFREE(pp)); 532 ASSERT(pp->p_szc == 0); 533 ASSERT(!hat_page_is_mapped(pp)); 534 ASSERT(!pp->p_vnode); 535 536 page_clr_all_props(pp); 537 pagescrub(pp, 0, MMU_PAGESIZE); 538 539 pp->p_next = NULL; 540 pp->p_prev = NULL; 541 if (page_hashin(pp, retired_pages, off, NULL) == 0) { 542 cmn_err(CE_PANIC, "retired page %p hashin failed", (void *)pp); 543 } 544 545 page_settoxic(pp, PR_RETIRED); 546 PR_INCR_KSTAT(pr_retired); 547 548 if (pp->p_toxic & PR_FMA) { 549 PR_INCR_KSTAT(pr_fma); 550 } else if (pp->p_toxic & PR_UE) { 551 PR_INCR_KSTAT(pr_ue); 552 } else { 553 PR_INCR_KSTAT(pr_mce); 554 } 555 556 mutex_enter(&freemem_lock); 557 availrmem--; 558 mutex_exit(&freemem_lock); 559 560 page_unlock(pp); 561 } 562 563 /* 564 * Check whether the number of pages which have been retired already exceeds 565 * the maximum allowable percentage of memory which may be retired. 566 * 567 * Returns 1 if the limit has been exceeded. 568 */ 569 static int 570 page_retire_limit(void) 571 { 572 if (PR_KSTAT_RETIRED_NOTUE >= (uint64_t)PAGE_RETIRE_LIMIT) { 573 PR_INCR_KSTAT(pr_limit_exceeded); 574 return (1); 575 } 576 577 return (0); 578 } 579 580 #define MSG_DM "Data Mismatch occurred at PA 0x%08x.%08x" \ 581 "[ 0x%x != 0x%x ] while attempting to clear previously " \ 582 "reported error; page removed from service" 583 584 #define MSG_UE "Uncorrectable Error occurred at PA 0x%08x.%08x while " \ 585 "attempting to clear previously reported error; page removed " \ 586 "from service" 587 588 /* 589 * Attempt to clear a UE from a page. 590 * Returns 1 if the error has been successfully cleared. 591 */ 592 static int 593 page_clear_transient_ue(page_t *pp) 594 { 595 caddr_t kaddr; 596 uint8_t rb, wb; 597 uint64_t pa; 598 uint32_t pa_hi, pa_lo; 599 on_trap_data_t otd; 600 int errors = 0; 601 int i; 602 603 ASSERT(PAGE_EXCL(pp)); 604 ASSERT(PP_PR_REQ(pp)); 605 ASSERT(pp->p_szc == 0); 606 ASSERT(!hat_page_is_mapped(pp)); 607 608 /* 609 * Clear the page and attempt to clear the UE. If we trap 610 * on the next access to the page, we know the UE has recurred. 611 */ 612 pagescrub(pp, 0, PAGESIZE); 613 614 /* 615 * Map the page and write a bunch of bit patterns to compare 616 * what we wrote with what we read back. This isn't a perfect 617 * test but it should be good enough to catch most of the 618 * recurring UEs. If this fails to catch a recurrent UE, we'll 619 * retire the page the next time we see a UE on the page. 620 */ 621 kaddr = ppmapin(pp, PROT_READ|PROT_WRITE, (caddr_t)-1); 622 623 pa = ptob((uint64_t)page_pptonum(pp)); 624 pa_hi = (uint32_t)(pa >> 32); 625 pa_lo = (uint32_t)pa; 626 627 /* 628 * Fill the page with each (0x00 - 0xFF] bit pattern, flushing 629 * the cache in between reading and writing. We do this under 630 * on_trap() protection to avoid recursion. 631 */ 632 if (on_trap(&otd, OT_DATA_EC)) { 633 PR_MESSAGE(CE_WARN, 1, MSG_UE, pa); 634 errors = 1; 635 } else { 636 for (wb = 0xff; wb > 0; wb--) { 637 for (i = 0; i < PAGESIZE; i++) { 638 kaddr[i] = wb; 639 } 640 641 sync_data_memory(kaddr, PAGESIZE); 642 643 for (i = 0; i < PAGESIZE; i++) { 644 rb = kaddr[i]; 645 if (rb != wb) { 646 /* 647 * We had a mismatch without a trap. 648 * Uh-oh. Something is really wrong 649 * with this system. 650 */ 651 if (page_retire_messages) { 652 cmn_err(CE_WARN, MSG_DM, 653 pa_hi, pa_lo, rb, wb); 654 } 655 errors = 1; 656 goto out; /* double break */ 657 } 658 } 659 } 660 } 661 out: 662 no_trap(); 663 ppmapout(kaddr); 664 665 return (errors ? 0 : 1); 666 } 667 668 /* 669 * Try to clear a page_t with a single UE. If the UE was transient, it is 670 * returned to service, and we return 1. Otherwise we return 0 meaning 671 * that further processing is required to retire the page. 672 */ 673 static int 674 page_retire_transient_ue(page_t *pp) 675 { 676 ASSERT(PAGE_EXCL(pp)); 677 ASSERT(!hat_page_is_mapped(pp)); 678 679 /* 680 * If this page is a repeat offender, retire him under the 681 * "two strikes and you're out" rule. The caller is responsible 682 * for scrubbing the page to try to clear the error. 683 */ 684 if (pp->p_toxic & PR_UE_SCRUBBED) { 685 PR_INCR_KSTAT(pr_ue_persistent); 686 return (0); 687 } 688 689 if (page_clear_transient_ue(pp)) { 690 /* 691 * We set the PR_SCRUBBED_UE bit; if we ever see this 692 * page again, we will retire it, no questions asked. 693 */ 694 page_settoxic(pp, PR_UE_SCRUBBED); 695 696 if (page_retire_first_ue) { 697 PR_INCR_KSTAT(pr_ue_cleared_retire); 698 return (0); 699 } else { 700 PR_INCR_KSTAT(pr_ue_cleared_free); 701 702 page_clrtoxic(pp, PR_UE | PR_MCE | PR_MSG); 703 704 /* LINTED: CONSTCOND */ 705 VN_DISPOSE(pp, B_FREE, 1, kcred); 706 return (1); 707 } 708 } 709 710 PR_INCR_KSTAT(pr_ue_persistent); 711 return (0); 712 } 713 714 /* 715 * Update the statistics dynamically when our kstat is read. 716 */ 717 static int 718 page_retire_kstat_update(kstat_t *ksp, int rw) 719 { 720 struct page_retire_kstat *pr; 721 722 if (ksp == NULL) 723 return (EINVAL); 724 725 switch (rw) { 726 727 case KSTAT_READ: 728 pr = (struct page_retire_kstat *)ksp->ks_data; 729 ASSERT(pr == &page_retire_kstat); 730 pr->pr_limit.value.ui64 = PAGE_RETIRE_LIMIT; 731 return (0); 732 733 case KSTAT_WRITE: 734 return (EACCES); 735 736 default: 737 return (EINVAL); 738 } 739 /*NOTREACHED*/ 740 } 741 742 static int 743 pr_list_kstat_update(kstat_t *ksp, int rw) 744 { 745 uint_t count; 746 page_t *pp; 747 kmutex_t *vphm; 748 749 if (rw == KSTAT_WRITE) 750 return (EACCES); 751 752 vphm = page_vnode_mutex(retired_pages); 753 mutex_enter(vphm); 754 /* Needs to be under a lock so that for loop will work right */ 755 if (retired_pages->v_pages == NULL) { 756 mutex_exit(vphm); 757 ksp->ks_ndata = 0; 758 ksp->ks_data_size = 0; 759 return (0); 760 } 761 762 count = 1; 763 for (pp = retired_pages->v_pages->p_vpnext; 764 pp != retired_pages->v_pages; pp = pp->p_vpnext) { 765 count++; 766 } 767 mutex_exit(vphm); 768 769 ksp->ks_ndata = count; 770 ksp->ks_data_size = count * 2 * sizeof (uint64_t); 771 772 return (0); 773 } 774 775 /* 776 * all spans will be pagesize and no coalescing will be done with the 777 * list produced. 778 */ 779 static int 780 pr_list_kstat_snapshot(kstat_t *ksp, void *buf, int rw) 781 { 782 kmutex_t *vphm; 783 page_t *pp; 784 struct memunit { 785 uint64_t address; 786 uint64_t size; 787 } *kspmem; 788 789 if (rw == KSTAT_WRITE) 790 return (EACCES); 791 792 ksp->ks_snaptime = gethrtime(); 793 794 kspmem = (struct memunit *)buf; 795 796 vphm = page_vnode_mutex(retired_pages); 797 mutex_enter(vphm); 798 pp = retired_pages->v_pages; 799 if (((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) || 800 (pp == NULL)) { 801 mutex_exit(vphm); 802 return (0); 803 } 804 kspmem->address = ptob(pp->p_pagenum); 805 kspmem->size = PAGESIZE; 806 kspmem++; 807 for (pp = pp->p_vpnext; pp != retired_pages->v_pages; 808 pp = pp->p_vpnext, kspmem++) { 809 if ((caddr_t)kspmem >= (caddr_t)buf + ksp->ks_data_size) 810 break; 811 kspmem->address = ptob(pp->p_pagenum); 812 kspmem->size = PAGESIZE; 813 } 814 mutex_exit(vphm); 815 816 return (0); 817 } 818 819 /* 820 * page_retire_pend_count -- helper function for page_capture_thread, 821 * returns the number of pages pending retirement. 822 */ 823 uint64_t 824 page_retire_pend_count(void) 825 { 826 return (PR_KSTAT_PENDING); 827 } 828 829 void 830 page_retire_incr_pend_count(void) 831 { 832 PR_INCR_KSTAT(pr_pending); 833 } 834 835 void 836 page_retire_decr_pend_count(void) 837 { 838 PR_DECR_KSTAT(pr_pending); 839 } 840 841 /* 842 * Initialize the page retire mechanism: 843 * 844 * - Establish the correctable error retire limit. 845 * - Initialize locks. 846 * - Build the retired_pages vnode. 847 * - Set up the kstats. 848 * - Fire off the background thread. 849 * - Tell page_retire() it's OK to start retiring pages. 850 */ 851 void 852 page_retire_init(void) 853 { 854 const fs_operation_def_t retired_vnodeops_template[] = {NULL, NULL}; 855 struct vnodeops *vops; 856 kstat_t *ksp; 857 858 const uint_t page_retire_ndata = 859 sizeof (page_retire_kstat) / sizeof (kstat_named_t); 860 861 ASSERT(page_retire_ksp == NULL); 862 863 if (max_pages_retired_bps <= 0) { 864 max_pages_retired_bps = MCE_BPT; 865 } 866 867 mutex_init(&pr_q_mutex, NULL, MUTEX_DEFAULT, NULL); 868 869 retired_pages = vn_alloc(KM_SLEEP); 870 if (vn_make_ops("retired_pages", retired_vnodeops_template, &vops)) { 871 cmn_err(CE_PANIC, 872 "page_retired_init: can't make retired vnodeops"); 873 } 874 vn_setops(retired_pages, vops); 875 876 if ((page_retire_ksp = kstat_create("unix", 0, "page_retire", 877 "misc", KSTAT_TYPE_NAMED, page_retire_ndata, 878 KSTAT_FLAG_VIRTUAL)) == NULL) { 879 cmn_err(CE_WARN, "kstat_create for page_retire failed"); 880 } else { 881 page_retire_ksp->ks_data = (void *)&page_retire_kstat; 882 page_retire_ksp->ks_update = page_retire_kstat_update; 883 kstat_install(page_retire_ksp); 884 } 885 886 mutex_init(&pr_list_kstat_mutex, NULL, MUTEX_DEFAULT, NULL); 887 ksp = kstat_create("unix", 0, "page_retire_list", "misc", 888 KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VAR_SIZE | KSTAT_FLAG_VIRTUAL); 889 if (ksp != NULL) { 890 ksp->ks_update = pr_list_kstat_update; 891 ksp->ks_snapshot = pr_list_kstat_snapshot; 892 ksp->ks_lock = &pr_list_kstat_mutex; 893 kstat_install(ksp); 894 } 895 896 page_capture_register_callback(PC_RETIRE, -1, page_retire_pp_finish); 897 pr_enable = 1; 898 } 899 900 /* 901 * page_retire_hunt() callback for the retire thread. 902 */ 903 static void 904 page_retire_thread_cb(page_t *pp) 905 { 906 PR_DEBUG(prd_tctop); 907 if (!PP_ISKAS(pp) && page_trylock(pp, SE_EXCL)) { 908 PR_DEBUG(prd_tclocked); 909 page_unlock(pp); 910 } 911 } 912 913 /* 914 * page_retire_hunt() callback for mdboot(). 915 * 916 * It is necessary to scrub any failing pages prior to reboot in order to 917 * prevent a latent error trap from occurring on the next boot. 918 */ 919 void 920 page_retire_mdboot_cb(page_t *pp) 921 { 922 /* 923 * Don't scrub the kernel, since we might still need it, unless 924 * we have UEs on the page, in which case we have nothing to lose. 925 */ 926 if (!PP_ISKAS(pp) || PP_TOXIC(pp)) { 927 pp->p_selock = -1; /* pacify ASSERTs */ 928 PP_CLRFREE(pp); 929 pagescrub(pp, 0, PAGESIZE); 930 pp->p_selock = 0; 931 } 932 pp->p_toxic = 0; 933 } 934 935 936 /* 937 * Callback used by page_trycapture() to finish off retiring a page. 938 * The page has already been cleaned and we've been given sole access to 939 * it. 940 * Always returns 0 to indicate that callback succeded as the callback never 941 * fails to finish retiring the given page. 942 */ 943 /*ARGSUSED*/ 944 static int 945 page_retire_pp_finish(page_t *pp, void *notused, uint_t flags) 946 { 947 int toxic; 948 949 ASSERT(PAGE_EXCL(pp)); 950 ASSERT(pp->p_iolock_state == 0); 951 ASSERT(pp->p_szc == 0); 952 953 toxic = pp->p_toxic; 954 955 /* 956 * The problem page is locked, demoted, unmapped, not free, 957 * hashed out, and not COW or mlocked (whew!). 958 * 959 * Now we select our ammunition, take it around back, and shoot it. 960 */ 961 if (toxic & PR_UE) { 962 ue_error: 963 if (page_retire_transient_ue(pp)) { 964 PR_DEBUG(prd_uescrubbed); 965 (void) page_retire_done(pp, PRD_UE_SCRUBBED); 966 } else { 967 PR_DEBUG(prd_uenotscrubbed); 968 page_retire_destroy(pp); 969 (void) page_retire_done(pp, PRD_SUCCESS); 970 } 971 return (0); 972 } else if (toxic & PR_FMA) { 973 PR_DEBUG(prd_fma); 974 page_retire_destroy(pp); 975 (void) page_retire_done(pp, PRD_SUCCESS); 976 return (0); 977 } else if (toxic & PR_MCE) { 978 PR_DEBUG(prd_mce); 979 page_retire_destroy(pp); 980 (void) page_retire_done(pp, PRD_SUCCESS); 981 return (0); 982 } 983 984 /* 985 * When page_retire_first_ue is set to zero and a UE occurs which is 986 * transient, it's possible that we clear some flags set by a second 987 * UE error on the page which occurs while the first is currently being 988 * handled and thus we need to handle the case where none of the above 989 * are set. In this instance, PR_UE_SCRUBBED should be set and thus 990 * we should execute the UE code above. 991 */ 992 if (toxic & PR_UE_SCRUBBED) { 993 goto ue_error; 994 } 995 996 /* 997 * It's impossible to get here. 998 */ 999 panic("bad toxic flags 0x%x in page_retire_pp_finish\n", toxic); 1000 return (0); 1001 } 1002 1003 /* 1004 * page_retire() - the front door in to retire a page. 1005 * 1006 * Ideally, page_retire() would instantly retire the requested page. 1007 * Unfortunately, some pages are locked or otherwise tied up and cannot be 1008 * retired right away. We use the page capture logic to deal with this 1009 * situation as it will continuously try to retire the page in the background 1010 * if the first attempt fails. Success is determined by looking to see whether 1011 * the page has been retired after the page_trycapture() attempt. 1012 * 1013 * Returns: 1014 * 1015 * - 0 on success, 1016 * - EINVAL when the PA is whacko, 1017 * - EIO if the page is already retired or already pending retirement, or 1018 * - EAGAIN if the page could not be _immediately_ retired but is pending. 1019 */ 1020 int 1021 page_retire(uint64_t pa, uchar_t reason) 1022 { 1023 page_t *pp; 1024 1025 ASSERT(reason & PR_REASONS); /* there must be a reason */ 1026 ASSERT(!(reason & ~PR_REASONS)); /* but no other bits */ 1027 1028 pp = page_numtopp_nolock(mmu_btop(pa)); 1029 if (pp == NULL) { 1030 PR_MESSAGE(CE_WARN, 1, "Cannot schedule clearing of error on" 1031 " page 0x%08x.%08x; page is not relocatable memory", pa); 1032 return (page_retire_done(pp, PRD_INVALID_PA)); 1033 } 1034 if (PP_RETIRED(pp)) { 1035 PR_DEBUG(prd_dup1); 1036 return (page_retire_done(pp, PRD_DUPLICATE)); 1037 } 1038 1039 if ((reason & PR_UE) && !PP_TOXIC(pp)) { 1040 PR_MESSAGE(CE_NOTE, 1, "Scheduling clearing of error on" 1041 " page 0x%08x.%08x", pa); 1042 } else if (PP_PR_REQ(pp)) { 1043 PR_DEBUG(prd_dup2); 1044 return (page_retire_done(pp, PRD_DUPLICATE)); 1045 } else { 1046 PR_MESSAGE(CE_NOTE, 1, "Scheduling removal of" 1047 " page 0x%08x.%08x", pa); 1048 } 1049 1050 /* Avoid setting toxic bits in the first place */ 1051 if ((reason & (PR_FMA | PR_MCE)) && !(reason & PR_UE) && 1052 page_retire_limit()) { 1053 return (page_retire_done(pp, PRD_LIMIT)); 1054 } 1055 1056 if (MTBF(pr_calls, pr_mtbf)) { 1057 page_settoxic(pp, reason); 1058 if (page_trycapture(pp, 0, CAPTURE_RETIRE, NULL) == 0) { 1059 PR_DEBUG(prd_prlocked); 1060 } else { 1061 PR_DEBUG(prd_prnotlocked); 1062 } 1063 } else { 1064 PR_DEBUG(prd_prnotlocked); 1065 } 1066 1067 if (PP_RETIRED(pp)) { 1068 PR_DEBUG(prd_prretired); 1069 return (0); 1070 } else { 1071 cv_signal(&pc_cv); 1072 PR_INCR_KSTAT(pr_failed); 1073 1074 if (pp->p_toxic & PR_MSG) { 1075 return (page_retire_done(pp, PRD_FAILED)); 1076 } else { 1077 return (page_retire_done(pp, PRD_PENDING)); 1078 } 1079 } 1080 } 1081 1082 /* 1083 * Take a retired page off the retired-pages vnode and clear the toxic flags. 1084 * If "free" is nonzero, lock it and put it back on the freelist. If "free" 1085 * is zero, the caller already holds SE_EXCL lock so we simply unretire it 1086 * and don't do anything else with it. 1087 * 1088 * Any unretire messages are printed from this routine. 1089 * 1090 * Returns 0 if page pp was unretired; else an error code. 1091 * 1092 * If flags is: 1093 * PR_UNR_FREE - lock the page, clear the toxic flags and free it 1094 * to the freelist. 1095 * PR_UNR_TEMP - lock the page, unretire it, leave the toxic 1096 * bits set as is and return it to the caller. 1097 * PR_UNR_CLEAN - page is SE_EXCL locked, unretire it, clear the 1098 * toxic flags and return it to caller as is. 1099 */ 1100 int 1101 page_unretire_pp(page_t *pp, int flags) 1102 { 1103 /* 1104 * To be retired, a page has to be hashed onto the retired_pages vnode 1105 * and have PR_RETIRED set in p_toxic. 1106 */ 1107 if (flags == PR_UNR_CLEAN || 1108 page_try_reclaim_lock(pp, SE_EXCL, SE_RETIRED)) { 1109 ASSERT(PAGE_EXCL(pp)); 1110 PR_DEBUG(prd_ulocked); 1111 if (!PP_RETIRED(pp)) { 1112 PR_DEBUG(prd_unotretired); 1113 page_unlock(pp); 1114 return (page_retire_done(pp, PRD_UNR_NOT)); 1115 } 1116 1117 PR_MESSAGE(CE_NOTE, 1, "unretiring retired" 1118 " page 0x%08x.%08x", mmu_ptob((uint64_t)pp->p_pagenum)); 1119 if (pp->p_toxic & PR_FMA) { 1120 PR_DECR_KSTAT(pr_fma); 1121 } else if (pp->p_toxic & PR_UE) { 1122 PR_DECR_KSTAT(pr_ue); 1123 } else { 1124 PR_DECR_KSTAT(pr_mce); 1125 } 1126 1127 if (flags == PR_UNR_TEMP) 1128 page_clrtoxic(pp, PR_RETIRED); 1129 else 1130 page_clrtoxic(pp, PR_TOXICFLAGS); 1131 1132 if (flags == PR_UNR_FREE) { 1133 PR_DEBUG(prd_udestroy); 1134 page_destroy(pp, 0); 1135 } else { 1136 PR_DEBUG(prd_uhashout); 1137 page_hashout(pp, NULL); 1138 } 1139 1140 mutex_enter(&freemem_lock); 1141 availrmem++; 1142 mutex_exit(&freemem_lock); 1143 1144 PR_DEBUG(prd_uunretired); 1145 PR_DECR_KSTAT(pr_retired); 1146 PR_INCR_KSTAT(pr_unretired); 1147 return (page_retire_done(pp, PRD_UNR_SUCCESS)); 1148 } 1149 PR_DEBUG(prd_unotlocked); 1150 return (page_retire_done(pp, PRD_UNR_CANTLOCK)); 1151 } 1152 1153 /* 1154 * Return a page to service by moving it from the retired_pages vnode 1155 * onto the freelist. 1156 * 1157 * Called from mmioctl_page_retire() on behalf of the FMA DE. 1158 * 1159 * Returns: 1160 * 1161 * - 0 if the page is unretired, 1162 * - EAGAIN if the pp can not be locked, 1163 * - EINVAL if the PA is whacko, and 1164 * - EIO if the pp is not retired. 1165 */ 1166 int 1167 page_unretire(uint64_t pa) 1168 { 1169 page_t *pp; 1170 1171 pp = page_numtopp_nolock(mmu_btop(pa)); 1172 if (pp == NULL) { 1173 return (page_retire_done(pp, PRD_INVALID_PA)); 1174 } 1175 1176 return (page_unretire_pp(pp, PR_UNR_FREE)); 1177 } 1178 1179 /* 1180 * Test a page to see if it is retired. If errors is non-NULL, the toxic 1181 * bits of the page are returned. Returns 0 on success, error code on failure. 1182 */ 1183 int 1184 page_retire_check_pp(page_t *pp, uint64_t *errors) 1185 { 1186 int rc; 1187 1188 if (PP_RETIRED(pp)) { 1189 PR_DEBUG(prd_checkhit); 1190 rc = 0; 1191 } else if (PP_PR_REQ(pp)) { 1192 PR_DEBUG(prd_checkmiss_pend); 1193 rc = EAGAIN; 1194 } else { 1195 PR_DEBUG(prd_checkmiss_noerr); 1196 rc = EIO; 1197 } 1198 1199 /* 1200 * We have magically arranged the bit values returned to fmd(1M) 1201 * to line up with the FMA, MCE, and UE bits of the page_t. 1202 */ 1203 if (errors) { 1204 uint64_t toxic = (uint64_t)(pp->p_toxic & PR_ERRMASK); 1205 if (toxic & PR_UE_SCRUBBED) { 1206 toxic &= ~PR_UE_SCRUBBED; 1207 toxic |= PR_UE; 1208 } 1209 *errors = toxic; 1210 } 1211 1212 return (rc); 1213 } 1214 1215 /* 1216 * Test to see if the page_t for a given PA is retired, and return the 1217 * hardware errors we have seen on the page if requested. 1218 * 1219 * Called from mmioctl_page_retire on behalf of the FMA DE. 1220 * 1221 * Returns: 1222 * 1223 * - 0 if the page is retired, 1224 * - EIO if the page is not retired and has no errors, 1225 * - EAGAIN if the page is not retired but is pending; and 1226 * - EINVAL if the PA is whacko. 1227 */ 1228 int 1229 page_retire_check(uint64_t pa, uint64_t *errors) 1230 { 1231 page_t *pp; 1232 1233 if (errors) { 1234 *errors = 0; 1235 } 1236 1237 pp = page_numtopp_nolock(mmu_btop(pa)); 1238 if (pp == NULL) { 1239 return (page_retire_done(pp, PRD_INVALID_PA)); 1240 } 1241 1242 return (page_retire_check_pp(pp, errors)); 1243 } 1244 1245 /* 1246 * Page retire self-test. For now, it always returns 0. 1247 */ 1248 int 1249 page_retire_test(void) 1250 { 1251 page_t *first, *pp, *cpp, *cpp2, *lpp; 1252 1253 /* 1254 * Tests the corner case where a large page can't be retired 1255 * because one of the constituent pages is locked. We mark 1256 * one page to be retired and try to retire it, and mark the 1257 * other page to be retired but don't try to retire it, so 1258 * that page_unlock() in the failure path will recurse and try 1259 * to retire THAT page. This is the worst possible situation 1260 * we can get ourselves into. 1261 */ 1262 memsegs_lock(0); 1263 pp = first = page_first(); 1264 do { 1265 if (pp->p_szc && PP_PAGEROOT(pp) == pp) { 1266 cpp = pp + 1; 1267 lpp = PP_ISFREE(pp)? pp : pp + 2; 1268 cpp2 = pp + 3; 1269 if (!page_trylock(lpp, pp == lpp? SE_EXCL : SE_SHARED)) 1270 continue; 1271 if (!page_trylock(cpp, SE_EXCL)) { 1272 page_unlock(lpp); 1273 continue; 1274 } 1275 1276 /* fails */ 1277 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1278 1279 page_unlock(lpp); 1280 page_unlock(cpp); 1281 (void) page_retire(ptob(cpp->p_pagenum), PR_FMA); 1282 (void) page_retire(ptob(cpp2->p_pagenum), PR_FMA); 1283 } 1284 } while ((pp = page_next(pp)) != first); 1285 memsegs_unlock(0); 1286 1287 return (0); 1288 } 1289