memblock: add MEMBLOCK_RSRV_KERN flagPatch series "kexec: introduce Kexec HandOver (KHO)", v8.Kexec today considers itself purely a boot loader: When we enter the newkernel, any state the previo
memblock: add MEMBLOCK_RSRV_KERN flagPatch series "kexec: introduce Kexec HandOver (KHO)", v8.Kexec today considers itself purely a boot loader: When we enter the newkernel, any state the previous kernel left behind is irrelevant and thenew kernel reinitializes the system.However, there are use cases where this mode of operation is not what weactually want. In virtualization hosts for example, we want to use kexecto update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure thatIOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, weneed to do the same for the PCI subsystem. If we want to kexec while anSEV-SNP enabled virtual machine is running, we need to preserve the VMcontext pages and physical memory. See "pkernfs: Persisting guest memoryand kernel/device state safely across kexec" Linux Plumbers Conference2023 presentation for details: https://lpc.events/event/17/contributions/1485/To start us on the journey to support all the use cases above, this patchimplements basic infrastructure to allow hand over of kernel state acrosskexec (Kexec HandOver, aka KHO). As a really simple example target, weuse memblock's reserve_mem.With this patchset applied, memory that was reserved using "reserve_mem"command line options remains intact after kexec and it is guaranteed toreside at the same physical address.== Alternatives ==There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command lineAll of the approaches above fundamentally have the same problem: Theyrequire the administrator to explicitly carve out a physical memorylocation because they have no mechanism outside of the kernel command lineto pass data (including memory reservations) between kexec'ing kernels.KHO provides that base foundation. We will determine later whether westill need any of the approaches above for fast bulk memory handover offor example IOMMU page tables. But IMHO they would all be users of KHO,with KHO providing the foundational primitive to pass metadata and bulkmemory reservations as well as provide easy versioning for data.== Overview ==We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is aFlattened Device Tree (fdt) which has a generator and parser alreadyincluded in Linux. KHO is enabled in the kernel command line by `kho=on`.When the root user enables KHO through/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to everyKHO users to register preserved memory regions, which contain drivers'states.When the actual kexec happens, the fdt is part of the image set that weboot into. In addition, we keep "scratch regions" available for kexec:physically contiguous memory regions that are guaranteed to not have anymemory that KHO would preserve. The new kernel bootstraps itself usingthe scratch regions and sets all handed over memory as in use. Whendrivers initialize that support KHO, they introspect the fdt, restorepreserved memory regions, and retrieve their states stored in thepreserved memory.== Limitations ==Currently KHO is only implemented for file based kexec. The kernelinterfaces in the patch set are already in place to support user spacekexec as well, but it is still not implemented it yet inside kexec tools.== How to Use ==To use the code, please boot the kernel with the "kho=on" command lineparameter. KHO will automatically create scratch regions. If you want toset the scratch size explicitly you can use "kho_scratch=" command lineparameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiBper NUMA node scratch regions on boot.Make sure to have a reserved memory range requested with reserv_memcommand line option, for example, "reserve_mem=64m:4k:n1".Then before you invoke file based "kexec -l", finalize KHO FDT: # echo 1 > /sys/kernel/debug/kho/out/finalizeYou can preview the generated FDT using `dtc`, # dtc /sys/kernel/debug/kho/out/fdt # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.Now kexec into the new kernel, # kexec -l Image --initrd=initrd -s # kexec -e(The order of KHO finalization and "kexec -l" does not matter.)The new kernel will boot up and contain the previous kernel's reserve_memcontents at the same physical address as the first kernel.You can also review the FDT passed from the old kernel, # dtc /sys/kernel/debug/kho/in/fdt # dtc /sys/kernel/debug/kho/in/sub_fdts/memblockThis patch (of 17):To denote areas that were reserved for kernel use either directly withmemblock_reserve_kern() or via memblock allocations.Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.comLink: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>Co-developed-by: Changyuan Lyu <changyuanl@google.com>Signed-off-by: Changyuan Lyu <changyuanl@google.com>Cc: Alexander Graf <graf@amazon.com>Cc: Andy Lutomirski <luto@kernel.org>Cc: Anthony Yznaga <anthony.yznaga@oracle.com>Cc: Arnd Bergmann <arnd@arndb.de>Cc: Ashish Kalra <ashish.kalra@amd.com>Cc: Ben Herrenschmidt <benh@kernel.crashing.org>Cc: Borislav Betkov <bp@alien8.de>Cc: Catalin Marinas <catalin.marinas@arm.com>Cc: David Woodhouse <dwmw2@infradead.org>Cc: Eric Biederman <ebiederm@xmission.com>Cc: "H. Peter Anvin" <hpa@zytor.com>Cc: Ingo Molnar <mingo@redhat.com>Cc: James Gowans <jgowans@amazon.com>Cc: Jonathan Corbet <corbet@lwn.net>Cc: Krzysztof Kozlowski <krzk@kernel.org>Cc: Marc Rutland <mark.rutland@arm.com>Cc: Paolo Bonzini <pbonzini@redhat.com>Cc: Pasha Tatashin <pasha.tatashin@soleen.com>Cc: Peter Zijlstra <peterz@infradead.org>Cc: Pratyush Yadav <ptyadav@amazon.de>Cc: Rob Herring <robh@kernel.org>Cc: Saravana Kannan <saravanak@google.com>Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>Cc: Steven Rostedt <rostedt@goodmis.org>Cc: Thomas Gleinxer <tglx@linutronix.de>Cc: Thomas Lendacky <thomas.lendacky@amd.com>Cc: Will Deacon <will@kernel.org>Cc: Dave Hansen <dave.hansen@linux.intel.com>Cc: Jason Gunthorpe <jgg@nvidia.com>Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
memblock tests: add test for memblock_set_nodeAdd a test to check memblock_set_node() behavior.And create a corner case in which the memblock.reserved array is doubledduring memblock_set_node().
memblock tests: add test for memblock_set_nodeAdd a test to check memblock_set_node() behavior.And create a corner case in which the memblock.reserved array is doubledduring memblock_set_node(). And finally make sure all regions inmemblock.reserved are with valid node id.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>CC: Mike Rapoport <rppt@kernel.org>CC: Yajun Deng <yajun.deng@linux.dev>Link: https://lore.kernel.org/r/20250318071948.23854-4-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
memblock tests: Fix mutex related build errorFix mutex and free_reserved_area() related build errors which havebeen introduced by commit 74e2498ccf7b ("mm/memblock: Add reservedmemory release fun
memblock tests: Fix mutex related build errorFix mutex and free_reserved_area() related build errors which havebeen introduced by commit 74e2498ccf7b ("mm/memblock: Add reservedmemory release function").Fixes: 74e2498ccf7b ("mm/memblock: Add reserved memory release function")Reported-by: Wei Yang <richard.weiyang@gmail.com>Closes: https://lore.kernel.org/all/20250405023018.g2ae52nrz2757b3n@master/Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>Link: https://lore.kernel.org/r/174399023133.47537.7375975856054461445.stgit@devnote2Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Merge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblockPull memblock updates from Mike Rapoport: - new memblock_estimated_nr_free_pages() helper to replace
Merge tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblockPull memblock updates from Mike Rapoport: - new memblock_estimated_nr_free_pages() helper to replace totalram_pages() which is less accurate when CONFIG_DEFERRED_STRUCT_PAGE_INIT is set - fixes for memblock tests* tag 'memblock-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: s390/mm: get estimated free pages by memblock api kernel/fork.c: get estimated free pages by memblock api mm/memblock: introduce a new helper memblock_estimated_nr_free_pages() memblock test: fix implicit declaration of function 'strscpy' memblock test: fix implicit declaration of function 'isspace' memblock test: fix implicit declaration of function 'memparse' memblock test: add the definition of __setup() memblock test: fix implicit declaration of function 'virt_to_phys' tools/testing: abstract two init.h into common include directory memblock tests: include export.h in linkage.h as kernel dose memblock tests: include memory_hotplug.h in mmzone.h as kernel dose
mm: rework accept memory helpersMake accept_memory() and range_contains_unaccepted_memory() take 'start'and 'size' arguments instead of 'start' and 'end'.Remove accept_page(), replacing it with
mm: rework accept memory helpersMake accept_memory() and range_contains_unaccepted_memory() take 'start'and 'size' arguments instead of 'start' and 'end'.Remove accept_page(), replacing it with direct calls to accept_memory(). The accept_page() name is going to be used for a different function.Link: https://lkml.kernel.org/r/20240809114854.3745464-6-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>Suggested-by: David Hildenbrand <david@redhat.com>Cc: Borislav Petkov <bp@alien8.de>Cc: Johannes Weiner <hannes@cmpxchg.org>Cc: Matthew Wilcox <willy@infradead.org>Cc: Mel Gorman <mgorman@suse.de>Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>Cc: Tom Lendacky <thomas.lendacky@amd.com>Cc: Vlastimil Babka <vbabka@suse.cz>Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memblock test: fix implicit declaration of function 'isspace'Commit 1e4c64b71c9b ("mm/memblock: Add "reserve_mem" to reserved namedmemory at boot up") introduce usage of isspace().Let's include
memblock test: fix implicit declaration of function 'isspace'Commit 1e4c64b71c9b ("mm/memblock: Add "reserve_mem" to reserved namedmemory at boot up") introduce usage of isspace().Let's include <linux/ctype.h> in kernel.h to fix this.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/20240806010319.29194-4-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
memblock test: fix implicit declaration of function 'memparse'Commit 1e4c64b71c9b ("mm/memblock: Add "reserve_mem" to reserved namedmemory at boot up") introduce the usage of memparse(), which is
memblock test: fix implicit declaration of function 'memparse'Commit 1e4c64b71c9b ("mm/memblock: Add "reserve_mem" to reserved namedmemory at boot up") introduce the usage of memparse(), which is notdefined in memblock test.Add the definition and link it to fix the build.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/20240806010319.29194-3-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
tools/testing: abstract two init.h into common include directoryCurrently we have two test suits define its own init.h. This is a littleredundant.Let's create a init.h in common include director
tools/testing: abstract two init.h into common include directoryCurrently we have two test suits define its own init.h. This is a littleredundant.Let's create a init.h in common include directory and merge these twointo it.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>CC: Mike Rapoport <rppt@kernel.org>CC: Liam R. Howlett <Liam.Howlett@oracle.com>Link: https://lore.kernel.org/r/20240712035138.24674-3-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
memblock tests: include export.h in linkage.h as kernel doseIn kernel code, linkage.h includes export.h. Let's sync with kernel.This is a preparation for move init.h in common include directory.
memblock tests: include export.h in linkage.h as kernel doseIn kernel code, linkage.h includes export.h. Let's sync with kernel.This is a preparation for move init.h in common include directory.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>CC: Mike Rapoport <rppt@kernel.org>Link: https://lore.kernel.org/r/20240712035138.24674-2-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
memblock tests: include memory_hotplug.h in mmzone.h as kernel doseIn kernel code, memory_hotplug.h is included in mmzone.h instead of ininit.h. Let's sync with kernel.This is a preparation for
memblock tests: include memory_hotplug.h in mmzone.h as kernel doseIn kernel code, memory_hotplug.h is included in mmzone.h instead of ininit.h. Let's sync with kernel.This is a preparation for move init.h in common include directory.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>CC: Mike Rapoport <rppt@kernel.org>Link: https://lore.kernel.org/r/20240712035138.24674-1-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
memblock tests: add memblock_overlaps_region_checksAdd a test case for memblock_overlaps_region().Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/2024050707583
memblock tests: add memblock_overlaps_region_checksAdd a test case for memblock_overlaps_region().Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/20240507075833.6346-5-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
memblock tests: add memblock_reserve_many_may_conflict_check()This may trigger the case fixed by commit 48c3b583bbdd ("mm/memblock:fix overlapping allocation when doubling reserved array").This
memblock tests: add memblock_reserve_many_may_conflict_check()This may trigger the case fixed by commit 48c3b583bbdd ("mm/memblock:fix overlapping allocation when doubling reserved array").This is done by adding the 129th reserve region into memblock.memory. Ifmemblock_double_array() use this reserve region as new array, it fails.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/20240507075833.6346-3-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
memblock tests: add memblock_reserve_all_locations_check()Instead of adding 129th memory block at the last position, let's try allpossible position.Signed-off-by: Wei Yang <richard.weiyang@gmail
memblock tests: add memblock_reserve_all_locations_check()Instead of adding 129th memory block at the last position, let's try allpossible position.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>Link: https://lore.kernel.org/r/20240507075833.6346-2-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
mm/memblock: remove empty dummy entryThe dummy entry is introduced in the initial implementation of lmb incommit 7c8c6b9776fb ("powerpc: Merge lmb.c and make MM initializationuse it.").As the c
mm/memblock: remove empty dummy entryThe dummy entry is introduced in the initial implementation of lmb incommit 7c8c6b9776fb ("powerpc: Merge lmb.c and make MM initializationuse it.").As the comment says the empty dummy entry is to simplify the code. /* Create a dummy zero size LMB which will get coalesced away later. * This simplifies the lmb_add() code below... */While current code is reimplemented by Tejun in commit 784656f9c680("memblock: Reimplement memblock_add_region()"). This empty dummy entryseems not benefit the code any more.Let's remove it.Signed-off-by: Wei Yang <richard.weiyang@gmail.com>CC: Paul Mackerras <paulus@ozlabs.org>CC: Tejun Heo <tj@kernel.org>CC: Mike Rapoport <rppt@kernel.org>Link: https://lore.kernel.org/r/20240405015821.13411-1-richard.weiyang@gmail.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERcommit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") haschanged the definition of MAX_ORDER to be inclusive. This has causedissues with
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERcommit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") haschanged the definition of MAX_ORDER to be inclusive. This has causedissues with code that was not yet upstream and depended on the previousdefinition.To draw attention to the altered meaning of the define, rename MAX_ORDERto MAX_PAGE_ORDER.Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>Cc: Linus Torvalds <torvalds@linux-foundation.org>Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memblock tests: fix warning ‘struct seq_file’ declared inside parameter listBuilding memblock tests produces the following warning:cc -I. -I../../include -Wall -O2 -fsanitize=address -fsanitize=u
memblock tests: fix warning ‘struct seq_file’ declared inside parameter listBuilding memblock tests produces the following warning:cc -I. -I../../include -Wall -O2 -fsanitize=address -fsanitize=undefined -D CONFIG_PHYS_ADDR_T_64BIT -c -o main.o main.cIn file included from tests/common.h:9, from tests/basic_api.h:5, from main.c:2:./linux/memblock.h:601:50: warning: ‘struct seq_file’ declared inside parameter list will not be visible outside of this definition or declaration 601 | static inline void memtest_report_meminfo(struct seq_file *m) { } | ^~~~~~~~Add declaration of 'struct seq_file' to tools/include/linux/seq_file.hto fix it.Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
memblock tests: Fix compilation errors.This patch fix the follow errors.commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") pass nidparameter to reserve_bootmem_region(), $ mak
memblock tests: Fix compilation errors.This patch fix the follow errors.commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") pass nidparameter to reserve_bootmem_region(), $ make -C tools/testing/memblock/ ... memblock.c: In function ‘memmap_init_reserved_pages’: memblock.c:2111:25: error: too many arguments to function ‘reserve_bootmem_region’ 2111 | reserve_bootmem_region(start, end, nid); | ^~~~~~~~~~~~~~~~~~~~~~ ../../include/linux/mm.h:32:6: note: declared here 32 | void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); | ^~~~~~~~~~~~~~~~~~~~~~ memblock.c:2122:17: error: too many arguments to function ‘reserve_bootmem_region’ 2122 | reserve_bootmem_region(start, end, nid); | ^~~~~~~~~~~~~~~~~~~~~~commit dcdfdd40fa82 ("mm: Add support for unaccepted memory") callaccept_memory() in memblock.c $ make -C tools/testing/memblock/ ... cc -fsanitize=address -fsanitize=undefined main.o memblock.o \ lib/slab.o mmzone.o slab.o tests/alloc_nid_api.o \ tests/alloc_helpers_api.o tests/alloc_api.o tests/basic_api.o \ tests/common.o tests/alloc_exact_nid_api.o -o main /usr/bin/ld: memblock.o: in function `memblock_alloc_range_nid': memblock.c:(.text+0x7ae4): undefined reference to `accept_memory'Signed-off-by: Rong Tao <rongtao@cestc.cn>Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")Link: https://lore.kernel.org/r/tencent_6F19BC082167F15DF2A8D8BEFE8EF220F60A@qq.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Add tests for memblock_alloc_node()This test is aimed at verifying the memblock_alloc_node() to work asexpected, so setting the correct NUMA node for the new allocatedregion. The memblock_alloc_n
Add tests for memblock_alloc_node()This test is aimed at verifying the memblock_alloc_node() to work asexpected, so setting the correct NUMA node for the new allocatedregion. The memblock_alloc_node() is called directly without using anystub. The core check is between the requested NUMA node and the `nid`field inside the memblock_region structure. These two are supposed tobe equal for the test to succeed.Signed-off-by: Claudio Migliorelli <claudio.migliorelli@mail.polimi.it>Link: https://lore.kernel.org/r/ea5e938e-6b74-b188-af59-4b94b18bc0@mail.polimi.itSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
mm, treewide: redefine MAX_ORDER sanelyMAX_ORDER currently defined as number of orders page allocator supports:user can ask buddy allocator for page order between 0 and MAX_ORDER-1.This definiti
mm, treewide: redefine MAX_ORDER sanelyMAX_ORDER currently defined as number of orders page allocator supports:user can ask buddy allocator for page order between 0 and MAX_ORDER-1.This definition is counter-intuitive and lead to number of bugs all overthe kernel.Change the definition of MAX_ORDER to be inclusive: the range of ordersuser can ask from buddy allocator is 0..MAX_ORDER now.[kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box[akpm@linux-foundation.org: fix another min_t warning][kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name[akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]Cc: "Kirill A. Shutemov" <kirill@shutemov.name>Cc: Zi Yan <ziy@nvidia.com>Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593.The pages being freed by memblock_free_late() have al
Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593.The pages being freed by memblock_free_late() have already beeninitialized, but if they are in the deferred init range,__free_one_page() might access nearby uninitialized pages when trying tocoalesce buddies. This can, for example, trigger this BUG: BUG: unable to handle page fault for address: ffffe964c02580c8 RIP: 0010:__list_del_entry_valid+0x3f/0x70 <TASK> __free_one_page+0x139/0x410 __free_pages_ok+0x21d/0x450 memblock_free_late+0x8c/0xb9 efi_free_boot_services+0x16b/0x25c efi_enter_virtual_mode+0x403/0x446 start_kernel+0x678/0x714 secondary_startup_64_no_verify+0xd2/0xdb </TASK>A proper fix will be more involved so revert this change for the timebeing.Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")Signed-off-by: Aaron Thompson <dev@aaront.org>Link: https://lore.kernel.org/r/20230207082151.1303-1-dev@aaront.orgSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
mm: Always release pages to the buddy allocator in memblock_free_late().If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()only releases pages to the buddy allocator if they are
mm: Always release pages to the buddy allocator in memblock_free_late().If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()only releases pages to the buddy allocator if they are not in thedeferred range. This is correct for free pages (as defined byfor_each_free_mem_pfn_range_in_zone()) because free pages in thedeferred range will be initialized and released as part of the deferredinit process. memblock_free_pages() is called by memblock_free_late(),which is used to free reserved ranges after memblock_free_all() hasrun. All pages in reserved ranges have been initialized at that point,and accordingly, those pages are not touched by the deferred initprocess. This means that currently, if the pages thatmemblock_free_late() intends to release are in the deferred range, theywill never be released to the buddy allocator. They will forever bereserved.In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),which is also correct for free pages but is not correct for reservedpages. KMSAN metadata for reserved pages is initialized bykmsan_init_shadow(), which runs shortly before memblock_free_all().For both of these reasons, memblock_free_pages() should only be calledfor free pages, and memblock_free_late() should call __free_pages_core()directly instead.One case where this issue can occur in the wild is EFI boot onx86_64. The x86 EFI code reserves all EFI boot services memory rangesvia memblock_reserve() and frees them later via memblock_free_late()(efi_reserve_boot_services() and efi_free_boot_services(),respectively). If any of those ranges happens to fall within thedeferred init range, the pages will not be released and that memory willbe unavailable.For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:v6.2-rc2: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867v6.2-rc2 + patch: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pagesFixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")Signed-off-by: Aaron Thompson <dev@aaront.org>Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
memblock tests: Fix compilation error.Commit cf4694be2b2cf ("tools: Add atomic_test_and_set_bit()") changedtools/arch/x86/include/asm/atomic.h to include <asm/asm.h>, which causes'make -C tools/t
memblock tests: Fix compilation error.Commit cf4694be2b2cf ("tools: Add atomic_test_and_set_bit()") changedtools/arch/x86/include/asm/atomic.h to include <asm/asm.h>, which causes'make -C tools/testing/memblock' to fail with:In file included from ../../include/asm/atomic.h:6, from ../../include/linux/atomic.h:5, from ./linux/mmzone.h:5, from ../../include/linux/mm.h:5, from ../../include/linux/pfn.h:5, from ./linux/memory_hotplug.h:6, from ./linux/init.h:7, from ./linux/memblock.h:11, from tests/common.h:8, from tests/basic_api.h:5, from main.c:2:../../include/asm/../../arch/x86/include/asm/atomic.h:11:10: fatal error: asm/asm.h: No such file or directory 11 | #include <asm/asm.h> | ^~~~~~~~~~~Create a symlink to asm/asm.h in the same manner as the existing one toasm/cmpxchg.h.Signed-off-by: Aaron Thompson <dev@aaront.org>Link: https://lore.kernel.org/r/010101857c402765-96e2dbc6-b82b-47e2-a437-4834dbe0b96b-000000@us-west-2.amazonses.comSigned-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
memblock tests: remove completed TODO itemRemove completed item from TODO list.Reviewed-by: David Hildenbrand <david@redhat.com>Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>Signed-off-by
memblock tests: remove completed TODO itemRemove completed item from TODO list.Reviewed-by: David Hildenbrand <david@redhat.com>Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>Link: https://lore.kernel.org/r/f2263abe45613b28f1583fbf04a4bffcf735bcf6.1667802195.git.remckee0@gmail.com
memblock tests: add generic NUMA tests for memblock_alloc_exact_nid_rawAdd tests for memblock_alloc_exact_nid_raw() where the simulated physicalmemory is set up with multiple NUMA nodes. Additiona
memblock tests: add generic NUMA tests for memblock_alloc_exact_nid_rawAdd tests for memblock_alloc_exact_nid_raw() where the simulated physicalmemory is set up with multiple NUMA nodes. Additionally, all but one ofthese tests set nid != NUMA_NO_NODE. All tests are run for both top-downand bottom-up allocation directions.The tested scenarios are:Range unrestricted:- region cannot be allocated: + there are no previously reserved regions, but requested node is too small + the requested node is fully reserved + the requested node is partially reserved and does not have enough space + none of the nodes have enough memory to allocate the regionRange restricted:- region can be allocated in the specific node requested without dropping min_addr: + the range fully overlaps with the node, and there are adjacent reserved regions- region cannot be allocated: + range partially overlaps with two different nodes, where the second node is the requested node + range overlaps with multiple nodes along node boundaries, and the requested node starts after max_addr + nid is set to NUMA_NO_NODE and the total range can fit the region, but the range is split between two nodes and everything else is reservedAcked-by: David Hildenbrand <david@redhat.com>Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>Link: https://lore.kernel.org/r/51b14da46e6591428df3aefc5acc7dca9341a541.1667802195.git.remckee0@gmail.com
memblock tests: add bottom-up NUMA tests for memblock_alloc_exact_nid_rawAdd tests for memblock_alloc_exact_nid_raw() where the simulated physicalmemory is set up with multiple NUMA nodes. Additio
memblock tests: add bottom-up NUMA tests for memblock_alloc_exact_nid_rawAdd tests for memblock_alloc_exact_nid_raw() where the simulated physicalmemory is set up with multiple NUMA nodes. Additionally, all of thesetests set nid != NUMA_NO_NODE. These tests are run with a bottom-upallocation direction.The tested scenarios are:Range unrestricted:- region can be allocated in the specific node requested: + there are no previously reserved regions + the requested node is partially reserved but has enough spaceRange restricted:- region can be allocated in the specific node requested after dropping min_addr: + range partially overlaps with two different nodes, where the first node is the requested node + range partially overlaps with two different nodes, where the requested node ends before min_addr + range overlaps with multiple nodes along node boundaries, and the requested node ends before min_addrAcked-by: David Hildenbrand <david@redhat.com>Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>Link: https://lore.kernel.org/r/935f0eed5e06fd44dc67d9f49b277923d7896bd3.1667802195.git.remckee0@gmail.com
1234