xref: /linux/Documentation/driver-api/cxl/linux/early-boot.rst (revision e9ef810dfee7a2227da9d423aecb0ced35faddbe)
1.. SPDX-License-Identifier: GPL-2.0
2
3=======================
4Linux Init (Early Boot)
5=======================
6
7Linux configuration is split into two major steps: Early-Boot and everything else.
8
9During early boot, Linux sets up immutable resources (such as numa nodes), while
10later operations include things like driver probe and memory hotplug.  Linux may
11read EFI and ACPI information throughout this process to configure logical
12representations of the devices.
13
14During Linux Early Boot stage (functions in the kernel that have the __init
15decorator), the system takes the resources created by EFI/BIOS
16(:doc:`ACPI tables <../platform/acpi>`) and turns them into resources that the
17kernel can consume.
18
19
20BIOS, Build and Boot Options
21============================
22
23There are 4 pre-boot options that need to be considered during kernel build
24which dictate how memory will be managed by Linux during early boot.
25
26* EFI_MEMORY_SP
27
28  * BIOS/EFI Option that dictates whether memory is SystemRAM or
29    Specific Purpose.  Specific Purpose memory will be deferred to
30    drivers to manage - and not immediately exposed as system RAM.
31
32* CONFIG_EFI_SOFT_RESERVE
33
34  * Linux Build config option that dictates whether the kernel supports
35    Specific Purpose memory.
36
37* CONFIG_MHP_DEFAULT_ONLINE_TYPE
38
39  * Linux Build config that dictates whether and how Specific Purpose memory
40    converted to a dax device should be managed (left as DAX or onlined as
41    SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).
42
43* nosoftreserve
44
45  * Linux kernel boot option that dictates whether Soft Reserve should be
46    supported.  Similar to CONFIG_EFI_SOFT_RESERVE.
47
48Memory Map Creation
49===================
50
51While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory
52is supported and detected, it will set this region aside as
53:code:`SOFT_RESERVED`.
54
55If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or
56:code:`nosoftreserve=y` - Linux will default a CXL device memory region to
57SystemRAM.  This will expose the memory to the kernel page allocator in
58:code:`ZONE_NORMAL`, making it available for use for most allocations (including
59:code:`struct page` and page tables).
60
61If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*`
62dictates whether the memory is onlined by default (:code:`_OFFLINE` or
63:code:`_ONLINE_*`), and if online which zone to online this memory to by default
64(:code:`_NORMAL` or :code:`_MOVABLE`).
65
66If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most
67kernel allocations (such as :code:`struct page` or page tables).  This may
68significant impact performance depending on the memory capacity of the system.
69
70
71NUMA Node Reservation
72=====================
73
74Linux refers to the proximity domains (:code:`PXM`) defined in the :doc:`SRAT
75<../platform/acpi/srat>` to create NUMA nodes in :code:`acpi_numa_init`.
76Typically, there is a 1:1 relation between :code:`PXM` and NUMA node IDs.
77
78The SRAT is the only ACPI defined way of defining Proximity Domains. Linux
79chooses to, at most, map those 1:1 with NUMA nodes.
80:doc:`CEDT <../platform/acpi/cedt>` adds a description of SPA ranges which
81Linux may map to one or more NUMA nodes.
82
83If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM`
84is created (as of v6.15). In the future, Linux may reject CFMWS not described
85by SRAT due to the ambiguity of proximity domain association.
86
87It is important to note that NUMA node creation cannot be done at runtime. All
88possible NUMA nodes are identified at :code:`__init` time, more specifically
89during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM`
90data for Linux to identify NUMA nodes their associated memory regions.
91
92The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`.
93
94See :doc:`Example Platform Configurations <../platform/example-configs>`
95for more info.
96
97Memory Tiers Creation
98=====================
99Memory tiers are a collection of NUMA nodes grouped by performance characteristics.
100During :code:`__init`, Linux initializes the system with a default memory tier that
101contains all nodes marked :code:`N_MEMORY`.
102
103:code:`memory_tier_init` is called at boot for all nodes with memory online by
104default. :code:`memory_tier_late_init` is called during late-init for nodes setup
105during driver configuration.
106
107Nodes are only marked :code:`N_MEMORY` if they have *online* memory.
108
109Tier membership can be inspected in ::
110
111  /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
112  0-1
113
114If nodes are grouped which have clear difference in performance, check the
115:doc:`HMAT <../platform/acpi/hmat>` and CDAT information for the CXL nodes. All
116nodes default to the DRAM tier, unless HMAT/CDAT information is reported to the
117memory_tier component via `access_coordinates`.
118
119For more, see :doc:`CXL access coordinates documentation
120<../linux/access-coordinates>`.
121
122Contiguous Memory Allocation
123============================
124The contiguous memory allocator (CMA) enables reservation of contiguous memory
125regions on NUMA nodes during early boot.  However, CMA cannot reserve memory
126on NUMA nodes that are not online during early boot. ::
127
128  void __init hugetlb_cma_reserve(int order) {
129    if (!node_online(nid))
130      /* do not allow reservations */
131  }
132
133This means if users intend to defer management of CXL memory to the driver, CMA
134cannot be used to guarantee huge page allocations.  If enabling CXL memory as
135SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be
136made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line
137parameters.
138