xref: /linux/Documentation/admin-guide/mm/numaperf.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
1*6c364edcSSeongJae Park=======================
2*6c364edcSSeongJae ParkNUMA Memory Performance
3*6c364edcSSeongJae Park=======================
4*6c364edcSSeongJae Park
513bac55eSKeith BuschNUMA Locality
613bac55eSKeith Busch=============
713bac55eSKeith Busch
813bac55eSKeith BuschSome platforms may have multiple types of memory attached to a compute
913bac55eSKeith Buschnode. These disparate memory ranges may share some characteristics, such
1013bac55eSKeith Buschas CPU cache coherence, but may have different performance. For example,
1113bac55eSKeith Buschdifferent media types and buses affect bandwidth and latency.
1213bac55eSKeith Busch
1313bac55eSKeith BuschA system supports such heterogeneous memory by grouping each memory type
1413bac55eSKeith Buschunder different domains, or "nodes", based on locality and performance
1513bac55eSKeith Buschcharacteristics.  Some memory may share the same node as a CPU, and others
1613bac55eSKeith Buschare provided as memory only nodes. While memory only nodes do not provide
1713bac55eSKeith BuschCPUs, they may still be local to one or more compute nodes relative to
1813bac55eSKeith Buschother nodes. The following diagram shows one such example of two compute
198867f610SJonathan Corbetnodes with local memory and a memory only node for each of compute node::
2013bac55eSKeith Busch
2113bac55eSKeith Busch +------------------+     +------------------+
2213bac55eSKeith Busch | Compute Node 0   +-----+ Compute Node 1   |
2313bac55eSKeith Busch | Local Node0 Mem  |     | Local Node1 Mem  |
2413bac55eSKeith Busch +--------+---------+     +--------+---------+
2513bac55eSKeith Busch          |                        |
2613bac55eSKeith Busch +--------+---------+     +--------+---------+
2713bac55eSKeith Busch | Slower Node2 Mem |     | Slower Node3 Mem |
2813bac55eSKeith Busch +------------------+     +--------+---------+
2913bac55eSKeith Busch
3013bac55eSKeith BuschA "memory initiator" is a node containing one or more devices such as
3113bac55eSKeith BuschCPUs or separate memory I/O devices that can initiate memory requests.
3213bac55eSKeith BuschA "memory target" is a node containing one or more physical address
3313bac55eSKeith Buschranges accessible from one or more memory initiators.
3413bac55eSKeith Busch
3513bac55eSKeith BuschWhen multiple memory initiators exist, they may not all have the same
3613bac55eSKeith Buschperformance when accessing a given memory target. Each initiator-target
3713bac55eSKeith Buschpair may be organized into different ranked access classes to represent
3813bac55eSKeith Buschthis relationship. The highest performing initiator to a given target
3913bac55eSKeith Buschis considered to be one of that target's local initiators, and given
4013bac55eSKeith Buschthe highest access class, 0. Any given target may have one or more
4113bac55eSKeith Buschlocal initiators, and any given initiator may have multiple local
4213bac55eSKeith Buschmemory targets.
4313bac55eSKeith Busch
4413bac55eSKeith BuschTo aid applications matching memory targets with their initiators, the
4513bac55eSKeith Buschkernel provides symlinks to each other. The following example lists the
4613bac55eSKeith Buschrelationship for the access class "0" memory initiators and targets::
4713bac55eSKeith Busch
4813bac55eSKeith Busch	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
4913bac55eSKeith Busch	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
5013bac55eSKeith Busch
5113bac55eSKeith Busch	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
5213bac55eSKeith Busch	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
5313bac55eSKeith Busch
5413bac55eSKeith BuschA memory initiator may have multiple memory targets in the same access
5513bac55eSKeith Buschclass. The target memory's initiators in a given class indicate the
5613bac55eSKeith Buschnodes' access characteristics share the same performance relative to other
5713bac55eSKeith Buschlinked initiator nodes. Each target within an initiator's access class,
5813bac55eSKeith Buschthough, do not necessarily perform the same as each other.
5913bac55eSKeith Busch
60dc9e7860SJonathan CameronThe access class "1" is used to allow differentiation between initiators
61dc9e7860SJonathan Cameronthat are CPUs and hence suitable for generic task scheduling, and
62dc9e7860SJonathan CameronIO initiators such as GPUs and NICs.  Unlike access class 0, only
63dc9e7860SJonathan Cameronnodes containing CPUs are considered.
64dc9e7860SJonathan Cameron
6513bac55eSKeith BuschNUMA Performance
6613bac55eSKeith Busch================
6713bac55eSKeith Busch
6813bac55eSKeith BuschApplications may wish to consider which node they want their memory to
6913bac55eSKeith Buschbe allocated from based on the node's performance characteristics. If
7013bac55eSKeith Buschthe system provides these attributes, the kernel exports them under the
7113bac55eSKeith Buschnode sysfs hierarchy by appending the attributes directory under the
7213bac55eSKeith Buschmemory node's access class 0 initiators as follows::
7313bac55eSKeith Busch
7413bac55eSKeith Busch	/sys/devices/system/node/nodeY/access0/initiators/
7513bac55eSKeith Busch
7613bac55eSKeith BuschThese attributes apply only when accessed from nodes that have the
77751d5b27SAndrew Klychkovare linked under the this access's initiators.
7813bac55eSKeith Busch
7913bac55eSKeith BuschThe performance characteristics the kernel provides for the local initiators
8013bac55eSKeith Buschare exported are as follows::
8113bac55eSKeith Busch
8213bac55eSKeith Busch	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
8313bac55eSKeith Busch	/sys/devices/system/node/nodeY/access0/initiators/
8413bac55eSKeith Busch	|-- read_bandwidth
8513bac55eSKeith Busch	|-- read_latency
8613bac55eSKeith Busch	|-- write_bandwidth
8713bac55eSKeith Busch	`-- write_latency
8813bac55eSKeith Busch
8913bac55eSKeith BuschThe bandwidth attributes are provided in MiB/second.
9013bac55eSKeith Busch
9113bac55eSKeith BuschThe latency attributes are provided in nanoseconds.
9213bac55eSKeith Busch
9313bac55eSKeith BuschThe values reported here correspond to the rated latency and bandwidth
9413bac55eSKeith Buschfor the platform.
9513bac55eSKeith Busch
96dc9e7860SJonathan CameronAccess class 1 takes the same form but only includes values for CPU to
97dc9e7860SJonathan Cameronmemory activity.
98dc9e7860SJonathan Cameron
9913bac55eSKeith BuschNUMA Cache
10013bac55eSKeith Busch==========
10113bac55eSKeith Busch
10213bac55eSKeith BuschSystem memory may be constructed in a hierarchy of elements with various
10313bac55eSKeith Buschperformance characteristics in order to provide large address space of
10413bac55eSKeith Buschslower performing memory cached by a smaller higher performing memory. The
10513bac55eSKeith Buschsystem physical addresses memory  initiators are aware of are provided
10613bac55eSKeith Buschby the last memory level in the hierarchy. The system meanwhile uses
10713bac55eSKeith Buschhigher performing memory to transparently cache access to progressively
10813bac55eSKeith Buschslower levels.
10913bac55eSKeith Busch
11013bac55eSKeith BuschThe term "far memory" is used to denote the last level memory in the
11113bac55eSKeith Buschhierarchy. Each increasing cache level provides higher performing
11213bac55eSKeith Buschinitiator access, and the term "near memory" represents the fastest
11313bac55eSKeith Buschcache provided by the system.
11413bac55eSKeith Busch
11513bac55eSKeith BuschThis numbering is different than CPU caches where the cache level (ex:
11613bac55eSKeith BuschL1, L2, L3) uses the CPU-side view where each increased level is lower
11713bac55eSKeith Buschperforming. In contrast, the memory cache level is centric to the last
11813bac55eSKeith Buschlevel memory, so the higher numbered cache level corresponds to  memory
11913bac55eSKeith Buschnearer to the CPU, and further from far memory.
12013bac55eSKeith Busch
12113bac55eSKeith BuschThe memory-side caches are not directly addressable by software. When
12213bac55eSKeith Buschsoftware accesses a system address, the system will return it from the
12313bac55eSKeith Buschnear memory cache if it is present. If it is not present, the system
12413bac55eSKeith Buschaccesses the next level of memory until there is either a hit in that
12513bac55eSKeith Buschcache level, or it reaches far memory.
12613bac55eSKeith Busch
12713bac55eSKeith BuschAn application does not need to know about caching attributes in order
12813bac55eSKeith Buschto use the system. Software may optionally query the memory cache
12913bac55eSKeith Buschattributes in order to maximize the performance out of such a setup.
13013bac55eSKeith BuschIf the system provides a way for the kernel to discover this information,
13113bac55eSKeith Buschfor example with ACPI HMAT (Heterogeneous Memory Attribute Table),
13213bac55eSKeith Buschthe kernel will append these attributes to the NUMA node memory target.
13313bac55eSKeith Busch
13413bac55eSKeith BuschWhen the kernel first registers a memory cache with a node, the kernel
13513bac55eSKeith Buschwill create the following directory::
13613bac55eSKeith Busch
13713bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/
13813bac55eSKeith Busch
139eeb3dc58SRandy DunlapIf that directory is not present, the system either does not provide
14013bac55eSKeith Buscha memory-side cache, or that information is not accessible to the kernel.
14113bac55eSKeith Busch
14213bac55eSKeith BuschThe attributes for each level of cache is provided under its cache
14313bac55eSKeith Buschlevel index::
14413bac55eSKeith Busch
14513bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexA/
14613bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexB/
14713bac55eSKeith Busch	/sys/devices/system/node/nodeX/memory_side_cache/indexC/
14813bac55eSKeith Busch
14913bac55eSKeith BuschEach cache level's directory provides its attributes. For example, the
15013bac55eSKeith Buschfollowing shows a single cache level and the attributes available for
15113bac55eSKeith Buschsoftware to query::
15213bac55eSKeith Busch
153abb9c078SMark O'Donovan	# tree /sys/devices/system/node/node0/memory_side_cache/
15413bac55eSKeith Busch	/sys/devices/system/node/node0/memory_side_cache/
15513bac55eSKeith Busch	|-- index1
15613bac55eSKeith Busch	|   |-- indexing
15713bac55eSKeith Busch	|   |-- line_size
15813bac55eSKeith Busch	|   |-- size
15913bac55eSKeith Busch	|   `-- write_policy
16013bac55eSKeith Busch
16113bac55eSKeith BuschThe "indexing" will be 0 if it is a direct-mapped cache, and non-zero
16213bac55eSKeith Buschfor any other indexed based, multi-way associativity.
16313bac55eSKeith Busch
16413bac55eSKeith BuschThe "line_size" is the number of bytes accessed from the next cache
16513bac55eSKeith Buschlevel on a miss.
16613bac55eSKeith Busch
16713bac55eSKeith BuschThe "size" is the number of bytes provided by this cache level.
16813bac55eSKeith Busch
16913bac55eSKeith BuschThe "write_policy" will be 0 for write-back, and non-zero for
17013bac55eSKeith Buschwrite-through caching.
17113bac55eSKeith Busch
17213bac55eSKeith BuschSee Also
17313bac55eSKeith Busch========
1742e03e3a4SMauro Carvalho Chehab
1752e03e3a4SMauro Carvalho Chehab[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
1762e03e3a4SMauro Carvalho Chehab- Section 5.2.27
177