1*6c364edcSSeongJae Park======================= 2*6c364edcSSeongJae ParkNUMA Memory Performance 3*6c364edcSSeongJae Park======================= 4*6c364edcSSeongJae Park 513bac55eSKeith BuschNUMA Locality 613bac55eSKeith Busch============= 713bac55eSKeith Busch 813bac55eSKeith BuschSome platforms may have multiple types of memory attached to a compute 913bac55eSKeith Buschnode. These disparate memory ranges may share some characteristics, such 1013bac55eSKeith Buschas CPU cache coherence, but may have different performance. For example, 1113bac55eSKeith Buschdifferent media types and buses affect bandwidth and latency. 1213bac55eSKeith Busch 1313bac55eSKeith BuschA system supports such heterogeneous memory by grouping each memory type 1413bac55eSKeith Buschunder different domains, or "nodes", based on locality and performance 1513bac55eSKeith Buschcharacteristics. Some memory may share the same node as a CPU, and others 1613bac55eSKeith Buschare provided as memory only nodes. While memory only nodes do not provide 1713bac55eSKeith BuschCPUs, they may still be local to one or more compute nodes relative to 1813bac55eSKeith Buschother nodes. The following diagram shows one such example of two compute 198867f610SJonathan Corbetnodes with local memory and a memory only node for each of compute node:: 2013bac55eSKeith Busch 2113bac55eSKeith Busch +------------------+ +------------------+ 2213bac55eSKeith Busch | Compute Node 0 +-----+ Compute Node 1 | 2313bac55eSKeith Busch | Local Node0 Mem | | Local Node1 Mem | 2413bac55eSKeith Busch +--------+---------+ +--------+---------+ 2513bac55eSKeith Busch | | 2613bac55eSKeith Busch +--------+---------+ +--------+---------+ 2713bac55eSKeith Busch | Slower Node2 Mem | | Slower Node3 Mem | 2813bac55eSKeith Busch +------------------+ +--------+---------+ 2913bac55eSKeith Busch 3013bac55eSKeith BuschA "memory initiator" is a node containing one or more devices such as 3113bac55eSKeith BuschCPUs or separate memory I/O devices that can initiate memory requests. 3213bac55eSKeith BuschA "memory target" is a node containing one or more physical address 3313bac55eSKeith Buschranges accessible from one or more memory initiators. 3413bac55eSKeith Busch 3513bac55eSKeith BuschWhen multiple memory initiators exist, they may not all have the same 3613bac55eSKeith Buschperformance when accessing a given memory target. Each initiator-target 3713bac55eSKeith Buschpair may be organized into different ranked access classes to represent 3813bac55eSKeith Buschthis relationship. The highest performing initiator to a given target 3913bac55eSKeith Buschis considered to be one of that target's local initiators, and given 4013bac55eSKeith Buschthe highest access class, 0. Any given target may have one or more 4113bac55eSKeith Buschlocal initiators, and any given initiator may have multiple local 4213bac55eSKeith Buschmemory targets. 4313bac55eSKeith Busch 4413bac55eSKeith BuschTo aid applications matching memory targets with their initiators, the 4513bac55eSKeith Buschkernel provides symlinks to each other. The following example lists the 4613bac55eSKeith Buschrelationship for the access class "0" memory initiators and targets:: 4713bac55eSKeith Busch 4813bac55eSKeith Busch # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ 4913bac55eSKeith Busch relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY 5013bac55eSKeith Busch 5113bac55eSKeith Busch # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ 5213bac55eSKeith Busch relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX 5313bac55eSKeith Busch 5413bac55eSKeith BuschA memory initiator may have multiple memory targets in the same access 5513bac55eSKeith Buschclass. The target memory's initiators in a given class indicate the 5613bac55eSKeith Buschnodes' access characteristics share the same performance relative to other 5713bac55eSKeith Buschlinked initiator nodes. Each target within an initiator's access class, 5813bac55eSKeith Buschthough, do not necessarily perform the same as each other. 5913bac55eSKeith Busch 60dc9e7860SJonathan CameronThe access class "1" is used to allow differentiation between initiators 61dc9e7860SJonathan Cameronthat are CPUs and hence suitable for generic task scheduling, and 62dc9e7860SJonathan CameronIO initiators such as GPUs and NICs. Unlike access class 0, only 63dc9e7860SJonathan Cameronnodes containing CPUs are considered. 64dc9e7860SJonathan Cameron 6513bac55eSKeith BuschNUMA Performance 6613bac55eSKeith Busch================ 6713bac55eSKeith Busch 6813bac55eSKeith BuschApplications may wish to consider which node they want their memory to 6913bac55eSKeith Buschbe allocated from based on the node's performance characteristics. If 7013bac55eSKeith Buschthe system provides these attributes, the kernel exports them under the 7113bac55eSKeith Buschnode sysfs hierarchy by appending the attributes directory under the 7213bac55eSKeith Buschmemory node's access class 0 initiators as follows:: 7313bac55eSKeith Busch 7413bac55eSKeith Busch /sys/devices/system/node/nodeY/access0/initiators/ 7513bac55eSKeith Busch 7613bac55eSKeith BuschThese attributes apply only when accessed from nodes that have the 77751d5b27SAndrew Klychkovare linked under the this access's initiators. 7813bac55eSKeith Busch 7913bac55eSKeith BuschThe performance characteristics the kernel provides for the local initiators 8013bac55eSKeith Buschare exported are as follows:: 8113bac55eSKeith Busch 8213bac55eSKeith Busch # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ 8313bac55eSKeith Busch /sys/devices/system/node/nodeY/access0/initiators/ 8413bac55eSKeith Busch |-- read_bandwidth 8513bac55eSKeith Busch |-- read_latency 8613bac55eSKeith Busch |-- write_bandwidth 8713bac55eSKeith Busch `-- write_latency 8813bac55eSKeith Busch 8913bac55eSKeith BuschThe bandwidth attributes are provided in MiB/second. 9013bac55eSKeith Busch 9113bac55eSKeith BuschThe latency attributes are provided in nanoseconds. 9213bac55eSKeith Busch 9313bac55eSKeith BuschThe values reported here correspond to the rated latency and bandwidth 9413bac55eSKeith Buschfor the platform. 9513bac55eSKeith Busch 96dc9e7860SJonathan CameronAccess class 1 takes the same form but only includes values for CPU to 97dc9e7860SJonathan Cameronmemory activity. 98dc9e7860SJonathan Cameron 9913bac55eSKeith BuschNUMA Cache 10013bac55eSKeith Busch========== 10113bac55eSKeith Busch 10213bac55eSKeith BuschSystem memory may be constructed in a hierarchy of elements with various 10313bac55eSKeith Buschperformance characteristics in order to provide large address space of 10413bac55eSKeith Buschslower performing memory cached by a smaller higher performing memory. The 10513bac55eSKeith Buschsystem physical addresses memory initiators are aware of are provided 10613bac55eSKeith Buschby the last memory level in the hierarchy. The system meanwhile uses 10713bac55eSKeith Buschhigher performing memory to transparently cache access to progressively 10813bac55eSKeith Buschslower levels. 10913bac55eSKeith Busch 11013bac55eSKeith BuschThe term "far memory" is used to denote the last level memory in the 11113bac55eSKeith Buschhierarchy. Each increasing cache level provides higher performing 11213bac55eSKeith Buschinitiator access, and the term "near memory" represents the fastest 11313bac55eSKeith Buschcache provided by the system. 11413bac55eSKeith Busch 11513bac55eSKeith BuschThis numbering is different than CPU caches where the cache level (ex: 11613bac55eSKeith BuschL1, L2, L3) uses the CPU-side view where each increased level is lower 11713bac55eSKeith Buschperforming. In contrast, the memory cache level is centric to the last 11813bac55eSKeith Buschlevel memory, so the higher numbered cache level corresponds to memory 11913bac55eSKeith Buschnearer to the CPU, and further from far memory. 12013bac55eSKeith Busch 12113bac55eSKeith BuschThe memory-side caches are not directly addressable by software. When 12213bac55eSKeith Buschsoftware accesses a system address, the system will return it from the 12313bac55eSKeith Buschnear memory cache if it is present. If it is not present, the system 12413bac55eSKeith Buschaccesses the next level of memory until there is either a hit in that 12513bac55eSKeith Buschcache level, or it reaches far memory. 12613bac55eSKeith Busch 12713bac55eSKeith BuschAn application does not need to know about caching attributes in order 12813bac55eSKeith Buschto use the system. Software may optionally query the memory cache 12913bac55eSKeith Buschattributes in order to maximize the performance out of such a setup. 13013bac55eSKeith BuschIf the system provides a way for the kernel to discover this information, 13113bac55eSKeith Buschfor example with ACPI HMAT (Heterogeneous Memory Attribute Table), 13213bac55eSKeith Buschthe kernel will append these attributes to the NUMA node memory target. 13313bac55eSKeith Busch 13413bac55eSKeith BuschWhen the kernel first registers a memory cache with a node, the kernel 13513bac55eSKeith Buschwill create the following directory:: 13613bac55eSKeith Busch 13713bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/ 13813bac55eSKeith Busch 139eeb3dc58SRandy DunlapIf that directory is not present, the system either does not provide 14013bac55eSKeith Buscha memory-side cache, or that information is not accessible to the kernel. 14113bac55eSKeith Busch 14213bac55eSKeith BuschThe attributes for each level of cache is provided under its cache 14313bac55eSKeith Buschlevel index:: 14413bac55eSKeith Busch 14513bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexA/ 14613bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexB/ 14713bac55eSKeith Busch /sys/devices/system/node/nodeX/memory_side_cache/indexC/ 14813bac55eSKeith Busch 14913bac55eSKeith BuschEach cache level's directory provides its attributes. For example, the 15013bac55eSKeith Buschfollowing shows a single cache level and the attributes available for 15113bac55eSKeith Buschsoftware to query:: 15213bac55eSKeith Busch 153abb9c078SMark O'Donovan # tree /sys/devices/system/node/node0/memory_side_cache/ 15413bac55eSKeith Busch /sys/devices/system/node/node0/memory_side_cache/ 15513bac55eSKeith Busch |-- index1 15613bac55eSKeith Busch | |-- indexing 15713bac55eSKeith Busch | |-- line_size 15813bac55eSKeith Busch | |-- size 15913bac55eSKeith Busch | `-- write_policy 16013bac55eSKeith Busch 16113bac55eSKeith BuschThe "indexing" will be 0 if it is a direct-mapped cache, and non-zero 16213bac55eSKeith Buschfor any other indexed based, multi-way associativity. 16313bac55eSKeith Busch 16413bac55eSKeith BuschThe "line_size" is the number of bytes accessed from the next cache 16513bac55eSKeith Buschlevel on a miss. 16613bac55eSKeith Busch 16713bac55eSKeith BuschThe "size" is the number of bytes provided by this cache level. 16813bac55eSKeith Busch 16913bac55eSKeith BuschThe "write_policy" will be 0 for write-back, and non-zero for 17013bac55eSKeith Buschwrite-through caching. 17113bac55eSKeith Busch 17213bac55eSKeith BuschSee Also 17313bac55eSKeith Busch======== 1742e03e3a4SMauro Carvalho Chehab 1752e03e3a4SMauro Carvalho Chehab[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf 1762e03e3a4SMauro Carvalho Chehab- Section 5.2.27 177