xref: /linux/Documentation/filesystems/resctrl.rst (revision 664a231d90aa450f9f6f029bee3a94dd08e1aac6)
1*7168ae33SJames Morse.. SPDX-License-Identifier: GPL-2.0
2*7168ae33SJames Morse.. include:: <isonum.txt>
3*7168ae33SJames Morse
4*7168ae33SJames Morse=====================================================
5*7168ae33SJames MorseUser Interface for Resource Control feature (resctrl)
6*7168ae33SJames Morse=====================================================
7*7168ae33SJames Morse
8*7168ae33SJames Morse:Copyright: |copy| 2016 Intel Corporation
9*7168ae33SJames Morse:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10*7168ae33SJames Morse          - Tony Luck <tony.luck@intel.com>
11*7168ae33SJames Morse          - Vikas Shivappa <vikas.shivappa@intel.com>
12*7168ae33SJames Morse
13*7168ae33SJames Morse
14*7168ae33SJames MorseIntel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15*7168ae33SJames MorseAMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
16*7168ae33SJames Morse
17*7168ae33SJames MorseThis feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
18*7168ae33SJames Morseflag bits:
19*7168ae33SJames Morse
20*7168ae33SJames Morse===============================================	================================
21*7168ae33SJames MorseRDT (Resource Director Technology) Allocation	"rdt_a"
22*7168ae33SJames MorseCAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
23*7168ae33SJames MorseCDP (Code and Data Prioritization)		"cdp_l3", "cdp_l2"
24*7168ae33SJames MorseCQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
25*7168ae33SJames MorseMBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
26*7168ae33SJames MorseMBA (Memory Bandwidth Allocation)		"mba"
27*7168ae33SJames MorseSMBA (Slow Memory Bandwidth Allocation)         ""
28*7168ae33SJames MorseBMEC (Bandwidth Monitoring Event Configuration) ""
29*7168ae33SJames Morse===============================================	================================
30*7168ae33SJames Morse
31*7168ae33SJames MorseHistorically, new features were made visible by default in /proc/cpuinfo. This
32*7168ae33SJames Morseresulted in the feature flags becoming hard to parse by humans. Adding a new
33*7168ae33SJames Morseflag to /proc/cpuinfo should be avoided if user space can obtain information
34*7168ae33SJames Morseabout the feature from resctrl's info directory.
35*7168ae33SJames Morse
36*7168ae33SJames MorseTo use the feature mount the file system::
37*7168ae33SJames Morse
38*7168ae33SJames Morse # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
39*7168ae33SJames Morse
40*7168ae33SJames Morsemount options are:
41*7168ae33SJames Morse
42*7168ae33SJames Morse"cdp":
43*7168ae33SJames Morse	Enable code/data prioritization in L3 cache allocations.
44*7168ae33SJames Morse"cdpl2":
45*7168ae33SJames Morse	Enable code/data prioritization in L2 cache allocations.
46*7168ae33SJames Morse"mba_MBps":
47*7168ae33SJames Morse	Enable the MBA Software Controller(mba_sc) to specify MBA
48*7168ae33SJames Morse	bandwidth in MiBps
49*7168ae33SJames Morse"debug":
50*7168ae33SJames Morse	Make debug files accessible. Available debug files are annotated with
51*7168ae33SJames Morse	"Available only with debug option".
52*7168ae33SJames Morse
53*7168ae33SJames MorseL2 and L3 CDP are controlled separately.
54*7168ae33SJames Morse
55*7168ae33SJames MorseRDT features are orthogonal. A particular system may support only
56*7168ae33SJames Morsemonitoring, only control, or both monitoring and control.  Cache
57*7168ae33SJames Morsepseudo-locking is a unique way of using cache control to "pin" or
58*7168ae33SJames Morse"lock" data in the cache. Details can be found in
59*7168ae33SJames Morse"Cache Pseudo-Locking".
60*7168ae33SJames Morse
61*7168ae33SJames Morse
62*7168ae33SJames MorseThe mount succeeds if either of allocation or monitoring is present, but
63*7168ae33SJames Morseonly those files and directories supported by the system will be created.
64*7168ae33SJames MorseFor more details on the behavior of the interface during monitoring
65*7168ae33SJames Morseand allocation, see the "Resource alloc and monitor groups" section.
66*7168ae33SJames Morse
67*7168ae33SJames MorseInfo directory
68*7168ae33SJames Morse==============
69*7168ae33SJames Morse
70*7168ae33SJames MorseThe 'info' directory contains information about the enabled
71*7168ae33SJames Morseresources. Each resource has its own subdirectory. The subdirectory
72*7168ae33SJames Morsenames reflect the resource names.
73*7168ae33SJames Morse
74*7168ae33SJames MorseEach subdirectory contains the following files with respect to
75*7168ae33SJames Morseallocation:
76*7168ae33SJames Morse
77*7168ae33SJames MorseCache resource(L3/L2)  subdirectory contains the following files
78*7168ae33SJames Morserelated to allocation:
79*7168ae33SJames Morse
80*7168ae33SJames Morse"num_closids":
81*7168ae33SJames Morse		The number of CLOSIDs which are valid for this
82*7168ae33SJames Morse		resource. The kernel uses the smallest number of
83*7168ae33SJames Morse		CLOSIDs of all enabled resources as limit.
84*7168ae33SJames Morse"cbm_mask":
85*7168ae33SJames Morse		The bitmask which is valid for this resource.
86*7168ae33SJames Morse		This mask is equivalent to 100%.
87*7168ae33SJames Morse"min_cbm_bits":
88*7168ae33SJames Morse		The minimum number of consecutive bits which
89*7168ae33SJames Morse		must be set when writing a mask.
90*7168ae33SJames Morse
91*7168ae33SJames Morse"shareable_bits":
92*7168ae33SJames Morse		Bitmask of shareable resource with other executing
93*7168ae33SJames Morse		entities (e.g. I/O). User can use this when
94*7168ae33SJames Morse		setting up exclusive cache partitions. Note that
95*7168ae33SJames Morse		some platforms support devices that have their
96*7168ae33SJames Morse		own settings for cache use which can over-ride
97*7168ae33SJames Morse		these bits.
98*7168ae33SJames Morse"bit_usage":
99*7168ae33SJames Morse		Annotated capacity bitmasks showing how all
100*7168ae33SJames Morse		instances of the resource are used. The legend is:
101*7168ae33SJames Morse
102*7168ae33SJames Morse			"0":
103*7168ae33SJames Morse			      Corresponding region is unused. When the system's
104*7168ae33SJames Morse			      resources have been allocated and a "0" is found
105*7168ae33SJames Morse			      in "bit_usage" it is a sign that resources are
106*7168ae33SJames Morse			      wasted.
107*7168ae33SJames Morse
108*7168ae33SJames Morse			"H":
109*7168ae33SJames Morse			      Corresponding region is used by hardware only
110*7168ae33SJames Morse			      but available for software use. If a resource
111*7168ae33SJames Morse			      has bits set in "shareable_bits" but not all
112*7168ae33SJames Morse			      of these bits appear in the resource groups'
113*7168ae33SJames Morse			      schematas then the bits appearing in
114*7168ae33SJames Morse			      "shareable_bits" but no resource group will
115*7168ae33SJames Morse			      be marked as "H".
116*7168ae33SJames Morse			"X":
117*7168ae33SJames Morse			      Corresponding region is available for sharing and
118*7168ae33SJames Morse			      used by hardware and software. These are the
119*7168ae33SJames Morse			      bits that appear in "shareable_bits" as
120*7168ae33SJames Morse			      well as a resource group's allocation.
121*7168ae33SJames Morse			"S":
122*7168ae33SJames Morse			      Corresponding region is used by software
123*7168ae33SJames Morse			      and available for sharing.
124*7168ae33SJames Morse			"E":
125*7168ae33SJames Morse			      Corresponding region is used exclusively by
126*7168ae33SJames Morse			      one resource group. No sharing allowed.
127*7168ae33SJames Morse			"P":
128*7168ae33SJames Morse			      Corresponding region is pseudo-locked. No
129*7168ae33SJames Morse			      sharing allowed.
130*7168ae33SJames Morse"sparse_masks":
131*7168ae33SJames Morse		Indicates if non-contiguous 1s value in CBM is supported.
132*7168ae33SJames Morse
133*7168ae33SJames Morse			"0":
134*7168ae33SJames Morse			      Only contiguous 1s value in CBM is supported.
135*7168ae33SJames Morse			"1":
136*7168ae33SJames Morse			      Non-contiguous 1s value in CBM is supported.
137*7168ae33SJames Morse
138*7168ae33SJames MorseMemory bandwidth(MB) subdirectory contains the following files
139*7168ae33SJames Morsewith respect to allocation:
140*7168ae33SJames Morse
141*7168ae33SJames Morse"min_bandwidth":
142*7168ae33SJames Morse		The minimum memory bandwidth percentage which
143*7168ae33SJames Morse		user can request.
144*7168ae33SJames Morse
145*7168ae33SJames Morse"bandwidth_gran":
146*7168ae33SJames Morse		The granularity in which the memory bandwidth
147*7168ae33SJames Morse		percentage is allocated. The allocated
148*7168ae33SJames Morse		b/w percentage is rounded off to the next
149*7168ae33SJames Morse		control step available on the hardware. The
150*7168ae33SJames Morse		available bandwidth control steps are:
151*7168ae33SJames Morse		min_bandwidth + N * bandwidth_gran.
152*7168ae33SJames Morse
153*7168ae33SJames Morse"delay_linear":
154*7168ae33SJames Morse		Indicates if the delay scale is linear or
155*7168ae33SJames Morse		non-linear. This field is purely informational
156*7168ae33SJames Morse		only.
157*7168ae33SJames Morse
158*7168ae33SJames Morse"thread_throttle_mode":
159*7168ae33SJames Morse		Indicator on Intel systems of how tasks running on threads
160*7168ae33SJames Morse		of a physical core are throttled in cases where they
161*7168ae33SJames Morse		request different memory bandwidth percentages:
162*7168ae33SJames Morse
163*7168ae33SJames Morse		"max":
164*7168ae33SJames Morse			the smallest percentage is applied
165*7168ae33SJames Morse			to all threads
166*7168ae33SJames Morse		"per-thread":
167*7168ae33SJames Morse			bandwidth percentages are directly applied to
168*7168ae33SJames Morse			the threads running on the core
169*7168ae33SJames Morse
170*7168ae33SJames MorseIf RDT monitoring is available there will be an "L3_MON" directory
171*7168ae33SJames Morsewith the following files:
172*7168ae33SJames Morse
173*7168ae33SJames Morse"num_rmids":
174*7168ae33SJames Morse		The number of RMIDs available. This is the
175*7168ae33SJames Morse		upper bound for how many "CTRL_MON" + "MON"
176*7168ae33SJames Morse		groups can be created.
177*7168ae33SJames Morse
178*7168ae33SJames Morse"mon_features":
179*7168ae33SJames Morse		Lists the monitoring events if
180*7168ae33SJames Morse		monitoring is enabled for the resource.
181*7168ae33SJames Morse		Example::
182*7168ae33SJames Morse
183*7168ae33SJames Morse			# cat /sys/fs/resctrl/info/L3_MON/mon_features
184*7168ae33SJames Morse			llc_occupancy
185*7168ae33SJames Morse			mbm_total_bytes
186*7168ae33SJames Morse			mbm_local_bytes
187*7168ae33SJames Morse
188*7168ae33SJames Morse		If the system supports Bandwidth Monitoring Event
189*7168ae33SJames Morse		Configuration (BMEC), then the bandwidth events will
190*7168ae33SJames Morse		be configurable. The output will be::
191*7168ae33SJames Morse
192*7168ae33SJames Morse			# cat /sys/fs/resctrl/info/L3_MON/mon_features
193*7168ae33SJames Morse			llc_occupancy
194*7168ae33SJames Morse			mbm_total_bytes
195*7168ae33SJames Morse			mbm_total_bytes_config
196*7168ae33SJames Morse			mbm_local_bytes
197*7168ae33SJames Morse			mbm_local_bytes_config
198*7168ae33SJames Morse
199*7168ae33SJames Morse"mbm_total_bytes_config", "mbm_local_bytes_config":
200*7168ae33SJames Morse	Read/write files containing the configuration for the mbm_total_bytes
201*7168ae33SJames Morse	and mbm_local_bytes events, respectively, when the Bandwidth
202*7168ae33SJames Morse	Monitoring Event Configuration (BMEC) feature is supported.
203*7168ae33SJames Morse	The event configuration settings are domain specific and affect
204*7168ae33SJames Morse	all the CPUs in the domain. When either event configuration is
205*7168ae33SJames Morse	changed, the bandwidth counters for all RMIDs of both events
206*7168ae33SJames Morse	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
207*7168ae33SJames Morse	domain. The next read for every RMID will report "Unavailable"
208*7168ae33SJames Morse	and subsequent reads will report the valid value.
209*7168ae33SJames Morse
210*7168ae33SJames Morse	Following are the types of events supported:
211*7168ae33SJames Morse
212*7168ae33SJames Morse	====    ========================================================
213*7168ae33SJames Morse	Bits    Description
214*7168ae33SJames Morse	====    ========================================================
215*7168ae33SJames Morse	6       Dirty Victims from the QOS domain to all types of memory
216*7168ae33SJames Morse	5       Reads to slow memory in the non-local NUMA domain
217*7168ae33SJames Morse	4       Reads to slow memory in the local NUMA domain
218*7168ae33SJames Morse	3       Non-temporal writes to non-local NUMA domain
219*7168ae33SJames Morse	2       Non-temporal writes to local NUMA domain
220*7168ae33SJames Morse	1       Reads to memory in the non-local NUMA domain
221*7168ae33SJames Morse	0       Reads to memory in the local NUMA domain
222*7168ae33SJames Morse	====    ========================================================
223*7168ae33SJames Morse
224*7168ae33SJames Morse	By default, the mbm_total_bytes configuration is set to 0x7f to count
225*7168ae33SJames Morse	all the event types and the mbm_local_bytes configuration is set to
226*7168ae33SJames Morse	0x15 to count all the local memory events.
227*7168ae33SJames Morse
228*7168ae33SJames Morse	Examples:
229*7168ae33SJames Morse
230*7168ae33SJames Morse	* To view the current configuration::
231*7168ae33SJames Morse	  ::
232*7168ae33SJames Morse
233*7168ae33SJames Morse	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
234*7168ae33SJames Morse	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
235*7168ae33SJames Morse
236*7168ae33SJames Morse	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
237*7168ae33SJames Morse	    0=0x15;1=0x15;3=0x15;4=0x15
238*7168ae33SJames Morse
239*7168ae33SJames Morse	* To change the mbm_total_bytes to count only reads on domain 0,
240*7168ae33SJames Morse	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
241*7168ae33SJames Morse	  (in hexadecimal 0x33):
242*7168ae33SJames Morse	  ::
243*7168ae33SJames Morse
244*7168ae33SJames Morse	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
245*7168ae33SJames Morse
246*7168ae33SJames Morse	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
247*7168ae33SJames Morse	    0=0x33;1=0x7f;2=0x7f;3=0x7f
248*7168ae33SJames Morse
249*7168ae33SJames Morse	* To change the mbm_local_bytes to count all the slow memory reads on
250*7168ae33SJames Morse	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
251*7168ae33SJames Morse	  in binary (in hexadecimal 0x30):
252*7168ae33SJames Morse	  ::
253*7168ae33SJames Morse
254*7168ae33SJames Morse	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
255*7168ae33SJames Morse
256*7168ae33SJames Morse	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
257*7168ae33SJames Morse	    0=0x30;1=0x30;3=0x15;4=0x15
258*7168ae33SJames Morse
259*7168ae33SJames Morse"max_threshold_occupancy":
260*7168ae33SJames Morse		Read/write file provides the largest value (in
261*7168ae33SJames Morse		bytes) at which a previously used LLC_occupancy
262*7168ae33SJames Morse		counter can be considered for re-use.
263*7168ae33SJames Morse
264*7168ae33SJames MorseFinally, in the top level of the "info" directory there is a file
265*7168ae33SJames Morsenamed "last_cmd_status". This is reset with every "command" issued
266*7168ae33SJames Morsevia the file system (making new directories or writing to any of the
267*7168ae33SJames Morsecontrol files). If the command was successful, it will read as "ok".
268*7168ae33SJames MorseIf the command failed, it will provide more information that can be
269*7168ae33SJames Morseconveyed in the error returns from file operations. E.g.
270*7168ae33SJames Morse::
271*7168ae33SJames Morse
272*7168ae33SJames Morse	# echo L3:0=f7 > schemata
273*7168ae33SJames Morse	bash: echo: write error: Invalid argument
274*7168ae33SJames Morse	# cat info/last_cmd_status
275*7168ae33SJames Morse	mask f7 has non-consecutive 1-bits
276*7168ae33SJames Morse
277*7168ae33SJames MorseResource alloc and monitor groups
278*7168ae33SJames Morse=================================
279*7168ae33SJames Morse
280*7168ae33SJames MorseResource groups are represented as directories in the resctrl file
281*7168ae33SJames Morsesystem.  The default group is the root directory which, immediately
282*7168ae33SJames Morseafter mounting, owns all the tasks and cpus in the system and can make
283*7168ae33SJames Morsefull use of all resources.
284*7168ae33SJames Morse
285*7168ae33SJames MorseOn a system with RDT control features additional directories can be
286*7168ae33SJames Morsecreated in the root directory that specify different amounts of each
287*7168ae33SJames Morseresource (see "schemata" below). The root and these additional top level
288*7168ae33SJames Morsedirectories are referred to as "CTRL_MON" groups below.
289*7168ae33SJames Morse
290*7168ae33SJames MorseOn a system with RDT monitoring the root directory and other top level
291*7168ae33SJames Morsedirectories contain a directory named "mon_groups" in which additional
292*7168ae33SJames Morsedirectories can be created to monitor subsets of tasks in the CTRL_MON
293*7168ae33SJames Morsegroup that is their ancestor. These are called "MON" groups in the rest
294*7168ae33SJames Morseof this document.
295*7168ae33SJames Morse
296*7168ae33SJames MorseRemoving a directory will move all tasks and cpus owned by the group it
297*7168ae33SJames Morserepresents to the parent. Removing one of the created CTRL_MON groups
298*7168ae33SJames Morsewill automatically remove all MON groups below it.
299*7168ae33SJames Morse
300*7168ae33SJames MorseMoving MON group directories to a new parent CTRL_MON group is supported
301*7168ae33SJames Morsefor the purpose of changing the resource allocations of a MON group
302*7168ae33SJames Morsewithout impacting its monitoring data or assigned tasks. This operation
303*7168ae33SJames Morseis not allowed for MON groups which monitor CPUs. No other move
304*7168ae33SJames Morseoperation is currently allowed other than simply renaming a CTRL_MON or
305*7168ae33SJames MorseMON group.
306*7168ae33SJames Morse
307*7168ae33SJames MorseAll groups contain the following files:
308*7168ae33SJames Morse
309*7168ae33SJames Morse"tasks":
310*7168ae33SJames Morse	Reading this file shows the list of all tasks that belong to
311*7168ae33SJames Morse	this group. Writing a task id to the file will add a task to the
312*7168ae33SJames Morse	group. Multiple tasks can be added by separating the task ids
313*7168ae33SJames Morse	with commas. Tasks will be assigned sequentially. Multiple
314*7168ae33SJames Morse	failures are not supported. A single failure encountered while
315*7168ae33SJames Morse	attempting to assign a task will cause the operation to abort and
316*7168ae33SJames Morse	already added tasks before the failure will remain in the group.
317*7168ae33SJames Morse	Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
318*7168ae33SJames Morse
319*7168ae33SJames Morse	If the group is a CTRL_MON group the task is removed from
320*7168ae33SJames Morse	whichever previous CTRL_MON group owned the task and also from
321*7168ae33SJames Morse	any MON group that owned the task. If the group is a MON group,
322*7168ae33SJames Morse	then the task must already belong to the CTRL_MON parent of this
323*7168ae33SJames Morse	group. The task is removed from any previous MON group.
324*7168ae33SJames Morse
325*7168ae33SJames Morse
326*7168ae33SJames Morse"cpus":
327*7168ae33SJames Morse	Reading this file shows a bitmask of the logical CPUs owned by
328*7168ae33SJames Morse	this group. Writing a mask to this file will add and remove
329*7168ae33SJames Morse	CPUs to/from this group. As with the tasks file a hierarchy is
330*7168ae33SJames Morse	maintained where MON groups may only include CPUs owned by the
331*7168ae33SJames Morse	parent CTRL_MON group.
332*7168ae33SJames Morse	When the resource group is in pseudo-locked mode this file will
333*7168ae33SJames Morse	only be readable, reflecting the CPUs associated with the
334*7168ae33SJames Morse	pseudo-locked region.
335*7168ae33SJames Morse
336*7168ae33SJames Morse
337*7168ae33SJames Morse"cpus_list":
338*7168ae33SJames Morse	Just like "cpus", only using ranges of CPUs instead of bitmasks.
339*7168ae33SJames Morse
340*7168ae33SJames Morse
341*7168ae33SJames MorseWhen control is enabled all CTRL_MON groups will also contain:
342*7168ae33SJames Morse
343*7168ae33SJames Morse"schemata":
344*7168ae33SJames Morse	A list of all the resources available to this group.
345*7168ae33SJames Morse	Each resource has its own line and format - see below for details.
346*7168ae33SJames Morse
347*7168ae33SJames Morse"size":
348*7168ae33SJames Morse	Mirrors the display of the "schemata" file to display the size in
349*7168ae33SJames Morse	bytes of each allocation instead of the bits representing the
350*7168ae33SJames Morse	allocation.
351*7168ae33SJames Morse
352*7168ae33SJames Morse"mode":
353*7168ae33SJames Morse	The "mode" of the resource group dictates the sharing of its
354*7168ae33SJames Morse	allocations. A "shareable" resource group allows sharing of its
355*7168ae33SJames Morse	allocations while an "exclusive" resource group does not. A
356*7168ae33SJames Morse	cache pseudo-locked region is created by first writing
357*7168ae33SJames Morse	"pseudo-locksetup" to the "mode" file before writing the cache
358*7168ae33SJames Morse	pseudo-locked region's schemata to the resource group's "schemata"
359*7168ae33SJames Morse	file. On successful pseudo-locked region creation the mode will
360*7168ae33SJames Morse	automatically change to "pseudo-locked".
361*7168ae33SJames Morse
362*7168ae33SJames Morse"ctrl_hw_id":
363*7168ae33SJames Morse	Available only with debug option. The identifier used by hardware
364*7168ae33SJames Morse	for the control group. On x86 this is the CLOSID.
365*7168ae33SJames Morse
366*7168ae33SJames MorseWhen monitoring is enabled all MON groups will also contain:
367*7168ae33SJames Morse
368*7168ae33SJames Morse"mon_data":
369*7168ae33SJames Morse	This contains a set of files organized by L3 domain and by
370*7168ae33SJames Morse	RDT event. E.g. on a system with two L3 domains there will
371*7168ae33SJames Morse	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
372*7168ae33SJames Morse	directories have one file per event (e.g. "llc_occupancy",
373*7168ae33SJames Morse	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
374*7168ae33SJames Morse	files provide a read out of the current value of the event for
375*7168ae33SJames Morse	all tasks in the group. In CTRL_MON groups these files provide
376*7168ae33SJames Morse	the sum for all tasks in the CTRL_MON group and all tasks in
377*7168ae33SJames Morse	MON groups. Please see example section for more details on usage.
378*7168ae33SJames Morse	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
379*7168ae33SJames Morse	directories for each node (located within the "mon_L3_XX" directory
380*7168ae33SJames Morse	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
381*7168ae33SJames Morse	where "YY" is the node number.
382*7168ae33SJames Morse
383*7168ae33SJames Morse"mon_hw_id":
384*7168ae33SJames Morse	Available only with debug option. The identifier used by hardware
385*7168ae33SJames Morse	for the monitor group. On x86 this is the RMID.
386*7168ae33SJames Morse
387*7168ae33SJames MorseWhen the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
388*7168ae33SJames Morse
389*7168ae33SJames Morse"mba_MBps_event":
390*7168ae33SJames Morse	Reading this file shows which memory bandwidth event is used
391*7168ae33SJames Morse	as input to the software feedback loop that keeps memory bandwidth
392*7168ae33SJames Morse	below the value specified in the schemata file. Writing the
393*7168ae33SJames Morse	name of one of the supported memory bandwidth events found in
394*7168ae33SJames Morse	/sys/fs/resctrl/info/L3_MON/mon_features changes the input
395*7168ae33SJames Morse	event.
396*7168ae33SJames Morse
397*7168ae33SJames MorseResource allocation rules
398*7168ae33SJames Morse-------------------------
399*7168ae33SJames Morse
400*7168ae33SJames MorseWhen a task is running the following rules define which resources are
401*7168ae33SJames Morseavailable to it:
402*7168ae33SJames Morse
403*7168ae33SJames Morse1) If the task is a member of a non-default group, then the schemata
404*7168ae33SJames Morse   for that group is used.
405*7168ae33SJames Morse
406*7168ae33SJames Morse2) Else if the task belongs to the default group, but is running on a
407*7168ae33SJames Morse   CPU that is assigned to some specific group, then the schemata for the
408*7168ae33SJames Morse   CPU's group is used.
409*7168ae33SJames Morse
410*7168ae33SJames Morse3) Otherwise the schemata for the default group is used.
411*7168ae33SJames Morse
412*7168ae33SJames MorseResource monitoring rules
413*7168ae33SJames Morse-------------------------
414*7168ae33SJames Morse1) If a task is a member of a MON group, or non-default CTRL_MON group
415*7168ae33SJames Morse   then RDT events for the task will be reported in that group.
416*7168ae33SJames Morse
417*7168ae33SJames Morse2) If a task is a member of the default CTRL_MON group, but is running
418*7168ae33SJames Morse   on a CPU that is assigned to some specific group, then the RDT events
419*7168ae33SJames Morse   for the task will be reported in that group.
420*7168ae33SJames Morse
421*7168ae33SJames Morse3) Otherwise RDT events for the task will be reported in the root level
422*7168ae33SJames Morse   "mon_data" group.
423*7168ae33SJames Morse
424*7168ae33SJames Morse
425*7168ae33SJames MorseNotes on cache occupancy monitoring and control
426*7168ae33SJames Morse===============================================
427*7168ae33SJames MorseWhen moving a task from one group to another you should remember that
428*7168ae33SJames Morsethis only affects *new* cache allocations by the task. E.g. you may have
429*7168ae33SJames Morsea task in a monitor group showing 3 MB of cache occupancy. If you move
430*7168ae33SJames Morseto a new group and immediately check the occupancy of the old and new
431*7168ae33SJames Morsegroups you will likely see that the old group is still showing 3 MB and
432*7168ae33SJames Morsethe new group zero. When the task accesses locations still in cache from
433*7168ae33SJames Morsebefore the move, the h/w does not update any counters. On a busy system
434*7168ae33SJames Morseyou will likely see the occupancy in the old group go down as cache lines
435*7168ae33SJames Morseare evicted and re-used while the occupancy in the new group rises as
436*7168ae33SJames Morsethe task accesses memory and loads into the cache are counted based on
437*7168ae33SJames Morsemembership in the new group.
438*7168ae33SJames Morse
439*7168ae33SJames MorseThe same applies to cache allocation control. Moving a task to a group
440*7168ae33SJames Morsewith a smaller cache partition will not evict any cache lines. The
441*7168ae33SJames Morseprocess may continue to use them from the old partition.
442*7168ae33SJames Morse
443*7168ae33SJames MorseHardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
444*7168ae33SJames Morseto identify a control group and a monitoring group respectively. Each of
445*7168ae33SJames Morsethe resource groups are mapped to these IDs based on the kind of group. The
446*7168ae33SJames Morsenumber of CLOSid and RMID are limited by the hardware and hence the creation of
447*7168ae33SJames Morsea "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
448*7168ae33SJames Morseand creation of "MON" group may fail if we run out of RMIDs.
449*7168ae33SJames Morse
450*7168ae33SJames Morsemax_threshold_occupancy - generic concepts
451*7168ae33SJames Morse------------------------------------------
452*7168ae33SJames Morse
453*7168ae33SJames MorseNote that an RMID once freed may not be immediately available for use as
454*7168ae33SJames Morsethe RMID is still tagged the cache lines of the previous user of RMID.
455*7168ae33SJames MorseHence such RMIDs are placed on limbo list and checked back if the cache
456*7168ae33SJames Morseoccupancy has gone down. If there is a time when system has a lot of
457*7168ae33SJames Morselimbo RMIDs but which are not ready to be used, user may see an -EBUSY
458*7168ae33SJames Morseduring mkdir.
459*7168ae33SJames Morse
460*7168ae33SJames Morsemax_threshold_occupancy is a user configurable value to determine the
461*7168ae33SJames Morseoccupancy at which an RMID can be freed.
462*7168ae33SJames Morse
463*7168ae33SJames MorseThe mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
464*7168ae33SJames Morsefor a subset of RMID that are not immediately available for allocation.
465*7168ae33SJames MorseThis can't be relied on to produce output every second, it may be necessary
466*7168ae33SJames Morseto attempt to create an empty monitor group to force an update. Output may
467*7168ae33SJames Morseonly be produced if creation of a control or monitor group fails.
468*7168ae33SJames Morse
469*7168ae33SJames MorseSchemata files - general concepts
470*7168ae33SJames Morse---------------------------------
471*7168ae33SJames MorseEach line in the file describes one resource. The line starts with
472*7168ae33SJames Morsethe name of the resource, followed by specific values to be applied
473*7168ae33SJames Morsein each of the instances of that resource on the system.
474*7168ae33SJames Morse
475*7168ae33SJames MorseCache IDs
476*7168ae33SJames Morse---------
477*7168ae33SJames MorseOn current generation systems there is one L3 cache per socket and L2
478*7168ae33SJames Morsecaches are generally just shared by the hyperthreads on a core, but this
479*7168ae33SJames Morseisn't an architectural requirement. We could have multiple separate L3
480*7168ae33SJames Morsecaches on a socket, multiple cores could share an L2 cache. So instead
481*7168ae33SJames Morseof using "socket" or "core" to define the set of logical cpus sharing
482*7168ae33SJames Morsea resource we use a "Cache ID". At a given cache level this will be a
483*7168ae33SJames Morseunique number across the whole system (but it isn't guaranteed to be a
484*7168ae33SJames Morsecontiguous sequence, there may be gaps).  To find the ID for each logical
485*7168ae33SJames MorseCPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
486*7168ae33SJames Morse
487*7168ae33SJames MorseCache Bit Masks (CBM)
488*7168ae33SJames Morse---------------------
489*7168ae33SJames MorseFor cache resources we describe the portion of the cache that is available
490*7168ae33SJames Morsefor allocation using a bitmask. The maximum value of the mask is defined
491*7168ae33SJames Morseby each cpu model (and may be different for different cache levels). It
492*7168ae33SJames Morseis found using CPUID, but is also provided in the "info" directory of
493*7168ae33SJames Morsethe resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
494*7168ae33SJames Morserequires that these masks have all the '1' bits in a contiguous block. So
495*7168ae33SJames Morse0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
496*7168ae33SJames Morseand 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
497*7168ae33SJames Morseif non-contiguous 1s value is supported. On a system with a 20-bit mask
498*7168ae33SJames Morseeach bit represents 5% of the capacity of the cache. You could partition
499*7168ae33SJames Morsethe cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
500*7168ae33SJames Morse
501*7168ae33SJames MorseNotes on Sub-NUMA Cluster mode
502*7168ae33SJames Morse==============================
503*7168ae33SJames MorseWhen SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
504*7168ae33SJames Morsenodes much more readily than between regular NUMA nodes since the CPUs
505*7168ae33SJames Morseon Sub-NUMA nodes share the same L3 cache and the system may report
506*7168ae33SJames Morsethe NUMA distance between Sub-NUMA nodes with a lower value than used
507*7168ae33SJames Morsefor regular NUMA nodes.
508*7168ae33SJames Morse
509*7168ae33SJames MorseThe top-level monitoring files in each "mon_L3_XX" directory provide
510*7168ae33SJames Morsethe sum of data across all SNC nodes sharing an L3 cache instance.
511*7168ae33SJames MorseUsers who bind tasks to the CPUs of a specific Sub-NUMA node can read
512*7168ae33SJames Morsethe "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
513*7168ae33SJames Morse"mon_sub_L3_YY" directories to get node local data.
514*7168ae33SJames Morse
515*7168ae33SJames MorseMemory bandwidth allocation is still performed at the L3 cache
516*7168ae33SJames Morselevel. I.e. throttling controls are applied to all SNC nodes.
517*7168ae33SJames Morse
518*7168ae33SJames MorseL3 cache allocation bitmaps also apply to all SNC nodes. But note that
519*7168ae33SJames Morsethe amount of L3 cache represented by each bit is divided by the number
520*7168ae33SJames Morseof SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
521*7168ae33SJames Morseallocation masks each bit normally represents 10MB. With SNC mode enabled
522*7168ae33SJames Morsewith two SNC nodes per L3 cache, each bit only represents 5MB.
523*7168ae33SJames Morse
524*7168ae33SJames MorseMemory bandwidth Allocation and monitoring
525*7168ae33SJames Morse==========================================
526*7168ae33SJames Morse
527*7168ae33SJames MorseFor Memory bandwidth resource, by default the user controls the resource
528*7168ae33SJames Morseby indicating the percentage of total memory bandwidth.
529*7168ae33SJames Morse
530*7168ae33SJames MorseThe minimum bandwidth percentage value for each cpu model is predefined
531*7168ae33SJames Morseand can be looked up through "info/MB/min_bandwidth". The bandwidth
532*7168ae33SJames Morsegranularity that is allocated is also dependent on the cpu model and can
533*7168ae33SJames Morsebe looked up at "info/MB/bandwidth_gran". The available bandwidth
534*7168ae33SJames Morsecontrol steps are: min_bw + N * bw_gran. Intermediate values are rounded
535*7168ae33SJames Morseto the next control step available on the hardware.
536*7168ae33SJames Morse
537*7168ae33SJames MorseThe bandwidth throttling is a core specific mechanism on some of Intel
538*7168ae33SJames MorseSKUs. Using a high bandwidth and a low bandwidth setting on two threads
539*7168ae33SJames Morsesharing a core may result in both threads being throttled to use the
540*7168ae33SJames Morselow bandwidth (see "thread_throttle_mode").
541*7168ae33SJames Morse
542*7168ae33SJames MorseThe fact that Memory bandwidth allocation(MBA) may be a core
543*7168ae33SJames Morsespecific mechanism where as memory bandwidth monitoring(MBM) is done at
544*7168ae33SJames Morsethe package level may lead to confusion when users try to apply control
545*7168ae33SJames Morsevia the MBA and then monitor the bandwidth to see if the controls are
546*7168ae33SJames Morseeffective. Below are such scenarios:
547*7168ae33SJames Morse
548*7168ae33SJames Morse1. User may *not* see increase in actual bandwidth when percentage
549*7168ae33SJames Morse   values are increased:
550*7168ae33SJames Morse
551*7168ae33SJames MorseThis can occur when aggregate L2 external bandwidth is more than L3
552*7168ae33SJames Morseexternal bandwidth. Consider an SKL SKU with 24 cores on a package and
553*7168ae33SJames Morsewhere L2 external  is 10GBps (hence aggregate L2 external bandwidth is
554*7168ae33SJames Morse240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
555*7168ae33SJames Morsethreads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
556*7168ae33SJames Morsebandwidth of 100GBps although the percentage value specified is only 50%
557*7168ae33SJames Morse<< 100%. Hence increasing the bandwidth percentage will not yield any
558*7168ae33SJames Morsemore bandwidth. This is because although the L2 external bandwidth still
559*7168ae33SJames Morsehas capacity, the L3 external bandwidth is fully used. Also note that
560*7168ae33SJames Morsethis would be dependent on number of cores the benchmark is run on.
561*7168ae33SJames Morse
562*7168ae33SJames Morse2. Same bandwidth percentage may mean different actual bandwidth
563*7168ae33SJames Morse   depending on # of threads:
564*7168ae33SJames Morse
565*7168ae33SJames MorseFor the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
566*7168ae33SJames Morsethread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
567*7168ae33SJames Morsethey have same percentage bandwidth of 10%. This is simply because as
568*7168ae33SJames Morsethreads start using more cores in an rdtgroup, the actual bandwidth may
569*7168ae33SJames Morseincrease or vary although user specified bandwidth percentage is same.
570*7168ae33SJames Morse
571*7168ae33SJames MorseIn order to mitigate this and make the interface more user friendly,
572*7168ae33SJames Morseresctrl added support for specifying the bandwidth in MiBps as well.  The
573*7168ae33SJames Morsekernel underneath would use a software feedback mechanism or a "Software
574*7168ae33SJames MorseController(mba_sc)" which reads the actual bandwidth using MBM counters
575*7168ae33SJames Morseand adjust the memory bandwidth percentages to ensure::
576*7168ae33SJames Morse
577*7168ae33SJames Morse	"actual bandwidth < user specified bandwidth".
578*7168ae33SJames Morse
579*7168ae33SJames MorseBy default, the schemata would take the bandwidth percentage values
580*7168ae33SJames Morsewhere as user can switch to the "MBA software controller" mode using
581*7168ae33SJames Morsea mount option 'mba_MBps'. The schemata format is specified in the below
582*7168ae33SJames Morsesections.
583*7168ae33SJames Morse
584*7168ae33SJames MorseL3 schemata file details (code and data prioritization disabled)
585*7168ae33SJames Morse----------------------------------------------------------------
586*7168ae33SJames MorseWith CDP disabled the L3 schemata format is::
587*7168ae33SJames Morse
588*7168ae33SJames Morse	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
589*7168ae33SJames Morse
590*7168ae33SJames MorseL3 schemata file details (CDP enabled via mount option to resctrl)
591*7168ae33SJames Morse------------------------------------------------------------------
592*7168ae33SJames MorseWhen CDP is enabled L3 control is split into two separate resources
593*7168ae33SJames Morseso you can specify independent masks for code and data like this::
594*7168ae33SJames Morse
595*7168ae33SJames Morse	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
596*7168ae33SJames Morse	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
597*7168ae33SJames Morse
598*7168ae33SJames MorseL2 schemata file details
599*7168ae33SJames Morse------------------------
600*7168ae33SJames MorseCDP is supported at L2 using the 'cdpl2' mount option. The schemata
601*7168ae33SJames Morseformat is either::
602*7168ae33SJames Morse
603*7168ae33SJames Morse	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
604*7168ae33SJames Morse
605*7168ae33SJames Morseor
606*7168ae33SJames Morse
607*7168ae33SJames Morse	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
608*7168ae33SJames Morse	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
609*7168ae33SJames Morse
610*7168ae33SJames Morse
611*7168ae33SJames MorseMemory bandwidth Allocation (default mode)
612*7168ae33SJames Morse------------------------------------------
613*7168ae33SJames Morse
614*7168ae33SJames MorseMemory b/w domain is L3 cache.
615*7168ae33SJames Morse::
616*7168ae33SJames Morse
617*7168ae33SJames Morse	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
618*7168ae33SJames Morse
619*7168ae33SJames MorseMemory bandwidth Allocation specified in MiBps
620*7168ae33SJames Morse----------------------------------------------
621*7168ae33SJames Morse
622*7168ae33SJames MorseMemory bandwidth domain is L3 cache.
623*7168ae33SJames Morse::
624*7168ae33SJames Morse
625*7168ae33SJames Morse	MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
626*7168ae33SJames Morse
627*7168ae33SJames MorseSlow Memory Bandwidth Allocation (SMBA)
628*7168ae33SJames Morse---------------------------------------
629*7168ae33SJames MorseAMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
630*7168ae33SJames MorseCXL.memory is the only supported "slow" memory device. With the
631*7168ae33SJames Morsesupport of SMBA, the hardware enables bandwidth allocation on
632*7168ae33SJames Morsethe slow memory devices. If there are multiple such devices in
633*7168ae33SJames Morsethe system, the throttling logic groups all the slow sources
634*7168ae33SJames Morsetogether and applies the limit on them as a whole.
635*7168ae33SJames Morse
636*7168ae33SJames MorseThe presence of SMBA (with CXL.memory) is independent of slow memory
637*7168ae33SJames Morsedevices presence. If there are no such devices on the system, then
638*7168ae33SJames Morseconfiguring SMBA will have no impact on the performance of the system.
639*7168ae33SJames Morse
640*7168ae33SJames MorseThe bandwidth domain for slow memory is L3 cache. Its schemata file
641*7168ae33SJames Morseis formatted as:
642*7168ae33SJames Morse::
643*7168ae33SJames Morse
644*7168ae33SJames Morse	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
645*7168ae33SJames Morse
646*7168ae33SJames MorseReading/writing the schemata file
647*7168ae33SJames Morse---------------------------------
648*7168ae33SJames MorseReading the schemata file will show the state of all resources
649*7168ae33SJames Morseon all domains. When writing you only need to specify those values
650*7168ae33SJames Morsewhich you wish to change.  E.g.
651*7168ae33SJames Morse::
652*7168ae33SJames Morse
653*7168ae33SJames Morse  # cat schemata
654*7168ae33SJames Morse  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
655*7168ae33SJames Morse  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
656*7168ae33SJames Morse  # echo "L3DATA:2=3c0;" > schemata
657*7168ae33SJames Morse  # cat schemata
658*7168ae33SJames Morse  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
659*7168ae33SJames Morse  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
660*7168ae33SJames Morse
661*7168ae33SJames MorseReading/writing the schemata file (on AMD systems)
662*7168ae33SJames Morse--------------------------------------------------
663*7168ae33SJames MorseReading the schemata file will show the current bandwidth limit on all
664*7168ae33SJames Morsedomains. The allocated resources are in multiples of one eighth GB/s.
665*7168ae33SJames MorseWhen writing to the file, you need to specify what cache id you wish to
666*7168ae33SJames Morseconfigure the bandwidth limit.
667*7168ae33SJames Morse
668*7168ae33SJames MorseFor example, to allocate 2GB/s limit on the first cache id:
669*7168ae33SJames Morse
670*7168ae33SJames Morse::
671*7168ae33SJames Morse
672*7168ae33SJames Morse  # cat schemata
673*7168ae33SJames Morse    MB:0=2048;1=2048;2=2048;3=2048
674*7168ae33SJames Morse    L3:0=ffff;1=ffff;2=ffff;3=ffff
675*7168ae33SJames Morse
676*7168ae33SJames Morse  # echo "MB:1=16" > schemata
677*7168ae33SJames Morse  # cat schemata
678*7168ae33SJames Morse    MB:0=2048;1=  16;2=2048;3=2048
679*7168ae33SJames Morse    L3:0=ffff;1=ffff;2=ffff;3=ffff
680*7168ae33SJames Morse
681*7168ae33SJames MorseReading/writing the schemata file (on AMD systems) with SMBA feature
682*7168ae33SJames Morse--------------------------------------------------------------------
683*7168ae33SJames MorseReading and writing the schemata file is the same as without SMBA in
684*7168ae33SJames Morseabove section.
685*7168ae33SJames Morse
686*7168ae33SJames MorseFor example, to allocate 8GB/s limit on the first cache id:
687*7168ae33SJames Morse
688*7168ae33SJames Morse::
689*7168ae33SJames Morse
690*7168ae33SJames Morse  # cat schemata
691*7168ae33SJames Morse    SMBA:0=2048;1=2048;2=2048;3=2048
692*7168ae33SJames Morse      MB:0=2048;1=2048;2=2048;3=2048
693*7168ae33SJames Morse      L3:0=ffff;1=ffff;2=ffff;3=ffff
694*7168ae33SJames Morse
695*7168ae33SJames Morse  # echo "SMBA:1=64" > schemata
696*7168ae33SJames Morse  # cat schemata
697*7168ae33SJames Morse    SMBA:0=2048;1=  64;2=2048;3=2048
698*7168ae33SJames Morse      MB:0=2048;1=2048;2=2048;3=2048
699*7168ae33SJames Morse      L3:0=ffff;1=ffff;2=ffff;3=ffff
700*7168ae33SJames Morse
701*7168ae33SJames MorseCache Pseudo-Locking
702*7168ae33SJames Morse====================
703*7168ae33SJames MorseCAT enables a user to specify the amount of cache space that an
704*7168ae33SJames Morseapplication can fill. Cache pseudo-locking builds on the fact that a
705*7168ae33SJames MorseCPU can still read and write data pre-allocated outside its current
706*7168ae33SJames Morseallocated area on a cache hit. With cache pseudo-locking, data can be
707*7168ae33SJames Morsepreloaded into a reserved portion of cache that no application can
708*7168ae33SJames Morsefill, and from that point on will only serve cache hits. The cache
709*7168ae33SJames Morsepseudo-locked memory is made accessible to user space where an
710*7168ae33SJames Morseapplication can map it into its virtual address space and thus have
711*7168ae33SJames Morsea region of memory with reduced average read latency.
712*7168ae33SJames Morse
713*7168ae33SJames MorseThe creation of a cache pseudo-locked region is triggered by a request
714*7168ae33SJames Morsefrom the user to do so that is accompanied by a schemata of the region
715*7168ae33SJames Morseto be pseudo-locked. The cache pseudo-locked region is created as follows:
716*7168ae33SJames Morse
717*7168ae33SJames Morse- Create a CAT allocation CLOSNEW with a CBM matching the schemata
718*7168ae33SJames Morse  from the user of the cache region that will contain the pseudo-locked
719*7168ae33SJames Morse  memory. This region must not overlap with any current CAT allocation/CLOS
720*7168ae33SJames Morse  on the system and no future overlap with this cache region is allowed
721*7168ae33SJames Morse  while the pseudo-locked region exists.
722*7168ae33SJames Morse- Create a contiguous region of memory of the same size as the cache
723*7168ae33SJames Morse  region.
724*7168ae33SJames Morse- Flush the cache, disable hardware prefetchers, disable preemption.
725*7168ae33SJames Morse- Make CLOSNEW the active CLOS and touch the allocated memory to load
726*7168ae33SJames Morse  it into the cache.
727*7168ae33SJames Morse- Set the previous CLOS as active.
728*7168ae33SJames Morse- At this point the closid CLOSNEW can be released - the cache
729*7168ae33SJames Morse  pseudo-locked region is protected as long as its CBM does not appear in
730*7168ae33SJames Morse  any CAT allocation. Even though the cache pseudo-locked region will from
731*7168ae33SJames Morse  this point on not appear in any CBM of any CLOS an application running with
732*7168ae33SJames Morse  any CLOS will be able to access the memory in the pseudo-locked region since
733*7168ae33SJames Morse  the region continues to serve cache hits.
734*7168ae33SJames Morse- The contiguous region of memory loaded into the cache is exposed to
735*7168ae33SJames Morse  user-space as a character device.
736*7168ae33SJames Morse
737*7168ae33SJames MorseCache pseudo-locking increases the probability that data will remain
738*7168ae33SJames Morsein the cache via carefully configuring the CAT feature and controlling
739*7168ae33SJames Morseapplication behavior. There is no guarantee that data is placed in
740*7168ae33SJames Morsecache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
741*7168ae33SJames Morse“locked” data from cache. Power management C-states may shrink or
742*7168ae33SJames Morsepower off cache. Deeper C-states will automatically be restricted on
743*7168ae33SJames Morsepseudo-locked region creation.
744*7168ae33SJames Morse
745*7168ae33SJames MorseIt is required that an application using a pseudo-locked region runs
746*7168ae33SJames Morsewith affinity to the cores (or a subset of the cores) associated
747*7168ae33SJames Morsewith the cache on which the pseudo-locked region resides. A sanity check
748*7168ae33SJames Morsewithin the code will not allow an application to map pseudo-locked memory
749*7168ae33SJames Morseunless it runs with affinity to cores associated with the cache on which the
750*7168ae33SJames Morsepseudo-locked region resides. The sanity check is only done during the
751*7168ae33SJames Morseinitial mmap() handling, there is no enforcement afterwards and the
752*7168ae33SJames Morseapplication self needs to ensure it remains affine to the correct cores.
753*7168ae33SJames Morse
754*7168ae33SJames MorsePseudo-locking is accomplished in two stages:
755*7168ae33SJames Morse
756*7168ae33SJames Morse1) During the first stage the system administrator allocates a portion
757*7168ae33SJames Morse   of cache that should be dedicated to pseudo-locking. At this time an
758*7168ae33SJames Morse   equivalent portion of memory is allocated, loaded into allocated
759*7168ae33SJames Morse   cache portion, and exposed as a character device.
760*7168ae33SJames Morse2) During the second stage a user-space application maps (mmap()) the
761*7168ae33SJames Morse   pseudo-locked memory into its address space.
762*7168ae33SJames Morse
763*7168ae33SJames MorseCache Pseudo-Locking Interface
764*7168ae33SJames Morse------------------------------
765*7168ae33SJames MorseA pseudo-locked region is created using the resctrl interface as follows:
766*7168ae33SJames Morse
767*7168ae33SJames Morse1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
768*7168ae33SJames Morse2) Change the new resource group's mode to "pseudo-locksetup" by writing
769*7168ae33SJames Morse   "pseudo-locksetup" to the "mode" file.
770*7168ae33SJames Morse3) Write the schemata of the pseudo-locked region to the "schemata" file. All
771*7168ae33SJames Morse   bits within the schemata should be "unused" according to the "bit_usage"
772*7168ae33SJames Morse   file.
773*7168ae33SJames Morse
774*7168ae33SJames MorseOn successful pseudo-locked region creation the "mode" file will contain
775*7168ae33SJames Morse"pseudo-locked" and a new character device with the same name as the resource
776*7168ae33SJames Morsegroup will exist in /dev/pseudo_lock. This character device can be mmap()'ed
777*7168ae33SJames Morseby user space in order to obtain access to the pseudo-locked memory region.
778*7168ae33SJames Morse
779*7168ae33SJames MorseAn example of cache pseudo-locked region creation and usage can be found below.
780*7168ae33SJames Morse
781*7168ae33SJames MorseCache Pseudo-Locking Debugging Interface
782*7168ae33SJames Morse----------------------------------------
783*7168ae33SJames MorseThe pseudo-locking debugging interface is enabled by default (if
784*7168ae33SJames MorseCONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
785*7168ae33SJames Morse
786*7168ae33SJames MorseThere is no explicit way for the kernel to test if a provided memory
787*7168ae33SJames Morselocation is present in the cache. The pseudo-locking debugging interface uses
788*7168ae33SJames Morsethe tracing infrastructure to provide two ways to measure cache residency of
789*7168ae33SJames Morsethe pseudo-locked region:
790*7168ae33SJames Morse
791*7168ae33SJames Morse1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
792*7168ae33SJames Morse   from these measurements are best visualized using a hist trigger (see
793*7168ae33SJames Morse   example below). In this test the pseudo-locked region is traversed at
794*7168ae33SJames Morse   a stride of 32 bytes while hardware prefetchers and preemption
795*7168ae33SJames Morse   are disabled. This also provides a substitute visualization of cache
796*7168ae33SJames Morse   hits and misses.
797*7168ae33SJames Morse2) Cache hit and miss measurements using model specific precision counters if
798*7168ae33SJames Morse   available. Depending on the levels of cache on the system the pseudo_lock_l2
799*7168ae33SJames Morse   and pseudo_lock_l3 tracepoints are available.
800*7168ae33SJames Morse
801*7168ae33SJames MorseWhen a pseudo-locked region is created a new debugfs directory is created for
802*7168ae33SJames Morseit in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
803*7168ae33SJames Morsewrite-only file, pseudo_lock_measure, is present in this directory. The
804*7168ae33SJames Morsemeasurement of the pseudo-locked region depends on the number written to this
805*7168ae33SJames Morsedebugfs file:
806*7168ae33SJames Morse
807*7168ae33SJames Morse1:
808*7168ae33SJames Morse     writing "1" to the pseudo_lock_measure file will trigger the latency
809*7168ae33SJames Morse     measurement captured in the pseudo_lock_mem_latency tracepoint. See
810*7168ae33SJames Morse     example below.
811*7168ae33SJames Morse2:
812*7168ae33SJames Morse     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
813*7168ae33SJames Morse     residency (cache hits and misses) measurement captured in the
814*7168ae33SJames Morse     pseudo_lock_l2 tracepoint. See example below.
815*7168ae33SJames Morse3:
816*7168ae33SJames Morse     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
817*7168ae33SJames Morse     residency (cache hits and misses) measurement captured in the
818*7168ae33SJames Morse     pseudo_lock_l3 tracepoint.
819*7168ae33SJames Morse
820*7168ae33SJames MorseAll measurements are recorded with the tracing infrastructure. This requires
821*7168ae33SJames Morsethe relevant tracepoints to be enabled before the measurement is triggered.
822*7168ae33SJames Morse
823*7168ae33SJames MorseExample of latency debugging interface
824*7168ae33SJames Morse~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
825*7168ae33SJames MorseIn this example a pseudo-locked region named "newlock" was created. Here is
826*7168ae33SJames Morsehow we can measure the latency in cycles of reading from this region and
827*7168ae33SJames Morsevisualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
828*7168ae33SJames Morseis set::
829*7168ae33SJames Morse
830*7168ae33SJames Morse  # :> /sys/kernel/tracing/trace
831*7168ae33SJames Morse  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
832*7168ae33SJames Morse  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
833*7168ae33SJames Morse  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
834*7168ae33SJames Morse  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
835*7168ae33SJames Morse  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
836*7168ae33SJames Morse
837*7168ae33SJames Morse  # event histogram
838*7168ae33SJames Morse  #
839*7168ae33SJames Morse  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
840*7168ae33SJames Morse  #
841*7168ae33SJames Morse
842*7168ae33SJames Morse  { latency:        456 } hitcount:          1
843*7168ae33SJames Morse  { latency:         50 } hitcount:         83
844*7168ae33SJames Morse  { latency:         36 } hitcount:         96
845*7168ae33SJames Morse  { latency:         44 } hitcount:        174
846*7168ae33SJames Morse  { latency:         48 } hitcount:        195
847*7168ae33SJames Morse  { latency:         46 } hitcount:        262
848*7168ae33SJames Morse  { latency:         42 } hitcount:        693
849*7168ae33SJames Morse  { latency:         40 } hitcount:       3204
850*7168ae33SJames Morse  { latency:         38 } hitcount:       3484
851*7168ae33SJames Morse
852*7168ae33SJames Morse  Totals:
853*7168ae33SJames Morse      Hits: 8192
854*7168ae33SJames Morse      Entries: 9
855*7168ae33SJames Morse    Dropped: 0
856*7168ae33SJames Morse
857*7168ae33SJames MorseExample of cache hits/misses debugging
858*7168ae33SJames Morse~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
859*7168ae33SJames MorseIn this example a pseudo-locked region named "newlock" was created on the L2
860*7168ae33SJames Morsecache of a platform. Here is how we can obtain details of the cache hits
861*7168ae33SJames Morseand misses using the platform's precision counters.
862*7168ae33SJames Morse::
863*7168ae33SJames Morse
864*7168ae33SJames Morse  # :> /sys/kernel/tracing/trace
865*7168ae33SJames Morse  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
866*7168ae33SJames Morse  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
867*7168ae33SJames Morse  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
868*7168ae33SJames Morse  # cat /sys/kernel/tracing/trace
869*7168ae33SJames Morse
870*7168ae33SJames Morse  # tracer: nop
871*7168ae33SJames Morse  #
872*7168ae33SJames Morse  #                              _-----=> irqs-off
873*7168ae33SJames Morse  #                             / _----=> need-resched
874*7168ae33SJames Morse  #                            | / _---=> hardirq/softirq
875*7168ae33SJames Morse  #                            || / _--=> preempt-depth
876*7168ae33SJames Morse  #                            ||| /     delay
877*7168ae33SJames Morse  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
878*7168ae33SJames Morse  #              | |       |   ||||       |         |
879*7168ae33SJames Morse  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
880*7168ae33SJames Morse
881*7168ae33SJames Morse
882*7168ae33SJames MorseExamples for RDT allocation usage
883*7168ae33SJames Morse~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
884*7168ae33SJames Morse
885*7168ae33SJames Morse1) Example 1
886*7168ae33SJames Morse
887*7168ae33SJames MorseOn a two socket machine (one L3 cache per socket) with just four bits
888*7168ae33SJames Morsefor cache bit masks, minimum b/w of 10% with a memory bandwidth
889*7168ae33SJames Morsegranularity of 10%.
890*7168ae33SJames Morse::
891*7168ae33SJames Morse
892*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
893*7168ae33SJames Morse  # cd /sys/fs/resctrl
894*7168ae33SJames Morse  # mkdir p0 p1
895*7168ae33SJames Morse  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
896*7168ae33SJames Morse  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
897*7168ae33SJames Morse
898*7168ae33SJames MorseThe default resource group is unmodified, so we have access to all parts
899*7168ae33SJames Morseof all caches (its schemata file reads "L3:0=f;1=f").
900*7168ae33SJames Morse
901*7168ae33SJames MorseTasks that are under the control of group "p0" may only allocate from the
902*7168ae33SJames Morse"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
903*7168ae33SJames MorseTasks in group "p1" use the "lower" 50% of cache on both sockets.
904*7168ae33SJames Morse
905*7168ae33SJames MorseSimilarly, tasks that are under the control of group "p0" may use a
906*7168ae33SJames Morsemaximum memory b/w of 50% on socket0 and 50% on socket 1.
907*7168ae33SJames MorseTasks in group "p1" may also use 50% memory b/w on both sockets.
908*7168ae33SJames MorseNote that unlike cache masks, memory b/w cannot specify whether these
909*7168ae33SJames Morseallocations can overlap or not. The allocations specifies the maximum
910*7168ae33SJames Morseb/w that the group may be able to use and the system admin can configure
911*7168ae33SJames Morsethe b/w accordingly.
912*7168ae33SJames Morse
913*7168ae33SJames MorseIf resctrl is using the software controller (mba_sc) then user can enter the
914*7168ae33SJames Morsemax b/w in MB rather than the percentage values.
915*7168ae33SJames Morse::
916*7168ae33SJames Morse
917*7168ae33SJames Morse  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
918*7168ae33SJames Morse  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
919*7168ae33SJames Morse
920*7168ae33SJames MorseIn the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
921*7168ae33SJames Morseof 1024MB where as on socket 1 they would use 500MB.
922*7168ae33SJames Morse
923*7168ae33SJames Morse2) Example 2
924*7168ae33SJames Morse
925*7168ae33SJames MorseAgain two sockets, but this time with a more realistic 20-bit mask.
926*7168ae33SJames Morse
927*7168ae33SJames MorseTwo real time tasks pid=1234 running on processor 0 and pid=5678 running on
928*7168ae33SJames Morseprocessor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
929*7168ae33SJames Morseneighbors, each of the two real-time tasks exclusively occupies one quarter
930*7168ae33SJames Morseof L3 cache on socket 0.
931*7168ae33SJames Morse::
932*7168ae33SJames Morse
933*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
934*7168ae33SJames Morse  # cd /sys/fs/resctrl
935*7168ae33SJames Morse
936*7168ae33SJames MorseFirst we reset the schemata for the default group so that the "upper"
937*7168ae33SJames Morse50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
938*7168ae33SJames Morseordinary tasks::
939*7168ae33SJames Morse
940*7168ae33SJames Morse  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
941*7168ae33SJames Morse
942*7168ae33SJames MorseNext we make a resource group for our first real time task and give
943*7168ae33SJames Morseit access to the "top" 25% of the cache on socket 0.
944*7168ae33SJames Morse::
945*7168ae33SJames Morse
946*7168ae33SJames Morse  # mkdir p0
947*7168ae33SJames Morse  # echo "L3:0=f8000;1=fffff" > p0/schemata
948*7168ae33SJames Morse
949*7168ae33SJames MorseFinally we move our first real time task into this resource group. We
950*7168ae33SJames Morsealso use taskset(1) to ensure the task always runs on a dedicated CPU
951*7168ae33SJames Morseon socket 0. Most uses of resource groups will also constrain which
952*7168ae33SJames Morseprocessors tasks run on.
953*7168ae33SJames Morse::
954*7168ae33SJames Morse
955*7168ae33SJames Morse  # echo 1234 > p0/tasks
956*7168ae33SJames Morse  # taskset -cp 1 1234
957*7168ae33SJames Morse
958*7168ae33SJames MorseDitto for the second real time task (with the remaining 25% of cache)::
959*7168ae33SJames Morse
960*7168ae33SJames Morse  # mkdir p1
961*7168ae33SJames Morse  # echo "L3:0=7c00;1=fffff" > p1/schemata
962*7168ae33SJames Morse  # echo 5678 > p1/tasks
963*7168ae33SJames Morse  # taskset -cp 2 5678
964*7168ae33SJames Morse
965*7168ae33SJames MorseFor the same 2 socket system with memory b/w resource and CAT L3 the
966*7168ae33SJames Morseschemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
967*7168ae33SJames Morse10):
968*7168ae33SJames Morse
969*7168ae33SJames MorseFor our first real time task this would request 20% memory b/w on socket 0.
970*7168ae33SJames Morse::
971*7168ae33SJames Morse
972*7168ae33SJames Morse  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
973*7168ae33SJames Morse
974*7168ae33SJames MorseFor our second real time task this would request an other 20% memory b/w
975*7168ae33SJames Morseon socket 0.
976*7168ae33SJames Morse::
977*7168ae33SJames Morse
978*7168ae33SJames Morse  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
979*7168ae33SJames Morse
980*7168ae33SJames Morse3) Example 3
981*7168ae33SJames Morse
982*7168ae33SJames MorseA single socket system which has real-time tasks running on core 4-7 and
983*7168ae33SJames Morsenon real-time workload assigned to core 0-3. The real-time tasks share text
984*7168ae33SJames Morseand data, so a per task association is not required and due to interaction
985*7168ae33SJames Morsewith the kernel it's desired that the kernel on these cores shares L3 with
986*7168ae33SJames Morsethe tasks.
987*7168ae33SJames Morse::
988*7168ae33SJames Morse
989*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
990*7168ae33SJames Morse  # cd /sys/fs/resctrl
991*7168ae33SJames Morse
992*7168ae33SJames MorseFirst we reset the schemata for the default group so that the "upper"
993*7168ae33SJames Morse50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
994*7168ae33SJames Morsecannot be used by ordinary tasks::
995*7168ae33SJames Morse
996*7168ae33SJames Morse  # echo "L3:0=3ff\nMB:0=50" > schemata
997*7168ae33SJames Morse
998*7168ae33SJames MorseNext we make a resource group for our real time cores and give it access
999*7168ae33SJames Morseto the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
1000*7168ae33SJames Morsesocket 0.
1001*7168ae33SJames Morse::
1002*7168ae33SJames Morse
1003*7168ae33SJames Morse  # mkdir p0
1004*7168ae33SJames Morse  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
1005*7168ae33SJames Morse
1006*7168ae33SJames MorseFinally we move core 4-7 over to the new group and make sure that the
1007*7168ae33SJames Morsekernel and the tasks running there get 50% of the cache. They should
1008*7168ae33SJames Morsealso get 50% of memory bandwidth assuming that the cores 4-7 are SMT
1009*7168ae33SJames Morsesiblings and only the real time threads are scheduled on the cores 4-7.
1010*7168ae33SJames Morse::
1011*7168ae33SJames Morse
1012*7168ae33SJames Morse  # echo F0 > p0/cpus
1013*7168ae33SJames Morse
1014*7168ae33SJames Morse4) Example 4
1015*7168ae33SJames Morse
1016*7168ae33SJames MorseThe resource groups in previous examples were all in the default "shareable"
1017*7168ae33SJames Morsemode allowing sharing of their cache allocations. If one resource group
1018*7168ae33SJames Morseconfigures a cache allocation then nothing prevents another resource group
1019*7168ae33SJames Morseto overlap with that allocation.
1020*7168ae33SJames Morse
1021*7168ae33SJames MorseIn this example a new exclusive resource group will be created on a L2 CAT
1022*7168ae33SJames Morsesystem with two L2 cache instances that can be configured with an 8-bit
1023*7168ae33SJames Morsecapacity bitmask. The new exclusive resource group will be configured to use
1024*7168ae33SJames Morse25% of each cache instance.
1025*7168ae33SJames Morse::
1026*7168ae33SJames Morse
1027*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl/
1028*7168ae33SJames Morse  # cd /sys/fs/resctrl
1029*7168ae33SJames Morse
1030*7168ae33SJames MorseFirst, we observe that the default group is configured to allocate to all L2
1031*7168ae33SJames Morsecache::
1032*7168ae33SJames Morse
1033*7168ae33SJames Morse  # cat schemata
1034*7168ae33SJames Morse  L2:0=ff;1=ff
1035*7168ae33SJames Morse
1036*7168ae33SJames MorseWe could attempt to create the new resource group at this point, but it will
1037*7168ae33SJames Morsefail because of the overlap with the schemata of the default group::
1038*7168ae33SJames Morse
1039*7168ae33SJames Morse  # mkdir p0
1040*7168ae33SJames Morse  # echo 'L2:0=0x3;1=0x3' > p0/schemata
1041*7168ae33SJames Morse  # cat p0/mode
1042*7168ae33SJames Morse  shareable
1043*7168ae33SJames Morse  # echo exclusive > p0/mode
1044*7168ae33SJames Morse  -sh: echo: write error: Invalid argument
1045*7168ae33SJames Morse  # cat info/last_cmd_status
1046*7168ae33SJames Morse  schemata overlaps
1047*7168ae33SJames Morse
1048*7168ae33SJames MorseTo ensure that there is no overlap with another resource group the default
1049*7168ae33SJames Morseresource group's schemata has to change, making it possible for the new
1050*7168ae33SJames Morseresource group to become exclusive.
1051*7168ae33SJames Morse::
1052*7168ae33SJames Morse
1053*7168ae33SJames Morse  # echo 'L2:0=0xfc;1=0xfc' > schemata
1054*7168ae33SJames Morse  # echo exclusive > p0/mode
1055*7168ae33SJames Morse  # grep . p0/*
1056*7168ae33SJames Morse  p0/cpus:0
1057*7168ae33SJames Morse  p0/mode:exclusive
1058*7168ae33SJames Morse  p0/schemata:L2:0=03;1=03
1059*7168ae33SJames Morse  p0/size:L2:0=262144;1=262144
1060*7168ae33SJames Morse
1061*7168ae33SJames MorseA new resource group will on creation not overlap with an exclusive resource
1062*7168ae33SJames Morsegroup::
1063*7168ae33SJames Morse
1064*7168ae33SJames Morse  # mkdir p1
1065*7168ae33SJames Morse  # grep . p1/*
1066*7168ae33SJames Morse  p1/cpus:0
1067*7168ae33SJames Morse  p1/mode:shareable
1068*7168ae33SJames Morse  p1/schemata:L2:0=fc;1=fc
1069*7168ae33SJames Morse  p1/size:L2:0=786432;1=786432
1070*7168ae33SJames Morse
1071*7168ae33SJames MorseThe bit_usage will reflect how the cache is used::
1072*7168ae33SJames Morse
1073*7168ae33SJames Morse  # cat info/L2/bit_usage
1074*7168ae33SJames Morse  0=SSSSSSEE;1=SSSSSSEE
1075*7168ae33SJames Morse
1076*7168ae33SJames MorseA resource group cannot be forced to overlap with an exclusive resource group::
1077*7168ae33SJames Morse
1078*7168ae33SJames Morse  # echo 'L2:0=0x1;1=0x1' > p1/schemata
1079*7168ae33SJames Morse  -sh: echo: write error: Invalid argument
1080*7168ae33SJames Morse  # cat info/last_cmd_status
1081*7168ae33SJames Morse  overlaps with exclusive group
1082*7168ae33SJames Morse
1083*7168ae33SJames MorseExample of Cache Pseudo-Locking
1084*7168ae33SJames Morse~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1085*7168ae33SJames MorseLock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1086*7168ae33SJames Morseregion is exposed at /dev/pseudo_lock/newlock that can be provided to
1087*7168ae33SJames Morseapplication for argument to mmap().
1088*7168ae33SJames Morse::
1089*7168ae33SJames Morse
1090*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl/
1091*7168ae33SJames Morse  # cd /sys/fs/resctrl
1092*7168ae33SJames Morse
1093*7168ae33SJames MorseEnsure that there are bits available that can be pseudo-locked, since only
1094*7168ae33SJames Morseunused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1095*7168ae33SJames Morseremoved from the default resource group's schemata::
1096*7168ae33SJames Morse
1097*7168ae33SJames Morse  # cat info/L2/bit_usage
1098*7168ae33SJames Morse  0=SSSSSSSS;1=SSSSSSSS
1099*7168ae33SJames Morse  # echo 'L2:1=0xfc' > schemata
1100*7168ae33SJames Morse  # cat info/L2/bit_usage
1101*7168ae33SJames Morse  0=SSSSSSSS;1=SSSSSS00
1102*7168ae33SJames Morse
1103*7168ae33SJames MorseCreate a new resource group that will be associated with the pseudo-locked
1104*7168ae33SJames Morseregion, indicate that it will be used for a pseudo-locked region, and
1105*7168ae33SJames Morseconfigure the requested pseudo-locked region capacity bitmask::
1106*7168ae33SJames Morse
1107*7168ae33SJames Morse  # mkdir newlock
1108*7168ae33SJames Morse  # echo pseudo-locksetup > newlock/mode
1109*7168ae33SJames Morse  # echo 'L2:1=0x3' > newlock/schemata
1110*7168ae33SJames Morse
1111*7168ae33SJames MorseOn success the resource group's mode will change to pseudo-locked, the
1112*7168ae33SJames Morsebit_usage will reflect the pseudo-locked region, and the character device
1113*7168ae33SJames Morseexposing the pseudo-locked region will exist::
1114*7168ae33SJames Morse
1115*7168ae33SJames Morse  # cat newlock/mode
1116*7168ae33SJames Morse  pseudo-locked
1117*7168ae33SJames Morse  # cat info/L2/bit_usage
1118*7168ae33SJames Morse  0=SSSSSSSS;1=SSSSSSPP
1119*7168ae33SJames Morse  # ls -l /dev/pseudo_lock/newlock
1120*7168ae33SJames Morse  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1121*7168ae33SJames Morse
1122*7168ae33SJames Morse::
1123*7168ae33SJames Morse
1124*7168ae33SJames Morse  /*
1125*7168ae33SJames Morse  * Example code to access one page of pseudo-locked cache region
1126*7168ae33SJames Morse  * from user space.
1127*7168ae33SJames Morse  */
1128*7168ae33SJames Morse  #define _GNU_SOURCE
1129*7168ae33SJames Morse  #include <fcntl.h>
1130*7168ae33SJames Morse  #include <sched.h>
1131*7168ae33SJames Morse  #include <stdio.h>
1132*7168ae33SJames Morse  #include <stdlib.h>
1133*7168ae33SJames Morse  #include <unistd.h>
1134*7168ae33SJames Morse  #include <sys/mman.h>
1135*7168ae33SJames Morse
1136*7168ae33SJames Morse  /*
1137*7168ae33SJames Morse  * It is required that the application runs with affinity to only
1138*7168ae33SJames Morse  * cores associated with the pseudo-locked region. Here the cpu
1139*7168ae33SJames Morse  * is hardcoded for convenience of example.
1140*7168ae33SJames Morse  */
1141*7168ae33SJames Morse  static int cpuid = 2;
1142*7168ae33SJames Morse
1143*7168ae33SJames Morse  int main(int argc, char *argv[])
1144*7168ae33SJames Morse  {
1145*7168ae33SJames Morse    cpu_set_t cpuset;
1146*7168ae33SJames Morse    long page_size;
1147*7168ae33SJames Morse    void *mapping;
1148*7168ae33SJames Morse    int dev_fd;
1149*7168ae33SJames Morse    int ret;
1150*7168ae33SJames Morse
1151*7168ae33SJames Morse    page_size = sysconf(_SC_PAGESIZE);
1152*7168ae33SJames Morse
1153*7168ae33SJames Morse    CPU_ZERO(&cpuset);
1154*7168ae33SJames Morse    CPU_SET(cpuid, &cpuset);
1155*7168ae33SJames Morse    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
1156*7168ae33SJames Morse    if (ret < 0) {
1157*7168ae33SJames Morse      perror("sched_setaffinity");
1158*7168ae33SJames Morse      exit(EXIT_FAILURE);
1159*7168ae33SJames Morse    }
1160*7168ae33SJames Morse
1161*7168ae33SJames Morse    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
1162*7168ae33SJames Morse    if (dev_fd < 0) {
1163*7168ae33SJames Morse      perror("open");
1164*7168ae33SJames Morse      exit(EXIT_FAILURE);
1165*7168ae33SJames Morse    }
1166*7168ae33SJames Morse
1167*7168ae33SJames Morse    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
1168*7168ae33SJames Morse            dev_fd, 0);
1169*7168ae33SJames Morse    if (mapping == MAP_FAILED) {
1170*7168ae33SJames Morse      perror("mmap");
1171*7168ae33SJames Morse      close(dev_fd);
1172*7168ae33SJames Morse      exit(EXIT_FAILURE);
1173*7168ae33SJames Morse    }
1174*7168ae33SJames Morse
1175*7168ae33SJames Morse    /* Application interacts with pseudo-locked memory @mapping */
1176*7168ae33SJames Morse
1177*7168ae33SJames Morse    ret = munmap(mapping, page_size);
1178*7168ae33SJames Morse    if (ret < 0) {
1179*7168ae33SJames Morse      perror("munmap");
1180*7168ae33SJames Morse      close(dev_fd);
1181*7168ae33SJames Morse      exit(EXIT_FAILURE);
1182*7168ae33SJames Morse    }
1183*7168ae33SJames Morse
1184*7168ae33SJames Morse    close(dev_fd);
1185*7168ae33SJames Morse    exit(EXIT_SUCCESS);
1186*7168ae33SJames Morse  }
1187*7168ae33SJames Morse
1188*7168ae33SJames MorseLocking between applications
1189*7168ae33SJames Morse----------------------------
1190*7168ae33SJames Morse
1191*7168ae33SJames MorseCertain operations on the resctrl filesystem, composed of read/writes
1192*7168ae33SJames Morseto/from multiple files, must be atomic.
1193*7168ae33SJames Morse
1194*7168ae33SJames MorseAs an example, the allocation of an exclusive reservation of L3 cache
1195*7168ae33SJames Morseinvolves:
1196*7168ae33SJames Morse
1197*7168ae33SJames Morse  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1198*7168ae33SJames Morse  2. Find a contiguous set of bits in the global CBM bitmask that is clear
1199*7168ae33SJames Morse     in any of the directory cbmmasks
1200*7168ae33SJames Morse  3. Create a new directory
1201*7168ae33SJames Morse  4. Set the bits found in step 2 to the new directory "schemata" file
1202*7168ae33SJames Morse
1203*7168ae33SJames MorseIf two applications attempt to allocate space concurrently then they can
1204*7168ae33SJames Morseend up allocating the same bits so the reservations are shared instead of
1205*7168ae33SJames Morseexclusive.
1206*7168ae33SJames Morse
1207*7168ae33SJames MorseTo coordinate atomic operations on the resctrlfs and to avoid the problem
1208*7168ae33SJames Morseabove, the following locking procedure is recommended:
1209*7168ae33SJames Morse
1210*7168ae33SJames MorseLocking is based on flock, which is available in libc and also as a shell
1211*7168ae33SJames Morsescript command
1212*7168ae33SJames Morse
1213*7168ae33SJames MorseWrite lock:
1214*7168ae33SJames Morse
1215*7168ae33SJames Morse A) Take flock(LOCK_EX) on /sys/fs/resctrl
1216*7168ae33SJames Morse B) Read/write the directory structure.
1217*7168ae33SJames Morse C) funlock
1218*7168ae33SJames Morse
1219*7168ae33SJames MorseRead lock:
1220*7168ae33SJames Morse
1221*7168ae33SJames Morse A) Take flock(LOCK_SH) on /sys/fs/resctrl
1222*7168ae33SJames Morse B) If success read the directory structure.
1223*7168ae33SJames Morse C) funlock
1224*7168ae33SJames Morse
1225*7168ae33SJames MorseExample with bash::
1226*7168ae33SJames Morse
1227*7168ae33SJames Morse  # Atomically read directory structure
1228*7168ae33SJames Morse  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1229*7168ae33SJames Morse
1230*7168ae33SJames Morse  # Read directory contents and create new subdirectory
1231*7168ae33SJames Morse
1232*7168ae33SJames Morse  $ cat create-dir.sh
1233*7168ae33SJames Morse  find /sys/fs/resctrl/ > output.txt
1234*7168ae33SJames Morse  mask = function-of(output.txt)
1235*7168ae33SJames Morse  mkdir /sys/fs/resctrl/newres/
1236*7168ae33SJames Morse  echo mask > /sys/fs/resctrl/newres/schemata
1237*7168ae33SJames Morse
1238*7168ae33SJames Morse  $ flock /sys/fs/resctrl/ ./create-dir.sh
1239*7168ae33SJames Morse
1240*7168ae33SJames MorseExample with C::
1241*7168ae33SJames Morse
1242*7168ae33SJames Morse  /*
1243*7168ae33SJames Morse  * Example code do take advisory locks
1244*7168ae33SJames Morse  * before accessing resctrl filesystem
1245*7168ae33SJames Morse  */
1246*7168ae33SJames Morse  #include <sys/file.h>
1247*7168ae33SJames Morse  #include <stdlib.h>
1248*7168ae33SJames Morse
1249*7168ae33SJames Morse  void resctrl_take_shared_lock(int fd)
1250*7168ae33SJames Morse  {
1251*7168ae33SJames Morse    int ret;
1252*7168ae33SJames Morse
1253*7168ae33SJames Morse    /* take shared lock on resctrl filesystem */
1254*7168ae33SJames Morse    ret = flock(fd, LOCK_SH);
1255*7168ae33SJames Morse    if (ret) {
1256*7168ae33SJames Morse      perror("flock");
1257*7168ae33SJames Morse      exit(-1);
1258*7168ae33SJames Morse    }
1259*7168ae33SJames Morse  }
1260*7168ae33SJames Morse
1261*7168ae33SJames Morse  void resctrl_take_exclusive_lock(int fd)
1262*7168ae33SJames Morse  {
1263*7168ae33SJames Morse    int ret;
1264*7168ae33SJames Morse
1265*7168ae33SJames Morse    /* release lock on resctrl filesystem */
1266*7168ae33SJames Morse    ret = flock(fd, LOCK_EX);
1267*7168ae33SJames Morse    if (ret) {
1268*7168ae33SJames Morse      perror("flock");
1269*7168ae33SJames Morse      exit(-1);
1270*7168ae33SJames Morse    }
1271*7168ae33SJames Morse  }
1272*7168ae33SJames Morse
1273*7168ae33SJames Morse  void resctrl_release_lock(int fd)
1274*7168ae33SJames Morse  {
1275*7168ae33SJames Morse    int ret;
1276*7168ae33SJames Morse
1277*7168ae33SJames Morse    /* take shared lock on resctrl filesystem */
1278*7168ae33SJames Morse    ret = flock(fd, LOCK_UN);
1279*7168ae33SJames Morse    if (ret) {
1280*7168ae33SJames Morse      perror("flock");
1281*7168ae33SJames Morse      exit(-1);
1282*7168ae33SJames Morse    }
1283*7168ae33SJames Morse  }
1284*7168ae33SJames Morse
1285*7168ae33SJames Morse  void main(void)
1286*7168ae33SJames Morse  {
1287*7168ae33SJames Morse    int fd, ret;
1288*7168ae33SJames Morse
1289*7168ae33SJames Morse    fd = open("/sys/fs/resctrl", O_DIRECTORY);
1290*7168ae33SJames Morse    if (fd == -1) {
1291*7168ae33SJames Morse      perror("open");
1292*7168ae33SJames Morse      exit(-1);
1293*7168ae33SJames Morse    }
1294*7168ae33SJames Morse    resctrl_take_shared_lock(fd);
1295*7168ae33SJames Morse    /* code to read directory contents */
1296*7168ae33SJames Morse    resctrl_release_lock(fd);
1297*7168ae33SJames Morse
1298*7168ae33SJames Morse    resctrl_take_exclusive_lock(fd);
1299*7168ae33SJames Morse    /* code to read and write directory contents */
1300*7168ae33SJames Morse    resctrl_release_lock(fd);
1301*7168ae33SJames Morse  }
1302*7168ae33SJames Morse
1303*7168ae33SJames MorseExamples for RDT Monitoring along with allocation usage
1304*7168ae33SJames Morse=======================================================
1305*7168ae33SJames MorseReading monitored data
1306*7168ae33SJames Morse----------------------
1307*7168ae33SJames MorseReading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1308*7168ae33SJames Morseshow the current snapshot of LLC occupancy of the corresponding MON
1309*7168ae33SJames Morsegroup or CTRL_MON group.
1310*7168ae33SJames Morse
1311*7168ae33SJames Morse
1312*7168ae33SJames MorseExample 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1313*7168ae33SJames Morse------------------------------------------------------------------------
1314*7168ae33SJames MorseOn a two socket machine (one L3 cache per socket) with just four bits
1315*7168ae33SJames Morsefor cache bit masks::
1316*7168ae33SJames Morse
1317*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
1318*7168ae33SJames Morse  # cd /sys/fs/resctrl
1319*7168ae33SJames Morse  # mkdir p0 p1
1320*7168ae33SJames Morse  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1321*7168ae33SJames Morse  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1322*7168ae33SJames Morse  # echo 5678 > p1/tasks
1323*7168ae33SJames Morse  # echo 5679 > p1/tasks
1324*7168ae33SJames Morse
1325*7168ae33SJames MorseThe default resource group is unmodified, so we have access to all parts
1326*7168ae33SJames Morseof all caches (its schemata file reads "L3:0=f;1=f").
1327*7168ae33SJames Morse
1328*7168ae33SJames MorseTasks that are under the control of group "p0" may only allocate from the
1329*7168ae33SJames Morse"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1330*7168ae33SJames MorseTasks in group "p1" use the "lower" 50% of cache on both sockets.
1331*7168ae33SJames Morse
1332*7168ae33SJames MorseCreate monitor groups and assign a subset of tasks to each monitor group.
1333*7168ae33SJames Morse::
1334*7168ae33SJames Morse
1335*7168ae33SJames Morse  # cd /sys/fs/resctrl/p1/mon_groups
1336*7168ae33SJames Morse  # mkdir m11 m12
1337*7168ae33SJames Morse  # echo 5678 > m11/tasks
1338*7168ae33SJames Morse  # echo 5679 > m12/tasks
1339*7168ae33SJames Morse
1340*7168ae33SJames Morsefetch data (data shown in bytes)
1341*7168ae33SJames Morse::
1342*7168ae33SJames Morse
1343*7168ae33SJames Morse  # cat m11/mon_data/mon_L3_00/llc_occupancy
1344*7168ae33SJames Morse  16234000
1345*7168ae33SJames Morse  # cat m11/mon_data/mon_L3_01/llc_occupancy
1346*7168ae33SJames Morse  14789000
1347*7168ae33SJames Morse  # cat m12/mon_data/mon_L3_00/llc_occupancy
1348*7168ae33SJames Morse  16789000
1349*7168ae33SJames Morse
1350*7168ae33SJames MorseThe parent ctrl_mon group shows the aggregated data.
1351*7168ae33SJames Morse::
1352*7168ae33SJames Morse
1353*7168ae33SJames Morse  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1354*7168ae33SJames Morse  31234000
1355*7168ae33SJames Morse
1356*7168ae33SJames MorseExample 2 (Monitor a task from its creation)
1357*7168ae33SJames Morse--------------------------------------------
1358*7168ae33SJames MorseOn a two socket machine (one L3 cache per socket)::
1359*7168ae33SJames Morse
1360*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
1361*7168ae33SJames Morse  # cd /sys/fs/resctrl
1362*7168ae33SJames Morse  # mkdir p0 p1
1363*7168ae33SJames Morse
1364*7168ae33SJames MorseAn RMID is allocated to the group once its created and hence the <cmd>
1365*7168ae33SJames Morsebelow is monitored from its creation.
1366*7168ae33SJames Morse::
1367*7168ae33SJames Morse
1368*7168ae33SJames Morse  # echo $$ > /sys/fs/resctrl/p1/tasks
1369*7168ae33SJames Morse  # <cmd>
1370*7168ae33SJames Morse
1371*7168ae33SJames MorseFetch the data::
1372*7168ae33SJames Morse
1373*7168ae33SJames Morse  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1374*7168ae33SJames Morse  31789000
1375*7168ae33SJames Morse
1376*7168ae33SJames MorseExample 3 (Monitor without CAT support or before creating CAT groups)
1377*7168ae33SJames Morse---------------------------------------------------------------------
1378*7168ae33SJames Morse
1379*7168ae33SJames MorseAssume a system like HSW has only CQM and no CAT support. In this case
1380*7168ae33SJames Morsethe resctrl will still mount but cannot create CTRL_MON directories.
1381*7168ae33SJames MorseBut user can create different MON groups within the root group thereby
1382*7168ae33SJames Morseable to monitor all tasks including kernel threads.
1383*7168ae33SJames Morse
1384*7168ae33SJames MorseThis can also be used to profile jobs cache size footprint before being
1385*7168ae33SJames Morseable to allocate them to different allocation groups.
1386*7168ae33SJames Morse::
1387*7168ae33SJames Morse
1388*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
1389*7168ae33SJames Morse  # cd /sys/fs/resctrl
1390*7168ae33SJames Morse  # mkdir mon_groups/m01
1391*7168ae33SJames Morse  # mkdir mon_groups/m02
1392*7168ae33SJames Morse
1393*7168ae33SJames Morse  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1394*7168ae33SJames Morse  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1395*7168ae33SJames Morse
1396*7168ae33SJames MorseMonitor the groups separately and also get per domain data. From the
1397*7168ae33SJames Morsebelow its apparent that the tasks are mostly doing work on
1398*7168ae33SJames Morsedomain(socket) 0.
1399*7168ae33SJames Morse::
1400*7168ae33SJames Morse
1401*7168ae33SJames Morse  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1402*7168ae33SJames Morse  31234000
1403*7168ae33SJames Morse  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1404*7168ae33SJames Morse  34555
1405*7168ae33SJames Morse  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1406*7168ae33SJames Morse  31234000
1407*7168ae33SJames Morse  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1408*7168ae33SJames Morse  32789
1409*7168ae33SJames Morse
1410*7168ae33SJames Morse
1411*7168ae33SJames MorseExample 4 (Monitor real time tasks)
1412*7168ae33SJames Morse-----------------------------------
1413*7168ae33SJames Morse
1414*7168ae33SJames MorseA single socket system which has real time tasks running on cores 4-7
1415*7168ae33SJames Morseand non real time tasks on other cpus. We want to monitor the cache
1416*7168ae33SJames Morseoccupancy of the real time threads on these cores.
1417*7168ae33SJames Morse::
1418*7168ae33SJames Morse
1419*7168ae33SJames Morse  # mount -t resctrl resctrl /sys/fs/resctrl
1420*7168ae33SJames Morse  # cd /sys/fs/resctrl
1421*7168ae33SJames Morse  # mkdir p1
1422*7168ae33SJames Morse
1423*7168ae33SJames MorseMove the cpus 4-7 over to p1::
1424*7168ae33SJames Morse
1425*7168ae33SJames Morse  # echo f0 > p1/cpus
1426*7168ae33SJames Morse
1427*7168ae33SJames MorseView the llc occupancy snapshot::
1428*7168ae33SJames Morse
1429*7168ae33SJames Morse  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1430*7168ae33SJames Morse  11234000
1431*7168ae33SJames Morse
1432*7168ae33SJames MorseIntel RDT Errata
1433*7168ae33SJames Morse================
1434*7168ae33SJames Morse
1435*7168ae33SJames MorseIntel MBM Counters May Report System Memory Bandwidth Incorrectly
1436*7168ae33SJames Morse-----------------------------------------------------------------
1437*7168ae33SJames Morse
1438*7168ae33SJames MorseErrata SKX99 for Skylake server and BDF102 for Broadwell server.
1439*7168ae33SJames Morse
1440*7168ae33SJames MorseProblem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1441*7168ae33SJames Morseaccording to the assigned Resource Monitor ID (RMID) for that logical
1442*7168ae33SJames Morsecore. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1443*7168ae33SJames Morsemetrics, may report incorrect system bandwidth for certain RMID values.
1444*7168ae33SJames Morse
1445*7168ae33SJames MorseImplication: Due to the errata, system memory bandwidth may not match
1446*7168ae33SJames Morsewhat is reported.
1447*7168ae33SJames Morse
1448*7168ae33SJames MorseWorkaround: MBM total and local readings are corrected according to the
1449*7168ae33SJames Morsefollowing correction factor table:
1450*7168ae33SJames Morse
1451*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1452*7168ae33SJames Morse|core count	|rmid count	|rmid threshold	|correction factor|
1453*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1454*7168ae33SJames Morse|1		|8		|0		|1.000000	  |
1455*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1456*7168ae33SJames Morse|2		|16		|0		|1.000000	  |
1457*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1458*7168ae33SJames Morse|3		|24		|15		|0.969650	  |
1459*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1460*7168ae33SJames Morse|4		|32		|0		|1.000000	  |
1461*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1462*7168ae33SJames Morse|6		|48		|31		|0.969650	  |
1463*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1464*7168ae33SJames Morse|7		|56		|47		|1.142857	  |
1465*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1466*7168ae33SJames Morse|8		|64		|0		|1.000000	  |
1467*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1468*7168ae33SJames Morse|9		|72		|63		|1.185115	  |
1469*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1470*7168ae33SJames Morse|10		|80		|63		|1.066553	  |
1471*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1472*7168ae33SJames Morse|11		|88		|79		|1.454545	  |
1473*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1474*7168ae33SJames Morse|12		|96		|0		|1.000000	  |
1475*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1476*7168ae33SJames Morse|13		|104		|95		|1.230769	  |
1477*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1478*7168ae33SJames Morse|14		|112		|95		|1.142857	  |
1479*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1480*7168ae33SJames Morse|15		|120		|95		|1.066667	  |
1481*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1482*7168ae33SJames Morse|16		|128		|0		|1.000000	  |
1483*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1484*7168ae33SJames Morse|17		|136		|127		|1.254863	  |
1485*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1486*7168ae33SJames Morse|18		|144		|127		|1.185255	  |
1487*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1488*7168ae33SJames Morse|19		|152		|0		|1.000000	  |
1489*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1490*7168ae33SJames Morse|20		|160		|127		|1.066667	  |
1491*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1492*7168ae33SJames Morse|21		|168		|0		|1.000000	  |
1493*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1494*7168ae33SJames Morse|22		|176		|159		|1.454334	  |
1495*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1496*7168ae33SJames Morse|23		|184		|0		|1.000000	  |
1497*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1498*7168ae33SJames Morse|24		|192		|127		|0.969744	  |
1499*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1500*7168ae33SJames Morse|25		|200		|191		|1.280246	  |
1501*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1502*7168ae33SJames Morse|26		|208		|191		|1.230921	  |
1503*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1504*7168ae33SJames Morse|27		|216		|0		|1.000000	  |
1505*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1506*7168ae33SJames Morse|28		|224		|191		|1.143118	  |
1507*7168ae33SJames Morse+---------------+---------------+---------------+-----------------+
1508*7168ae33SJames Morse
1509*7168ae33SJames MorseIf rmid > rmid threshold, MBM total and local values should be multiplied
1510*7168ae33SJames Morseby the correction factor.
1511*7168ae33SJames Morse
1512*7168ae33SJames MorseSee:
1513*7168ae33SJames Morse
1514*7168ae33SJames Morse1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1515*7168ae33SJames Morsehttp://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1516*7168ae33SJames Morse
1517*7168ae33SJames Morse2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1518*7168ae33SJames Morsehttp://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1519*7168ae33SJames Morse
1520*7168ae33SJames Morse3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1521*7168ae33SJames Morsehttps://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1522*7168ae33SJames Morse
1523*7168ae33SJames Morsefor further information.
1524