xref: /linux/Documentation/filesystems/resctrl.rst (revision 4b99990cdf9560e8a071640baf19f312e6ae02f4)
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4=====================================================
5User Interface for Resource Control feature (resctrl)
6=====================================================
7
8:Copyright: |copy| 2016 Intel Corporation
9:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10          - Tony Luck <tony.luck@intel.com>
11          - Vikas Shivappa <vikas.shivappa@intel.com>
12
13
14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
16
17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
18flag bits:
19
20=============================================================== ================================
21RDT (Resource Director Technology) Allocation			"rdt_a"
22CAT (Cache Allocation Technology)				"cat_l3", "cat_l2"
23CDP (Code and Data Prioritization)				"cdp_l3", "cdp_l2"
24CQM (Cache QoS Monitoring)					"cqm_llc", "cqm_occup_llc"
25MBM (Memory Bandwidth Monitoring)				"cqm_mbm_total", "cqm_mbm_local"
26MBA (Memory Bandwidth Allocation)				"mba"
27SMBA (Slow Memory Bandwidth Allocation)				""
28BMEC (Bandwidth Monitoring Event Configuration)			""
29ABMC (Assignable Bandwidth Monitoring Counters)			""
30SDCIAE (Smart Data Cache Injection Allocation Enforcement)	""
31=============================================================== ================================
32
33Historically, new features were made visible by default in /proc/cpuinfo. This
34resulted in the feature flags becoming hard to parse by humans. Adding a new
35flag to /proc/cpuinfo should be avoided if user space can obtain information
36about the feature from resctrl's info directory.
37
38To use the feature mount the file system::
39
40 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
41
42mount options are:
43
44"cdp":
45	Enable code/data prioritization in L3 cache allocations.
46"cdpl2":
47	Enable code/data prioritization in L2 cache allocations.
48"mba_MBps":
49	Enable the MBA Software Controller(mba_sc) to specify MBA
50	bandwidth in MiBps
51"debug":
52	Make debug files accessible. Available debug files are annotated with
53	"Available only with debug option".
54
55L2 and L3 CDP are controlled separately.
56
57RDT features are orthogonal. A particular system may support only
58monitoring, only control, or both monitoring and control.  Cache
59pseudo-locking is a unique way of using cache control to "pin" or
60"lock" data in the cache. Details can be found in
61"Cache Pseudo-Locking".
62
63
64The mount succeeds if either of allocation or monitoring is present, but
65only those files and directories supported by the system will be created.
66For more details on the behavior of the interface during monitoring
67and allocation, see the "Resource alloc and monitor groups" section.
68
69Info directory
70==============
71
72The 'info' directory contains information about the enabled
73resources. Each resource has its own subdirectory. The subdirectory
74names reflect the resource names.
75
76Most of the files in the resource's subdirectory are read-only, and
77describe properties of the resource. Resources that support global
78configuration options also include writable files that can be used
79to modify those settings.
80
81Each subdirectory contains the following files with respect to
82allocation:
83
84Cache resource(L3/L2)  subdirectory contains the following files
85related to allocation:
86
87"num_closids":
88		The number of CLOSIDs which are valid for this
89		resource. The kernel uses the smallest number of
90		CLOSIDs of all enabled resources as limit.
91"cbm_mask":
92		The bitmask which is valid for this resource.
93		This mask is equivalent to 100%.
94"min_cbm_bits":
95		The minimum number of consecutive bits which
96		must be set when writing a mask.
97
98"shareable_bits":
99		Bitmask of shareable resource with other executing entities
100		(e.g. I/O). Applies to all instances of this resource. User
101		can use this when setting up exclusive cache partitions.
102		Note that some platforms support devices that have their
103		own settings for cache use which can over-ride these bits.
104
105		When "io_alloc" is enabled, a portion of each cache instance can
106		be configured for shared use between hardware and software.
107		"bit_usage" should be used to see which portions of each cache
108		instance is configured for hardware use via "io_alloc" feature
109		because every cache instance can have its "io_alloc" bitmask
110		configured independently via "io_alloc_cbm".
111
112"bit_usage":
113		Annotated capacity bitmasks showing how all
114		instances of the resource are used. The legend is:
115
116			"0":
117			      Corresponding region is unused. When the system's
118			      resources have been allocated and a "0" is found
119			      in "bit_usage" it is a sign that resources are
120			      wasted.
121
122			"H":
123			      Corresponding region is used by hardware only
124			      but available for software use. If a resource
125			      has bits set in "shareable_bits" or "io_alloc_cbm"
126			      but not all of these bits appear in the resource
127			      groups' schemata then the bits appearing in
128			      "shareable_bits" or "io_alloc_cbm" but no
129			      resource group will be marked as "H".
130			"X":
131			      Corresponding region is available for sharing and
132			      used by hardware and software. These are the bits
133			      that appear in "shareable_bits" or "io_alloc_cbm"
134			      as well as a resource group's allocation.
135			"S":
136			      Corresponding region is used by software
137			      and available for sharing.
138			"E":
139			      Corresponding region is used exclusively by
140			      one resource group. No sharing allowed.
141			"P":
142			      Corresponding region is pseudo-locked. No
143			      sharing allowed.
144"sparse_masks":
145		Indicates if non-contiguous 1s value in CBM is supported.
146
147			"0":
148			      Only contiguous 1s value in CBM is supported.
149			"1":
150			      Non-contiguous 1s value in CBM is supported.
151
152"io_alloc":
153		"io_alloc" enables system software to configure the portion of
154		the cache allocated for I/O traffic. File may only exist if the
155		system supports this feature on some of its cache resources.
156
157			"disabled":
158			      Resource supports "io_alloc" but the feature is disabled.
159			      Portions of cache used for allocation of I/O traffic cannot
160			      be configured.
161			"enabled":
162			      Portions of cache used for allocation of I/O traffic
163			      can be configured using "io_alloc_cbm".
164			"not supported":
165			      Support not available for this resource.
166
167		The feature can be modified by writing to the interface, for example:
168
169		To enable::
170
171			# echo 1 > /sys/fs/resctrl/info/L3/io_alloc
172
173		To disable::
174
175			# echo 0 > /sys/fs/resctrl/info/L3/io_alloc
176
177		The underlying implementation may reduce resources available to
178		general (CPU) cache allocation. See architecture specific notes
179		below. Depending on usage requirements the feature can be enabled
180		or disabled.
181
182		On AMD systems, io_alloc feature is supported by the L3 Smart
183		Data Cache Injection Allocation Enforcement (SDCIAE). The CLOSID for
184		io_alloc is the highest CLOSID supported by the resource. When
185		io_alloc is enabled, the highest CLOSID is dedicated to io_alloc and
186		no longer available for general (CPU) cache allocation. When CDP is
187		enabled, io_alloc routes I/O traffic using the highest CLOSID allocated
188		for the instruction cache (CDP_CODE), making this CLOSID no longer
189		available for general (CPU) cache allocation for both the CDP_CODE
190		and CDP_DATA resources.
191
192"io_alloc_cbm":
193		Capacity bitmasks that describe the portions of cache instances to
194		which I/O traffic from supported I/O devices are routed when "io_alloc"
195		is enabled.
196
197		CBMs are displayed in the following format:
198
199			<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
200
201		Example::
202
203			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
204			0=ffff;1=ffff
205
206		CBMs can be configured by writing to the interface.
207
208		Example::
209
210			# echo 1=ff > /sys/fs/resctrl/info/L3/io_alloc_cbm
211			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
212			0=ffff;1=00ff
213
214			# echo "0=ff;1=f" > /sys/fs/resctrl/info/L3/io_alloc_cbm
215			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
216			0=00ff;1=000f
217
218		An ID of "*" configures all domains with the provided CBM.
219
220		Example on a system that does not require a minimum number of consecutive bits in the mask::
221
222			# echo "*=0" > /sys/fs/resctrl/info/L3/io_alloc_cbm
223			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
224			0=0;1=0
225
226		When CDP is enabled "io_alloc_cbm" associated with the CDP_DATA and CDP_CODE
227		resources may reflect the same values. For example, values read from and
228		written to /sys/fs/resctrl/info/L3DATA/io_alloc_cbm may be reflected by
229		/sys/fs/resctrl/info/L3CODE/io_alloc_cbm and vice versa.
230
231Memory bandwidth(MB) subdirectory contains the following files
232with respect to allocation:
233
234"min_bandwidth":
235		The minimum memory bandwidth percentage which
236		user can request.
237
238"bandwidth_gran":
239		The granularity in which the memory bandwidth
240		percentage is allocated. The allocated
241		b/w percentage is rounded off to the next
242		control step available on the hardware. The
243		available bandwidth control steps are:
244		min_bandwidth + N * bandwidth_gran.
245
246"delay_linear":
247		Indicates if the delay scale is linear or
248		non-linear. This field is purely informational
249		only.
250
251"thread_throttle_mode":
252		Indicator on Intel systems of how tasks running on threads
253		of a physical core are throttled in cases where they
254		request different memory bandwidth percentages:
255
256		"max":
257			the smallest percentage is applied
258			to all threads
259		"per-thread":
260			bandwidth percentages are directly applied to
261			the threads running on the core
262
263If L3 monitoring is available there will be an "L3_MON" directory
264with the following files:
265
266"num_rmids":
267		The number of RMIDs supported by hardware for
268		L3 monitoring events.
269
270"mon_features":
271		Lists the monitoring events if
272		monitoring is enabled for the resource.
273		Example::
274
275			# cat /sys/fs/resctrl/info/L3_MON/mon_features
276			llc_occupancy
277			mbm_total_bytes
278			mbm_local_bytes
279
280		If the system supports Bandwidth Monitoring Event
281		Configuration (BMEC), then the bandwidth events will
282		be configurable. The output will be::
283
284			# cat /sys/fs/resctrl/info/L3_MON/mon_features
285			llc_occupancy
286			mbm_total_bytes
287			mbm_total_bytes_config
288			mbm_local_bytes
289			mbm_local_bytes_config
290
291"mbm_total_bytes_config", "mbm_local_bytes_config":
292	Read/write files containing the configuration for the mbm_total_bytes
293	and mbm_local_bytes events, respectively, when the Bandwidth
294	Monitoring Event Configuration (BMEC) feature is supported.
295	The event configuration settings are domain specific and affect
296	all the CPUs in the domain. When either event configuration is
297	changed, the bandwidth counters for all RMIDs of both events
298	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
299	domain. The next read for every RMID will report "Unavailable"
300	and subsequent reads will report the valid value.
301
302	Following are the types of events supported:
303
304	====    ========================================================
305	Bits    Description
306	====    ========================================================
307	6       Dirty Victims from the QOS domain to all types of memory
308	5       Reads to slow memory in the non-local NUMA domain
309	4       Reads to slow memory in the local NUMA domain
310	3       Non-temporal writes to non-local NUMA domain
311	2       Non-temporal writes to local NUMA domain
312	1       Reads to memory in the non-local NUMA domain
313	0       Reads to memory in the local NUMA domain
314	====    ========================================================
315
316	By default, the mbm_total_bytes configuration is set to 0x7f to count
317	all the event types and the mbm_local_bytes configuration is set to
318	0x15 to count all the local memory events.
319
320	Examples:
321
322	* To view the current configuration::
323	  ::
324
325	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
326	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
327
328	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
329	    0=0x15;1=0x15;3=0x15;4=0x15
330
331	* To change the mbm_total_bytes to count only reads on domain 0,
332	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
333	  (in hexadecimal 0x33):
334	  ::
335
336	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
337
338	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
339	    0=0x33;1=0x7f;2=0x7f;3=0x7f
340
341	* To change the mbm_local_bytes to count all the slow memory reads on
342	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
343	  in binary (in hexadecimal 0x30):
344	  ::
345
346	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
347
348	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
349	    0=0x30;1=0x30;3=0x15;4=0x15
350
351"mbm_assign_mode":
352	The supported counter assignment modes. The enclosed brackets indicate which mode
353	is enabled. The MBM events associated with counters may reset when "mbm_assign_mode"
354	is changed.
355	::
356
357	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
358	  [mbm_event]
359	  default
360
361	"mbm_event":
362
363	mbm_event mode allows users to assign a hardware counter to an RMID, event
364	pair and monitor the bandwidth usage as long as it is assigned. The hardware
365	continues to track the assigned counter until it is explicitly unassigned by
366	the user. Each event within a resctrl group can be assigned independently.
367
368	In this mode, a monitoring event can only accumulate data while it is backed
369	by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON
370	group to specify which of the events should have a counter assigned. The number
371	of counters available is described in the "num_mbm_cntrs" file. Changing the
372	mode may cause all counters on the resource to reset.
373
374	Moving to mbm_event counter assignment mode requires users to assign the counters
375	to the events. Otherwise, the MBM event counters will return 'Unassigned' when read.
376
377	The mode is beneficial for AMD platforms that support more CTRL_MON
378	and MON groups than available hardware counters. By default, this
379	feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
380	Monitoring Counters) capability, ensuring counters remain assigned even
381	when the corresponding RMID is not actively used by any processor.
382
383	"default":
384
385	In default mode, resctrl assumes there is a hardware counter for each
386	event within every CTRL_MON and MON group. On AMD platforms, it is
387	recommended to use the mbm_event mode, if supported, to prevent reset of MBM
388	events between reads resulting from hardware re-allocating counters. This can
389	result in misleading values or display "Unavailable" if no counter is assigned
390	to the event.
391
392	* To enable "mbm_event" counter assignment mode:
393	  ::
394
395	    # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
396
397	* To enable "default" monitoring mode:
398	  ::
399
400	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
401
402"num_mbm_cntrs":
403	The maximum number of counters (total of available and assigned counters) in
404	each domain when the system supports mbm_event mode.
405
406	For example, on a system with maximum of 32 memory bandwidth monitoring
407	counters in each of its L3 domains:
408	::
409
410	  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
411	  0=32;1=32
412
413"available_mbm_cntrs":
414	The number of counters available for assignment in each domain when mbm_event
415	mode is enabled on the system.
416
417	For example, on a system with 30 available [hardware] assignable counters
418	in each of its L3 domains:
419	::
420
421	  # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
422	  0=30;1=30
423
424"event_configs":
425	Directory that exists when "mbm_event" counter assignment mode is supported.
426	Contains a sub-directory for each MBM event that can be assigned to a counter.
427
428	Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes.
429	Each MBM event's sub-directory contains a file named "event_filter" that is
430	used to view and (if writable) modify which memory transactions the MBM event
431	is configured with. The file is accessible only when "mbm_event" counter
432	assignment mode is enabled.
433
434	List of memory transaction types supported:
435
436	==========================  ========================================================
437	Name			    Description
438	==========================  ========================================================
439	dirty_victim_writes_all     Dirty Victims from the QOS domain to all types of memory
440	remote_reads_slow_memory    Reads to slow memory in the non-local NUMA domain
441	local_reads_slow_memory     Reads to slow memory in the local NUMA domain
442	remote_non_temporal_writes  Non-temporal writes to non-local NUMA domain
443	local_non_temporal_writes   Non-temporal writes to local NUMA domain
444	remote_reads                Reads to memory in the non-local NUMA domain
445	local_reads                 Reads to memory in the local NUMA domain
446	==========================  ========================================================
447
448	For example::
449
450	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
451	  local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
452	  local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
453
454	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
455	  local_reads,local_non_temporal_writes,local_reads_slow_memory
456
457	The memory transactions the MBM event is configured with can be changed
458	if "event_filter" is writable.
459
460	For example::
461
462	  # echo "local_reads, local_non_temporal_writes" >
463	    /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
464
465	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
466	   local_reads,local_non_temporal_writes
467
468"mbm_assign_on_mkdir":
469	Exists when "mbm_event" counter assignment mode is supported. Accessible
470	only when "mbm_event" counter assignment mode is enabled.
471
472	Determines if a counter will automatically be assigned to an RMID, MBM event
473	pair when its associated monitor group is created via mkdir. Enabled by default
474	on boot, also when switched from "default" mode to "mbm_event" counter assignment
475	mode. Users can disable this capability by writing to the interface.
476
477	"0":
478		Auto assignment is disabled.
479	"1":
480		Auto assignment is enabled.
481
482	Automatic counter assignment is done with best effort. If auto
483	assignment is enabled but there are not enough available counters then
484	monitor group creation could succeed while one or more events belonging
485	to the group may not have a counter assigned in all domains. Consult
486	mbm_L3_assignments for counter assignment states of the new groups.
487
488	Example::
489
490	  # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
491	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
492	  0
493
494"max_threshold_occupancy":
495		Read/write file provides the largest value (in
496		bytes) at which a previously used LLC_occupancy
497		counter can be considered for reuse.
498
499If telemetry monitoring is available there will be a "PERF_PKG_MON" directory
500with the following files:
501
502"num_rmids":
503		The number of RMIDs for telemetry monitoring events.
504
505		On Intel resctrl will not enable telemetry events if the number of
506		RMIDs that can be tracked concurrently is lower than the total number
507		of RMIDs supported. Telemetry events can be force-enabled with the
508		"rdt=" kernel parameter, but this may reduce the number of
509		monitoring groups that can be created.
510
511"mon_features":
512		Lists the telemetry monitoring events that are enabled on this system.
513
514The upper bound for how many "CTRL_MON" + "MON" can be created
515is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
516
517Finally, in the top level of the "info" directory there is a file
518named "last_cmd_status". This is reset with every "command" issued
519via the file system (making new directories or writing to any of the
520control files). If the command was successful, it will read as "ok".
521If the command failed, it will provide more information that can be
522conveyed in the error returns from file operations. E.g.
523::
524
525	# echo L3:0=f7 > schemata
526	bash: echo: write error: Invalid argument
527	# cat info/last_cmd_status
528	mask f7 has non-consecutive 1-bits
529
530Resource alloc and monitor groups
531=================================
532
533Resource groups are represented as directories in the resctrl file
534system.  The default group is the root directory which, immediately
535after mounting, owns all the tasks and cpus in the system and can make
536full use of all resources.
537
538On a system with RDT control features additional directories can be
539created in the root directory that specify different amounts of each
540resource (see "schemata" below). The root and these additional top level
541directories are referred to as "CTRL_MON" groups below.
542
543On a system with RDT monitoring the root directory and other top level
544directories contain a directory named "mon_groups" in which additional
545directories can be created to monitor subsets of tasks in the CTRL_MON
546group that is their ancestor. These are called "MON" groups in the rest
547of this document.
548
549Removing a directory will move all tasks and cpus owned by the group it
550represents to the parent. Removing one of the created CTRL_MON groups
551will automatically remove all MON groups below it.
552
553Moving MON group directories to a new parent CTRL_MON group is supported
554for the purpose of changing the resource allocations of a MON group
555without impacting its monitoring data or assigned tasks. This operation
556is not allowed for MON groups which monitor CPUs. No other move
557operation is currently allowed other than simply renaming a CTRL_MON or
558MON group.
559
560All groups contain the following files:
561
562"tasks":
563	Reading this file shows the list of all tasks that belong to
564	this group. Writing a task id to the file will add a task to the
565	group. Multiple tasks can be added by separating the task ids
566	with commas. Tasks will be assigned sequentially. Multiple
567	failures are not supported. A single failure encountered while
568	attempting to assign a task will cause the operation to abort and
569	already added tasks before the failure will remain in the group.
570	Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
571
572	If the group is a CTRL_MON group the task is removed from
573	whichever previous CTRL_MON group owned the task and also from
574	any MON group that owned the task. If the group is a MON group,
575	then the task must already belong to the CTRL_MON parent of this
576	group. The task is removed from any previous MON group.
577
578	When writing to this file, a task id of 0 is interpreted as the
579	task id of the currently running task. On reading the file, a task
580	id of 0 will never be shown and there is no representation of the
581	idle tasks. Instead, a CPU's idle task is always considered as a
582	member of the group owning the CPU.
583
584"cpus":
585	Reading this file shows a bitmask of the logical CPUs owned by
586	this group. Writing a mask to this file will add and remove
587	CPUs to/from this group. As with the tasks file a hierarchy is
588	maintained where MON groups may only include CPUs owned by the
589	parent CTRL_MON group.
590	When the resource group is in pseudo-locked mode this file will
591	only be readable, reflecting the CPUs associated with the
592	pseudo-locked region.
593
594
595"cpus_list":
596	Just like "cpus", only using ranges of CPUs instead of bitmasks.
597
598
599When control is enabled all CTRL_MON groups will also contain:
600
601"schemata":
602	A list of all the resources available to this group.
603	Each resource has its own line and format - see below for details.
604
605"size":
606	Mirrors the display of the "schemata" file to display the size in
607	bytes of each allocation instead of the bits representing the
608	allocation.
609
610"mode":
611	The "mode" of the resource group dictates the sharing of its
612	allocations. A "shareable" resource group allows sharing of its
613	allocations while an "exclusive" resource group does not. A
614	cache pseudo-locked region is created by first writing
615	"pseudo-locksetup" to the "mode" file before writing the cache
616	pseudo-locked region's schemata to the resource group's "schemata"
617	file. On successful pseudo-locked region creation the mode will
618	automatically change to "pseudo-locked".
619
620"ctrl_hw_id":
621	Available only with debug option. The identifier used by hardware
622	for the control group. On x86 this is the CLOSID.
623
624When monitoring is enabled all MON groups will also contain:
625
626"mon_data":
627	This contains directories for each monitor domain.
628
629	If L3 monitoring is enabled, there will be a "mon_L3_XX" directory for
630	each instance of an L3 cache. Each directory contains files for the enabled
631	L3 events (e.g. "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes").
632
633	If telemetry monitoring is enabled, there will be a "mon_PERF_PKG_YY"
634	directory for each physical processor package. Each directory contains
635	files for the enabled telemetry events (e.g. "core_energy". "activity",
636	"uops_retired", etc.)
637
638	The info/`*`/mon_features files provide the full list of enabled
639	event/file names.
640
641	"core energy" reports a floating point number for the energy (in Joules)
642	consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
643	during execution of instructions summed across all logical CPUs on a
644	package for the current monitoring group.
645
646	"activity" also reports a floating point value (in Farads).  This provides
647	an estimate of work done independent of the frequency that the CPUs used
648	for execution.
649
650	Note that "core energy" and "activity" only measure energy/activity in the
651	"core" of the CPU (arithmetic units, TLB, L1 and L2 caches, etc.). They
652	do not include L3 cache, memory, I/O devices etc.
653
654	All other events report decimal integer values.
655
656	In a MON group these files provide a read out of the current value of
657	the event for all tasks in the group. In CTRL_MON groups these files
658	provide the sum for all tasks in the CTRL_MON group and all tasks in
659	MON groups. Please see example section for more details on usage.
660
661	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
662	directories for each node (located within the "mon_L3_XX" directory
663	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
664	where "YY" is the node number.
665
666	When the 'mbm_event' counter assignment mode is enabled, reading
667	an MBM event of a MON group returns 'Unassigned' if no hardware
668	counter is assigned to it. For CTRL_MON groups, 'Unassigned' is
669	returned if the MBM event does not have an assigned counter in the
670	CTRL_MON group nor in any of its associated MON groups.
671
672"mon_hw_id":
673	Available only with debug option. The identifier used by hardware
674	for the monitor group. On x86 this is the RMID.
675
676When monitoring is enabled all MON groups may also contain:
677
678"mbm_L3_assignments":
679	Exists when "mbm_event" counter assignment mode is supported and lists the
680	counter assignment states of the group.
681
682	The assignment list is displayed in the following format:
683
684	<Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state>
685
686	Event: A valid MBM event in the
687	       /sys/fs/resctrl/info/L3_MON/event_configs directory.
688
689	Domain ID: A valid domain ID. When writing, '*' applies the changes
690		   to all the domains.
691
692	Assignment states:
693
694	_ : No counter assigned.
695
696	e : Counter assigned exclusively.
697
698	Example:
699
700	To display the counter assignment states for the default group.
701	::
702
703	 # cd /sys/fs/resctrl
704	 # cat /sys/fs/resctrl/mbm_L3_assignments
705	   mbm_total_bytes:0=e;1=e
706	   mbm_local_bytes:0=e;1=e
707
708	Assignments can be modified by writing to the interface.
709
710	Examples:
711
712	To unassign the counter associated with the mbm_total_bytes event on domain 0:
713	::
714
715	 # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
716	 # cat /sys/fs/resctrl/mbm_L3_assignments
717	   mbm_total_bytes:0=_;1=e
718	   mbm_local_bytes:0=e;1=e
719
720	To unassign the counter associated with the mbm_total_bytes event on all the domains:
721	::
722
723	 # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
724	 # cat /sys/fs/resctrl/mbm_L3_assignments
725	   mbm_total_bytes:0=_;1=_
726	   mbm_local_bytes:0=e;1=e
727
728	To assign a counter associated with the mbm_total_bytes event on all domains in
729	exclusive mode:
730	::
731
732	 # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
733	 # cat /sys/fs/resctrl/mbm_L3_assignments
734	   mbm_total_bytes:0=e;1=e
735	   mbm_local_bytes:0=e;1=e
736
737When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
738
739"mba_MBps_event":
740	Reading this file shows which memory bandwidth event is used
741	as input to the software feedback loop that keeps memory bandwidth
742	below the value specified in the schemata file. Writing the
743	name of one of the supported memory bandwidth events found in
744	/sys/fs/resctrl/info/L3_MON/mon_features changes the input
745	event.
746
747Resource allocation rules
748-------------------------
749
750When a task is running the following rules define which resources are
751available to it:
752
7531) If the task is a member of a non-default group, then the schemata
754   for that group is used.
755
7562) Else if the task belongs to the default group, but is running on a
757   CPU that is assigned to some specific group, then the schemata for the
758   CPU's group is used.
759
7603) Otherwise the schemata for the default group is used.
761
762Resource monitoring rules
763-------------------------
7641) If a task is a member of a MON group, or non-default CTRL_MON group
765   then RDT events for the task will be reported in that group.
766
7672) If a task is a member of the default CTRL_MON group, but is running
768   on a CPU that is assigned to some specific group, then the RDT events
769   for the task will be reported in that group.
770
7713) Otherwise RDT events for the task will be reported in the root level
772   "mon_data" group.
773
774
775Notes on cache occupancy monitoring and control
776===============================================
777When moving a task from one group to another you should remember that
778this only affects *new* cache allocations by the task. E.g. you may have
779a task in a monitor group showing 3 MB of cache occupancy. If you move
780to a new group and immediately check the occupancy of the old and new
781groups you will likely see that the old group is still showing 3 MB and
782the new group zero. When the task accesses locations still in cache from
783before the move, the h/w does not update any counters. On a busy system
784you will likely see the occupancy in the old group go down as cache lines
785are evicted and re-used while the occupancy in the new group rises as
786the task accesses memory and loads into the cache are counted based on
787membership in the new group.
788
789The same applies to cache allocation control. Moving a task to a group
790with a smaller cache partition will not evict any cache lines. The
791process may continue to use them from the old partition.
792
793Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
794to identify a control group and a monitoring group respectively. Each of
795the resource groups are mapped to these IDs based on the kind of group. The
796number of CLOSid and RMID are limited by the hardware and hence the creation of
797a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
798and creation of "MON" group may fail if we run out of RMIDs.
799
800max_threshold_occupancy - generic concepts
801------------------------------------------
802
803Note that an RMID once freed may not be immediately available for use as
804the RMID is still tagged the cache lines of the previous user of RMID.
805Hence such RMIDs are placed on limbo list and checked back if the cache
806occupancy has gone down. If there is a time when system has a lot of
807limbo RMIDs but which are not ready to be used, user may see an -EBUSY
808during mkdir.
809
810max_threshold_occupancy is a user configurable value to determine the
811occupancy at which an RMID can be freed.
812
813The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
814for a subset of RMID that are not immediately available for allocation.
815This can't be relied on to produce output every second, it may be necessary
816to attempt to create an empty monitor group to force an update. Output may
817only be produced if creation of a control or monitor group fails.
818
819Schemata files - general concepts
820---------------------------------
821Each line in the file describes one resource. The line starts with
822the name of the resource, followed by specific values to be applied
823in each of the instances of that resource on the system.
824
825Cache IDs
826---------
827On current generation systems there is one L3 cache per socket and L2
828caches are generally just shared by the hyperthreads on a core, but this
829isn't an architectural requirement. We could have multiple separate L3
830caches on a socket, multiple cores could share an L2 cache. So instead
831of using "socket" or "core" to define the set of logical cpus sharing
832a resource we use a "Cache ID". At a given cache level this will be a
833unique number across the whole system (but it isn't guaranteed to be a
834contiguous sequence, there may be gaps).  To find the ID for each logical
835CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
836
837Cache Bit Masks (CBM)
838---------------------
839For cache resources we describe the portion of the cache that is available
840for allocation using a bitmask. The maximum value of the mask is defined
841by each cpu model (and may be different for different cache levels). It
842is found using CPUID, but is also provided in the "info" directory of
843the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
844requires that these masks have all the '1' bits in a contiguous block. So
8450x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
846and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
847if non-contiguous 1s value is supported. On a system with a 20-bit mask
848each bit represents 5% of the capacity of the cache. You could partition
849the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
850
851Notes on Sub-NUMA Cluster mode
852==============================
853When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
854nodes much more readily than between regular NUMA nodes since the CPUs
855on Sub-NUMA nodes share the same L3 cache and the system may report
856the NUMA distance between Sub-NUMA nodes with a lower value than used
857for regular NUMA nodes.
858
859The top-level monitoring files in each "mon_L3_XX" directory provide
860the sum of data across all SNC nodes sharing an L3 cache instance.
861Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
862the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
863"mon_sub_L3_YY" directories to get node local data.
864
865Memory bandwidth allocation is still performed at the L3 cache
866level. I.e. throttling controls are applied to all SNC nodes.
867
868L3 cache allocation bitmaps also apply to all SNC nodes. But note that
869the amount of L3 cache represented by each bit is divided by the number
870of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
871allocation masks each bit normally represents 10MB. With SNC mode enabled
872with two SNC nodes per L3 cache, each bit only represents 5MB.
873
874Memory bandwidth Allocation and monitoring
875==========================================
876
877For Memory bandwidth resource, by default the user controls the resource
878by indicating the percentage of total memory bandwidth.
879
880The minimum bandwidth percentage value for each cpu model is predefined
881and can be looked up through "info/MB/min_bandwidth". The bandwidth
882granularity that is allocated is also dependent on the cpu model and can
883be looked up at "info/MB/bandwidth_gran". The available bandwidth
884control steps are: min_bw + N * bw_gran. Intermediate values are rounded
885to the next control step available on the hardware.
886
887The bandwidth throttling is a core specific mechanism on some of Intel
888SKUs. Using a high bandwidth and a low bandwidth setting on two threads
889sharing a core may result in both threads being throttled to use the
890low bandwidth (see "thread_throttle_mode").
891
892The fact that Memory bandwidth allocation(MBA) may be a core
893specific mechanism where as memory bandwidth monitoring(MBM) is done at
894the package level may lead to confusion when users try to apply control
895via the MBA and then monitor the bandwidth to see if the controls are
896effective. Below are such scenarios:
897
8981. User may *not* see increase in actual bandwidth when percentage
899   values are increased:
900
901This can occur when aggregate L2 external bandwidth is more than L3
902external bandwidth. Consider an SKL SKU with 24 cores on a package and
903where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
904240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
905threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
906bandwidth of 100GBps although the percentage value specified is only 50%
907<< 100%. Hence increasing the bandwidth percentage will not yield any
908more bandwidth. This is because although the L2 external bandwidth still
909has capacity, the L3 external bandwidth is fully used. Also note that
910this would be dependent on number of cores the benchmark is run on.
911
9122. Same bandwidth percentage may mean different actual bandwidth
913   depending on # of threads:
914
915For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
916thread, with 10% bandwidth' can consume up to 10GBps and 40GBps although
917they have same percentage bandwidth of 10%. This is simply because as
918threads start using more cores in an rdtgroup, the actual bandwidth may
919increase or vary although user specified bandwidth percentage is same.
920
921In order to mitigate this and make the interface more user friendly,
922resctrl added support for specifying the bandwidth in MiBps as well.  The
923kernel underneath would use a software feedback mechanism or a "Software
924Controller(mba_sc)" which reads the actual bandwidth using MBM counters
925and adjust the memory bandwidth percentages to ensure::
926
927	"actual bandwidth < user specified bandwidth".
928
929By default, the schemata would take the bandwidth percentage values
930where as user can switch to the "MBA software controller" mode using
931a mount option 'mba_MBps'. The schemata format is specified in the below
932sections.
933
934L3 schemata file details (code and data prioritization disabled)
935----------------------------------------------------------------
936With CDP disabled the L3 schemata format is::
937
938	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
939
940L3 schemata file details (CDP enabled via mount option to resctrl)
941------------------------------------------------------------------
942When CDP is enabled L3 control is split into two separate resources
943so you can specify independent masks for code and data like this::
944
945	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
946	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
947
948L2 schemata file details
949------------------------
950CDP is supported at L2 using the 'cdpl2' mount option. The schemata
951format is either::
952
953	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
954
955or
956
957	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
958	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
959
960
961Memory bandwidth Allocation (default mode)
962------------------------------------------
963
964Memory b/w domain is L3 cache.
965::
966
967	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
968
969Memory bandwidth Allocation specified in MiBps
970----------------------------------------------
971
972Memory bandwidth domain is L3 cache.
973::
974
975	MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
976
977Slow Memory Bandwidth Allocation (SMBA)
978---------------------------------------
979AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
980CXL.memory is the only supported "slow" memory device. With the
981support of SMBA, the hardware enables bandwidth allocation on
982the slow memory devices. If there are multiple such devices in
983the system, the throttling logic groups all the slow sources
984together and applies the limit on them as a whole.
985
986The presence of SMBA (with CXL.memory) is independent of slow memory
987devices presence. If there are no such devices on the system, then
988configuring SMBA will have no impact on the performance of the system.
989
990The bandwidth domain for slow memory is L3 cache. Its schemata file
991is formatted as:
992::
993
994	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
995
996Reading/writing the schemata file
997---------------------------------
998Reading the schemata file will show the state of all resources
999on all domains. When writing you only need to specify those values
1000which you wish to change.  E.g.
1001::
1002
1003  # cat schemata
1004  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
1005  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
1006  # echo "L3DATA:2=3c0;" > schemata
1007  # cat schemata
1008  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
1009  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
1010
1011Reading/writing the schemata file (on AMD systems)
1012--------------------------------------------------
1013Reading the schemata file will show the current bandwidth limit on all
1014domains. The allocated resources are in multiples of one eighth GB/s.
1015When writing to the file, you need to specify what cache id you wish to
1016configure the bandwidth limit.
1017
1018For example, to allocate 2GB/s limit on the first cache id:
1019
1020::
1021
1022  # cat schemata
1023    MB:0=2048;1=2048;2=2048;3=2048
1024    L3:0=ffff;1=ffff;2=ffff;3=ffff
1025
1026  # echo "MB:1=16" > schemata
1027  # cat schemata
1028    MB:0=2048;1=  16;2=2048;3=2048
1029    L3:0=ffff;1=ffff;2=ffff;3=ffff
1030
1031Reading/writing the schemata file (on AMD systems) with SMBA feature
1032--------------------------------------------------------------------
1033Reading and writing the schemata file is the same as without SMBA in
1034above section.
1035
1036For example, to allocate 8GB/s limit on the first cache id:
1037
1038::
1039
1040  # cat schemata
1041    SMBA:0=2048;1=2048;2=2048;3=2048
1042      MB:0=2048;1=2048;2=2048;3=2048
1043      L3:0=ffff;1=ffff;2=ffff;3=ffff
1044
1045  # echo "SMBA:1=64" > schemata
1046  # cat schemata
1047    SMBA:0=2048;1=  64;2=2048;3=2048
1048      MB:0=2048;1=2048;2=2048;3=2048
1049      L3:0=ffff;1=ffff;2=ffff;3=ffff
1050
1051Cache Pseudo-Locking
1052====================
1053CAT enables a user to specify the amount of cache space that an
1054application can fill. Cache pseudo-locking builds on the fact that a
1055CPU can still read and write data pre-allocated outside its current
1056allocated area on a cache hit. With cache pseudo-locking, data can be
1057preloaded into a reserved portion of cache that no application can
1058fill, and from that point on will only serve cache hits. The cache
1059pseudo-locked memory is made accessible to user space where an
1060application can map it into its virtual address space and thus have
1061a region of memory with reduced average read latency.
1062
1063The creation of a cache pseudo-locked region is triggered by a request
1064from the user to do so that is accompanied by a schemata of the region
1065to be pseudo-locked. The cache pseudo-locked region is created as follows:
1066
1067- Create a CAT allocation CLOSNEW with a CBM matching the schemata
1068  from the user of the cache region that will contain the pseudo-locked
1069  memory. This region must not overlap with any current CAT allocation/CLOS
1070  on the system and no future overlap with this cache region is allowed
1071  while the pseudo-locked region exists.
1072- Create a contiguous region of memory of the same size as the cache
1073  region.
1074- Flush the cache, disable hardware prefetchers, disable preemption.
1075- Make CLOSNEW the active CLOS and touch the allocated memory to load
1076  it into the cache.
1077- Set the previous CLOS as active.
1078- At this point the closid CLOSNEW can be released - the cache
1079  pseudo-locked region is protected as long as its CBM does not appear in
1080  any CAT allocation. Even though the cache pseudo-locked region will from
1081  this point on not appear in any CBM of any CLOS an application running with
1082  any CLOS will be able to access the memory in the pseudo-locked region since
1083  the region continues to serve cache hits.
1084- The contiguous region of memory loaded into the cache is exposed to
1085  user-space as a character device.
1086
1087Cache pseudo-locking increases the probability that data will remain
1088in the cache via carefully configuring the CAT feature and controlling
1089application behavior. There is no guarantee that data is placed in
1090cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
1091“locked” data from cache. Power management C-states may shrink or
1092power off cache. Deeper C-states will automatically be restricted on
1093pseudo-locked region creation.
1094
1095It is required that an application using a pseudo-locked region runs
1096with affinity to the cores (or a subset of the cores) associated
1097with the cache on which the pseudo-locked region resides. A sanity check
1098within the code will not allow an application to map pseudo-locked memory
1099unless it runs with affinity to cores associated with the cache on which the
1100pseudo-locked region resides. The sanity check is only done during the
1101initial mmap() handling, there is no enforcement afterwards and the
1102application self needs to ensure it remains affine to the correct cores.
1103
1104Pseudo-locking is accomplished in two stages:
1105
11061) During the first stage the system administrator allocates a portion
1107   of cache that should be dedicated to pseudo-locking. At this time an
1108   equivalent portion of memory is allocated, loaded into allocated
1109   cache portion, and exposed as a character device.
11102) During the second stage a user-space application maps (mmap()) the
1111   pseudo-locked memory into its address space.
1112
1113Cache Pseudo-Locking Interface
1114------------------------------
1115A pseudo-locked region is created using the resctrl interface as follows:
1116
11171) Create a new resource group by creating a new directory in /sys/fs/resctrl.
11182) Change the new resource group's mode to "pseudo-locksetup" by writing
1119   "pseudo-locksetup" to the "mode" file.
11203) Write the schemata of the pseudo-locked region to the "schemata" file. All
1121   bits within the schemata should be "unused" according to the "bit_usage"
1122   file.
1123
1124On successful pseudo-locked region creation the "mode" file will contain
1125"pseudo-locked" and a new character device with the same name as the resource
1126group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
1127by user space in order to obtain access to the pseudo-locked memory region.
1128
1129An example of cache pseudo-locked region creation and usage can be found below.
1130
1131Cache Pseudo-Locking Debugging Interface
1132----------------------------------------
1133The pseudo-locking debugging interface is enabled by default (if
1134CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
1135
1136There is no explicit way for the kernel to test if a provided memory
1137location is present in the cache. The pseudo-locking debugging interface uses
1138the tracing infrastructure to provide two ways to measure cache residency of
1139the pseudo-locked region:
1140
11411) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
1142   from these measurements are best visualized using a hist trigger (see
1143   example below). In this test the pseudo-locked region is traversed at
1144   a stride of 32 bytes while hardware prefetchers and preemption
1145   are disabled. This also provides a substitute visualization of cache
1146   hits and misses.
11472) Cache hit and miss measurements using model specific precision counters if
1148   available. Depending on the levels of cache on the system the pseudo_lock_l2
1149   and pseudo_lock_l3 tracepoints are available.
1150
1151When a pseudo-locked region is created a new debugfs directory is created for
1152it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
1153write-only file, pseudo_lock_measure, is present in this directory. The
1154measurement of the pseudo-locked region depends on the number written to this
1155debugfs file:
1156
11571:
1158     writing "1" to the pseudo_lock_measure file will trigger the latency
1159     measurement captured in the pseudo_lock_mem_latency tracepoint. See
1160     example below.
11612:
1162     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
1163     residency (cache hits and misses) measurement captured in the
1164     pseudo_lock_l2 tracepoint. See example below.
11653:
1166     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
1167     residency (cache hits and misses) measurement captured in the
1168     pseudo_lock_l3 tracepoint.
1169
1170All measurements are recorded with the tracing infrastructure. This requires
1171the relevant tracepoints to be enabled before the measurement is triggered.
1172
1173Example of latency debugging interface
1174~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1175In this example a pseudo-locked region named "newlock" was created. Here is
1176how we can measure the latency in cycles of reading from this region and
1177visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
1178is set::
1179
1180  # :> /sys/kernel/tracing/trace
1181  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
1182  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
1183  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
1184  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
1185  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
1186
1187  # event histogram
1188  #
1189  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
1190  #
1191
1192  { latency:        456 } hitcount:          1
1193  { latency:         50 } hitcount:         83
1194  { latency:         36 } hitcount:         96
1195  { latency:         44 } hitcount:        174
1196  { latency:         48 } hitcount:        195
1197  { latency:         46 } hitcount:        262
1198  { latency:         42 } hitcount:        693
1199  { latency:         40 } hitcount:       3204
1200  { latency:         38 } hitcount:       3484
1201
1202  Totals:
1203      Hits: 8192
1204      Entries: 9
1205    Dropped: 0
1206
1207Example of cache hits/misses debugging
1208~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1209In this example a pseudo-locked region named "newlock" was created on the L2
1210cache of a platform. Here is how we can obtain details of the cache hits
1211and misses using the platform's precision counters.
1212::
1213
1214  # :> /sys/kernel/tracing/trace
1215  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
1216  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
1217  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
1218  # cat /sys/kernel/tracing/trace
1219
1220  # tracer: nop
1221  #
1222  #                              _-----=> irqs-off
1223  #                             / _----=> need-resched
1224  #                            | / _---=> hardirq/softirq
1225  #                            || / _--=> preempt-depth
1226  #                            ||| /     delay
1227  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1228  #              | |       |   ||||       |         |
1229  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
1230
1231
1232Examples for RDT allocation usage
1233~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1234
12351) Example 1
1236
1237On a two socket machine (one L3 cache per socket) with just four bits
1238for cache bit masks, minimum b/w of 10% with a memory bandwidth
1239granularity of 10%.
1240::
1241
1242  # mount -t resctrl resctrl /sys/fs/resctrl
1243  # cd /sys/fs/resctrl
1244  # mkdir p0 p1
1245  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
1246  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
1247
1248The default resource group is unmodified, so we have access to all parts
1249of all caches (its schemata file reads "L3:0=f;1=f").
1250
1251Tasks that are under the control of group "p0" may only allocate from the
1252"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1253Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1254
1255Similarly, tasks that are under the control of group "p0" may use a
1256maximum memory b/w of 50% on socket0 and 50% on socket 1.
1257Tasks in group "p1" may also use 50% memory b/w on both sockets.
1258Note that unlike cache masks, memory b/w cannot specify whether these
1259allocations can overlap or not. The allocations specifies the maximum
1260b/w that the group may be able to use and the system admin can configure
1261the b/w accordingly.
1262
1263If resctrl is using the software controller (mba_sc) then user can enter the
1264max b/w in MB rather than the percentage values.
1265::
1266
1267  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
1268  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
1269
1270In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
1271of 1024MB where as on socket 1 they would use 500MB.
1272
12732) Example 2
1274
1275Again two sockets, but this time with a more realistic 20-bit mask.
1276
1277Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
1278processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
1279neighbors, each of the two real-time tasks exclusively occupies one quarter
1280of L3 cache on socket 0.
1281::
1282
1283  # mount -t resctrl resctrl /sys/fs/resctrl
1284  # cd /sys/fs/resctrl
1285
1286First we reset the schemata for the default group so that the "upper"
128750% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
1288ordinary tasks::
1289
1290  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
1291
1292Next we make a resource group for our first real time task and give
1293it access to the "top" 25% of the cache on socket 0.
1294::
1295
1296  # mkdir p0
1297  # echo "L3:0=f8000;1=fffff" > p0/schemata
1298
1299Finally we move our first real time task into this resource group. We
1300also use taskset(1) to ensure the task always runs on a dedicated CPU
1301on socket 0. Most uses of resource groups will also constrain which
1302processors tasks run on.
1303::
1304
1305  # echo 1234 > p0/tasks
1306  # taskset -cp 1 1234
1307
1308Ditto for the second real time task (with the remaining 25% of cache)::
1309
1310  # mkdir p1
1311  # echo "L3:0=7c00;1=fffff" > p1/schemata
1312  # echo 5678 > p1/tasks
1313  # taskset -cp 2 5678
1314
1315For the same 2 socket system with memory b/w resource and CAT L3 the
1316schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
131710):
1318
1319For our first real time task this would request 20% memory b/w on socket 0.
1320::
1321
1322  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
1323
1324For our second real time task this would request an other 20% memory b/w
1325on socket 0.
1326::
1327
1328  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
1329
13303) Example 3
1331
1332A single socket system which has real-time tasks running on core 4-7 and
1333non real-time workload assigned to core 0-3. The real-time tasks share text
1334and data, so a per task association is not required and due to interaction
1335with the kernel it's desired that the kernel on these cores shares L3 with
1336the tasks.
1337::
1338
1339  # mount -t resctrl resctrl /sys/fs/resctrl
1340  # cd /sys/fs/resctrl
1341
1342First we reset the schemata for the default group so that the "upper"
134350% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
1344cannot be used by ordinary tasks::
1345
1346  # echo "L3:0=3ff\nMB:0=50" > schemata
1347
1348Next we make a resource group for our real time cores and give it access
1349to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
1350socket 0.
1351::
1352
1353  # mkdir p0
1354  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
1355
1356Finally we move core 4-7 over to the new group and make sure that the
1357kernel and the tasks running there get 50% of the cache. They should
1358also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
1359siblings and only the real time threads are scheduled on the cores 4-7.
1360::
1361
1362  # echo F0 > p0/cpus
1363
13644) Example 4
1365
1366The resource groups in previous examples were all in the default "shareable"
1367mode allowing sharing of their cache allocations. If one resource group
1368configures a cache allocation then nothing prevents another resource group
1369to overlap with that allocation.
1370
1371In this example a new exclusive resource group will be created on a L2 CAT
1372system with two L2 cache instances that can be configured with an 8-bit
1373capacity bitmask. The new exclusive resource group will be configured to use
137425% of each cache instance.
1375::
1376
1377  # mount -t resctrl resctrl /sys/fs/resctrl/
1378  # cd /sys/fs/resctrl
1379
1380First, we observe that the default group is configured to allocate to all L2
1381cache::
1382
1383  # cat schemata
1384  L2:0=ff;1=ff
1385
1386We could attempt to create the new resource group at this point, but it will
1387fail because of the overlap with the schemata of the default group::
1388
1389  # mkdir p0
1390  # echo 'L2:0=0x3;1=0x3' > p0/schemata
1391  # cat p0/mode
1392  shareable
1393  # echo exclusive > p0/mode
1394  -sh: echo: write error: Invalid argument
1395  # cat info/last_cmd_status
1396  schemata overlaps
1397
1398To ensure that there is no overlap with another resource group the default
1399resource group's schemata has to change, making it possible for the new
1400resource group to become exclusive.
1401::
1402
1403  # echo 'L2:0=0xfc;1=0xfc' > schemata
1404  # echo exclusive > p0/mode
1405  # grep . p0/*
1406  p0/cpus:0
1407  p0/mode:exclusive
1408  p0/schemata:L2:0=03;1=03
1409  p0/size:L2:0=262144;1=262144
1410
1411A new resource group will on creation not overlap with an exclusive resource
1412group::
1413
1414  # mkdir p1
1415  # grep . p1/*
1416  p1/cpus:0
1417  p1/mode:shareable
1418  p1/schemata:L2:0=fc;1=fc
1419  p1/size:L2:0=786432;1=786432
1420
1421The bit_usage will reflect how the cache is used::
1422
1423  # cat info/L2/bit_usage
1424  0=SSSSSSEE;1=SSSSSSEE
1425
1426A resource group cannot be forced to overlap with an exclusive resource group::
1427
1428  # echo 'L2:0=0x1;1=0x1' > p1/schemata
1429  -sh: echo: write error: Invalid argument
1430  # cat info/last_cmd_status
1431  overlaps with exclusive group
1432
1433Example of Cache Pseudo-Locking
1434~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1435Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1436region is exposed at /dev/pseudo_lock/newlock that can be provided to
1437application for argument to mmap().
1438::
1439
1440  # mount -t resctrl resctrl /sys/fs/resctrl/
1441  # cd /sys/fs/resctrl
1442
1443Ensure that there are bits available that can be pseudo-locked, since only
1444unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1445removed from the default resource group's schemata::
1446
1447  # cat info/L2/bit_usage
1448  0=SSSSSSSS;1=SSSSSSSS
1449  # echo 'L2:1=0xfc' > schemata
1450  # cat info/L2/bit_usage
1451  0=SSSSSSSS;1=SSSSSS00
1452
1453Create a new resource group that will be associated with the pseudo-locked
1454region, indicate that it will be used for a pseudo-locked region, and
1455configure the requested pseudo-locked region capacity bitmask::
1456
1457  # mkdir newlock
1458  # echo pseudo-locksetup > newlock/mode
1459  # echo 'L2:1=0x3' > newlock/schemata
1460
1461On success the resource group's mode will change to pseudo-locked, the
1462bit_usage will reflect the pseudo-locked region, and the character device
1463exposing the pseudo-locked region will exist::
1464
1465  # cat newlock/mode
1466  pseudo-locked
1467  # cat info/L2/bit_usage
1468  0=SSSSSSSS;1=SSSSSSPP
1469  # ls -l /dev/pseudo_lock/newlock
1470  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1471
1472::
1473
1474  /*
1475  * Example code to access one page of pseudo-locked cache region
1476  * from user space.
1477  */
1478  #define _GNU_SOURCE
1479  #include <fcntl.h>
1480  #include <sched.h>
1481  #include <stdio.h>
1482  #include <stdlib.h>
1483  #include <unistd.h>
1484  #include <sys/mman.h>
1485
1486  /*
1487  * It is required that the application runs with affinity to only
1488  * cores associated with the pseudo-locked region. Here the cpu
1489  * is hardcoded for convenience of example.
1490  */
1491  static int cpuid = 2;
1492
1493  int main(int argc, char *argv[])
1494  {
1495    cpu_set_t cpuset;
1496    long page_size;
1497    void *mapping;
1498    int dev_fd;
1499    int ret;
1500
1501    page_size = sysconf(_SC_PAGESIZE);
1502
1503    CPU_ZERO(&cpuset);
1504    CPU_SET(cpuid, &cpuset);
1505    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
1506    if (ret < 0) {
1507      perror("sched_setaffinity");
1508      exit(EXIT_FAILURE);
1509    }
1510
1511    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
1512    if (dev_fd < 0) {
1513      perror("open");
1514      exit(EXIT_FAILURE);
1515    }
1516
1517    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
1518            dev_fd, 0);
1519    if (mapping == MAP_FAILED) {
1520      perror("mmap");
1521      close(dev_fd);
1522      exit(EXIT_FAILURE);
1523    }
1524
1525    /* Application interacts with pseudo-locked memory @mapping */
1526
1527    ret = munmap(mapping, page_size);
1528    if (ret < 0) {
1529      perror("munmap");
1530      close(dev_fd);
1531      exit(EXIT_FAILURE);
1532    }
1533
1534    close(dev_fd);
1535    exit(EXIT_SUCCESS);
1536  }
1537
1538Locking between applications
1539----------------------------
1540
1541Certain operations on the resctrl filesystem, composed of read/writes
1542to/from multiple files, must be atomic.
1543
1544As an example, the allocation of an exclusive reservation of L3 cache
1545involves:
1546
1547  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1548  2. Find a contiguous set of bits in the global CBM bitmask that is clear
1549     in any of the directory cbmmasks
1550  3. Create a new directory
1551  4. Set the bits found in step 2 to the new directory "schemata" file
1552
1553If two applications attempt to allocate space concurrently then they can
1554end up allocating the same bits so the reservations are shared instead of
1555exclusive.
1556
1557To coordinate atomic operations on the resctrlfs and to avoid the problem
1558above, the following locking procedure is recommended:
1559
1560Locking is based on flock, which is available in libc and also as a shell
1561script command
1562
1563Write lock:
1564
1565 A) Take flock(LOCK_EX) on /sys/fs/resctrl
1566 B) Read/write the directory structure.
1567 C) funlock
1568
1569Read lock:
1570
1571 A) Take flock(LOCK_SH) on /sys/fs/resctrl
1572 B) If success read the directory structure.
1573 C) funlock
1574
1575Example with bash::
1576
1577  # Atomically read directory structure
1578  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1579
1580  # Read directory contents and create new subdirectory
1581
1582  $ cat create-dir.sh
1583  find /sys/fs/resctrl/ > output.txt
1584  mask = function-of(output.txt)
1585  mkdir /sys/fs/resctrl/newres/
1586  echo mask > /sys/fs/resctrl/newres/schemata
1587
1588  $ flock /sys/fs/resctrl/ ./create-dir.sh
1589
1590Example with C::
1591
1592  /*
1593  * Example code do take advisory locks
1594  * before accessing resctrl filesystem
1595  */
1596  #include <sys/file.h>
1597  #include <stdlib.h>
1598
1599  void resctrl_take_shared_lock(int fd)
1600  {
1601    int ret;
1602
1603    /* take shared lock on resctrl filesystem */
1604    ret = flock(fd, LOCK_SH);
1605    if (ret) {
1606      perror("flock");
1607      exit(-1);
1608    }
1609  }
1610
1611  void resctrl_take_exclusive_lock(int fd)
1612  {
1613    int ret;
1614
1615    /* release lock on resctrl filesystem */
1616    ret = flock(fd, LOCK_EX);
1617    if (ret) {
1618      perror("flock");
1619      exit(-1);
1620    }
1621  }
1622
1623  void resctrl_release_lock(int fd)
1624  {
1625    int ret;
1626
1627    /* take shared lock on resctrl filesystem */
1628    ret = flock(fd, LOCK_UN);
1629    if (ret) {
1630      perror("flock");
1631      exit(-1);
1632    }
1633  }
1634
1635  void main(void)
1636  {
1637    int fd, ret;
1638
1639    fd = open("/sys/fs/resctrl", O_DIRECTORY);
1640    if (fd == -1) {
1641      perror("open");
1642      exit(-1);
1643    }
1644    resctrl_take_shared_lock(fd);
1645    /* code to read directory contents */
1646    resctrl_release_lock(fd);
1647
1648    resctrl_take_exclusive_lock(fd);
1649    /* code to read and write directory contents */
1650    resctrl_release_lock(fd);
1651  }
1652
1653Examples for RDT Monitoring along with allocation usage
1654=======================================================
1655Reading monitored data
1656----------------------
1657Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1658show the current snapshot of LLC occupancy of the corresponding MON
1659group or CTRL_MON group.
1660
1661
1662Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1663------------------------------------------------------------------------
1664On a two socket machine (one L3 cache per socket) with just four bits
1665for cache bit masks::
1666
1667  # mount -t resctrl resctrl /sys/fs/resctrl
1668  # cd /sys/fs/resctrl
1669  # mkdir p0 p1
1670  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1671  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1672  # echo 5678 > p1/tasks
1673  # echo 5679 > p1/tasks
1674
1675The default resource group is unmodified, so we have access to all parts
1676of all caches (its schemata file reads "L3:0=f;1=f").
1677
1678Tasks that are under the control of group "p0" may only allocate from the
1679"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1680Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1681
1682Create monitor groups and assign a subset of tasks to each monitor group.
1683::
1684
1685  # cd /sys/fs/resctrl/p1/mon_groups
1686  # mkdir m11 m12
1687  # echo 5678 > m11/tasks
1688  # echo 5679 > m12/tasks
1689
1690fetch data (data shown in bytes)
1691::
1692
1693  # cat m11/mon_data/mon_L3_00/llc_occupancy
1694  16234000
1695  # cat m11/mon_data/mon_L3_01/llc_occupancy
1696  14789000
1697  # cat m12/mon_data/mon_L3_00/llc_occupancy
1698  16789000
1699
1700The parent ctrl_mon group shows the aggregated data.
1701::
1702
1703  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1704  31234000
1705
1706Example 2 (Monitor a task from its creation)
1707--------------------------------------------
1708On a two socket machine (one L3 cache per socket)::
1709
1710  # mount -t resctrl resctrl /sys/fs/resctrl
1711  # cd /sys/fs/resctrl
1712  # mkdir p0 p1
1713
1714An RMID is allocated to the group once its created and hence the <cmd>
1715below is monitored from its creation.
1716::
1717
1718  # echo $$ > /sys/fs/resctrl/p1/tasks
1719  # <cmd>
1720
1721Fetch the data::
1722
1723  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1724  31789000
1725
1726Example 3 (Monitor without CAT support or before creating CAT groups)
1727---------------------------------------------------------------------
1728
1729Assume a system like HSW has only CQM and no CAT support. In this case
1730the resctrl will still mount but cannot create CTRL_MON directories.
1731But user can create different MON groups within the root group thereby
1732able to monitor all tasks including kernel threads.
1733
1734This can also be used to profile jobs cache size footprint before being
1735able to allocate them to different allocation groups.
1736::
1737
1738  # mount -t resctrl resctrl /sys/fs/resctrl
1739  # cd /sys/fs/resctrl
1740  # mkdir mon_groups/m01
1741  # mkdir mon_groups/m02
1742
1743  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1744  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1745
1746Monitor the groups separately and also get per domain data. From the
1747below its apparent that the tasks are mostly doing work on
1748domain(socket) 0.
1749::
1750
1751  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1752  31234000
1753  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1754  34555
1755  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1756  31234000
1757  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1758  32789
1759
1760
1761Example 4 (Monitor real time tasks)
1762-----------------------------------
1763
1764A single socket system which has real time tasks running on cores 4-7
1765and non real time tasks on other cpus. We want to monitor the cache
1766occupancy of the real time threads on these cores.
1767::
1768
1769  # mount -t resctrl resctrl /sys/fs/resctrl
1770  # cd /sys/fs/resctrl
1771  # mkdir p1
1772
1773Move the cpus 4-7 over to p1::
1774
1775  # echo f0 > p1/cpus
1776
1777View the llc occupancy snapshot::
1778
1779  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1780  11234000
1781
1782
1783Examples on working with mbm_assign_mode
1784========================================
1785
1786a. Check if MBM counter assignment mode is supported.
1787::
1788
1789  # mount -t resctrl resctrl /sys/fs/resctrl/
1790
1791  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1792  [mbm_event]
1793  default
1794
1795The "mbm_event" mode is detected and enabled.
1796
1797b. Check how many assignable counters are supported.
1798::
1799
1800  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
1801  0=32;1=32
1802
1803c. Check how many assignable counters are available for assignment in each domain.
1804::
1805
1806  # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
1807  0=30;1=30
1808
1809d. To list the default group's assign states.
1810::
1811
1812  # cat /sys/fs/resctrl/mbm_L3_assignments
1813  mbm_total_bytes:0=e;1=e
1814  mbm_local_bytes:0=e;1=e
1815
1816e.  To unassign the counter associated with the mbm_total_bytes event on domain 0.
1817::
1818
1819  # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
1820  # cat /sys/fs/resctrl/mbm_L3_assignments
1821  mbm_total_bytes:0=_;1=e
1822  mbm_local_bytes:0=e;1=e
1823
1824f. To unassign the counter associated with the mbm_total_bytes event on all domains.
1825::
1826
1827  # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
1828  # cat /sys/fs/resctrl/mbm_L3_assignment
1829  mbm_total_bytes:0=_;1=_
1830  mbm_local_bytes:0=e;1=e
1831
1832g. To assign a counter associated with the mbm_total_bytes event on all domains in
1833exclusive mode.
1834::
1835
1836  # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
1837  # cat /sys/fs/resctrl/mbm_L3_assignments
1838  mbm_total_bytes:0=e;1=e
1839  mbm_local_bytes:0=e;1=e
1840
1841h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is
1842no change in reading the events with the assignment.
1843::
1844
1845  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
1846  779247936
1847  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
1848  562324232
1849  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1850  212122123
1851  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1852  121212144
1853
1854i. Check the event configurations.
1855::
1856
1857  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
1858  local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
1859  local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
1860
1861  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1862  local_reads,local_non_temporal_writes,local_reads_slow_memory
1863
1864j. Change the event configuration for mbm_local_bytes.
1865::
1866
1867  # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" >
1868  /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1869
1870  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1871  local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads
1872
1873k. Now read the local events again. The first read may come back with "Unavailable"
1874status. The subsequent read of mbm_local_bytes will display the current value.
1875::
1876
1877  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1878  Unavailable
1879  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1880  2252323
1881  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1882  Unavailable
1883  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1884  1566565
1885
1886l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be
1887done using the following command. Note that switching the mbm_assign_mode may reset all
1888the MBM counters (and thus all MBM events) of all the resctrl groups.
1889::
1890
1891  # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1892  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1893  mbm_event
1894  [default]
1895
1896m. Unmount the resctrl filesystem.
1897::
1898
1899  # umount /sys/fs/resctrl/
1900
1901Intel RDT Errata
1902================
1903
1904Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1905-----------------------------------------------------------------
1906
1907Errata SKX99 for Skylake server and BDF102 for Broadwell server.
1908
1909Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1910according to the assigned Resource Monitor ID (RMID) for that logical
1911core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1912metrics, may report incorrect system bandwidth for certain RMID values.
1913
1914Implication: Due to the errata, system memory bandwidth may not match
1915what is reported.
1916
1917Workaround: MBM total and local readings are corrected according to the
1918following correction factor table:
1919
1920+---------------+---------------+---------------+-----------------+
1921|core count	|rmid count	|rmid threshold	|correction factor|
1922+---------------+---------------+---------------+-----------------+
1923|1		|8		|0		|1.000000	  |
1924+---------------+---------------+---------------+-----------------+
1925|2		|16		|0		|1.000000	  |
1926+---------------+---------------+---------------+-----------------+
1927|3		|24		|15		|0.969650	  |
1928+---------------+---------------+---------------+-----------------+
1929|4		|32		|0		|1.000000	  |
1930+---------------+---------------+---------------+-----------------+
1931|6		|48		|31		|0.969650	  |
1932+---------------+---------------+---------------+-----------------+
1933|7		|56		|47		|1.142857	  |
1934+---------------+---------------+---------------+-----------------+
1935|8		|64		|0		|1.000000	  |
1936+---------------+---------------+---------------+-----------------+
1937|9		|72		|63		|1.185115	  |
1938+---------------+---------------+---------------+-----------------+
1939|10		|80		|63		|1.066553	  |
1940+---------------+---------------+---------------+-----------------+
1941|11		|88		|79		|1.454545	  |
1942+---------------+---------------+---------------+-----------------+
1943|12		|96		|0		|1.000000	  |
1944+---------------+---------------+---------------+-----------------+
1945|13		|104		|95		|1.230769	  |
1946+---------------+---------------+---------------+-----------------+
1947|14		|112		|95		|1.142857	  |
1948+---------------+---------------+---------------+-----------------+
1949|15		|120		|95		|1.066667	  |
1950+---------------+---------------+---------------+-----------------+
1951|16		|128		|0		|1.000000	  |
1952+---------------+---------------+---------------+-----------------+
1953|17		|136		|127		|1.254863	  |
1954+---------------+---------------+---------------+-----------------+
1955|18		|144		|127		|1.185255	  |
1956+---------------+---------------+---------------+-----------------+
1957|19		|152		|0		|1.000000	  |
1958+---------------+---------------+---------------+-----------------+
1959|20		|160		|127		|1.066667	  |
1960+---------------+---------------+---------------+-----------------+
1961|21		|168		|0		|1.000000	  |
1962+---------------+---------------+---------------+-----------------+
1963|22		|176		|159		|1.454334	  |
1964+---------------+---------------+---------------+-----------------+
1965|23		|184		|0		|1.000000	  |
1966+---------------+---------------+---------------+-----------------+
1967|24		|192		|127		|0.969744	  |
1968+---------------+---------------+---------------+-----------------+
1969|25		|200		|191		|1.280246	  |
1970+---------------+---------------+---------------+-----------------+
1971|26		|208		|191		|1.230921	  |
1972+---------------+---------------+---------------+-----------------+
1973|27		|216		|0		|1.000000	  |
1974+---------------+---------------+---------------+-----------------+
1975|28		|224		|191		|1.143118	  |
1976+---------------+---------------+---------------+-----------------+
1977
1978If rmid > rmid threshold, MBM total and local values should be multiplied
1979by the correction factor.
1980
1981See:
1982
19831. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1984http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1985
19862. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1987http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1988
19893. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1990https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1991
1992for further information.
1993