xref: /linux/Documentation/filesystems/resctrl.rst (revision 40286d6379aacfcc053253ef78dc78b09addffda)
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4=====================================================
5User Interface for Resource Control feature (resctrl)
6=====================================================
7
8:Copyright: |copy| 2016 Intel Corporation
9:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10          - Tony Luck <tony.luck@intel.com>
11          - Vikas Shivappa <vikas.shivappa@intel.com>
12
13
14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
16
17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
18flag bits:
19
20=============================================================== ================================
21RDT (Resource Director Technology) Allocation			"rdt_a"
22CAT (Cache Allocation Technology)				"cat_l3", "cat_l2"
23CDP (Code and Data Prioritization)				"cdp_l3", "cdp_l2"
24CQM (Cache QoS Monitoring)					"cqm_llc", "cqm_occup_llc"
25MBM (Memory Bandwidth Monitoring)				"cqm_mbm_total", "cqm_mbm_local"
26MBA (Memory Bandwidth Allocation)				"mba"
27SMBA (Slow Memory Bandwidth Allocation)				""
28BMEC (Bandwidth Monitoring Event Configuration)			""
29ABMC (Assignable Bandwidth Monitoring Counters)			""
30SDCIAE (Smart Data Cache Injection Allocation Enforcement)	""
31=============================================================== ================================
32
33Historically, new features were made visible by default in /proc/cpuinfo. This
34resulted in the feature flags becoming hard to parse by humans. Adding a new
35flag to /proc/cpuinfo should be avoided if user space can obtain information
36about the feature from resctrl's info directory.
37
38To use the feature mount the file system::
39
40 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
41
42mount options are:
43
44"cdp":
45	Enable code/data prioritization in L3 cache allocations.
46"cdpl2":
47	Enable code/data prioritization in L2 cache allocations.
48"mba_MBps":
49	Enable the MBA Software Controller(mba_sc) to specify MBA
50	bandwidth in MiBps
51"debug":
52	Make debug files accessible. Available debug files are annotated with
53	"Available only with debug option".
54
55L2 and L3 CDP are controlled separately.
56
57RDT features are orthogonal. A particular system may support only
58monitoring, only control, or both monitoring and control.  Cache
59pseudo-locking is a unique way of using cache control to "pin" or
60"lock" data in the cache. Details can be found in
61"Cache Pseudo-Locking".
62
63
64The mount succeeds if either of allocation or monitoring is present, but
65only those files and directories supported by the system will be created.
66For more details on the behavior of the interface during monitoring
67and allocation, see the "Resource alloc and monitor groups" section.
68
69Info directory
70==============
71
72The 'info' directory contains information about the enabled
73resources. Each resource has its own subdirectory. The subdirectory
74names reflect the resource names.
75
76Most of the files in the resource's subdirectory are read-only, and
77describe properties of the resource. Resources that support global
78configuration options also include writable files that can be used
79to modify those settings.
80
81Each subdirectory contains the following files with respect to
82allocation:
83
84Cache resource(L3/L2)  subdirectory contains the following files
85related to allocation:
86
87"num_closids":
88		The number of CLOSIDs which are valid for this
89		resource. The kernel uses the smallest number of
90		CLOSIDs of all enabled resources as limit.
91"cbm_mask":
92		The bitmask which is valid for this resource.
93		This mask is equivalent to 100%.
94"min_cbm_bits":
95		The minimum number of consecutive bits which
96		must be set when writing a mask.
97
98"shareable_bits":
99		Bitmask of shareable resource with other executing entities
100		(e.g. I/O). Applies to all instances of this resource. User
101		can use this when setting up exclusive cache partitions.
102		Note that some platforms support devices that have their
103		own settings for cache use which can over-ride these bits.
104
105		When "io_alloc" is enabled, a portion of each cache instance can
106		be configured for shared use between hardware and software.
107		"bit_usage" should be used to see which portions of each cache
108		instance is configured for hardware use via "io_alloc" feature
109		because every cache instance can have its "io_alloc" bitmask
110		configured independently via "io_alloc_cbm".
111
112"bit_usage":
113		Annotated capacity bitmasks showing how all
114		instances of the resource are used. The legend is:
115
116			"0":
117			      Corresponding region is unused. When the system's
118			      resources have been allocated and a "0" is found
119			      in "bit_usage" it is a sign that resources are
120			      wasted.
121
122			"H":
123			      Corresponding region is used by hardware only
124			      but available for software use. If a resource
125			      has bits set in "shareable_bits" or "io_alloc_cbm"
126			      but not all of these bits appear in the resource
127			      groups' schemata then the bits appearing in
128			      "shareable_bits" or "io_alloc_cbm" but no
129			      resource group will be marked as "H".
130			"X":
131			      Corresponding region is available for sharing and
132			      used by hardware and software. These are the bits
133			      that appear in "shareable_bits" or "io_alloc_cbm"
134			      as well as a resource group's allocation.
135			"S":
136			      Corresponding region is used by software
137			      and available for sharing.
138			"E":
139			      Corresponding region is used exclusively by
140			      one resource group. No sharing allowed.
141			"P":
142			      Corresponding region is pseudo-locked. No
143			      sharing allowed.
144"sparse_masks":
145		Indicates if non-contiguous 1s value in CBM is supported.
146
147			"0":
148			      Only contiguous 1s value in CBM is supported.
149			"1":
150			      Non-contiguous 1s value in CBM is supported.
151
152"io_alloc":
153		"io_alloc" enables system software to configure the portion of
154		the cache allocated for I/O traffic. File may only exist if the
155		system supports this feature on some of its cache resources.
156
157			"disabled":
158			      Resource supports "io_alloc" but the feature is disabled.
159			      Portions of cache used for allocation of I/O traffic cannot
160			      be configured.
161			"enabled":
162			      Portions of cache used for allocation of I/O traffic
163			      can be configured using "io_alloc_cbm".
164			"not supported":
165			      Support not available for this resource.
166
167		The feature can be modified by writing to the interface, for example:
168
169		To enable::
170
171			# echo 1 > /sys/fs/resctrl/info/L3/io_alloc
172
173		To disable::
174
175			# echo 0 > /sys/fs/resctrl/info/L3/io_alloc
176
177		The underlying implementation may reduce resources available to
178		general (CPU) cache allocation. See architecture specific notes
179		below. Depending on usage requirements the feature can be enabled
180		or disabled.
181
182		On AMD systems, io_alloc feature is supported by the L3 Smart
183		Data Cache Injection Allocation Enforcement (SDCIAE). The CLOSID for
184		io_alloc is the highest CLOSID supported by the resource. When
185		io_alloc is enabled, the highest CLOSID is dedicated to io_alloc and
186		no longer available for general (CPU) cache allocation. When CDP is
187		enabled, io_alloc routes I/O traffic using the highest CLOSID allocated
188		for the instruction cache (CDP_CODE), making this CLOSID no longer
189		available for general (CPU) cache allocation for both the CDP_CODE
190		and CDP_DATA resources.
191
192"io_alloc_cbm":
193		Capacity bitmasks that describe the portions of cache instances to
194		which I/O traffic from supported I/O devices are routed when "io_alloc"
195		is enabled.
196
197		CBMs are displayed in the following format:
198
199			<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
200
201		Example::
202
203			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
204			0=ffff;1=ffff
205
206		CBMs can be configured by writing to the interface.
207
208		Example::
209
210			# echo 1=ff > /sys/fs/resctrl/info/L3/io_alloc_cbm
211			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
212			0=ffff;1=00ff
213
214			# echo "0=ff;1=f" > /sys/fs/resctrl/info/L3/io_alloc_cbm
215			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
216			0=00ff;1=000f
217
218		An ID of "*" configures all domains with the provided CBM.
219
220		Example on a system that does not require a minimum number of consecutive bits in the mask::
221
222			# echo "*=0" > /sys/fs/resctrl/info/L3/io_alloc_cbm
223			# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
224			0=0;1=0
225
226		When CDP is enabled "io_alloc_cbm" associated with the CDP_DATA and CDP_CODE
227		resources may reflect the same values. For example, values read from and
228		written to /sys/fs/resctrl/info/L3DATA/io_alloc_cbm may be reflected by
229		/sys/fs/resctrl/info/L3CODE/io_alloc_cbm and vice versa.
230
231Memory bandwidth(MB) subdirectory contains the following files
232with respect to allocation:
233
234"min_bandwidth":
235		The minimum memory bandwidth percentage which
236		user can request.
237
238"bandwidth_gran":
239		The granularity in which the memory bandwidth
240		percentage is allocated. The allocated
241		b/w percentage is rounded off to the next
242		control step available on the hardware. The
243		available bandwidth control steps are:
244		min_bandwidth + N * bandwidth_gran.
245
246"delay_linear":
247		Indicates if the delay scale is linear or
248		non-linear. This field is purely informational
249		only.
250
251"thread_throttle_mode":
252		Indicator on Intel systems of how tasks running on threads
253		of a physical core are throttled in cases where they
254		request different memory bandwidth percentages:
255
256		"max":
257			the smallest percentage is applied
258			to all threads
259		"per-thread":
260			bandwidth percentages are directly applied to
261			the threads running on the core
262
263If L3 monitoring is available there will be an "L3_MON" directory
264with the following files:
265
266"num_rmids":
267		The number of RMIDs supported by hardware for
268		L3 monitoring events.
269
270"mon_features":
271		Lists the monitoring events if
272		monitoring is enabled for the resource.
273		Example::
274
275			# cat /sys/fs/resctrl/info/L3_MON/mon_features
276			llc_occupancy
277			mbm_total_bytes
278			mbm_local_bytes
279
280		If the system supports Bandwidth Monitoring Event
281		Configuration (BMEC), then the bandwidth events will
282		be configurable. The output will be::
283
284			# cat /sys/fs/resctrl/info/L3_MON/mon_features
285			llc_occupancy
286			mbm_total_bytes
287			mbm_total_bytes_config
288			mbm_local_bytes
289			mbm_local_bytes_config
290
291"mbm_total_bytes_config", "mbm_local_bytes_config":
292	Read/write files containing the configuration for the mbm_total_bytes
293	and mbm_local_bytes events, respectively, when the Bandwidth
294	Monitoring Event Configuration (BMEC) feature is supported.
295	The event configuration settings are domain specific and affect
296	all the CPUs in the domain. When either event configuration is
297	changed, the bandwidth counters for all RMIDs of both events
298	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
299	domain. The next read for every RMID will report "Unavailable"
300	and subsequent reads will report the valid value.
301
302	Following are the types of events supported:
303
304	====    ========================================================
305	Bits    Description
306	====    ========================================================
307	6       Dirty Victims from the QOS domain to all types of memory
308	5       Reads to slow memory in the non-local NUMA domain
309	4       Reads to slow memory in the local NUMA domain
310	3       Non-temporal writes to non-local NUMA domain
311	2       Non-temporal writes to local NUMA domain
312	1       Reads to memory in the non-local NUMA domain
313	0       Reads to memory in the local NUMA domain
314	====    ========================================================
315
316	By default, the mbm_total_bytes configuration is set to 0x7f to count
317	all the event types and the mbm_local_bytes configuration is set to
318	0x15 to count all the local memory events.
319
320	Examples:
321
322	* To view the current configuration::
323	  ::
324
325	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
326	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
327
328	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
329	    0=0x15;1=0x15;3=0x15;4=0x15
330
331	* To change the mbm_total_bytes to count only reads on domain 0,
332	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
333	  (in hexadecimal 0x33):
334	  ::
335
336	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
337
338	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
339	    0=0x33;1=0x7f;2=0x7f;3=0x7f
340
341	* To change the mbm_local_bytes to count all the slow memory reads on
342	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
343	  in binary (in hexadecimal 0x30):
344	  ::
345
346	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
347
348	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
349	    0=0x30;1=0x30;3=0x15;4=0x15
350
351"mbm_assign_mode":
352	The supported counter assignment modes. The enclosed brackets indicate which mode
353	is enabled. The MBM events associated with counters may reset when "mbm_assign_mode"
354	is changed.
355	::
356
357	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
358	  [mbm_event]
359	  default
360
361	"mbm_event":
362
363	mbm_event mode allows users to assign a hardware counter to an RMID, event
364	pair and monitor the bandwidth usage as long as it is assigned. The hardware
365	continues to track the assigned counter until it is explicitly unassigned by
366	the user. Each event within a resctrl group can be assigned independently.
367
368	In this mode, a monitoring event can only accumulate data while it is backed
369	by a hardware counter. Use "mbm_L3_assignments" found in each CTRL_MON and MON
370	group to specify which of the events should have a counter assigned. The number
371	of counters available is described in the "num_mbm_cntrs" file. Changing the
372	mode may cause all counters on the resource to reset.
373
374	Moving to mbm_event counter assignment mode requires users to assign the counters
375	to the events. Otherwise, the MBM event counters will return 'Unassigned' when read.
376
377	The mode is beneficial for AMD platforms that support more CTRL_MON
378	and MON groups than available hardware counters. By default, this
379	feature is enabled on AMD platforms with the ABMC (Assignable Bandwidth
380	Monitoring Counters) capability, ensuring counters remain assigned even
381	when the corresponding RMID is not actively used by any processor.
382
383	"default":
384
385	In default mode, resctrl assumes there is a hardware counter for each
386	event within every CTRL_MON and MON group. On AMD platforms, it is
387	recommended to use the mbm_event mode, if supported, to prevent reset of MBM
388	events between reads resulting from hardware re-allocating counters. This can
389	result in misleading values or display "Unavailable" if no counter is assigned
390	to the event.
391
392	* To enable "mbm_event" counter assignment mode:
393	  ::
394
395	    # echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
396
397	* To enable "default" monitoring mode:
398	  ::
399
400	    # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
401
402"num_mbm_cntrs":
403	The maximum number of counters (total of available and assigned counters) in
404	each domain when the system supports mbm_event mode.
405
406	For example, on a system with maximum of 32 memory bandwidth monitoring
407	counters in each of its L3 domains:
408	::
409
410	  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
411	  0=32;1=32
412
413"available_mbm_cntrs":
414	The number of counters available for assignment in each domain when mbm_event
415	mode is enabled on the system.
416
417	For example, on a system with 30 available [hardware] assignable counters
418	in each of its L3 domains:
419	::
420
421	  # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
422	  0=30;1=30
423
424"event_configs":
425	Directory that exists when "mbm_event" counter assignment mode is supported.
426	Contains a sub-directory for each MBM event that can be assigned to a counter.
427
428	Two MBM events are supported by default: mbm_local_bytes and mbm_total_bytes.
429	Each MBM event's sub-directory contains a file named "event_filter" that is
430	used to view and modify which memory transactions the MBM event is configured
431	with. The file is accessible only when "mbm_event" counter assignment mode is
432	enabled.
433
434	List of memory transaction types supported:
435
436	==========================  ========================================================
437	Name			    Description
438	==========================  ========================================================
439	dirty_victim_writes_all     Dirty Victims from the QOS domain to all types of memory
440	remote_reads_slow_memory    Reads to slow memory in the non-local NUMA domain
441	local_reads_slow_memory     Reads to slow memory in the local NUMA domain
442	remote_non_temporal_writes  Non-temporal writes to non-local NUMA domain
443	local_non_temporal_writes   Non-temporal writes to local NUMA domain
444	remote_reads                Reads to memory in the non-local NUMA domain
445	local_reads                 Reads to memory in the local NUMA domain
446	==========================  ========================================================
447
448	For example::
449
450	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
451	  local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
452	  local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
453
454	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
455	  local_reads,local_non_temporal_writes,local_reads_slow_memory
456
457	Modify the event configuration by writing to the "event_filter" file within
458	the "event_configs" directory. The read/write "event_filter" file contains the
459	configuration of the event that reflects which memory transactions are counted by it.
460
461	For example::
462
463	  # echo "local_reads, local_non_temporal_writes" >
464	    /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
465
466	  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
467	   local_reads,local_non_temporal_writes
468
469"mbm_assign_on_mkdir":
470	Exists when "mbm_event" counter assignment mode is supported. Accessible
471	only when "mbm_event" counter assignment mode is enabled.
472
473	Determines if a counter will automatically be assigned to an RMID, MBM event
474	pair when its associated monitor group is created via mkdir. Enabled by default
475	on boot, also when switched from "default" mode to "mbm_event" counter assignment
476	mode. Users can disable this capability by writing to the interface.
477
478	"0":
479		Auto assignment is disabled.
480	"1":
481		Auto assignment is enabled.
482
483	Example::
484
485	  # echo 0 > /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
486	  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_on_mkdir
487	  0
488
489"max_threshold_occupancy":
490		Read/write file provides the largest value (in
491		bytes) at which a previously used LLC_occupancy
492		counter can be considered for reuse.
493
494If telemetry monitoring is available there will be a "PERF_PKG_MON" directory
495with the following files:
496
497"num_rmids":
498		The number of RMIDs for telemetry monitoring events.
499
500		On Intel resctrl will not enable telemetry events if the number of
501		RMIDs that can be tracked concurrently is lower than the total number
502		of RMIDs supported. Telemetry events can be force-enabled with the
503		"rdt=" kernel parameter, but this may reduce the number of
504		monitoring groups that can be created.
505
506"mon_features":
507		Lists the telemetry monitoring events that are enabled on this system.
508
509The upper bound for how many "CTRL_MON" + "MON" can be created
510is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
511
512Finally, in the top level of the "info" directory there is a file
513named "last_cmd_status". This is reset with every "command" issued
514via the file system (making new directories or writing to any of the
515control files). If the command was successful, it will read as "ok".
516If the command failed, it will provide more information that can be
517conveyed in the error returns from file operations. E.g.
518::
519
520	# echo L3:0=f7 > schemata
521	bash: echo: write error: Invalid argument
522	# cat info/last_cmd_status
523	mask f7 has non-consecutive 1-bits
524
525Resource alloc and monitor groups
526=================================
527
528Resource groups are represented as directories in the resctrl file
529system.  The default group is the root directory which, immediately
530after mounting, owns all the tasks and cpus in the system and can make
531full use of all resources.
532
533On a system with RDT control features additional directories can be
534created in the root directory that specify different amounts of each
535resource (see "schemata" below). The root and these additional top level
536directories are referred to as "CTRL_MON" groups below.
537
538On a system with RDT monitoring the root directory and other top level
539directories contain a directory named "mon_groups" in which additional
540directories can be created to monitor subsets of tasks in the CTRL_MON
541group that is their ancestor. These are called "MON" groups in the rest
542of this document.
543
544Removing a directory will move all tasks and cpus owned by the group it
545represents to the parent. Removing one of the created CTRL_MON groups
546will automatically remove all MON groups below it.
547
548Moving MON group directories to a new parent CTRL_MON group is supported
549for the purpose of changing the resource allocations of a MON group
550without impacting its monitoring data or assigned tasks. This operation
551is not allowed for MON groups which monitor CPUs. No other move
552operation is currently allowed other than simply renaming a CTRL_MON or
553MON group.
554
555All groups contain the following files:
556
557"tasks":
558	Reading this file shows the list of all tasks that belong to
559	this group. Writing a task id to the file will add a task to the
560	group. Multiple tasks can be added by separating the task ids
561	with commas. Tasks will be assigned sequentially. Multiple
562	failures are not supported. A single failure encountered while
563	attempting to assign a task will cause the operation to abort and
564	already added tasks before the failure will remain in the group.
565	Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
566
567	If the group is a CTRL_MON group the task is removed from
568	whichever previous CTRL_MON group owned the task and also from
569	any MON group that owned the task. If the group is a MON group,
570	then the task must already belong to the CTRL_MON parent of this
571	group. The task is removed from any previous MON group.
572
573
574"cpus":
575	Reading this file shows a bitmask of the logical CPUs owned by
576	this group. Writing a mask to this file will add and remove
577	CPUs to/from this group. As with the tasks file a hierarchy is
578	maintained where MON groups may only include CPUs owned by the
579	parent CTRL_MON group.
580	When the resource group is in pseudo-locked mode this file will
581	only be readable, reflecting the CPUs associated with the
582	pseudo-locked region.
583
584
585"cpus_list":
586	Just like "cpus", only using ranges of CPUs instead of bitmasks.
587
588
589When control is enabled all CTRL_MON groups will also contain:
590
591"schemata":
592	A list of all the resources available to this group.
593	Each resource has its own line and format - see below for details.
594
595"size":
596	Mirrors the display of the "schemata" file to display the size in
597	bytes of each allocation instead of the bits representing the
598	allocation.
599
600"mode":
601	The "mode" of the resource group dictates the sharing of its
602	allocations. A "shareable" resource group allows sharing of its
603	allocations while an "exclusive" resource group does not. A
604	cache pseudo-locked region is created by first writing
605	"pseudo-locksetup" to the "mode" file before writing the cache
606	pseudo-locked region's schemata to the resource group's "schemata"
607	file. On successful pseudo-locked region creation the mode will
608	automatically change to "pseudo-locked".
609
610"ctrl_hw_id":
611	Available only with debug option. The identifier used by hardware
612	for the control group. On x86 this is the CLOSID.
613
614When monitoring is enabled all MON groups will also contain:
615
616"mon_data":
617	This contains directories for each monitor domain.
618
619	If L3 monitoring is enabled, there will be a "mon_L3_XX" directory for
620	each instance of an L3 cache. Each directory contains files for the enabled
621	L3 events (e.g. "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes").
622
623	If telemetry monitoring is enabled, there will be a "mon_PERF_PKG_YY"
624	directory for each physical processor package. Each directory contains
625	files for the enabled telemetry events (e.g. "core_energy". "activity",
626	"uops_retired", etc.)
627
628	The info/`*`/mon_features files provide the full list of enabled
629	event/file names.
630
631	"core energy" reports a floating point number for the energy (in Joules)
632	consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
633	during execution of instructions summed across all logical CPUs on a
634	package for the current monitoring group.
635
636	"activity" also reports a floating point value (in Farads).  This provides
637	an estimate of work done independent of the frequency that the CPUs used
638	for execution.
639
640	Note that "core energy" and "activity" only measure energy/activity in the
641	"core" of the CPU (arithmetic units, TLB, L1 and L2 caches, etc.). They
642	do not include L3 cache, memory, I/O devices etc.
643
644	All other events report decimal integer values.
645
646	In a MON group these files provide a read out of the current value of
647	the event for all tasks in the group. In CTRL_MON groups these files
648	provide the sum for all tasks in the CTRL_MON group and all tasks in
649	MON groups. Please see example section for more details on usage.
650
651	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
652	directories for each node (located within the "mon_L3_XX" directory
653	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
654	where "YY" is the node number.
655
656	When the 'mbm_event' counter assignment mode is enabled, reading
657	an MBM event of a MON group returns 'Unassigned' if no hardware
658	counter is assigned to it. For CTRL_MON groups, 'Unassigned' is
659	returned if the MBM event does not have an assigned counter in the
660	CTRL_MON group nor in any of its associated MON groups.
661
662"mon_hw_id":
663	Available only with debug option. The identifier used by hardware
664	for the monitor group. On x86 this is the RMID.
665
666When monitoring is enabled all MON groups may also contain:
667
668"mbm_L3_assignments":
669	Exists when "mbm_event" counter assignment mode is supported and lists the
670	counter assignment states of the group.
671
672	The assignment list is displayed in the following format:
673
674	<Event>:<Domain ID>=<Assignment state>;<Domain ID>=<Assignment state>
675
676	Event: A valid MBM event in the
677	       /sys/fs/resctrl/info/L3_MON/event_configs directory.
678
679	Domain ID: A valid domain ID. When writing, '*' applies the changes
680		   to all the domains.
681
682	Assignment states:
683
684	_ : No counter assigned.
685
686	e : Counter assigned exclusively.
687
688	Example:
689
690	To display the counter assignment states for the default group.
691	::
692
693	 # cd /sys/fs/resctrl
694	 # cat /sys/fs/resctrl/mbm_L3_assignments
695	   mbm_total_bytes:0=e;1=e
696	   mbm_local_bytes:0=e;1=e
697
698	Assignments can be modified by writing to the interface.
699
700	Examples:
701
702	To unassign the counter associated with the mbm_total_bytes event on domain 0:
703	::
704
705	 # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
706	 # cat /sys/fs/resctrl/mbm_L3_assignments
707	   mbm_total_bytes:0=_;1=e
708	   mbm_local_bytes:0=e;1=e
709
710	To unassign the counter associated with the mbm_total_bytes event on all the domains:
711	::
712
713	 # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
714	 # cat /sys/fs/resctrl/mbm_L3_assignments
715	   mbm_total_bytes:0=_;1=_
716	   mbm_local_bytes:0=e;1=e
717
718	To assign a counter associated with the mbm_total_bytes event on all domains in
719	exclusive mode:
720	::
721
722	 # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
723	 # cat /sys/fs/resctrl/mbm_L3_assignments
724	   mbm_total_bytes:0=e;1=e
725	   mbm_local_bytes:0=e;1=e
726
727When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
728
729"mba_MBps_event":
730	Reading this file shows which memory bandwidth event is used
731	as input to the software feedback loop that keeps memory bandwidth
732	below the value specified in the schemata file. Writing the
733	name of one of the supported memory bandwidth events found in
734	/sys/fs/resctrl/info/L3_MON/mon_features changes the input
735	event.
736
737Resource allocation rules
738-------------------------
739
740When a task is running the following rules define which resources are
741available to it:
742
7431) If the task is a member of a non-default group, then the schemata
744   for that group is used.
745
7462) Else if the task belongs to the default group, but is running on a
747   CPU that is assigned to some specific group, then the schemata for the
748   CPU's group is used.
749
7503) Otherwise the schemata for the default group is used.
751
752Resource monitoring rules
753-------------------------
7541) If a task is a member of a MON group, or non-default CTRL_MON group
755   then RDT events for the task will be reported in that group.
756
7572) If a task is a member of the default CTRL_MON group, but is running
758   on a CPU that is assigned to some specific group, then the RDT events
759   for the task will be reported in that group.
760
7613) Otherwise RDT events for the task will be reported in the root level
762   "mon_data" group.
763
764
765Notes on cache occupancy monitoring and control
766===============================================
767When moving a task from one group to another you should remember that
768this only affects *new* cache allocations by the task. E.g. you may have
769a task in a monitor group showing 3 MB of cache occupancy. If you move
770to a new group and immediately check the occupancy of the old and new
771groups you will likely see that the old group is still showing 3 MB and
772the new group zero. When the task accesses locations still in cache from
773before the move, the h/w does not update any counters. On a busy system
774you will likely see the occupancy in the old group go down as cache lines
775are evicted and re-used while the occupancy in the new group rises as
776the task accesses memory and loads into the cache are counted based on
777membership in the new group.
778
779The same applies to cache allocation control. Moving a task to a group
780with a smaller cache partition will not evict any cache lines. The
781process may continue to use them from the old partition.
782
783Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
784to identify a control group and a monitoring group respectively. Each of
785the resource groups are mapped to these IDs based on the kind of group. The
786number of CLOSid and RMID are limited by the hardware and hence the creation of
787a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
788and creation of "MON" group may fail if we run out of RMIDs.
789
790max_threshold_occupancy - generic concepts
791------------------------------------------
792
793Note that an RMID once freed may not be immediately available for use as
794the RMID is still tagged the cache lines of the previous user of RMID.
795Hence such RMIDs are placed on limbo list and checked back if the cache
796occupancy has gone down. If there is a time when system has a lot of
797limbo RMIDs but which are not ready to be used, user may see an -EBUSY
798during mkdir.
799
800max_threshold_occupancy is a user configurable value to determine the
801occupancy at which an RMID can be freed.
802
803The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
804for a subset of RMID that are not immediately available for allocation.
805This can't be relied on to produce output every second, it may be necessary
806to attempt to create an empty monitor group to force an update. Output may
807only be produced if creation of a control or monitor group fails.
808
809Schemata files - general concepts
810---------------------------------
811Each line in the file describes one resource. The line starts with
812the name of the resource, followed by specific values to be applied
813in each of the instances of that resource on the system.
814
815Cache IDs
816---------
817On current generation systems there is one L3 cache per socket and L2
818caches are generally just shared by the hyperthreads on a core, but this
819isn't an architectural requirement. We could have multiple separate L3
820caches on a socket, multiple cores could share an L2 cache. So instead
821of using "socket" or "core" to define the set of logical cpus sharing
822a resource we use a "Cache ID". At a given cache level this will be a
823unique number across the whole system (but it isn't guaranteed to be a
824contiguous sequence, there may be gaps).  To find the ID for each logical
825CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
826
827Cache Bit Masks (CBM)
828---------------------
829For cache resources we describe the portion of the cache that is available
830for allocation using a bitmask. The maximum value of the mask is defined
831by each cpu model (and may be different for different cache levels). It
832is found using CPUID, but is also provided in the "info" directory of
833the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
834requires that these masks have all the '1' bits in a contiguous block. So
8350x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
836and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
837if non-contiguous 1s value is supported. On a system with a 20-bit mask
838each bit represents 5% of the capacity of the cache. You could partition
839the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
840
841Notes on Sub-NUMA Cluster mode
842==============================
843When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
844nodes much more readily than between regular NUMA nodes since the CPUs
845on Sub-NUMA nodes share the same L3 cache and the system may report
846the NUMA distance between Sub-NUMA nodes with a lower value than used
847for regular NUMA nodes.
848
849The top-level monitoring files in each "mon_L3_XX" directory provide
850the sum of data across all SNC nodes sharing an L3 cache instance.
851Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
852the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
853"mon_sub_L3_YY" directories to get node local data.
854
855Memory bandwidth allocation is still performed at the L3 cache
856level. I.e. throttling controls are applied to all SNC nodes.
857
858L3 cache allocation bitmaps also apply to all SNC nodes. But note that
859the amount of L3 cache represented by each bit is divided by the number
860of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
861allocation masks each bit normally represents 10MB. With SNC mode enabled
862with two SNC nodes per L3 cache, each bit only represents 5MB.
863
864Memory bandwidth Allocation and monitoring
865==========================================
866
867For Memory bandwidth resource, by default the user controls the resource
868by indicating the percentage of total memory bandwidth.
869
870The minimum bandwidth percentage value for each cpu model is predefined
871and can be looked up through "info/MB/min_bandwidth". The bandwidth
872granularity that is allocated is also dependent on the cpu model and can
873be looked up at "info/MB/bandwidth_gran". The available bandwidth
874control steps are: min_bw + N * bw_gran. Intermediate values are rounded
875to the next control step available on the hardware.
876
877The bandwidth throttling is a core specific mechanism on some of Intel
878SKUs. Using a high bandwidth and a low bandwidth setting on two threads
879sharing a core may result in both threads being throttled to use the
880low bandwidth (see "thread_throttle_mode").
881
882The fact that Memory bandwidth allocation(MBA) may be a core
883specific mechanism where as memory bandwidth monitoring(MBM) is done at
884the package level may lead to confusion when users try to apply control
885via the MBA and then monitor the bandwidth to see if the controls are
886effective. Below are such scenarios:
887
8881. User may *not* see increase in actual bandwidth when percentage
889   values are increased:
890
891This can occur when aggregate L2 external bandwidth is more than L3
892external bandwidth. Consider an SKL SKU with 24 cores on a package and
893where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
894240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
895threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
896bandwidth of 100GBps although the percentage value specified is only 50%
897<< 100%. Hence increasing the bandwidth percentage will not yield any
898more bandwidth. This is because although the L2 external bandwidth still
899has capacity, the L3 external bandwidth is fully used. Also note that
900this would be dependent on number of cores the benchmark is run on.
901
9022. Same bandwidth percentage may mean different actual bandwidth
903   depending on # of threads:
904
905For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
906thread, with 10% bandwidth' can consume up to 10GBps and 40GBps although
907they have same percentage bandwidth of 10%. This is simply because as
908threads start using more cores in an rdtgroup, the actual bandwidth may
909increase or vary although user specified bandwidth percentage is same.
910
911In order to mitigate this and make the interface more user friendly,
912resctrl added support for specifying the bandwidth in MiBps as well.  The
913kernel underneath would use a software feedback mechanism or a "Software
914Controller(mba_sc)" which reads the actual bandwidth using MBM counters
915and adjust the memory bandwidth percentages to ensure::
916
917	"actual bandwidth < user specified bandwidth".
918
919By default, the schemata would take the bandwidth percentage values
920where as user can switch to the "MBA software controller" mode using
921a mount option 'mba_MBps'. The schemata format is specified in the below
922sections.
923
924L3 schemata file details (code and data prioritization disabled)
925----------------------------------------------------------------
926With CDP disabled the L3 schemata format is::
927
928	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
929
930L3 schemata file details (CDP enabled via mount option to resctrl)
931------------------------------------------------------------------
932When CDP is enabled L3 control is split into two separate resources
933so you can specify independent masks for code and data like this::
934
935	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
936	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
937
938L2 schemata file details
939------------------------
940CDP is supported at L2 using the 'cdpl2' mount option. The schemata
941format is either::
942
943	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
944
945or
946
947	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
948	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
949
950
951Memory bandwidth Allocation (default mode)
952------------------------------------------
953
954Memory b/w domain is L3 cache.
955::
956
957	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
958
959Memory bandwidth Allocation specified in MiBps
960----------------------------------------------
961
962Memory bandwidth domain is L3 cache.
963::
964
965	MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
966
967Slow Memory Bandwidth Allocation (SMBA)
968---------------------------------------
969AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
970CXL.memory is the only supported "slow" memory device. With the
971support of SMBA, the hardware enables bandwidth allocation on
972the slow memory devices. If there are multiple such devices in
973the system, the throttling logic groups all the slow sources
974together and applies the limit on them as a whole.
975
976The presence of SMBA (with CXL.memory) is independent of slow memory
977devices presence. If there are no such devices on the system, then
978configuring SMBA will have no impact on the performance of the system.
979
980The bandwidth domain for slow memory is L3 cache. Its schemata file
981is formatted as:
982::
983
984	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
985
986Reading/writing the schemata file
987---------------------------------
988Reading the schemata file will show the state of all resources
989on all domains. When writing you only need to specify those values
990which you wish to change.  E.g.
991::
992
993  # cat schemata
994  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
995  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
996  # echo "L3DATA:2=3c0;" > schemata
997  # cat schemata
998  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
999  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
1000
1001Reading/writing the schemata file (on AMD systems)
1002--------------------------------------------------
1003Reading the schemata file will show the current bandwidth limit on all
1004domains. The allocated resources are in multiples of one eighth GB/s.
1005When writing to the file, you need to specify what cache id you wish to
1006configure the bandwidth limit.
1007
1008For example, to allocate 2GB/s limit on the first cache id:
1009
1010::
1011
1012  # cat schemata
1013    MB:0=2048;1=2048;2=2048;3=2048
1014    L3:0=ffff;1=ffff;2=ffff;3=ffff
1015
1016  # echo "MB:1=16" > schemata
1017  # cat schemata
1018    MB:0=2048;1=  16;2=2048;3=2048
1019    L3:0=ffff;1=ffff;2=ffff;3=ffff
1020
1021Reading/writing the schemata file (on AMD systems) with SMBA feature
1022--------------------------------------------------------------------
1023Reading and writing the schemata file is the same as without SMBA in
1024above section.
1025
1026For example, to allocate 8GB/s limit on the first cache id:
1027
1028::
1029
1030  # cat schemata
1031    SMBA:0=2048;1=2048;2=2048;3=2048
1032      MB:0=2048;1=2048;2=2048;3=2048
1033      L3:0=ffff;1=ffff;2=ffff;3=ffff
1034
1035  # echo "SMBA:1=64" > schemata
1036  # cat schemata
1037    SMBA:0=2048;1=  64;2=2048;3=2048
1038      MB:0=2048;1=2048;2=2048;3=2048
1039      L3:0=ffff;1=ffff;2=ffff;3=ffff
1040
1041Cache Pseudo-Locking
1042====================
1043CAT enables a user to specify the amount of cache space that an
1044application can fill. Cache pseudo-locking builds on the fact that a
1045CPU can still read and write data pre-allocated outside its current
1046allocated area on a cache hit. With cache pseudo-locking, data can be
1047preloaded into a reserved portion of cache that no application can
1048fill, and from that point on will only serve cache hits. The cache
1049pseudo-locked memory is made accessible to user space where an
1050application can map it into its virtual address space and thus have
1051a region of memory with reduced average read latency.
1052
1053The creation of a cache pseudo-locked region is triggered by a request
1054from the user to do so that is accompanied by a schemata of the region
1055to be pseudo-locked. The cache pseudo-locked region is created as follows:
1056
1057- Create a CAT allocation CLOSNEW with a CBM matching the schemata
1058  from the user of the cache region that will contain the pseudo-locked
1059  memory. This region must not overlap with any current CAT allocation/CLOS
1060  on the system and no future overlap with this cache region is allowed
1061  while the pseudo-locked region exists.
1062- Create a contiguous region of memory of the same size as the cache
1063  region.
1064- Flush the cache, disable hardware prefetchers, disable preemption.
1065- Make CLOSNEW the active CLOS and touch the allocated memory to load
1066  it into the cache.
1067- Set the previous CLOS as active.
1068- At this point the closid CLOSNEW can be released - the cache
1069  pseudo-locked region is protected as long as its CBM does not appear in
1070  any CAT allocation. Even though the cache pseudo-locked region will from
1071  this point on not appear in any CBM of any CLOS an application running with
1072  any CLOS will be able to access the memory in the pseudo-locked region since
1073  the region continues to serve cache hits.
1074- The contiguous region of memory loaded into the cache is exposed to
1075  user-space as a character device.
1076
1077Cache pseudo-locking increases the probability that data will remain
1078in the cache via carefully configuring the CAT feature and controlling
1079application behavior. There is no guarantee that data is placed in
1080cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
1081“locked” data from cache. Power management C-states may shrink or
1082power off cache. Deeper C-states will automatically be restricted on
1083pseudo-locked region creation.
1084
1085It is required that an application using a pseudo-locked region runs
1086with affinity to the cores (or a subset of the cores) associated
1087with the cache on which the pseudo-locked region resides. A sanity check
1088within the code will not allow an application to map pseudo-locked memory
1089unless it runs with affinity to cores associated with the cache on which the
1090pseudo-locked region resides. The sanity check is only done during the
1091initial mmap() handling, there is no enforcement afterwards and the
1092application self needs to ensure it remains affine to the correct cores.
1093
1094Pseudo-locking is accomplished in two stages:
1095
10961) During the first stage the system administrator allocates a portion
1097   of cache that should be dedicated to pseudo-locking. At this time an
1098   equivalent portion of memory is allocated, loaded into allocated
1099   cache portion, and exposed as a character device.
11002) During the second stage a user-space application maps (mmap()) the
1101   pseudo-locked memory into its address space.
1102
1103Cache Pseudo-Locking Interface
1104------------------------------
1105A pseudo-locked region is created using the resctrl interface as follows:
1106
11071) Create a new resource group by creating a new directory in /sys/fs/resctrl.
11082) Change the new resource group's mode to "pseudo-locksetup" by writing
1109   "pseudo-locksetup" to the "mode" file.
11103) Write the schemata of the pseudo-locked region to the "schemata" file. All
1111   bits within the schemata should be "unused" according to the "bit_usage"
1112   file.
1113
1114On successful pseudo-locked region creation the "mode" file will contain
1115"pseudo-locked" and a new character device with the same name as the resource
1116group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
1117by user space in order to obtain access to the pseudo-locked memory region.
1118
1119An example of cache pseudo-locked region creation and usage can be found below.
1120
1121Cache Pseudo-Locking Debugging Interface
1122----------------------------------------
1123The pseudo-locking debugging interface is enabled by default (if
1124CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
1125
1126There is no explicit way for the kernel to test if a provided memory
1127location is present in the cache. The pseudo-locking debugging interface uses
1128the tracing infrastructure to provide two ways to measure cache residency of
1129the pseudo-locked region:
1130
11311) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
1132   from these measurements are best visualized using a hist trigger (see
1133   example below). In this test the pseudo-locked region is traversed at
1134   a stride of 32 bytes while hardware prefetchers and preemption
1135   are disabled. This also provides a substitute visualization of cache
1136   hits and misses.
11372) Cache hit and miss measurements using model specific precision counters if
1138   available. Depending on the levels of cache on the system the pseudo_lock_l2
1139   and pseudo_lock_l3 tracepoints are available.
1140
1141When a pseudo-locked region is created a new debugfs directory is created for
1142it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
1143write-only file, pseudo_lock_measure, is present in this directory. The
1144measurement of the pseudo-locked region depends on the number written to this
1145debugfs file:
1146
11471:
1148     writing "1" to the pseudo_lock_measure file will trigger the latency
1149     measurement captured in the pseudo_lock_mem_latency tracepoint. See
1150     example below.
11512:
1152     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
1153     residency (cache hits and misses) measurement captured in the
1154     pseudo_lock_l2 tracepoint. See example below.
11553:
1156     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
1157     residency (cache hits and misses) measurement captured in the
1158     pseudo_lock_l3 tracepoint.
1159
1160All measurements are recorded with the tracing infrastructure. This requires
1161the relevant tracepoints to be enabled before the measurement is triggered.
1162
1163Example of latency debugging interface
1164~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1165In this example a pseudo-locked region named "newlock" was created. Here is
1166how we can measure the latency in cycles of reading from this region and
1167visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
1168is set::
1169
1170  # :> /sys/kernel/tracing/trace
1171  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
1172  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
1173  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
1174  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
1175  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
1176
1177  # event histogram
1178  #
1179  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
1180  #
1181
1182  { latency:        456 } hitcount:          1
1183  { latency:         50 } hitcount:         83
1184  { latency:         36 } hitcount:         96
1185  { latency:         44 } hitcount:        174
1186  { latency:         48 } hitcount:        195
1187  { latency:         46 } hitcount:        262
1188  { latency:         42 } hitcount:        693
1189  { latency:         40 } hitcount:       3204
1190  { latency:         38 } hitcount:       3484
1191
1192  Totals:
1193      Hits: 8192
1194      Entries: 9
1195    Dropped: 0
1196
1197Example of cache hits/misses debugging
1198~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1199In this example a pseudo-locked region named "newlock" was created on the L2
1200cache of a platform. Here is how we can obtain details of the cache hits
1201and misses using the platform's precision counters.
1202::
1203
1204  # :> /sys/kernel/tracing/trace
1205  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
1206  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
1207  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
1208  # cat /sys/kernel/tracing/trace
1209
1210  # tracer: nop
1211  #
1212  #                              _-----=> irqs-off
1213  #                             / _----=> need-resched
1214  #                            | / _---=> hardirq/softirq
1215  #                            || / _--=> preempt-depth
1216  #                            ||| /     delay
1217  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1218  #              | |       |   ||||       |         |
1219  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
1220
1221
1222Examples for RDT allocation usage
1223~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1224
12251) Example 1
1226
1227On a two socket machine (one L3 cache per socket) with just four bits
1228for cache bit masks, minimum b/w of 10% with a memory bandwidth
1229granularity of 10%.
1230::
1231
1232  # mount -t resctrl resctrl /sys/fs/resctrl
1233  # cd /sys/fs/resctrl
1234  # mkdir p0 p1
1235  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
1236  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
1237
1238The default resource group is unmodified, so we have access to all parts
1239of all caches (its schemata file reads "L3:0=f;1=f").
1240
1241Tasks that are under the control of group "p0" may only allocate from the
1242"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1243Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1244
1245Similarly, tasks that are under the control of group "p0" may use a
1246maximum memory b/w of 50% on socket0 and 50% on socket 1.
1247Tasks in group "p1" may also use 50% memory b/w on both sockets.
1248Note that unlike cache masks, memory b/w cannot specify whether these
1249allocations can overlap or not. The allocations specifies the maximum
1250b/w that the group may be able to use and the system admin can configure
1251the b/w accordingly.
1252
1253If resctrl is using the software controller (mba_sc) then user can enter the
1254max b/w in MB rather than the percentage values.
1255::
1256
1257  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
1258  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
1259
1260In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
1261of 1024MB where as on socket 1 they would use 500MB.
1262
12632) Example 2
1264
1265Again two sockets, but this time with a more realistic 20-bit mask.
1266
1267Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
1268processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
1269neighbors, each of the two real-time tasks exclusively occupies one quarter
1270of L3 cache on socket 0.
1271::
1272
1273  # mount -t resctrl resctrl /sys/fs/resctrl
1274  # cd /sys/fs/resctrl
1275
1276First we reset the schemata for the default group so that the "upper"
127750% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
1278ordinary tasks::
1279
1280  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
1281
1282Next we make a resource group for our first real time task and give
1283it access to the "top" 25% of the cache on socket 0.
1284::
1285
1286  # mkdir p0
1287  # echo "L3:0=f8000;1=fffff" > p0/schemata
1288
1289Finally we move our first real time task into this resource group. We
1290also use taskset(1) to ensure the task always runs on a dedicated CPU
1291on socket 0. Most uses of resource groups will also constrain which
1292processors tasks run on.
1293::
1294
1295  # echo 1234 > p0/tasks
1296  # taskset -cp 1 1234
1297
1298Ditto for the second real time task (with the remaining 25% of cache)::
1299
1300  # mkdir p1
1301  # echo "L3:0=7c00;1=fffff" > p1/schemata
1302  # echo 5678 > p1/tasks
1303  # taskset -cp 2 5678
1304
1305For the same 2 socket system with memory b/w resource and CAT L3 the
1306schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
130710):
1308
1309For our first real time task this would request 20% memory b/w on socket 0.
1310::
1311
1312  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
1313
1314For our second real time task this would request an other 20% memory b/w
1315on socket 0.
1316::
1317
1318  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
1319
13203) Example 3
1321
1322A single socket system which has real-time tasks running on core 4-7 and
1323non real-time workload assigned to core 0-3. The real-time tasks share text
1324and data, so a per task association is not required and due to interaction
1325with the kernel it's desired that the kernel on these cores shares L3 with
1326the tasks.
1327::
1328
1329  # mount -t resctrl resctrl /sys/fs/resctrl
1330  # cd /sys/fs/resctrl
1331
1332First we reset the schemata for the default group so that the "upper"
133350% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
1334cannot be used by ordinary tasks::
1335
1336  # echo "L3:0=3ff\nMB:0=50" > schemata
1337
1338Next we make a resource group for our real time cores and give it access
1339to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
1340socket 0.
1341::
1342
1343  # mkdir p0
1344  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
1345
1346Finally we move core 4-7 over to the new group and make sure that the
1347kernel and the tasks running there get 50% of the cache. They should
1348also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
1349siblings and only the real time threads are scheduled on the cores 4-7.
1350::
1351
1352  # echo F0 > p0/cpus
1353
13544) Example 4
1355
1356The resource groups in previous examples were all in the default "shareable"
1357mode allowing sharing of their cache allocations. If one resource group
1358configures a cache allocation then nothing prevents another resource group
1359to overlap with that allocation.
1360
1361In this example a new exclusive resource group will be created on a L2 CAT
1362system with two L2 cache instances that can be configured with an 8-bit
1363capacity bitmask. The new exclusive resource group will be configured to use
136425% of each cache instance.
1365::
1366
1367  # mount -t resctrl resctrl /sys/fs/resctrl/
1368  # cd /sys/fs/resctrl
1369
1370First, we observe that the default group is configured to allocate to all L2
1371cache::
1372
1373  # cat schemata
1374  L2:0=ff;1=ff
1375
1376We could attempt to create the new resource group at this point, but it will
1377fail because of the overlap with the schemata of the default group::
1378
1379  # mkdir p0
1380  # echo 'L2:0=0x3;1=0x3' > p0/schemata
1381  # cat p0/mode
1382  shareable
1383  # echo exclusive > p0/mode
1384  -sh: echo: write error: Invalid argument
1385  # cat info/last_cmd_status
1386  schemata overlaps
1387
1388To ensure that there is no overlap with another resource group the default
1389resource group's schemata has to change, making it possible for the new
1390resource group to become exclusive.
1391::
1392
1393  # echo 'L2:0=0xfc;1=0xfc' > schemata
1394  # echo exclusive > p0/mode
1395  # grep . p0/*
1396  p0/cpus:0
1397  p0/mode:exclusive
1398  p0/schemata:L2:0=03;1=03
1399  p0/size:L2:0=262144;1=262144
1400
1401A new resource group will on creation not overlap with an exclusive resource
1402group::
1403
1404  # mkdir p1
1405  # grep . p1/*
1406  p1/cpus:0
1407  p1/mode:shareable
1408  p1/schemata:L2:0=fc;1=fc
1409  p1/size:L2:0=786432;1=786432
1410
1411The bit_usage will reflect how the cache is used::
1412
1413  # cat info/L2/bit_usage
1414  0=SSSSSSEE;1=SSSSSSEE
1415
1416A resource group cannot be forced to overlap with an exclusive resource group::
1417
1418  # echo 'L2:0=0x1;1=0x1' > p1/schemata
1419  -sh: echo: write error: Invalid argument
1420  # cat info/last_cmd_status
1421  overlaps with exclusive group
1422
1423Example of Cache Pseudo-Locking
1424~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1425Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1426region is exposed at /dev/pseudo_lock/newlock that can be provided to
1427application for argument to mmap().
1428::
1429
1430  # mount -t resctrl resctrl /sys/fs/resctrl/
1431  # cd /sys/fs/resctrl
1432
1433Ensure that there are bits available that can be pseudo-locked, since only
1434unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1435removed from the default resource group's schemata::
1436
1437  # cat info/L2/bit_usage
1438  0=SSSSSSSS;1=SSSSSSSS
1439  # echo 'L2:1=0xfc' > schemata
1440  # cat info/L2/bit_usage
1441  0=SSSSSSSS;1=SSSSSS00
1442
1443Create a new resource group that will be associated with the pseudo-locked
1444region, indicate that it will be used for a pseudo-locked region, and
1445configure the requested pseudo-locked region capacity bitmask::
1446
1447  # mkdir newlock
1448  # echo pseudo-locksetup > newlock/mode
1449  # echo 'L2:1=0x3' > newlock/schemata
1450
1451On success the resource group's mode will change to pseudo-locked, the
1452bit_usage will reflect the pseudo-locked region, and the character device
1453exposing the pseudo-locked region will exist::
1454
1455  # cat newlock/mode
1456  pseudo-locked
1457  # cat info/L2/bit_usage
1458  0=SSSSSSSS;1=SSSSSSPP
1459  # ls -l /dev/pseudo_lock/newlock
1460  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1461
1462::
1463
1464  /*
1465  * Example code to access one page of pseudo-locked cache region
1466  * from user space.
1467  */
1468  #define _GNU_SOURCE
1469  #include <fcntl.h>
1470  #include <sched.h>
1471  #include <stdio.h>
1472  #include <stdlib.h>
1473  #include <unistd.h>
1474  #include <sys/mman.h>
1475
1476  /*
1477  * It is required that the application runs with affinity to only
1478  * cores associated with the pseudo-locked region. Here the cpu
1479  * is hardcoded for convenience of example.
1480  */
1481  static int cpuid = 2;
1482
1483  int main(int argc, char *argv[])
1484  {
1485    cpu_set_t cpuset;
1486    long page_size;
1487    void *mapping;
1488    int dev_fd;
1489    int ret;
1490
1491    page_size = sysconf(_SC_PAGESIZE);
1492
1493    CPU_ZERO(&cpuset);
1494    CPU_SET(cpuid, &cpuset);
1495    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
1496    if (ret < 0) {
1497      perror("sched_setaffinity");
1498      exit(EXIT_FAILURE);
1499    }
1500
1501    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
1502    if (dev_fd < 0) {
1503      perror("open");
1504      exit(EXIT_FAILURE);
1505    }
1506
1507    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
1508            dev_fd, 0);
1509    if (mapping == MAP_FAILED) {
1510      perror("mmap");
1511      close(dev_fd);
1512      exit(EXIT_FAILURE);
1513    }
1514
1515    /* Application interacts with pseudo-locked memory @mapping */
1516
1517    ret = munmap(mapping, page_size);
1518    if (ret < 0) {
1519      perror("munmap");
1520      close(dev_fd);
1521      exit(EXIT_FAILURE);
1522    }
1523
1524    close(dev_fd);
1525    exit(EXIT_SUCCESS);
1526  }
1527
1528Locking between applications
1529----------------------------
1530
1531Certain operations on the resctrl filesystem, composed of read/writes
1532to/from multiple files, must be atomic.
1533
1534As an example, the allocation of an exclusive reservation of L3 cache
1535involves:
1536
1537  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1538  2. Find a contiguous set of bits in the global CBM bitmask that is clear
1539     in any of the directory cbmmasks
1540  3. Create a new directory
1541  4. Set the bits found in step 2 to the new directory "schemata" file
1542
1543If two applications attempt to allocate space concurrently then they can
1544end up allocating the same bits so the reservations are shared instead of
1545exclusive.
1546
1547To coordinate atomic operations on the resctrlfs and to avoid the problem
1548above, the following locking procedure is recommended:
1549
1550Locking is based on flock, which is available in libc and also as a shell
1551script command
1552
1553Write lock:
1554
1555 A) Take flock(LOCK_EX) on /sys/fs/resctrl
1556 B) Read/write the directory structure.
1557 C) funlock
1558
1559Read lock:
1560
1561 A) Take flock(LOCK_SH) on /sys/fs/resctrl
1562 B) If success read the directory structure.
1563 C) funlock
1564
1565Example with bash::
1566
1567  # Atomically read directory structure
1568  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1569
1570  # Read directory contents and create new subdirectory
1571
1572  $ cat create-dir.sh
1573  find /sys/fs/resctrl/ > output.txt
1574  mask = function-of(output.txt)
1575  mkdir /sys/fs/resctrl/newres/
1576  echo mask > /sys/fs/resctrl/newres/schemata
1577
1578  $ flock /sys/fs/resctrl/ ./create-dir.sh
1579
1580Example with C::
1581
1582  /*
1583  * Example code do take advisory locks
1584  * before accessing resctrl filesystem
1585  */
1586  #include <sys/file.h>
1587  #include <stdlib.h>
1588
1589  void resctrl_take_shared_lock(int fd)
1590  {
1591    int ret;
1592
1593    /* take shared lock on resctrl filesystem */
1594    ret = flock(fd, LOCK_SH);
1595    if (ret) {
1596      perror("flock");
1597      exit(-1);
1598    }
1599  }
1600
1601  void resctrl_take_exclusive_lock(int fd)
1602  {
1603    int ret;
1604
1605    /* release lock on resctrl filesystem */
1606    ret = flock(fd, LOCK_EX);
1607    if (ret) {
1608      perror("flock");
1609      exit(-1);
1610    }
1611  }
1612
1613  void resctrl_release_lock(int fd)
1614  {
1615    int ret;
1616
1617    /* take shared lock on resctrl filesystem */
1618    ret = flock(fd, LOCK_UN);
1619    if (ret) {
1620      perror("flock");
1621      exit(-1);
1622    }
1623  }
1624
1625  void main(void)
1626  {
1627    int fd, ret;
1628
1629    fd = open("/sys/fs/resctrl", O_DIRECTORY);
1630    if (fd == -1) {
1631      perror("open");
1632      exit(-1);
1633    }
1634    resctrl_take_shared_lock(fd);
1635    /* code to read directory contents */
1636    resctrl_release_lock(fd);
1637
1638    resctrl_take_exclusive_lock(fd);
1639    /* code to read and write directory contents */
1640    resctrl_release_lock(fd);
1641  }
1642
1643Examples for RDT Monitoring along with allocation usage
1644=======================================================
1645Reading monitored data
1646----------------------
1647Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1648show the current snapshot of LLC occupancy of the corresponding MON
1649group or CTRL_MON group.
1650
1651
1652Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1653------------------------------------------------------------------------
1654On a two socket machine (one L3 cache per socket) with just four bits
1655for cache bit masks::
1656
1657  # mount -t resctrl resctrl /sys/fs/resctrl
1658  # cd /sys/fs/resctrl
1659  # mkdir p0 p1
1660  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1661  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1662  # echo 5678 > p1/tasks
1663  # echo 5679 > p1/tasks
1664
1665The default resource group is unmodified, so we have access to all parts
1666of all caches (its schemata file reads "L3:0=f;1=f").
1667
1668Tasks that are under the control of group "p0" may only allocate from the
1669"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1670Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1671
1672Create monitor groups and assign a subset of tasks to each monitor group.
1673::
1674
1675  # cd /sys/fs/resctrl/p1/mon_groups
1676  # mkdir m11 m12
1677  # echo 5678 > m11/tasks
1678  # echo 5679 > m12/tasks
1679
1680fetch data (data shown in bytes)
1681::
1682
1683  # cat m11/mon_data/mon_L3_00/llc_occupancy
1684  16234000
1685  # cat m11/mon_data/mon_L3_01/llc_occupancy
1686  14789000
1687  # cat m12/mon_data/mon_L3_00/llc_occupancy
1688  16789000
1689
1690The parent ctrl_mon group shows the aggregated data.
1691::
1692
1693  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1694  31234000
1695
1696Example 2 (Monitor a task from its creation)
1697--------------------------------------------
1698On a two socket machine (one L3 cache per socket)::
1699
1700  # mount -t resctrl resctrl /sys/fs/resctrl
1701  # cd /sys/fs/resctrl
1702  # mkdir p0 p1
1703
1704An RMID is allocated to the group once its created and hence the <cmd>
1705below is monitored from its creation.
1706::
1707
1708  # echo $$ > /sys/fs/resctrl/p1/tasks
1709  # <cmd>
1710
1711Fetch the data::
1712
1713  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1714  31789000
1715
1716Example 3 (Monitor without CAT support or before creating CAT groups)
1717---------------------------------------------------------------------
1718
1719Assume a system like HSW has only CQM and no CAT support. In this case
1720the resctrl will still mount but cannot create CTRL_MON directories.
1721But user can create different MON groups within the root group thereby
1722able to monitor all tasks including kernel threads.
1723
1724This can also be used to profile jobs cache size footprint before being
1725able to allocate them to different allocation groups.
1726::
1727
1728  # mount -t resctrl resctrl /sys/fs/resctrl
1729  # cd /sys/fs/resctrl
1730  # mkdir mon_groups/m01
1731  # mkdir mon_groups/m02
1732
1733  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1734  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1735
1736Monitor the groups separately and also get per domain data. From the
1737below its apparent that the tasks are mostly doing work on
1738domain(socket) 0.
1739::
1740
1741  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1742  31234000
1743  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1744  34555
1745  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1746  31234000
1747  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1748  32789
1749
1750
1751Example 4 (Monitor real time tasks)
1752-----------------------------------
1753
1754A single socket system which has real time tasks running on cores 4-7
1755and non real time tasks on other cpus. We want to monitor the cache
1756occupancy of the real time threads on these cores.
1757::
1758
1759  # mount -t resctrl resctrl /sys/fs/resctrl
1760  # cd /sys/fs/resctrl
1761  # mkdir p1
1762
1763Move the cpus 4-7 over to p1::
1764
1765  # echo f0 > p1/cpus
1766
1767View the llc occupancy snapshot::
1768
1769  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1770  11234000
1771
1772
1773Examples on working with mbm_assign_mode
1774========================================
1775
1776a. Check if MBM counter assignment mode is supported.
1777::
1778
1779  # mount -t resctrl resctrl /sys/fs/resctrl/
1780
1781  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1782  [mbm_event]
1783  default
1784
1785The "mbm_event" mode is detected and enabled.
1786
1787b. Check how many assignable counters are supported.
1788::
1789
1790  # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
1791  0=32;1=32
1792
1793c. Check how many assignable counters are available for assignment in each domain.
1794::
1795
1796  # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
1797  0=30;1=30
1798
1799d. To list the default group's assign states.
1800::
1801
1802  # cat /sys/fs/resctrl/mbm_L3_assignments
1803  mbm_total_bytes:0=e;1=e
1804  mbm_local_bytes:0=e;1=e
1805
1806e.  To unassign the counter associated with the mbm_total_bytes event on domain 0.
1807::
1808
1809  # echo "mbm_total_bytes:0=_" > /sys/fs/resctrl/mbm_L3_assignments
1810  # cat /sys/fs/resctrl/mbm_L3_assignments
1811  mbm_total_bytes:0=_;1=e
1812  mbm_local_bytes:0=e;1=e
1813
1814f. To unassign the counter associated with the mbm_total_bytes event on all domains.
1815::
1816
1817  # echo "mbm_total_bytes:*=_" > /sys/fs/resctrl/mbm_L3_assignments
1818  # cat /sys/fs/resctrl/mbm_L3_assignment
1819  mbm_total_bytes:0=_;1=_
1820  mbm_local_bytes:0=e;1=e
1821
1822g. To assign a counter associated with the mbm_total_bytes event on all domains in
1823exclusive mode.
1824::
1825
1826  # echo "mbm_total_bytes:*=e" > /sys/fs/resctrl/mbm_L3_assignments
1827  # cat /sys/fs/resctrl/mbm_L3_assignments
1828  mbm_total_bytes:0=e;1=e
1829  mbm_local_bytes:0=e;1=e
1830
1831h. Read the events mbm_total_bytes and mbm_local_bytes of the default group. There is
1832no change in reading the events with the assignment.
1833::
1834
1835  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
1836  779247936
1837  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
1838  562324232
1839  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1840  212122123
1841  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1842  121212144
1843
1844i. Check the event configurations.
1845::
1846
1847  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_total_bytes/event_filter
1848  local_reads,remote_reads,local_non_temporal_writes,remote_non_temporal_writes,
1849  local_reads_slow_memory,remote_reads_slow_memory,dirty_victim_writes_all
1850
1851  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1852  local_reads,local_non_temporal_writes,local_reads_slow_memory
1853
1854j. Change the event configuration for mbm_local_bytes.
1855::
1856
1857  # echo "local_reads, local_non_temporal_writes, local_reads_slow_memory, remote_reads" >
1858  /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1859
1860  # cat /sys/fs/resctrl/info/L3_MON/event_configs/mbm_local_bytes/event_filter
1861  local_reads,local_non_temporal_writes,local_reads_slow_memory,remote_reads
1862
1863k. Now read the local events again. The first read may come back with "Unavailable"
1864status. The subsequent read of mbm_local_bytes will display the current value.
1865::
1866
1867  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1868  Unavailable
1869  # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
1870  2252323
1871  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1872  Unavailable
1873  # cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
1874  1566565
1875
1876l. Users have the option to go back to 'default' mbm_assign_mode if required. This can be
1877done using the following command. Note that switching the mbm_assign_mode may reset all
1878the MBM counters (and thus all MBM events) of all the resctrl groups.
1879::
1880
1881  # echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1882  # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
1883  mbm_event
1884  [default]
1885
1886m. Unmount the resctrl filesystem.
1887::
1888
1889  # umount /sys/fs/resctrl/
1890
1891Intel RDT Errata
1892================
1893
1894Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1895-----------------------------------------------------------------
1896
1897Errata SKX99 for Skylake server and BDF102 for Broadwell server.
1898
1899Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1900according to the assigned Resource Monitor ID (RMID) for that logical
1901core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1902metrics, may report incorrect system bandwidth for certain RMID values.
1903
1904Implication: Due to the errata, system memory bandwidth may not match
1905what is reported.
1906
1907Workaround: MBM total and local readings are corrected according to the
1908following correction factor table:
1909
1910+---------------+---------------+---------------+-----------------+
1911|core count	|rmid count	|rmid threshold	|correction factor|
1912+---------------+---------------+---------------+-----------------+
1913|1		|8		|0		|1.000000	  |
1914+---------------+---------------+---------------+-----------------+
1915|2		|16		|0		|1.000000	  |
1916+---------------+---------------+---------------+-----------------+
1917|3		|24		|15		|0.969650	  |
1918+---------------+---------------+---------------+-----------------+
1919|4		|32		|0		|1.000000	  |
1920+---------------+---------------+---------------+-----------------+
1921|6		|48		|31		|0.969650	  |
1922+---------------+---------------+---------------+-----------------+
1923|7		|56		|47		|1.142857	  |
1924+---------------+---------------+---------------+-----------------+
1925|8		|64		|0		|1.000000	  |
1926+---------------+---------------+---------------+-----------------+
1927|9		|72		|63		|1.185115	  |
1928+---------------+---------------+---------------+-----------------+
1929|10		|80		|63		|1.066553	  |
1930+---------------+---------------+---------------+-----------------+
1931|11		|88		|79		|1.454545	  |
1932+---------------+---------------+---------------+-----------------+
1933|12		|96		|0		|1.000000	  |
1934+---------------+---------------+---------------+-----------------+
1935|13		|104		|95		|1.230769	  |
1936+---------------+---------------+---------------+-----------------+
1937|14		|112		|95		|1.142857	  |
1938+---------------+---------------+---------------+-----------------+
1939|15		|120		|95		|1.066667	  |
1940+---------------+---------------+---------------+-----------------+
1941|16		|128		|0		|1.000000	  |
1942+---------------+---------------+---------------+-----------------+
1943|17		|136		|127		|1.254863	  |
1944+---------------+---------------+---------------+-----------------+
1945|18		|144		|127		|1.185255	  |
1946+---------------+---------------+---------------+-----------------+
1947|19		|152		|0		|1.000000	  |
1948+---------------+---------------+---------------+-----------------+
1949|20		|160		|127		|1.066667	  |
1950+---------------+---------------+---------------+-----------------+
1951|21		|168		|0		|1.000000	  |
1952+---------------+---------------+---------------+-----------------+
1953|22		|176		|159		|1.454334	  |
1954+---------------+---------------+---------------+-----------------+
1955|23		|184		|0		|1.000000	  |
1956+---------------+---------------+---------------+-----------------+
1957|24		|192		|127		|0.969744	  |
1958+---------------+---------------+---------------+-----------------+
1959|25		|200		|191		|1.280246	  |
1960+---------------+---------------+---------------+-----------------+
1961|26		|208		|191		|1.230921	  |
1962+---------------+---------------+---------------+-----------------+
1963|27		|216		|0		|1.000000	  |
1964+---------------+---------------+---------------+-----------------+
1965|28		|224		|191		|1.143118	  |
1966+---------------+---------------+---------------+-----------------+
1967
1968If rmid > rmid threshold, MBM total and local values should be multiplied
1969by the correction factor.
1970
1971See:
1972
19731. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1974http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1975
19762. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1977http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1978
19793. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1980https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1981
1982for further information.
1983