xref: /linux/Documentation/arch/x86/resctrl.rst (revision 08b7174fb8d126e607e385e34b9e1da4f3be274f)
1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4===========================================
5User Interface for Resource Control feature
6===========================================
7
8:Copyright: |copy| 2016 Intel Corporation
9:Authors: - Fenghua Yu <fenghua.yu@intel.com>
10          - Tony Luck <tony.luck@intel.com>
11          - Vikas Shivappa <vikas.shivappa@intel.com>
12
13
14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
16
17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
18flag bits:
19
20===============================================	================================
21RDT (Resource Director Technology) Allocation	"rdt_a"
22CAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
23CDP (Code and Data Prioritization)		"cdp_l3", "cdp_l2"
24CQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
25MBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
26MBA (Memory Bandwidth Allocation)		"mba"
27SMBA (Slow Memory Bandwidth Allocation)         ""
28BMEC (Bandwidth Monitoring Event Configuration) ""
29===============================================	================================
30
31Historically, new features were made visible by default in /proc/cpuinfo. This
32resulted in the feature flags becoming hard to parse by humans. Adding a new
33flag to /proc/cpuinfo should be avoided if user space can obtain information
34about the feature from resctrl's info directory.
35
36To use the feature mount the file system::
37
38 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
39
40mount options are:
41
42"cdp":
43	Enable code/data prioritization in L3 cache allocations.
44"cdpl2":
45	Enable code/data prioritization in L2 cache allocations.
46"mba_MBps":
47	Enable the MBA Software Controller(mba_sc) to specify MBA
48	bandwidth in MBps
49
50L2 and L3 CDP are controlled separately.
51
52RDT features are orthogonal. A particular system may support only
53monitoring, only control, or both monitoring and control.  Cache
54pseudo-locking is a unique way of using cache control to "pin" or
55"lock" data in the cache. Details can be found in
56"Cache Pseudo-Locking".
57
58
59The mount succeeds if either of allocation or monitoring is present, but
60only those files and directories supported by the system will be created.
61For more details on the behavior of the interface during monitoring
62and allocation, see the "Resource alloc and monitor groups" section.
63
64Info directory
65==============
66
67The 'info' directory contains information about the enabled
68resources. Each resource has its own subdirectory. The subdirectory
69names reflect the resource names.
70
71Each subdirectory contains the following files with respect to
72allocation:
73
74Cache resource(L3/L2)  subdirectory contains the following files
75related to allocation:
76
77"num_closids":
78		The number of CLOSIDs which are valid for this
79		resource. The kernel uses the smallest number of
80		CLOSIDs of all enabled resources as limit.
81"cbm_mask":
82		The bitmask which is valid for this resource.
83		This mask is equivalent to 100%.
84"min_cbm_bits":
85		The minimum number of consecutive bits which
86		must be set when writing a mask.
87
88"shareable_bits":
89		Bitmask of shareable resource with other executing
90		entities (e.g. I/O). User can use this when
91		setting up exclusive cache partitions. Note that
92		some platforms support devices that have their
93		own settings for cache use which can over-ride
94		these bits.
95"bit_usage":
96		Annotated capacity bitmasks showing how all
97		instances of the resource are used. The legend is:
98
99			"0":
100			      Corresponding region is unused. When the system's
101			      resources have been allocated and a "0" is found
102			      in "bit_usage" it is a sign that resources are
103			      wasted.
104
105			"H":
106			      Corresponding region is used by hardware only
107			      but available for software use. If a resource
108			      has bits set in "shareable_bits" but not all
109			      of these bits appear in the resource groups'
110			      schematas then the bits appearing in
111			      "shareable_bits" but no resource group will
112			      be marked as "H".
113			"X":
114			      Corresponding region is available for sharing and
115			      used by hardware and software. These are the
116			      bits that appear in "shareable_bits" as
117			      well as a resource group's allocation.
118			"S":
119			      Corresponding region is used by software
120			      and available for sharing.
121			"E":
122			      Corresponding region is used exclusively by
123			      one resource group. No sharing allowed.
124			"P":
125			      Corresponding region is pseudo-locked. No
126			      sharing allowed.
127
128Memory bandwidth(MB) subdirectory contains the following files
129with respect to allocation:
130
131"min_bandwidth":
132		The minimum memory bandwidth percentage which
133		user can request.
134
135"bandwidth_gran":
136		The granularity in which the memory bandwidth
137		percentage is allocated. The allocated
138		b/w percentage is rounded off to the next
139		control step available on the hardware. The
140		available bandwidth control steps are:
141		min_bandwidth + N * bandwidth_gran.
142
143"delay_linear":
144		Indicates if the delay scale is linear or
145		non-linear. This field is purely informational
146		only.
147
148"thread_throttle_mode":
149		Indicator on Intel systems of how tasks running on threads
150		of a physical core are throttled in cases where they
151		request different memory bandwidth percentages:
152
153		"max":
154			the smallest percentage is applied
155			to all threads
156		"per-thread":
157			bandwidth percentages are directly applied to
158			the threads running on the core
159
160If RDT monitoring is available there will be an "L3_MON" directory
161with the following files:
162
163"num_rmids":
164		The number of RMIDs available. This is the
165		upper bound for how many "CTRL_MON" + "MON"
166		groups can be created.
167
168"mon_features":
169		Lists the monitoring events if
170		monitoring is enabled for the resource.
171		Example::
172
173			# cat /sys/fs/resctrl/info/L3_MON/mon_features
174			llc_occupancy
175			mbm_total_bytes
176			mbm_local_bytes
177
178		If the system supports Bandwidth Monitoring Event
179		Configuration (BMEC), then the bandwidth events will
180		be configurable. The output will be::
181
182			# cat /sys/fs/resctrl/info/L3_MON/mon_features
183			llc_occupancy
184			mbm_total_bytes
185			mbm_total_bytes_config
186			mbm_local_bytes
187			mbm_local_bytes_config
188
189"mbm_total_bytes_config", "mbm_local_bytes_config":
190	Read/write files containing the configuration for the mbm_total_bytes
191	and mbm_local_bytes events, respectively, when the Bandwidth
192	Monitoring Event Configuration (BMEC) feature is supported.
193	The event configuration settings are domain specific and affect
194	all the CPUs in the domain. When either event configuration is
195	changed, the bandwidth counters for all RMIDs of both events
196	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
197	domain. The next read for every RMID will report "Unavailable"
198	and subsequent reads will report the valid value.
199
200	Following are the types of events supported:
201
202	====    ========================================================
203	Bits    Description
204	====    ========================================================
205	6       Dirty Victims from the QOS domain to all types of memory
206	5       Reads to slow memory in the non-local NUMA domain
207	4       Reads to slow memory in the local NUMA domain
208	3       Non-temporal writes to non-local NUMA domain
209	2       Non-temporal writes to local NUMA domain
210	1       Reads to memory in the non-local NUMA domain
211	0       Reads to memory in the local NUMA domain
212	====    ========================================================
213
214	By default, the mbm_total_bytes configuration is set to 0x7f to count
215	all the event types and the mbm_local_bytes configuration is set to
216	0x15 to count all the local memory events.
217
218	Examples:
219
220	* To view the current configuration::
221	  ::
222
223	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
224	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
225
226	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
227	    0=0x15;1=0x15;3=0x15;4=0x15
228
229	* To change the mbm_total_bytes to count only reads on domain 0,
230	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
231	  (in hexadecimal 0x33):
232	  ::
233
234	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
235
236	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
237	    0=0x33;1=0x7f;2=0x7f;3=0x7f
238
239	* To change the mbm_local_bytes to count all the slow memory reads on
240	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
241	  in binary (in hexadecimal 0x30):
242	  ::
243
244	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
245
246	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
247	    0=0x30;1=0x30;3=0x15;4=0x15
248
249"max_threshold_occupancy":
250		Read/write file provides the largest value (in
251		bytes) at which a previously used LLC_occupancy
252		counter can be considered for re-use.
253
254Finally, in the top level of the "info" directory there is a file
255named "last_cmd_status". This is reset with every "command" issued
256via the file system (making new directories or writing to any of the
257control files). If the command was successful, it will read as "ok".
258If the command failed, it will provide more information that can be
259conveyed in the error returns from file operations. E.g.
260::
261
262	# echo L3:0=f7 > schemata
263	bash: echo: write error: Invalid argument
264	# cat info/last_cmd_status
265	mask f7 has non-consecutive 1-bits
266
267Resource alloc and monitor groups
268=================================
269
270Resource groups are represented as directories in the resctrl file
271system.  The default group is the root directory which, immediately
272after mounting, owns all the tasks and cpus in the system and can make
273full use of all resources.
274
275On a system with RDT control features additional directories can be
276created in the root directory that specify different amounts of each
277resource (see "schemata" below). The root and these additional top level
278directories are referred to as "CTRL_MON" groups below.
279
280On a system with RDT monitoring the root directory and other top level
281directories contain a directory named "mon_groups" in which additional
282directories can be created to monitor subsets of tasks in the CTRL_MON
283group that is their ancestor. These are called "MON" groups in the rest
284of this document.
285
286Removing a directory will move all tasks and cpus owned by the group it
287represents to the parent. Removing one of the created CTRL_MON groups
288will automatically remove all MON groups below it.
289
290Moving MON group directories to a new parent CTRL_MON group is supported
291for the purpose of changing the resource allocations of a MON group
292without impacting its monitoring data or assigned tasks. This operation
293is not allowed for MON groups which monitor CPUs. No other move
294operation is currently allowed other than simply renaming a CTRL_MON or
295MON group.
296
297All groups contain the following files:
298
299"tasks":
300	Reading this file shows the list of all tasks that belong to
301	this group. Writing a task id to the file will add a task to the
302	group. If the group is a CTRL_MON group the task is removed from
303	whichever previous CTRL_MON group owned the task and also from
304	any MON group that owned the task. If the group is a MON group,
305	then the task must already belong to the CTRL_MON parent of this
306	group. The task is removed from any previous MON group.
307
308
309"cpus":
310	Reading this file shows a bitmask of the logical CPUs owned by
311	this group. Writing a mask to this file will add and remove
312	CPUs to/from this group. As with the tasks file a hierarchy is
313	maintained where MON groups may only include CPUs owned by the
314	parent CTRL_MON group.
315	When the resource group is in pseudo-locked mode this file will
316	only be readable, reflecting the CPUs associated with the
317	pseudo-locked region.
318
319
320"cpus_list":
321	Just like "cpus", only using ranges of CPUs instead of bitmasks.
322
323
324When control is enabled all CTRL_MON groups will also contain:
325
326"schemata":
327	A list of all the resources available to this group.
328	Each resource has its own line and format - see below for details.
329
330"size":
331	Mirrors the display of the "schemata" file to display the size in
332	bytes of each allocation instead of the bits representing the
333	allocation.
334
335"mode":
336	The "mode" of the resource group dictates the sharing of its
337	allocations. A "shareable" resource group allows sharing of its
338	allocations while an "exclusive" resource group does not. A
339	cache pseudo-locked region is created by first writing
340	"pseudo-locksetup" to the "mode" file before writing the cache
341	pseudo-locked region's schemata to the resource group's "schemata"
342	file. On successful pseudo-locked region creation the mode will
343	automatically change to "pseudo-locked".
344
345When monitoring is enabled all MON groups will also contain:
346
347"mon_data":
348	This contains a set of files organized by L3 domain and by
349	RDT event. E.g. on a system with two L3 domains there will
350	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
351	directories have one file per event (e.g. "llc_occupancy",
352	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
353	files provide a read out of the current value of the event for
354	all tasks in the group. In CTRL_MON groups these files provide
355	the sum for all tasks in the CTRL_MON group and all tasks in
356	MON groups. Please see example section for more details on usage.
357
358Resource allocation rules
359-------------------------
360
361When a task is running the following rules define which resources are
362available to it:
363
3641) If the task is a member of a non-default group, then the schemata
365   for that group is used.
366
3672) Else if the task belongs to the default group, but is running on a
368   CPU that is assigned to some specific group, then the schemata for the
369   CPU's group is used.
370
3713) Otherwise the schemata for the default group is used.
372
373Resource monitoring rules
374-------------------------
3751) If a task is a member of a MON group, or non-default CTRL_MON group
376   then RDT events for the task will be reported in that group.
377
3782) If a task is a member of the default CTRL_MON group, but is running
379   on a CPU that is assigned to some specific group, then the RDT events
380   for the task will be reported in that group.
381
3823) Otherwise RDT events for the task will be reported in the root level
383   "mon_data" group.
384
385
386Notes on cache occupancy monitoring and control
387===============================================
388When moving a task from one group to another you should remember that
389this only affects *new* cache allocations by the task. E.g. you may have
390a task in a monitor group showing 3 MB of cache occupancy. If you move
391to a new group and immediately check the occupancy of the old and new
392groups you will likely see that the old group is still showing 3 MB and
393the new group zero. When the task accesses locations still in cache from
394before the move, the h/w does not update any counters. On a busy system
395you will likely see the occupancy in the old group go down as cache lines
396are evicted and re-used while the occupancy in the new group rises as
397the task accesses memory and loads into the cache are counted based on
398membership in the new group.
399
400The same applies to cache allocation control. Moving a task to a group
401with a smaller cache partition will not evict any cache lines. The
402process may continue to use them from the old partition.
403
404Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
405to identify a control group and a monitoring group respectively. Each of
406the resource groups are mapped to these IDs based on the kind of group. The
407number of CLOSid and RMID are limited by the hardware and hence the creation of
408a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
409and creation of "MON" group may fail if we run out of RMIDs.
410
411max_threshold_occupancy - generic concepts
412------------------------------------------
413
414Note that an RMID once freed may not be immediately available for use as
415the RMID is still tagged the cache lines of the previous user of RMID.
416Hence such RMIDs are placed on limbo list and checked back if the cache
417occupancy has gone down. If there is a time when system has a lot of
418limbo RMIDs but which are not ready to be used, user may see an -EBUSY
419during mkdir.
420
421max_threshold_occupancy is a user configurable value to determine the
422occupancy at which an RMID can be freed.
423
424Schemata files - general concepts
425---------------------------------
426Each line in the file describes one resource. The line starts with
427the name of the resource, followed by specific values to be applied
428in each of the instances of that resource on the system.
429
430Cache IDs
431---------
432On current generation systems there is one L3 cache per socket and L2
433caches are generally just shared by the hyperthreads on a core, but this
434isn't an architectural requirement. We could have multiple separate L3
435caches on a socket, multiple cores could share an L2 cache. So instead
436of using "socket" or "core" to define the set of logical cpus sharing
437a resource we use a "Cache ID". At a given cache level this will be a
438unique number across the whole system (but it isn't guaranteed to be a
439contiguous sequence, there may be gaps).  To find the ID for each logical
440CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
441
442Cache Bit Masks (CBM)
443---------------------
444For cache resources we describe the portion of the cache that is available
445for allocation using a bitmask. The maximum value of the mask is defined
446by each cpu model (and may be different for different cache levels). It
447is found using CPUID, but is also provided in the "info" directory of
448the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
449requires that these masks have all the '1' bits in a contiguous block. So
4500x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
451and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
452of the capacity of the cache. You could partition the cache into four
453equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
454
455Memory bandwidth Allocation and monitoring
456==========================================
457
458For Memory bandwidth resource, by default the user controls the resource
459by indicating the percentage of total memory bandwidth.
460
461The minimum bandwidth percentage value for each cpu model is predefined
462and can be looked up through "info/MB/min_bandwidth". The bandwidth
463granularity that is allocated is also dependent on the cpu model and can
464be looked up at "info/MB/bandwidth_gran". The available bandwidth
465control steps are: min_bw + N * bw_gran. Intermediate values are rounded
466to the next control step available on the hardware.
467
468The bandwidth throttling is a core specific mechanism on some of Intel
469SKUs. Using a high bandwidth and a low bandwidth setting on two threads
470sharing a core may result in both threads being throttled to use the
471low bandwidth (see "thread_throttle_mode").
472
473The fact that Memory bandwidth allocation(MBA) may be a core
474specific mechanism where as memory bandwidth monitoring(MBM) is done at
475the package level may lead to confusion when users try to apply control
476via the MBA and then monitor the bandwidth to see if the controls are
477effective. Below are such scenarios:
478
4791. User may *not* see increase in actual bandwidth when percentage
480   values are increased:
481
482This can occur when aggregate L2 external bandwidth is more than L3
483external bandwidth. Consider an SKL SKU with 24 cores on a package and
484where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
485240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
486threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
487bandwidth of 100GBps although the percentage value specified is only 50%
488<< 100%. Hence increasing the bandwidth percentage will not yield any
489more bandwidth. This is because although the L2 external bandwidth still
490has capacity, the L3 external bandwidth is fully used. Also note that
491this would be dependent on number of cores the benchmark is run on.
492
4932. Same bandwidth percentage may mean different actual bandwidth
494   depending on # of threads:
495
496For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
497thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
498they have same percentage bandwidth of 10%. This is simply because as
499threads start using more cores in an rdtgroup, the actual bandwidth may
500increase or vary although user specified bandwidth percentage is same.
501
502In order to mitigate this and make the interface more user friendly,
503resctrl added support for specifying the bandwidth in MBps as well.  The
504kernel underneath would use a software feedback mechanism or a "Software
505Controller(mba_sc)" which reads the actual bandwidth using MBM counters
506and adjust the memory bandwidth percentages to ensure::
507
508	"actual bandwidth < user specified bandwidth".
509
510By default, the schemata would take the bandwidth percentage values
511where as user can switch to the "MBA software controller" mode using
512a mount option 'mba_MBps'. The schemata format is specified in the below
513sections.
514
515L3 schemata file details (code and data prioritization disabled)
516----------------------------------------------------------------
517With CDP disabled the L3 schemata format is::
518
519	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
520
521L3 schemata file details (CDP enabled via mount option to resctrl)
522------------------------------------------------------------------
523When CDP is enabled L3 control is split into two separate resources
524so you can specify independent masks for code and data like this::
525
526	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
527	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
528
529L2 schemata file details
530------------------------
531CDP is supported at L2 using the 'cdpl2' mount option. The schemata
532format is either::
533
534	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
535
536or
537
538	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
539	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
540
541
542Memory bandwidth Allocation (default mode)
543------------------------------------------
544
545Memory b/w domain is L3 cache.
546::
547
548	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
549
550Memory bandwidth Allocation specified in MBps
551---------------------------------------------
552
553Memory bandwidth domain is L3 cache.
554::
555
556	MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
557
558Slow Memory Bandwidth Allocation (SMBA)
559---------------------------------------
560AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
561CXL.memory is the only supported "slow" memory device. With the
562support of SMBA, the hardware enables bandwidth allocation on
563the slow memory devices. If there are multiple such devices in
564the system, the throttling logic groups all the slow sources
565together and applies the limit on them as a whole.
566
567The presence of SMBA (with CXL.memory) is independent of slow memory
568devices presence. If there are no such devices on the system, then
569configuring SMBA will have no impact on the performance of the system.
570
571The bandwidth domain for slow memory is L3 cache. Its schemata file
572is formatted as:
573::
574
575	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
576
577Reading/writing the schemata file
578---------------------------------
579Reading the schemata file will show the state of all resources
580on all domains. When writing you only need to specify those values
581which you wish to change.  E.g.
582::
583
584  # cat schemata
585  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
586  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
587  # echo "L3DATA:2=3c0;" > schemata
588  # cat schemata
589  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
590  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
591
592Reading/writing the schemata file (on AMD systems)
593--------------------------------------------------
594Reading the schemata file will show the current bandwidth limit on all
595domains. The allocated resources are in multiples of one eighth GB/s.
596When writing to the file, you need to specify what cache id you wish to
597configure the bandwidth limit.
598
599For example, to allocate 2GB/s limit on the first cache id:
600
601::
602
603  # cat schemata
604    MB:0=2048;1=2048;2=2048;3=2048
605    L3:0=ffff;1=ffff;2=ffff;3=ffff
606
607  # echo "MB:1=16" > schemata
608  # cat schemata
609    MB:0=2048;1=  16;2=2048;3=2048
610    L3:0=ffff;1=ffff;2=ffff;3=ffff
611
612Reading/writing the schemata file (on AMD systems) with SMBA feature
613--------------------------------------------------------------------
614Reading and writing the schemata file is the same as without SMBA in
615above section.
616
617For example, to allocate 8GB/s limit on the first cache id:
618
619::
620
621  # cat schemata
622    SMBA:0=2048;1=2048;2=2048;3=2048
623      MB:0=2048;1=2048;2=2048;3=2048
624      L3:0=ffff;1=ffff;2=ffff;3=ffff
625
626  # echo "SMBA:1=64" > schemata
627  # cat schemata
628    SMBA:0=2048;1=  64;2=2048;3=2048
629      MB:0=2048;1=2048;2=2048;3=2048
630      L3:0=ffff;1=ffff;2=ffff;3=ffff
631
632Cache Pseudo-Locking
633====================
634CAT enables a user to specify the amount of cache space that an
635application can fill. Cache pseudo-locking builds on the fact that a
636CPU can still read and write data pre-allocated outside its current
637allocated area on a cache hit. With cache pseudo-locking, data can be
638preloaded into a reserved portion of cache that no application can
639fill, and from that point on will only serve cache hits. The cache
640pseudo-locked memory is made accessible to user space where an
641application can map it into its virtual address space and thus have
642a region of memory with reduced average read latency.
643
644The creation of a cache pseudo-locked region is triggered by a request
645from the user to do so that is accompanied by a schemata of the region
646to be pseudo-locked. The cache pseudo-locked region is created as follows:
647
648- Create a CAT allocation CLOSNEW with a CBM matching the schemata
649  from the user of the cache region that will contain the pseudo-locked
650  memory. This region must not overlap with any current CAT allocation/CLOS
651  on the system and no future overlap with this cache region is allowed
652  while the pseudo-locked region exists.
653- Create a contiguous region of memory of the same size as the cache
654  region.
655- Flush the cache, disable hardware prefetchers, disable preemption.
656- Make CLOSNEW the active CLOS and touch the allocated memory to load
657  it into the cache.
658- Set the previous CLOS as active.
659- At this point the closid CLOSNEW can be released - the cache
660  pseudo-locked region is protected as long as its CBM does not appear in
661  any CAT allocation. Even though the cache pseudo-locked region will from
662  this point on not appear in any CBM of any CLOS an application running with
663  any CLOS will be able to access the memory in the pseudo-locked region since
664  the region continues to serve cache hits.
665- The contiguous region of memory loaded into the cache is exposed to
666  user-space as a character device.
667
668Cache pseudo-locking increases the probability that data will remain
669in the cache via carefully configuring the CAT feature and controlling
670application behavior. There is no guarantee that data is placed in
671cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
672“locked” data from cache. Power management C-states may shrink or
673power off cache. Deeper C-states will automatically be restricted on
674pseudo-locked region creation.
675
676It is required that an application using a pseudo-locked region runs
677with affinity to the cores (or a subset of the cores) associated
678with the cache on which the pseudo-locked region resides. A sanity check
679within the code will not allow an application to map pseudo-locked memory
680unless it runs with affinity to cores associated with the cache on which the
681pseudo-locked region resides. The sanity check is only done during the
682initial mmap() handling, there is no enforcement afterwards and the
683application self needs to ensure it remains affine to the correct cores.
684
685Pseudo-locking is accomplished in two stages:
686
6871) During the first stage the system administrator allocates a portion
688   of cache that should be dedicated to pseudo-locking. At this time an
689   equivalent portion of memory is allocated, loaded into allocated
690   cache portion, and exposed as a character device.
6912) During the second stage a user-space application maps (mmap()) the
692   pseudo-locked memory into its address space.
693
694Cache Pseudo-Locking Interface
695------------------------------
696A pseudo-locked region is created using the resctrl interface as follows:
697
6981) Create a new resource group by creating a new directory in /sys/fs/resctrl.
6992) Change the new resource group's mode to "pseudo-locksetup" by writing
700   "pseudo-locksetup" to the "mode" file.
7013) Write the schemata of the pseudo-locked region to the "schemata" file. All
702   bits within the schemata should be "unused" according to the "bit_usage"
703   file.
704
705On successful pseudo-locked region creation the "mode" file will contain
706"pseudo-locked" and a new character device with the same name as the resource
707group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
708by user space in order to obtain access to the pseudo-locked memory region.
709
710An example of cache pseudo-locked region creation and usage can be found below.
711
712Cache Pseudo-Locking Debugging Interface
713----------------------------------------
714The pseudo-locking debugging interface is enabled by default (if
715CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
716
717There is no explicit way for the kernel to test if a provided memory
718location is present in the cache. The pseudo-locking debugging interface uses
719the tracing infrastructure to provide two ways to measure cache residency of
720the pseudo-locked region:
721
7221) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
723   from these measurements are best visualized using a hist trigger (see
724   example below). In this test the pseudo-locked region is traversed at
725   a stride of 32 bytes while hardware prefetchers and preemption
726   are disabled. This also provides a substitute visualization of cache
727   hits and misses.
7282) Cache hit and miss measurements using model specific precision counters if
729   available. Depending on the levels of cache on the system the pseudo_lock_l2
730   and pseudo_lock_l3 tracepoints are available.
731
732When a pseudo-locked region is created a new debugfs directory is created for
733it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
734write-only file, pseudo_lock_measure, is present in this directory. The
735measurement of the pseudo-locked region depends on the number written to this
736debugfs file:
737
7381:
739     writing "1" to the pseudo_lock_measure file will trigger the latency
740     measurement captured in the pseudo_lock_mem_latency tracepoint. See
741     example below.
7422:
743     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
744     residency (cache hits and misses) measurement captured in the
745     pseudo_lock_l2 tracepoint. See example below.
7463:
747     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
748     residency (cache hits and misses) measurement captured in the
749     pseudo_lock_l3 tracepoint.
750
751All measurements are recorded with the tracing infrastructure. This requires
752the relevant tracepoints to be enabled before the measurement is triggered.
753
754Example of latency debugging interface
755~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
756In this example a pseudo-locked region named "newlock" was created. Here is
757how we can measure the latency in cycles of reading from this region and
758visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
759is set::
760
761  # :> /sys/kernel/tracing/trace
762  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
763  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
764  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
765  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
766  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
767
768  # event histogram
769  #
770  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
771  #
772
773  { latency:        456 } hitcount:          1
774  { latency:         50 } hitcount:         83
775  { latency:         36 } hitcount:         96
776  { latency:         44 } hitcount:        174
777  { latency:         48 } hitcount:        195
778  { latency:         46 } hitcount:        262
779  { latency:         42 } hitcount:        693
780  { latency:         40 } hitcount:       3204
781  { latency:         38 } hitcount:       3484
782
783  Totals:
784      Hits: 8192
785      Entries: 9
786    Dropped: 0
787
788Example of cache hits/misses debugging
789~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
790In this example a pseudo-locked region named "newlock" was created on the L2
791cache of a platform. Here is how we can obtain details of the cache hits
792and misses using the platform's precision counters.
793::
794
795  # :> /sys/kernel/tracing/trace
796  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
797  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
798  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
799  # cat /sys/kernel/tracing/trace
800
801  # tracer: nop
802  #
803  #                              _-----=> irqs-off
804  #                             / _----=> need-resched
805  #                            | / _---=> hardirq/softirq
806  #                            || / _--=> preempt-depth
807  #                            ||| /     delay
808  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
809  #              | |       |   ||||       |         |
810  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
811
812
813Examples for RDT allocation usage
814~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
815
8161) Example 1
817
818On a two socket machine (one L3 cache per socket) with just four bits
819for cache bit masks, minimum b/w of 10% with a memory bandwidth
820granularity of 10%.
821::
822
823  # mount -t resctrl resctrl /sys/fs/resctrl
824  # cd /sys/fs/resctrl
825  # mkdir p0 p1
826  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
827  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
828
829The default resource group is unmodified, so we have access to all parts
830of all caches (its schemata file reads "L3:0=f;1=f").
831
832Tasks that are under the control of group "p0" may only allocate from the
833"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
834Tasks in group "p1" use the "lower" 50% of cache on both sockets.
835
836Similarly, tasks that are under the control of group "p0" may use a
837maximum memory b/w of 50% on socket0 and 50% on socket 1.
838Tasks in group "p1" may also use 50% memory b/w on both sockets.
839Note that unlike cache masks, memory b/w cannot specify whether these
840allocations can overlap or not. The allocations specifies the maximum
841b/w that the group may be able to use and the system admin can configure
842the b/w accordingly.
843
844If resctrl is using the software controller (mba_sc) then user can enter the
845max b/w in MB rather than the percentage values.
846::
847
848  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
849  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
850
851In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
852of 1024MB where as on socket 1 they would use 500MB.
853
8542) Example 2
855
856Again two sockets, but this time with a more realistic 20-bit mask.
857
858Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
859processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
860neighbors, each of the two real-time tasks exclusively occupies one quarter
861of L3 cache on socket 0.
862::
863
864  # mount -t resctrl resctrl /sys/fs/resctrl
865  # cd /sys/fs/resctrl
866
867First we reset the schemata for the default group so that the "upper"
86850% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
869ordinary tasks::
870
871  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
872
873Next we make a resource group for our first real time task and give
874it access to the "top" 25% of the cache on socket 0.
875::
876
877  # mkdir p0
878  # echo "L3:0=f8000;1=fffff" > p0/schemata
879
880Finally we move our first real time task into this resource group. We
881also use taskset(1) to ensure the task always runs on a dedicated CPU
882on socket 0. Most uses of resource groups will also constrain which
883processors tasks run on.
884::
885
886  # echo 1234 > p0/tasks
887  # taskset -cp 1 1234
888
889Ditto for the second real time task (with the remaining 25% of cache)::
890
891  # mkdir p1
892  # echo "L3:0=7c00;1=fffff" > p1/schemata
893  # echo 5678 > p1/tasks
894  # taskset -cp 2 5678
895
896For the same 2 socket system with memory b/w resource and CAT L3 the
897schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
89810):
899
900For our first real time task this would request 20% memory b/w on socket 0.
901::
902
903  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
904
905For our second real time task this would request an other 20% memory b/w
906on socket 0.
907::
908
909  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
910
9113) Example 3
912
913A single socket system which has real-time tasks running on core 4-7 and
914non real-time workload assigned to core 0-3. The real-time tasks share text
915and data, so a per task association is not required and due to interaction
916with the kernel it's desired that the kernel on these cores shares L3 with
917the tasks.
918::
919
920  # mount -t resctrl resctrl /sys/fs/resctrl
921  # cd /sys/fs/resctrl
922
923First we reset the schemata for the default group so that the "upper"
92450% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
925cannot be used by ordinary tasks::
926
927  # echo "L3:0=3ff\nMB:0=50" > schemata
928
929Next we make a resource group for our real time cores and give it access
930to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
931socket 0.
932::
933
934  # mkdir p0
935  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
936
937Finally we move core 4-7 over to the new group and make sure that the
938kernel and the tasks running there get 50% of the cache. They should
939also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
940siblings and only the real time threads are scheduled on the cores 4-7.
941::
942
943  # echo F0 > p0/cpus
944
9454) Example 4
946
947The resource groups in previous examples were all in the default "shareable"
948mode allowing sharing of their cache allocations. If one resource group
949configures a cache allocation then nothing prevents another resource group
950to overlap with that allocation.
951
952In this example a new exclusive resource group will be created on a L2 CAT
953system with two L2 cache instances that can be configured with an 8-bit
954capacity bitmask. The new exclusive resource group will be configured to use
95525% of each cache instance.
956::
957
958  # mount -t resctrl resctrl /sys/fs/resctrl/
959  # cd /sys/fs/resctrl
960
961First, we observe that the default group is configured to allocate to all L2
962cache::
963
964  # cat schemata
965  L2:0=ff;1=ff
966
967We could attempt to create the new resource group at this point, but it will
968fail because of the overlap with the schemata of the default group::
969
970  # mkdir p0
971  # echo 'L2:0=0x3;1=0x3' > p0/schemata
972  # cat p0/mode
973  shareable
974  # echo exclusive > p0/mode
975  -sh: echo: write error: Invalid argument
976  # cat info/last_cmd_status
977  schemata overlaps
978
979To ensure that there is no overlap with another resource group the default
980resource group's schemata has to change, making it possible for the new
981resource group to become exclusive.
982::
983
984  # echo 'L2:0=0xfc;1=0xfc' > schemata
985  # echo exclusive > p0/mode
986  # grep . p0/*
987  p0/cpus:0
988  p0/mode:exclusive
989  p0/schemata:L2:0=03;1=03
990  p0/size:L2:0=262144;1=262144
991
992A new resource group will on creation not overlap with an exclusive resource
993group::
994
995  # mkdir p1
996  # grep . p1/*
997  p1/cpus:0
998  p1/mode:shareable
999  p1/schemata:L2:0=fc;1=fc
1000  p1/size:L2:0=786432;1=786432
1001
1002The bit_usage will reflect how the cache is used::
1003
1004  # cat info/L2/bit_usage
1005  0=SSSSSSEE;1=SSSSSSEE
1006
1007A resource group cannot be forced to overlap with an exclusive resource group::
1008
1009  # echo 'L2:0=0x1;1=0x1' > p1/schemata
1010  -sh: echo: write error: Invalid argument
1011  # cat info/last_cmd_status
1012  overlaps with exclusive group
1013
1014Example of Cache Pseudo-Locking
1015~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1016Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1017region is exposed at /dev/pseudo_lock/newlock that can be provided to
1018application for argument to mmap().
1019::
1020
1021  # mount -t resctrl resctrl /sys/fs/resctrl/
1022  # cd /sys/fs/resctrl
1023
1024Ensure that there are bits available that can be pseudo-locked, since only
1025unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1026removed from the default resource group's schemata::
1027
1028  # cat info/L2/bit_usage
1029  0=SSSSSSSS;1=SSSSSSSS
1030  # echo 'L2:1=0xfc' > schemata
1031  # cat info/L2/bit_usage
1032  0=SSSSSSSS;1=SSSSSS00
1033
1034Create a new resource group that will be associated with the pseudo-locked
1035region, indicate that it will be used for a pseudo-locked region, and
1036configure the requested pseudo-locked region capacity bitmask::
1037
1038  # mkdir newlock
1039  # echo pseudo-locksetup > newlock/mode
1040  # echo 'L2:1=0x3' > newlock/schemata
1041
1042On success the resource group's mode will change to pseudo-locked, the
1043bit_usage will reflect the pseudo-locked region, and the character device
1044exposing the pseudo-locked region will exist::
1045
1046  # cat newlock/mode
1047  pseudo-locked
1048  # cat info/L2/bit_usage
1049  0=SSSSSSSS;1=SSSSSSPP
1050  # ls -l /dev/pseudo_lock/newlock
1051  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1052
1053::
1054
1055  /*
1056  * Example code to access one page of pseudo-locked cache region
1057  * from user space.
1058  */
1059  #define _GNU_SOURCE
1060  #include <fcntl.h>
1061  #include <sched.h>
1062  #include <stdio.h>
1063  #include <stdlib.h>
1064  #include <unistd.h>
1065  #include <sys/mman.h>
1066
1067  /*
1068  * It is required that the application runs with affinity to only
1069  * cores associated with the pseudo-locked region. Here the cpu
1070  * is hardcoded for convenience of example.
1071  */
1072  static int cpuid = 2;
1073
1074  int main(int argc, char *argv[])
1075  {
1076    cpu_set_t cpuset;
1077    long page_size;
1078    void *mapping;
1079    int dev_fd;
1080    int ret;
1081
1082    page_size = sysconf(_SC_PAGESIZE);
1083
1084    CPU_ZERO(&cpuset);
1085    CPU_SET(cpuid, &cpuset);
1086    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
1087    if (ret < 0) {
1088      perror("sched_setaffinity");
1089      exit(EXIT_FAILURE);
1090    }
1091
1092    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
1093    if (dev_fd < 0) {
1094      perror("open");
1095      exit(EXIT_FAILURE);
1096    }
1097
1098    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
1099            dev_fd, 0);
1100    if (mapping == MAP_FAILED) {
1101      perror("mmap");
1102      close(dev_fd);
1103      exit(EXIT_FAILURE);
1104    }
1105
1106    /* Application interacts with pseudo-locked memory @mapping */
1107
1108    ret = munmap(mapping, page_size);
1109    if (ret < 0) {
1110      perror("munmap");
1111      close(dev_fd);
1112      exit(EXIT_FAILURE);
1113    }
1114
1115    close(dev_fd);
1116    exit(EXIT_SUCCESS);
1117  }
1118
1119Locking between applications
1120----------------------------
1121
1122Certain operations on the resctrl filesystem, composed of read/writes
1123to/from multiple files, must be atomic.
1124
1125As an example, the allocation of an exclusive reservation of L3 cache
1126involves:
1127
1128  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1129  2. Find a contiguous set of bits in the global CBM bitmask that is clear
1130     in any of the directory cbmmasks
1131  3. Create a new directory
1132  4. Set the bits found in step 2 to the new directory "schemata" file
1133
1134If two applications attempt to allocate space concurrently then they can
1135end up allocating the same bits so the reservations are shared instead of
1136exclusive.
1137
1138To coordinate atomic operations on the resctrlfs and to avoid the problem
1139above, the following locking procedure is recommended:
1140
1141Locking is based on flock, which is available in libc and also as a shell
1142script command
1143
1144Write lock:
1145
1146 A) Take flock(LOCK_EX) on /sys/fs/resctrl
1147 B) Read/write the directory structure.
1148 C) funlock
1149
1150Read lock:
1151
1152 A) Take flock(LOCK_SH) on /sys/fs/resctrl
1153 B) If success read the directory structure.
1154 C) funlock
1155
1156Example with bash::
1157
1158  # Atomically read directory structure
1159  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1160
1161  # Read directory contents and create new subdirectory
1162
1163  $ cat create-dir.sh
1164  find /sys/fs/resctrl/ > output.txt
1165  mask = function-of(output.txt)
1166  mkdir /sys/fs/resctrl/newres/
1167  echo mask > /sys/fs/resctrl/newres/schemata
1168
1169  $ flock /sys/fs/resctrl/ ./create-dir.sh
1170
1171Example with C::
1172
1173  /*
1174  * Example code do take advisory locks
1175  * before accessing resctrl filesystem
1176  */
1177  #include <sys/file.h>
1178  #include <stdlib.h>
1179
1180  void resctrl_take_shared_lock(int fd)
1181  {
1182    int ret;
1183
1184    /* take shared lock on resctrl filesystem */
1185    ret = flock(fd, LOCK_SH);
1186    if (ret) {
1187      perror("flock");
1188      exit(-1);
1189    }
1190  }
1191
1192  void resctrl_take_exclusive_lock(int fd)
1193  {
1194    int ret;
1195
1196    /* release lock on resctrl filesystem */
1197    ret = flock(fd, LOCK_EX);
1198    if (ret) {
1199      perror("flock");
1200      exit(-1);
1201    }
1202  }
1203
1204  void resctrl_release_lock(int fd)
1205  {
1206    int ret;
1207
1208    /* take shared lock on resctrl filesystem */
1209    ret = flock(fd, LOCK_UN);
1210    if (ret) {
1211      perror("flock");
1212      exit(-1);
1213    }
1214  }
1215
1216  void main(void)
1217  {
1218    int fd, ret;
1219
1220    fd = open("/sys/fs/resctrl", O_DIRECTORY);
1221    if (fd == -1) {
1222      perror("open");
1223      exit(-1);
1224    }
1225    resctrl_take_shared_lock(fd);
1226    /* code to read directory contents */
1227    resctrl_release_lock(fd);
1228
1229    resctrl_take_exclusive_lock(fd);
1230    /* code to read and write directory contents */
1231    resctrl_release_lock(fd);
1232  }
1233
1234Examples for RDT Monitoring along with allocation usage
1235=======================================================
1236Reading monitored data
1237----------------------
1238Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
1239show the current snapshot of LLC occupancy of the corresponding MON
1240group or CTRL_MON group.
1241
1242
1243Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1244------------------------------------------------------------------------
1245On a two socket machine (one L3 cache per socket) with just four bits
1246for cache bit masks::
1247
1248  # mount -t resctrl resctrl /sys/fs/resctrl
1249  # cd /sys/fs/resctrl
1250  # mkdir p0 p1
1251  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
1252  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
1253  # echo 5678 > p1/tasks
1254  # echo 5679 > p1/tasks
1255
1256The default resource group is unmodified, so we have access to all parts
1257of all caches (its schemata file reads "L3:0=f;1=f").
1258
1259Tasks that are under the control of group "p0" may only allocate from the
1260"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1261Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1262
1263Create monitor groups and assign a subset of tasks to each monitor group.
1264::
1265
1266  # cd /sys/fs/resctrl/p1/mon_groups
1267  # mkdir m11 m12
1268  # echo 5678 > m11/tasks
1269  # echo 5679 > m12/tasks
1270
1271fetch data (data shown in bytes)
1272::
1273
1274  # cat m11/mon_data/mon_L3_00/llc_occupancy
1275  16234000
1276  # cat m11/mon_data/mon_L3_01/llc_occupancy
1277  14789000
1278  # cat m12/mon_data/mon_L3_00/llc_occupancy
1279  16789000
1280
1281The parent ctrl_mon group shows the aggregated data.
1282::
1283
1284  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1285  31234000
1286
1287Example 2 (Monitor a task from its creation)
1288--------------------------------------------
1289On a two socket machine (one L3 cache per socket)::
1290
1291  # mount -t resctrl resctrl /sys/fs/resctrl
1292  # cd /sys/fs/resctrl
1293  # mkdir p0 p1
1294
1295An RMID is allocated to the group once its created and hence the <cmd>
1296below is monitored from its creation.
1297::
1298
1299  # echo $$ > /sys/fs/resctrl/p1/tasks
1300  # <cmd>
1301
1302Fetch the data::
1303
1304  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
1305  31789000
1306
1307Example 3 (Monitor without CAT support or before creating CAT groups)
1308---------------------------------------------------------------------
1309
1310Assume a system like HSW has only CQM and no CAT support. In this case
1311the resctrl will still mount but cannot create CTRL_MON directories.
1312But user can create different MON groups within the root group thereby
1313able to monitor all tasks including kernel threads.
1314
1315This can also be used to profile jobs cache size footprint before being
1316able to allocate them to different allocation groups.
1317::
1318
1319  # mount -t resctrl resctrl /sys/fs/resctrl
1320  # cd /sys/fs/resctrl
1321  # mkdir mon_groups/m01
1322  # mkdir mon_groups/m02
1323
1324  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
1325  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
1326
1327Monitor the groups separately and also get per domain data. From the
1328below its apparent that the tasks are mostly doing work on
1329domain(socket) 0.
1330::
1331
1332  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
1333  31234000
1334  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
1335  34555
1336  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
1337  31234000
1338  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
1339  32789
1340
1341
1342Example 4 (Monitor real time tasks)
1343-----------------------------------
1344
1345A single socket system which has real time tasks running on cores 4-7
1346and non real time tasks on other cpus. We want to monitor the cache
1347occupancy of the real time threads on these cores.
1348::
1349
1350  # mount -t resctrl resctrl /sys/fs/resctrl
1351  # cd /sys/fs/resctrl
1352  # mkdir p1
1353
1354Move the cpus 4-7 over to p1::
1355
1356  # echo f0 > p1/cpus
1357
1358View the llc occupancy snapshot::
1359
1360  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
1361  11234000
1362
1363Intel RDT Errata
1364================
1365
1366Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1367-----------------------------------------------------------------
1368
1369Errata SKX99 for Skylake server and BDF102 for Broadwell server.
1370
1371Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
1372according to the assigned Resource Monitor ID (RMID) for that logical
1373core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
1374metrics, may report incorrect system bandwidth for certain RMID values.
1375
1376Implication: Due to the errata, system memory bandwidth may not match
1377what is reported.
1378
1379Workaround: MBM total and local readings are corrected according to the
1380following correction factor table:
1381
1382+---------------+---------------+---------------+-----------------+
1383|core count	|rmid count	|rmid threshold	|correction factor|
1384+---------------+---------------+---------------+-----------------+
1385|1		|8		|0		|1.000000	  |
1386+---------------+---------------+---------------+-----------------+
1387|2		|16		|0		|1.000000	  |
1388+---------------+---------------+---------------+-----------------+
1389|3		|24		|15		|0.969650	  |
1390+---------------+---------------+---------------+-----------------+
1391|4		|32		|0		|1.000000	  |
1392+---------------+---------------+---------------+-----------------+
1393|6		|48		|31		|0.969650	  |
1394+---------------+---------------+---------------+-----------------+
1395|7		|56		|47		|1.142857	  |
1396+---------------+---------------+---------------+-----------------+
1397|8		|64		|0		|1.000000	  |
1398+---------------+---------------+---------------+-----------------+
1399|9		|72		|63		|1.185115	  |
1400+---------------+---------------+---------------+-----------------+
1401|10		|80		|63		|1.066553	  |
1402+---------------+---------------+---------------+-----------------+
1403|11		|88		|79		|1.454545	  |
1404+---------------+---------------+---------------+-----------------+
1405|12		|96		|0		|1.000000	  |
1406+---------------+---------------+---------------+-----------------+
1407|13		|104		|95		|1.230769	  |
1408+---------------+---------------+---------------+-----------------+
1409|14		|112		|95		|1.142857	  |
1410+---------------+---------------+---------------+-----------------+
1411|15		|120		|95		|1.066667	  |
1412+---------------+---------------+---------------+-----------------+
1413|16		|128		|0		|1.000000	  |
1414+---------------+---------------+---------------+-----------------+
1415|17		|136		|127		|1.254863	  |
1416+---------------+---------------+---------------+-----------------+
1417|18		|144		|127		|1.185255	  |
1418+---------------+---------------+---------------+-----------------+
1419|19		|152		|0		|1.000000	  |
1420+---------------+---------------+---------------+-----------------+
1421|20		|160		|127		|1.066667	  |
1422+---------------+---------------+---------------+-----------------+
1423|21		|168		|0		|1.000000	  |
1424+---------------+---------------+---------------+-----------------+
1425|22		|176		|159		|1.454334	  |
1426+---------------+---------------+---------------+-----------------+
1427|23		|184		|0		|1.000000	  |
1428+---------------+---------------+---------------+-----------------+
1429|24		|192		|127		|0.969744	  |
1430+---------------+---------------+---------------+-----------------+
1431|25		|200		|191		|1.280246	  |
1432+---------------+---------------+---------------+-----------------+
1433|26		|208		|191		|1.230921	  |
1434+---------------+---------------+---------------+-----------------+
1435|27		|216		|0		|1.000000	  |
1436+---------------+---------------+---------------+-----------------+
1437|28		|224		|191		|1.143118	  |
1438+---------------+---------------+---------------+-----------------+
1439
1440If rmid > rmid threshold, MBM total and local values should be multiplied
1441by the correction factor.
1442
1443See:
1444
14451. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
1446http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1447
14482. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1449http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1450
14513. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
1452https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
1453
1454for further information.
1455