xref: /linux/Documentation/admin-guide/cgroup-v1/memcg_test.rst (revision fbf5df34a4dbcd09d433dd4f0916bf9b2ddb16de)
1=====================================================
2Memory Resource Controller(Memcg) Implementation Memo
3=====================================================
4
5Last Updated: 2010/2
6
7Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
8
9Because VM is getting complex (one of reasons is memcg...), memcg's behavior
10is complex. This is a document for memcg's internal behavior.
11Please note that implementation details can be changed.
12
13(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
14
150. How to record usage ?
16========================
17
18   2 objects are used.
19
20   page_cgroup ....an object per page.
21
22	Allocated at boot or memory hotplug. Freed at memory hot removal.
23
24   swap_cgroup ... an entry per swp_entry.
25
26	Allocated at swapon(). Freed at swapoff().
27
28   The page_cgroup has USED bit and double count against a page_cgroup never
29   occurs. swap_cgroup is used only when a charged page is swapped-out.
30
311. Charge
32=========
33
34   a page/swp_entry may be charged (usage += PAGE_SIZE) at
35
36	mem_cgroup_try_charge()
37
382. Uncharge
39===========
40
41  a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
42
43	mem_cgroup_uncharge()
44	  Called when a page's refcount goes down to 0.
45
46	mem_cgroup_uncharge_swap()
47	  Called when swp_entry's refcnt goes down to 0. A charge against swap
48	  disappears.
49
503. charge-commit
51=======================
52
53	Memcg pages are charged in two steps:
54
55		- mem_cgroup_try_charge()
56		- commit_charge()
57
58	At try_charge(), there are no flags to say "this page is charged".
59	at this point, usage += PAGE_SIZE.
60
61	At commit(), the page is associated with the memcg.
62
63Under below explanation, we assume CONFIG_SWAP=y.
64
654. Anonymous
66============
67
68	Anonymous page is newly allocated at
69		  - page fault into MAP_ANONYMOUS mapping.
70		  - Copy-On-Write.
71
72	4.1 Swap-in.
73	At swap-in, the page is taken from swap-cache. There are 2 cases.
74
75	(a) If the SwapCache is newly allocated and read, it has no charges.
76	(b) If the SwapCache has been mapped by processes, it has been
77	    charged already.
78
79	4.2 Swap-out.
80	At swap-out, typical state transition is below.
81
82	(a) add to swap cache. (marked as SwapCache)
83	    swp_entry's refcnt += 1.
84	(b) fully unmapped.
85	    swp_entry's refcnt += # of ptes.
86	(c) write back to swap.
87	(d) delete from swap cache. (remove from SwapCache)
88	    swp_entry's refcnt -= 1.
89
90
91	Finally, at task exit,
92	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
93
945. Page Cache
95=============
96
97	Page Cache is charged at
98	- filemap_add_folio().
99
100	The logic is very clear. (About migration, see below)
101
102	Note:
103	  __filemap_remove_folio() is called by filemap_remove_folio()
104	  and __remove_mapping().
105
1066. Shmem(tmpfs) Page Cache
107===========================
108
109	The best way to understand shmem's page state transition is to read
110	mm/shmem.c.
111
112	But brief explanation of the behavior of memcg around shmem will be
113	helpful to understand the logic.
114
115	Shmem's page (just leaf page, not direct/indirect block) can be on
116
117		- radix-tree of shmem's inode.
118		- SwapCache.
119		- Both on radix-tree and SwapCache. This happens at swap-in
120		  and swap-out,
121
122	It's charged when...
123
124	- A new page is added to shmem's radix-tree.
125	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
126
1277. Page Migration
128=================
129
130	mem_cgroup_migrate()
131
1328. LRU
133======
134	Each memcg has its own vector of LRUs (inactive anon, active anon,
135	inactive file, active file, unevictable) of pages from each node,
136	each LRU handled under a single lru_lock for that memcg and node.
137
1389. Typical Tests.
139=================
140
141 Tests for racy cases.
142
1439.1 Small limit to memcg.
144-------------------------
145
146	When you do test to do racy case, it's good test to set memcg's limit
147	to be very small rather than GB. Many races found in the test under
148	xKB or xxMB limits.
149
150	(Memory behavior under GB and Memory behavior under MB shows very
151	different situation.)
152
1539.2 Shmem
154---------
155
156	Historically, memcg's shmem handling was poor and we saw some amount
157	of troubles here. This is because shmem is page-cache but can be
158	SwapCache. Test with shmem/tmpfs is always good test.
159
1609.3 Migration
161-------------
162
163	For NUMA, migration is an another special case. To do easy test, cpuset
164	is useful. Following is a sample script to do migration::
165
166		mount -t cgroup -o cpuset none /opt/cpuset
167
168		mkdir /opt/cpuset/01
169		echo 1 > /opt/cpuset/01/cpuset.cpus
170		echo 0 > /opt/cpuset/01/cpuset.mems
171		echo 1 > /opt/cpuset/01/cpuset.memory_migrate
172		mkdir /opt/cpuset/02
173		echo 1 > /opt/cpuset/02/cpuset.cpus
174		echo 1 > /opt/cpuset/02/cpuset.mems
175		echo 1 > /opt/cpuset/02/cpuset.memory_migrate
176
177	In above set, when you moves a task from 01 to 02, page migration to
178	node 0 to node 1 will occur. Following is a script to migrate all
179	under cpuset.::
180
181		--
182		move_task()
183		{
184		for pid in $1
185		do
186			/bin/echo $pid >$2/tasks 2>/dev/null
187			echo -n $pid
188			echo -n " "
189		done
190		echo END
191		}
192
193		G1_TASK=`cat ${G1}/tasks`
194		G2_TASK=`cat ${G2}/tasks`
195		move_task "${G1_TASK}" ${G2} &
196		--
197
1989.4 Memory hotplug
199------------------
200
201	memory hotplug test is one of good test.
202
203	to offline memory, do following::
204
205		# echo offline > /sys/devices/system/memory/memoryXXX/state
206
207	(XXX is the place of memory)
208
209	This is an easy way to test page migration, too.
210
2119.5 nested cgroups
212------------------
213
214	Use tests like the following for testing nested cgroups::
215
216		mkdir /opt/cgroup/01/child_a
217		mkdir /opt/cgroup/01/child_b
218
219		set limit to 01.
220		add limit to 01/child_b
221		run jobs under child_a and child_b
222
223	create/delete following groups at random while jobs are running::
224
225		/opt/cgroup/01/child_a/child_aa
226		/opt/cgroup/01/child_b/child_bb
227		/opt/cgroup/01/child_c
228
229	running new jobs in new group is also good.
230
2319.6 Mount with other subsystems
232-------------------------------
233
234	Mounting with other subsystems is a good test because there is a
235	race and lock dependency with other cgroup subsystems.
236
237	example::
238
239		# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
240
241	and do task move, mkdir, rmdir etc...under this.
242
2439.7 swapoff
244-----------
245
246	Besides management of swap is one of complicated parts of memcg,
247	call path of swap-in at swapoff is not same as usual swap-in path..
248	It's worth to be tested explicitly.
249
250	For example, test like following is good:
251
252	(Shell-A)::
253
254		# mount -t cgroup none /cgroup -o memory
255		# mkdir /cgroup/test
256		# echo 40M > /cgroup/test/memory.limit_in_bytes
257		# echo 0 > /cgroup/test/tasks
258
259	Run malloc(100M) program under this. You'll see 60M of swaps.
260
261	(Shell-B)::
262
263		# move all tasks in /cgroup/test to /cgroup
264		# /sbin/swapoff -a
265		# rmdir /cgroup/test
266		# kill malloc task.
267
268	Of course, tmpfs v.s. swapoff test should be tested, too.
269
2709.8 OOM-Killer
271--------------
272
273	Out-of-memory caused by memcg's limit will kill tasks under
274	the memcg. When hierarchy is used, a task under hierarchy
275	will be killed by the kernel.
276
277	In this case, panic_on_oom shouldn't be invoked and tasks
278	in other groups shouldn't be killed.
279
280	It's not difficult to cause OOM under memcg as following.
281
282	Case A) when you can swapoff::
283
284		#swapoff -a
285		#echo 50M > /memory.limit_in_bytes
286
287	run 51M of malloc
288
289	Case B) when you use mem+swap limitation::
290
291		#echo 50M > memory.limit_in_bytes
292		#echo 50M > memory.memsw.limit_in_bytes
293
294	run 51M of malloc
295
2969.9 Move charges at task migration
297----------------------------------
298
299	Charges associated with a task can be moved along with task migration.
300
301	(Shell-A)::
302
303		#mkdir /cgroup/A
304		#echo $$ >/cgroup/A/tasks
305
306	run some programs which uses some amount of memory in /cgroup/A.
307
308	(Shell-B)::
309
310		#mkdir /cgroup/B
311		#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
312		#echo "pid of the program running in group A" >/cgroup/B/tasks
313
314	You can see charges have been moved by reading ``*.usage_in_bytes`` or
315	memory.stat of both A and B.
316
317	See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
318	be written to move_charge_at_immigrate.
319
3209.10 Memory thresholds
321----------------------
322
323	Memory controller implements memory thresholds using cgroups notification
324	API. You can use tools/cgroup/cgroup_event_listener.c to test it.
325
326	(Shell-A) Create cgroup and run event listener::
327
328		# mkdir /cgroup/A
329		# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
330
331	(Shell-B) Add task to cgroup and try to allocate and free memory::
332
333		# echo $$ >/cgroup/A/tasks
334		# a="$(dd if=/dev/zero bs=1M count=10)"
335		# a=
336
337	You will see message from cgroup_event_listener every time you cross
338	the thresholds.
339
340	Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
341
342	It's good idea to test root cgroup as well.
343