xref: /linux/Documentation/admin-guide/blockdev/zram.rst (revision 4eac932103a5d8c3a1bdf6776dbc1a178c31d896)
1========================================
2zram: Compressed RAM-based block devices
3========================================
4
5Introduction
6============
7
8The zram module creates RAM-based block devices named /dev/zram<id>
9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
10in memory itself. These disks allow very fast I/O and compression provides
11good amounts of memory savings. Some of the use cases include /tmp storage,
12use as swap disks, various caches under /var and maybe many more. :)
13
14Statistics for individual zram devices are exported through sysfs nodes at
15/sys/block/zram<id>/
16
17Usage
18=====
19
20There are several ways to configure and manage zram device(-s):
21
22a) using zram and zram_control sysfs attributes
23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
24
25In this document we will describe only 'manual' zram configuration steps,
26IOW, zram and zram_control sysfs attributes.
27
28In order to get a better idea about zramctl please consult util-linux
29documentation, zramctl man-page or `zramctl --help`. Please be informed
30that zram maintainers do not develop/maintain util-linux or zramctl, should
31you have any questions please contact util-linux@vger.kernel.org
32
33Following shows a typical sequence of steps for using zram.
34
35WARNING
36=======
37
38For the sake of simplicity we skip error checking parts in most of the
39examples below. However, it is your sole responsibility to handle errors.
40
41zram sysfs attributes always return negative values in case of errors.
42The list of possible return codes:
43
44========  =============================================================
45-EBUSY	  an attempt to modify an attribute that cannot be changed once
46	  the device has been initialised. Please reset device first.
47-ENOMEM	  zram was not able to allocate enough memory to fulfil your
48	  needs.
49-EINVAL	  invalid input has been provided.
50========  =============================================================
51
52If you use 'echo', the returned value is set by the 'echo' utility,
53and, in general case, something like::
54
55	echo 3 > /sys/block/zram0/max_comp_streams
56	if [ $? -ne 0 ]; then
57		handle_error
58	fi
59
60should suffice.
61
621) Load Module
63==============
64
65::
66
67	modprobe zram num_devices=4
68
69This creates 4 devices: /dev/zram{0,1,2,3}
70
71num_devices parameter is optional and tells zram how many devices should be
72pre-created. Default: 1.
73
742) Set max number of compression streams
75========================================
76
77Regardless of the value passed to this attribute, ZRAM will always
78allocate multiple compression streams - one per online CPU - thus
79allowing several concurrent compression operations. The number of
80allocated compression streams goes down when some of the CPUs
81become offline. There is no single-compression-stream mode anymore,
82unless you are running a UP system or have only 1 CPU online.
83
84To find out how many streams are currently available::
85
86	cat /sys/block/zram0/max_comp_streams
87
883) Select compression algorithm
89===============================
90
91Using comp_algorithm device attribute one can see available and
92currently selected (shown in square brackets) compression algorithms,
93or change the selected compression algorithm (once the device is initialised
94there is no way to change compression algorithm).
95
96Examples::
97
98	#show supported compression algorithms
99	cat /sys/block/zram0/comp_algorithm
100	lzo [lz4]
101
102	#select lzo compression algorithm
103	echo lzo > /sys/block/zram0/comp_algorithm
104
105For the time being, the `comp_algorithm` content shows only compression
106algorithms that are supported by zram.
107
1084) Set Disksize
109===============
110
111Set disk size by writing the value to sysfs node 'disksize'.
112The value can be either in bytes or you can use mem suffixes.
113Examples::
114
115	# Initialize /dev/zram0 with 50MB disksize
116	echo $((50*1024*1024)) > /sys/block/zram0/disksize
117
118	# Using mem suffixes
119	echo 256K > /sys/block/zram0/disksize
120	echo 512M > /sys/block/zram0/disksize
121	echo 1G > /sys/block/zram0/disksize
122
123Note:
124There is little point creating a zram of greater than twice the size of memory
125since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
126size of the disk when not in use so a huge zram is wasteful.
127
1285) Set memory limit: Optional
129=============================
130
131Set memory limit by writing the value to sysfs node 'mem_limit'.
132The value can be either in bytes or you can use mem suffixes.
133In addition, you could change the value in runtime.
134Examples::
135
136	# limit /dev/zram0 with 50MB memory
137	echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
138
139	# Using mem suffixes
140	echo 256K > /sys/block/zram0/mem_limit
141	echo 512M > /sys/block/zram0/mem_limit
142	echo 1G > /sys/block/zram0/mem_limit
143
144	# To disable memory limit
145	echo 0 > /sys/block/zram0/mem_limit
146
1476) Activate
148===========
149
150::
151
152	mkswap /dev/zram0
153	swapon /dev/zram0
154
155	mkfs.ext4 /dev/zram1
156	mount /dev/zram1 /tmp
157
1587) Add/remove zram devices
159==========================
160
161zram provides a control interface, which enables dynamic (on-demand) device
162addition and removal.
163
164In order to add a new /dev/zramX device, perform a read operation on the hot_add
165attribute. This will return either the new device's device id (meaning that you
166can use /dev/zram<id>) or an error code.
167
168Example::
169
170	cat /sys/class/zram-control/hot_add
171	1
172
173To remove the existing /dev/zramX device (where X is a device id)
174execute::
175
176	echo X > /sys/class/zram-control/hot_remove
177
1788) Stats
179========
180
181Per-device statistics are exported as various nodes under /sys/block/zram<id>/
182
183A brief description of exported device attributes follows. For more details
184please read Documentation/ABI/testing/sysfs-block-zram.
185
186======================  ======  ===============================================
187Name            	access            description
188======================  ======  ===============================================
189disksize          	RW	show and set the device's disk size
190initstate         	RO	shows the initialization state of the device
191reset             	WO	trigger device reset
192mem_used_max      	WO	reset the `mem_used_max` counter (see later)
193mem_limit         	WO	specifies the maximum amount of memory ZRAM can
194				use to store the compressed data
195writeback_limit   	WO	specifies the maximum amount of write IO zram
196				can write out to backing device as 4KB unit
197writeback_limit_enable  RW	show and set writeback_limit feature
198max_comp_streams  	RW	the number of possible concurrent compress
199				operations
200comp_algorithm    	RW	show and change the compression algorithm
201algorithm_params	WO	setup compression algorithm parameters
202compact           	WO	trigger memory compaction
203debug_stat        	RO	this file is used for zram debugging purposes
204backing_dev	  	RW	set up backend storage for zram to write out
205idle		  	WO	mark allocated slot as idle
206======================  ======  ===============================================
207
208
209User space is advised to use the following files to read the device statistics.
210
211File /sys/block/zram<id>/stat
212
213Represents block layer statistics. Read Documentation/block/stat.rst for
214details.
215
216File /sys/block/zram<id>/io_stat
217
218The stat file represents device's I/O statistics not accounted by block
219layer and, thus, not available in zram<id>/stat file. It consists of a
220single line of text and contains the following stats separated by
221whitespace:
222
223 =============    =============================================================
224 failed_reads     The number of failed reads
225 failed_writes    The number of failed writes
226 invalid_io       The number of non-page-size-aligned I/O requests
227 notify_free      Depending on device usage scenario it may account
228
229                  a) the number of pages freed because of swap slot free
230                     notifications
231                  b) the number of pages freed because of
232                     REQ_OP_DISCARD requests sent by bio. The former ones are
233                     sent to a swap block device when a swap slot is freed,
234                     which implies that this disk is being used as a swap disk.
235
236                  The latter ones are sent by filesystem mounted with
237                  discard option, whenever some data blocks are getting
238                  discarded.
239 =============    =============================================================
240
241File /sys/block/zram<id>/mm_stat
242
243The mm_stat file represents the device's mm statistics. It consists of a single
244line of text and contains the following stats separated by whitespace:
245
246 ================ =============================================================
247 orig_data_size   uncompressed size of data stored in this disk.
248                  Unit: bytes
249 compr_data_size  compressed size of data stored in this disk
250 mem_used_total   the amount of memory allocated for this disk. This
251                  includes allocator fragmentation and metadata overhead,
252                  allocated for this disk. So, allocator space efficiency
253                  can be calculated using compr_data_size and this statistic.
254                  Unit: bytes
255 mem_limit        the maximum amount of memory ZRAM can use to store
256                  the compressed data
257 mem_used_max     the maximum amount of memory zram has consumed to
258                  store the data
259 same_pages       the number of same element filled pages written to this disk.
260                  No memory is allocated for such pages.
261 pages_compacted  the number of pages freed during compaction
262 huge_pages	  the number of incompressible pages
263 huge_pages_since the number of incompressible pages since zram set up
264 ================ =============================================================
265
266File /sys/block/zram<id>/bd_stat
267
268The bd_stat file represents a device's backing device statistics. It consists of
269a single line of text and contains the following stats separated by whitespace:
270
271 ============== =============================================================
272 bd_count	size of data written in backing device.
273		Unit: 4K bytes
274 bd_reads	the number of reads from backing device
275		Unit: 4K bytes
276 bd_writes	the number of writes to backing device
277		Unit: 4K bytes
278 ============== =============================================================
279
2809) Deactivate
281=============
282
283::
284
285	swapoff /dev/zram0
286	umount /dev/zram1
287
28810) Reset
289=========
290
291	Write any positive value to 'reset' sysfs node::
292
293		echo 1 > /sys/block/zram0/reset
294		echo 1 > /sys/block/zram1/reset
295
296	This frees all the memory allocated for the given device and
297	resets the disksize to zero. You must set the disksize again
298	before reusing the device.
299
300Optional Feature
301================
302
303writeback
304---------
305
306With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
307to backing storage rather than keeping it in memory.
308To use the feature, admin should set up backing device via::
309
310	echo /dev/sda5 > /sys/block/zramX/backing_dev
311
312before disksize setting. It supports only partitions at this moment.
313If admin wants to use incompressible page writeback, they could do it via::
314
315	echo huge > /sys/block/zramX/writeback
316
317To use idle page writeback, first, user need to declare zram pages
318as idle::
319
320	echo all > /sys/block/zramX/idle
321
322From now on, any pages on zram are idle pages. The idle mark
323will be removed until someone requests access of the block.
324IOW, unless there is access request, those pages are still idle pages.
325Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be
326marked as idle based on how long (in seconds) it's been since they were
327last accessed::
328
329        echo 86400 > /sys/block/zramX/idle
330
331In this example all pages which haven't been accessed in more than 86400
332seconds (one day) will be marked idle.
333
334Admin can request writeback of those idle pages at right timing via::
335
336	echo idle > /sys/block/zramX/writeback
337
338With the command, zram will writeback idle pages from memory to the storage.
339
340Additionally, if a user choose to writeback only huge and idle pages
341this can be accomplished with::
342
343        echo huge_idle > /sys/block/zramX/writeback
344
345If a user chooses to writeback only incompressible pages (pages that none of
346algorithms can compress) this can be accomplished with::
347
348	echo incompressible > /sys/block/zramX/writeback
349
350If an admin wants to write a specific page in zram device to the backing device,
351they could write a page index into the interface::
352
353	echo "page_index=1251" > /sys/block/zramX/writeback
354
355If there are lots of write IO with flash device, potentially, it has
356flash wearout problem so that admin needs to design write limitation
357to guarantee storage health for entire product life.
358
359To overcome the concern, zram supports "writeback_limit" feature.
360The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
361any writeback. IOW, if admin wants to apply writeback budget, they should
362enable writeback_limit_enable via::
363
364	$ echo 1 > /sys/block/zramX/writeback_limit_enable
365
366Once writeback_limit_enable is set, zram doesn't allow any writeback
367until admin sets the budget via /sys/block/zramX/writeback_limit.
368
369(If admin doesn't enable writeback_limit_enable, writeback_limit's value
370assigned via /sys/block/zramX/writeback_limit is meaningless.)
371
372If admin wants to limit writeback as per-day 400M, they could do it
373like below::
374
375	$ MB_SHIFT=20
376	$ 4K_SHIFT=12
377	$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
378		/sys/block/zram0/writeback_limit.
379	$ echo 1 > /sys/block/zram0/writeback_limit_enable
380
381If admins want to allow further write again once the budget is exhausted,
382they could do it like below::
383
384	$ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
385		/sys/block/zram0/writeback_limit
386
387If an admin wants to see the remaining writeback budget since last set::
388
389	$ cat /sys/block/zramX/writeback_limit
390
391If an admin wants to disable writeback limit, they could do::
392
393	$ echo 0 > /sys/block/zramX/writeback_limit_enable
394
395The writeback_limit count will reset whenever you reset zram (e.g.,
396system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
397writeback happened until you reset the zram to allocate extra writeback
398budget in next setting is user's job.
399
400If admin wants to measure writeback count in a certain period, they could
401know it via /sys/block/zram0/bd_stat's 3rd column.
402
403recompression
404-------------
405
406With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative
407(secondary) compression algorithms. The basic idea is that alternative
408compression algorithm can provide better compression ratio at a price of
409(potentially) slower compression/decompression speeds. Alternative compression
410algorithm can, for example, be more successful compressing huge pages (those
411that default algorithm failed to compress). Another application is idle pages
412recompression - pages that are cold and sit in the memory can be recompressed
413using more effective algorithm and, hence, reduce zsmalloc memory usage.
414
415With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:
416one primary and up to 3 secondary ones. Primary zram compressor is explained
417in "3) Select compression algorithm", secondary algorithms are configured
418using recomp_algorithm device attribute.
419
420Example:::
421
422	#show supported recompression algorithms
423	cat /sys/block/zramX/recomp_algorithm
424	#1: lzo lzo-rle lz4 lz4hc [zstd]
425	#2: lzo lzo-rle lz4 [lz4hc] zstd
426
427Alternative compression algorithms are sorted by priority. In the example
428above, zstd is used as the first alternative algorithm, which has priority
429of 1, while lz4hc is configured as a compression algorithm with priority 2.
430Alternative compression algorithm's priority is provided during algorithms
431configuration:::
432
433	#select zstd recompression algorithm, priority 1
434	echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
435
436	#select deflate recompression algorithm, priority 2
437	echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
438
439Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
440which controls recompression.
441
442Examples:::
443
444	#IDLE pages recompression is activated by `idle` mode
445	echo "type=idle" > /sys/block/zramX/recompress
446
447	#HUGE pages recompression is activated by `huge` mode
448	echo "type=huge" > /sys/block/zram0/recompress
449
450	#HUGE_IDLE pages recompression is activated by `huge_idle` mode
451	echo "type=huge_idle" > /sys/block/zramX/recompress
452
453The number of idle pages can be significant, so user-space can pass a size
454threshold (in bytes) to the recompress knob: zram will recompress only pages
455of equal or greater size:::
456
457	#recompress all pages larger than 3000 bytes
458	echo "threshold=3000" > /sys/block/zramX/recompress
459
460	#recompress idle pages larger than 2000 bytes
461	echo "type=idle threshold=2000" > /sys/block/zramX/recompress
462
463It is also possible to limit the number of pages zram re-compression will
464attempt to recompress:::
465
466	echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress
467
468Recompression of idle pages requires memory tracking.
469
470During re-compression for every page, that matches re-compression criteria,
471ZRAM iterates the list of registered alternative compression algorithms in
472order of their priorities. ZRAM stops either when re-compression was
473successful (re-compressed object is smaller in size than the original one)
474and matches re-compression criteria (e.g. size threshold) or when there are
475no secondary algorithms left to try. If none of the secondary algorithms can
476successfully re-compressed the page such a page is marked as incompressible,
477so ZRAM will not attempt to re-compress it in the future.
478
479This re-compression behaviour, when it iterates through the list of
480registered compression algorithms, increases our chances of finding the
481algorithm that successfully compresses a particular page. Sometimes, however,
482it is convenient (and sometimes even necessary) to limit recompression to
483only one particular algorithm so that it will not try any other algorithms.
484This can be achieved by providing a algo=NAME parameter:::
485
486	#use zstd algorithm only (if registered)
487	echo "type=huge algo=zstd" > /sys/block/zramX/recompress
488
489memory tracking
490===============
491
492With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
493zram block. It could be useful to catch cold or incompressible
494pages of the process with*pagemap.
495
496If you enable the feature, you could see block state via
497/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
498
499	  300    75.033841 .wh...
500	  301    63.806904 s.....
501	  302    63.806919 ..hi..
502	  303    62.801919 ....r.
503	  304   146.781902 ..hi.n
504
505First column
506	zram's block index.
507Second column
508	access time since the system was booted
509Third column
510	state of the block:
511
512	s:
513		same page
514	w:
515		written page to backing store
516	h:
517		huge page
518	i:
519		idle page
520	r:
521		recompressed page (secondary compression algorithm)
522	n:
523		none (including secondary) of algorithms could compress it
524
525First line of above example says 300th block is accessed at 75.033841sec
526and the block's state is huge so it is written back to the backing
527storage. It's a debugging feature so anyone shouldn't rely on it to work
528properly.
529
530Nitin Gupta
531ngupta@vflare.org
532