1======================================== 2zram: Compressed RAM-based block devices 3======================================== 4 5Introduction 6============ 7 8The zram module creates RAM-based block devices named /dev/zram<id> 9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored 10in memory itself. These disks allow very fast I/O and compression provides 11good amounts of memory savings. Some of the use cases include /tmp storage, 12use as swap disks, various caches under /var and maybe many more. :) 13 14Statistics for individual zram devices are exported through sysfs nodes at 15/sys/block/zram<id>/ 16 17Usage 18===== 19 20There are several ways to configure and manage zram device(-s): 21 22a) using zram and zram_control sysfs attributes 23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). 24 25In this document we will describe only 'manual' zram configuration steps, 26IOW, zram and zram_control sysfs attributes. 27 28In order to get a better idea about zramctl please consult util-linux 29documentation, zramctl man-page or `zramctl --help`. Please be informed 30that zram maintainers do not develop/maintain util-linux or zramctl, should 31you have any questions please contact util-linux@vger.kernel.org 32 33Following shows a typical sequence of steps for using zram. 34 35WARNING 36======= 37 38For the sake of simplicity we skip error checking parts in most of the 39examples below. However, it is your sole responsibility to handle errors. 40 41zram sysfs attributes always return negative values in case of errors. 42The list of possible return codes: 43 44======== ============================================================= 45-EBUSY an attempt to modify an attribute that cannot be changed once 46 the device has been initialised. Please reset device first. 47-ENOMEM zram was not able to allocate enough memory to fulfil your 48 needs. 49-EINVAL invalid input has been provided. 50-EAGAIN re-try operation later (e.g. when attempting to run recompress 51 and writeback simultaneously). 52======== ============================================================= 53 54If you use 'echo', the returned value is set by the 'echo' utility, 55and, in general case, something like:: 56 57 echo 3 > /sys/block/zram0/max_comp_streams 58 if [ $? -ne 0 ]; then 59 handle_error 60 fi 61 62should suffice. 63 641) Load Module 65============== 66 67:: 68 69 modprobe zram num_devices=4 70 71This creates 4 devices: /dev/zram{0,1,2,3} 72 73num_devices parameter is optional and tells zram how many devices should be 74pre-created. Default: 1. 75 762) Set max number of compression streams 77======================================== 78 79Regardless of the value passed to this attribute, ZRAM will always 80allocate multiple compression streams - one per online CPU - thus 81allowing several concurrent compression operations. The number of 82allocated compression streams goes down when some of the CPUs 83become offline. There is no single-compression-stream mode anymore, 84unless you are running a UP system or have only 1 CPU online. 85 86To find out how many streams are currently available:: 87 88 cat /sys/block/zram0/max_comp_streams 89 903) Select compression algorithm 91=============================== 92 93Using comp_algorithm device attribute one can see available and 94currently selected (shown in square brackets) compression algorithms, 95or change the selected compression algorithm (once the device is initialised 96there is no way to change compression algorithm). 97 98Examples:: 99 100 #show supported compression algorithms 101 cat /sys/block/zram0/comp_algorithm 102 lzo [lz4] 103 104 #select lzo compression algorithm 105 echo lzo > /sys/block/zram0/comp_algorithm 106 107For the time being, the `comp_algorithm` content shows only compression 108algorithms that are supported by zram. 109 1104) Set compression algorithm parameters: Optional 111================================================= 112 113Compression algorithms may support specific parameters which can be 114tweaked for particular dataset. ZRAM has an `algorithm_params` device 115attribute which provides a per-algorithm params configuration. 116 117For example, several compression algorithms support `level` parameter. 118In addition, certain compression algorithms support pre-trained dictionaries, 119which significantly change algorithms' characteristics. In order to configure 120compression algorithm to use external pre-trained dictionary, pass full 121path to the `dict` along with other parameters:: 122 123 #pass path to pre-trained zstd dictionary 124 echo "algo=zstd dict=/etc/dictioary" > /sys/block/zram0/algorithm_params 125 126 #same, but using algorithm priority 127 echo "priority=1 dict=/etc/dictioary" > \ 128 /sys/block/zram0/algorithm_params 129 130 #pass path to pre-trained zstd dictionary and compression level 131 echo "algo=zstd level=8 dict=/etc/dictioary" > \ 132 /sys/block/zram0/algorithm_params 133 134Parameters are algorithm specific: not all algorithms support pre-trained 135dictionaries, not all algorithms support `level`. Furthermore, for certain 136algorithms `level` controls the compression level (the higher the value the 137better the compression ratio, it even can take negatives values for some 138algorithms), for other algorithms `level` is acceleration level (the higher 139the value the lower the compression ratio). 140 1415) Set Disksize 142=============== 143 144Set disk size by writing the value to sysfs node 'disksize'. 145The value can be either in bytes or you can use mem suffixes. 146Examples:: 147 148 # Initialize /dev/zram0 with 50MB disksize 149 echo $((50*1024*1024)) > /sys/block/zram0/disksize 150 151 # Using mem suffixes 152 echo 256K > /sys/block/zram0/disksize 153 echo 512M > /sys/block/zram0/disksize 154 echo 1G > /sys/block/zram0/disksize 155 156Note: 157There is little point creating a zram of greater than twice the size of memory 158since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 159size of the disk when not in use so a huge zram is wasteful. 160 1616) Set memory limit: Optional 162============================= 163 164Set memory limit by writing the value to sysfs node 'mem_limit'. 165The value can be either in bytes or you can use mem suffixes. 166In addition, you could change the value in runtime. 167Examples:: 168 169 # limit /dev/zram0 with 50MB memory 170 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 171 172 # Using mem suffixes 173 echo 256K > /sys/block/zram0/mem_limit 174 echo 512M > /sys/block/zram0/mem_limit 175 echo 1G > /sys/block/zram0/mem_limit 176 177 # To disable memory limit 178 echo 0 > /sys/block/zram0/mem_limit 179 1807) Activate 181=========== 182 183:: 184 185 mkswap /dev/zram0 186 swapon /dev/zram0 187 188 mkfs.ext4 /dev/zram1 189 mount /dev/zram1 /tmp 190 1918) Add/remove zram devices 192========================== 193 194zram provides a control interface, which enables dynamic (on-demand) device 195addition and removal. 196 197In order to add a new /dev/zramX device, perform a read operation on the hot_add 198attribute. This will return either the new device's device id (meaning that you 199can use /dev/zram<id>) or an error code. 200 201Example:: 202 203 cat /sys/class/zram-control/hot_add 204 1 205 206To remove the existing /dev/zramX device (where X is a device id) 207execute:: 208 209 echo X > /sys/class/zram-control/hot_remove 210 2119) Stats 212======== 213 214Per-device statistics are exported as various nodes under /sys/block/zram<id>/ 215 216A brief description of exported device attributes follows. For more details 217please read Documentation/ABI/testing/sysfs-block-zram. 218 219====================== ====== =============================================== 220Name access description 221====================== ====== =============================================== 222disksize RW show and set the device's disk size 223initstate RO shows the initialization state of the device 224reset WO trigger device reset 225mem_used_max WO reset the `mem_used_max` counter (see later) 226mem_limit WO specifies the maximum amount of memory ZRAM can 227 use to store the compressed data 228writeback_limit WO specifies the maximum amount of write IO zram 229 can write out to backing device as 4KB unit 230writeback_limit_enable RW show and set writeback_limit feature 231max_comp_streams RW the number of possible concurrent compress 232 operations 233comp_algorithm RW show and change the compression algorithm 234algorithm_params WO setup compression algorithm parameters 235compact WO trigger memory compaction 236debug_stat RO this file is used for zram debugging purposes 237backing_dev RW set up backend storage for zram to write out 238idle WO mark allocated slot as idle 239====================== ====== =============================================== 240 241 242User space is advised to use the following files to read the device statistics. 243 244File /sys/block/zram<id>/stat 245 246Represents block layer statistics. Read Documentation/block/stat.rst for 247details. 248 249File /sys/block/zram<id>/io_stat 250 251The stat file represents device's I/O statistics not accounted by block 252layer and, thus, not available in zram<id>/stat file. It consists of a 253single line of text and contains the following stats separated by 254whitespace: 255 256 ============= ============================================================= 257 failed_reads The number of failed reads 258 failed_writes The number of failed writes 259 invalid_io The number of non-page-size-aligned I/O requests 260 notify_free Depending on device usage scenario it may account 261 262 a) the number of pages freed because of swap slot free 263 notifications 264 b) the number of pages freed because of 265 REQ_OP_DISCARD requests sent by bio. The former ones are 266 sent to a swap block device when a swap slot is freed, 267 which implies that this disk is being used as a swap disk. 268 269 The latter ones are sent by filesystem mounted with 270 discard option, whenever some data blocks are getting 271 discarded. 272 ============= ============================================================= 273 274File /sys/block/zram<id>/mm_stat 275 276The mm_stat file represents the device's mm statistics. It consists of a single 277line of text and contains the following stats separated by whitespace: 278 279 ================ ============================================================= 280 orig_data_size uncompressed size of data stored in this disk. 281 Unit: bytes 282 compr_data_size compressed size of data stored in this disk 283 mem_used_total the amount of memory allocated for this disk. This 284 includes allocator fragmentation and metadata overhead, 285 allocated for this disk. So, allocator space efficiency 286 can be calculated using compr_data_size and this statistic. 287 Unit: bytes 288 mem_limit the maximum amount of memory ZRAM can use to store 289 the compressed data 290 mem_used_max the maximum amount of memory zram has consumed to 291 store the data 292 same_pages the number of same element filled pages written to this disk. 293 No memory is allocated for such pages. 294 pages_compacted the number of pages freed during compaction 295 huge_pages the number of incompressible pages 296 huge_pages_since the number of incompressible pages since zram set up 297 ================ ============================================================= 298 299File /sys/block/zram<id>/bd_stat 300 301The bd_stat file represents a device's backing device statistics. It consists of 302a single line of text and contains the following stats separated by whitespace: 303 304 ============== ============================================================= 305 bd_count size of data written in backing device. 306 Unit: 4K bytes 307 bd_reads the number of reads from backing device 308 Unit: 4K bytes 309 bd_writes the number of writes to backing device 310 Unit: 4K bytes 311 ============== ============================================================= 312 31310) Deactivate 314============== 315 316:: 317 318 swapoff /dev/zram0 319 umount /dev/zram1 320 32111) Reset 322========= 323 324 Write any positive value to 'reset' sysfs node:: 325 326 echo 1 > /sys/block/zram0/reset 327 echo 1 > /sys/block/zram1/reset 328 329 This frees all the memory allocated for the given device and 330 resets the disksize to zero. You must set the disksize again 331 before reusing the device. 332 333Optional Feature 334================ 335 336writeback 337--------- 338 339With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page 340to backing storage rather than keeping it in memory. 341To use the feature, admin should set up backing device via:: 342 343 echo /dev/sda5 > /sys/block/zramX/backing_dev 344 345before disksize setting. It supports only partitions at this moment. 346If admin wants to use incompressible page writeback, they could do it via:: 347 348 echo huge > /sys/block/zramX/writeback 349 350To use idle page writeback, first, user need to declare zram pages 351as idle:: 352 353 echo all > /sys/block/zramX/idle 354 355From now on, any pages on zram are idle pages. The idle mark 356will be removed until someone requests access of the block. 357IOW, unless there is access request, those pages are still idle pages. 358Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be 359marked as idle based on how long (in seconds) it's been since they were 360last accessed:: 361 362 echo 86400 > /sys/block/zramX/idle 363 364In this example all pages which haven't been accessed in more than 86400 365seconds (one day) will be marked idle. 366 367Admin can request writeback of those idle pages at right timing via:: 368 369 echo idle > /sys/block/zramX/writeback 370 371With the command, zram will writeback idle pages from memory to the storage. 372 373Additionally, if a user choose to writeback only huge and idle pages 374this can be accomplished with:: 375 376 echo huge_idle > /sys/block/zramX/writeback 377 378If a user chooses to writeback only incompressible pages (pages that none of 379algorithms can compress) this can be accomplished with:: 380 381 echo incompressible > /sys/block/zramX/writeback 382 383If an admin wants to write a specific page in zram device to the backing device, 384they could write a page index into the interface:: 385 386 echo "page_index=1251" > /sys/block/zramX/writeback 387 388If there are lots of write IO with flash device, potentially, it has 389flash wearout problem so that admin needs to design write limitation 390to guarantee storage health for entire product life. 391 392To overcome the concern, zram supports "writeback_limit" feature. 393The "writeback_limit_enable"'s default value is 0 so that it doesn't limit 394any writeback. IOW, if admin wants to apply writeback budget, they should 395enable writeback_limit_enable via:: 396 397 $ echo 1 > /sys/block/zramX/writeback_limit_enable 398 399Once writeback_limit_enable is set, zram doesn't allow any writeback 400until admin sets the budget via /sys/block/zramX/writeback_limit. 401 402(If admin doesn't enable writeback_limit_enable, writeback_limit's value 403assigned via /sys/block/zramX/writeback_limit is meaningless.) 404 405If admin wants to limit writeback as per-day 400M, they could do it 406like below:: 407 408 $ MB_SHIFT=20 409 $ 4K_SHIFT=12 410 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 411 /sys/block/zram0/writeback_limit. 412 $ echo 1 > /sys/block/zram0/writeback_limit_enable 413 414If admins want to allow further write again once the budget is exhausted, 415they could do it like below:: 416 417 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 418 /sys/block/zram0/writeback_limit 419 420If an admin wants to see the remaining writeback budget since last set:: 421 422 $ cat /sys/block/zramX/writeback_limit 423 424If an admin wants to disable writeback limit, they could do:: 425 426 $ echo 0 > /sys/block/zramX/writeback_limit_enable 427 428The writeback_limit count will reset whenever you reset zram (e.g., 429system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of 430writeback happened until you reset the zram to allocate extra writeback 431budget in next setting is user's job. 432 433If admin wants to measure writeback count in a certain period, they could 434know it via /sys/block/zram0/bd_stat's 3rd column. 435 436recompression 437------------- 438 439With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative 440(secondary) compression algorithms. The basic idea is that alternative 441compression algorithm can provide better compression ratio at a price of 442(potentially) slower compression/decompression speeds. Alternative compression 443algorithm can, for example, be more successful compressing huge pages (those 444that default algorithm failed to compress). Another application is idle pages 445recompression - pages that are cold and sit in the memory can be recompressed 446using more effective algorithm and, hence, reduce zsmalloc memory usage. 447 448With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: 449one primary and up to 3 secondary ones. Primary zram compressor is explained 450in "3) Select compression algorithm", secondary algorithms are configured 451using recomp_algorithm device attribute. 452 453Example::: 454 455 #show supported recompression algorithms 456 cat /sys/block/zramX/recomp_algorithm 457 #1: lzo lzo-rle lz4 lz4hc [zstd] 458 #2: lzo lzo-rle lz4 [lz4hc] zstd 459 460Alternative compression algorithms are sorted by priority. In the example 461above, zstd is used as the first alternative algorithm, which has priority 462of 1, while lz4hc is configured as a compression algorithm with priority 2. 463Alternative compression algorithm's priority is provided during algorithms 464configuration::: 465 466 #select zstd recompression algorithm, priority 1 467 echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm 468 469 #select deflate recompression algorithm, priority 2 470 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm 471 472Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, 473which controls recompression. 474 475Examples::: 476 477 #IDLE pages recompression is activated by `idle` mode 478 echo "type=idle" > /sys/block/zramX/recompress 479 480 #HUGE pages recompression is activated by `huge` mode 481 echo "type=huge" > /sys/block/zram0/recompress 482 483 #HUGE_IDLE pages recompression is activated by `huge_idle` mode 484 echo "type=huge_idle" > /sys/block/zramX/recompress 485 486The number of idle pages can be significant, so user-space can pass a size 487threshold (in bytes) to the recompress knob: zram will recompress only pages 488of equal or greater size::: 489 490 #recompress all pages larger than 3000 bytes 491 echo "threshold=3000" > /sys/block/zramX/recompress 492 493 #recompress idle pages larger than 2000 bytes 494 echo "type=idle threshold=2000" > /sys/block/zramX/recompress 495 496It is also possible to limit the number of pages zram re-compression will 497attempt to recompress::: 498 499 echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress 500 501Recompression of idle pages requires memory tracking. 502 503During re-compression for every page, that matches re-compression criteria, 504ZRAM iterates the list of registered alternative compression algorithms in 505order of their priorities. ZRAM stops either when re-compression was 506successful (re-compressed object is smaller in size than the original one) 507and matches re-compression criteria (e.g. size threshold) or when there are 508no secondary algorithms left to try. If none of the secondary algorithms can 509successfully re-compressed the page such a page is marked as incompressible, 510so ZRAM will not attempt to re-compress it in the future. 511 512This re-compression behaviour, when it iterates through the list of 513registered compression algorithms, increases our chances of finding the 514algorithm that successfully compresses a particular page. Sometimes, however, 515it is convenient (and sometimes even necessary) to limit recompression to 516only one particular algorithm so that it will not try any other algorithms. 517This can be achieved by providing a `algo` or `priority` parameter::: 518 519 #use zstd algorithm only (if registered) 520 echo "type=huge algo=zstd" > /sys/block/zramX/recompress 521 522 #use zstd algorithm only (if zstd was registered under priority 1) 523 echo "type=huge priority=1" > /sys/block/zramX/recompress 524 525memory tracking 526=============== 527 528With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the 529zram block. It could be useful to catch cold or incompressible 530pages of the process with*pagemap. 531 532If you enable the feature, you could see block state via 533/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: 534 535 300 75.033841 .wh... 536 301 63.806904 s..... 537 302 63.806919 ..hi.. 538 303 62.801919 ....r. 539 304 146.781902 ..hi.n 540 541First column 542 zram's block index. 543Second column 544 access time since the system was booted 545Third column 546 state of the block: 547 548 s: 549 same page 550 w: 551 written page to backing store 552 h: 553 huge page 554 i: 555 idle page 556 r: 557 recompressed page (secondary compression algorithm) 558 n: 559 none (including secondary) of algorithms could compress it 560 561First line of above example says 300th block is accessed at 75.033841sec 562and the block's state is huge so it is written back to the backing 563storage. It's a debugging feature so anyone shouldn't rely on it to work 564properly. 565 566Nitin Gupta 567ngupta@vflare.org 568