1======================================== 2zram: Compressed RAM-based block devices 3======================================== 4 5Introduction 6============ 7 8The zram module creates RAM-based block devices named /dev/zram<id> 9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored 10in memory itself. These disks allow very fast I/O and compression provides 11good amounts of memory savings. Some of the use cases include /tmp storage, 12use as swap disks, various caches under /var and maybe many more. :) 13 14Statistics for individual zram devices are exported through sysfs nodes at 15/sys/block/zram<id>/ 16 17Usage 18===== 19 20There are several ways to configure and manage zram device(-s): 21 22a) using zram and zram_control sysfs attributes 23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). 24 25In this document we will describe only 'manual' zram configuration steps, 26IOW, zram and zram_control sysfs attributes. 27 28In order to get a better idea about zramctl please consult util-linux 29documentation, zramctl man-page or `zramctl --help`. Please be informed 30that zram maintainers do not develop/maintain util-linux or zramctl, should 31you have any questions please contact util-linux@vger.kernel.org 32 33Following shows a typical sequence of steps for using zram. 34 35WARNING 36======= 37 38For the sake of simplicity we skip error checking parts in most of the 39examples below. However, it is your sole responsibility to handle errors. 40 41zram sysfs attributes always return negative values in case of errors. 42The list of possible return codes: 43 44======== ============================================================= 45-EBUSY an attempt to modify an attribute that cannot be changed once 46 the device has been initialised. Please reset device first. 47-ENOMEM zram was not able to allocate enough memory to fulfil your 48 needs. 49-EINVAL invalid input has been provided. 50-EAGAIN re-try operation later (e.g. when attempting to run recompress 51 and writeback simultaneously). 52======== ============================================================= 53 54If you use 'echo', the returned value is set by the 'echo' utility, 55and, in general case, something like:: 56 57 echo foo > /sys/block/zram0/comp_algorithm 58 if [ $? -ne 0 ]; then 59 handle_error 60 fi 61 62should suffice. 63 641) Load Module 65============== 66 67:: 68 69 modprobe zram num_devices=4 70 71This creates 4 devices: /dev/zram{0,1,2,3} 72 73num_devices parameter is optional and tells zram how many devices should be 74pre-created. Default: 1. 75 762) Select compression algorithm 77=============================== 78 79Using comp_algorithm device attribute one can see available and 80currently selected (shown in square brackets) compression algorithms, 81or change the selected compression algorithm (once the device is initialised 82there is no way to change compression algorithm). 83 84Examples:: 85 86 #show supported compression algorithms 87 cat /sys/block/zram0/comp_algorithm 88 lzo [lz4] 89 90 #select lzo compression algorithm 91 echo lzo > /sys/block/zram0/comp_algorithm 92 93For the time being, the `comp_algorithm` content shows only compression 94algorithms that are supported by zram. 95 963) Set compression algorithm parameters: Optional 97================================================= 98 99Compression algorithms may support specific parameters which can be 100tweaked for particular dataset. ZRAM has an `algorithm_params` device 101attribute which provides a per-algorithm params configuration. 102 103For example, several compression algorithms support `level` parameter. 104In addition, certain compression algorithms support pre-trained dictionaries, 105which significantly change algorithms' characteristics. In order to configure 106compression algorithm to use external pre-trained dictionary, pass full 107path to the `dict` along with other parameters:: 108 109 #pass path to pre-trained zstd dictionary 110 echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params 111 112 #same, but using algorithm priority 113 echo "priority=1 dict=/etc/dictionary" > \ 114 /sys/block/zram0/algorithm_params 115 116 #pass path to pre-trained zstd dictionary and compression level 117 echo "algo=zstd level=8 dict=/etc/dictionary" > \ 118 /sys/block/zram0/algorithm_params 119 120Parameters are algorithm specific: not all algorithms support pre-trained 121dictionaries, not all algorithms support `level`. Furthermore, for certain 122algorithms `level` controls the compression level (the higher the value the 123better the compression ratio, it even can take negatives values for some 124algorithms), for other algorithms `level` is acceleration level (the higher 125the value the lower the compression ratio). 126 1274) Set Disksize 128=============== 129 130Set disk size by writing the value to sysfs node 'disksize'. 131The value can be either in bytes or you can use mem suffixes. 132Examples:: 133 134 # Initialize /dev/zram0 with 50MB disksize 135 echo $((50*1024*1024)) > /sys/block/zram0/disksize 136 137 # Using mem suffixes 138 echo 256K > /sys/block/zram0/disksize 139 echo 512M > /sys/block/zram0/disksize 140 echo 1G > /sys/block/zram0/disksize 141 142Note: 143There is little point creating a zram of greater than twice the size of memory 144since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 145size of the disk when not in use so a huge zram is wasteful. 146 1475) Set memory limit: Optional 148============================= 149 150Set memory limit by writing the value to sysfs node 'mem_limit'. 151The value can be either in bytes or you can use mem suffixes. 152In addition, you could change the value in runtime. 153Examples:: 154 155 # limit /dev/zram0 with 50MB memory 156 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 157 158 # Using mem suffixes 159 echo 256K > /sys/block/zram0/mem_limit 160 echo 512M > /sys/block/zram0/mem_limit 161 echo 1G > /sys/block/zram0/mem_limit 162 163 # To disable memory limit 164 echo 0 > /sys/block/zram0/mem_limit 165 1666) Activate 167=========== 168 169:: 170 171 mkswap /dev/zram0 172 swapon /dev/zram0 173 174 mkfs.ext4 /dev/zram1 175 mount /dev/zram1 /tmp 176 1777) Add/remove zram devices 178========================== 179 180zram provides a control interface, which enables dynamic (on-demand) device 181addition and removal. 182 183In order to add a new /dev/zramX device, perform a read operation on the hot_add 184attribute. This will return either the new device's device id (meaning that you 185can use /dev/zram<id>) or an error code. 186 187Example:: 188 189 cat /sys/class/zram-control/hot_add 190 1 191 192To remove the existing /dev/zramX device (where X is a device id) 193execute:: 194 195 echo X > /sys/class/zram-control/hot_remove 196 1978) Stats 198======== 199 200Per-device statistics are exported as various nodes under /sys/block/zram<id>/ 201 202A brief description of exported device attributes follows. For more details 203please read Documentation/ABI/testing/sysfs-block-zram. 204 205====================== ====== =============================================== 206Name access description 207====================== ====== =============================================== 208disksize RW show and set the device's disk size 209initstate RO shows the initialization state of the device 210reset WO trigger device reset 211mem_used_max WO reset the `mem_used_max` counter (see later) 212mem_limit WO specifies the maximum amount of memory ZRAM can 213 use to store the compressed data 214writeback_limit WO specifies the maximum amount of write IO zram 215 can write out to backing device as 4KB unit 216writeback_limit_enable RW show and set writeback_limit feature 217writeback_batch_size RW show and set maximum number of in-flight 218 writeback operations 219writeback_compressed RW show and set compressed writeback feature 220comp_algorithm RW show and change the compression algorithm 221algorithm_params WO setup compression algorithm parameters 222compact WO trigger memory compaction 223debug_stat RO this file is used for zram debugging purposes 224backing_dev RW set up backend storage for zram to write out 225idle WO mark allocated slot as idle 226====================== ====== =============================================== 227 228User space is advised to use the following files to read the device statistics. 229 230File /sys/block/zram<id>/stat 231 232Represents block layer statistics. Read Documentation/block/stat.rst for 233details. 234 235File /sys/block/zram<id>/io_stat 236 237The stat file represents device's I/O statistics not accounted by block 238layer and, thus, not available in zram<id>/stat file. It consists of a 239single line of text and contains the following stats separated by 240whitespace: 241 242 ============= ============================================================= 243 failed_reads The number of failed reads 244 failed_writes The number of failed writes 245 invalid_io The number of non-page-size-aligned I/O requests 246 notify_free Depending on device usage scenario it may account 247 248 a) the number of pages freed because of swap slot free 249 notifications 250 b) the number of pages freed because of 251 REQ_OP_DISCARD requests sent by bio. The former ones are 252 sent to a swap block device when a swap slot is freed, 253 which implies that this disk is being used as a swap disk. 254 255 The latter ones are sent by filesystem mounted with 256 discard option, whenever some data blocks are getting 257 discarded. 258 ============= ============================================================= 259 260File /sys/block/zram<id>/mm_stat 261 262The mm_stat file represents the device's mm statistics. It consists of a single 263line of text and contains the following stats separated by whitespace: 264 265 ================ ============================================================= 266 orig_data_size uncompressed size of data stored in this disk. 267 Unit: bytes 268 compr_data_size compressed size of data stored in this disk 269 mem_used_total the amount of memory allocated for this disk. This 270 includes allocator fragmentation and metadata overhead, 271 allocated for this disk. So, allocator space efficiency 272 can be calculated using compr_data_size and this statistic. 273 Unit: bytes 274 mem_limit the maximum amount of memory ZRAM can use to store 275 the compressed data 276 mem_used_max the maximum amount of memory zram has consumed to 277 store the data 278 same_pages the number of same element filled pages written to this disk. 279 No memory is allocated for such pages. 280 pages_compacted the number of pages freed during compaction 281 huge_pages the number of incompressible pages 282 huge_pages_since the number of incompressible pages since zram set up 283 ================ ============================================================= 284 285File /sys/block/zram<id>/bd_stat 286 287The bd_stat file represents a device's backing device statistics. It consists of 288a single line of text and contains the following stats separated by whitespace: 289 290 ============== ============================================================= 291 bd_count size of data written in backing device. 292 Unit: 4K bytes 293 bd_reads the number of reads from backing device 294 Unit: 4K bytes 295 bd_writes the number of writes to backing device 296 Unit: 4K bytes 297 ============== ============================================================= 298 2999) Deactivate 300============== 301 302:: 303 304 swapoff /dev/zram0 305 umount /dev/zram1 306 30710) Reset 308========= 309 310 Write any positive value to 'reset' sysfs node:: 311 312 echo 1 > /sys/block/zram0/reset 313 echo 1 > /sys/block/zram1/reset 314 315 This frees all the memory allocated for the given device and 316 resets the disksize to zero. You must set the disksize again 317 before reusing the device. 318 319Optional Feature 320================ 321 322IDLE pages tracking 323------------------- 324 325zram has built-in support for idle pages tracking (that is, allocated but 326not used pages). This feature is useful for e.g. zram writeback and 327recompression. In order to mark pages as idle, execute the following command:: 328 329 echo all > /sys/block/zramX/idle 330 331This will mark all allocated zram pages as idle. The idle mark will be 332removed only when the page (block) is accessed (e.g. overwritten or freed). 333Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be 334marked as idle based on how many seconds have passed since the last access to 335a particular zram page:: 336 337 echo 86400 > /sys/block/zramX/idle 338 339In this example, all pages which haven't been accessed in more than 86400 340seconds (one day) will be marked idle. 341 342writeback 343--------- 344 345With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page 346to backing storage rather than keeping it in memory. 347To use the feature, admin should set up backing device via:: 348 349 echo /dev/sda5 > /sys/block/zramX/backing_dev 350 351before disksize setting. It supports only partitions at this moment. 352If admin wants to use incompressible page writeback, they could do it via:: 353 354 echo huge > /sys/block/zramX/writeback 355 356Admin can request writeback of idle pages at right timing via:: 357 358 echo idle > /sys/block/zramX/writeback 359 360With the command, zram will writeback idle pages from memory to the storage. 361 362Additionally, if a user choose to writeback only huge and idle pages 363this can be accomplished with:: 364 365 echo huge_idle > /sys/block/zramX/writeback 366 367If a user chooses to writeback only incompressible pages (pages that none of 368algorithms can compress) this can be accomplished with:: 369 370 echo incompressible > /sys/block/zramX/writeback 371 372If an admin wants to write a specific page in zram device to the backing device, 373they could write a page index into the interface:: 374 375 echo "page_index=1251" > /sys/block/zramX/writeback 376 377In Linux 6.16 this interface underwent some rework. First, the interface 378now supports `key=value` format for all of its parameters (`type=huge_idle`, 379etc.) Second, the support for `page_indexes` was introduced, which specify 380`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the 381number of syscalls, but more importantly this enables optimal post-processing 382target selection strategy. Usage example:: 383 384 echo "type=idle" > /sys/block/zramX/writeback 385 echo "page_indexes=1-100 page_indexes=200-300" > \ 386 /sys/block/zramX/writeback 387 388We also now permit multiple page_index params per call and a mix of 389single pages and page ranges:: 390 391 echo page_index=42 page_index=99 page_indexes=100-200 \ 392 page_indexes=500-700 > /sys/block/zramX/writeback 393 394If there are lots of write IO with flash device, potentially, it has 395flash wearout problem so that admin needs to design write limitation 396to guarantee storage health for entire product life. 397 398To overcome the concern, zram supports "writeback_limit" feature. 399The "writeback_limit_enable"'s default value is 0 so that it doesn't limit 400any writeback. IOW, if admin wants to apply writeback budget, they should 401enable writeback_limit_enable via:: 402 403 $ echo 1 > /sys/block/zramX/writeback_limit_enable 404 405Once writeback_limit_enable is set, zram doesn't allow any writeback 406until admin sets the budget via /sys/block/zramX/writeback_limit. 407 408(If admin doesn't enable writeback_limit_enable, writeback_limit's value 409assigned via /sys/block/zramX/writeback_limit is meaningless.) 410 411If admin wants to limit writeback as per-day 400M, they could do it 412like below:: 413 414 $ MB_SHIFT=20 415 $ 4K_SHIFT=12 416 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 417 /sys/block/zram0/writeback_limit. 418 $ echo 1 > /sys/block/zram0/writeback_limit_enable 419 420If admins want to allow further write again once the budget is exhausted, 421they could do it like below:: 422 423 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 424 /sys/block/zram0/writeback_limit 425 426If an admin wants to see the remaining writeback budget since last set:: 427 428 $ cat /sys/block/zramX/writeback_limit 429 430If an admin wants to disable writeback limit, they could do:: 431 432 $ echo 0 > /sys/block/zramX/writeback_limit_enable 433 434The writeback_limit count will reset whenever you reset zram (e.g., 435system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of 436writeback happened until you reset the zram to allocate extra writeback 437budget in next setting is user's job. 438 439By default zram stores written back pages in decompressed (raw) form, which 440means that writeback operation involves decompression of the page before 441writing it to the backing device. This behavior can be changed by enabling 442`writeback_compressed` feature, which causes zram to write compressed pages 443to the backing device, thus avoiding decompression overhead. To enable 444this feature, execute:: 445 446 $ echo yes > /sys/block/zramX/writeback_compressed 447 448Note that this feature should be configured before the `zramX` device is 449initialized. 450 451Depending on backing device storage type, writeback operation may benefit 452from a higher number of in-flight write requests (batched writes). The 453number of maximum in-flight writeback operations can be configured via 454`writeback_batch_size` attribute. To change the default value (which is 32), 455execute:: 456 457 $ echo 64 > /sys/block/zramX/writeback_batch_size 458 459If admin wants to measure writeback count in a certain period, they could 460know it via /sys/block/zram0/bd_stat's 3rd column. 461 462recompression 463------------- 464 465With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative 466(secondary) compression algorithms. The basic idea is that alternative 467compression algorithm can provide better compression ratio at a price of 468(potentially) slower compression/decompression speeds. Alternative compression 469algorithm can, for example, be more successful compressing huge pages (those 470that default algorithm failed to compress). Another application is idle pages 471recompression - pages that are cold and sit in the memory can be recompressed 472using more effective algorithm and, hence, reduce zsmalloc memory usage. 473 474With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: 475one primary and up to 3 secondary ones. Primary zram compressor is explained 476in "3) Select compression algorithm", secondary algorithms are configured 477using recomp_algorithm device attribute. 478 479Example::: 480 481 #show supported recompression algorithms 482 cat /sys/block/zramX/recomp_algorithm 483 #1: lzo lzo-rle lz4 lz4hc [zstd] 484 #2: lzo lzo-rle lz4 [lz4hc] zstd 485 486Alternative compression algorithms are sorted by priority. In the example 487above, zstd is used as the first alternative algorithm, which has priority 488of 1, while lz4hc is configured as a compression algorithm with priority 2. 489Alternative compression algorithm's priority is provided during algorithms 490configuration::: 491 492 #select zstd recompression algorithm, priority 1 493 echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm 494 495 #select deflate recompression algorithm, priority 2 496 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm 497 498Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, 499which controls recompression. 500 501Examples::: 502 503 #IDLE pages recompression is activated by `idle` mode 504 echo "type=idle" > /sys/block/zramX/recompress 505 506 #HUGE pages recompression is activated by `huge` mode 507 echo "type=huge" > /sys/block/zram0/recompress 508 509 #HUGE_IDLE pages recompression is activated by `huge_idle` mode 510 echo "type=huge_idle" > /sys/block/zramX/recompress 511 512The number of idle pages can be significant, so user-space can pass a size 513threshold (in bytes) to the recompress knob: zram will recompress only pages 514of equal or greater size::: 515 516 #recompress all pages larger than 3000 bytes 517 echo "threshold=3000" > /sys/block/zramX/recompress 518 519 #recompress idle pages larger than 2000 bytes 520 echo "type=idle threshold=2000" > /sys/block/zramX/recompress 521 522It is also possible to limit the number of pages zram re-compression will 523attempt to recompress::: 524 525 echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress 526 527During re-compression for every page, that matches re-compression criteria, 528ZRAM iterates the list of registered alternative compression algorithms in 529order of their priorities. ZRAM stops either when re-compression was 530successful (re-compressed object is smaller in size than the original one) 531and matches re-compression criteria (e.g. size threshold) or when there are 532no secondary algorithms left to try. If none of the secondary algorithms can 533successfully re-compressed the page such a page is marked as incompressible, 534so ZRAM will not attempt to re-compress it in the future. 535 536This re-compression behaviour, when it iterates through the list of 537registered compression algorithms, increases our chances of finding the 538algorithm that successfully compresses a particular page. Sometimes, however, 539it is convenient (and sometimes even necessary) to limit recompression to 540only one particular algorithm so that it will not try any other algorithms. 541This can be achieved by providing a `algo` or `priority` parameter::: 542 543 #use zstd algorithm only (if registered) 544 echo "type=huge algo=zstd" > /sys/block/zramX/recompress 545 546 #use zstd algorithm only (if zstd was registered under priority 1) 547 echo "type=huge priority=1" > /sys/block/zramX/recompress 548 549memory tracking 550=============== 551 552With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the 553zram block. It could be useful to catch cold or incompressible 554pages of the process with*pagemap. 555 556If you enable the feature, you could see block state via 557/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: 558 559 300 75.033841 .wh... 560 301 63.806904 s..... 561 302 63.806919 ..hi.. 562 303 62.801919 ....r. 563 304 146.781902 ..hi.n 564 565First column 566 zram's block index. 567Second column 568 access time since the system was booted 569Third column 570 state of the block: 571 572 s: 573 same page 574 w: 575 written page to backing store 576 h: 577 huge page 578 i: 579 idle page 580 r: 581 recompressed page (secondary compression algorithm) 582 n: 583 none (including secondary) of algorithms could compress it 584 585First line of above example says 300th block is accessed at 75.033841sec 586and the block's state is huge so it is written back to the backing 587storage. It's a debugging feature so anyone shouldn't rely on it to work 588properly. 589 590Nitin Gupta 591ngupta@vflare.org 592