1======================================== 2zram: Compressed RAM-based block devices 3======================================== 4 5Introduction 6============ 7 8The zram module creates RAM-based block devices named /dev/zram<id> 9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored 10in memory itself. These disks allow very fast I/O and compression provides 11good amounts of memory savings. Some of the use cases include /tmp storage, 12use as swap disks, various caches under /var and maybe many more. :) 13 14Statistics for individual zram devices are exported through sysfs nodes at 15/sys/block/zram<id>/ 16 17Usage 18===== 19 20There are several ways to configure and manage zram device(-s): 21 22a) using zram and zram_control sysfs attributes 23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). 24 25In this document we will describe only 'manual' zram configuration steps, 26IOW, zram and zram_control sysfs attributes. 27 28In order to get a better idea about zramctl please consult util-linux 29documentation, zramctl man-page or `zramctl --help`. Please be informed 30that zram maintainers do not develop/maintain util-linux or zramctl, should 31you have any questions please contact util-linux@vger.kernel.org 32 33Following shows a typical sequence of steps for using zram. 34 35WARNING 36======= 37 38For the sake of simplicity we skip error checking parts in most of the 39examples below. However, it is your sole responsibility to handle errors. 40 41zram sysfs attributes always return negative values in case of errors. 42The list of possible return codes: 43 44======== ============================================================= 45-EBUSY an attempt to modify an attribute that cannot be changed once 46 the device has been initialised. Please reset device first. 47-ENOMEM zram was not able to allocate enough memory to fulfil your 48 needs. 49-EINVAL invalid input has been provided. 50-EAGAIN re-try operation later (e.g. when attempting to run recompress 51 and writeback simultaneously). 52======== ============================================================= 53 54If you use 'echo', the returned value is set by the 'echo' utility, 55and, in general case, something like:: 56 57 echo foo > /sys/block/zram0/comp_algorithm 58 if [ $? -ne 0 ]; then 59 handle_error 60 fi 61 62should suffice. 63 641) Load Module 65============== 66 67:: 68 69 modprobe zram num_devices=4 70 71This creates 4 devices: /dev/zram{0,1,2,3} 72 73num_devices parameter is optional and tells zram how many devices should be 74pre-created. Default: 1. 75 762) Select compression algorithm 77=============================== 78 79Using comp_algorithm device attribute one can see available and 80currently selected (shown in square brackets) compression algorithms, 81or change the selected compression algorithm (once the device is initialised 82there is no way to change compression algorithm). 83 84Examples:: 85 86 #show supported compression algorithms 87 cat /sys/block/zram0/comp_algorithm 88 lzo [lz4] 89 90 #select lzo compression algorithm 91 echo lzo > /sys/block/zram0/comp_algorithm 92 93For the time being, the `comp_algorithm` content shows only compression 94algorithms that are supported by zram. 95 963) Set compression algorithm parameters: Optional 97================================================= 98 99Compression algorithms may support specific parameters which can be 100tweaked for particular dataset. ZRAM has an `algorithm_params` device 101attribute which provides a per-algorithm params configuration. 102 103For example, several compression algorithms support `level` parameter. 104In addition, certain compression algorithms support pre-trained dictionaries, 105which significantly change algorithms' characteristics. In order to configure 106compression algorithm to use external pre-trained dictionary, pass full 107path to the `dict` along with other parameters:: 108 109 #pass path to pre-trained zstd dictionary 110 echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params 111 112 #same, but using algorithm priority 113 echo "priority=1 dict=/etc/dictionary" > \ 114 /sys/block/zram0/algorithm_params 115 116 #pass path to pre-trained zstd dictionary and compression level 117 echo "algo=zstd level=8 dict=/etc/dictionary" > \ 118 /sys/block/zram0/algorithm_params 119 120Parameters are algorithm specific: not all algorithms support pre-trained 121dictionaries, not all algorithms support `level`. Furthermore, for certain 122algorithms `level` controls the compression level (the higher the value the 123better the compression ratio, it even can take negatives values for some 124algorithms), for other algorithms `level` is acceleration level (the higher 125the value the lower the compression ratio). 126 1274) Set Disksize 128=============== 129 130Set disk size by writing the value to sysfs node 'disksize'. 131The value can be either in bytes or you can use mem suffixes. 132Examples:: 133 134 # Initialize /dev/zram0 with 50MB disksize 135 echo $((50*1024*1024)) > /sys/block/zram0/disksize 136 137 # Using mem suffixes 138 echo 256K > /sys/block/zram0/disksize 139 echo 512M > /sys/block/zram0/disksize 140 echo 1G > /sys/block/zram0/disksize 141 142Note: 143There is little point creating a zram of greater than twice the size of memory 144since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 145size of the disk when not in use so a huge zram is wasteful. 146 1475) Set memory limit: Optional 148============================= 149 150Set memory limit by writing the value to sysfs node 'mem_limit'. 151The value can be either in bytes or you can use mem suffixes. 152In addition, you could change the value in runtime. 153Examples:: 154 155 # limit /dev/zram0 with 50MB memory 156 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 157 158 # Using mem suffixes 159 echo 256K > /sys/block/zram0/mem_limit 160 echo 512M > /sys/block/zram0/mem_limit 161 echo 1G > /sys/block/zram0/mem_limit 162 163 # To disable memory limit 164 echo 0 > /sys/block/zram0/mem_limit 165 1666) Activate 167=========== 168 169:: 170 171 mkswap /dev/zram0 172 swapon /dev/zram0 173 174 mkfs.ext4 /dev/zram1 175 mount /dev/zram1 /tmp 176 1777) Add/remove zram devices 178========================== 179 180zram provides a control interface, which enables dynamic (on-demand) device 181addition and removal. 182 183In order to add a new /dev/zramX device, perform a read operation on the hot_add 184attribute. This will return either the new device's device id (meaning that you 185can use /dev/zram<id>) or an error code. 186 187Example:: 188 189 cat /sys/class/zram-control/hot_add 190 1 191 192To remove the existing /dev/zramX device (where X is a device id) 193execute:: 194 195 echo X > /sys/class/zram-control/hot_remove 196 1978) Stats 198======== 199 200Per-device statistics are exported as various nodes under /sys/block/zram<id>/ 201 202A brief description of exported device attributes follows. For more details 203please read Documentation/ABI/testing/sysfs-block-zram. 204 205====================== ====== =============================================== 206Name access description 207====================== ====== =============================================== 208disksize RW show and set the device's disk size 209initstate RO shows the initialization state of the device 210reset WO trigger device reset 211mem_used_max WO reset the `mem_used_max` counter (see later) 212mem_limit WO specifies the maximum amount of memory ZRAM can 213 use to store the compressed data 214writeback_limit WO specifies the maximum amount of write IO zram 215 can write out to backing device as 4KB unit 216writeback_limit_enable RW show and set writeback_limit feature 217comp_algorithm RW show and change the compression algorithm 218algorithm_params WO setup compression algorithm parameters 219compact WO trigger memory compaction 220debug_stat RO this file is used for zram debugging purposes 221backing_dev RW set up backend storage for zram to write out 222idle WO mark allocated slot as idle 223====================== ====== =============================================== 224 225 226User space is advised to use the following files to read the device statistics. 227 228File /sys/block/zram<id>/stat 229 230Represents block layer statistics. Read Documentation/block/stat.rst for 231details. 232 233File /sys/block/zram<id>/io_stat 234 235The stat file represents device's I/O statistics not accounted by block 236layer and, thus, not available in zram<id>/stat file. It consists of a 237single line of text and contains the following stats separated by 238whitespace: 239 240 ============= ============================================================= 241 failed_reads The number of failed reads 242 failed_writes The number of failed writes 243 invalid_io The number of non-page-size-aligned I/O requests 244 notify_free Depending on device usage scenario it may account 245 246 a) the number of pages freed because of swap slot free 247 notifications 248 b) the number of pages freed because of 249 REQ_OP_DISCARD requests sent by bio. The former ones are 250 sent to a swap block device when a swap slot is freed, 251 which implies that this disk is being used as a swap disk. 252 253 The latter ones are sent by filesystem mounted with 254 discard option, whenever some data blocks are getting 255 discarded. 256 ============= ============================================================= 257 258File /sys/block/zram<id>/mm_stat 259 260The mm_stat file represents the device's mm statistics. It consists of a single 261line of text and contains the following stats separated by whitespace: 262 263 ================ ============================================================= 264 orig_data_size uncompressed size of data stored in this disk. 265 Unit: bytes 266 compr_data_size compressed size of data stored in this disk 267 mem_used_total the amount of memory allocated for this disk. This 268 includes allocator fragmentation and metadata overhead, 269 allocated for this disk. So, allocator space efficiency 270 can be calculated using compr_data_size and this statistic. 271 Unit: bytes 272 mem_limit the maximum amount of memory ZRAM can use to store 273 the compressed data 274 mem_used_max the maximum amount of memory zram has consumed to 275 store the data 276 same_pages the number of same element filled pages written to this disk. 277 No memory is allocated for such pages. 278 pages_compacted the number of pages freed during compaction 279 huge_pages the number of incompressible pages 280 huge_pages_since the number of incompressible pages since zram set up 281 ================ ============================================================= 282 283File /sys/block/zram<id>/bd_stat 284 285The bd_stat file represents a device's backing device statistics. It consists of 286a single line of text and contains the following stats separated by whitespace: 287 288 ============== ============================================================= 289 bd_count size of data written in backing device. 290 Unit: 4K bytes 291 bd_reads the number of reads from backing device 292 Unit: 4K bytes 293 bd_writes the number of writes to backing device 294 Unit: 4K bytes 295 ============== ============================================================= 296 2979) Deactivate 298============== 299 300:: 301 302 swapoff /dev/zram0 303 umount /dev/zram1 304 30510) Reset 306========= 307 308 Write any positive value to 'reset' sysfs node:: 309 310 echo 1 > /sys/block/zram0/reset 311 echo 1 > /sys/block/zram1/reset 312 313 This frees all the memory allocated for the given device and 314 resets the disksize to zero. You must set the disksize again 315 before reusing the device. 316 317Optional Feature 318================ 319 320IDLE pages tracking 321------------------- 322 323zram has built-in support for idle pages tracking (that is, allocated but 324not used pages). This feature is useful for e.g. zram writeback and 325recompression. In order to mark pages as idle, execute the following command:: 326 327 echo all > /sys/block/zramX/idle 328 329This will mark all allocated zram pages as idle. The idle mark will be 330removed only when the page (block) is accessed (e.g. overwritten or freed). 331Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be 332marked as idle based on how many seconds have passed since the last access to 333a particular zram page:: 334 335 echo 86400 > /sys/block/zramX/idle 336 337In this example, all pages which haven't been accessed in more than 86400 338seconds (one day) will be marked idle. 339 340writeback 341--------- 342 343With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page 344to backing storage rather than keeping it in memory. 345To use the feature, admin should set up backing device via:: 346 347 echo /dev/sda5 > /sys/block/zramX/backing_dev 348 349before disksize setting. It supports only partitions at this moment. 350If admin wants to use incompressible page writeback, they could do it via:: 351 352 echo huge > /sys/block/zramX/writeback 353 354Admin can request writeback of idle pages at right timing via:: 355 356 echo idle > /sys/block/zramX/writeback 357 358With the command, zram will writeback idle pages from memory to the storage. 359 360Additionally, if a user choose to writeback only huge and idle pages 361this can be accomplished with:: 362 363 echo huge_idle > /sys/block/zramX/writeback 364 365If a user chooses to writeback only incompressible pages (pages that none of 366algorithms can compress) this can be accomplished with:: 367 368 echo incompressible > /sys/block/zramX/writeback 369 370If an admin wants to write a specific page in zram device to the backing device, 371they could write a page index into the interface:: 372 373 echo "page_index=1251" > /sys/block/zramX/writeback 374 375In Linux 6.16 this interface underwent some rework. First, the interface 376now supports `key=value` format for all of its parameters (`type=huge_idle`, 377etc.) Second, the support for `page_indexes` was introduced, which specify 378`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the 379number of syscalls, but more importantly this enables optimal post-processing 380target selection strategy. Usage example:: 381 382 echo "type=idle" > /sys/block/zramX/writeback 383 echo "page_indexes=1-100 page_indexes=200-300" > \ 384 /sys/block/zramX/writeback 385 386We also now permit multiple page_index params per call and a mix of 387single pages and page ranges:: 388 389 echo page_index=42 page_index=99 page_indexes=100-200 \ 390 page_indexes=500-700 > /sys/block/zramX/writeback 391 392If there are lots of write IO with flash device, potentially, it has 393flash wearout problem so that admin needs to design write limitation 394to guarantee storage health for entire product life. 395 396To overcome the concern, zram supports "writeback_limit" feature. 397The "writeback_limit_enable"'s default value is 0 so that it doesn't limit 398any writeback. IOW, if admin wants to apply writeback budget, they should 399enable writeback_limit_enable via:: 400 401 $ echo 1 > /sys/block/zramX/writeback_limit_enable 402 403Once writeback_limit_enable is set, zram doesn't allow any writeback 404until admin sets the budget via /sys/block/zramX/writeback_limit. 405 406(If admin doesn't enable writeback_limit_enable, writeback_limit's value 407assigned via /sys/block/zramX/writeback_limit is meaningless.) 408 409If admin wants to limit writeback as per-day 400M, they could do it 410like below:: 411 412 $ MB_SHIFT=20 413 $ 4K_SHIFT=12 414 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 415 /sys/block/zram0/writeback_limit. 416 $ echo 1 > /sys/block/zram0/writeback_limit_enable 417 418If admins want to allow further write again once the budget is exhausted, 419they could do it like below:: 420 421 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 422 /sys/block/zram0/writeback_limit 423 424If an admin wants to see the remaining writeback budget since last set:: 425 426 $ cat /sys/block/zramX/writeback_limit 427 428If an admin wants to disable writeback limit, they could do:: 429 430 $ echo 0 > /sys/block/zramX/writeback_limit_enable 431 432The writeback_limit count will reset whenever you reset zram (e.g., 433system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of 434writeback happened until you reset the zram to allocate extra writeback 435budget in next setting is user's job. 436 437If admin wants to measure writeback count in a certain period, they could 438know it via /sys/block/zram0/bd_stat's 3rd column. 439 440recompression 441------------- 442 443With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative 444(secondary) compression algorithms. The basic idea is that alternative 445compression algorithm can provide better compression ratio at a price of 446(potentially) slower compression/decompression speeds. Alternative compression 447algorithm can, for example, be more successful compressing huge pages (those 448that default algorithm failed to compress). Another application is idle pages 449recompression - pages that are cold and sit in the memory can be recompressed 450using more effective algorithm and, hence, reduce zsmalloc memory usage. 451 452With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: 453one primary and up to 3 secondary ones. Primary zram compressor is explained 454in "3) Select compression algorithm", secondary algorithms are configured 455using recomp_algorithm device attribute. 456 457Example::: 458 459 #show supported recompression algorithms 460 cat /sys/block/zramX/recomp_algorithm 461 #1: lzo lzo-rle lz4 lz4hc [zstd] 462 #2: lzo lzo-rle lz4 [lz4hc] zstd 463 464Alternative compression algorithms are sorted by priority. In the example 465above, zstd is used as the first alternative algorithm, which has priority 466of 1, while lz4hc is configured as a compression algorithm with priority 2. 467Alternative compression algorithm's priority is provided during algorithms 468configuration::: 469 470 #select zstd recompression algorithm, priority 1 471 echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm 472 473 #select deflate recompression algorithm, priority 2 474 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm 475 476Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, 477which controls recompression. 478 479Examples::: 480 481 #IDLE pages recompression is activated by `idle` mode 482 echo "type=idle" > /sys/block/zramX/recompress 483 484 #HUGE pages recompression is activated by `huge` mode 485 echo "type=huge" > /sys/block/zram0/recompress 486 487 #HUGE_IDLE pages recompression is activated by `huge_idle` mode 488 echo "type=huge_idle" > /sys/block/zramX/recompress 489 490The number of idle pages can be significant, so user-space can pass a size 491threshold (in bytes) to the recompress knob: zram will recompress only pages 492of equal or greater size::: 493 494 #recompress all pages larger than 3000 bytes 495 echo "threshold=3000" > /sys/block/zramX/recompress 496 497 #recompress idle pages larger than 2000 bytes 498 echo "type=idle threshold=2000" > /sys/block/zramX/recompress 499 500It is also possible to limit the number of pages zram re-compression will 501attempt to recompress::: 502 503 echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress 504 505During re-compression for every page, that matches re-compression criteria, 506ZRAM iterates the list of registered alternative compression algorithms in 507order of their priorities. ZRAM stops either when re-compression was 508successful (re-compressed object is smaller in size than the original one) 509and matches re-compression criteria (e.g. size threshold) or when there are 510no secondary algorithms left to try. If none of the secondary algorithms can 511successfully re-compressed the page such a page is marked as incompressible, 512so ZRAM will not attempt to re-compress it in the future. 513 514This re-compression behaviour, when it iterates through the list of 515registered compression algorithms, increases our chances of finding the 516algorithm that successfully compresses a particular page. Sometimes, however, 517it is convenient (and sometimes even necessary) to limit recompression to 518only one particular algorithm so that it will not try any other algorithms. 519This can be achieved by providing a `algo` or `priority` parameter::: 520 521 #use zstd algorithm only (if registered) 522 echo "type=huge algo=zstd" > /sys/block/zramX/recompress 523 524 #use zstd algorithm only (if zstd was registered under priority 1) 525 echo "type=huge priority=1" > /sys/block/zramX/recompress 526 527memory tracking 528=============== 529 530With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the 531zram block. It could be useful to catch cold or incompressible 532pages of the process with*pagemap. 533 534If you enable the feature, you could see block state via 535/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: 536 537 300 75.033841 .wh... 538 301 63.806904 s..... 539 302 63.806919 ..hi.. 540 303 62.801919 ....r. 541 304 146.781902 ..hi.n 542 543First column 544 zram's block index. 545Second column 546 access time since the system was booted 547Third column 548 state of the block: 549 550 s: 551 same page 552 w: 553 written page to backing store 554 h: 555 huge page 556 i: 557 idle page 558 r: 559 recompressed page (secondary compression algorithm) 560 n: 561 none (including secondary) of algorithms could compress it 562 563First line of above example says 300th block is accessed at 75.033841sec 564and the block's state is huge so it is written back to the backing 565storage. It's a debugging feature so anyone shouldn't rely on it to work 566properly. 567 568Nitin Gupta 569ngupta@vflare.org 570