1======================================== 2zram: Compressed RAM-based block devices 3======================================== 4 5Introduction 6============ 7 8The zram module creates RAM-based block devices named /dev/zram<id> 9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored 10in memory itself. These disks allow very fast I/O and compression provides 11good amounts of memory savings. Some of the use cases include /tmp storage, 12use as swap disks, various caches under /var and maybe many more. :) 13 14Statistics for individual zram devices are exported through sysfs nodes at 15/sys/block/zram<id>/ 16 17Usage 18===== 19 20There are several ways to configure and manage zram device(-s): 21 22a) using zram and zram_control sysfs attributes 23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). 24 25In this document we will describe only 'manual' zram configuration steps, 26IOW, zram and zram_control sysfs attributes. 27 28In order to get a better idea about zramctl please consult util-linux 29documentation, zramctl man-page or `zramctl --help`. Please be informed 30that zram maintainers do not develop/maintain util-linux or zramctl, should 31you have any questions please contact util-linux@vger.kernel.org 32 33Following shows a typical sequence of steps for using zram. 34 35WARNING 36======= 37 38For the sake of simplicity we skip error checking parts in most of the 39examples below. However, it is your sole responsibility to handle errors. 40 41zram sysfs attributes always return negative values in case of errors. 42The list of possible return codes: 43 44======== ============================================================= 45-EBUSY an attempt to modify an attribute that cannot be changed once 46 the device has been initialised. Please reset device first. 47-ENOMEM zram was not able to allocate enough memory to fulfil your 48 needs. 49-EINVAL invalid input has been provided. 50-EAGAIN re-try operation later (e.g. when attempting to run recompress 51 and writeback simultaneously). 52======== ============================================================= 53 54If you use 'echo', the returned value is set by the 'echo' utility, 55and, in general case, something like:: 56 57 echo foo > /sys/block/zram0/comp_algorithm 58 if [ $? -ne 0 ]; then 59 handle_error 60 fi 61 62should suffice. 63 641) Load Module 65============== 66 67:: 68 69 modprobe zram num_devices=4 70 71This creates 4 devices: /dev/zram{0,1,2,3} 72 73num_devices parameter is optional and tells zram how many devices should be 74pre-created. Default: 1. 75 762) Select compression algorithm 77=============================== 78 79Using comp_algorithm device attribute one can see available and 80currently selected (shown in square brackets) compression algorithms, 81or change the selected compression algorithm (once the device is initialised 82there is no way to change compression algorithm). 83 84Examples:: 85 86 #show supported compression algorithms 87 cat /sys/block/zram0/comp_algorithm 88 lzo [lz4] 89 90 #select lzo compression algorithm 91 echo lzo > /sys/block/zram0/comp_algorithm 92 93For the time being, the `comp_algorithm` content shows only compression 94algorithms that are supported by zram. 95 963) Set compression algorithm parameters: Optional 97================================================= 98 99Compression algorithms may support specific parameters which can be 100tweaked for particular dataset. ZRAM has an `algorithm_params` device 101attribute which provides a per-algorithm params configuration. 102 103For example, several compression algorithms support `level` parameter. 104In addition, certain compression algorithms support pre-trained dictionaries, 105which significantly change algorithms' characteristics. In order to configure 106compression algorithm to use external pre-trained dictionary, pass full 107path to the `dict` along with other parameters:: 108 109 #pass path to pre-trained zstd dictionary 110 echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params 111 112 #same, but using algorithm priority 113 echo "priority=1 dict=/etc/dictionary" > \ 114 /sys/block/zram0/algorithm_params 115 116 #pass path to pre-trained zstd dictionary and compression level 117 echo "algo=zstd level=8 dict=/etc/dictionary" > \ 118 /sys/block/zram0/algorithm_params 119 120Parameters are algorithm specific: not all algorithms support pre-trained 121dictionaries, not all algorithms support `level`. Furthermore, for certain 122algorithms `level` controls the compression level (the higher the value the 123better the compression ratio, it even can take negatives values for some 124algorithms), for other algorithms `level` is acceleration level (the higher 125the value the lower the compression ratio). 126 1274) Set Disksize 128=============== 129 130Set disk size by writing the value to sysfs node 'disksize'. 131The value can be either in bytes or you can use mem suffixes. 132Examples:: 133 134 # Initialize /dev/zram0 with 50MB disksize 135 echo $((50*1024*1024)) > /sys/block/zram0/disksize 136 137 # Using mem suffixes 138 echo 256K > /sys/block/zram0/disksize 139 echo 512M > /sys/block/zram0/disksize 140 echo 1G > /sys/block/zram0/disksize 141 142Note: 143There is little point creating a zram of greater than twice the size of memory 144since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 145size of the disk when not in use so a huge zram is wasteful. 146 1475) Set memory limit: Optional 148============================= 149 150Set memory limit by writing the value to sysfs node 'mem_limit'. 151The value can be either in bytes or you can use mem suffixes. 152In addition, you could change the value in runtime. 153Examples:: 154 155 # limit /dev/zram0 with 50MB memory 156 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 157 158 # Using mem suffixes 159 echo 256K > /sys/block/zram0/mem_limit 160 echo 512M > /sys/block/zram0/mem_limit 161 echo 1G > /sys/block/zram0/mem_limit 162 163 # To disable memory limit 164 echo 0 > /sys/block/zram0/mem_limit 165 1666) Activate 167=========== 168 169:: 170 171 mkswap /dev/zram0 172 swapon /dev/zram0 173 174 mkfs.ext4 /dev/zram1 175 mount /dev/zram1 /tmp 176 1777) Add/remove zram devices 178========================== 179 180zram provides a control interface, which enables dynamic (on-demand) device 181addition and removal. 182 183In order to add a new /dev/zramX device, perform a read operation on the hot_add 184attribute. This will return either the new device's device id (meaning that you 185can use /dev/zram<id>) or an error code. 186 187Example:: 188 189 cat /sys/class/zram-control/hot_add 190 1 191 192To remove the existing /dev/zramX device (where X is a device id) 193execute:: 194 195 echo X > /sys/class/zram-control/hot_remove 196 1978) Stats 198======== 199 200Per-device statistics are exported as various nodes under /sys/block/zram<id>/ 201 202A brief description of exported device attributes follows. For more details 203please read Documentation/ABI/testing/sysfs-block-zram. 204 205====================== ====== =============================================== 206Name access description 207====================== ====== =============================================== 208disksize RW show and set the device's disk size 209initstate RO shows the initialization state of the device 210reset WO trigger device reset 211mem_used_max WO reset the `mem_used_max` counter (see later) 212mem_limit WO specifies the maximum amount of memory ZRAM can 213 use to store the compressed data 214writeback_limit WO specifies the maximum amount of write IO zram 215 can write out to backing device as 4KB unit 216writeback_limit_enable RW show and set writeback_limit feature 217comp_algorithm RW show and change the compression algorithm 218algorithm_params WO setup compression algorithm parameters 219compact WO trigger memory compaction 220debug_stat RO this file is used for zram debugging purposes 221backing_dev RW set up backend storage for zram to write out 222idle WO mark allocated slot as idle 223====================== ====== =============================================== 224 225 226User space is advised to use the following files to read the device statistics. 227 228File /sys/block/zram<id>/stat 229 230Represents block layer statistics. Read Documentation/block/stat.rst for 231details. 232 233File /sys/block/zram<id>/io_stat 234 235The stat file represents device's I/O statistics not accounted by block 236layer and, thus, not available in zram<id>/stat file. It consists of a 237single line of text and contains the following stats separated by 238whitespace: 239 240 ============= ============================================================= 241 failed_reads The number of failed reads 242 failed_writes The number of failed writes 243 invalid_io The number of non-page-size-aligned I/O requests 244 notify_free Depending on device usage scenario it may account 245 246 a) the number of pages freed because of swap slot free 247 notifications 248 b) the number of pages freed because of 249 REQ_OP_DISCARD requests sent by bio. The former ones are 250 sent to a swap block device when a swap slot is freed, 251 which implies that this disk is being used as a swap disk. 252 253 The latter ones are sent by filesystem mounted with 254 discard option, whenever some data blocks are getting 255 discarded. 256 ============= ============================================================= 257 258File /sys/block/zram<id>/mm_stat 259 260The mm_stat file represents the device's mm statistics. It consists of a single 261line of text and contains the following stats separated by whitespace: 262 263 ================ ============================================================= 264 orig_data_size uncompressed size of data stored in this disk. 265 Unit: bytes 266 compr_data_size compressed size of data stored in this disk 267 mem_used_total the amount of memory allocated for this disk. This 268 includes allocator fragmentation and metadata overhead, 269 allocated for this disk. So, allocator space efficiency 270 can be calculated using compr_data_size and this statistic. 271 Unit: bytes 272 mem_limit the maximum amount of memory ZRAM can use to store 273 the compressed data 274 mem_used_max the maximum amount of memory zram has consumed to 275 store the data 276 same_pages the number of same element filled pages written to this disk. 277 No memory is allocated for such pages. 278 pages_compacted the number of pages freed during compaction 279 huge_pages the number of incompressible pages 280 huge_pages_since the number of incompressible pages since zram set up 281 ================ ============================================================= 282 283File /sys/block/zram<id>/bd_stat 284 285The bd_stat file represents a device's backing device statistics. It consists of 286a single line of text and contains the following stats separated by whitespace: 287 288 ============== ============================================================= 289 bd_count size of data written in backing device. 290 Unit: 4K bytes 291 bd_reads the number of reads from backing device 292 Unit: 4K bytes 293 bd_writes the number of writes to backing device 294 Unit: 4K bytes 295 ============== ============================================================= 296 2979) Deactivate 298============== 299 300:: 301 302 swapoff /dev/zram0 303 umount /dev/zram1 304 30510) Reset 306========= 307 308 Write any positive value to 'reset' sysfs node:: 309 310 echo 1 > /sys/block/zram0/reset 311 echo 1 > /sys/block/zram1/reset 312 313 This frees all the memory allocated for the given device and 314 resets the disksize to zero. You must set the disksize again 315 before reusing the device. 316 317Optional Feature 318================ 319 320writeback 321--------- 322 323With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page 324to backing storage rather than keeping it in memory. 325To use the feature, admin should set up backing device via:: 326 327 echo /dev/sda5 > /sys/block/zramX/backing_dev 328 329before disksize setting. It supports only partitions at this moment. 330If admin wants to use incompressible page writeback, they could do it via:: 331 332 echo huge > /sys/block/zramX/writeback 333 334To use idle page writeback, first, user need to declare zram pages 335as idle:: 336 337 echo all > /sys/block/zramX/idle 338 339From now on, any pages on zram are idle pages. The idle mark 340will be removed until someone requests access of the block. 341IOW, unless there is access request, those pages are still idle pages. 342Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be 343marked as idle based on how long (in seconds) it's been since they were 344last accessed:: 345 346 echo 86400 > /sys/block/zramX/idle 347 348In this example all pages which haven't been accessed in more than 86400 349seconds (one day) will be marked idle. 350 351Admin can request writeback of those idle pages at right timing via:: 352 353 echo idle > /sys/block/zramX/writeback 354 355With the command, zram will writeback idle pages from memory to the storage. 356 357Additionally, if a user choose to writeback only huge and idle pages 358this can be accomplished with:: 359 360 echo huge_idle > /sys/block/zramX/writeback 361 362If a user chooses to writeback only incompressible pages (pages that none of 363algorithms can compress) this can be accomplished with:: 364 365 echo incompressible > /sys/block/zramX/writeback 366 367If an admin wants to write a specific page in zram device to the backing device, 368they could write a page index into the interface:: 369 370 echo "page_index=1251" > /sys/block/zramX/writeback 371 372In Linux 6.16 this interface underwent some rework. First, the interface 373now supports `key=value` format for all of its parameters (`type=huge_idle`, 374etc.) Second, the support for `page_indexes` was introduced, which specify 375`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the 376number of syscalls, but more importantly this enables optimal post-processing 377target selection strategy. Usage example:: 378 379 echo "type=idle" > /sys/block/zramX/writeback 380 echo "page_indexes=1-100 page_indexes=200-300" > \ 381 /sys/block/zramX/writeback 382 383We also now permit multiple page_index params per call and a mix of 384single pages and page ranges:: 385 386 echo page_index=42 page_index=99 page_indexes=100-200 \ 387 page_indexes=500-700 > /sys/block/zramX/writeback 388 389If there are lots of write IO with flash device, potentially, it has 390flash wearout problem so that admin needs to design write limitation 391to guarantee storage health for entire product life. 392 393To overcome the concern, zram supports "writeback_limit" feature. 394The "writeback_limit_enable"'s default value is 0 so that it doesn't limit 395any writeback. IOW, if admin wants to apply writeback budget, they should 396enable writeback_limit_enable via:: 397 398 $ echo 1 > /sys/block/zramX/writeback_limit_enable 399 400Once writeback_limit_enable is set, zram doesn't allow any writeback 401until admin sets the budget via /sys/block/zramX/writeback_limit. 402 403(If admin doesn't enable writeback_limit_enable, writeback_limit's value 404assigned via /sys/block/zramX/writeback_limit is meaningless.) 405 406If admin wants to limit writeback as per-day 400M, they could do it 407like below:: 408 409 $ MB_SHIFT=20 410 $ 4K_SHIFT=12 411 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 412 /sys/block/zram0/writeback_limit. 413 $ echo 1 > /sys/block/zram0/writeback_limit_enable 414 415If admins want to allow further write again once the budget is exhausted, 416they could do it like below:: 417 418 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 419 /sys/block/zram0/writeback_limit 420 421If an admin wants to see the remaining writeback budget since last set:: 422 423 $ cat /sys/block/zramX/writeback_limit 424 425If an admin wants to disable writeback limit, they could do:: 426 427 $ echo 0 > /sys/block/zramX/writeback_limit_enable 428 429The writeback_limit count will reset whenever you reset zram (e.g., 430system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of 431writeback happened until you reset the zram to allocate extra writeback 432budget in next setting is user's job. 433 434If admin wants to measure writeback count in a certain period, they could 435know it via /sys/block/zram0/bd_stat's 3rd column. 436 437recompression 438------------- 439 440With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative 441(secondary) compression algorithms. The basic idea is that alternative 442compression algorithm can provide better compression ratio at a price of 443(potentially) slower compression/decompression speeds. Alternative compression 444algorithm can, for example, be more successful compressing huge pages (those 445that default algorithm failed to compress). Another application is idle pages 446recompression - pages that are cold and sit in the memory can be recompressed 447using more effective algorithm and, hence, reduce zsmalloc memory usage. 448 449With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: 450one primary and up to 3 secondary ones. Primary zram compressor is explained 451in "3) Select compression algorithm", secondary algorithms are configured 452using recomp_algorithm device attribute. 453 454Example::: 455 456 #show supported recompression algorithms 457 cat /sys/block/zramX/recomp_algorithm 458 #1: lzo lzo-rle lz4 lz4hc [zstd] 459 #2: lzo lzo-rle lz4 [lz4hc] zstd 460 461Alternative compression algorithms are sorted by priority. In the example 462above, zstd is used as the first alternative algorithm, which has priority 463of 1, while lz4hc is configured as a compression algorithm with priority 2. 464Alternative compression algorithm's priority is provided during algorithms 465configuration::: 466 467 #select zstd recompression algorithm, priority 1 468 echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm 469 470 #select deflate recompression algorithm, priority 2 471 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm 472 473Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, 474which controls recompression. 475 476Examples::: 477 478 #IDLE pages recompression is activated by `idle` mode 479 echo "type=idle" > /sys/block/zramX/recompress 480 481 #HUGE pages recompression is activated by `huge` mode 482 echo "type=huge" > /sys/block/zram0/recompress 483 484 #HUGE_IDLE pages recompression is activated by `huge_idle` mode 485 echo "type=huge_idle" > /sys/block/zramX/recompress 486 487The number of idle pages can be significant, so user-space can pass a size 488threshold (in bytes) to the recompress knob: zram will recompress only pages 489of equal or greater size::: 490 491 #recompress all pages larger than 3000 bytes 492 echo "threshold=3000" > /sys/block/zramX/recompress 493 494 #recompress idle pages larger than 2000 bytes 495 echo "type=idle threshold=2000" > /sys/block/zramX/recompress 496 497It is also possible to limit the number of pages zram re-compression will 498attempt to recompress::: 499 500 echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress 501 502Recompression of idle pages requires memory tracking. 503 504During re-compression for every page, that matches re-compression criteria, 505ZRAM iterates the list of registered alternative compression algorithms in 506order of their priorities. ZRAM stops either when re-compression was 507successful (re-compressed object is smaller in size than the original one) 508and matches re-compression criteria (e.g. size threshold) or when there are 509no secondary algorithms left to try. If none of the secondary algorithms can 510successfully re-compressed the page such a page is marked as incompressible, 511so ZRAM will not attempt to re-compress it in the future. 512 513This re-compression behaviour, when it iterates through the list of 514registered compression algorithms, increases our chances of finding the 515algorithm that successfully compresses a particular page. Sometimes, however, 516it is convenient (and sometimes even necessary) to limit recompression to 517only one particular algorithm so that it will not try any other algorithms. 518This can be achieved by providing a `algo` or `priority` parameter::: 519 520 #use zstd algorithm only (if registered) 521 echo "type=huge algo=zstd" > /sys/block/zramX/recompress 522 523 #use zstd algorithm only (if zstd was registered under priority 1) 524 echo "type=huge priority=1" > /sys/block/zramX/recompress 525 526memory tracking 527=============== 528 529With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the 530zram block. It could be useful to catch cold or incompressible 531pages of the process with*pagemap. 532 533If you enable the feature, you could see block state via 534/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: 535 536 300 75.033841 .wh... 537 301 63.806904 s..... 538 302 63.806919 ..hi.. 539 303 62.801919 ....r. 540 304 146.781902 ..hi.n 541 542First column 543 zram's block index. 544Second column 545 access time since the system was booted 546Third column 547 state of the block: 548 549 s: 550 same page 551 w: 552 written page to backing store 553 h: 554 huge page 555 i: 556 idle page 557 r: 558 recompressed page (secondary compression algorithm) 559 n: 560 none (including secondary) of algorithms could compress it 561 562First line of above example says 300th block is accessed at 75.033841sec 563and the block's state is huge so it is written back to the backing 564storage. It's a debugging feature so anyone shouldn't rely on it to work 565properly. 566 567Nitin Gupta 568ngupta@vflare.org 569