19a610812SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 29a610812SMauro Carvalho Chehab 39a610812SMauro Carvalho Chehab================================================ 49a610812SMauro Carvalho ChehabZoneFS - Zone filesystem for Zoned block devices 59a610812SMauro Carvalho Chehab================================================ 69a610812SMauro Carvalho Chehab 79a610812SMauro Carvalho ChehabIntroduction 89a610812SMauro Carvalho Chehab============ 99a610812SMauro Carvalho Chehab 109a610812SMauro Carvalho Chehabzonefs is a very simple file system exposing each zone of a zoned block device 119a610812SMauro Carvalho Chehabas a file. Unlike a regular POSIX-compliant file system with native zoned block 129a610812SMauro Carvalho Chehabdevice support (e.g. f2fs), zonefs does not hide the sequential write 139a610812SMauro Carvalho Chehabconstraint of zoned block devices to the user. Files representing sequential 149a610812SMauro Carvalho Chehabwrite zones of the device must be written sequentially starting from the end 159a610812SMauro Carvalho Chehabof the file (append only writes). 169a610812SMauro Carvalho Chehab 179a610812SMauro Carvalho ChehabAs such, zonefs is in essence closer to a raw block device access interface 189a610812SMauro Carvalho Chehabthan to a full-featured POSIX file system. The goal of zonefs is to simplify 199a610812SMauro Carvalho Chehabthe implementation of zoned block device support in applications by replacing 209a610812SMauro Carvalho Chehabraw block device file accesses with a richer file API, avoiding relying on 219a610812SMauro Carvalho Chehabdirect block device file ioctls which may be more obscure to developers. One 229a610812SMauro Carvalho Chehabexample of this approach is the implementation of LSM (log-structured merge) 239a610812SMauro Carvalho Chehabtree structures (such as used in RocksDB and LevelDB) on zoned block devices 249a610812SMauro Carvalho Chehabby allowing SSTables to be stored in a zone file similarly to a regular file 259a610812SMauro Carvalho Chehabsystem rather than as a range of sectors of the entire disk. The introduction 269a610812SMauro Carvalho Chehabof the higher level construct "one file is one zone" can help reducing the 279a610812SMauro Carvalho Chehabamount of changes needed in the application as well as introducing support for 289a610812SMauro Carvalho Chehabdifferent application programming languages. 299a610812SMauro Carvalho Chehab 309a610812SMauro Carvalho ChehabZoned block devices 319a610812SMauro Carvalho Chehab------------------- 329a610812SMauro Carvalho Chehab 339a610812SMauro Carvalho ChehabZoned storage devices belong to a class of storage devices with an address 349a610812SMauro Carvalho Chehabspace that is divided into zones. A zone is a group of consecutive LBAs and all 359a610812SMauro Carvalho Chehabzones are contiguous (there are no LBA gaps). Zones may have different types. 369a610812SMauro Carvalho Chehab 379a610812SMauro Carvalho Chehab* Conventional zones: there are no access constraints to LBAs belonging to 389a610812SMauro Carvalho Chehab conventional zones. Any read or write access can be executed, similarly to a 399a610812SMauro Carvalho Chehab regular block device. 409a610812SMauro Carvalho Chehab* Sequential zones: these zones accept random reads but must be written 419a610812SMauro Carvalho Chehab sequentially. Each sequential zone has a write pointer maintained by the 429a610812SMauro Carvalho Chehab device that keeps track of the mandatory start LBA position of the next write 439a610812SMauro Carvalho Chehab to the device. As a result of this write constraint, LBAs in a sequential zone 449a610812SMauro Carvalho Chehab cannot be overwritten. Sequential zones must first be erased using a special 459a610812SMauro Carvalho Chehab command (zone reset) before rewriting. 469a610812SMauro Carvalho Chehab 479a610812SMauro Carvalho ChehabZoned storage devices can be implemented using various recording and media 489a610812SMauro Carvalho Chehabtechnologies. The most common form of zoned storage today uses the SCSI Zoned 499a610812SMauro Carvalho ChehabBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled 509a610812SMauro Carvalho ChehabMagnetic Recording (SMR) HDDs. 519a610812SMauro Carvalho Chehab 529a610812SMauro Carvalho ChehabSolid State Disks (SSD) storage devices can also implement a zoned interface 539a610812SMauro Carvalho Chehabto, for instance, reduce internal write amplification due to garbage collection. 549a610812SMauro Carvalho ChehabThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard 559a610812SMauro Carvalho Chehabcommittee aiming at adding a zoned storage interface to the NVMe protocol. 569a610812SMauro Carvalho Chehab 579a610812SMauro Carvalho ChehabZonefs Overview 589a610812SMauro Carvalho Chehab=============== 599a610812SMauro Carvalho Chehab 609a610812SMauro Carvalho ChehabZonefs exposes the zones of a zoned block device as files. The files 619a610812SMauro Carvalho Chehabrepresenting zones are grouped by zone type, which are themselves represented 629a610812SMauro Carvalho Chehabby sub-directories. This file structure is built entirely using zone information 639a610812SMauro Carvalho Chehabprovided by the device and so does not require any complex on-disk metadata 649a610812SMauro Carvalho Chehabstructure. 659a610812SMauro Carvalho Chehab 669a610812SMauro Carvalho ChehabOn-disk metadata 679a610812SMauro Carvalho Chehab---------------- 689a610812SMauro Carvalho Chehab 699a610812SMauro Carvalho Chehabzonefs on-disk metadata is reduced to an immutable super block which 709a610812SMauro Carvalho Chehabpersistently stores a magic number and optional feature flags and values. On 719a610812SMauro Carvalho Chehabmount, zonefs uses blkdev_report_zones() to obtain the device zone configuration 729a610812SMauro Carvalho Chehaband populates the mount point with a static file tree solely based on this 739a610812SMauro Carvalho Chehabinformation. File sizes come from the device zone type and write pointer 749a610812SMauro Carvalho Chehabposition managed by the device itself. 759a610812SMauro Carvalho Chehab 769a610812SMauro Carvalho ChehabThe super block is always written on disk at sector 0. The first zone of the 779a610812SMauro Carvalho Chehabdevice storing the super block is never exposed as a zone file by zonefs. If 789a610812SMauro Carvalho Chehabthe zone containing the super block is a sequential zone, the mkzonefs format 799a610812SMauro Carvalho Chehabtool always "finishes" the zone, that is, it transitions the zone to a full 809a610812SMauro Carvalho Chehabstate to make it read-only, preventing any data write. 819a610812SMauro Carvalho Chehab 829a610812SMauro Carvalho ChehabZone type sub-directories 839a610812SMauro Carvalho Chehab------------------------- 849a610812SMauro Carvalho Chehab 859a610812SMauro Carvalho ChehabFiles representing zones of the same type are grouped together under the same 869a610812SMauro Carvalho Chehabsub-directory automatically created on mount. 879a610812SMauro Carvalho Chehab 889a610812SMauro Carvalho ChehabFor conventional zones, the sub-directory "cnv" is used. This directory is 899a610812SMauro Carvalho Chehabhowever created if and only if the device has usable conventional zones. If 909a610812SMauro Carvalho Chehabthe device only has a single conventional zone at sector 0, the zone will not 919a610812SMauro Carvalho Chehabbe exposed as a file as it will be used to store the zonefs super block. For 929a610812SMauro Carvalho Chehabsuch devices, the "cnv" sub-directory will not be created. 939a610812SMauro Carvalho Chehab 949a610812SMauro Carvalho ChehabFor sequential write zones, the sub-directory "seq" is used. 959a610812SMauro Carvalho Chehab 969a610812SMauro Carvalho ChehabThese two directories are the only directories that exist in zonefs. Users 979a610812SMauro Carvalho Chehabcannot create other directories and cannot rename nor delete the "cnv" and 989a610812SMauro Carvalho Chehab"seq" sub-directories. 999a610812SMauro Carvalho Chehab 1009a610812SMauro Carvalho ChehabThe size of the directories indicated by the st_size field of struct stat, 1019a610812SMauro Carvalho Chehabobtained with the stat() or fstat() system calls, indicates the number of files 1029a610812SMauro Carvalho Chehabexisting under the directory. 1039a610812SMauro Carvalho Chehab 1049a610812SMauro Carvalho ChehabZone files 1059a610812SMauro Carvalho Chehab---------- 1069a610812SMauro Carvalho Chehab 1079a610812SMauro Carvalho ChehabZone files are named using the number of the zone they represent within the set 1089a610812SMauro Carvalho Chehabof zones of a particular type. That is, both the "cnv" and "seq" directories 1099a610812SMauro Carvalho Chehabcontain files named "0", "1", "2", ... The file numbers also represent 1109a610812SMauro Carvalho Chehabincreasing zone start sector on the device. 1119a610812SMauro Carvalho Chehab 1129a610812SMauro Carvalho ChehabAll read and write operations to zone files are not allowed beyond the file 1134c96870eSJohannes Thumshirnmaximum size, that is, beyond the zone capacity. Any access exceeding the zone 1144c96870eSJohannes Thumshirncapacity is failed with the -EFBIG error. 1159a610812SMauro Carvalho Chehab 1169a610812SMauro Carvalho ChehabCreating, deleting, renaming or modifying any attribute of files and 1179a610812SMauro Carvalho Chehabsub-directories is not allowed. 1189a610812SMauro Carvalho Chehab 1199a610812SMauro Carvalho ChehabThe number of blocks of a file as reported by stat() and fstat() indicates the 1204c96870eSJohannes Thumshirncapacity of the zone file, or in other words, the maximum file size. 1219a610812SMauro Carvalho Chehab 1229a610812SMauro Carvalho ChehabConventional zone files 1239a610812SMauro Carvalho Chehab----------------------- 1249a610812SMauro Carvalho Chehab 1259a610812SMauro Carvalho ChehabThe size of conventional zone files is fixed to the size of the zone they 1269a610812SMauro Carvalho Chehabrepresent. Conventional zone files cannot be truncated. 1279a610812SMauro Carvalho Chehab 1289a610812SMauro Carvalho ChehabThese files can be randomly read and written using any type of I/O operation: 1299a610812SMauro Carvalho Chehabbuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O 1309a610812SMauro Carvalho Chehabconstraint for these files beyond the file size limit mentioned above. 1319a610812SMauro Carvalho Chehab 1329a610812SMauro Carvalho ChehabSequential zone files 1339a610812SMauro Carvalho Chehab--------------------- 1349a610812SMauro Carvalho Chehab 1359a610812SMauro Carvalho ChehabThe size of sequential zone files grouped in the "seq" sub-directory represents 1369a610812SMauro Carvalho Chehabthe file's zone write pointer position relative to the zone start sector. 1379a610812SMauro Carvalho Chehab 1389a610812SMauro Carvalho ChehabSequential zone files can only be written sequentially, starting from the file 1399a610812SMauro Carvalho Chehabend, that is, write operations can only be append writes. Zonefs makes no 1409a610812SMauro Carvalho Chehabattempt at accepting random writes and will fail any write request that has a 1419a610812SMauro Carvalho Chehabstart offset not corresponding to the end of the file, or to the end of the last 142481ed297SLinus Torvaldswrite issued and still in-flight (for asynchronous I/O operations). 1439a610812SMauro Carvalho Chehab 1449a610812SMauro Carvalho ChehabSince dirty page writeback by the page cache does not guarantee a sequential 1459a610812SMauro Carvalho Chehabwrite pattern, zonefs prevents buffered writes and writeable shared mappings 1469a610812SMauro Carvalho Chehabon sequential files. Only direct I/O writes are accepted for these files. 1479a610812SMauro Carvalho Chehabzonefs relies on the sequential delivery of write I/O requests to the device 1489a610812SMauro Carvalho Chehabimplemented by the block layer elevator. An elevator implementing the sequential 1499a610812SMauro Carvalho Chehabwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) 150481ed297SLinus Torvaldsmust be used. This type of elevator (e.g. mq-deadline) is set by default 1519a610812SMauro Carvalho Chehabfor zoned block devices on device initialization. 1529a610812SMauro Carvalho Chehab 1539a610812SMauro Carvalho ChehabThere are no restrictions on the type of I/O used for read operations in 1549a610812SMauro Carvalho Chehabsequential zone files. Buffered I/Os, direct I/Os and shared read mappings are 1559a610812SMauro Carvalho Chehaball accepted. 1569a610812SMauro Carvalho Chehab 1579a610812SMauro Carvalho ChehabTruncating sequential zone files is allowed only down to 0, in which case, the 1589a610812SMauro Carvalho Chehabzone is reset to rewind the file zone write pointer position to the start of 1594c96870eSJohannes Thumshirnthe zone, or up to the zone capacity, in which case the file's zone is 1604c96870eSJohannes Thumshirntransitioned to the FULL state (finish zone operation). 1619a610812SMauro Carvalho Chehab 1629a610812SMauro Carvalho ChehabFormat options 1639a610812SMauro Carvalho Chehab-------------- 1649a610812SMauro Carvalho Chehab 1659a610812SMauro Carvalho ChehabSeveral optional features of zonefs can be enabled at format time. 1669a610812SMauro Carvalho Chehab 1679a610812SMauro Carvalho Chehab* Conventional zone aggregation: ranges of contiguous conventional zones can be 1689a610812SMauro Carvalho Chehab aggregated into a single larger file instead of the default one file per zone. 1699a610812SMauro Carvalho Chehab* File ownership: The owner UID and GID of zone files is by default 0 (root) 1709a610812SMauro Carvalho Chehab but can be changed to any valid UID/GID. 1719a610812SMauro Carvalho Chehab* File access permissions: the default 640 access permissions can be changed. 1729a610812SMauro Carvalho Chehab 1739a610812SMauro Carvalho ChehabIO error handling 1749a610812SMauro Carvalho Chehab----------------- 1759a610812SMauro Carvalho Chehab 1769a610812SMauro Carvalho ChehabZoned block devices may fail I/O requests for reasons similar to regular block 1779a610812SMauro Carvalho Chehabdevices, e.g. due to bad sectors. However, in addition to such known I/O 1789a610812SMauro Carvalho Chehabfailure pattern, the standards governing zoned block devices behavior define 1799a610812SMauro Carvalho Chehabadditional conditions that result in I/O errors. 1809a610812SMauro Carvalho Chehab 1819a610812SMauro Carvalho Chehab* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): 1829a610812SMauro Carvalho Chehab While the data already written in the zone is still readable, the zone can 1839a610812SMauro Carvalho Chehab no longer be written. No user action on the zone (zone management command or 1849a610812SMauro Carvalho Chehab read/write access) can change the zone condition back to a normal read/write 1859a610812SMauro Carvalho Chehab state. While the reasons for the device to transition a zone to read-only 1869a610812SMauro Carvalho Chehab state are not defined by the standards, a typical cause for such transition 1879a610812SMauro Carvalho Chehab would be a defective write head on an HDD (all zones under this head are 1889a610812SMauro Carvalho Chehab changed to read-only). 1899a610812SMauro Carvalho Chehab 1909a610812SMauro Carvalho Chehab* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): 1919a610812SMauro Carvalho Chehab An offline zone cannot be read nor written. No user action can transition an 1929a610812SMauro Carvalho Chehab offline zone back to an operational good state. Similarly to zone read-only 1939a610812SMauro Carvalho Chehab transitions, the reasons for a drive to transition a zone to the offline 1949a610812SMauro Carvalho Chehab condition are undefined. A typical cause would be a defective read-write head 1959a610812SMauro Carvalho Chehab on an HDD causing all zones on the platter under the broken head to be 1969a610812SMauro Carvalho Chehab inaccessible. 1979a610812SMauro Carvalho Chehab 1989a610812SMauro Carvalho Chehab* Unaligned write errors: These errors result from the host issuing write 1999a610812SMauro Carvalho Chehab requests with a start sector that does not correspond to a zone write pointer 2009a610812SMauro Carvalho Chehab position when the write request is executed by the device. Even though zonefs 2019a610812SMauro Carvalho Chehab enforces sequential file write for sequential zones, unaligned write errors 2029a610812SMauro Carvalho Chehab may still happen in the case of a partial failure of a very large direct I/O 2039a610812SMauro Carvalho Chehab operation split into multiple BIOs/requests or asynchronous I/O operations. 2049a610812SMauro Carvalho Chehab If one of the write request within the set of sequential write requests 205481ed297SLinus Torvalds issued to the device fails, all write requests queued after it will 2069a610812SMauro Carvalho Chehab become unaligned and fail. 2079a610812SMauro Carvalho Chehab 2089a610812SMauro Carvalho Chehab* Delayed write errors: similarly to regular block devices, if the device side 2099a610812SMauro Carvalho Chehab write cache is enabled, write errors may occur in ranges of previously 2109a610812SMauro Carvalho Chehab completed writes when the device write cache is flushed, e.g. on fsync(). 2119a610812SMauro Carvalho Chehab Similarly to the previous immediate unaligned write error case, delayed write 2129a610812SMauro Carvalho Chehab errors can propagate through a stream of cached sequential data for a zone 2139a610812SMauro Carvalho Chehab causing all data to be dropped after the sector that caused the error. 2149a610812SMauro Carvalho Chehab 2159a610812SMauro Carvalho ChehabAll I/O errors detected by zonefs are notified to the user with an error code 216481ed297SLinus Torvaldsreturn for the system call that triggered or detected the error. The recovery 2179a610812SMauro Carvalho Chehabactions taken by zonefs in response to I/O errors depend on the I/O type (read 2189a610812SMauro Carvalho Chehabvs write) and on the reason for the error (bad sector, unaligned writes or zone 2199a610812SMauro Carvalho Chehabcondition change). 2209a610812SMauro Carvalho Chehab 2219a610812SMauro Carvalho Chehab* For read I/O errors, zonefs does not execute any particular recovery action, 2229a610812SMauro Carvalho Chehab but only if the file zone is still in a good condition and there is no 2239a610812SMauro Carvalho Chehab inconsistency between the file inode size and its zone write pointer position. 2249a610812SMauro Carvalho Chehab If a problem is detected, I/O error recovery is executed (see below table). 2259a610812SMauro Carvalho Chehab 2269a610812SMauro Carvalho Chehab* For write I/O errors, zonefs I/O error recovery is always executed. 2279a610812SMauro Carvalho Chehab 2289a610812SMauro Carvalho Chehab* A zone condition change to read-only or offline also always triggers zonefs 2299a610812SMauro Carvalho Chehab I/O error recovery. 2309a610812SMauro Carvalho Chehab 231481ed297SLinus TorvaldsZonefs minimal I/O error recovery may change a file size and file access 2329a610812SMauro Carvalho Chehabpermissions. 2339a610812SMauro Carvalho Chehab 2349a610812SMauro Carvalho Chehab* File size changes: 2359a610812SMauro Carvalho Chehab Immediate or delayed write errors in a sequential zone file may cause the file 2369a610812SMauro Carvalho Chehab inode size to be inconsistent with the amount of data successfully written in 2379a610812SMauro Carvalho Chehab the file zone. For instance, the partial failure of a multi-BIO large write 2389a610812SMauro Carvalho Chehab operation will cause the zone write pointer to advance partially, even though 2399a610812SMauro Carvalho Chehab the entire write operation will be reported as failed to the user. In such 2409a610812SMauro Carvalho Chehab case, the file inode size must be advanced to reflect the zone write pointer 2419a610812SMauro Carvalho Chehab change and eventually allow the user to restart writing at the end of the 2429a610812SMauro Carvalho Chehab file. 2439a610812SMauro Carvalho Chehab A file size may also be reduced to reflect a delayed write error detected on 2449a610812SMauro Carvalho Chehab fsync(): in this case, the amount of data effectively written in the zone may 2459a610812SMauro Carvalho Chehab be less than originally indicated by the file inode size. After such I/O 246481ed297SLinus Torvalds error, zonefs always fixes the file inode size to reflect the amount of data 2479a610812SMauro Carvalho Chehab persistently stored in the file zone. 2489a610812SMauro Carvalho Chehab 2499a610812SMauro Carvalho Chehab* Access permission changes: 2509a610812SMauro Carvalho Chehab A zone condition change to read-only is indicated with a change in the file 2519a610812SMauro Carvalho Chehab access permissions to render the file read-only. This disables changes to the 2529a610812SMauro Carvalho Chehab file attributes and data modification. For offline zones, all permissions 2539a610812SMauro Carvalho Chehab (read and write) to the file are disabled. 2549a610812SMauro Carvalho Chehab 2559a610812SMauro Carvalho ChehabFurther action taken by zonefs I/O error recovery can be controlled by the user 2569a610812SMauro Carvalho Chehabwith the "errors=xxx" mount option. The table below summarizes the result of 2579a610812SMauro Carvalho Chehabzonefs I/O error processing depending on the mount option and on the zone 2589a610812SMauro Carvalho Chehabconditions:: 2599a610812SMauro Carvalho Chehab 2609a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2619a610812SMauro Carvalho Chehab | | | Post error state | 2629a610812SMauro Carvalho Chehab | "errors=xxx" | device | access permissions | 2639a610812SMauro Carvalho Chehab | mount | zone | file file device zone | 2649a610812SMauro Carvalho Chehab | option | condition | size read write read write | 2659a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2669a610812SMauro Carvalho Chehab | | good | fixed yes no yes yes | 267481ed297SLinus Torvalds | remount-ro | read-only | as is yes no yes no | 2689a610812SMauro Carvalho Chehab | (default) | offline | 0 no no no no | 2699a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2709a610812SMauro Carvalho Chehab | | good | fixed yes no yes yes | 271481ed297SLinus Torvalds | zone-ro | read-only | as is yes no yes no | 2729a610812SMauro Carvalho Chehab | | offline | 0 no no no no | 2739a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2749a610812SMauro Carvalho Chehab | | good | 0 no no yes yes | 2759a610812SMauro Carvalho Chehab | zone-offline | read-only | 0 no no yes no | 2769a610812SMauro Carvalho Chehab | | offline | 0 no no no no | 2779a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2789a610812SMauro Carvalho Chehab | | good | fixed yes yes yes yes | 279481ed297SLinus Torvalds | repair | read-only | as is yes no yes no | 2809a610812SMauro Carvalho Chehab | | offline | 0 no no no no | 2819a610812SMauro Carvalho Chehab +--------------+-----------+-----------------------------------------+ 2829a610812SMauro Carvalho Chehab 2839a610812SMauro Carvalho ChehabFurther notes: 2849a610812SMauro Carvalho Chehab 2859a610812SMauro Carvalho Chehab* The "errors=remount-ro" mount option is the default behavior of zonefs I/O 2869a610812SMauro Carvalho Chehab error processing if no errors mount option is specified. 2879a610812SMauro Carvalho Chehab* With the "errors=remount-ro" mount option, the change of the file access 2889a610812SMauro Carvalho Chehab permissions to read-only applies to all files. The file system is remounted 2899a610812SMauro Carvalho Chehab read-only. 2909a610812SMauro Carvalho Chehab* Access permission and file size changes due to the device transitioning zones 291481ed297SLinus Torvalds to the offline condition are permanent. Remounting or reformatting the device 2929a610812SMauro Carvalho Chehab with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good 2939a610812SMauro Carvalho Chehab state. 2949a610812SMauro Carvalho Chehab* File access permission changes to read-only due to the device transitioning 295481ed297SLinus Torvalds zones to the read-only condition are permanent. Remounting or reformatting 2969a610812SMauro Carvalho Chehab the device will not re-enable file write access. 2979a610812SMauro Carvalho Chehab* File access permission changes implied by the remount-ro, zone-ro and 2989a610812SMauro Carvalho Chehab zone-offline mount options are temporary for zones in a good condition. 2999a610812SMauro Carvalho Chehab Unmounting and remounting the file system will restore the previous default 3009a610812SMauro Carvalho Chehab (format time values) access rights to the files affected. 3019a610812SMauro Carvalho Chehab* The repair mount option triggers only the minimal set of I/O error recovery 3029a610812SMauro Carvalho Chehab actions, that is, file size fixes for zones in a good condition. Zones 3039a610812SMauro Carvalho Chehab indicated as being read-only or offline by the device still imply changes to 3049a610812SMauro Carvalho Chehab the zone file access permissions as noted in the table above. 3059a610812SMauro Carvalho Chehab 3069a610812SMauro Carvalho ChehabMount options 3079a610812SMauro Carvalho Chehab------------- 3089a610812SMauro Carvalho Chehab 309*ae430388SDamien Le Moalzonefs defines several mount options: 310*ae430388SDamien Le Moal* errors=<behavior> 311*ae430388SDamien Le Moal* explicit-open 312*ae430388SDamien Le Moal 313*ae430388SDamien Le Moal"errors=<behavior>" option 314*ae430388SDamien Le Moal~~~~~~~~~~~~~~~~~~~~~~~~~~ 315*ae430388SDamien Le Moal 316*ae430388SDamien Le MoalThe "errors=<behavior>" option mount option allows the user to specify zonefs 317*ae430388SDamien Le Moalbehavior in response to I/O errors, inode size inconsistencies or zone 318481ed297SLinus Torvaldscondition changes. The defined behaviors are as follow: 3199a610812SMauro Carvalho Chehab 3209a610812SMauro Carvalho Chehab* remount-ro (default) 3219a610812SMauro Carvalho Chehab* zone-ro 3229a610812SMauro Carvalho Chehab* zone-offline 3239a610812SMauro Carvalho Chehab* repair 3249a610812SMauro Carvalho Chehab 325481ed297SLinus TorvaldsThe run-time I/O error actions defined for each behavior are detailed in the 326481ed297SLinus Torvaldsprevious section. Mount time I/O errors will cause the mount operation to fail. 327481ed297SLinus TorvaldsThe handling of read-only zones also differs between mount-time and run-time. 328481ed297SLinus TorvaldsIf a read-only zone is found at mount time, the zone is always treated in the 329481ed297SLinus Torvaldssame manner as offline zones, that is, all accesses are disabled and the zone 330481ed297SLinus Torvaldsfile size set to 0. This is necessary as the write pointer of read-only zones 331481ed297SLinus Torvaldsis defined as invalib by the ZBC and ZAC standards, making it impossible to 332481ed297SLinus Torvaldsdiscover the amount of data that has been written to the zone. In the case of a 333481ed297SLinus Torvaldsread-only zone discovered at run-time, as indicated in the previous section. 3344c96870eSJohannes ThumshirnThe size of the zone file is left unchanged from its last updated value. 3359a610812SMauro Carvalho Chehab 336*ae430388SDamien Le Moal"explicit-open" option 337*ae430388SDamien Le Moal~~~~~~~~~~~~~~~~~~~~~~ 338*ae430388SDamien Le Moal 33948bfd5c6SJohannes ThumshirnA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on 34048bfd5c6SJohannes Thumshirnthe number of zones that can be active, that is, zones that are in the 34148bfd5c6SJohannes Thumshirnimplicit open, explicit open or closed conditions. This potential limitation 34248bfd5c6SJohannes Thumshirntranslates into a risk for applications to see write IO errors due to this 34348bfd5c6SJohannes Thumshirnlimit being exceeded if the zone of a file is not already active when a write 34448bfd5c6SJohannes Thumshirnrequest is issued by the user. 34548bfd5c6SJohannes Thumshirn 34648bfd5c6SJohannes ThumshirnTo avoid these potential errors, the "explicit-open" mount option forces zones 34748bfd5c6SJohannes Thumshirnto be made active using an open zone command when a file is opened for writing 34848bfd5c6SJohannes Thumshirnfor the first time. If the zone open command succeeds, the application is then 34948bfd5c6SJohannes Thumshirnguaranteed that write requests can be processed. Conversely, the 35048bfd5c6SJohannes Thumshirn"explicit-open" mount option will result in a zone close command being issued 35148bfd5c6SJohannes Thumshirnto the device on the last close() of a zone file if the zone is not full nor 35248bfd5c6SJohannes Thumshirnempty. 35348bfd5c6SJohannes Thumshirn 3549a610812SMauro Carvalho ChehabZonefs User Space Tools 3559a610812SMauro Carvalho Chehab======================= 3569a610812SMauro Carvalho Chehab 3579a610812SMauro Carvalho ChehabThe mkzonefs tool is used to format zoned block devices for use with zonefs. 3589a610812SMauro Carvalho ChehabThis tool is available on Github at: 3599a610812SMauro Carvalho Chehab 3609a610812SMauro Carvalho Chehabhttps://github.com/damien-lemoal/zonefs-tools 3619a610812SMauro Carvalho Chehab 3629a610812SMauro Carvalho Chehabzonefs-tools also includes a test suite which can be run against any zoned 3639a610812SMauro Carvalho Chehabblock device, including null_blk block device created with zoned mode. 3649a610812SMauro Carvalho Chehab 3659a610812SMauro Carvalho ChehabExamples 3669a610812SMauro Carvalho Chehab-------- 3679a610812SMauro Carvalho Chehab 3689a610812SMauro Carvalho ChehabThe following formats a 15TB host-managed SMR HDD with 256 MB zones 3699a610812SMauro Carvalho Chehabwith the conventional zones aggregation feature enabled:: 3709a610812SMauro Carvalho Chehab 3719a610812SMauro Carvalho Chehab # mkzonefs -o aggr_cnv /dev/sdX 3729a610812SMauro Carvalho Chehab # mount -t zonefs /dev/sdX /mnt 3739a610812SMauro Carvalho Chehab # ls -l /mnt/ 3749a610812SMauro Carvalho Chehab total 0 3759a610812SMauro Carvalho Chehab dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv 3769a610812SMauro Carvalho Chehab dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq 3779a610812SMauro Carvalho Chehab 3789a610812SMauro Carvalho ChehabThe size of the zone files sub-directories indicate the number of files 3799a610812SMauro Carvalho Chehabexisting for each type of zones. In this example, there is only one 3809a610812SMauro Carvalho Chehabconventional zone file (all conventional zones are aggregated under a single 3819a610812SMauro Carvalho Chehabfile):: 3829a610812SMauro Carvalho Chehab 3839a610812SMauro Carvalho Chehab # ls -l /mnt/cnv 3849a610812SMauro Carvalho Chehab total 137101312 3859a610812SMauro Carvalho Chehab -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 3869a610812SMauro Carvalho Chehab 3879a610812SMauro Carvalho ChehabThis aggregated conventional zone file can be used as a regular file:: 3889a610812SMauro Carvalho Chehab 3899a610812SMauro Carvalho Chehab # mkfs.ext4 /mnt/cnv/0 3909a610812SMauro Carvalho Chehab # mount -o loop /mnt/cnv/0 /data 3919a610812SMauro Carvalho Chehab 3929a610812SMauro Carvalho ChehabThe "seq" sub-directory grouping files for sequential write zones has in this 3939a610812SMauro Carvalho Chehabexample 55356 zones:: 3949a610812SMauro Carvalho Chehab 3959a610812SMauro Carvalho Chehab # ls -lv /mnt/seq 3969a610812SMauro Carvalho Chehab total 14511243264 3979a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:23 0 3989a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:23 1 3999a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:23 2 4009a610812SMauro Carvalho Chehab ... 4019a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:23 55354 4029a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:23 55355 4039a610812SMauro Carvalho Chehab 4049a610812SMauro Carvalho ChehabFor sequential write zone files, the file size changes as data is appended at 4059a610812SMauro Carvalho Chehabthe end of the file, similarly to any regular file system:: 4069a610812SMauro Carvalho Chehab 4079a610812SMauro Carvalho Chehab # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct 4089a610812SMauro Carvalho Chehab 1+0 records in 4099a610812SMauro Carvalho Chehab 1+0 records out 4109a610812SMauro Carvalho Chehab 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s 4119a610812SMauro Carvalho Chehab 4129a610812SMauro Carvalho Chehab # ls -l /mnt/seq/0 4139a610812SMauro Carvalho Chehab -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 4149a610812SMauro Carvalho Chehab 4159a610812SMauro Carvalho ChehabThe written file can be truncated to the zone size, preventing any further 4169a610812SMauro Carvalho Chehabwrite operation:: 4179a610812SMauro Carvalho Chehab 4189a610812SMauro Carvalho Chehab # truncate -s 268435456 /mnt/seq/0 4199a610812SMauro Carvalho Chehab # ls -l /mnt/seq/0 4209a610812SMauro Carvalho Chehab -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 4219a610812SMauro Carvalho Chehab 4229a610812SMauro Carvalho ChehabTruncation to 0 size allows freeing the file zone storage space and restart 4239a610812SMauro Carvalho Chehabappend-writes to the file:: 4249a610812SMauro Carvalho Chehab 4259a610812SMauro Carvalho Chehab # truncate -s 0 /mnt/seq/0 4269a610812SMauro Carvalho Chehab # ls -l /mnt/seq/0 4279a610812SMauro Carvalho Chehab -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 4289a610812SMauro Carvalho Chehab 4294c96870eSJohannes ThumshirnSince files are statically mapped to zones on the disk, the number of blocks 4304c96870eSJohannes Thumshirnof a file as reported by stat() and fstat() indicates the capacity of the file 4314c96870eSJohannes Thumshirnzone:: 4329a610812SMauro Carvalho Chehab 4339a610812SMauro Carvalho Chehab # stat /mnt/seq/0 4349a610812SMauro Carvalho Chehab File: /mnt/seq/0 4359a610812SMauro Carvalho Chehab Size: 0 Blocks: 524288 IO Block: 4096 regular empty file 4369a610812SMauro Carvalho Chehab Device: 870h/2160d Inode: 50431 Links: 1 4379a610812SMauro Carvalho Chehab Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) 4389a610812SMauro Carvalho Chehab Access: 2019-11-25 13:23:57.048971997 +0900 4399a610812SMauro Carvalho Chehab Modify: 2019-11-25 13:52:25.553805765 +0900 4409a610812SMauro Carvalho Chehab Change: 2019-11-25 13:52:25.553805765 +0900 4419a610812SMauro Carvalho Chehab Birth: - 4429a610812SMauro Carvalho Chehab 4439a610812SMauro Carvalho ChehabThe number of blocks of the file ("Blocks") in units of 512B blocks gives the 4449a610812SMauro Carvalho Chehabmaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone 4454c96870eSJohannes Thumshirncapacity in this example. Of note is that the "IO block" field always 4464c96870eSJohannes Thumshirnindicates the minimum I/O size for writes and corresponds to the device 4474c96870eSJohannes Thumshirnphysical sector size. 448