papers/fsinterface/fsinterface.ms

/*
 * Encapsulation of namei parameters.
 * One of these is located in the u. area to
 * minimize space allocated on the kernel stack
 * and to retain per-process context.
 */
struct nameidata {
 /* arguments to namei and related context: */
 caddr_t ni_dirp; /* pathname pointer */
 enum uio_seg ni_seg; /* location of pathname */
 short ni_nameiop; /* see below */
 struct vnode *ni_cdir; /* current directory */
 struct vnode *ni_rdir; /* root directory, if not normal root */
 struct ucred *ni_cred; /* credentials */

 /* shared between namei, lookup routines and commit routines: */
 caddr_t ni_pnbuf; /* pathname buffer */
 char *ni_ptr; /* current location in pathname */
 int ni_pathlen; /* remaining chars in path */
 short ni_more; /* more left to translate in pathname */
 short ni_loopcnt; /* count of symlinks encountered */

 /* results: */
 struct vnode *ni_vp; /* vnode of result */
 struct vnode *ni_dvp; /* vnode of intermediate directory */

/* BEGIN UFS SPECIFIC */
 struct diroffcache { /* last successful directory search */
 struct vnode *nc_prevdir; /* terminal directory */
 long nc_id; /* directory's unique id */
 off_t nc_prevoffset; /* where last entry found */
 } ni_nc;
/* END UFS SPECIFIC */
};


/*
 * namei operations and modifiers
 */
#define LOOKUP 0 /* perform name lookup only */
#define CREATE 1 /* setup for file creation */
#define DELETE 2 /* setup for file deletion */
#define WANTPARENT 0x10 /* return parent directory vnode also */
#define NOCACHE 0x20 /* name must not be left in cache */
#define FOLLOW 0x40 /* follow symbolic links */
#define NOFOLLOW 0x0 /* don't follow symbolic links (pseudo) */

As in current systems other than Sun's VFS, namei is called
with an operation request, one of LOOKUP, CREATE or DELETE.
For a LOOKUP, the operation is exactly like the lookup in VFS.
CREATE and DELETE allow the filesystem to ensure consistency
by locking the parent inode (private to the filesystem),
and (for the local filesystem) to avoid duplicate directory scans
by storing the new directory entry and its offset in the directory
in the ndirinfo structure.
This is intended to be opaque to the filesystem-independent levels.
Not all lookups for creation or deletion are actually followed
by the intended operation; permission may be denied, the filesystem
may be read-only, etc.
Therefore, an entry point to the filesystem is provided
to abort a creation or deletion operation
and allow release of any locked internal data.
After a namei with a CREATE or DELETE flag, the pathname pointer
is set to point to the last filename component.
Filesystems that choose to implement creation or deletion entirely
within the subsequent call to a create or delete entry
are thus free to do so.

The nameidata is used to store context used during name translation.
The current and root directories for the translation are stored here.
For the local filesystem, the per-process directory offset cache
is also kept here.
A file server could leave the directory offset cache empty,
could use a single cache for all clients,
or could hold caches for several recent clients.

Several other data structures are used in the filesystem operations.
One is the ucred structure which describes a client's credentials
to the filesystem.
This is modified slightly from the Sun structure;
the ``accounting'' group ID has been merged into the groups array.
The actual number of groups in the array is given explicitly
to avoid use of a reserved group ID as a terminator.
Also, typedefs introduced in 4.3BSD for user and group ID's have been used.
The ucred structure is thus:

/*
 * Credentials.
 */
struct ucred {
 u_short cr_ref; /* reference count */
 uid_t cr_uid; /* effective user id */
 short cr_ngroups; /* number of groups */
 gid_t cr_groups[NGROUPS]; /* groups */
 /*
 * The following either should not be here,
 * or should be treated as opaque.
 */
 uid_t cr_ruid; /* real user id */
 gid_t cr_svgid; /* saved set-group id */
};


A final structure used by the filesystem interface is the uio
structure mentioned earlier.
This structure describes the source or destination of an I/O
operation, with provision for scatter/gather I/O.
It is used in the read and write entries to the filesystem.
The uio structure presented here is modified from the one
used in 4.2BSD to specify the location of each vector of the operation
(user or kernel space)
and to allow an alternate function to be used to implement the data movement.
The alternate function might perform page remapping rather than a copy,
for example.

/*
 * Description of an I/O operation which potentially
 * involves scatter-gather, with individual sections
 * described by iovec, below. uio_resid is initially
 * set to the total size of the operation, and is
 * decremented as the operation proceeds. uio_offset
 * is incremented by the amount of each operation.
 * uio_iov is incremented and uio_iovcnt is decremented
 * after each vector is processed.
 */
struct uio {
 struct iovec *uio_iov;
 int uio_iovcnt;
 off_t uio_offset;
 int uio_resid;
 enum uio_rw uio_rw;
};

enum uio_rw { UIO_READ, UIO_WRITE };


/*
 * Description of a contiguous section of an I/O operation.
 * If iov_op is non-null, it is called to implement the copy
 * operation, possibly by remapping, with the call
 * (*iov_op)(from, to, count);
 * where from and to are caddr_t and count is int.
 * Otherwise, the copy is done in the normal way,
 * treating base as a user or kernel virtual address
 * according to iov_segflg.
 */
struct iovec {
 caddr_t iov_base;
 int iov_len;
 enum uio_seg iov_segflg;
 int (*iov_op)();
};


/*
 * Segment flag values.
 */
enum uio_seg {
 UIO_USERSPACE, /* from user data space */
 UIO_SYSSPACE, /* from system space */
};

File and filesystem operations

With the introduction of the data structures used by the filesystem
operations, the complete list of filesystem entry points may be listed.
As noted, they derive mostly from the Sun VFS interface.
Lines marked with + are additions to the Sun definitions;
lines marked with ! are modified from VFS.

The structure describing the externally-visible features of a mounted
filesystem, vfs, is:

/*
 * Structure per mounted file system.
 * Each mounted file system has an array of
 * operations and an instance record.
 * The file systems are put on a doubly linked list.
 */
struct vfs {
 struct vfs *vfs_next; /* next vfs in vfs list */
+ struct vfs *vfs_prev; /* prev vfs in vfs list */
 struct vfsops *vfs_op; /* operations on vfs */
 struct vnode *vfs_vnodecovered; /* vnode we mounted on */
 int vfs_flag; /* flags */
! int vfs_fsize; /* fundamental block size */
+ int vfs_bsize; /* optimal transfer size */
! uid_t vfs_exroot; /* exported fs uid 0 mapping */
 short vfs_exflags; /* exported fs flags */
 caddr_t vfs_data; /* private data */
};


 /*
 * vfs flags.
 * VFS_MLOCK lock the vfs so that name lookup cannot proceed past the vfs.
 * This keeps the subtree stable during mounts and unmounts.
 */
 #define VFS_RDONLY 0x01 /* read only vfs */
+ #define VFS_NOEXEC 0x02 /* can't exec from filesystem */
 #define VFS_MLOCK 0x04 /* lock vfs so that subtree is stable */
 #define VFS_MWAIT 0x08 /* someone is waiting for lock */
 #define VFS_NOSUID 0x10 /* don't honor setuid bits on vfs */
 #define VFS_EXPORTED 0x20 /* file system is exported (NFS) */

 /*
 * exported vfs flags.
 */
 #define EX_RDONLY 0x01 /* exported read only */


The operations supported by the filesystem-specific layer
on an individual filesystem are:

/*
 * Operations supported on virtual file system.
 */
struct vfsops {
! int (*vfs_mount)( /* vfs, path, data, datalen */ );
! int (*vfs_unmount)( /* vfs, forcibly */ );
+ int (*vfs_mountroot)();
 int (*vfs_root)( /* vfs, vpp */ );
! int (*vfs_statfs)( /* vfs, vp, sbp */ );
! int (*vfs_sync)( /* vfs, waitfor */ );
+ int (*vfs_fhtovp)( /* vfs, fhp, vpp */ );
+ int (*vfs_vptofh)( /* vp, fhp */ );
};


The vfs_statfs entry returns a structure of the form:

/*
 * file system statistics
 */
struct statfs {
! short f_type; /* type of filesystem */
+ short f_flags; /* copy of vfs (mount) flags */
! long f_fsize; /* fundamental file system block size */
+ long f_bsize; /* optimal transfer block size */
 long f_blocks; /* total data blocks in file system */
 long f_bfree; /* free blocks in fs */
 long f_bavail; /* free blocks avail to non-superuser */
 long f_files; /* total file nodes in file system */
 long f_ffree; /* free file nodes in fs */
 fsid_t f_fsid; /* file system id */
+ char *f_mntonname; /* directory on which mounted */
+ char *f_mntfromname; /* mounted filesystem */
 long f_spare[7]; /* spare for later */
};

typedef long fsid_t[2]; /* file system id type */


The modifications to Sun's interface at this level are minor.
Additional arguments are present for the vfs_mount and vfs_umount
entries.
vfs_statfs accepts a vnode as well as filesystem identifier,
as the information may not be uniform throughout a filesystem.
For example,
if a client may mount a file tree that spans multiple physical
filesystems on a server, different sections may have different amounts
of free space.
(NFS does not allow remotely-mounted file trees to span physical filesystems
on the server.)
The final additions are the entries that support file handles.
vfs_vptofh is provided for the use of file servers,
which need to obtain an opaque
file handle to represent the current vnode for transmission to clients.
This file handle may later be used to relocate the vnode using vfs_fhtovp
without requiring the vnode to remain in memory.

Finally, the external form of a filesystem object, the vnode, is:

/*
 * vnode types. VNON means no type.
 */
enum vtype { VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };

struct vnode {
 u_short v_flag; /* vnode flags (see below) */
 u_short v_count; /* reference count */
 u_short v_shlockc; /* count of shared locks */
 u_short v_exlockc; /* count of exclusive locks */
 struct vfs *v_vfsmountedhere; /* ptr to vfs mounted here */
 struct vfs *v_vfsp; /* ptr to vfs we are in */
 struct vnodeops *v_op; /* vnode operations */
+ struct text *v_text; /* text/mapped region */
 enum vtype v_type; /* vnode type */
 caddr_t v_data; /* private data for fs */
};


/*
 * vnode flags.
 */
#define VROOT 0x01 /* root of its file system */
#define VTEXT 0x02 /* vnode is a pure text prototype */
#define VEXLOCK 0x10 /* exclusive lock */
#define VSHLOCK 0x20 /* shared lock */
#define VLWAIT 0x40 /* proc is waiting on shared or excl. lock */


The operations supported by the filesystems on individual vnode\^s
are:

/*
 * Operations on vnodes.
 */
struct vnodeops {
! int (*vn_lookup)( /* ndp */ );
! int (*vn_create)( /* ndp, vap, fflags */ );
+ int (*vn_mknod)( /* ndp, vap, fflags */ );
! int (*vn_open)( /* vp, fflags, cred */ );
 int (*vn_close)( /* vp, fflags, cred */ );
 int (*vn_access)( /* vp, fflags, cred */ );
 int (*vn_getattr)( /* vp, vap, cred */ );
 int (*vn_setattr)( /* vp, vap, cred */ );

+ int (*vn_read)( /* vp, uiop, offp, ioflag, cred */ );
+ int (*vn_write)( /* vp, uiop, offp, ioflag, cred */ );
! int (*vn_ioctl)( /* vp, com, data, fflag, cred */ );
 int (*vn_select)( /* vp, which, cred */ );
+ int (*vn_mmap)( /* vp, ..., cred */ );
 int (*vn_fsync)( /* vp, cred */ );
+ int (*vn_seek)( /* vp, offp, off, whence */ );

! int (*vn_remove)( /* ndp */ );
! int (*vn_link)( /* vp, ndp */ );
! int (*vn_rename)( /* src ndp, target ndp */ );
! int (*vn_mkdir)( /* ndp, vap */ );
! int (*vn_rmdir)( /* ndp */ );
! int (*vn_symlink)( /* ndp, vap, nm */ );
 int (*vn_readdir)( /* vp, uiop, offp, ioflag, cred */ );
 int (*vn_readlink)( /* vp, uiop, ioflag, cred */ );

+ int (*vn_abortop)( /* ndp */ );
+ int (*vn_lock)( /* vp */ );
+ int (*vn_unlock)( /* vp */ );
! int (*vn_inactive)( /* vp */ );
};


/*
 * flags for ioflag
 */
#define IO_UNIT 0x01 /* do io as atomic unit for VOP_RDWR */
#define IO_APPEND 0x02 /* append write for VOP_RDWR */
#define IO_SYNC 0x04 /* sync io for VOP_RDWR */


The argument types listed in the comments following each operation are:
 ndp 10
A pointer to a nameidata structure.
 vap
A pointer to a vattr structure (vnode attributes; see below).
 fflags
File open flags, possibly including O_APPEND, O_CREAT, O_TRUNC and O_EXCL.
 vp
A pointer to a vnode previously obtained with vn_lookup.
 cred
A pointer to a ucred credentials structure.
 uiop
A pointer to a uio structure.
 ioflag
Any of the IO flags defined above.
 com
An ioctl command, with type unsigned long.
 data
A pointer to a character buffer used to pass data to or from an ioctl.
 which
One of FREAD, FWRITE or 0 (select for exceptional conditions).
 off
A file offset of type off_t.
 offp
A pointer to file offset of type off_t.
 whence
One of L_SET, L_INCR, or L_XTND.
 fhp
A pointer to a file handle buffer.

Several changes have been made to Sun's set of vnode operations.
Most obviously, the vn_lookup receives a nameidata structure
containing its arguments and context as described.
The same structure is also passed to one of the creation or deletion
entries if the lookup operation is for CREATE or DELETE to complete
an operation, or to the vn_abortop entry if no operation
is undertaken.
For filesystems that perform no locking between lookup for creation
or deletion and the call to implement that action,
the final pathname component may be left untranslated by the lookup
routine.
In any case, the pathname pointer points at the final name component,
and the nameidata contains a reference to the vnode of the parent
directory.
The interface is thus flexible enough to accommodate filesystems
that are fully stateful or fully stateless, while avoiding redundant
operations whenever possible.
One operation remains problematical, the vn_rename call.
It is tempting to look up the source of the rename for deletion
and the target for creation.
However, filesystems that lock directories during such lookups must avoid
deadlock if the two paths cross.
For that reason, the source is translated for LOOKUP only,
with the WANTPARENT flag set;
the target is then translated with an operation of CREATE.

In addition to the changes concerned with the nameidata interface,
several other changes were made in the vnode operations.
The vn_rdrw entry was split into vn_read and vn_write;
frequently, the read/write entry amounts to a routine that checks
the direction flag, then calls either a read routine or a write routine.
The two entries may be identical for any given filesystem;
the direction flag is contained in the uio given as an argument.

All of the read and write operations use a uio to describe
the file offset and buffer locations.
All of these fields must be updated before return.
In particular, the vn_readdir entry uses this
to return a new file offset token for its current location.

Several new operations have been added.
The first, vn_seek, is a concession to record-oriented files
such as directories.
It allows the filesystem to verify that a seek leaves a file at a sensible
offset, or to return a new offset token relative to an earlier one.
For most filesystems and files, this operation amounts to performing
simple arithmetic.
Another new entry point is vn_mmap, for use in mapping device memory
into a user process address space.
Its semantics are not yet decided.
The final additions are the vn_lock and vn_unlock entries.
These are used to request that the underlying file be locked against
changes for short periods of time if the filesystem implementation allows it.
They are used to maintain consistency
during internal operations such as exec,
and may not be used to construct atomic operations from other filesystem
operations.

The attributes of a vnode are not stored in the vnode,
as they might change with time and may need to be read from a remote
source.
Attributes have the form:

/*
 * Vnode attributes. A field value of -1
 * represents a field whose value is unavailable
 * (getattr) or which is not to be changed (setattr).
 */
struct vattr {
 enum vtype va_type; /* vnode type (for create) */
 u_short va_mode; /* files access mode and type */
! uid_t va_uid; /* owner user id */
! gid_t va_gid; /* owner group id */
 long va_fsid; /* file system id (dev for now) */
! long va_fileid; /* file id */
 short va_nlink; /* number of references to file */
 u_long va_size; /* file size in bytes (quad?) */
+ u_long va_size1; /* reserved if not quad */
 long va_blocksize; /* blocksize preferred for i/o */
 struct timeval va_atime; /* time of last access */
 struct timeval va_mtime; /* time of last modification */
 struct timeval va_ctime; /* time file changed */
 dev_t va_rdev; /* device the file represents */
 u_long va_bytes; /* bytes of disk space held by file */
+ u_long va_bytes1; /* reserved if va_bytes not a quad */
};

Conclusions

The Sun VFS filesystem interface is the most widely used generic
filesystem interface.
Of the interfaces examined, it creates the cleanest separation
between the filesystem-independent and -dependent layers and data structures.
It has several flaws, but it is felt that certain changes in the interface
can ameliorate most of them.
The interface proposed here includes those changes.
The proposed interface is now being implemented by the Computer Systems
Research Group at Berkeley.
If the design succeeds in improving the flexibility and performance
of the filesystem layering, it will be advanced as a model interface.
Acknowledgements

The filesystem interface described here is derived from Sun's VFS interface.
It also includes features similar to those of DEC's GFS interface.
We are indebted to members of the Sun and DEC system groups
for long discussions of the issues involved.

References

 Brownbridge82 \w'Satyanarayanan85\0\0'u
Brownbridge, D.R., L.F. Marshall, B. Randell,
``The Newcastle Connection, or UNIXes of the World Unite!,''
Software- Practice and Experience, Vol. 12, pp. 1147-1162, 1982.

 Cole85
Cole, C.T., P.B. Flinn, A.B. Atlas,
``An Implementation of an Extended File System for UNIX,''
Usenix Conference Proceedings,
pp. 131-150, June, 1985.

 Kleiman86
``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
Usenix Conference Proceedings,
pp. 238-247, June, 1986.

 Leffler84
Leffler, S., M.K. McKusick, M. Karels,
``Measuring and Improving the Performance of 4.2BSD,''
Usenix Conference Proceedings, pp. 237-252, June, 1984.

 McKusick84
McKusick, M.K., W.N. Joy, S.J. Leffler, R.S. Fabry,
``A Fast File System for UNIX,'' Transactions on Computer Systems,
Vol. 2, pp. 181-197,
ACM, August, 1984.

 McKusick85
McKusick, M.K., M. Karels, S. Leffler,
``Performance Improvements and Functional Enhancements in 4.3BSD,''
Usenix Conference Proceedings, pp. 519-531, June, 1985.

 Rifkin86
Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh,
``RFS Architectural Overview,'' Usenix Conference Proceedings,
pp. 248-259, June, 1986.

 Ritchie74
Ritchie, D.M. and K. Thompson, ``The Unix Time-Sharing System,''
Communications of the ACM, Vol. 17, pp. 365-375, July, 1974.

 Rodriguez86
Rodriguez, R., M. Koehler, R. Hyde,
``The Generic File System,'' Usenix Conference Proceedings,
pp. 260-269, June, 1986.

 Sandberg85
Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
``Design and Implementation of the Sun Network Filesystem,''
Usenix Conference Proceedings,
pp. 119-130, June, 1985.

 Satyanarayanan85
Satyanarayanan, M., et al.,
``The ITC Distributed File System: Principles and Design,''
Proc. 10th Symposium on Operating Systems Principles, pp. 35-50,
ACM, December, 1985.

 Walker85
Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed Filesystem,''
The LOCUS Distributed System Architecture,
G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.

 Weinberger84
Weinberger, P.J., ``The Version 8 Network File System,''
Usenix Conference presentation,
June, 1984.