smm/05.fastfs/3.t

         Copyright (c) 1986, 1993
 The Regents of the University of California. All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions
 are met:
 1. Redistributions of source code must retain the above copyright
 notice, this list of conditions and the following disclaimer.
 2. Redistributions in binary form must reproduce the above copyright
 notice, this list of conditions and the following disclaimer in the
 documentation and/or other materials provided with the distribution.
 3. Neither the name of the University nor the names of its contributors
 may be used to endorse or promote products derived from this software
 without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 @(#)3.t 8.1 (Berkeley) 6/8/93

 $FreeBSD$

New file system organization

In the new file system organization (as in the
old file system organization),
each disk drive contains one or more file systems.
A file system is described by its super-block,
located at the beginning of the file system's disk partition.
Because the super-block contains critical data,
it is replicated to protect against catastrophic loss.
This is done when the file system is created;
since the super-block data does not change,
the copies need not be referenced unless a head crash
or other hard disk error causes the default super-block
to be unusable.

To insure that it is possible to create files as large as
bytes with only two levels of indirection,
the minimum size of a file system block is 4096 bytes.
The size of file system blocks can be any power of two
greater than or equal to 4096.
The block size of a file system is recorded in the
file system's super-block
so it is possible for file systems with different block sizes
to be simultaneously accessible on the same system.
The block size must be decided at the time that
the file system is created;
it cannot be subsequently changed without rebuilding the file system.

The new file system organization divides a disk partition
into one or more areas called
 "cylinder groups". A cylinder group is comprised of one or more consecutive
cylinders on a disk.
Associated with each cylinder group is some bookkeeping information
that includes a redundant copy of the super-block,
space for inodes,
a bit map describing available blocks in the cylinder group,
and summary information describing the usage of data blocks
within the cylinder group.
The bit map of available blocks in the cylinder group replaces
the traditional file system's free list.
For each cylinder group a static number of inodes
is allocated at file system creation time.
The default policy is to allocate one inode for each 2048
bytes of space in the cylinder group, expecting this
to be far more than will ever be needed.

All the cylinder group bookkeeping information could be
placed at the beginning of each cylinder group.
However if this approach were used,
all the redundant information would be on the top platter.
A single hardware failure that destroyed the top platter
could cause the loss of all redundant copies of the super-block.
Thus the cylinder group bookkeeping information
begins at a varying offset from the beginning of the cylinder group.
The offset for each successive cylinder group is calculated to be
about one track further from the beginning of the cylinder group
than the preceding cylinder group.
In this way the redundant
information spirals down into the pack so that any single track, cylinder,
or platter can be lost without losing all copies of the super-block.
Except for the first cylinder group,
the space between the beginning of the cylinder group
and the beginning of the cylinder group information
is used for data blocks.\(dg
.FS
\(dg While it appears that the first cylinder group could be laid
out with its super-block at the ``known'' location,
this would not work for file systems
with blocks sizes of 16 kilobytes or greater.
This is because of a requirement that the first 8 kilobytes of the disk
be reserved for a bootstrap program and a separate requirement that
the cylinder group information begin on a file system block boundary.
To start the cylinder group on a file system block boundary,
file systems with block sizes larger than 8 kilobytes
would have to leave an empty space between the end of
the boot block and the beginning of the cylinder group.
Without knowing the size of the file system blocks,
the system would not know what roundup function to use
to find the beginning of the first cylinder group.
.FE

Optimizing storage utilization

Data is laid out so that larger blocks can be transferred
in a single disk transaction, greatly increasing file system throughput.
As an example, consider a file in the new file system
composed of 4096 byte data blocks.
In the old file system this file would be composed of 1024 byte blocks.
By increasing the block size, disk accesses in the new file
system may transfer up to four times as much information per
disk transaction.
In large files, several
4096 byte blocks may be allocated from the same cylinder so that
even larger data transfers are possible before requiring a seek.

The main problem with
larger blocks is that most UNIX
file systems are composed of many small files.
A uniformly large block size wastes space.
Table 1 shows the effect of file system
block size on the amount of wasted space in the file system.
The files measured to obtain these figures reside on
one of our time sharing
systems that has roughly 1.2 gigabytes of on-line storage.
The measurements are based on the active user file systems containing
about 920 megabytes of formatted space.
.KF
 B


Space used % waste Organization
775.2 Mb 0.0 Data only, no separation between files
807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system

Table 1 - Amount of wasted space as a function of block size.

.KE
The space wasted is calculated to be the percentage of space
on the disk not containing user data.
As the block size on the disk
increases, the waste rises quickly, to an intolerable
45.6% waste with 4096 byte file system blocks.

To be able to use large blocks without undue waste,
small files must be stored in a more efficient way.
The new file system accomplishes this goal by allowing the division
of a single file system block into one or more
 "fragments". The file system fragment size is specified
at the time that the file system is created;
each file system block can optionally be broken into
2, 4, or 8 fragments, each of which is addressable.
The lower bound on the size of these fragments is constrained
by the disk sector size,
typically 512 bytes.
The block map associated with each cylinder group
records the space available in a cylinder group
at the fragment level;
to determine if a block is available, aligned fragments are examined.
Figure 1 shows a piece of a map from a 4096/1024 file system.
.KF
 B


Bits in map XXXX XXOO OOXX OOOO
Fragment numbers 0-3 4-7 8-11 12-15
Block numbers 0 1 2 3

Figure 1 - Example layout of blocks and fragments in a 4096/1024 file system.

.KE
Each bit in the map records the status of a fragment;
an ``X'' shows that the fragment is in use,
while an ``O'' shows that the fragment is available for allocation.
In this example,
fragments 0-5, 10, and 11 are in use,
while fragments 6-9, and 12-15 are free.
Fragments of adjoining blocks cannot be used as a full block,
even if they are large enough.
In this example,
fragments 6-9 cannot be allocated as a full block;
only fragments 12-15 can be coalesced into a full block.

On a file system with a block size of 4096 bytes
and a fragment size of 1024 bytes,
a file is represented by zero or more 4096 byte blocks of data,
and possibly a single fragmented block.
If a file system block must be fragmented to obtain
space for a small amount of data,
the remaining fragments of the block are made
available for allocation to other files.
As an example consider an 11000 byte file stored on
a 4096/1024 byte file system.
This file would uses two full size blocks and one
three fragment portion of another block.
If no block with three aligned fragments is
available at the time the file is created,
a full size block is split yielding the necessary
fragments and a single unused fragment.
This remaining fragment can be allocated to another file as needed.

Space is allocated to a file when a program does a write
system call.
Each time data is written to a file, the system checks to see if
the size of the file has increased*.
.FS
* A program may be overwriting data in the middle of an existing file
in which case space would already have been allocated.
.FE
If the file needs to be expanded to hold the new data,
one of three conditions exists:
 1)
There is enough space left in an already allocated
block or fragment to hold the new data.
The new data is written into the available space.
 2)
The file contains no fragmented blocks (and the last
block in the file
contains insufficient space to hold the new data).
If space exists in a block already allocated,
the space is filled with new data.
If the remainder of the new data contains more than
a full block of data, a full block is allocated and
the first full block of new data is written there.
This process is repeated until less than a full block
of new data remains.
If the remaining new data to be written will
fit in less than a full block,
a block with the necessary fragments is located,
otherwise a full block is located.
The remaining new data is written into the located space.
 3)
The file contains one or more fragments (and the
fragments contain insufficient space to hold the new data).
If the size of the new data plus the size of the data
already in the fragments exceeds the size of a full block,
a new block is allocated.
The contents of the fragments are copied
to the beginning of the block
and the remainder of the block is filled with new data.
The process then continues as in (2) above.
Otherwise, if the new data to be written will
fit in less than a full block,
a block with the necessary fragments is located,
otherwise a full block is located.
The contents of the existing fragments
appended with the new data
are written into the allocated space.

The problem with expanding a file one fragment at a
a time is that data may be copied many times as a
fragmented block expands to a full block.
Fragment reallocation can be minimized
if the user program writes a full block at a time,
except for a partial block at the end of the file.
Since file systems with different block sizes may reside on
the same system,
the file system interface has been extended to provide
application programs the optimal size for a read or write.
For files the optimal size is the block size of the file system
on which the file is being accessed.
For other objects, such as pipes and sockets,
the optimal size is the underlying buffer size.
This feature is used by the Standard
Input/Output Library,
a package used by most user programs.
This feature is also used by
certain system utilities such as archivers and loaders
that do their own input and output management
and need the highest possible file system bandwidth.

The amount of wasted space in the 4096/1024 byte new file system
organization is empirically observed to be about the same as in the
1024 byte old file system organization.
A file system with 4096 byte blocks and 512 byte fragments
has about the same amount of wasted space as the 512 byte
block UNIX file system.
The new file system uses less space
than the 512 byte or 1024 byte
file systems for indexing information for
large files and the same amount of space
for small files.
These savings are offset by the need to use
more space for keeping track of available free blocks.
The net result is about the same disk utilization
when a new file system's fragment size
equals an old file system's block size.

In order for the layout policies to be effective,
a file system cannot be kept completely full.
For each file system there is a parameter, termed
the free space reserve, that
gives the minimum acceptable percentage of file system
blocks that should be free.
If the number of free blocks drops below this level
only the system administrator can continue to allocate blocks.
The value of this parameter may be changed at any time,
even when the file system is mounted and active.
The transfer rates that appear in section 4 were measured on file
systems kept less than 90% full (a reserve of 10%).
If the number of free blocks falls to zero,
the file system throughput tends to be cut in half,
because of the inability of the file system to localize
blocks in a file.
If a file system's performance degrades because
of overfilling, it may be restored by removing
files until the amount of free space once again
reaches the minimum acceptable level.
Access rates for files created during periods of little
free space may be restored by moving their data once enough
space is available.
The free space reserve must be added to the
percentage of waste when comparing the organizations given
in Table 1.
Thus, the percentage of waste in
an old 1024 byte UNIX file system is roughly
comparable to a new 4096/512 byte file system
with the free space reserve set at 5%.
(Compare 11.8% wasted with the old file system
to 6.9% waste + 5% reserved space in the
new file system.)

File system parameterization

Except for the initial creation of the free list,
the old file system ignores the parameters of the underlying hardware.
It has no information about either the physical characteristics
of the mass storage device,
or the hardware that interacts with it.
A goal of the new file system is to parameterize the
processor capabilities and
mass storage characteristics
so that blocks can be allocated in an
optimum configuration-dependent way.
Parameters used include the speed of the processor,
the hardware support for mass storage transfers,
and the characteristics of the mass storage devices.
Disk technology is constantly improving and
a given installation can have several different disk technologies
running on a single processor.
Each file system is parameterized so that it can be
adapted to the characteristics of the disk on which
it is placed.

For mass storage devices such as disks,
the new file system tries to allocate new blocks
on the same cylinder as the previous block in the same file.
Optimally, these new blocks will also be
rotationally well positioned.
The distance between ``rotationally optimal'' blocks varies greatly;
it can be a consecutive block
or a rotationally delayed block
depending on system characteristics.
On a processor with an input/output channel that does not require
any processor intervention between mass storage transfer requests,
two consecutive disk blocks can often be accessed
without suffering lost time because of an intervening disk revolution.
For processors without input/output channels,
the main processor must field an interrupt and
prepare for a new disk transfer.
The expected time to service this interrupt and
schedule a new disk transfer depends on the
speed of the main processor.

The physical characteristics of each disk include
the number of blocks per track and the rate at which
the disk spins.
The allocation routines use this information to calculate
the number of milliseconds required to skip over a block.
The characteristics of the processor include
the expected time to service an interrupt and schedule a
new disk transfer.
Given a block allocated to a file,
the allocation routines calculate the number of blocks to
skip over so that the next block in the file will
come into position under the disk head in the expected
amount of time that it takes to start a new
disk transfer operation.
For programs that sequentially access large amounts of data,
this strategy minimizes the amount of time spent waiting for
the disk to position itself.

To ease the calculation of finding rotationally optimal blocks,
the cylinder group summary information includes
a count of the available blocks in a cylinder
group at different rotational positions.
Eight rotational positions are distinguished,
so the resolution of the
summary information is 2 milliseconds for a typical 3600
revolution per minute drive.
The super-block contains a vector of lists called
 "rotational layout tables". The vector is indexed by rotational position.
Each component of the vector
lists the index into the block map for every data block contained
in its rotational position.
When looking for an allocatable block,
the system first looks through the summary counts for a rotational
position with a non-zero block count.
It then uses the index of the rotational position to find the appropriate
list to use to index through
only the relevant parts of the block map to find a free block.

The parameter that defines the
minimum number of milliseconds between the completion of a data
transfer and the initiation of
another data transfer on the same cylinder
can be changed at any time,
even when the file system is mounted and active.
If a file system is parameterized to lay out blocks with
a rotational separation of 2 milliseconds,
and the disk pack is then moved to a system that has a
processor requiring 4 milliseconds to schedule a disk operation,
the throughput will drop precipitously because of lost disk revolutions
on nearly every block.
If the eventual target machine is known,
the file system can be parameterized for it
even though it is initially created on a different processor.
Even if the move is not known in advance,
the rotational layout delay can be reconfigured after the disk is moved
so that all further allocation is done based on the
characteristics of the new host.

Layout policies

The file system layout policies are divided into two distinct parts.
At the top level are global policies that use file system
wide summary information to make decisions regarding
the placement of new inodes and data blocks.
These routines are responsible for deciding the
placement of new directories and files.
They also calculate rotationally optimal block layouts,
and decide when to force a long seek to a new cylinder group
because there are insufficient blocks left
in the current cylinder group to do reasonable layouts.
Below the global policy routines are
the local allocation routines that use a locally optimal scheme to
lay out data blocks.

Two methods for improving file system performance are to increase
the locality of reference to minimize seek latency
as described by [Trivedi80], and
to improve the layout of data to make larger transfers possible
as described by [Nevalainen77].
The global layout policies try to improve performance
by clustering related information.
They cannot attempt to localize all data references,
but must also try to spread unrelated data
among different cylinder groups.
If too much localization is attempted,
the local cylinder group may run out of space
forcing the data to be scattered to non-local cylinder groups.
Taken to an extreme,
total localization can result in a single huge cluster of data
resembling the old file system.
The global policies try to balance the two conflicting
goals of localizing data that is concurrently accessed
while spreading out unrelated data.

One allocatable resource is inodes.
Inodes are used to describe both files and directories.
Inodes of files in the same directory are frequently accessed together.
For example, the ``list directory'' command often accesses
the inode for each file in a directory.
The layout policy tries to place all the inodes of
files in a directory in the same cylinder group.
To ensure that files are distributed throughout the disk,
a different policy is used for directory allocation.
A new directory is placed in a cylinder group that has a greater
than average number of free inodes,
and the smallest number of directories already in it.
The intent of this policy is to allow the inode clustering policy
to succeed most of the time.
The allocation of inodes within a cylinder group is done using a
next free strategy.
Although this allocates the inodes randomly within a cylinder group,
all the inodes for a particular cylinder group can be read with
8 to 16 disk transfers.
(At most 16 disk transfers are required because a cylinder
group may have no more than 2048 inodes.)
This puts a small and constant upper bound on the number of
disk transfers required to access the inodes
for all the files in a directory.
In contrast, the old file system typically requires
one disk transfer to fetch the inode for each file in a directory.

The other major resource is data blocks.
Since data blocks for a file are typically accessed together,
the policy routines try to place all data
blocks for a file in the same cylinder group,
preferably at rotationally optimal positions in the same cylinder.
The problem with allocating all the data blocks
in the same cylinder group is that large files will
quickly use up available space in the cylinder group,
forcing a spill over to other areas.
Further, using all the space in a cylinder group
causes future allocations for any file in the cylinder group
to also spill to other areas.
Ideally none of the cylinder groups should ever become completely full.
The heuristic solution chosen is to
redirect block allocation
to a different cylinder group
when a file exceeds 48 kilobytes,
and at every megabyte thereafter.*
.FS
* The first spill over point at 48 kilobytes is the point
at which a file on a 4096 byte block file system first
requires a single indirect block. This appears to be
a natural first point at which to redirect block allocation.
The other spillover points are chosen with the intent of
forcing block allocation to be redirected when a
file has used about 25% of the data blocks in a cylinder group.
In observing the new file system in day to day use, the heuristics appear
to work well in minimizing the number of completely filled
cylinder groups.
.FE
The newly chosen cylinder group is selected from those cylinder
groups that have a greater than average number of free blocks left.
Although big files tend to be spread out over the disk,
a megabyte of data is typically accessible before
a long seek must be performed,
and the cost of one long seek per megabyte is small.

The global policy routines call local allocation routines with
requests for specific blocks.
The local allocation routines will
always allocate the requested block
if it is free, otherwise it
allocates a free block of the requested size that is
rotationally closest to the requested block.
If the global layout policies had complete information,
they could always request unused blocks and
the allocation routines would be reduced to simple bookkeeping.
However, maintaining complete information is costly;
thus the implementation of the global layout policy
uses heuristics that employ only partial information.

If a requested block is not available, the local allocator uses
a four level allocation strategy:
 1)
Use the next available block rotationally closest
to the requested block on the same cylinder. It is assumed
here that head switching time is zero. On disk
controllers where this is not the case, it may be possible
to incorporate the time required to switch between disk platters
when constructing the rotational layout tables. This, however,
has not yet been tried.
 2)
If there are no blocks available on the same cylinder,
use a block within the same cylinder group.
 3)
If that cylinder group is entirely full,
quadratically hash the cylinder group number to choose
another cylinder group to look for a free block.
 4)
Finally if the hash fails, apply an exhaustive search
to all cylinder groups.

Quadratic hash is used because of its speed in finding
unused slots in nearly full hash tables [Knuth75].
File systems that are parameterized to maintain at least
10% free space rarely use this strategy.
File systems that are run without maintaining any free
space typically have so few free blocks that almost any
allocation is random;
the most important characteristic of
the strategy used under such conditions is that the strategy be fast.