papers/jail/implementation.ms

Implementation jail in the FreeBSD kernel.

The jail(2) system call, allocation, refcounting and deallocation of
struct prison.

The jail(2) system call is implemented as a non-optional system call
in FreeBSD. Other system calls are controlled by compile time options
in the kernel configuration file, but due to the minute footprint of
the jail implementation, it was decided to make it a standard
facility in FreeBSD.

The implementation of the system call is straightforward: a data structure
is allocated and populated with the arguments provided. The data structure
is attached to the current process' struct proc, its reference count
set to one and a call to the
chroot(2) syscall implementation completes the task.

Hooks in the code implementing process creation and destruction maintains
the reference count on the data structure and free it when the last reference
is lost.
Any new process created by a process in a jail will inherit a reference
to the jail, which effectively puts the new process in the same jail.

There is no way to modify the contents of the data structure describing
the jail after its creation, and no way to attach a process to an existing
jail if it was not created from the inside that jail.

Fortification of the chroot(2) facility for filesystem name scoping.

A number of ways to escape the confines of a chroot(2)-created subscope
of the filesystem view have been identified over the years.
chroot(2) was never intended to be security mechanism as such, but even
then the ftp daemon largely depended on the security provided by
chroot(2) to provide the ``anonymous ftp'' access method.

Three classes of escape routes existed: recursive chroot(2) escapes,
``..'' based escapes and fchdir(2) based escapes.
All of these exploited the fact that chroot(2) didn't try sufficiently
hard to enforce the new root directory.

New code were added to detect and thwart these escapes, amongst
other things by tracking the directory of the first level of chroot(2)
experienced by a process and refusing backwards traversal across
this directory, as well as additional code to refuse chroot(2) if
file-descriptors were open referencing directories.

Restriction of process visibility and interaction.

A macro was already in available in the kernel to determine if one process
could affect another process. This macro did the rather complex checking
of uid and gid values. It was felt that the complexity of the macro were
approaching the lower edge of IOCCC entrance criteria, and it was therefore
converted to a proper function named p_trespass(p1, p2) which does
all the previous checks and additionally checks the jail aspect of the access.
The check is implemented such that access fails if the origin process is jailed
but the target process is not in the same jail.

Process visibility is provided through two mechanisms in FreeBSD,
the procfs file system and a sub-tree of the sysctl tree.
Both of these were modified to report only the processes in the same
jail to a jailed process.

Restriction to one IP number.

Restricting TCP and UDP access to just one IP number was done almost
entirely in the code which manages ``protocol control blocks''.
When a jailed process binds to a socket, the IP number provided by
the process will not be used, instead the pre-configured IP number of
the jail is used.

BSD based TCP/IP network stacks sport a special interface, the loop-back
interface, which has the ``magic'' IP number 127.0.0.1.
This is often used by processes to contact servers on the local machine,
and consequently special handling for jails were needed.
To handle this case it was necessary to also intercept and modify the
behaviour of connection establishment, and when the 127.0.0.1 address
were seen from a jailed process, substitute the jails configured IP number.

Finally the APIs through which the network configuration and connection
state may be queried were modified to report only information relevant
to the configured IP number of a jailed process.

Adding jail awareness to selected device drivers.

A couple of device drivers needed to be taught about jails, the ``pty''
driver is one of them. The pty driver provides ``virtual terminals'' to
services like telnet, ssh, rlogin and X11 terminal window programs.
Therefore jails need access to the pty driver, and code had to be added
to enforce that a particular virtual terminal were not accessed from more
than one jail at the same time.

General restriction of super-users powers for jailed super-users.

This item proved to be the simplest but most tedious to implement.
Tedious because a manual review of all places where the kernel allowed
the super user special powers were called for,
simple because very few places were required to let a jailed root through.
Of the approximately 260 checks in the FreeBSD 4.0 kernel, only
about 35 will let a jailed root through.

Since the default is for jailed roots to not receive privilege, new
code or drivers in the FreeBSD kernel are automatically jail-aware: they
will refuse jailed roots privilege.
The other part of this protection comes from the fact that a jailed
root cannot create new device nodes with the mknod(2) systemcall, so
unless the machine administrator creates device nodes for a particular
device inside the jails filesystem tree, the driver in effect does
not exist in the jail.

As a side-effect of this work the suser(9) API were cleaned up and
extended to cater for not only the jail facility, but also to make room
for future partitioning facilities.

Implementation statistics

The change of the suser(9) API modified approx 350 source lines
distributed over approx. 100 source files. The vast majority of
these changes were generated automatically with a script.

The implementation of the jail facility added approx 200 lines of
code in total, distributed over approx. 50 files. and about 200 lines
in two new kernel files.