xref: /linux/Documentation/filesystems/orangefs.rst (revision 18ccb2233fc5f7c27b5be17f5b6585c2fa62d919)
1*18ccb223SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2*18ccb223SMauro Carvalho Chehab
3*18ccb223SMauro Carvalho Chehab========
4*18ccb223SMauro Carvalho ChehabORANGEFS
5*18ccb223SMauro Carvalho Chehab========
6*18ccb223SMauro Carvalho Chehab
7*18ccb223SMauro Carvalho ChehabOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
8*18ccb223SMauro Carvalho Chehabfor large storage problems faced by HPC, BigData, Streaming Video,
9*18ccb223SMauro Carvalho ChehabGenomics, Bioinformatics.
10*18ccb223SMauro Carvalho Chehab
11*18ccb223SMauro Carvalho ChehabOrangefs, originally called PVFS, was first developed in 1993 by
12*18ccb223SMauro Carvalho ChehabWalt Ligon and Eric Blumer as a parallel file system for Parallel
13*18ccb223SMauro Carvalho ChehabVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns
14*18ccb223SMauro Carvalho Chehabof parallel programs.
15*18ccb223SMauro Carvalho Chehab
16*18ccb223SMauro Carvalho ChehabOrangefs features include:
17*18ccb223SMauro Carvalho Chehab
18*18ccb223SMauro Carvalho Chehab  * Distributes file data among multiple file servers
19*18ccb223SMauro Carvalho Chehab  * Supports simultaneous access by multiple clients
20*18ccb223SMauro Carvalho Chehab  * Stores file data and metadata on servers using local file system
21*18ccb223SMauro Carvalho Chehab    and access methods
22*18ccb223SMauro Carvalho Chehab  * Userspace implementation is easy to install and maintain
23*18ccb223SMauro Carvalho Chehab  * Direct MPI support
24*18ccb223SMauro Carvalho Chehab  * Stateless
25*18ccb223SMauro Carvalho Chehab
26*18ccb223SMauro Carvalho Chehab
27*18ccb223SMauro Carvalho ChehabMailing List Archives
28*18ccb223SMauro Carvalho Chehab=====================
29*18ccb223SMauro Carvalho Chehab
30*18ccb223SMauro Carvalho Chehabhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
31*18ccb223SMauro Carvalho Chehab
32*18ccb223SMauro Carvalho Chehab
33*18ccb223SMauro Carvalho ChehabMailing List Submissions
34*18ccb223SMauro Carvalho Chehab========================
35*18ccb223SMauro Carvalho Chehab
36*18ccb223SMauro Carvalho Chehabdevel@lists.orangefs.org
37*18ccb223SMauro Carvalho Chehab
38*18ccb223SMauro Carvalho Chehab
39*18ccb223SMauro Carvalho ChehabDocumentation
40*18ccb223SMauro Carvalho Chehab=============
41*18ccb223SMauro Carvalho Chehab
42*18ccb223SMauro Carvalho Chehabhttp://www.orangefs.org/documentation/
43*18ccb223SMauro Carvalho Chehab
44*18ccb223SMauro Carvalho Chehab
45*18ccb223SMauro Carvalho ChehabUserspace Filesystem Source
46*18ccb223SMauro Carvalho Chehab===========================
47*18ccb223SMauro Carvalho Chehab
48*18ccb223SMauro Carvalho Chehabhttp://www.orangefs.org/download
49*18ccb223SMauro Carvalho Chehab
50*18ccb223SMauro Carvalho ChehabOrangefs versions prior to 2.9.3 would not be compatible with the
51*18ccb223SMauro Carvalho Chehabupstream version of the kernel client.
52*18ccb223SMauro Carvalho Chehab
53*18ccb223SMauro Carvalho Chehab
54*18ccb223SMauro Carvalho ChehabRunning ORANGEFS On a Single Server
55*18ccb223SMauro Carvalho Chehab===================================
56*18ccb223SMauro Carvalho Chehab
57*18ccb223SMauro Carvalho ChehabOrangeFS is usually run in large installations with multiple servers and
58*18ccb223SMauro Carvalho Chehabclients, but a complete filesystem can be run on a single machine for
59*18ccb223SMauro Carvalho Chehabdevelopment and testing.
60*18ccb223SMauro Carvalho Chehab
61*18ccb223SMauro Carvalho ChehabOn Fedora, install orangefs and orangefs-server::
62*18ccb223SMauro Carvalho Chehab
63*18ccb223SMauro Carvalho Chehab    dnf -y install orangefs orangefs-server
64*18ccb223SMauro Carvalho Chehab
65*18ccb223SMauro Carvalho ChehabThere is an example server configuration file in
66*18ccb223SMauro Carvalho Chehab/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
67*18ccb223SMauro Carvalho Chehabnecessary.
68*18ccb223SMauro Carvalho Chehab
69*18ccb223SMauro Carvalho ChehabTo generate a filesystem to run xfstests against, see below.
70*18ccb223SMauro Carvalho Chehab
71*18ccb223SMauro Carvalho ChehabThere is an example client configuration file in /etc/pvfs2tab.  It is a
72*18ccb223SMauro Carvalho Chehabsingle line.  Uncomment it and change the hostname if necessary.  This
73*18ccb223SMauro Carvalho Chehabcontrols clients which use libpvfs2.  This does not control the
74*18ccb223SMauro Carvalho Chehabpvfs2-client-core.
75*18ccb223SMauro Carvalho Chehab
76*18ccb223SMauro Carvalho ChehabCreate the filesystem::
77*18ccb223SMauro Carvalho Chehab
78*18ccb223SMauro Carvalho Chehab    pvfs2-server -f /etc/orangefs/orangefs.conf
79*18ccb223SMauro Carvalho Chehab
80*18ccb223SMauro Carvalho ChehabStart the server::
81*18ccb223SMauro Carvalho Chehab
82*18ccb223SMauro Carvalho Chehab    systemctl start orangefs-server
83*18ccb223SMauro Carvalho Chehab
84*18ccb223SMauro Carvalho ChehabTest the server::
85*18ccb223SMauro Carvalho Chehab
86*18ccb223SMauro Carvalho Chehab    pvfs2-ping -m /pvfsmnt
87*18ccb223SMauro Carvalho Chehab
88*18ccb223SMauro Carvalho ChehabStart the client.  The module must be compiled in or loaded before this
89*18ccb223SMauro Carvalho Chehabpoint::
90*18ccb223SMauro Carvalho Chehab
91*18ccb223SMauro Carvalho Chehab    systemctl start orangefs-client
92*18ccb223SMauro Carvalho Chehab
93*18ccb223SMauro Carvalho ChehabMount the filesystem::
94*18ccb223SMauro Carvalho Chehab
95*18ccb223SMauro Carvalho Chehab    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
96*18ccb223SMauro Carvalho Chehab
97*18ccb223SMauro Carvalho Chehab
98*18ccb223SMauro Carvalho ChehabBuilding ORANGEFS on a Single Server
99*18ccb223SMauro Carvalho Chehab====================================
100*18ccb223SMauro Carvalho Chehab
101*18ccb223SMauro Carvalho ChehabWhere OrangeFS cannot be installed from distribution packages, it may be
102*18ccb223SMauro Carvalho Chehabbuilt from source.
103*18ccb223SMauro Carvalho Chehab
104*18ccb223SMauro Carvalho ChehabYou can omit --prefix if you don't care that things are sprinkled around
105*18ccb223SMauro Carvalho Chehabin /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
106*18ccb223SMauro Carvalho Chehabdefault, we will probably be changing the default to LMDB soon.
107*18ccb223SMauro Carvalho Chehab
108*18ccb223SMauro Carvalho Chehab::
109*18ccb223SMauro Carvalho Chehab
110*18ccb223SMauro Carvalho Chehab    ./configure --prefix=/opt/ofs --with-db-backend=lmdb
111*18ccb223SMauro Carvalho Chehab
112*18ccb223SMauro Carvalho Chehab    make
113*18ccb223SMauro Carvalho Chehab
114*18ccb223SMauro Carvalho Chehab    make install
115*18ccb223SMauro Carvalho Chehab
116*18ccb223SMauro Carvalho ChehabCreate an orangefs config file::
117*18ccb223SMauro Carvalho Chehab
118*18ccb223SMauro Carvalho Chehab    /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
119*18ccb223SMauro Carvalho Chehab
120*18ccb223SMauro Carvalho ChehabCreate an /etc/pvfs2tab file::
121*18ccb223SMauro Carvalho Chehab
122*18ccb223SMauro Carvalho Chehab    echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
123*18ccb223SMauro Carvalho Chehab	/etc/pvfs2tab
124*18ccb223SMauro Carvalho Chehab
125*18ccb223SMauro Carvalho ChehabCreate the mount point you specified in the tab file if needed::
126*18ccb223SMauro Carvalho Chehab
127*18ccb223SMauro Carvalho Chehab    mkdir /pvfsmnt
128*18ccb223SMauro Carvalho Chehab
129*18ccb223SMauro Carvalho ChehabBootstrap the server::
130*18ccb223SMauro Carvalho Chehab
131*18ccb223SMauro Carvalho Chehab    /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
132*18ccb223SMauro Carvalho Chehab
133*18ccb223SMauro Carvalho ChehabStart the server::
134*18ccb223SMauro Carvalho Chehab
135*18ccb223SMauro Carvalho Chehab    /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
136*18ccb223SMauro Carvalho Chehab
137*18ccb223SMauro Carvalho ChehabNow the server should be running. Pvfs2-ls is a simple
138*18ccb223SMauro Carvalho Chehabtest to verify that the server is running::
139*18ccb223SMauro Carvalho Chehab
140*18ccb223SMauro Carvalho Chehab    /opt/ofs/bin/pvfs2-ls /pvfsmnt
141*18ccb223SMauro Carvalho Chehab
142*18ccb223SMauro Carvalho ChehabIf stuff seems to be working, load the kernel module and
143*18ccb223SMauro Carvalho Chehabturn on the client core::
144*18ccb223SMauro Carvalho Chehab
145*18ccb223SMauro Carvalho Chehab    /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
146*18ccb223SMauro Carvalho Chehab
147*18ccb223SMauro Carvalho ChehabMount your filesystem::
148*18ccb223SMauro Carvalho Chehab
149*18ccb223SMauro Carvalho Chehab    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
150*18ccb223SMauro Carvalho Chehab
151*18ccb223SMauro Carvalho Chehab
152*18ccb223SMauro Carvalho ChehabRunning xfstests
153*18ccb223SMauro Carvalho Chehab================
154*18ccb223SMauro Carvalho Chehab
155*18ccb223SMauro Carvalho ChehabIt is useful to use a scratch filesystem with xfstests.  This can be
156*18ccb223SMauro Carvalho Chehabdone with only one server.
157*18ccb223SMauro Carvalho Chehab
158*18ccb223SMauro Carvalho ChehabMake a second copy of the FileSystem section in the server configuration
159*18ccb223SMauro Carvalho Chehabfile, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
160*18ccb223SMauro Carvalho ChehabChange the ID to something other than the ID of the first FileSystem
161*18ccb223SMauro Carvalho Chehabsection (2 is usually a good choice).
162*18ccb223SMauro Carvalho Chehab
163*18ccb223SMauro Carvalho ChehabThen there are two FileSystem sections: orangefs and scratch.
164*18ccb223SMauro Carvalho Chehab
165*18ccb223SMauro Carvalho ChehabThis change should be made before creating the filesystem.
166*18ccb223SMauro Carvalho Chehab
167*18ccb223SMauro Carvalho Chehab::
168*18ccb223SMauro Carvalho Chehab
169*18ccb223SMauro Carvalho Chehab    pvfs2-server -f /etc/orangefs/orangefs.conf
170*18ccb223SMauro Carvalho Chehab
171*18ccb223SMauro Carvalho ChehabTo run xfstests, create /etc/xfsqa.config::
172*18ccb223SMauro Carvalho Chehab
173*18ccb223SMauro Carvalho Chehab    TEST_DIR=/orangefs
174*18ccb223SMauro Carvalho Chehab    TEST_DEV=tcp://localhost:3334/orangefs
175*18ccb223SMauro Carvalho Chehab    SCRATCH_MNT=/scratch
176*18ccb223SMauro Carvalho Chehab    SCRATCH_DEV=tcp://localhost:3334/scratch
177*18ccb223SMauro Carvalho Chehab
178*18ccb223SMauro Carvalho ChehabThen xfstests can be run::
179*18ccb223SMauro Carvalho Chehab
180*18ccb223SMauro Carvalho Chehab    ./check -pvfs2
181*18ccb223SMauro Carvalho Chehab
182*18ccb223SMauro Carvalho Chehab
183*18ccb223SMauro Carvalho ChehabOptions
184*18ccb223SMauro Carvalho Chehab=======
185*18ccb223SMauro Carvalho Chehab
186*18ccb223SMauro Carvalho ChehabThe following mount options are accepted:
187*18ccb223SMauro Carvalho Chehab
188*18ccb223SMauro Carvalho Chehab  acl
189*18ccb223SMauro Carvalho Chehab    Allow the use of Access Control Lists on files and directories.
190*18ccb223SMauro Carvalho Chehab
191*18ccb223SMauro Carvalho Chehab  intr
192*18ccb223SMauro Carvalho Chehab    Some operations between the kernel client and the user space
193*18ccb223SMauro Carvalho Chehab    filesystem can be interruptible, such as changes in debug levels
194*18ccb223SMauro Carvalho Chehab    and the setting of tunable parameters.
195*18ccb223SMauro Carvalho Chehab
196*18ccb223SMauro Carvalho Chehab  local_lock
197*18ccb223SMauro Carvalho Chehab    Enable posix locking from the perspective of "this" kernel. The
198*18ccb223SMauro Carvalho Chehab    default file_operations lock action is to return ENOSYS. Posix
199*18ccb223SMauro Carvalho Chehab    locking kicks in if the filesystem is mounted with -o local_lock.
200*18ccb223SMauro Carvalho Chehab    Distributed locking is being worked on for the future.
201*18ccb223SMauro Carvalho Chehab
202*18ccb223SMauro Carvalho Chehab
203*18ccb223SMauro Carvalho ChehabDebugging
204*18ccb223SMauro Carvalho Chehab=========
205*18ccb223SMauro Carvalho Chehab
206*18ccb223SMauro Carvalho ChehabIf you want the debug (GOSSIP) statements in a particular
207*18ccb223SMauro Carvalho Chehabsource file (inode.c for example) go to syslog::
208*18ccb223SMauro Carvalho Chehab
209*18ccb223SMauro Carvalho Chehab  echo inode > /sys/kernel/debug/orangefs/kernel-debug
210*18ccb223SMauro Carvalho Chehab
211*18ccb223SMauro Carvalho ChehabNo debugging (the default)::
212*18ccb223SMauro Carvalho Chehab
213*18ccb223SMauro Carvalho Chehab  echo none > /sys/kernel/debug/orangefs/kernel-debug
214*18ccb223SMauro Carvalho Chehab
215*18ccb223SMauro Carvalho ChehabDebugging from several source files::
216*18ccb223SMauro Carvalho Chehab
217*18ccb223SMauro Carvalho Chehab  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
218*18ccb223SMauro Carvalho Chehab
219*18ccb223SMauro Carvalho ChehabAll debugging::
220*18ccb223SMauro Carvalho Chehab
221*18ccb223SMauro Carvalho Chehab  echo all > /sys/kernel/debug/orangefs/kernel-debug
222*18ccb223SMauro Carvalho Chehab
223*18ccb223SMauro Carvalho ChehabGet a list of all debugging keywords::
224*18ccb223SMauro Carvalho Chehab
225*18ccb223SMauro Carvalho Chehab  cat /sys/kernel/debug/orangefs/debug-help
226*18ccb223SMauro Carvalho Chehab
227*18ccb223SMauro Carvalho Chehab
228*18ccb223SMauro Carvalho ChehabProtocol between Kernel Module and Userspace
229*18ccb223SMauro Carvalho Chehab============================================
230*18ccb223SMauro Carvalho Chehab
231*18ccb223SMauro Carvalho ChehabOrangefs is a user space filesystem and an associated kernel module.
232*18ccb223SMauro Carvalho ChehabWe'll just refer to the user space part of Orangefs as "userspace"
233*18ccb223SMauro Carvalho Chehabfrom here on out. Orangefs descends from PVFS, and userspace code
234*18ccb223SMauro Carvalho Chehabstill uses PVFS for function and variable names. Userspace typedefs
235*18ccb223SMauro Carvalho Chehabmany of the important structures. Function and variable names in
236*18ccb223SMauro Carvalho Chehabthe kernel module have been transitioned to "orangefs", and The Linux
237*18ccb223SMauro Carvalho ChehabCoding Style avoids typedefs, so kernel module structures that
238*18ccb223SMauro Carvalho Chehabcorrespond to userspace structures are not typedefed.
239*18ccb223SMauro Carvalho Chehab
240*18ccb223SMauro Carvalho ChehabThe kernel module implements a pseudo device that userspace
241*18ccb223SMauro Carvalho Chehabcan read from and write to. Userspace can also manipulate the
242*18ccb223SMauro Carvalho Chehabkernel module through the pseudo device with ioctl.
243*18ccb223SMauro Carvalho Chehab
244*18ccb223SMauro Carvalho ChehabThe Bufmap
245*18ccb223SMauro Carvalho Chehab----------
246*18ccb223SMauro Carvalho Chehab
247*18ccb223SMauro Carvalho ChehabAt startup userspace allocates two page-size-aligned (posix_memalign)
248*18ccb223SMauro Carvalho Chehabmlocked memory buffers, one is used for IO and one is used for readdir
249*18ccb223SMauro Carvalho Chehaboperations. The IO buffer is 41943040 bytes and the readdir buffer is
250*18ccb223SMauro Carvalho Chehab4194304 bytes. Each buffer contains logical chunks, or partitions, and
251*18ccb223SMauro Carvalho Chehaba pointer to each buffer is added to its own PVFS_dev_map_desc structure
252*18ccb223SMauro Carvalho Chehabwhich also describes its total size, as well as the size and number of
253*18ccb223SMauro Carvalho Chehabthe partitions.
254*18ccb223SMauro Carvalho Chehab
255*18ccb223SMauro Carvalho ChehabA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
256*18ccb223SMauro Carvalho Chehabmapping routine in the kernel module with an ioctl. The structure is
257*18ccb223SMauro Carvalho Chehabcopied from user space to kernel space with copy_from_user and is used
258*18ccb223SMauro Carvalho Chehabto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
259*18ccb223SMauro Carvalho Chehabthen contains:
260*18ccb223SMauro Carvalho Chehab
261*18ccb223SMauro Carvalho Chehab  * refcnt
262*18ccb223SMauro Carvalho Chehab    - a reference counter
263*18ccb223SMauro Carvalho Chehab  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
264*18ccb223SMauro Carvalho Chehab    partition size, which represents the filesystem's block size and
265*18ccb223SMauro Carvalho Chehab    is used for s_blocksize in super blocks.
266*18ccb223SMauro Carvalho Chehab  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
267*18ccb223SMauro Carvalho Chehab    partitions in the IO buffer.
268*18ccb223SMauro Carvalho Chehab  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
269*18ccb223SMauro Carvalho Chehab  * total_size - the total size of the IO buffer.
270*18ccb223SMauro Carvalho Chehab  * page_count - the number of 4096 byte pages in the IO buffer.
271*18ccb223SMauro Carvalho Chehab  * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
272*18ccb223SMauro Carvalho Chehab    of kcalloced memory. This memory is used as an array of pointers
273*18ccb223SMauro Carvalho Chehab    to each of the pages in the IO buffer through a call to get_user_pages.
274*18ccb223SMauro Carvalho Chehab  * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
275*18ccb223SMauro Carvalho Chehab    bytes of kcalloced memory. This memory is further intialized:
276*18ccb223SMauro Carvalho Chehab
277*18ccb223SMauro Carvalho Chehab      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
278*18ccb223SMauro Carvalho Chehab      structure. user_desc->ptr points to the IO buffer.
279*18ccb223SMauro Carvalho Chehab
280*18ccb223SMauro Carvalho Chehab      ::
281*18ccb223SMauro Carvalho Chehab
282*18ccb223SMauro Carvalho Chehab	pages_per_desc = bufmap->desc_size / PAGE_SIZE
283*18ccb223SMauro Carvalho Chehab	offset = 0
284*18ccb223SMauro Carvalho Chehab
285*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
286*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[0].array_count = pages_per_desc = 1024
287*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
288*18ccb223SMauro Carvalho Chehab        offset += 1024
289*18ccb223SMauro Carvalho Chehab                           .
290*18ccb223SMauro Carvalho Chehab                           .
291*18ccb223SMauro Carvalho Chehab                           .
292*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
293*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[9].array_count = pages_per_desc = 1024
294*18ccb223SMauro Carvalho Chehab        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
295*18ccb223SMauro Carvalho Chehab                                               (9 * 1024 * 4096)
296*18ccb223SMauro Carvalho Chehab        offset += 1024
297*18ccb223SMauro Carvalho Chehab
298*18ccb223SMauro Carvalho Chehab  * buffer_index_array - a desc_count sized array of ints, used to
299*18ccb223SMauro Carvalho Chehab    indicate which of the IO buffer's partitions are available to use.
300*18ccb223SMauro Carvalho Chehab  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
301*18ccb223SMauro Carvalho Chehab  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
302*18ccb223SMauro Carvalho Chehab    int array used to indicate which of the readdir buffer's partitions are
303*18ccb223SMauro Carvalho Chehab    available to use.
304*18ccb223SMauro Carvalho Chehab  * readdir_index_lock - a spinlock to protect readdir_index_array during
305*18ccb223SMauro Carvalho Chehab    update.
306*18ccb223SMauro Carvalho Chehab
307*18ccb223SMauro Carvalho ChehabOperations
308*18ccb223SMauro Carvalho Chehab----------
309*18ccb223SMauro Carvalho Chehab
310*18ccb223SMauro Carvalho ChehabThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it
311*18ccb223SMauro Carvalho Chehabneeds to communicate with userspace. Part of the op contains the "upcall"
312*18ccb223SMauro Carvalho Chehabwhich expresses the request to userspace. Part of the op eventually
313*18ccb223SMauro Carvalho Chehabcontains the "downcall" which expresses the results of the request.
314*18ccb223SMauro Carvalho Chehab
315*18ccb223SMauro Carvalho ChehabThe slab allocator is used to keep a cache of op structures handy.
316*18ccb223SMauro Carvalho Chehab
317*18ccb223SMauro Carvalho ChehabAt init time the kernel module defines and initializes a request list
318*18ccb223SMauro Carvalho Chehaband an in_progress hash table to keep track of all the ops that are
319*18ccb223SMauro Carvalho Chehabin flight at any given time.
320*18ccb223SMauro Carvalho Chehab
321*18ccb223SMauro Carvalho ChehabOps are stateful:
322*18ccb223SMauro Carvalho Chehab
323*18ccb223SMauro Carvalho Chehab * unknown
324*18ccb223SMauro Carvalho Chehab	    - op was just initialized
325*18ccb223SMauro Carvalho Chehab * waiting
326*18ccb223SMauro Carvalho Chehab	    - op is on request_list (upward bound)
327*18ccb223SMauro Carvalho Chehab * inprogr
328*18ccb223SMauro Carvalho Chehab	    - op is in progress (waiting for downcall)
329*18ccb223SMauro Carvalho Chehab * serviced
330*18ccb223SMauro Carvalho Chehab	    - op has matching downcall; ok
331*18ccb223SMauro Carvalho Chehab * purged
332*18ccb223SMauro Carvalho Chehab	    - op has to start a timer since client-core
333*18ccb223SMauro Carvalho Chehab              exited uncleanly before servicing op
334*18ccb223SMauro Carvalho Chehab * given up
335*18ccb223SMauro Carvalho Chehab	    - submitter has given up waiting for it
336*18ccb223SMauro Carvalho Chehab
337*18ccb223SMauro Carvalho ChehabWhen some arbitrary userspace program needs to perform a
338*18ccb223SMauro Carvalho Chehabfilesystem operation on Orangefs (readdir, I/O, create, whatever)
339*18ccb223SMauro Carvalho Chehaban op structure is initialized and tagged with a distinguishing ID
340*18ccb223SMauro Carvalho Chehabnumber. The upcall part of the op is filled out, and the op is
341*18ccb223SMauro Carvalho Chehabpassed to the "service_operation" function.
342*18ccb223SMauro Carvalho Chehab
343*18ccb223SMauro Carvalho ChehabService_operation changes the op's state to "waiting", puts
344*18ccb223SMauro Carvalho Chehabit on the request list, and signals the Orangefs file_operations.poll
345*18ccb223SMauro Carvalho Chehabfunction through a wait queue. Userspace is polling the pseudo-device
346*18ccb223SMauro Carvalho Chehaband thus becomes aware of the upcall request that needs to be read.
347*18ccb223SMauro Carvalho Chehab
348*18ccb223SMauro Carvalho ChehabWhen the Orangefs file_operations.read function is triggered, the
349*18ccb223SMauro Carvalho Chehabrequest list is searched for an op that seems ready-to-process.
350*18ccb223SMauro Carvalho ChehabThe op is removed from the request list. The tag from the op and
351*18ccb223SMauro Carvalho Chehabthe filled-out upcall struct are copy_to_user'ed back to userspace.
352*18ccb223SMauro Carvalho Chehab
353*18ccb223SMauro Carvalho ChehabIf any of these (and some additional protocol) copy_to_users fail,
354*18ccb223SMauro Carvalho Chehabthe op's state is set to "waiting" and the op is added back to
355*18ccb223SMauro Carvalho Chehabthe request list. Otherwise, the op's state is changed to "in progress",
356*18ccb223SMauro Carvalho Chehaband the op is hashed on its tag and put onto the end of a list in the
357*18ccb223SMauro Carvalho Chehabin_progress hash table at the index the tag hashed to.
358*18ccb223SMauro Carvalho Chehab
359*18ccb223SMauro Carvalho ChehabWhen userspace has assembled the response to the upcall, it
360*18ccb223SMauro Carvalho Chehabwrites the response, which includes the distinguishing tag, back to
361*18ccb223SMauro Carvalho Chehabthe pseudo device in a series of io_vecs. This triggers the Orangefs
362*18ccb223SMauro Carvalho Chehabfile_operations.write_iter function to find the op with the associated
363*18ccb223SMauro Carvalho Chehabtag and remove it from the in_progress hash table. As long as the op's
364*18ccb223SMauro Carvalho Chehabstate is not "canceled" or "given up", its state is set to "serviced".
365*18ccb223SMauro Carvalho ChehabThe file_operations.write_iter function returns to the waiting vfs,
366*18ccb223SMauro Carvalho Chehaband back to service_operation through wait_for_matching_downcall.
367*18ccb223SMauro Carvalho Chehab
368*18ccb223SMauro Carvalho ChehabService operation returns to its caller with the op's downcall
369*18ccb223SMauro Carvalho Chehabpart (the response to the upcall) filled out.
370*18ccb223SMauro Carvalho Chehab
371*18ccb223SMauro Carvalho ChehabThe "client-core" is the bridge between the kernel module and
372*18ccb223SMauro Carvalho Chehabuserspace. The client-core is a daemon. The client-core has an
373*18ccb223SMauro Carvalho Chehabassociated watchdog daemon. If the client-core is ever signaled
374*18ccb223SMauro Carvalho Chehabto die, the watchdog daemon restarts the client-core. Even though
375*18ccb223SMauro Carvalho Chehabthe client-core is restarted "right away", there is a period of
376*18ccb223SMauro Carvalho Chehabtime during such an event that the client-core is dead. A dead client-core
377*18ccb223SMauro Carvalho Chehabcan't be triggered by the Orangefs file_operations.poll function.
378*18ccb223SMauro Carvalho ChehabOps that pass through service_operation during a "dead spell" can timeout
379*18ccb223SMauro Carvalho Chehabon the wait queue and one attempt is made to recycle them. Obviously,
380*18ccb223SMauro Carvalho Chehabif the client-core stays dead too long, the arbitrary userspace processes
381*18ccb223SMauro Carvalho Chehabtrying to use Orangefs will be negatively affected. Waiting ops
382*18ccb223SMauro Carvalho Chehabthat can't be serviced will be removed from the request list and
383*18ccb223SMauro Carvalho Chehabhave their states set to "given up". In-progress ops that can't
384*18ccb223SMauro Carvalho Chehabbe serviced will be removed from the in_progress hash table and
385*18ccb223SMauro Carvalho Chehabhave their states set to "given up".
386*18ccb223SMauro Carvalho Chehab
387*18ccb223SMauro Carvalho ChehabReaddir and I/O ops are atypical with respect to their payloads.
388*18ccb223SMauro Carvalho Chehab
389*18ccb223SMauro Carvalho Chehab  - readdir ops use the smaller of the two pre-allocated pre-partitioned
390*18ccb223SMauro Carvalho Chehab    memory buffers. The readdir buffer is only available to userspace.
391*18ccb223SMauro Carvalho Chehab    The kernel module obtains an index to a free partition before launching
392*18ccb223SMauro Carvalho Chehab    a readdir op. Userspace deposits the results into the indexed partition
393*18ccb223SMauro Carvalho Chehab    and then writes them to back to the pvfs device.
394*18ccb223SMauro Carvalho Chehab
395*18ccb223SMauro Carvalho Chehab  - io (read and write) ops use the larger of the two pre-allocated
396*18ccb223SMauro Carvalho Chehab    pre-partitioned memory buffers. The IO buffer is accessible from
397*18ccb223SMauro Carvalho Chehab    both userspace and the kernel module. The kernel module obtains an
398*18ccb223SMauro Carvalho Chehab    index to a free partition before launching an io op. The kernel module
399*18ccb223SMauro Carvalho Chehab    deposits write data into the indexed partition, to be consumed
400*18ccb223SMauro Carvalho Chehab    directly by userspace. Userspace deposits the results of read
401*18ccb223SMauro Carvalho Chehab    requests into the indexed partition, to be consumed directly
402*18ccb223SMauro Carvalho Chehab    by the kernel module.
403*18ccb223SMauro Carvalho Chehab
404*18ccb223SMauro Carvalho ChehabResponses to kernel requests are all packaged in pvfs2_downcall_t
405*18ccb223SMauro Carvalho Chehabstructs. Besides a few other members, pvfs2_downcall_t contains a
406*18ccb223SMauro Carvalho Chehabunion of structs, each of which is associated with a particular
407*18ccb223SMauro Carvalho Chehabresponse type.
408*18ccb223SMauro Carvalho Chehab
409*18ccb223SMauro Carvalho ChehabThe several members outside of the union are:
410*18ccb223SMauro Carvalho Chehab
411*18ccb223SMauro Carvalho Chehab ``int32_t type``
412*18ccb223SMauro Carvalho Chehab    - type of operation.
413*18ccb223SMauro Carvalho Chehab ``int32_t status``
414*18ccb223SMauro Carvalho Chehab    - return code for the operation.
415*18ccb223SMauro Carvalho Chehab ``int64_t trailer_size``
416*18ccb223SMauro Carvalho Chehab    - 0 unless readdir operation.
417*18ccb223SMauro Carvalho Chehab ``char *trailer_buf``
418*18ccb223SMauro Carvalho Chehab    - initialized to NULL, used during readdir operations.
419*18ccb223SMauro Carvalho Chehab
420*18ccb223SMauro Carvalho ChehabThe appropriate member inside the union is filled out for any
421*18ccb223SMauro Carvalho Chehabparticular response.
422*18ccb223SMauro Carvalho Chehab
423*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_FILE_IO
424*18ccb223SMauro Carvalho Chehab    fill a pvfs2_io_response_t
425*18ccb223SMauro Carvalho Chehab
426*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_LOOKUP
427*18ccb223SMauro Carvalho Chehab    fill a PVFS_object_kref
428*18ccb223SMauro Carvalho Chehab
429*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_CREATE
430*18ccb223SMauro Carvalho Chehab    fill a PVFS_object_kref
431*18ccb223SMauro Carvalho Chehab
432*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_SYMLINK
433*18ccb223SMauro Carvalho Chehab    fill a PVFS_object_kref
434*18ccb223SMauro Carvalho Chehab
435*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_GETATTR
436*18ccb223SMauro Carvalho Chehab    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
437*18ccb223SMauro Carvalho Chehab    fill in a string with the link target when the object is a symlink.
438*18ccb223SMauro Carvalho Chehab
439*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_MKDIR
440*18ccb223SMauro Carvalho Chehab    fill a PVFS_object_kref
441*18ccb223SMauro Carvalho Chehab
442*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_STATFS
443*18ccb223SMauro Carvalho Chehab    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
444*18ccb223SMauro Carvalho Chehab    us to know, in a timely fashion, these statistics about our
445*18ccb223SMauro Carvalho Chehab    distributed network filesystem.
446*18ccb223SMauro Carvalho Chehab
447*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_FS_MOUNT
448*18ccb223SMauro Carvalho Chehab    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
449*18ccb223SMauro Carvalho Chehab    except its members are in a different order and "__pad1" is replaced
450*18ccb223SMauro Carvalho Chehab    with "id".
451*18ccb223SMauro Carvalho Chehab
452*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_GETXATTR
453*18ccb223SMauro Carvalho Chehab    fill a pvfs2_getxattr_response_t
454*18ccb223SMauro Carvalho Chehab
455*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_LISTXATTR
456*18ccb223SMauro Carvalho Chehab    fill a pvfs2_listxattr_response_t
457*18ccb223SMauro Carvalho Chehab
458*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_PARAM
459*18ccb223SMauro Carvalho Chehab    fill a pvfs2_param_response_t
460*18ccb223SMauro Carvalho Chehab
461*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_PERF_COUNT
462*18ccb223SMauro Carvalho Chehab    fill a pvfs2_perf_count_response_t
463*18ccb223SMauro Carvalho Chehab
464*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_FSKEY
465*18ccb223SMauro Carvalho Chehab    file a pvfs2_fs_key_response_t
466*18ccb223SMauro Carvalho Chehab
467*18ccb223SMauro Carvalho Chehab  PVFS2_VFS_OP_READDIR
468*18ccb223SMauro Carvalho Chehab    jamb everything needed to represent a pvfs2_readdir_response_t into
469*18ccb223SMauro Carvalho Chehab    the readdir buffer descriptor specified in the upcall.
470*18ccb223SMauro Carvalho Chehab
471*18ccb223SMauro Carvalho ChehabUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests
472*18ccb223SMauro Carvalho Chehabmade by the kernel side.
473*18ccb223SMauro Carvalho Chehab
474*18ccb223SMauro Carvalho ChehabA buffer_list containing:
475*18ccb223SMauro Carvalho Chehab
476*18ccb223SMauro Carvalho Chehab  - a pointer to the prepared response to the request from the
477*18ccb223SMauro Carvalho Chehab    kernel (struct pvfs2_downcall_t).
478*18ccb223SMauro Carvalho Chehab  - and also, in the case of a readdir request, a pointer to a
479*18ccb223SMauro Carvalho Chehab    buffer containing descriptors for the objects in the target
480*18ccb223SMauro Carvalho Chehab    directory.
481*18ccb223SMauro Carvalho Chehab
482*18ccb223SMauro Carvalho Chehab... is sent to the function (PINT_dev_write_list) which performs
483*18ccb223SMauro Carvalho Chehabthe writev.
484*18ccb223SMauro Carvalho Chehab
485*18ccb223SMauro Carvalho ChehabPINT_dev_write_list has a local iovec array: struct iovec io_array[10];
486*18ccb223SMauro Carvalho Chehab
487*18ccb223SMauro Carvalho ChehabThe first four elements of io_array are initialized like this for all
488*18ccb223SMauro Carvalho Chehabresponses::
489*18ccb223SMauro Carvalho Chehab
490*18ccb223SMauro Carvalho Chehab  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
491*18ccb223SMauro Carvalho Chehab  io_array[0].iov_len = sizeof(int32_t)
492*18ccb223SMauro Carvalho Chehab
493*18ccb223SMauro Carvalho Chehab  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
494*18ccb223SMauro Carvalho Chehab  io_array[1].iov_len = sizeof(int32_t)
495*18ccb223SMauro Carvalho Chehab
496*18ccb223SMauro Carvalho Chehab  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
497*18ccb223SMauro Carvalho Chehab  io_array[2].iov_len = sizeof(int64_t)
498*18ccb223SMauro Carvalho Chehab
499*18ccb223SMauro Carvalho Chehab  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
500*18ccb223SMauro Carvalho Chehab                         of global variable vfs_request (vfs_request_t)
501*18ccb223SMauro Carvalho Chehab  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
502*18ccb223SMauro Carvalho Chehab
503*18ccb223SMauro Carvalho ChehabReaddir responses initialize the fifth element io_array like this::
504*18ccb223SMauro Carvalho Chehab
505*18ccb223SMauro Carvalho Chehab  io_array[4].iov_base = contents of member trailer_buf (char *)
506*18ccb223SMauro Carvalho Chehab                         from out_downcall member of global variable
507*18ccb223SMauro Carvalho Chehab                         vfs_request
508*18ccb223SMauro Carvalho Chehab  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
509*18ccb223SMauro Carvalho Chehab                        from out_downcall member of global variable
510*18ccb223SMauro Carvalho Chehab                        vfs_request
511*18ccb223SMauro Carvalho Chehab
512*18ccb223SMauro Carvalho ChehabOrangefs exploits the dcache in order to avoid sending redundant
513*18ccb223SMauro Carvalho Chehabrequests to userspace. We keep object inode attributes up-to-date with
514*18ccb223SMauro Carvalho Chehaborangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
515*18ccb223SMauro Carvalho Chehabhelp it decide whether or not to update an inode: "new" and "bypass".
516*18ccb223SMauro Carvalho ChehabOrangefs keeps private data in an object's inode that includes a short
517*18ccb223SMauro Carvalho Chehabtimeout value, getattr_time, which allows any iteration of
518*18ccb223SMauro Carvalho Chehaborangefs_inode_getattr to know how long it has been since the inode was
519*18ccb223SMauro Carvalho Chehabupdated. When the object is not new (new == 0) and the bypass flag is not
520*18ccb223SMauro Carvalho Chehabset (bypass == 0) orangefs_inode_getattr returns without updating the inode
521*18ccb223SMauro Carvalho Chehabif getattr_time has not timed out. Getattr_time is updated each time the
522*18ccb223SMauro Carvalho Chehabinode is updated.
523*18ccb223SMauro Carvalho Chehab
524*18ccb223SMauro Carvalho ChehabCreation of a new object (file, dir, sym-link) includes the evaluation of
525*18ccb223SMauro Carvalho Chehabits pathname, resulting in a negative directory entry for the object.
526*18ccb223SMauro Carvalho ChehabA new inode is allocated and associated with the dentry, turning it from
527*18ccb223SMauro Carvalho Chehaba negative dentry into a "productive full member of society". Orangefs
528*18ccb223SMauro Carvalho Chehabobtains the new inode from Linux with new_inode() and associates
529*18ccb223SMauro Carvalho Chehabthe inode with the dentry by sending the pair back to Linux with
530*18ccb223SMauro Carvalho Chehabd_instantiate().
531*18ccb223SMauro Carvalho Chehab
532*18ccb223SMauro Carvalho ChehabThe evaluation of a pathname for an object resolves to its corresponding
533*18ccb223SMauro Carvalho Chehabdentry. If there is no corresponding dentry, one is created for it in
534*18ccb223SMauro Carvalho Chehabthe dcache. Whenever a dentry is modified or verified Orangefs stores a
535*18ccb223SMauro Carvalho Chehabshort timeout value in the dentry's d_time, and the dentry will be trusted
536*18ccb223SMauro Carvalho Chehabfor that amount of time. Orangefs is a network filesystem, and objects
537*18ccb223SMauro Carvalho Chehabcan potentially change out-of-band with any particular Orangefs kernel module
538*18ccb223SMauro Carvalho Chehabinstance, so trusting a dentry is risky. The alternative to trusting
539*18ccb223SMauro Carvalho Chehabdentries is to always obtain the needed information from userspace - at
540*18ccb223SMauro Carvalho Chehableast a trip to the client-core, maybe to the servers. Obtaining information
541*18ccb223SMauro Carvalho Chehabfrom a dentry is cheap, obtaining it from userspace is relatively expensive,
542*18ccb223SMauro Carvalho Chehabhence the motivation to use the dentry when possible.
543*18ccb223SMauro Carvalho Chehab
544*18ccb223SMauro Carvalho ChehabThe timeout values d_time and getattr_time are jiffy based, and the
545*18ccb223SMauro Carvalho Chehabcode is designed to avoid the jiffy-wrap problem::
546*18ccb223SMauro Carvalho Chehab
547*18ccb223SMauro Carvalho Chehab    "In general, if the clock may have wrapped around more than once, there
548*18ccb223SMauro Carvalho Chehab    is no way to tell how much time has elapsed. However, if the times t1
549*18ccb223SMauro Carvalho Chehab    and t2 are known to be fairly close, we can reliably compute the
550*18ccb223SMauro Carvalho Chehab    difference in a way that takes into account the possibility that the
551*18ccb223SMauro Carvalho Chehab    clock may have wrapped between times."
552*18ccb223SMauro Carvalho Chehab
553*18ccb223SMauro Carvalho Chehabfrom course notes by instructor Andy Wang
554*18ccb223SMauro Carvalho Chehab
555