1*18ccb223SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2*18ccb223SMauro Carvalho Chehab 3*18ccb223SMauro Carvalho Chehab======== 4*18ccb223SMauro Carvalho ChehabORANGEFS 5*18ccb223SMauro Carvalho Chehab======== 6*18ccb223SMauro Carvalho Chehab 7*18ccb223SMauro Carvalho ChehabOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal 8*18ccb223SMauro Carvalho Chehabfor large storage problems faced by HPC, BigData, Streaming Video, 9*18ccb223SMauro Carvalho ChehabGenomics, Bioinformatics. 10*18ccb223SMauro Carvalho Chehab 11*18ccb223SMauro Carvalho ChehabOrangefs, originally called PVFS, was first developed in 1993 by 12*18ccb223SMauro Carvalho ChehabWalt Ligon and Eric Blumer as a parallel file system for Parallel 13*18ccb223SMauro Carvalho ChehabVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns 14*18ccb223SMauro Carvalho Chehabof parallel programs. 15*18ccb223SMauro Carvalho Chehab 16*18ccb223SMauro Carvalho ChehabOrangefs features include: 17*18ccb223SMauro Carvalho Chehab 18*18ccb223SMauro Carvalho Chehab * Distributes file data among multiple file servers 19*18ccb223SMauro Carvalho Chehab * Supports simultaneous access by multiple clients 20*18ccb223SMauro Carvalho Chehab * Stores file data and metadata on servers using local file system 21*18ccb223SMauro Carvalho Chehab and access methods 22*18ccb223SMauro Carvalho Chehab * Userspace implementation is easy to install and maintain 23*18ccb223SMauro Carvalho Chehab * Direct MPI support 24*18ccb223SMauro Carvalho Chehab * Stateless 25*18ccb223SMauro Carvalho Chehab 26*18ccb223SMauro Carvalho Chehab 27*18ccb223SMauro Carvalho ChehabMailing List Archives 28*18ccb223SMauro Carvalho Chehab===================== 29*18ccb223SMauro Carvalho Chehab 30*18ccb223SMauro Carvalho Chehabhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ 31*18ccb223SMauro Carvalho Chehab 32*18ccb223SMauro Carvalho Chehab 33*18ccb223SMauro Carvalho ChehabMailing List Submissions 34*18ccb223SMauro Carvalho Chehab======================== 35*18ccb223SMauro Carvalho Chehab 36*18ccb223SMauro Carvalho Chehabdevel@lists.orangefs.org 37*18ccb223SMauro Carvalho Chehab 38*18ccb223SMauro Carvalho Chehab 39*18ccb223SMauro Carvalho ChehabDocumentation 40*18ccb223SMauro Carvalho Chehab============= 41*18ccb223SMauro Carvalho Chehab 42*18ccb223SMauro Carvalho Chehabhttp://www.orangefs.org/documentation/ 43*18ccb223SMauro Carvalho Chehab 44*18ccb223SMauro Carvalho Chehab 45*18ccb223SMauro Carvalho ChehabUserspace Filesystem Source 46*18ccb223SMauro Carvalho Chehab=========================== 47*18ccb223SMauro Carvalho Chehab 48*18ccb223SMauro Carvalho Chehabhttp://www.orangefs.org/download 49*18ccb223SMauro Carvalho Chehab 50*18ccb223SMauro Carvalho ChehabOrangefs versions prior to 2.9.3 would not be compatible with the 51*18ccb223SMauro Carvalho Chehabupstream version of the kernel client. 52*18ccb223SMauro Carvalho Chehab 53*18ccb223SMauro Carvalho Chehab 54*18ccb223SMauro Carvalho ChehabRunning ORANGEFS On a Single Server 55*18ccb223SMauro Carvalho Chehab=================================== 56*18ccb223SMauro Carvalho Chehab 57*18ccb223SMauro Carvalho ChehabOrangeFS is usually run in large installations with multiple servers and 58*18ccb223SMauro Carvalho Chehabclients, but a complete filesystem can be run on a single machine for 59*18ccb223SMauro Carvalho Chehabdevelopment and testing. 60*18ccb223SMauro Carvalho Chehab 61*18ccb223SMauro Carvalho ChehabOn Fedora, install orangefs and orangefs-server:: 62*18ccb223SMauro Carvalho Chehab 63*18ccb223SMauro Carvalho Chehab dnf -y install orangefs orangefs-server 64*18ccb223SMauro Carvalho Chehab 65*18ccb223SMauro Carvalho ChehabThere is an example server configuration file in 66*18ccb223SMauro Carvalho Chehab/etc/orangefs/orangefs.conf. Change localhost to your hostname if 67*18ccb223SMauro Carvalho Chehabnecessary. 68*18ccb223SMauro Carvalho Chehab 69*18ccb223SMauro Carvalho ChehabTo generate a filesystem to run xfstests against, see below. 70*18ccb223SMauro Carvalho Chehab 71*18ccb223SMauro Carvalho ChehabThere is an example client configuration file in /etc/pvfs2tab. It is a 72*18ccb223SMauro Carvalho Chehabsingle line. Uncomment it and change the hostname if necessary. This 73*18ccb223SMauro Carvalho Chehabcontrols clients which use libpvfs2. This does not control the 74*18ccb223SMauro Carvalho Chehabpvfs2-client-core. 75*18ccb223SMauro Carvalho Chehab 76*18ccb223SMauro Carvalho ChehabCreate the filesystem:: 77*18ccb223SMauro Carvalho Chehab 78*18ccb223SMauro Carvalho Chehab pvfs2-server -f /etc/orangefs/orangefs.conf 79*18ccb223SMauro Carvalho Chehab 80*18ccb223SMauro Carvalho ChehabStart the server:: 81*18ccb223SMauro Carvalho Chehab 82*18ccb223SMauro Carvalho Chehab systemctl start orangefs-server 83*18ccb223SMauro Carvalho Chehab 84*18ccb223SMauro Carvalho ChehabTest the server:: 85*18ccb223SMauro Carvalho Chehab 86*18ccb223SMauro Carvalho Chehab pvfs2-ping -m /pvfsmnt 87*18ccb223SMauro Carvalho Chehab 88*18ccb223SMauro Carvalho ChehabStart the client. The module must be compiled in or loaded before this 89*18ccb223SMauro Carvalho Chehabpoint:: 90*18ccb223SMauro Carvalho Chehab 91*18ccb223SMauro Carvalho Chehab systemctl start orangefs-client 92*18ccb223SMauro Carvalho Chehab 93*18ccb223SMauro Carvalho ChehabMount the filesystem:: 94*18ccb223SMauro Carvalho Chehab 95*18ccb223SMauro Carvalho Chehab mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt 96*18ccb223SMauro Carvalho Chehab 97*18ccb223SMauro Carvalho Chehab 98*18ccb223SMauro Carvalho ChehabBuilding ORANGEFS on a Single Server 99*18ccb223SMauro Carvalho Chehab==================================== 100*18ccb223SMauro Carvalho Chehab 101*18ccb223SMauro Carvalho ChehabWhere OrangeFS cannot be installed from distribution packages, it may be 102*18ccb223SMauro Carvalho Chehabbuilt from source. 103*18ccb223SMauro Carvalho Chehab 104*18ccb223SMauro Carvalho ChehabYou can omit --prefix if you don't care that things are sprinkled around 105*18ccb223SMauro Carvalho Chehabin /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by 106*18ccb223SMauro Carvalho Chehabdefault, we will probably be changing the default to LMDB soon. 107*18ccb223SMauro Carvalho Chehab 108*18ccb223SMauro Carvalho Chehab:: 109*18ccb223SMauro Carvalho Chehab 110*18ccb223SMauro Carvalho Chehab ./configure --prefix=/opt/ofs --with-db-backend=lmdb 111*18ccb223SMauro Carvalho Chehab 112*18ccb223SMauro Carvalho Chehab make 113*18ccb223SMauro Carvalho Chehab 114*18ccb223SMauro Carvalho Chehab make install 115*18ccb223SMauro Carvalho Chehab 116*18ccb223SMauro Carvalho ChehabCreate an orangefs config file:: 117*18ccb223SMauro Carvalho Chehab 118*18ccb223SMauro Carvalho Chehab /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf 119*18ccb223SMauro Carvalho Chehab 120*18ccb223SMauro Carvalho ChehabCreate an /etc/pvfs2tab file:: 121*18ccb223SMauro Carvalho Chehab 122*18ccb223SMauro Carvalho Chehab echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ 123*18ccb223SMauro Carvalho Chehab /etc/pvfs2tab 124*18ccb223SMauro Carvalho Chehab 125*18ccb223SMauro Carvalho ChehabCreate the mount point you specified in the tab file if needed:: 126*18ccb223SMauro Carvalho Chehab 127*18ccb223SMauro Carvalho Chehab mkdir /pvfsmnt 128*18ccb223SMauro Carvalho Chehab 129*18ccb223SMauro Carvalho ChehabBootstrap the server:: 130*18ccb223SMauro Carvalho Chehab 131*18ccb223SMauro Carvalho Chehab /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf 132*18ccb223SMauro Carvalho Chehab 133*18ccb223SMauro Carvalho ChehabStart the server:: 134*18ccb223SMauro Carvalho Chehab 135*18ccb223SMauro Carvalho Chehab /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf 136*18ccb223SMauro Carvalho Chehab 137*18ccb223SMauro Carvalho ChehabNow the server should be running. Pvfs2-ls is a simple 138*18ccb223SMauro Carvalho Chehabtest to verify that the server is running:: 139*18ccb223SMauro Carvalho Chehab 140*18ccb223SMauro Carvalho Chehab /opt/ofs/bin/pvfs2-ls /pvfsmnt 141*18ccb223SMauro Carvalho Chehab 142*18ccb223SMauro Carvalho ChehabIf stuff seems to be working, load the kernel module and 143*18ccb223SMauro Carvalho Chehabturn on the client core:: 144*18ccb223SMauro Carvalho Chehab 145*18ccb223SMauro Carvalho Chehab /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core 146*18ccb223SMauro Carvalho Chehab 147*18ccb223SMauro Carvalho ChehabMount your filesystem:: 148*18ccb223SMauro Carvalho Chehab 149*18ccb223SMauro Carvalho Chehab mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt 150*18ccb223SMauro Carvalho Chehab 151*18ccb223SMauro Carvalho Chehab 152*18ccb223SMauro Carvalho ChehabRunning xfstests 153*18ccb223SMauro Carvalho Chehab================ 154*18ccb223SMauro Carvalho Chehab 155*18ccb223SMauro Carvalho ChehabIt is useful to use a scratch filesystem with xfstests. This can be 156*18ccb223SMauro Carvalho Chehabdone with only one server. 157*18ccb223SMauro Carvalho Chehab 158*18ccb223SMauro Carvalho ChehabMake a second copy of the FileSystem section in the server configuration 159*18ccb223SMauro Carvalho Chehabfile, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. 160*18ccb223SMauro Carvalho ChehabChange the ID to something other than the ID of the first FileSystem 161*18ccb223SMauro Carvalho Chehabsection (2 is usually a good choice). 162*18ccb223SMauro Carvalho Chehab 163*18ccb223SMauro Carvalho ChehabThen there are two FileSystem sections: orangefs and scratch. 164*18ccb223SMauro Carvalho Chehab 165*18ccb223SMauro Carvalho ChehabThis change should be made before creating the filesystem. 166*18ccb223SMauro Carvalho Chehab 167*18ccb223SMauro Carvalho Chehab:: 168*18ccb223SMauro Carvalho Chehab 169*18ccb223SMauro Carvalho Chehab pvfs2-server -f /etc/orangefs/orangefs.conf 170*18ccb223SMauro Carvalho Chehab 171*18ccb223SMauro Carvalho ChehabTo run xfstests, create /etc/xfsqa.config:: 172*18ccb223SMauro Carvalho Chehab 173*18ccb223SMauro Carvalho Chehab TEST_DIR=/orangefs 174*18ccb223SMauro Carvalho Chehab TEST_DEV=tcp://localhost:3334/orangefs 175*18ccb223SMauro Carvalho Chehab SCRATCH_MNT=/scratch 176*18ccb223SMauro Carvalho Chehab SCRATCH_DEV=tcp://localhost:3334/scratch 177*18ccb223SMauro Carvalho Chehab 178*18ccb223SMauro Carvalho ChehabThen xfstests can be run:: 179*18ccb223SMauro Carvalho Chehab 180*18ccb223SMauro Carvalho Chehab ./check -pvfs2 181*18ccb223SMauro Carvalho Chehab 182*18ccb223SMauro Carvalho Chehab 183*18ccb223SMauro Carvalho ChehabOptions 184*18ccb223SMauro Carvalho Chehab======= 185*18ccb223SMauro Carvalho Chehab 186*18ccb223SMauro Carvalho ChehabThe following mount options are accepted: 187*18ccb223SMauro Carvalho Chehab 188*18ccb223SMauro Carvalho Chehab acl 189*18ccb223SMauro Carvalho Chehab Allow the use of Access Control Lists on files and directories. 190*18ccb223SMauro Carvalho Chehab 191*18ccb223SMauro Carvalho Chehab intr 192*18ccb223SMauro Carvalho Chehab Some operations between the kernel client and the user space 193*18ccb223SMauro Carvalho Chehab filesystem can be interruptible, such as changes in debug levels 194*18ccb223SMauro Carvalho Chehab and the setting of tunable parameters. 195*18ccb223SMauro Carvalho Chehab 196*18ccb223SMauro Carvalho Chehab local_lock 197*18ccb223SMauro Carvalho Chehab Enable posix locking from the perspective of "this" kernel. The 198*18ccb223SMauro Carvalho Chehab default file_operations lock action is to return ENOSYS. Posix 199*18ccb223SMauro Carvalho Chehab locking kicks in if the filesystem is mounted with -o local_lock. 200*18ccb223SMauro Carvalho Chehab Distributed locking is being worked on for the future. 201*18ccb223SMauro Carvalho Chehab 202*18ccb223SMauro Carvalho Chehab 203*18ccb223SMauro Carvalho ChehabDebugging 204*18ccb223SMauro Carvalho Chehab========= 205*18ccb223SMauro Carvalho Chehab 206*18ccb223SMauro Carvalho ChehabIf you want the debug (GOSSIP) statements in a particular 207*18ccb223SMauro Carvalho Chehabsource file (inode.c for example) go to syslog:: 208*18ccb223SMauro Carvalho Chehab 209*18ccb223SMauro Carvalho Chehab echo inode > /sys/kernel/debug/orangefs/kernel-debug 210*18ccb223SMauro Carvalho Chehab 211*18ccb223SMauro Carvalho ChehabNo debugging (the default):: 212*18ccb223SMauro Carvalho Chehab 213*18ccb223SMauro Carvalho Chehab echo none > /sys/kernel/debug/orangefs/kernel-debug 214*18ccb223SMauro Carvalho Chehab 215*18ccb223SMauro Carvalho ChehabDebugging from several source files:: 216*18ccb223SMauro Carvalho Chehab 217*18ccb223SMauro Carvalho Chehab echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug 218*18ccb223SMauro Carvalho Chehab 219*18ccb223SMauro Carvalho ChehabAll debugging:: 220*18ccb223SMauro Carvalho Chehab 221*18ccb223SMauro Carvalho Chehab echo all > /sys/kernel/debug/orangefs/kernel-debug 222*18ccb223SMauro Carvalho Chehab 223*18ccb223SMauro Carvalho ChehabGet a list of all debugging keywords:: 224*18ccb223SMauro Carvalho Chehab 225*18ccb223SMauro Carvalho Chehab cat /sys/kernel/debug/orangefs/debug-help 226*18ccb223SMauro Carvalho Chehab 227*18ccb223SMauro Carvalho Chehab 228*18ccb223SMauro Carvalho ChehabProtocol between Kernel Module and Userspace 229*18ccb223SMauro Carvalho Chehab============================================ 230*18ccb223SMauro Carvalho Chehab 231*18ccb223SMauro Carvalho ChehabOrangefs is a user space filesystem and an associated kernel module. 232*18ccb223SMauro Carvalho ChehabWe'll just refer to the user space part of Orangefs as "userspace" 233*18ccb223SMauro Carvalho Chehabfrom here on out. Orangefs descends from PVFS, and userspace code 234*18ccb223SMauro Carvalho Chehabstill uses PVFS for function and variable names. Userspace typedefs 235*18ccb223SMauro Carvalho Chehabmany of the important structures. Function and variable names in 236*18ccb223SMauro Carvalho Chehabthe kernel module have been transitioned to "orangefs", and The Linux 237*18ccb223SMauro Carvalho ChehabCoding Style avoids typedefs, so kernel module structures that 238*18ccb223SMauro Carvalho Chehabcorrespond to userspace structures are not typedefed. 239*18ccb223SMauro Carvalho Chehab 240*18ccb223SMauro Carvalho ChehabThe kernel module implements a pseudo device that userspace 241*18ccb223SMauro Carvalho Chehabcan read from and write to. Userspace can also manipulate the 242*18ccb223SMauro Carvalho Chehabkernel module through the pseudo device with ioctl. 243*18ccb223SMauro Carvalho Chehab 244*18ccb223SMauro Carvalho ChehabThe Bufmap 245*18ccb223SMauro Carvalho Chehab---------- 246*18ccb223SMauro Carvalho Chehab 247*18ccb223SMauro Carvalho ChehabAt startup userspace allocates two page-size-aligned (posix_memalign) 248*18ccb223SMauro Carvalho Chehabmlocked memory buffers, one is used for IO and one is used for readdir 249*18ccb223SMauro Carvalho Chehaboperations. The IO buffer is 41943040 bytes and the readdir buffer is 250*18ccb223SMauro Carvalho Chehab4194304 bytes. Each buffer contains logical chunks, or partitions, and 251*18ccb223SMauro Carvalho Chehaba pointer to each buffer is added to its own PVFS_dev_map_desc structure 252*18ccb223SMauro Carvalho Chehabwhich also describes its total size, as well as the size and number of 253*18ccb223SMauro Carvalho Chehabthe partitions. 254*18ccb223SMauro Carvalho Chehab 255*18ccb223SMauro Carvalho ChehabA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a 256*18ccb223SMauro Carvalho Chehabmapping routine in the kernel module with an ioctl. The structure is 257*18ccb223SMauro Carvalho Chehabcopied from user space to kernel space with copy_from_user and is used 258*18ccb223SMauro Carvalho Chehabto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which 259*18ccb223SMauro Carvalho Chehabthen contains: 260*18ccb223SMauro Carvalho Chehab 261*18ccb223SMauro Carvalho Chehab * refcnt 262*18ccb223SMauro Carvalho Chehab - a reference counter 263*18ccb223SMauro Carvalho Chehab * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's 264*18ccb223SMauro Carvalho Chehab partition size, which represents the filesystem's block size and 265*18ccb223SMauro Carvalho Chehab is used for s_blocksize in super blocks. 266*18ccb223SMauro Carvalho Chehab * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of 267*18ccb223SMauro Carvalho Chehab partitions in the IO buffer. 268*18ccb223SMauro Carvalho Chehab * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. 269*18ccb223SMauro Carvalho Chehab * total_size - the total size of the IO buffer. 270*18ccb223SMauro Carvalho Chehab * page_count - the number of 4096 byte pages in the IO buffer. 271*18ccb223SMauro Carvalho Chehab * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes 272*18ccb223SMauro Carvalho Chehab of kcalloced memory. This memory is used as an array of pointers 273*18ccb223SMauro Carvalho Chehab to each of the pages in the IO buffer through a call to get_user_pages. 274*18ccb223SMauro Carvalho Chehab * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` 275*18ccb223SMauro Carvalho Chehab bytes of kcalloced memory. This memory is further intialized: 276*18ccb223SMauro Carvalho Chehab 277*18ccb223SMauro Carvalho Chehab user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc 278*18ccb223SMauro Carvalho Chehab structure. user_desc->ptr points to the IO buffer. 279*18ccb223SMauro Carvalho Chehab 280*18ccb223SMauro Carvalho Chehab :: 281*18ccb223SMauro Carvalho Chehab 282*18ccb223SMauro Carvalho Chehab pages_per_desc = bufmap->desc_size / PAGE_SIZE 283*18ccb223SMauro Carvalho Chehab offset = 0 284*18ccb223SMauro Carvalho Chehab 285*18ccb223SMauro Carvalho Chehab bufmap->desc_array[0].page_array = &bufmap->page_array[offset] 286*18ccb223SMauro Carvalho Chehab bufmap->desc_array[0].array_count = pages_per_desc = 1024 287*18ccb223SMauro Carvalho Chehab bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) 288*18ccb223SMauro Carvalho Chehab offset += 1024 289*18ccb223SMauro Carvalho Chehab . 290*18ccb223SMauro Carvalho Chehab . 291*18ccb223SMauro Carvalho Chehab . 292*18ccb223SMauro Carvalho Chehab bufmap->desc_array[9].page_array = &bufmap->page_array[offset] 293*18ccb223SMauro Carvalho Chehab bufmap->desc_array[9].array_count = pages_per_desc = 1024 294*18ccb223SMauro Carvalho Chehab bufmap->desc_array[9].uaddr = (user_desc->ptr) + 295*18ccb223SMauro Carvalho Chehab (9 * 1024 * 4096) 296*18ccb223SMauro Carvalho Chehab offset += 1024 297*18ccb223SMauro Carvalho Chehab 298*18ccb223SMauro Carvalho Chehab * buffer_index_array - a desc_count sized array of ints, used to 299*18ccb223SMauro Carvalho Chehab indicate which of the IO buffer's partitions are available to use. 300*18ccb223SMauro Carvalho Chehab * buffer_index_lock - a spinlock to protect buffer_index_array during update. 301*18ccb223SMauro Carvalho Chehab * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element 302*18ccb223SMauro Carvalho Chehab int array used to indicate which of the readdir buffer's partitions are 303*18ccb223SMauro Carvalho Chehab available to use. 304*18ccb223SMauro Carvalho Chehab * readdir_index_lock - a spinlock to protect readdir_index_array during 305*18ccb223SMauro Carvalho Chehab update. 306*18ccb223SMauro Carvalho Chehab 307*18ccb223SMauro Carvalho ChehabOperations 308*18ccb223SMauro Carvalho Chehab---------- 309*18ccb223SMauro Carvalho Chehab 310*18ccb223SMauro Carvalho ChehabThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it 311*18ccb223SMauro Carvalho Chehabneeds to communicate with userspace. Part of the op contains the "upcall" 312*18ccb223SMauro Carvalho Chehabwhich expresses the request to userspace. Part of the op eventually 313*18ccb223SMauro Carvalho Chehabcontains the "downcall" which expresses the results of the request. 314*18ccb223SMauro Carvalho Chehab 315*18ccb223SMauro Carvalho ChehabThe slab allocator is used to keep a cache of op structures handy. 316*18ccb223SMauro Carvalho Chehab 317*18ccb223SMauro Carvalho ChehabAt init time the kernel module defines and initializes a request list 318*18ccb223SMauro Carvalho Chehaband an in_progress hash table to keep track of all the ops that are 319*18ccb223SMauro Carvalho Chehabin flight at any given time. 320*18ccb223SMauro Carvalho Chehab 321*18ccb223SMauro Carvalho ChehabOps are stateful: 322*18ccb223SMauro Carvalho Chehab 323*18ccb223SMauro Carvalho Chehab * unknown 324*18ccb223SMauro Carvalho Chehab - op was just initialized 325*18ccb223SMauro Carvalho Chehab * waiting 326*18ccb223SMauro Carvalho Chehab - op is on request_list (upward bound) 327*18ccb223SMauro Carvalho Chehab * inprogr 328*18ccb223SMauro Carvalho Chehab - op is in progress (waiting for downcall) 329*18ccb223SMauro Carvalho Chehab * serviced 330*18ccb223SMauro Carvalho Chehab - op has matching downcall; ok 331*18ccb223SMauro Carvalho Chehab * purged 332*18ccb223SMauro Carvalho Chehab - op has to start a timer since client-core 333*18ccb223SMauro Carvalho Chehab exited uncleanly before servicing op 334*18ccb223SMauro Carvalho Chehab * given up 335*18ccb223SMauro Carvalho Chehab - submitter has given up waiting for it 336*18ccb223SMauro Carvalho Chehab 337*18ccb223SMauro Carvalho ChehabWhen some arbitrary userspace program needs to perform a 338*18ccb223SMauro Carvalho Chehabfilesystem operation on Orangefs (readdir, I/O, create, whatever) 339*18ccb223SMauro Carvalho Chehaban op structure is initialized and tagged with a distinguishing ID 340*18ccb223SMauro Carvalho Chehabnumber. The upcall part of the op is filled out, and the op is 341*18ccb223SMauro Carvalho Chehabpassed to the "service_operation" function. 342*18ccb223SMauro Carvalho Chehab 343*18ccb223SMauro Carvalho ChehabService_operation changes the op's state to "waiting", puts 344*18ccb223SMauro Carvalho Chehabit on the request list, and signals the Orangefs file_operations.poll 345*18ccb223SMauro Carvalho Chehabfunction through a wait queue. Userspace is polling the pseudo-device 346*18ccb223SMauro Carvalho Chehaband thus becomes aware of the upcall request that needs to be read. 347*18ccb223SMauro Carvalho Chehab 348*18ccb223SMauro Carvalho ChehabWhen the Orangefs file_operations.read function is triggered, the 349*18ccb223SMauro Carvalho Chehabrequest list is searched for an op that seems ready-to-process. 350*18ccb223SMauro Carvalho ChehabThe op is removed from the request list. The tag from the op and 351*18ccb223SMauro Carvalho Chehabthe filled-out upcall struct are copy_to_user'ed back to userspace. 352*18ccb223SMauro Carvalho Chehab 353*18ccb223SMauro Carvalho ChehabIf any of these (and some additional protocol) copy_to_users fail, 354*18ccb223SMauro Carvalho Chehabthe op's state is set to "waiting" and the op is added back to 355*18ccb223SMauro Carvalho Chehabthe request list. Otherwise, the op's state is changed to "in progress", 356*18ccb223SMauro Carvalho Chehaband the op is hashed on its tag and put onto the end of a list in the 357*18ccb223SMauro Carvalho Chehabin_progress hash table at the index the tag hashed to. 358*18ccb223SMauro Carvalho Chehab 359*18ccb223SMauro Carvalho ChehabWhen userspace has assembled the response to the upcall, it 360*18ccb223SMauro Carvalho Chehabwrites the response, which includes the distinguishing tag, back to 361*18ccb223SMauro Carvalho Chehabthe pseudo device in a series of io_vecs. This triggers the Orangefs 362*18ccb223SMauro Carvalho Chehabfile_operations.write_iter function to find the op with the associated 363*18ccb223SMauro Carvalho Chehabtag and remove it from the in_progress hash table. As long as the op's 364*18ccb223SMauro Carvalho Chehabstate is not "canceled" or "given up", its state is set to "serviced". 365*18ccb223SMauro Carvalho ChehabThe file_operations.write_iter function returns to the waiting vfs, 366*18ccb223SMauro Carvalho Chehaband back to service_operation through wait_for_matching_downcall. 367*18ccb223SMauro Carvalho Chehab 368*18ccb223SMauro Carvalho ChehabService operation returns to its caller with the op's downcall 369*18ccb223SMauro Carvalho Chehabpart (the response to the upcall) filled out. 370*18ccb223SMauro Carvalho Chehab 371*18ccb223SMauro Carvalho ChehabThe "client-core" is the bridge between the kernel module and 372*18ccb223SMauro Carvalho Chehabuserspace. The client-core is a daemon. The client-core has an 373*18ccb223SMauro Carvalho Chehabassociated watchdog daemon. If the client-core is ever signaled 374*18ccb223SMauro Carvalho Chehabto die, the watchdog daemon restarts the client-core. Even though 375*18ccb223SMauro Carvalho Chehabthe client-core is restarted "right away", there is a period of 376*18ccb223SMauro Carvalho Chehabtime during such an event that the client-core is dead. A dead client-core 377*18ccb223SMauro Carvalho Chehabcan't be triggered by the Orangefs file_operations.poll function. 378*18ccb223SMauro Carvalho ChehabOps that pass through service_operation during a "dead spell" can timeout 379*18ccb223SMauro Carvalho Chehabon the wait queue and one attempt is made to recycle them. Obviously, 380*18ccb223SMauro Carvalho Chehabif the client-core stays dead too long, the arbitrary userspace processes 381*18ccb223SMauro Carvalho Chehabtrying to use Orangefs will be negatively affected. Waiting ops 382*18ccb223SMauro Carvalho Chehabthat can't be serviced will be removed from the request list and 383*18ccb223SMauro Carvalho Chehabhave their states set to "given up". In-progress ops that can't 384*18ccb223SMauro Carvalho Chehabbe serviced will be removed from the in_progress hash table and 385*18ccb223SMauro Carvalho Chehabhave their states set to "given up". 386*18ccb223SMauro Carvalho Chehab 387*18ccb223SMauro Carvalho ChehabReaddir and I/O ops are atypical with respect to their payloads. 388*18ccb223SMauro Carvalho Chehab 389*18ccb223SMauro Carvalho Chehab - readdir ops use the smaller of the two pre-allocated pre-partitioned 390*18ccb223SMauro Carvalho Chehab memory buffers. The readdir buffer is only available to userspace. 391*18ccb223SMauro Carvalho Chehab The kernel module obtains an index to a free partition before launching 392*18ccb223SMauro Carvalho Chehab a readdir op. Userspace deposits the results into the indexed partition 393*18ccb223SMauro Carvalho Chehab and then writes them to back to the pvfs device. 394*18ccb223SMauro Carvalho Chehab 395*18ccb223SMauro Carvalho Chehab - io (read and write) ops use the larger of the two pre-allocated 396*18ccb223SMauro Carvalho Chehab pre-partitioned memory buffers. The IO buffer is accessible from 397*18ccb223SMauro Carvalho Chehab both userspace and the kernel module. The kernel module obtains an 398*18ccb223SMauro Carvalho Chehab index to a free partition before launching an io op. The kernel module 399*18ccb223SMauro Carvalho Chehab deposits write data into the indexed partition, to be consumed 400*18ccb223SMauro Carvalho Chehab directly by userspace. Userspace deposits the results of read 401*18ccb223SMauro Carvalho Chehab requests into the indexed partition, to be consumed directly 402*18ccb223SMauro Carvalho Chehab by the kernel module. 403*18ccb223SMauro Carvalho Chehab 404*18ccb223SMauro Carvalho ChehabResponses to kernel requests are all packaged in pvfs2_downcall_t 405*18ccb223SMauro Carvalho Chehabstructs. Besides a few other members, pvfs2_downcall_t contains a 406*18ccb223SMauro Carvalho Chehabunion of structs, each of which is associated with a particular 407*18ccb223SMauro Carvalho Chehabresponse type. 408*18ccb223SMauro Carvalho Chehab 409*18ccb223SMauro Carvalho ChehabThe several members outside of the union are: 410*18ccb223SMauro Carvalho Chehab 411*18ccb223SMauro Carvalho Chehab ``int32_t type`` 412*18ccb223SMauro Carvalho Chehab - type of operation. 413*18ccb223SMauro Carvalho Chehab ``int32_t status`` 414*18ccb223SMauro Carvalho Chehab - return code for the operation. 415*18ccb223SMauro Carvalho Chehab ``int64_t trailer_size`` 416*18ccb223SMauro Carvalho Chehab - 0 unless readdir operation. 417*18ccb223SMauro Carvalho Chehab ``char *trailer_buf`` 418*18ccb223SMauro Carvalho Chehab - initialized to NULL, used during readdir operations. 419*18ccb223SMauro Carvalho Chehab 420*18ccb223SMauro Carvalho ChehabThe appropriate member inside the union is filled out for any 421*18ccb223SMauro Carvalho Chehabparticular response. 422*18ccb223SMauro Carvalho Chehab 423*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_FILE_IO 424*18ccb223SMauro Carvalho Chehab fill a pvfs2_io_response_t 425*18ccb223SMauro Carvalho Chehab 426*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_LOOKUP 427*18ccb223SMauro Carvalho Chehab fill a PVFS_object_kref 428*18ccb223SMauro Carvalho Chehab 429*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_CREATE 430*18ccb223SMauro Carvalho Chehab fill a PVFS_object_kref 431*18ccb223SMauro Carvalho Chehab 432*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_SYMLINK 433*18ccb223SMauro Carvalho Chehab fill a PVFS_object_kref 434*18ccb223SMauro Carvalho Chehab 435*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_GETATTR 436*18ccb223SMauro Carvalho Chehab fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) 437*18ccb223SMauro Carvalho Chehab fill in a string with the link target when the object is a symlink. 438*18ccb223SMauro Carvalho Chehab 439*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_MKDIR 440*18ccb223SMauro Carvalho Chehab fill a PVFS_object_kref 441*18ccb223SMauro Carvalho Chehab 442*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_STATFS 443*18ccb223SMauro Carvalho Chehab fill a pvfs2_statfs_response_t with useless info <g>. It is hard for 444*18ccb223SMauro Carvalho Chehab us to know, in a timely fashion, these statistics about our 445*18ccb223SMauro Carvalho Chehab distributed network filesystem. 446*18ccb223SMauro Carvalho Chehab 447*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_FS_MOUNT 448*18ccb223SMauro Carvalho Chehab fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref 449*18ccb223SMauro Carvalho Chehab except its members are in a different order and "__pad1" is replaced 450*18ccb223SMauro Carvalho Chehab with "id". 451*18ccb223SMauro Carvalho Chehab 452*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_GETXATTR 453*18ccb223SMauro Carvalho Chehab fill a pvfs2_getxattr_response_t 454*18ccb223SMauro Carvalho Chehab 455*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_LISTXATTR 456*18ccb223SMauro Carvalho Chehab fill a pvfs2_listxattr_response_t 457*18ccb223SMauro Carvalho Chehab 458*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_PARAM 459*18ccb223SMauro Carvalho Chehab fill a pvfs2_param_response_t 460*18ccb223SMauro Carvalho Chehab 461*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_PERF_COUNT 462*18ccb223SMauro Carvalho Chehab fill a pvfs2_perf_count_response_t 463*18ccb223SMauro Carvalho Chehab 464*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_FSKEY 465*18ccb223SMauro Carvalho Chehab file a pvfs2_fs_key_response_t 466*18ccb223SMauro Carvalho Chehab 467*18ccb223SMauro Carvalho Chehab PVFS2_VFS_OP_READDIR 468*18ccb223SMauro Carvalho Chehab jamb everything needed to represent a pvfs2_readdir_response_t into 469*18ccb223SMauro Carvalho Chehab the readdir buffer descriptor specified in the upcall. 470*18ccb223SMauro Carvalho Chehab 471*18ccb223SMauro Carvalho ChehabUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests 472*18ccb223SMauro Carvalho Chehabmade by the kernel side. 473*18ccb223SMauro Carvalho Chehab 474*18ccb223SMauro Carvalho ChehabA buffer_list containing: 475*18ccb223SMauro Carvalho Chehab 476*18ccb223SMauro Carvalho Chehab - a pointer to the prepared response to the request from the 477*18ccb223SMauro Carvalho Chehab kernel (struct pvfs2_downcall_t). 478*18ccb223SMauro Carvalho Chehab - and also, in the case of a readdir request, a pointer to a 479*18ccb223SMauro Carvalho Chehab buffer containing descriptors for the objects in the target 480*18ccb223SMauro Carvalho Chehab directory. 481*18ccb223SMauro Carvalho Chehab 482*18ccb223SMauro Carvalho Chehab... is sent to the function (PINT_dev_write_list) which performs 483*18ccb223SMauro Carvalho Chehabthe writev. 484*18ccb223SMauro Carvalho Chehab 485*18ccb223SMauro Carvalho ChehabPINT_dev_write_list has a local iovec array: struct iovec io_array[10]; 486*18ccb223SMauro Carvalho Chehab 487*18ccb223SMauro Carvalho ChehabThe first four elements of io_array are initialized like this for all 488*18ccb223SMauro Carvalho Chehabresponses:: 489*18ccb223SMauro Carvalho Chehab 490*18ccb223SMauro Carvalho Chehab io_array[0].iov_base = address of local variable "proto_ver" (int32_t) 491*18ccb223SMauro Carvalho Chehab io_array[0].iov_len = sizeof(int32_t) 492*18ccb223SMauro Carvalho Chehab 493*18ccb223SMauro Carvalho Chehab io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) 494*18ccb223SMauro Carvalho Chehab io_array[1].iov_len = sizeof(int32_t) 495*18ccb223SMauro Carvalho Chehab 496*18ccb223SMauro Carvalho Chehab io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) 497*18ccb223SMauro Carvalho Chehab io_array[2].iov_len = sizeof(int64_t) 498*18ccb223SMauro Carvalho Chehab 499*18ccb223SMauro Carvalho Chehab io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) 500*18ccb223SMauro Carvalho Chehab of global variable vfs_request (vfs_request_t) 501*18ccb223SMauro Carvalho Chehab io_array[3].iov_len = sizeof(pvfs2_downcall_t) 502*18ccb223SMauro Carvalho Chehab 503*18ccb223SMauro Carvalho ChehabReaddir responses initialize the fifth element io_array like this:: 504*18ccb223SMauro Carvalho Chehab 505*18ccb223SMauro Carvalho Chehab io_array[4].iov_base = contents of member trailer_buf (char *) 506*18ccb223SMauro Carvalho Chehab from out_downcall member of global variable 507*18ccb223SMauro Carvalho Chehab vfs_request 508*18ccb223SMauro Carvalho Chehab io_array[4].iov_len = contents of member trailer_size (PVFS_size) 509*18ccb223SMauro Carvalho Chehab from out_downcall member of global variable 510*18ccb223SMauro Carvalho Chehab vfs_request 511*18ccb223SMauro Carvalho Chehab 512*18ccb223SMauro Carvalho ChehabOrangefs exploits the dcache in order to avoid sending redundant 513*18ccb223SMauro Carvalho Chehabrequests to userspace. We keep object inode attributes up-to-date with 514*18ccb223SMauro Carvalho Chehaborangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to 515*18ccb223SMauro Carvalho Chehabhelp it decide whether or not to update an inode: "new" and "bypass". 516*18ccb223SMauro Carvalho ChehabOrangefs keeps private data in an object's inode that includes a short 517*18ccb223SMauro Carvalho Chehabtimeout value, getattr_time, which allows any iteration of 518*18ccb223SMauro Carvalho Chehaborangefs_inode_getattr to know how long it has been since the inode was 519*18ccb223SMauro Carvalho Chehabupdated. When the object is not new (new == 0) and the bypass flag is not 520*18ccb223SMauro Carvalho Chehabset (bypass == 0) orangefs_inode_getattr returns without updating the inode 521*18ccb223SMauro Carvalho Chehabif getattr_time has not timed out. Getattr_time is updated each time the 522*18ccb223SMauro Carvalho Chehabinode is updated. 523*18ccb223SMauro Carvalho Chehab 524*18ccb223SMauro Carvalho ChehabCreation of a new object (file, dir, sym-link) includes the evaluation of 525*18ccb223SMauro Carvalho Chehabits pathname, resulting in a negative directory entry for the object. 526*18ccb223SMauro Carvalho ChehabA new inode is allocated and associated with the dentry, turning it from 527*18ccb223SMauro Carvalho Chehaba negative dentry into a "productive full member of society". Orangefs 528*18ccb223SMauro Carvalho Chehabobtains the new inode from Linux with new_inode() and associates 529*18ccb223SMauro Carvalho Chehabthe inode with the dentry by sending the pair back to Linux with 530*18ccb223SMauro Carvalho Chehabd_instantiate(). 531*18ccb223SMauro Carvalho Chehab 532*18ccb223SMauro Carvalho ChehabThe evaluation of a pathname for an object resolves to its corresponding 533*18ccb223SMauro Carvalho Chehabdentry. If there is no corresponding dentry, one is created for it in 534*18ccb223SMauro Carvalho Chehabthe dcache. Whenever a dentry is modified or verified Orangefs stores a 535*18ccb223SMauro Carvalho Chehabshort timeout value in the dentry's d_time, and the dentry will be trusted 536*18ccb223SMauro Carvalho Chehabfor that amount of time. Orangefs is a network filesystem, and objects 537*18ccb223SMauro Carvalho Chehabcan potentially change out-of-band with any particular Orangefs kernel module 538*18ccb223SMauro Carvalho Chehabinstance, so trusting a dentry is risky. The alternative to trusting 539*18ccb223SMauro Carvalho Chehabdentries is to always obtain the needed information from userspace - at 540*18ccb223SMauro Carvalho Chehableast a trip to the client-core, maybe to the servers. Obtaining information 541*18ccb223SMauro Carvalho Chehabfrom a dentry is cheap, obtaining it from userspace is relatively expensive, 542*18ccb223SMauro Carvalho Chehabhence the motivation to use the dentry when possible. 543*18ccb223SMauro Carvalho Chehab 544*18ccb223SMauro Carvalho ChehabThe timeout values d_time and getattr_time are jiffy based, and the 545*18ccb223SMauro Carvalho Chehabcode is designed to avoid the jiffy-wrap problem:: 546*18ccb223SMauro Carvalho Chehab 547*18ccb223SMauro Carvalho Chehab "In general, if the clock may have wrapped around more than once, there 548*18ccb223SMauro Carvalho Chehab is no way to tell how much time has elapsed. However, if the times t1 549*18ccb223SMauro Carvalho Chehab and t2 are known to be fairly close, we can reliably compute the 550*18ccb223SMauro Carvalho Chehab difference in a way that takes into account the possibility that the 551*18ccb223SMauro Carvalho Chehab clock may have wrapped between times." 552*18ccb223SMauro Carvalho Chehab 553*18ccb223SMauro Carvalho Chehabfrom course notes by instructor Andy Wang 554*18ccb223SMauro Carvalho Chehab 555