1.\" Copyright (c) 2018 Rick Macklem 2.\" 3.\" Redistribution and use in source and binary forms, with or without 4.\" modification, are permitted provided that the following conditions 5.\" are met: 6.\" 1. Redistributions of source code must retain the above copyright 7.\" notice, this list of conditions and the following disclaimer. 8.\" 2. Redistributions in binary form must reproduce the above copyright 9.\" notice, this list of conditions and the following disclaimer in the 10.\" documentation and/or other materials provided with the distribution. 11.\" 12.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 13.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 14.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 15.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 16.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 17.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 18.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 19.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 20.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 21.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 22.\" SUCH DAMAGE. 23.\" 24.Dd December 20, 2019 25.Dt PNFSSERVER 4 26.Os 27.Sh NAME 28.Nm pNFSserver 29.Nd NFS Version 4.1 and 4.2 Parallel NFS Protocol Server 30.Sh DESCRIPTION 31A set of 32.Fx 33servers may be configured to provide a 34.Xr pnfs 4 35service. 36One 37.Fx 38system needs to be configured as a MetaData Server (MDS) and 39at least one additional 40.Fx 41system needs to be configured as one or 42more Data Servers (DS)s. 43.Pp 44These 45.Fx 46systems are configured to be NFSv4.1 and NFSv4.2 47servers, see 48.Xr nfsd 8 49and 50.Xr exports 5 51if you are not familiar with configuring a NFSv4.n server. 52All DS(s) and the MDS should support NFSv4.2 as well as NFSv4.1. 53Mixing an MDS that supports NFSv4.2 with any DS(s) that do not support 54NFSv4.2 will not work correctly. 55As such, all DS(s) must be upgraded from 56.Fx 12 57to 58.Fx 13 59before upgrading the MDS. 60.Sh DS server configuration 61The DS(s) need to be configured as NFSv4.1 and NFSv4.2 server(s), 62with a top level exported 63directory used for storage of data files. 64This directory must be owned by 65.Dq root 66and would normally have a mode of 67.Dq 700 . 68Within this directory there needs to be additional directories named 69ds0,...,dsN (where N is 19 by default) also owned by 70.Dq root 71with mode 72.Dq 700 . 73These are the directories where the data files are stored. 74The following command can be run by root when in the top level exported 75directory to create these subdirectories. 76.Bd -literal -offset indent 77jot -w ds 20 0 | xargs mkdir -m 700 78.Ed 79.sp 80Note that 81.Dq 20 82is the default and can be set to a larger value on the MDS as shown below. 83.sp 84The top level exported directory used for storage of data files must be 85exported to the MDS with the 86.Dq maproot=root sec=sys 87export options so that the MDS can create entries in these subdirectories. 88It must also be exported to all pNFS aware clients, but these clients do 89not require the 90.Dq maproot=root 91export option and this directory should be exported to them with the same 92options as used by the MDS to export file system(s) to the clients. 93.Pp 94It is possible to have multiple DSs on the same 95.Fx 96system, but each 97of these DSs must have a separate top level exported directory used for storage 98of data files and each 99of these DSs must be mountable via a separate IP address. 100Alias addresses can be set on the DS server system for a network 101interface via 102.Xr ifconfig 8 103to create these different IP addresses. 104Multiple DSs on the same server may be useful when data for different file systems 105on the MDS are being stored on different file system volumes on the 106.Fx 107DS system. 108.Sh MDS server configuration 109The MDS must be a separate 110.Fx 111system from the 112.Fx 113DS system(s) and 114NFS clients. 115It is configured as a NFSv4.1 and NFSv4.2 server with 116file system(s) exported to clients. 117However, the 118.Dq -p 119command line argument for 120.Xr nfsd 121is used to indicate that it is running as the MDS for a pNFS server. 122.Pp 123The DS(s) must all be mounted on the MDS using the following mount options: 124.Bd -literal -offset indent 125nfsv4,minorversion=2,soft,retrans=2 126.Ed 127.sp 128so that they can be defined as DSs in the 129.Dq -p 130option. 131Normally these mounts would be entered in the 132.Xr fstab 5 133on the MDS. 134For example, if there are four DSs named nfsv4-data[0-3], the 135.Xr fstab 5 136lines might look like: 137.Bd -literal -offset 138nfsv4-data0:/ /data0 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 139nfsv4-data1:/ /data1 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 140nfsv4-data2:/ /data2 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 141nfsv4-data3:/ /data3 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 142.Ed 143.sp 144The 145.Xr nfsd 8 146command line option 147.Dq -p 148indicates that the NFS server is a pNFS MDS and specifies what 149DSs are to be used. 150.br 151For the above 152.Xr fstab 5 153example, the 154.Xr nfsd 8 155nfs_server_flags line in your 156.Xr rc.conf 5 157might look like: 158.Bd -literal -offset 159nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3" 160.Ed 161.sp 162This example specifies that the data files should be distributed over the 163four DSs and File layouts will be issued to pNFS enabled clients. 164If issuing Flexible File layouts is desired for this case, setting the sysctl 165.Dq vfs.nfsd.default_flexfile 166non-zero in your 167.Xr sysctl.conf 5 168file will make the 169.Nm 170do that. 171.br 172Alternately, this variant of 173.Dq nfs_server_flags 174will specify that two way mirroring is to be done, via the 175.Dq -m 176command line option. 177.Bd -literal -offset 178nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3 -m 2" 179.Ed 180.sp 181With two way mirroring, the data file for each exported file on the MDS 182will be stored on two of the DSs. 183When mirroring is enabled, the server will always issue Flexible File layouts. 184.Pp 185It is also possible to specify which DSs are to be used to store data files for 186specific exported file systems on the MDS. 187For example, if the MDS has exported two file systems 188.Dq /export1 189and 190.Dq /export2 191to clients, the following variant of 192.Dq nfs_server_flags 193will specify that data files for 194.Dq /export1 195will be stored on nfsv4-data0 and nfsv4-data1, whereas the data files for 196.Dq /export2 197will be store on nfsv4-data2 and nfsv4-data3. 198.Bd -literal -offset 199nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2" 200.Ed 201.sp 202This can be used by system administrators to control where data files are 203stored and might be useful for control of storage use. 204For this case, it may be convenient to co-locate more than one of the DSs 205on the same 206.Fx 207server, using separate file systems on the DS system 208for storage of the respective DS's data files. 209If mirroring is desired for this case, the 210.Dq -m 211option also needs to be specified. 212There must be enough DSs assigned to each exported file system on the MDS 213to support the level of mirroring. 214The above example would be fine for two way mirroring, but four way mirroring 215would not work, since there are only two DSs assigned to each exported file 216system on the MDS. 217.Pp 218The number of subdirectories in each DS is defined by the 219.Dq vfs.nfs.dsdirsize 220sysctl on the MDS. 221This value can be increased from the default of 20, but only when the 222.Xr nfsd 8 223is not running and after the additional ds20,... subdirectories have been 224created on all the DSs. 225For a service that will store a large number of files this sysctl should be 226set much larger, to avoid the number of entries in a subdirectory from 227getting too large. 228.Sh Client mounts 229Once operational, NFSv4.1 or NFSv4.2 230.Fx 231client mounts 232done with the 233.Dq pnfs 234option should do I/O directly on the DSs. 235The clients mounting the MDS must be running the 236.Xr nfscbd 237daemon for pNFS to work. 238Set 239.Bd -literal -offset indent 240nfscbd_enable="YES" 241.Ed 242.sp 243in the 244.Xr rc.conf 5 245on these clients. 246Non-pNFS aware clients or NFSv3 mounts will do all I/O RPCs on the MDS, 247which acts as a proxy for the appropriate DS(s). 248.Sh Backing up a pNFS service 249Since the data is separated from the metadata, the simple way to back up 250a pNFS service is to do so from an NFS client that has the service mounted 251on it. 252If you back up the MDS exported file system(s) on the MDS, you must do it 253in such a way that the 254.Dq system 255namespace extended attributes get backed up. 256.Sh Handling of failed mirrored DSs 257When a mirrored DS fails, it can be disabled one of three ways: 258.sp 2591 - The MDS detects a problem when trying to do proxy 260operations on the DS. 261This can take a couple of minutes 262after the DS failure or network partitioning occurs. 263.sp 2642 - A pNFS client can report an I/O error that occurred for a DS to the MDS in 265the arguments for a LayoutReturn operation. 266.sp 2673 - The system administrator can perform the pnfsdskill(8) command on the MDS 268to disable it. 269If the system administrator does a pnfsdskill(8) and it fails with ENXIO 270(Device not configured) that normally means the DS was already 271disabled via #1 or #2. 272Since doing this is harmless, once a system administrator knows that 273there is a problem with a mirrored DS, doing the command is recommended. 274.sp 275Once a system administrator knows that a mirrored DS has malfunctioned 276or has been network partitioned, they should do the following as root/su 277on the MDS: 278.Bd -literal -offset indent 279# pnfsdskill <mounted-on-path-of-DS> 280# umount -N <mounted-on-path-of-DS> 281.Ed 282.sp 283Note that the <mounted-on-path-of-DS> must be the exact mounted-on path 284string used when the DS was mounted on the MDS. 285.Pp 286Once the mirrored DS has been disabled, the pNFS service should continue to 287function, but file updates will only happen on the DS(s) that have not been disabled. 288Assuming two way mirroring, that implies the one DS of the pair stored in the 289.Dq pnfsd.dsfile 290extended attribute for the file on the MDS, for files stored on the disabled DS. 291.Pp 292The next step is to clear the IP address in the 293.Dq pnfsd.dsfile 294extended attribute on all files on the MDS for the failed DS. 295This is done so that, when the disabled DS is repaired and brought back online, 296the data files on this DS will not be used, since they may be out of date. 297The command that clears the IP address is 298.Xr pnfsdsfile 8 299with the 300.Dq -r 301option. 302.Bd -literal -offset 303For example: 304# pnfsdsfile -r nfsv4-data3 yyy.c 305yyy.c: nfsv4-data2.home.rick ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 0.0.0.0 ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 306.Ed 307.sp 308replaces nfsv4-data3 with an IPv4 address of 0.0.0.0, so that nfsv4-data3 309will not get used. 310.Pp 311Normally this will be called within a 312.Xr find 1 313command for all regular 314files in the exported directory tree and must be done on the MDS. 315When used with 316.Xr find 1 , 317you will probably also want the 318.Dq -q 319option so that it won't spit out the results for every file. 320If the disabled/repaired DS is nfsv4-data3, the commands done on the MDS 321would be: 322.Bd -literal -offset 323# cd <top-level-exported-dir> 324# find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} \; 325.Ed 326.sp 327There is a problem with the above command if the file found by 328.Xr find 1 329is renamed or unlinked before the 330.Xr pnfsdsfile 8 331command is done on it. 332This should normally generate an error message. 333A simple unlink is harmless 334but a link/unlink or rename might result in the file not having been processed 335under its new name. 336To check that all files have their IP addresses set to 0.0.0.0 these 337commands can be used (assuming the 338.Xr sh 1 339shell): 340.Bd -literal -offset 341# cd <top-level-exported-dir> 342# find . -type f -exec pnfsdsfile {} \; | sed "/nfsv4-data3/!d" 343.Ed 344.sp 345Any line(s) printed require the 346.Xr pnfsdsfile 8 347with 348.Dq -r 349to be done again. 350Once this is done, the replaced/repaired DS can be brought back online. 351It should have empty ds0,...,dsN directories under the top level exported 352directory for storage of data files just like it did when first set up. 353Mount it on the MDS exactly as you did before disabling it. 354For the nfsv4-data3 example, the command would be: 355.Bd -literal -offset 356# mount -t nfs -o nfsv4,minorversion=2,soft,retrans=2 nfsv4-data3:/ /data3 357.Ed 358.sp 359Then restart the nfsd to re-enable the DS. 360.Bd -literal -offset 361# /etc/rc.d/nfsd restart 362.Ed 363.sp 364Now, new files can be stored on nfsv4-data3, 365but files with the IP address zeroed out on the MDS will not yet use the 366repaired DS (nfsv4-data3). 367The next step is to go through the exported file tree on the MDS and, 368for each of the 369files with an IPv4 address of 0.0.0.0 in its extended attribute, copy the file 370data to the repaired DS and re-enable use of this mirror for it. 371This command for copying the file data for one MDS file is 372.Xr pnfsdscopymr 8 373and it will also normally be used in a 374.Xr find 1 . 375For the example case, the commands on the MDS would be: 376.Bd -literal -offset 377# cd <top-level-exported-dir> 378# find . -type f -exec pnfsdscopymr -r /data3 {} \; 379.Ed 380.sp 381When this completes, the recovery should be complete or at least nearly so. 382As noted above, if a link/unlink or rename occurs on a file name while the 383above 384.Xr find 1 385is in progress, it may not get copied. 386To check for any file(s) not yet copied, the commands are: 387.Bd -literal -offset 388# cd <top-level-exported-dir> 389# find . -type f -exec pnfsdsfile {} \; | sed "/0\.0\.0\.0/!d" 390.Ed 391.sp 392If this command prints out any file name(s), these files must 393have the 394.Xr pnfsdscopymr 8 395command done on them to complete the recovery. 396.Bd -literal -offset 397# pnfsdscopymr -r /data3 <file-path-reported> 398.Ed 399.sp 400If this command fails with the error 401.br 402.Dq pnfsdscopymr: Copymr failed for file <path>: Device not configured 403.br 404repeatedly, this may be caused by a Read/Write layout that has not 405been returned. 406The only way to get rid of such a layout is to restart the 407.Xr nfsd 8 . 408.sp 409All of these commands are designed to be 410done while the pNFS service is running and can be re-run safely. 411.Pp 412For a more detailed discussion of the setup and management of a pNFS service 413see: 414.Bd -literal -offset indent 415https://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt 416.Ed 417.sp 418.Sh SEE ALSO 419.Xr nfsv4 4 , 420.Xr pnfs 4 , 421.Xr exports 5 , 422.Xr fstab 5 , 423.Xr rc.conf 5 , 424.Xr sysctl.conf 5 , 425.Xr nfscbd 8 , 426.Xr nfsd 8 , 427.Xr nfsuserd 8 , 428.Xr pnfsdscopymr 8 , 429.Xr pnfsdsfile 8 , 430.Xr pnfsdskill 8 431.Sh HISTORY 432The 433.Nm 434service first appeared in 435.Fx 12.0 . 436.Sh BUGS 437Since the MDS cannot be mirrored, it is a single point of failure just 438as a non 439.Tn pNFS 440server is. 441For non-mirrored configurations, all 442.Fx 443systems used in the service 444are single points of failure. 445