# # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License, Version 1.0 only # (the "License"). You may not use this file except in compliance # with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # # Copyright (c) 1995 Sun Microsystems, Inc. All Rights Reserved # #ident "%W% %E% SMI" # # design notes that are likely to be of general (rather than # merely historical) interest. Table of Contents Overview what filesync does Primary Data Structures general principles why they exist key concepts what they represent data structures major structures and their contents Overview of Passes main phases of program execution Modules list and descriptions of files Studying the Code active ingredients a reading list of high points the whole thing a suggested order for everything Gross calling structure who calls whom Helpful hints good things to know Overview The purpose of this program is to compare pairs of directory trees with a baseline snapshot, to determine which files have changed, and to propagate the changes in order to bring the trees back into congruency. The baseline snapshot describes size, ownership, ... for all files that filesync is managing WHEN THEY WERE LAST IN SYNC. The files and directory trees to be compared are determined by a relatively flexible (user editable) rules file, whose format (packingrules.4) permits files and or trees to be specified, explicitly, implicitly, or with wild cards. There are also provisions for filtering out unwanted files and for running programs to generate lists of files and directories to be included or excluded. The comparisons begin by comparing the structured name spaces. For names that appear in both trees, the files are then compared on the basis of type, size, contents, ownership and protections. For files that are already in the baseline snapshot, if the sizes and modification times have not changed, we do not bother to recheck the contents. The reconciliation process (resolving the differences) will only propagate a change if it is obvious what should be done (one side has changed relative to the snapshot, while the other has not). If there are conflicting changes, the file is flagged and the user is asked to reconcile the differences manually. There are, however a few switches that can be used to constrain the analysis or reconciliation, or to force one particular side to win in case of a conflict. Primary Data Structures general principles: we will build up an in-memory tree that represents the union of the name spaces found in the baseline and on the source and destination sides. keep in mind that the baseline recalls the state of files THE LAST TIME THEY WERE IN AGREEMENT. If files have disagreed for a long time, the baseline still remembers what they were like when they agreed. If files have never agreed, the baseline has no notions of how they "used to be". key concepts: a "base pair" is a pair of directories whose contents (or a subset of whose contents) are to be syncrhonized. The "base pairs" to be managed are specified in the packing rules file. associated with each "base pair" is a set of rules that describe which files (under those directories) are to be kept in sync. Each rule is a list of: files and or directories to be included wild cards for files or directories to be included programs to generate lists of names for inclusion file names to be ignored wild cards for file names to be ignored programs to generate lists of names for ignoring as a result of the "evaluation" process we build up (under each base pair) a tree that represents all of the files that we are supposed to keep in sync, and contains everything we need to know about each one of those files. The structure of the tree mirrors the directory hierarchy ... actually the union of the three hiearchies (baseline, source and destination). for each file, we record interesting information (type, size, owner, protection, mod time) and keep separate note of what these values were: in the baseline last time two sides agreed on the source side, as we just examined it on the destination side, as we just examined it data structures: there is an ordered list of "base" structures for each base, we maintain three lists of associated "rule" descriptions: inclusion rules exclusion rules restriction rules (from the command line) a "file" tree, representing all files below the bases a list of statistics to be printed as a summary for each "rule", we maintain some flags describing the type of rule the character string that is the rule for each "file", we maintain sibling and child pointers to give them tree structure flags to describe what we have done/should do "fileinfo" information from the src, dest, and baseline in addition there are some fields that are used to add the file to a list of files requiring reconciliation and record what happened to it. a "fileinfo" structure contains a subset of the information that we obtain from a stat call: major/minor/inum type link count ownership, protection, and acls size modification time there is also, built up during analysis, a reconciliation list. This is an ordered list of "file" structures which are believed to descibe files that have changed and require reconciliation. The ordering is important both for correctness and to preserve relative modification times. Overview of passes: pass I (evaluate) stat every file that we might be interested in (on both src/dest sides). This includes walking the trees under all directories in order to find out what files exist and stating all of them. the main trick in this pass is that there may be files we don't want to evaluate (because we are limiting our attention to specific files and trees). There is a LISTED flag kept in the database that tells me whether or not I need to stat/descend any given node. all restrictions and ignores take effect during this pass. pass II (analyze) given the baseline and all of the current stat information gained during pass I, figure out what might conceivably have changed and queue it for pass III. This pass doesn't try to figure out what happened or who should win ... it merely identifies candidates for pass III. This pass ignores any nodes that were not evaluated during pass I. the queueing process, however, determines the order in which the files will be processed in pass III, and the order is very important. pass III (reconcile) process the list of candidates, figuring out what has actually changed and which versions deserve to win. If is clear what needs doing, we actually do it in this pass. Modules filesync.h defines for limits, sizes and return codes declarations for global variables (mostly cmd-line parms) defines for default file names declarations for routines of general interest database.h data-structures for recording rules data-structures for recording information about files declarations for routines that operate on/with those structures messages.h the text of all localizable messages debug.h definitions and declarations for routines for error simulation and bit-map display. acls.c routines to get, set, compare, and display Access Control Lists action.c routines to do the real work of copying, deleting, or changing ownership in order to make one side agree with the other. anal.c routines to examine the in-core list of files and determine what has changed (and therefore what is files are candidates for reconciliation). This analysis includes figuring out which files should be links rather than copies. base.c routines to read and write the baseline file routines to search and manipulate the in-core base list debug.c data structures and routines, used to sumulate errors and produce debug output, that map between bits (as found in various flag words) character string names for their meanings. eval.c routines to build up the internal tree that describes the status of all of the files that are described by the current rules. files.c routines to manipulate file name arguments, including wild cards and embedded environment variables. ignore.c routines to maintain a list of names or patterns for files to be ignored, and to check file names against that list. main.c global variables, cmd-line parameter processing, parameter validation, error reporting, and the main loop. recon.c routines to examine a list of files that appear to have changed, and figure out what the appropriate reconciliation course of action is. rename.c routines to search the tree to determine whether or not any creates/deletes are actually renames. rules.c routines to read and write the rules file routines to add rules and enumerate in-core rules filecheck.c not really a part of filesync, but rather a utility program that is used in the test suite. It extracts information about files that is not readily available from other unix commands. Comments on studying the code if you are only interested in the "active ingredients": read the above notes on data structures and then read the structure declarations in database.h read the above notes overviewing the passes in recon.c: read reconcile this routine almost makes sense on its own, and it is unquestionably the most important routine in the entire program. Everything else just gathers data for reconcile to use, or updates the books to reflect the changes. in eval.c: read evaluate, eval_file, walker, and note_info this is the main guts of pass I in anal.c: read analyze, check_file, check_changes & queue_file this is the main guts of pass II if you want to read the whole thing: the following routines do fundamentally simple things in simple ways, and can (for the most part) be understood in vaccuuo. The things they do are probably sufficiently obvious that you can probably understand the more interesting code without having read them at all. base.c rules.c files.c debug.c ignore.c acls.c the following routines constitute the real meat of the program, and while they are broken into specialized modules, they probably need to be understood as an organic whole: main.c setup and control eval.c pass I anal.c pass II recon.c pass III action.c execution and book-keeping rename.c a special case for a common situation Gross calling structure / flow of control main.c:main findfiles read_baseline read_rules if new rules add_base add_include evaluate analyze write_baseline write_summary eval.c:evaluate add_file_to_base add_glob add_run ignore_pgm ignore_file ignore_expr eval_file eval.c:eval_file note_info nftw walker note_info anal.c:analyze check_file reconcile anal.c:check_file check_changes queue_file recon.c:reconcile samedata samestuff do_copy copy do_like update_info do_like do_remove Helpful Hints the "file" structure contains a bunch of flags. Many of them just summarize what we know about the file (e.g. where it was found). Others are more subtle and control the evaluation process or the writing out of the baseline file. You can't really understand the processing unless you understand what these flags mean. F_NEW added by a new rule F_LISTED this name was generated by a rule F_SPARSE this directory is an intermediate on the way to a name generated by a rule and should not be recursively walked. F_EVALUATE this node was found in evaluation and has up-to-date stat information F_CONFLICT there is a conflict on this node so baseline should remain unchanged F_REMOVE this node should be purged from the baseline F_STAT_ERROR it was impossible to stat this file (and anything below it) the implications of these flags on processing are F_NEW, F_LISTED, F_SPARSE affect whether or not a particular node should be included in the evaluation pass. in some situations, only new rules are interpreted. listed files and directories should be evaluated and analyzed. sparse directories should not be recursively enumerated. F_EVALUATE determines whether or not a node is included in the analysis pass. Only nodes that have been evaluated will be analyzed. F_CONFLICT, F_REMOVE, F_EVALUATE affect how a node should be written back into the baseline file. if there is a conflict or we haven't evaluated a node, we won't update the baseline. if a node is marked for removal, it will be excluded from the baseline when it is written out. F_STAT_ERROR if we could not get proper status information about a file (or the tree under it) we cannot, with any confidence, determine what its state is or do anything about it. Such files are flagged as "in conflict". it is somewhat kinky that we put error flagged files on the reconciliation list. We do this because this is the easiest way to pull them out for reporting as conflicts.