xref: /illumos-gate/usr/src/cmd/filesync/README (revision 35a5a3587fd94b666239c157d3722745250ccbd7)
1#
2# CDDL HEADER START
3#
4# The contents of this file are subject to the terms of the
5# Common Development and Distribution License, Version 1.0 only
6# (the "License").  You may not use this file except in compliance
7# with the License.
8#
9# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
10# or http://www.opensolaris.org/os/licensing.
11# See the License for the specific language governing permissions
12# and limitations under the License.
13#
14# When distributing Covered Code, include this CDDL HEADER in each
15# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
16# If applicable, add the following below this CDDL HEADER, with the
17# fields enclosed by brackets "[]" replaced with your own identifying
18# information: Portions Copyright [yyyy] [name of copyright owner]
19#
20# CDDL HEADER END
21#
22# Copyright (c) 1995 Sun Microsystems, Inc.  All Rights Reserved
23#
24#ident	"%W%	%E% SMI"
25#
26#	design notes that are likely to be of general (rather than
27#	merely historical) interest.
28
29Table of Contents
30
31	Overview			what filesync does
32
33	Primary Data Structures
34		general principles	why they exist
35		key concepts		what they represent
36		data structures		major structures and their contents
37
38	Overview of Passes		main phases of program execution
39
40	Modules				list and descriptions of files
41
42	Studying the Code
43		active ingredients	a reading list of high points
44		the whole thing		a suggested order for everything
45
46	Gross calling structure		who calls whom
47
48	Helpful hints			good things to know
49
50Overview
51
52	The purpose of this program is to compare pairs of directory
53	trees with a baseline snapshot, to determine which files have
54	changed, and to propagate the changes in order to bring the
55	trees back into congruency.  The baseline snapshot describes
56	size, ownership, ... for all files that filesync is managing
57	WHEN THEY WERE LAST IN SYNC.
58
59	The files and directory trees to be compared are determined
60	by a relatively flexible (user editable) rules file, whose
61	format (packingrules.4) permits files and or trees to be
62	specified, explicitly, implicitly, or with wild cards.
63	There are also provisions for filtering out unwanted files
64	and for running programs to generate lists of files and
65	directories to be included or excluded.
66
67	The comparisons begin by comparing the structured name
68	spaces.  For names that appear in both trees, the files
69	are then compared on the basis of type, size, contents,
70	ownership and protections.  For files that are already
71	in the baseline snapshot, if the sizes and modification
72	times have not changed, we do not bother to recheck the
73	contents.
74
75	The reconciliation process (resolving the differences)
76	will only propagate a change if it is obvious what should
77	be done (one side has changed relative to the snapshot,
78	while the other has not).  If there are conflicting changes,
79	the file is flagged and the user is asked to reconcile the
80	differences manually.  There are, however a few switches
81	that can be used to constrain the analysis or reconciliation,
82	or to force one particular side to win in case of a conflict.
83
84
85Primary Data Structures
86
87	general principles:
88		we will build up an in-memory tree that represents
89		the union of the name spaces found in the baseline
90		and on the source and destination sides.
91
92		keep in mind that the baseline recalls the state of
93		files THE LAST TIME THEY WERE IN AGREEMENT.  If files
94		have disagreed for a long time, the baseline still
95		remembers what they were like when they agreed.  If
96		files have never agreed, the baseline has no notions
97		of how they "used to be".
98
99	key concepts:
100		a "base pair" is a pair of directories whose
101		contents (or a subset of whose contents) are to
102		be syncrhonized.  The "base pairs" to be managed
103		are specified in the packing rules file.
104
105		associated with each "base pair" is a set of rules
106		that describe which files (under those directories)
107		are to be kept in sync.  Each rule is a list of:
108			files and or directories to be included
109			wild cards for files or directories to be included
110			programs to generate lists of names for inclusion
111			file names to be ignored
112			wild cards for file names to be ignored
113			programs to generate lists of names for ignoring
114
115		as a result of the "evaluation" process we build up
116		(under each base pair) a tree that represents all of
117		the files that we are supposed to keep in sync, and
118		contains everything we need to know about each one
119		of those files.  The structure of the tree mirrors
120		the directory hierarchy ... actually the union of the
121		three hiearchies (baseline, source and destination).
122
123		for each file, we record interesting information (type,
124		size, owner, protection, mod time) and keep separate
125		note of what these values were:
126			in the baseline last time two sides agreed
127			on the source side, as we just examined it
128			on the destination side, as we just examined it
129
130	data structures:
131
132		there is an ordered list of "base" structures
133		for each base, we maintain
134			three lists of associated "rule" descriptions:
135				inclusion rules
136				exclusion rules
137				restriction rules (from the command line)
138			a "file" tree, representing all files below the bases
139			a list of statistics to be printed as a summary
140
141		for each "rule", we maintain
142			some flags describing the type of rule
143			the character string that is the rule
144
145		for each "file", we maintain
146			sibling and child pointers to give them tree structure
147			flags to describe what we have done/should do
148			"fileinfo" information from the src, dest, and baseline
149
150			in addition there are some fields that are used
151			to add the file to a list of files requiring
152			reconciliation and record what happened to it.
153
154		a "fileinfo" structure contains a subset of the information
155		that we obtain from a stat call:
156			major/minor/inum
157			type
158			link count
159			ownership, protection, and acls
160			size
161			modification time
162
163		there is also, built up during analysis, a reconciliation
164		list.  This is an ordered list of "file" structures which
165		are believed to descibe files that have changed and require
166		reconciliation.  The ordering is important both for correctness
167		and to preserve relative modification times.
168
169Overview of passes:
170
171	pass I (evaluate)
172
173		stat every file that we might be interested in
174		(on both src/dest sides).  This includes walking
175		the trees under all directories in order to
176		find out what files exist and stating all of
177		them.
178
179		the main trick in this pass is that there may be
180		files we don't want to evaluate (because we are
181		limiting our attention to specific files and trees).
182		There is a LISTED flag kept in the database that
183		tells me whether or not I need to stat/descend any
184		given node.
185
186		all restrictions and ignores take effect during this pass.
187
188	pass II (analyze)
189
190		given the baseline and all of the current stat information
191		gained during pass I, figure out what might conceivably
192		have changed and queue it for pass III.  This pass doesn't
193		try to figure out what happened or who should win ... it
194		merely identifies candidates for pass III.  This pass
195		ignores any nodes that were not evaluated during pass I.
196
197		the queueing process, however, determines the order in
198		which the files will be processed in pass III, and the
199		order is very important.
200
201	pass III (reconcile)
202
203		process the list of candidates, figuring out what has
204		actually changed and which versions deserve to win.  If
205		is clear what needs doing, we actually do it in this
206		pass.
207
208Modules
209
210	filesync.h
211		defines for limits, sizes and return codes
212		declarations for global variables (mostly cmd-line parms)
213		defines for default file names
214		declarations for routines of general interest
215
216	database.h
217		data-structures for recording rules
218		data-structures for recording information about files
219		declarations for routines that operate on/with those structures
220
221	messages.h
222		the text of all localizable messages
223
224	debug.h
225		definitions and declarations for routines for error
226		simulation and bit-map display.
227
228	acls.c
229		routines to get, set, compare, and display Access Control Lists
230	action.c
231		routines to do the real work of copying, deleting, or
232		changing ownership in order to make one side agree
233		with the other.
234	anal.c
235		routines to examine the in-core list of files and
236		determine what has changed (and therefore what is
237		files are candidates for reconciliation).  This
238		analysis includes figuring out which files should
239		be links rather than copies.
240	base.c
241		routines to read and write the baseline file
242		routines to search and manipulate the in-core base list
243	debug.c
244		data structures and routines, used to sumulate errors
245		and produce debug output, that map between bits (as found
246		in various flag words) character string names for their
247		meanings.
248
249	eval.c
250		routines to build up the internal tree that describes
251		the status of all of the files that are described
252		by the current rules.
253	files.c
254		routines to manipulate file name arguments, including
255		wild cards and embedded environment variables.
256	ignore.c
257		routines to maintain a list of names or patterns for
258		files to be ignored, and to check file names against
259		that list.
260	main.c
261		global variables, cmd-line parameter processing,
262		parameter validation, error reporting, and the
263		main loop.
264	recon.c
265		routines to examine a list of files that appear to
266		have changed, and figure out what the appropriate
267		reconciliation course of action is.
268	rename.c
269		routines to search the tree to determine whether
270		or not any creates/deletes are actually renames.
271	rules.c
272		routines to read and write the rules file
273		routines to add rules and enumerate in-core rules
274
275	filecheck.c
276		not really a part of filesync, but rather a utility
277		program that is used in the test suite.  It extracts
278		information about files that is not readily available
279		from other unix commands.
280
281Comments on studying the code
282
283	if you are only interested in the "active ingredients":
284
285		read the above notes on data structures and then
286
287		read the structure declarations in database.h
288
289		read the above notes overviewing the passes
290
291		in recon.c: read reconcile
292
293			this routine almost makes sense on its own,
294			and it is unquestionably the most important
295			routine in the entire program.  Everything
296			else just gathers data for reconcile to use,
297			or updates the books to reflect the changes.
298
299		in eval.c: read evaluate, eval_file, walker, and note_info
300
301			this is the main guts of pass I
302
303		in anal.c: read analyze, check_file, check_changes & queue_file
304
305			this is the main guts of pass II
306
307	if you want to read the whole thing:
308
309		the following routines do fundamentally simple things
310		in simple ways, and can (for the most part) be understood
311		in vaccuuo.  The things they do are probably sufficiently
312		obvious that you can probably understand the more interesting
313		code without having read them at all.
314
315			base.c
316			rules.c
317			files.c
318			debug.c
319			ignore.c
320			acls.c
321
322		the following routines constitute the real meat of the
323		program, and while they are broken into specialized
324		modules, they probably need to be understood as an
325		organic whole:
326
327			main.c		setup and control
328			eval.c		pass I
329			anal.c		pass II
330			recon.c		pass III
331			action.c	execution and book-keeping
332			rename.c	a special case for a common situation
333
334
335Gross calling structure / flow of control
336
337	main.c:main
338		findfiles
339		read_baseline
340		read_rules
341		if new rules
342			add_base
343			add_include
344		evaluate
345		analyze
346		write_baseline
347		write_summary
348
349	eval.c:evaluate
350		add_file_to_base
351		add_glob
352		add_run
353		ignore_pgm
354		ignore_file
355		ignore_expr
356		eval_file
357
358	eval.c:eval_file
359		note_info
360		nftw
361			walker
362				note_info
363
364	anal.c:analyze
365		check_file
366		reconcile
367
368	anal.c:check_file
369		check_changes
370		queue_file
371
372
373	recon.c:reconcile
374		samedata
375		samestuff
376		do_copy
377			copy
378			do_like
379			update_info
380		do_like
381		do_remove
382
383Helpful Hints
384
385	the "file" structure contains a bunch of flags.  Many of them
386	just summarize what we know about the file (e.g. where it was
387	found).  Others are more subtle and control the evaluation
388	process or the writing out of the baseline file.  You can't
389	really understand the processing unless you understand what
390	these flags mean.
391
392		F_NEW		added by a new rule
393
394		F_LISTED	this name was generated by a rule
395
396		F_SPARSE	this directory is an intermediate on
397				the way to a name generated by a rule
398				and should not be recursively walked.
399
400		F_EVALUATE	this node was found in evaluation and
401				has up-to-date stat information
402
403		F_CONFLICT	there is a conflict on this node so
404				baseline should remain unchanged
405
406		F_REMOVE	this node should be purged from the baseline
407
408		F_STAT_ERROR	it was impossible to stat this file
409				(and anything below it)
410
411	the implications of these flags on processing are
412
413		F_NEW, F_LISTED, F_SPARSE
414
415			affect whether or not a particular node should
416			be included in the evaluation pass.
417
418			in some situations, only new rules are interpreted.
419
420			listed files and directories should be evaluated
421			and analyzed.  sparse directories should not be
422			recursively enumerated.
423
424		F_EVALUATE
425
426			determines whether or not a node is included
427			in the analysis pass.  Only nodes that have
428			been evaluated will be analyzed.
429
430		F_CONFLICT, F_REMOVE, F_EVALUATE
431
432			affect how a node should be written back into					the baseline file.
433
434			if there is a conflict or we haven't evaluated
435			a node, we won't update the baseline.
436
437			if a node is marked for removal, it will be
438			excluded from the baseline when it is written out.
439
440		F_STAT_ERROR
441
442			if we could not get proper status information
443			about a file (or the tree under it) we cannot,
444			with any confidence, determine what its state
445			is or do anything about it.  Such files are
446			flagged as "in conflict".
447
448			it is somewhat kinky that we put error flagged
449			files on the reconciliation list.  We do this
450			because this is the easiest way to pull them
451			out for reporting as conflicts.
452
453
454