xref: /freebsd/contrib/byacc/README.BTYACC (revision 4313cc83440a39bdf976f955b1d4d3f3c4d1552f)
1-- $Id: README.BTYACC,v 1.1 2014/03/25 19:21:31 Tom.Shields Exp $
2
3The original README from btyacc is below.
4
5The backtracking enhancements to byacc have been merged into Thomas Dickey's
6byacc baseline.
7
8The %include and %define/%ifdef enhancements described below are not currently
9incorporated.
10
11-------------------------------------------------------------------------------
12	     BTYACC -- backtracking yacc
13	     ===========================
14
15BTYACC was created by Chris Dodd using ideas from many
16places and lots of code from the Berkeley Yacc
17distribution, which is a public domain yacc clone put
18together by the good folks at Berkeley.  This code is
19distributed with NO WARRANTY and is public domain.
20It is certain to contain bugs, which you should
21report to: chrisd@collins.com.
22
23Vadim Maslov of Siber Systems <vadik@siber.com>
24considerably modified BTYACC to make it suitable
25for production environment.
26
27Several people have suggested bug fixes that
28were incorporated into BtYacc.
29
30See the README.BYACC files for more about
31Berkeley Yacc and other sources of info.
32
33http://www.siber.com/btyacc/ is the current home of BtYacc.
34It is provided courtesy of Siber Systems http://www.siber.com/.
35
36
37		Version 3.0 changes
38		-------------------
39		  by Vadim Maslov
40
41Changes mostly occurred in btyaccpa.ske file that
42contains the parsing shift/reduce/backtrack algorithm.
43
44Version 3.0 innovations focus on:
45- text position computation and propagation,
46- industrial-strength error processing and recovery.
47
48
49** Added mechanism for computing and propagating
50text position of tokens and non-terminals.
51
52Compilers often need to build AST trees such that every node
53in a tree can relate to the parsed program source it came from.
54The following applications are very likely to need this:
55- debuggers that show actual source of the debugged program,
56- source-to-source translators that want
57  unchanged parts of the tree to generate the unchanged code.
58
59The new YYPOSN mechanism added in this version of BtYacc
60helps you in automating the text position computation
61and in assigning the computed text positions to the AST.
62This mechanism is successfully used in commercial
63parsers and source-to-source translators.
64
65In standard Yaccs every token and every non-terminal
66has an YYSTYPE semantic value attached to it.
67In this new version every token and every non-terminal
68also has an YYPOSN text position attached to it.
69YYPOSN is a user-defined type that can be anything and
70that has a meaning of text position attached to
71token or non-terminal.
72
73In addition to semantic value stack BtYacc now maintains
74text position stack. Behavior of the text position stack
75is similar to the behavior of the semantic value stack.
76
77If using text position mechanism,
78you need to define the following:
79
80YYPOSN	Preprocessor variable that contains C/C++ type of
81	the text position attached to
82	every token and non-terminal.
83
84yyposn  Global variable of type YYPOSN.
85        The lexer must assign text position of
86	the returned token to yyposn, just like it assigns
87	semantic value of the returned token to yylval.
88
89YYREDUCEPOSNFUNC
90	Preprocessor variable that points to function that
91	is called after the grammar rule reduction
92	to reduce text positions located on the stack.
93
94        This function is called by BtYacc to reduce text
95	positions. The function is called immediately after
96	the regular rule reduction occurs.
97
98	The function has the following prototype:
99	void ReducePosn(YYPOSN  &ret,
100			YYPOSN  *terms,
101			YYSTYPE *term_vals,
102			int      term_no,
103			int      stk_pos,
104			int      yychar,
105			YYPOSN  &yyposn,
106			UserType extra);
107
108        The function arguments are:
109        - ret
110	  Reference to the text position returned by
111          the rule. The function must write the computed
112          text position returned by the rule to ret.
113          This is analogue of the $$ semantic value.
114
115        - term_posns
116          Array of the right-hand side rule components
117	  YYPOSN text positions.  These are analogues of
118	  $1, $2, ..., $N in the text position world.
119
120        - term_vals
121	  Array of the right-hand side (RHS) rule components
122	  YYSTYPE values. These are the $1,...,$N themselves.
123
124        - term_no
125          Number of the components in RHS of the reduced rule.
126          Equal to size of arrays term_posns and term_vals.
127          Also equal to N in $1,...,$N in the reduced rule.
128
129        - stk_pos
130          YYSTYPE/YYPOSN stack position before the reduction.
131
132        - yychar
133          Lookahead token that immediately follows
134 	  the reduced RHS components.
135
136        - yyposn
137          YYPOSN of the token that immediately follows
138	  the reduced RHS components.
139
140        - extra
141          User-defined extra argument passed to ReducePosn.
142
143        Typically this function extracts text positions from
144	the right-hand side rule components and either
145	assigns them to the returned $$ structure/tree or
146        if no $$ value is returned, puts them into
147	the ret text position from where
148        it will be picked up by the later reduced rules.
149
150YYREDUCEPOSNFUNCARG
151	Extra user-defined argument passed to
152	the ReducePosn function. This argument can use
153	any variables defined in btyaccpa.ske.
154
155
156** Added code to btyaccpa.ske that automatically cleans up
157semantic semantic values and text positions of tokens
158and non-terminals that are discarded and deleted as
159a result of error processing.
160
161In the previous versions the discarded token and non-terminal
162semantic values were not cleaned that caused quite severe
163leaks.  The only way to fix it was to add garbage collection
164to YYSTYPE class.
165
166Now BtYacc skeleton calls delete functions for semantic
167values and positions of the discarded tokens and
168non-terminals.
169
170You need to define the following functions that BtYacc
171calls when it needs to delete semantic value or text position.
172
173YYDELETEVAL
174	User-defined function that is called by BtYacc
175	to delete semantic value of the token or non-terminal.
176
177	The user-defined function must have the prototype:
178	void DeleteYYval(YYSTYPE v, int type);
179	v    is semantic value to delete,
180	type is one of the following:
181	0 	discarding token
182	1       discarding state
183	2       cleaning up stack when aborting
184
185YYDELETEPOSN
186	User-defined function that is called by BtYacc
187	to delete text position of the token or non-terminal.
188
189	The user-defined function must have the prototype:
190	void DeleteYYposn(YYPOSN p, int type);
191	v    is semantic value to delete,
192	type is one of the following:
193	0 	discarding token
194	1       discarding state
195	2       cleaning up stack when aborting
196
197
198** User can define "detailed" syntax error processing
199function that reports an *exact* position of
200the token that caused the error.
201
202If you define preprocessor variable YYERROR_DETAILED in
203your grammar then you need define the following
204error processing function:
205
206void yyerror_detailed(char    *text,
207		      int      errt,
208		      YYSTYPE &errt_value,
209		      YYPOSN  &errt_posn);
210
211It receives the following arguments:
212text		Error message.
213errt		Code of the token that caused the error.
214errt_value	Value of the token that caused the error.
215errt_posn	Text position of token that caused error.
216
217
218** Dropped compatibility with C.
219
220Compatibility with C became increasingly difficult
221to maintain as new features were added to btyaccpa.ske.
222So we dropped it. If anybody wants to make the new version
223compatible with C, we would gladly accept the changes.
224
225Meanwhile we expect that you use C++ to write grammar
226actions and everything else in grammar files.
227Since C is (in a sense) subset of C++, your C-based
228grammar may work if you use C++ compiler to compile it.
229
230		Version 3.0 bugs fixed
231		----------------------
232
233Matthias Meixner <meixner@mes.th-darmstadt.de> fixed a bug:
234BtYacc does not correctly handle typenames, if one typename
235is a prefix of another one and if this type is used after
236the longer one. In this case BTYacc produces invalid code.
237
238
239		Version 2.1 changes
240		-------------------
241		  by Vadim Maslov
242
243** Added preprocessor statements to BtYacc that are similar
244in function and behavior to C/C++ preprocessor statements.
245
246These statements are used to:
247
248- Introduce modularity into a grammar by breaking it
249  into several *.y files and assembling different
250  grammars from the *.y modules using %include and %ifdef.
251
252- Have several versions of the same grammar
253  by using %ifdef and $endif.
254
255- To include automatically generated grammar fragment.
256  For instance, we use %include to include
257  automatically generated list of tokens.
258
259Preprocessor statements are:
260
261%define <var-name>
262	Define preprocessor variable named <var-name>.
263
264%ifdef <var-name>
265	If preprocessor variable named <var-name>
266	is defined by %define, then process the text from
267	this %ifdef to the closing %endif.
268
269%endif
270	Closing bracket for %ifdef preprocessor statement.
271	Only one nesting level of %ifdef-%endif is allowed.
272
273%include <file-name>
274	Process contents of the file named <file-name>.
275	If <file-name> is a relative name, it is looked up
276        in a directory in which btyacc was started.
277	Only one nesting level of %include is allowed.
278
279
280		Version 2.0 changes
281		-------------------
282		  by Vadim Maslov
283
284
285** Changed 16-bit short numbers to 32-bit int numbers in
286grammar tables, so that huge grammar tables (tables that
287are larger than 32768 elements) resulting from huge
288grammars (Cobol grammar, for instance) can work correctly.
289You need to have 32-bit integer to index table bigger than
29032768 elements, 16-bit integer is not enough.
291
292The original BtYacc just generated non-working tables
293larger than 32768 elements without even notifying about
294the table overflow.
295
296
297** Make error recovery work correctly when error happens
298while processing nested conflicts. Original BtYacc could
299infinitely cycle in certain situations that involved error
300recovery while in nested conflict.
301
302More detailed explanation: when we have nested conflicts
303(conflict that happens while trial-processing another
304conflict), it leads btyacc into NP-complete searching of
305conflict tree. The ultimate goal is YYVALID operator that
306selects a particular branch of that tree as a valid one.
307
308If no YYVALID is found on the tree, then error recovery
309takes over.  The problem with this is that error recovery
310is started in the same state context that exists on the
311last surveyed branch of the conflict tree.  Sometimes this
312last branch may be of zero length and it results in
313recovering to exactly the same state as existed before
314entering the conflict. BtYacc cycles then.
315
316We solved this problem by memorizing the longest path in
317the conflict tree while browsing it. If we ever get into
318error recovery, we restore state that existed on the
319longest path.  Effectively we say: if we have an error,
320let us move forward as far as we possibly could while we
321were browsing the conflict tree.
322
323
324** Introduce YYVALID_NESTED operation in addition to
325simply YYVALID.  When we have a nested conflict (conflict
326while processing in trial mode for another conflict), we
327want to relate YYVALID to a particular level of conflict
328being in trial.
329
330Since we mostly anticipate only 2-level nested conflicts
331YYVALID_NESTED tells the parser to satisfy only the
332internal conflict.  Therefore, in 1-level conflict
333situation YYVALID_NESTED acts like a regular YYVALID, but
334in 2-level conflict it is a no-op and the other YYVALID
335for outer conflict will be searched for.
336
337
338** Improved handling of situation where /tmp directory is
339missing.  Original btyacc just died quietly when /tmp
340directory was missing.  We added code that states the
341problem explicitly. While on UNIX /tmp directory is always
342present, it may be missing on WIN32 systems, therefore
343diagnosing this situation is important.
344
345
346	Version 1.0 changes: BackTracking
347	=================================
348		by Chris Dodd
349
350BTYACC is a modified version of yacc that supports
351automatic backtracking and semantic disambiguation to
352parse ambiguous grammars, as well as syntactic sugar for
353inherited attributes (which tend to introduce conflicts).
354Whenever a btyacc generated parser runs into a
355shift-reduce or reduce-reduce error in the parse table, it
356remembers the current parse point (yacc stack and input
357stream state), and goes into trial parse mode.  It then
358continues parsing, ignoring most rule actions.  If it runs
359into an error (either through the parse table or through
360an action calling YYERROR), it backtracks to the most
361recent conflict point and tries a different alternative.
362If it finds a successful parse (reaches the end of the
363input or an action calls YYVALID), it backtracks to the
364point where it first entered trial parse mode, and
365continues with a full parse (executing all actions),
366following the path of the successful trial.
367
368Actions in btyacc come in two flavors -- {}-actions, which
369are only executed when not in trial mode, and []-actions
370which are executed regardless of mode.  There are also
371inherited attributes, which look like arguments (they are
372enclosed in "()") and act like []-actions.
373
374What this buys you:
375
376* No more lexer feedback hack.  In yacc grammars for C, a
377standard hack, know as the "lexer feedback hack" is used
378to find typedef names.  The lexer uses semantic
379information to decide if any given identifier is a
380typedef-name or not and returns a special token.  With
381btyacc, you no longer need to do this; the lexer should
382just always return an identifier.  The btyacc grammar then
383needs a rule of the form:
384
385typename: ID [ if (!IsTypeName(LookupId($1))) YYERROR; ]
386
387While the hack works adequately well for parsing C, it
388becomes a nightmare when you try to parse something like
389C++, where treating an ID as a typedef becomes heavily
390dependent on context.
391
392* Easy disambiguation via simple ordering.  Btyacc runs
393its trials via the rule "try shifting first, then try
394reducing by the order that the conflicting rules appear in
395the input file".  This means you can deal with semantic a
396disambiguation rule like:
397    [1] If it looks like a declaration it is, otherwise
398    [2] If it looks like an expression it is, otherwise
399    [3] it is a syntax error
400	[Ellis&Stroustrup, Annotated C++ Reference Manual, p93]
401
402To deal with this, you need only put all the rules for
403declarations before the rules for expressions in the
404grammar file.
405
406* No extra cost if you do not use it.  Backtracking is
407only triggered when the parse hits a shift/reduce or
408reduce/reduce conflict in the table.  If you have no
409conflicts in your grammar, there is no extra cost, other
410than some extra code which will never be invoked.
411
412* C++ and ANSI C compatible parsers.  The parsers produced
413by btyacc can be compiled with C++ correctly.  If you
414"#define" YYSTYPE to be some C++ type with constructor and
415destructor, everything will work fine.  My favorite is
416"#define YYSTYPE SmartPointer", where SmartPointer is a
417smart pointer type that does garbage collection on the
418pointed to objects.
419
420BTYACC was originally written to make it easy to write a
421C++ parser (my goal was to be able to use the grammar out
422of the back of the ARM with as few modifications as
423possible).  Anyone who has ever looked at Jim Roskind
424public domain C++ yacc grammar, or the yacc-based grammar
425used in g++ knows how difficult this is.  BTYACC is very
426useful for parsing any ambiguous grammar, particularly
427ones that come from trying to merge two (or more) complete
428grammars.
429
430Limitations of the backtracking: Currently, the generated
431parser does NO pruning of alternate parsing paths.  To
432avoid an exponential explosion of possible paths (and
433parsing time), you need to manually tell the parser when
434it can throw away saved paths using YYVALID.  In practice,
435this turns out to be fairly easy to do.  A C++ parser (for
436example) can just put a [YYVALID;] after every complete
437declaration and statement rule, corresponding to pruning
438the backtracking state after seeing a ';' or '}' -- there
439will never be a situation in which it is useful to
440backtrack past either of these.
441
442Inherited attributes in btyacc:
443
444Inherited attributes look a lot like function arguments to
445non-terminals, which is what they end up being in a
446recursive descent parser, but NOT how they are implemented
447in btyacc.  Basically they are just syntactic sugar for
448embedded semantic actions and $0, $-1, ... in normal yacc.
449btyacc gives you two big advantages besides just the
450syntax:
451    1. it does type checking on the inherited attributes,
452       so you do not have to specify $<type>0 and makes sure
453       you give the correct number of arguments (inherited
454       attributes) to every use of a non-terminal.
455    2. It "collapses" identical actions from that are produced
456       from inherited attributes.  This eliminates many
457       potential reduce-reduce conflicts arising from
458       the inherited attributes.
459
460You use inherited attributes by declaring the types of the
461attributes in the preamble with a type declaration and
462declaring names of the attributes on the lhs of the yacc
463rule.  You can of course have more than one rule with the
464same lhs, and you can even give them different names in
465each, but the type and number must be the same.
466
467Here is a small example:
468           /* lhs takes 2 inherited attributes */
469%type <t1> lhs(<t1>, <t2>)
470	   stuff(<t1>, <t2>)
471%%
472lhs($i1, $i2) : { $$ = $i1 }
473	      | lhs($i1, $i2) stuff($1,$i2) { $$ = $2; }
474
475This is roughly equivalent to the following yacc code:
476lhs :
477      { $$ = $<t1>-1; }
478    | lhs [ $<t1>$ = $-1; ] [ $<t2>$ = $<t2>0; ] stuff
479      { $$ = $4; }
480    ;
481
482See the file "test/t2.y" for a longer and more complete
483example.  At the current time, the start symbol cannot
484have any arguments.
485
486Variant parsers:
487
488Btyacc supports the -S flag to use a different parser
489skeleton, changing the way that the parser is called and
490used.  The skeleton "push.skel" is included to produce a
491"passive" parser that you feed tokens to (rather than
492having the parser call a separate yylex routine).  With
493push.skel, yyparse is defined as follows:
494
495int yyparse(int token, YYSTYPE yylval)
496
497You should call yyparse repeatedly with successive tokens
498of input.  It returns 0 if more input is needed, 1 for a
499successful parse, and -1 for an unrecoverable parse error.
500
501
502	Miscellaneous Features in ver. 1.0
503	----------------------------------
504		by Chris Dodd
505
506     The -r option has been implemented.  The -r option tells
507Yacc to put the read-only tables in y.tab.c and the code and
508variables in y.code.c.  Keith Bostic asked for this option so
509that :yyfix could be eliminated.
510
511     The -l and -t options have been implemented.  The -l
512option tells Yacc not to include #line directives in the code
513it produces.  The -t option causes debugging code to be
514included in the compiled parser.
515
516     The code for error recovery has been changed to
517implement the same algorithm as AT&T Yacc.  There will still
518be differences in the way error recovery works because AT&T
519Yacc uses more default reductions than Berkeley Yacc.
520
521     The environment variable TMPDIR determines the directory
522where temporary files will be created.  If TMPDIR is defined,
523temporary files will be created in the directory whose
524pathname is the value of TMPDIR.  By default, temporary files
525are created in /tmp.
526
527     The keywords are now case-insensitive.  For example,
528%nonassoc, %NONASSOC, %NonAssoc, and %nOnAsSoC are
529all equivalent.
530
531     Commas and semicolons that are not part of C code are
532treated as commentary.
533
534     Line-end comments, as in BCPL, are permitted.  Line-end
535comments begin with // and end at the next end-of-line.
536Line-end comments are permitted in C code; they are converted
537to C comments on output.
538
539     The form of y.output files has been changed to look more
540like those produced by AT&T Yacc.
541
542     A new kind of declaration has been added.
543The form of the declaration is
544
545	  %ident string
546
547where string is a sequence of characters beginning with a
548double quote and ending with either a double quote or the
549next end-of-line, whichever comes first.  The declaration
550will cause a #ident directive to be written near the start
551of the output file.
552
553     If a parser has been compiled with debugging code, that
554code can be enabled by setting an environment variable.
555If the environment variable YYDEBUG is set to 0, debugging
556output is suppressed.  If it is set to 1, debugging output
557is written to standard output.
558
559
560		Building BtYacc
561		---------------
562	by Chris Dodd and Vadim Maslov
563
564We used GCC and GNU make to compile BtYacc both on UNIX and
565WIN32 paltforms.  You are welcome to try different
566combinations of makes and compilers.  Most likely it will
567work, but it may require Makefile changes.
568
569There is no config script.
570Just type "make" and it should compile.
571
572AWK. If you want to change file btyaccpa.ske (backtracking
573parser skeleton), you will need awk to compile it into
574skeleton.c file. We used GNU AWK (gawk) version 3.0.
575
576It is known that using older versions of gawk
577may create problems in compilation, because older awks
578have problems with backslashes at the end of a line.
579
580For MSDOS, there a "makefile.dos" that should do the trick.
581Note: makefile.dos was not tested for a long time.
582
583The result of compilation should be a single executable called
584"btyacc" which you can install anywhere you like;
585it does not require any other files in the distribution to run.
586
587
588	       Legal Stuff
589	       -----------
590	by Chris Dodd and Vadim Maslov
591
592In English: BtYacc is freeware. BtYacc is distributed with
593no warranty whatsoever. The author and any other contributors
594take no responsibility for any and all consequences of its use.
595
596In Legalese: LIMITATION OF LIABILITY. NEITHER SIBER SYSTEMS
597NOR ANY OF ITS LICENSORS NOR ANY BTYACC CONTRIBUTOR SHALL BE
598LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL OR CONSEQUENTIAL
599DAMAGES, OR DAMAGES FOR LOSS OF PROFITS, REVENUE, DATA OR
600DATA USE, CAUSED BY BTYACC AND INCURRED BY CUSTOMER OR ANY
601THIRD PARTY, WHETHER IN AN ACTION IN CONTRACT OR TORT, EVEN
602IF SIBER SYSTEMS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
603DAMAGES.
604