xref: /freebsd/lib/libc/regex/re_format.7 (revision 4e579ad047720775ab580b74192c7de8a3386fea)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"	This product includes software developed by the University of
19.\"	California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)re_format.7	8.3 (Berkeley) 3/20/94
37.\"
38.Dd June 30, 2014
39.Dt RE_FORMAT 7
40.Os
41.Sh NAME
42.Nm re_format
43.Nd POSIX 1003.2 regular expressions
44.Sh DESCRIPTION
45Regular expressions
46.Pq Dq RE Ns s ,
47as defined in
48.St -p1003.2 ,
49come in two forms:
50modern REs (roughly those of
51.Xr egrep 1 ;
521003.2 calls these
53.Dq extended
54REs)
55and obsolete REs (roughly those of
56.Xr ed 1 ;
571003.2
58.Dq basic
59REs).
60Obsolete REs mostly exist for backward compatibility in some old programs;
61they will be discussed at the end.
62.St -p1003.2
63leaves some aspects of RE syntax and semantics open;
64`\(dd' marks decisions on these aspects that
65may not be fully portable to other
66.St -p1003.2
67implementations.
68.Pp
69A (modern) RE is one\(dd or more non-empty\(dd
70.Em branches ,
71separated by
72.Ql \&| .
73It matches anything that matches one of the branches.
74.Pp
75A branch is one\(dd or more
76.Em pieces ,
77concatenated.
78It matches a match for the first, followed by a match for the second, etc.
79.Pp
80A piece is an
81.Em atom
82possibly followed
83by a single\(dd
84.Ql \&* ,
85.Ql \&+ ,
86.Ql \&? ,
87or
88.Em bound .
89An atom followed by
90.Ql \&*
91matches a sequence of 0 or more matches of the atom.
92An atom followed by
93.Ql \&+
94matches a sequence of 1 or more matches of the atom.
95An atom followed by
96.Ql ?\&
97matches a sequence of 0 or 1 matches of the atom.
98.Pp
99A
100.Em bound
101is
102.Ql \&{
103followed by an unsigned decimal integer,
104possibly followed by
105.Ql \&,
106possibly followed by another unsigned decimal integer,
107always followed by
108.Ql \&} .
109The integers must lie between 0 and
110.Dv RE_DUP_MAX
111(255\(dd) inclusive,
112and if there are two of them, the first may not exceed the second.
113An atom followed by a bound containing one integer
114.Em i
115and no comma matches
116a sequence of exactly
117.Em i
118matches of the atom.
119An atom followed by a bound
120containing one integer
121.Em i
122and a comma matches
123a sequence of
124.Em i
125or more matches of the atom.
126An atom followed by a bound
127containing two integers
128.Em i
129and
130.Em j
131matches
132a sequence of
133.Em i
134through
135.Em j
136(inclusive) matches of the atom.
137.Pp
138An atom is a regular expression enclosed in
139.Ql ()
140(matching a match for the
141regular expression),
142an empty set of
143.Ql ()
144(matching the null string)\(dd,
145a
146.Em bracket expression
147(see below),
148.Ql .\&
149(matching any single character),
150.Ql \&^
151(matching the null string at the beginning of a line),
152.Ql \&$
153(matching the null string at the end of a line), a
154.Ql \e
155followed by one of the characters
156.Ql ^.[$()|*+?{\e
157(matching that character taken as an ordinary character),
158a
159.Ql \e
160followed by any other character\(dd
161(matching that character taken as an ordinary character,
162as if the
163.Ql \e
164had not been present\(dd),
165or a single character with no other significance (matching that character).
166A
167.Ql \&{
168followed by a character other than a digit is an ordinary
169character, not the beginning of a bound\(dd.
170It is illegal to end an RE with
171.Ql \e .
172.Pp
173A
174.Em bracket expression
175is a list of characters enclosed in
176.Ql [] .
177It normally matches any single character from the list (but see below).
178If the list begins with
179.Ql \&^ ,
180it matches any single character
181(but see below)
182.Em not
183from the rest of the list.
184If two characters in the list are separated by
185.Ql \&- ,
186this is shorthand
187for the full
188.Em range
189of characters between those two (inclusive) in the
190collating sequence,
191.No e.g. Ql [0-9]
192in ASCII matches any decimal digit.
193It is illegal\(dd for two ranges to share an
194endpoint,
195.No e.g. Ql a-c-e .
196Ranges are very collating-sequence-dependent,
197and portable programs should avoid relying on them.
198.Pp
199To include a literal
200.Ql \&]
201in the list, make it the first character
202(following a possible
203.Ql \&^ ) .
204To include a literal
205.Ql \&- ,
206make it the first or last character,
207or the second endpoint of a range.
208To use a literal
209.Ql \&-
210as the first endpoint of a range,
211enclose it in
212.Ql [.\&
213and
214.Ql .]\&
215to make it a collating element (see below).
216With the exception of these and some combinations using
217.Ql \&[
218(see next paragraphs), all other special characters, including
219.Ql \e ,
220lose their special significance within a bracket expression.
221.Pp
222Within a bracket expression, a collating element (a character,
223a multi-character sequence that collates as if it were a single character,
224or a collating-sequence name for either)
225enclosed in
226.Ql [.\&
227and
228.Ql .]\&
229stands for the
230sequence of characters of that collating element.
231The sequence is a single element of the bracket expression's list.
232A bracket expression containing a multi-character collating element
233can thus match more than one character,
234e.g.\& if the collating sequence includes a
235.Ql ch
236collating element,
237then the RE
238.Ql [[.ch.]]*c
239matches the first five characters
240of
241.Ql chchcc .
242.Pp
243Within a bracket expression, a collating element enclosed in
244.Ql [=
245and
246.Ql =]
247is an equivalence class, standing for the sequences of characters
248of all collating elements equivalent to that one, including itself.
249(If there are no other equivalent collating elements,
250the treatment is as if the enclosing delimiters were
251.Ql [.\&
252and
253.Ql .] . )
254For example, if
255.Ql x
256and
257.Ql y
258are the members of an equivalence class,
259then
260.Ql [[=x=]] ,
261.Ql [[=y=]] ,
262and
263.Ql [xy]
264are all synonymous.
265An equivalence class may not\(dd be an endpoint
266of a range.
267.Pp
268Within a bracket expression, the name of a
269.Em character class
270enclosed in
271.Ql [:
272and
273.Ql :]
274stands for the list of all characters belonging to that
275class.
276Standard character class names are:
277.Bl -column "alnum" "digit" "xdigit" -offset indent
278.It Em "alnum	digit	punct"
279.It Em "alpha	graph	space"
280.It Em "blank	lower	upper"
281.It Em "cntrl	print	xdigit"
282.El
283.Pp
284These stand for the character classes defined in
285.Xr ctype 3 .
286A locale may provide others.
287A character class may not be used as an endpoint of a range.
288.Pp
289A bracketed expression like
290.Ql [[:class:]]
291can be used to match a single character that belongs to a character
292class.
293The reverse, matching any character that does not belong to a specific
294class, the negation operator of bracket expressions may be used:
295.Ql [^[:class:]] .
296.Pp
297There are two special cases\(dd of bracket expressions:
298the bracket expressions
299.Ql [[:<:]]
300and
301.Ql [[:>:]]
302match the null string at the beginning and end of a word respectively.
303A word is defined as a sequence of word characters
304which is neither preceded nor followed by
305word characters.
306A word character is an
307.Em alnum
308character (as defined by
309.Xr ctype 3 )
310or an underscore.
311This is an extension,
312compatible with but not specified by
313.St -p1003.2 ,
314and should be used with
315caution in software intended to be portable to other systems.
316The additional word delimiters
317.Ql \e<
318and
319.Ql \e>
320are provided to ease compatibility with traditional
321SVR4
322systems but are not portable and should be avoided.
323.Pp
324In the event that an RE could match more than one substring of a given
325string,
326the RE matches the one starting earliest in the string.
327If the RE could match more than one substring starting at that point,
328it matches the longest.
329Subexpressions also match the longest possible substrings, subject to
330the constraint that the whole match be as long as possible,
331with subexpressions starting earlier in the RE taking priority over
332ones starting later.
333Note that higher-level subexpressions thus take priority over
334their lower-level component subexpressions.
335.Pp
336Match lengths are measured in characters, not collating elements.
337A null string is considered longer than no match at all.
338For example,
339.Ql bb*
340matches the three middle characters of
341.Ql abbbc ,
342.Ql (wee|week)(knights|nights)
343matches all ten characters of
344.Ql weeknights ,
345when
346.Ql (.*).*\&
347is matched against
348.Ql abc
349the parenthesized subexpression
350matches all three characters, and
351when
352.Ql (a*)*
353is matched against
354.Ql bc
355both the whole RE and the parenthesized
356subexpression match the null string.
357.Pp
358If case-independent matching is specified,
359the effect is much as if all case distinctions had vanished from the
360alphabet.
361When an alphabetic that exists in multiple cases appears as an
362ordinary character outside a bracket expression, it is effectively
363transformed into a bracket expression containing both cases,
364.No e.g. Ql x
365becomes
366.Ql [xX] .
367When it appears inside a bracket expression, all case counterparts
368of it are added to the bracket expression, so that (e.g.)
369.Ql [x]
370becomes
371.Ql [xX]
372and
373.Ql [^x]
374becomes
375.Ql [^xX] .
376.Pp
377No particular limit is imposed on the length of REs\(dd.
378Programs intended to be portable should not employ REs longer
379than 256 bytes,
380as an implementation can refuse to accept such REs and remain
381POSIX-compliant.
382.Pp
383Obsolete
384.Pq Dq basic
385regular expressions differ in several respects.
386.Ql \&|
387is an ordinary character and there is no equivalent
388for its functionality.
389.Ql \&+
390and
391.Ql ?\&
392are ordinary characters, and their functionality
393can be expressed using bounds
394.Po
395.Ql {1,}
396or
397.Ql {0,1}
398respectively
399.Pc .
400Also note that
401.Ql x+
402in modern REs is equivalent to
403.Ql xx* .
404The delimiters for bounds are
405.Ql \e{
406and
407.Ql \e} ,
408with
409.Ql \&{
410and
411.Ql \&}
412by themselves ordinary characters.
413The parentheses for nested subexpressions are
414.Ql \e(
415and
416.Ql \e) ,
417with
418.Ql \&(
419and
420.Ql \&)
421by themselves ordinary characters.
422.Ql \&^
423is an ordinary character except at the beginning of the
424RE or\(dd the beginning of a parenthesized subexpression,
425.Ql \&$
426is an ordinary character except at the end of the
427RE or\(dd the end of a parenthesized subexpression,
428and
429.Ql \&*
430is an ordinary character if it appears at the beginning of the
431RE or the beginning of a parenthesized subexpression
432(after a possible leading
433.Ql \&^ ) .
434Finally, there is one new type of atom, a
435.Em back reference :
436.Ql \e
437followed by a non-zero decimal digit
438.Em d
439matches the same sequence of characters
440matched by the
441.Em d Ns th
442parenthesized subexpression
443(numbering subexpressions by the positions of their opening parentheses,
444left to right),
445so that (e.g.)
446.Ql \e([bc]\e)\e1
447matches
448.Ql bb
449or
450.Ql cc
451but not
452.Ql bc .
453.Sh SEE ALSO
454.Xr regex 3
455.Rs
456.%T Regular Expression Notation
457.%R IEEE Std
458.%N 1003.2
459.%P section 2.8
460.Re
461.Sh BUGS
462Having two kinds of REs is a botch.
463.Pp
464The current
465.St -p1003.2
466spec says that
467.Ql \&)
468is an ordinary character in
469the absence of an unmatched
470.Ql \&( ;
471this was an unintentional result of a wording error,
472and change is likely.
473Avoid relying on it.
474.Pp
475Back references are a dreadful botch,
476posing major problems for efficient implementations.
477They are also somewhat vaguely defined
478(does
479.Ql a\e(\e(b\e)*\e2\e)*d
480match
481.Ql abbbd ? ) .
482Avoid using them.
483.Pp
484.St -p1003.2
485specification of case-independent matching is vague.
486The
487.Dq one case implies all cases
488definition given above
489is current consensus among implementors as to the right interpretation.
490.Pp
491The syntax for word boundaries is incredibly ugly.
492