xref: /freebsd/lib/libc/regex/regex.3 (revision a3e8fd0b7f663db7eafff527d5c3ca3bcfa8a537)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"	This product includes software developed by the University of
19.\"	California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\" $FreeBSD$
38.\"
39.Dd October 2, 2002
40.Dt REGEX 3
41.Os
42.Sh NAME
43.Nm regcomp ,
44.Nm regexec ,
45.Nm regerror ,
46.Nm regfree
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.In regex.h
52.Ft int
53.Fo regcomp
54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
55.Fc
56.Ft int
57.Fo regexec
58.Fa "const regex_t * restrict preg" "const char * restrict string"
59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
60.Fc
61.Ft size_t
62.Fo regerror
63.Fa "int errcode" "const regex_t * restrict preg"
64.Fa "char * restrict errbuf" "size_t errbuf_size"
65.Fc
66.Ft void
67.Fn regfree "regex_t *preg"
68.Sh DESCRIPTION
69These routines implement
70.St -p1003.2
71regular expressions
72.Pq Do RE Dc Ns s ;
73see
74.Xr re_format 7 .
75.Fn Regcomp
76compiles an RE written as a string into an internal form,
77.Fn regexec
78matches that internal form against a string and reports results,
79.Fn regerror
80transforms error codes from either into human-readable messages,
81and
82.Fn regfree
83frees any dynamically-allocated storage used by the internal form
84of an RE.
85.Pp
86The header
87.Aq Pa regex.h
88declares two structure types,
89.Ft regex_t
90and
91.Ft regmatch_t ,
92the former for compiled internal forms and the latter for match reporting.
93It also declares the four functions,
94a type
95.Ft regoff_t ,
96and a number of constants with names starting with
97.Dq Dv REG_ .
98.Pp
99.Fn Regcomp
100compiles the regular expression contained in the
101.Fa pattern
102string,
103subject to the flags in
104.Fa cflags ,
105and places the results in the
106.Ft regex_t
107structure pointed to by
108.Fa preg .
109.Fa Cflags
110is the bitwise OR of zero or more of the following flags:
111.Bl -tag -width REG_EXTENDED
112.It Dv REG_EXTENDED
113Compile modern
114.Pq Dq extended
115REs,
116rather than the obsolete
117.Pq Dq basic
118REs that
119are the default.
120.It Dv REG_BASIC
121This is a synonym for 0,
122provided as a counterpart to
123.Dv REG_EXTENDED
124to improve readability.
125.It Dv REG_NOSPEC
126Compile with recognition of all special characters turned off.
127All characters are thus considered ordinary,
128so the
129.Dq RE
130is a literal string.
131This is an extension,
132compatible with but not specified by
133.St -p1003.2 ,
134and should be used with
135caution in software intended to be portable to other systems.
136.Dv REG_EXTENDED
137and
138.Dv REG_NOSPEC
139may not be used
140in the same call to
141.Fn regcomp .
142.It Dv REG_ICASE
143Compile for matching that ignores upper/lower case distinctions.
144See
145.Xr re_format 7 .
146.It Dv REG_NOSUB
147Compile for matching that need only report success or failure,
148not what was matched.
149.It Dv REG_NEWLINE
150Compile for newline-sensitive matching.
151By default, newline is a completely ordinary character with no special
152meaning in either REs or strings.
153With this flag,
154.Ql [^
155bracket expressions and
156.Ql .\&
157never match newline,
158a
159.Ql ^\&
160anchor matches the null string after any newline in the string
161in addition to its normal function,
162and the
163.Ql $\&
164anchor matches the null string before any newline in the
165string in addition to its normal function.
166.It Dv REG_PEND
167The regular expression ends,
168not at the first NUL,
169but just before the character pointed to by the
170.Va re_endp
171member of the structure pointed to by
172.Fa preg .
173The
174.Va re_endp
175member is of type
176.Ft "const char *" .
177This flag permits inclusion of NULs in the RE;
178they are considered ordinary characters.
179This is an extension,
180compatible with but not specified by
181.St -p1003.2 ,
182and should be used with
183caution in software intended to be portable to other systems.
184.El
185.Pp
186When successful,
187.Fn regcomp
188returns 0 and fills in the structure pointed to by
189.Fa preg .
190One member of that structure
191(other than
192.Va re_endp )
193is publicized:
194.Va re_nsub ,
195of type
196.Ft size_t ,
197contains the number of parenthesized subexpressions within the RE
198(except that the value of this member is undefined if the
199.Dv REG_NOSUB
200flag was used).
201If
202.Fn regcomp
203fails, it returns a non-zero error code;
204see
205.Sx DIAGNOSTICS .
206.Pp
207.Fn Regexec
208matches the compiled RE pointed to by
209.Fa preg
210against the
211.Fa string ,
212subject to the flags in
213.Fa eflags ,
214and reports results using
215.Fa nmatch ,
216.Fa pmatch ,
217and the returned value.
218The RE must have been compiled by a previous invocation of
219.Fn regcomp .
220The compiled form is not altered during execution of
221.Fn regexec ,
222so a single compiled RE can be used simultaneously by multiple threads.
223.Pp
224By default,
225the NUL-terminated string pointed to by
226.Fa string
227is considered to be the text of an entire line, minus any terminating
228newline.
229The
230.Fa eflags
231argument is the bitwise OR of zero or more of the following flags:
232.Bl -tag -width REG_STARTEND
233.It Dv REG_NOTBOL
234The first character of
235the string
236is not the beginning of a line, so the
237.Ql ^\&
238anchor should not match before it.
239This does not affect the behavior of newlines under
240.Dv REG_NEWLINE .
241.It Dv REG_NOTEOL
242The NUL terminating
243the string
244does not end a line, so the
245.Ql $\&
246anchor should not match before it.
247This does not affect the behavior of newlines under
248.Dv REG_NEWLINE .
249.It Dv REG_STARTEND
250The string is considered to start at
251.Fa string
252+
253.Fa pmatch Ns [0]. Ns Va rm_so
254and to have a terminating NUL located at
255.Fa string
256+
257.Fa pmatch Ns [0]. Ns Va rm_eo
258(there need not actually be a NUL at that location),
259regardless of the value of
260.Fa nmatch .
261See below for the definition of
262.Fa pmatch
263and
264.Fa nmatch .
265This is an extension,
266compatible with but not specified by
267.St -p1003.2 ,
268and should be used with
269caution in software intended to be portable to other systems.
270Note that a non-zero
271.Va rm_so
272does not imply
273.Dv REG_NOTBOL ;
274.Dv REG_STARTEND
275affects only the location of the string,
276not how it is matched.
277.El
278.Pp
279See
280.Xr re_format 7
281for a discussion of what is matched in situations where an RE or a
282portion thereof could match any of several substrings of
283.Fa string .
284.Pp
285Normally,
286.Fn regexec
287returns 0 for success and the non-zero code
288.Dv REG_NOMATCH
289for failure.
290Other non-zero error codes may be returned in exceptional situations;
291see
292.Sx DIAGNOSTICS .
293.Pp
294If
295.Dv REG_NOSUB
296was specified in the compilation of the RE,
297or if
298.Fa nmatch
299is 0,
300.Fn regexec
301ignores the
302.Fa pmatch
303argument (but see below for the case where
304.Dv REG_STARTEND
305is specified).
306Otherwise,
307.Fa pmatch
308points to an array of
309.Fa nmatch
310structures of type
311.Ft regmatch_t .
312Such a structure has at least the members
313.Va rm_so
314and
315.Va rm_eo ,
316both of type
317.Ft regoff_t
318(a signed arithmetic type at least as large as an
319.Ft off_t
320and a
321.Ft ssize_t ) ,
322containing respectively the offset of the first character of a substring
323and the offset of the first character after the end of the substring.
324Offsets are measured from the beginning of the
325.Fa string
326argument given to
327.Fn regexec .
328An empty substring is denoted by equal offsets,
329both indicating the character following the empty substring.
330.Pp
331The 0th member of the
332.Fa pmatch
333array is filled in to indicate what substring of
334.Fa string
335was matched by the entire RE.
336Remaining members report what substring was matched by parenthesized
337subexpressions within the RE;
338member
339.Va i
340reports subexpression
341.Va i ,
342with subexpressions counted (starting at 1) by the order of their opening
343parentheses in the RE, left to right.
344Unused entries in the array (corresponding either to subexpressions that
345did not participate in the match at all, or to subexpressions that do not
346exist in the RE (that is,
347.Va i
348>
349.Fa preg Ns -> Ns Va re_nsub ) )
350have both
351.Va rm_so
352and
353.Va rm_eo
354set to -1.
355If a subexpression participated in the match several times,
356the reported substring is the last one it matched.
357(Note, as an example in particular, that when the RE
358.Ql "(b*)+"
359matches
360.Ql bbb ,
361the parenthesized subexpression matches each of the three
362.So Li b Sc Ns s
363and then
364an infinite number of empty strings following the last
365.Ql b ,
366so the reported substring is one of the empties.)
367.Pp
368If
369.Dv REG_STARTEND
370is specified,
371.Fa pmatch
372must point to at least one
373.Ft regmatch_t
374(even if
375.Fa nmatch
376is 0 or
377.Dv REG_NOSUB
378was specified),
379to hold the input offsets for
380.Dv REG_STARTEND .
381Use for output is still entirely controlled by
382.Fa nmatch ;
383if
384.Fa nmatch
385is 0 or
386.Dv REG_NOSUB
387was specified,
388the value of
389.Fa pmatch Ns [0]
390will not be changed by a successful
391.Fn regexec .
392.Pp
393.Fn Regerror
394maps a non-zero
395.Fa errcode
396from either
397.Fn regcomp
398or
399.Fn regexec
400to a human-readable, printable message.
401If
402.Fa preg
403is
404.No non\- Ns Dv NULL ,
405the error code should have arisen from use of
406the
407.Ft regex_t
408pointed to by
409.Fa preg ,
410and if the error code came from
411.Fn regcomp ,
412it should have been the result from the most recent
413.Fn regcomp
414using that
415.Ft regex_t .
416.No ( Fn Regerror
417may be able to supply a more detailed message using information
418from the
419.Ft regex_t . )
420.Fn Regerror
421places the NUL-terminated message into the buffer pointed to by
422.Fa errbuf ,
423limiting the length (including the NUL) to at most
424.Fa errbuf_size
425bytes.
426If the whole message won't fit,
427as much of it as will fit before the terminating NUL is supplied.
428In any case,
429the returned value is the size of buffer needed to hold the whole
430message (including terminating NUL).
431If
432.Fa errbuf_size
433is 0,
434.Fa errbuf
435is ignored but the return value is still correct.
436.Pp
437If the
438.Fa errcode
439given to
440.Fn regerror
441is first ORed with
442.Dv REG_ITOA ,
443the
444.Dq message
445that results is the printable name of the error code,
446e.g.\&
447.Dq Dv REG_NOMATCH ,
448rather than an explanation thereof.
449If
450.Fa errcode
451is
452.Dv REG_ATOI ,
453then
454.Fa preg
455shall be
456.No non\- Ns Dv NULL
457and the
458.Va re_endp
459member of the structure it points to
460must point to the printable name of an error code;
461in this case, the result in
462.Fa errbuf
463is the decimal digits of
464the numeric value of the error code
465(0 if the name is not recognized).
466.Dv REG_ITOA
467and
468.Dv REG_ATOI
469are intended primarily as debugging facilities;
470they are extensions,
471compatible with but not specified by
472.St -p1003.2 ,
473and should be used with
474caution in software intended to be portable to other systems.
475Be warned also that they are considered experimental and changes are possible.
476.Pp
477.Fn Regfree
478frees any dynamically-allocated storage associated with the compiled RE
479pointed to by
480.Fa preg .
481The remaining
482.Ft regex_t
483is no longer a valid compiled RE
484and the effect of supplying it to
485.Fn regexec
486or
487.Fn regerror
488is undefined.
489.Pp
490None of these functions references global variables except for tables
491of constants;
492all are safe for use from multiple threads if the arguments are safe.
493.Sh IMPLEMENTATION CHOICES
494There are a number of decisions that
495.St -p1003.2
496leaves up to the implementor,
497either by explicitly saying
498.Dq undefined
499or by virtue of them being
500forbidden by the RE grammar.
501This implementation treats them as follows.
502.Pp
503See
504.Xr re_format 7
505for a discussion of the definition of case-independent matching.
506.Pp
507There is no particular limit on the length of REs,
508except insofar as memory is limited.
509Memory usage is approximately linear in RE size, and largely insensitive
510to RE complexity, except for bounded repetitions.
511See
512.Sx BUGS
513for one short RE using them
514that will run almost any system out of memory.
515.Pp
516A backslashed character other than one specifically given a magic meaning
517by
518.St -p1003.2
519(such magic meanings occur only in obsolete
520.Bq Dq basic
521REs)
522is taken as an ordinary character.
523.Pp
524Any unmatched
525.Ql [\&
526is a
527.Dv REG_EBRACK
528error.
529.Pp
530Equivalence classes cannot begin or end bracket-expression ranges.
531The endpoint of one range cannot begin another.
532.Pp
533.Dv RE_DUP_MAX ,
534the limit on repetition counts in bounded repetitions, is 255.
535.Pp
536A repetition operator
537.Ql ( ?\& ,
538.Ql *\& ,
539.Ql +\& ,
540or bounds)
541cannot follow another
542repetition operator.
543A repetition operator cannot begin an expression or subexpression
544or follow
545.Ql ^\&
546or
547.Ql |\& .
548.Pp
549.Ql |\&
550cannot appear first or last in a (sub)expression or after another
551.Ql |\& ,
552i.e. an operand of
553.Ql |\&
554cannot be an empty subexpression.
555An empty parenthesized subexpression,
556.Ql "()" ,
557is legal and matches an
558empty (sub)string.
559An empty string is not a legal RE.
560.Pp
561A
562.Ql {\&
563followed by a digit is considered the beginning of bounds for a
564bounded repetition, which must then follow the syntax for bounds.
565A
566.Ql {\&
567.Em not
568followed by a digit is considered an ordinary character.
569.Pp
570.Ql ^\&
571and
572.Ql $\&
573beginning and ending subexpressions in obsolete
574.Pq Dq basic
575REs are anchors, not ordinary characters.
576.Sh SEE ALSO
577.Xr grep 1 ,
578.Xr re_format 7
579.Pp
580.St -p1003.2 ,
581sections 2.8 (Regular Expression Notation)
582and
583B.5 (C Binding for Regular Expression Matching).
584.Sh DIAGNOSTICS
585Non-zero error codes from
586.Fn regcomp
587and
588.Fn regexec
589include the following:
590.Pp
591.Bl -tag -width REG_ECOLLATE -compact
592.It Dv REG_NOMATCH
593.Fn regexec
594failed to match
595.It Dv REG_BADPAT
596invalid regular expression
597.It Dv REG_ECOLLATE
598invalid collating element
599.It Dv REG_ECTYPE
600invalid character class
601.It Dv REG_EESCAPE
602.Ql \e
603applied to unescapable character
604.It Dv REG_ESUBREG
605invalid backreference number
606.It Dv REG_EBRACK
607brackets
608.Ql "[ ]"
609not balanced
610.It Dv REG_EPAREN
611parentheses
612.Ql "( )"
613not balanced
614.It Dv REG_EBRACE
615braces
616.Ql "{ }"
617not balanced
618.It Dv REG_BADBR
619invalid repetition count(s) in
620.Ql "{ }"
621.It Dv REG_ERANGE
622invalid character range in
623.Ql "[ ]"
624.It Dv REG_ESPACE
625ran out of memory
626.It Dv REG_BADRPT
627.Ql ?\& ,
628.Ql *\& ,
629or
630.Ql +\&
631operand invalid
632.It Dv REG_EMPTY
633empty (sub)expression
634.It Dv REG_ASSERT
635can't happen - you found a bug
636.It Dv REG_INVARG
637invalid argument, e.g. negative-length string
638.El
639.Sh HISTORY
640Originally written by
641.An Henry Spencer .
642Altered for inclusion in the
643.Bx 4.4
644distribution.
645.Sh BUGS
646This is an alpha release with known defects.
647Please report problems.
648.Pp
649The back-reference code is subtle and doubts linger about its correctness
650in complex cases.
651.Pp
652.Fn Regexec
653performance is poor.
654This will improve with later releases.
655.Fa Nmatch
656exceeding 0 is expensive;
657.Fa nmatch
658exceeding 1 is worse.
659.Fn Regexec
660is largely insensitive to RE complexity
661.Em except
662that back
663references are massively expensive.
664RE length does matter; in particular, there is a strong speed bonus
665for keeping RE length under about 30 characters,
666with most special characters counting roughly double.
667.Pp
668.Fn Regcomp
669implements bounded repetitions by macro expansion,
670which is costly in time and space if counts are large
671or bounded repetitions are nested.
672An RE like, say,
673.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
674will (eventually) run almost any existing machine out of swap space.
675.Pp
676There are suspected problems with response to obscure error conditions.
677Notably,
678certain kinds of internal overflow,
679produced only by truly enormous REs or by multiply nested bounded repetitions,
680are probably not handled well.
681.Pp
682Due to a mistake in
683.St -p1003.2 ,
684things like
685.Ql "a)b"
686are legal REs because
687.Ql )\&
688is
689a special character only in the presence of a previous unmatched
690.Ql (\& .
691This can't be fixed until the spec is fixed.
692.Pp
693The standard's definition of back references is vague.
694For example, does
695.Ql "a\e(\e(b\e)*\e2\e)*d"
696match
697.Ql "abbbd" ?
698Until the standard is clarified,
699behavior in such cases should not be relied on.
700.Pp
701The implementation of word-boundary matching is a bit of a kludge,
702and bugs may lurk in combinations of word-boundary matching and anchoring.
703