xref: /freebsd/lib/libc/regex/regex.3 (revision d056fa046c6a91b90cd98165face0e42a33a5173)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"	This product includes software developed by the University of
19.\"	California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\" $FreeBSD$
38.\"
39.Dd August 17, 2005
40.Dt REGEX 3
41.Os
42.Sh NAME
43.Nm regcomp ,
44.Nm regexec ,
45.Nm regerror ,
46.Nm regfree
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.In regex.h
52.Ft int
53.Fo regcomp
54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
55.Fc
56.Ft int
57.Fo regexec
58.Fa "const regex_t * restrict preg" "const char * restrict string"
59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
60.Fc
61.Ft size_t
62.Fo regerror
63.Fa "int errcode" "const regex_t * restrict preg"
64.Fa "char * restrict errbuf" "size_t errbuf_size"
65.Fc
66.Ft void
67.Fn regfree "regex_t *preg"
68.Sh DESCRIPTION
69These routines implement
70.St -p1003.2
71regular expressions
72.Pq Do RE Dc Ns s ;
73see
74.Xr re_format 7 .
75The
76.Fn regcomp
77function
78compiles an RE written as a string into an internal form,
79.Fn regexec
80matches that internal form against a string and reports results,
81.Fn regerror
82transforms error codes from either into human-readable messages,
83and
84.Fn regfree
85frees any dynamically-allocated storage used by the internal form
86of an RE.
87.Pp
88The header
89.In regex.h
90declares two structure types,
91.Ft regex_t
92and
93.Ft regmatch_t ,
94the former for compiled internal forms and the latter for match reporting.
95It also declares the four functions,
96a type
97.Ft regoff_t ,
98and a number of constants with names starting with
99.Dq Dv REG_ .
100.Pp
101The
102.Fn regcomp
103function
104compiles the regular expression contained in the
105.Fa pattern
106string,
107subject to the flags in
108.Fa cflags ,
109and places the results in the
110.Ft regex_t
111structure pointed to by
112.Fa preg .
113The
114.Fa cflags
115argument
116is the bitwise OR of zero or more of the following flags:
117.Bl -tag -width REG_EXTENDED
118.It Dv REG_EXTENDED
119Compile modern
120.Pq Dq extended
121REs,
122rather than the obsolete
123.Pq Dq basic
124REs that
125are the default.
126.It Dv REG_BASIC
127This is a synonym for 0,
128provided as a counterpart to
129.Dv REG_EXTENDED
130to improve readability.
131.It Dv REG_NOSPEC
132Compile with recognition of all special characters turned off.
133All characters are thus considered ordinary,
134so the
135.Dq RE
136is a literal string.
137This is an extension,
138compatible with but not specified by
139.St -p1003.2 ,
140and should be used with
141caution in software intended to be portable to other systems.
142.Dv REG_EXTENDED
143and
144.Dv REG_NOSPEC
145may not be used
146in the same call to
147.Fn regcomp .
148.It Dv REG_ICASE
149Compile for matching that ignores upper/lower case distinctions.
150See
151.Xr re_format 7 .
152.It Dv REG_NOSUB
153Compile for matching that need only report success or failure,
154not what was matched.
155.It Dv REG_NEWLINE
156Compile for newline-sensitive matching.
157By default, newline is a completely ordinary character with no special
158meaning in either REs or strings.
159With this flag,
160.Ql [^
161bracket expressions and
162.Ql .\&
163never match newline,
164a
165.Ql ^\&
166anchor matches the null string after any newline in the string
167in addition to its normal function,
168and the
169.Ql $\&
170anchor matches the null string before any newline in the
171string in addition to its normal function.
172.It Dv REG_PEND
173The regular expression ends,
174not at the first NUL,
175but just before the character pointed to by the
176.Va re_endp
177member of the structure pointed to by
178.Fa preg .
179The
180.Va re_endp
181member is of type
182.Ft "const char *" .
183This flag permits inclusion of NULs in the RE;
184they are considered ordinary characters.
185This is an extension,
186compatible with but not specified by
187.St -p1003.2 ,
188and should be used with
189caution in software intended to be portable to other systems.
190.El
191.Pp
192When successful,
193.Fn regcomp
194returns 0 and fills in the structure pointed to by
195.Fa preg .
196One member of that structure
197(other than
198.Va re_endp )
199is publicized:
200.Va re_nsub ,
201of type
202.Ft size_t ,
203contains the number of parenthesized subexpressions within the RE
204(except that the value of this member is undefined if the
205.Dv REG_NOSUB
206flag was used).
207If
208.Fn regcomp
209fails, it returns a non-zero error code;
210see
211.Sx DIAGNOSTICS .
212.Pp
213The
214.Fn regexec
215function
216matches the compiled RE pointed to by
217.Fa preg
218against the
219.Fa string ,
220subject to the flags in
221.Fa eflags ,
222and reports results using
223.Fa nmatch ,
224.Fa pmatch ,
225and the returned value.
226The RE must have been compiled by a previous invocation of
227.Fn regcomp .
228The compiled form is not altered during execution of
229.Fn regexec ,
230so a single compiled RE can be used simultaneously by multiple threads.
231.Pp
232By default,
233the NUL-terminated string pointed to by
234.Fa string
235is considered to be the text of an entire line, minus any terminating
236newline.
237The
238.Fa eflags
239argument is the bitwise OR of zero or more of the following flags:
240.Bl -tag -width REG_STARTEND
241.It Dv REG_NOTBOL
242The first character of
243the string
244is not the beginning of a line, so the
245.Ql ^\&
246anchor should not match before it.
247This does not affect the behavior of newlines under
248.Dv REG_NEWLINE .
249.It Dv REG_NOTEOL
250The NUL terminating
251the string
252does not end a line, so the
253.Ql $\&
254anchor should not match before it.
255This does not affect the behavior of newlines under
256.Dv REG_NEWLINE .
257.It Dv REG_STARTEND
258The string is considered to start at
259.Fa string
260+
261.Fa pmatch Ns [0]. Ns Va rm_so
262and to have a terminating NUL located at
263.Fa string
264+
265.Fa pmatch Ns [0]. Ns Va rm_eo
266(there need not actually be a NUL at that location),
267regardless of the value of
268.Fa nmatch .
269See below for the definition of
270.Fa pmatch
271and
272.Fa nmatch .
273This is an extension,
274compatible with but not specified by
275.St -p1003.2 ,
276and should be used with
277caution in software intended to be portable to other systems.
278Note that a non-zero
279.Va rm_so
280does not imply
281.Dv REG_NOTBOL ;
282.Dv REG_STARTEND
283affects only the location of the string,
284not how it is matched.
285.El
286.Pp
287See
288.Xr re_format 7
289for a discussion of what is matched in situations where an RE or a
290portion thereof could match any of several substrings of
291.Fa string .
292.Pp
293Normally,
294.Fn regexec
295returns 0 for success and the non-zero code
296.Dv REG_NOMATCH
297for failure.
298Other non-zero error codes may be returned in exceptional situations;
299see
300.Sx DIAGNOSTICS .
301.Pp
302If
303.Dv REG_NOSUB
304was specified in the compilation of the RE,
305or if
306.Fa nmatch
307is 0,
308.Fn regexec
309ignores the
310.Fa pmatch
311argument (but see below for the case where
312.Dv REG_STARTEND
313is specified).
314Otherwise,
315.Fa pmatch
316points to an array of
317.Fa nmatch
318structures of type
319.Ft regmatch_t .
320Such a structure has at least the members
321.Va rm_so
322and
323.Va rm_eo ,
324both of type
325.Ft regoff_t
326(a signed arithmetic type at least as large as an
327.Ft off_t
328and a
329.Ft ssize_t ) ,
330containing respectively the offset of the first character of a substring
331and the offset of the first character after the end of the substring.
332Offsets are measured from the beginning of the
333.Fa string
334argument given to
335.Fn regexec .
336An empty substring is denoted by equal offsets,
337both indicating the character following the empty substring.
338.Pp
339The 0th member of the
340.Fa pmatch
341array is filled in to indicate what substring of
342.Fa string
343was matched by the entire RE.
344Remaining members report what substring was matched by parenthesized
345subexpressions within the RE;
346member
347.Va i
348reports subexpression
349.Va i ,
350with subexpressions counted (starting at 1) by the order of their opening
351parentheses in the RE, left to right.
352Unused entries in the array (corresponding either to subexpressions that
353did not participate in the match at all, or to subexpressions that do not
354exist in the RE (that is,
355.Va i
356>
357.Fa preg Ns -> Ns Va re_nsub ) )
358have both
359.Va rm_so
360and
361.Va rm_eo
362set to -1.
363If a subexpression participated in the match several times,
364the reported substring is the last one it matched.
365(Note, as an example in particular, that when the RE
366.Ql "(b*)+"
367matches
368.Ql bbb ,
369the parenthesized subexpression matches each of the three
370.So Li b Sc Ns s
371and then
372an infinite number of empty strings following the last
373.Ql b ,
374so the reported substring is one of the empties.)
375.Pp
376If
377.Dv REG_STARTEND
378is specified,
379.Fa pmatch
380must point to at least one
381.Ft regmatch_t
382(even if
383.Fa nmatch
384is 0 or
385.Dv REG_NOSUB
386was specified),
387to hold the input offsets for
388.Dv REG_STARTEND .
389Use for output is still entirely controlled by
390.Fa nmatch ;
391if
392.Fa nmatch
393is 0 or
394.Dv REG_NOSUB
395was specified,
396the value of
397.Fa pmatch Ns [0]
398will not be changed by a successful
399.Fn regexec .
400.Pp
401The
402.Fn regerror
403function
404maps a non-zero
405.Fa errcode
406from either
407.Fn regcomp
408or
409.Fn regexec
410to a human-readable, printable message.
411If
412.Fa preg
413is
414.No non\- Ns Dv NULL ,
415the error code should have arisen from use of
416the
417.Ft regex_t
418pointed to by
419.Fa preg ,
420and if the error code came from
421.Fn regcomp ,
422it should have been the result from the most recent
423.Fn regcomp
424using that
425.Ft regex_t .
426The
427.Fn ( regerror
428may be able to supply a more detailed message using information
429from the
430.Ft regex_t . )
431The
432.Fn regerror
433function
434places the NUL-terminated message into the buffer pointed to by
435.Fa errbuf ,
436limiting the length (including the NUL) to at most
437.Fa errbuf_size
438bytes.
439If the whole message will not fit,
440as much of it as will fit before the terminating NUL is supplied.
441In any case,
442the returned value is the size of buffer needed to hold the whole
443message (including terminating NUL).
444If
445.Fa errbuf_size
446is 0,
447.Fa errbuf
448is ignored but the return value is still correct.
449.Pp
450If the
451.Fa errcode
452given to
453.Fn regerror
454is first ORed with
455.Dv REG_ITOA ,
456the
457.Dq message
458that results is the printable name of the error code,
459e.g.\&
460.Dq Dv REG_NOMATCH ,
461rather than an explanation thereof.
462If
463.Fa errcode
464is
465.Dv REG_ATOI ,
466then
467.Fa preg
468shall be
469.No non\- Ns Dv NULL
470and the
471.Va re_endp
472member of the structure it points to
473must point to the printable name of an error code;
474in this case, the result in
475.Fa errbuf
476is the decimal digits of
477the numeric value of the error code
478(0 if the name is not recognized).
479.Dv REG_ITOA
480and
481.Dv REG_ATOI
482are intended primarily as debugging facilities;
483they are extensions,
484compatible with but not specified by
485.St -p1003.2 ,
486and should be used with
487caution in software intended to be portable to other systems.
488Be warned also that they are considered experimental and changes are possible.
489.Pp
490The
491.Fn regfree
492function
493frees any dynamically-allocated storage associated with the compiled RE
494pointed to by
495.Fa preg .
496The remaining
497.Ft regex_t
498is no longer a valid compiled RE
499and the effect of supplying it to
500.Fn regexec
501or
502.Fn regerror
503is undefined.
504.Pp
505None of these functions references global variables except for tables
506of constants;
507all are safe for use from multiple threads if the arguments are safe.
508.Sh IMPLEMENTATION CHOICES
509There are a number of decisions that
510.St -p1003.2
511leaves up to the implementor,
512either by explicitly saying
513.Dq undefined
514or by virtue of them being
515forbidden by the RE grammar.
516This implementation treats them as follows.
517.Pp
518See
519.Xr re_format 7
520for a discussion of the definition of case-independent matching.
521.Pp
522There is no particular limit on the length of REs,
523except insofar as memory is limited.
524Memory usage is approximately linear in RE size, and largely insensitive
525to RE complexity, except for bounded repetitions.
526See
527.Sx BUGS
528for one short RE using them
529that will run almost any system out of memory.
530.Pp
531A backslashed character other than one specifically given a magic meaning
532by
533.St -p1003.2
534(such magic meanings occur only in obsolete
535.Bq Dq basic
536REs)
537is taken as an ordinary character.
538.Pp
539Any unmatched
540.Ql [\&
541is a
542.Dv REG_EBRACK
543error.
544.Pp
545Equivalence classes cannot begin or end bracket-expression ranges.
546The endpoint of one range cannot begin another.
547.Pp
548.Dv RE_DUP_MAX ,
549the limit on repetition counts in bounded repetitions, is 255.
550.Pp
551A repetition operator
552.Ql ( ?\& ,
553.Ql *\& ,
554.Ql +\& ,
555or bounds)
556cannot follow another
557repetition operator.
558A repetition operator cannot begin an expression or subexpression
559or follow
560.Ql ^\&
561or
562.Ql |\& .
563.Pp
564.Ql |\&
565cannot appear first or last in a (sub)expression or after another
566.Ql |\& ,
567i.e., an operand of
568.Ql |\&
569cannot be an empty subexpression.
570An empty parenthesized subexpression,
571.Ql "()" ,
572is legal and matches an
573empty (sub)string.
574An empty string is not a legal RE.
575.Pp
576A
577.Ql {\&
578followed by a digit is considered the beginning of bounds for a
579bounded repetition, which must then follow the syntax for bounds.
580A
581.Ql {\&
582.Em not
583followed by a digit is considered an ordinary character.
584.Pp
585.Ql ^\&
586and
587.Ql $\&
588beginning and ending subexpressions in obsolete
589.Pq Dq basic
590REs are anchors, not ordinary characters.
591.Sh DIAGNOSTICS
592Non-zero error codes from
593.Fn regcomp
594and
595.Fn regexec
596include the following:
597.Pp
598.Bl -tag -width REG_ECOLLATE -compact
599.It Dv REG_NOMATCH
600The
601.Fn regexec
602function
603failed to match
604.It Dv REG_BADPAT
605invalid regular expression
606.It Dv REG_ECOLLATE
607invalid collating element
608.It Dv REG_ECTYPE
609invalid character class
610.It Dv REG_EESCAPE
611.Ql \e
612applied to unescapable character
613.It Dv REG_ESUBREG
614invalid backreference number
615.It Dv REG_EBRACK
616brackets
617.Ql "[ ]"
618not balanced
619.It Dv REG_EPAREN
620parentheses
621.Ql "( )"
622not balanced
623.It Dv REG_EBRACE
624braces
625.Ql "{ }"
626not balanced
627.It Dv REG_BADBR
628invalid repetition count(s) in
629.Ql "{ }"
630.It Dv REG_ERANGE
631invalid character range in
632.Ql "[ ]"
633.It Dv REG_ESPACE
634ran out of memory
635.It Dv REG_BADRPT
636.Ql ?\& ,
637.Ql *\& ,
638or
639.Ql +\&
640operand invalid
641.It Dv REG_EMPTY
642empty (sub)expression
643.It Dv REG_ASSERT
644cannot happen - you found a bug
645.It Dv REG_INVARG
646invalid argument, e.g.\& negative-length string
647.It Dv REG_ILLSEQ
648illegal byte sequence (bad multibyte character)
649.El
650.Sh SEE ALSO
651.Xr grep 1 ,
652.Xr re_format 7
653.Pp
654.St -p1003.2 ,
655sections 2.8 (Regular Expression Notation)
656and
657B.5 (C Binding for Regular Expression Matching).
658.Sh HISTORY
659Originally written by
660.An Henry Spencer .
661Altered for inclusion in the
662.Bx 4.4
663distribution.
664.Sh BUGS
665This is an alpha release with known defects.
666Please report problems.
667.Pp
668The back-reference code is subtle and doubts linger about its correctness
669in complex cases.
670.Pp
671The
672.Fn regexec
673function
674performance is poor.
675This will improve with later releases.
676The
677.Fa nmatch
678argument
679exceeding 0 is expensive;
680.Fa nmatch
681exceeding 1 is worse.
682The
683.Fn regexec
684function
685is largely insensitive to RE complexity
686.Em except
687that back
688references are massively expensive.
689RE length does matter; in particular, there is a strong speed bonus
690for keeping RE length under about 30 characters,
691with most special characters counting roughly double.
692.Pp
693The
694.Fn regcomp
695function
696implements bounded repetitions by macro expansion,
697which is costly in time and space if counts are large
698or bounded repetitions are nested.
699An RE like, say,
700.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
701will (eventually) run almost any existing machine out of swap space.
702.Pp
703There are suspected problems with response to obscure error conditions.
704Notably,
705certain kinds of internal overflow,
706produced only by truly enormous REs or by multiply nested bounded repetitions,
707are probably not handled well.
708.Pp
709Due to a mistake in
710.St -p1003.2 ,
711things like
712.Ql "a)b"
713are legal REs because
714.Ql )\&
715is
716a special character only in the presence of a previous unmatched
717.Ql (\& .
718This cannot be fixed until the spec is fixed.
719.Pp
720The standard's definition of back references is vague.
721For example, does
722.Ql "a\e(\e(b\e)*\e2\e)*d"
723match
724.Ql "abbbd" ?
725Until the standard is clarified,
726behavior in such cases should not be relied on.
727.Pp
728The implementation of word-boundary matching is a bit of a kludge,
729and bugs may lurk in combinations of word-boundary matching and anchoring.
730.Pp
731Word-boundary matching does not work properly in multibyte locales.
732