xref: /freebsd/lib/libc/regex/regex.3 (revision 3e5645b78f476816ca3b5acc28b29bbafbb9c444)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
33.\" $FreeBSD$
34.\"
35.Dd August 17, 2005
36.Dt REGEX 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regexec ,
41.Nm regerror ,
42.Nm regfree
43.Nd regular-expression library
44.Sh LIBRARY
45.Lb libc
46.Sh SYNOPSIS
47.In regex.h
48.Ft int
49.Fo regcomp
50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
51.Fc
52.Ft int
53.Fo regexec
54.Fa "const regex_t * restrict preg" "const char * restrict string"
55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
56.Fc
57.Ft size_t
58.Fo regerror
59.Fa "int errcode" "const regex_t * restrict preg"
60.Fa "char * restrict errbuf" "size_t errbuf_size"
61.Fc
62.Ft void
63.Fn regfree "regex_t *preg"
64.Sh DESCRIPTION
65These routines implement
66.St -p1003.2
67regular expressions
68.Pq Do RE Dc Ns s ;
69see
70.Xr re_format 7 .
71The
72.Fn regcomp
73function
74compiles an RE written as a string into an internal form,
75.Fn regexec
76matches that internal form against a string and reports results,
77.Fn regerror
78transforms error codes from either into human-readable messages,
79and
80.Fn regfree
81frees any dynamically-allocated storage used by the internal form
82of an RE.
83.Pp
84The header
85.In regex.h
86declares two structure types,
87.Ft regex_t
88and
89.Ft regmatch_t ,
90the former for compiled internal forms and the latter for match reporting.
91It also declares the four functions,
92a type
93.Ft regoff_t ,
94and a number of constants with names starting with
95.Dq Dv REG_ .
96.Pp
97The
98.Fn regcomp
99function
100compiles the regular expression contained in the
101.Fa pattern
102string,
103subject to the flags in
104.Fa cflags ,
105and places the results in the
106.Ft regex_t
107structure pointed to by
108.Fa preg .
109The
110.Fa cflags
111argument
112is the bitwise OR of zero or more of the following flags:
113.Bl -tag -width REG_EXTENDED
114.It Dv REG_EXTENDED
115Compile modern
116.Pq Dq extended
117REs,
118rather than the obsolete
119.Pq Dq basic
120REs that
121are the default.
122.It Dv REG_BASIC
123This is a synonym for 0,
124provided as a counterpart to
125.Dv REG_EXTENDED
126to improve readability.
127.It Dv REG_NOSPEC
128Compile with recognition of all special characters turned off.
129All characters are thus considered ordinary,
130so the
131.Dq RE
132is a literal string.
133This is an extension,
134compatible with but not specified by
135.St -p1003.2 ,
136and should be used with
137caution in software intended to be portable to other systems.
138.Dv REG_EXTENDED
139and
140.Dv REG_NOSPEC
141may not be used
142in the same call to
143.Fn regcomp .
144.It Dv REG_ICASE
145Compile for matching that ignores upper/lower case distinctions.
146See
147.Xr re_format 7 .
148.It Dv REG_NOSUB
149Compile for matching that need only report success or failure,
150not what was matched.
151.It Dv REG_NEWLINE
152Compile for newline-sensitive matching.
153By default, newline is a completely ordinary character with no special
154meaning in either REs or strings.
155With this flag,
156.Ql [^
157bracket expressions and
158.Ql .\&
159never match newline,
160a
161.Ql ^\&
162anchor matches the null string after any newline in the string
163in addition to its normal function,
164and the
165.Ql $\&
166anchor matches the null string before any newline in the
167string in addition to its normal function.
168.It Dv REG_PEND
169The regular expression ends,
170not at the first NUL,
171but just before the character pointed to by the
172.Va re_endp
173member of the structure pointed to by
174.Fa preg .
175The
176.Va re_endp
177member is of type
178.Ft "const char *" .
179This flag permits inclusion of NULs in the RE;
180they are considered ordinary characters.
181This is an extension,
182compatible with but not specified by
183.St -p1003.2 ,
184and should be used with
185caution in software intended to be portable to other systems.
186.El
187.Pp
188When successful,
189.Fn regcomp
190returns 0 and fills in the structure pointed to by
191.Fa preg .
192One member of that structure
193(other than
194.Va re_endp )
195is publicized:
196.Va re_nsub ,
197of type
198.Ft size_t ,
199contains the number of parenthesized subexpressions within the RE
200(except that the value of this member is undefined if the
201.Dv REG_NOSUB
202flag was used).
203If
204.Fn regcomp
205fails, it returns a non-zero error code;
206see
207.Sx DIAGNOSTICS .
208.Pp
209The
210.Fn regexec
211function
212matches the compiled RE pointed to by
213.Fa preg
214against the
215.Fa string ,
216subject to the flags in
217.Fa eflags ,
218and reports results using
219.Fa nmatch ,
220.Fa pmatch ,
221and the returned value.
222The RE must have been compiled by a previous invocation of
223.Fn regcomp .
224The compiled form is not altered during execution of
225.Fn regexec ,
226so a single compiled RE can be used simultaneously by multiple threads.
227.Pp
228By default,
229the NUL-terminated string pointed to by
230.Fa string
231is considered to be the text of an entire line, minus any terminating
232newline.
233The
234.Fa eflags
235argument is the bitwise OR of zero or more of the following flags:
236.Bl -tag -width REG_STARTEND
237.It Dv REG_NOTBOL
238The first character of
239the string
240is not the beginning of a line, so the
241.Ql ^\&
242anchor should not match before it.
243This does not affect the behavior of newlines under
244.Dv REG_NEWLINE .
245.It Dv REG_NOTEOL
246The NUL terminating
247the string
248does not end a line, so the
249.Ql $\&
250anchor should not match before it.
251This does not affect the behavior of newlines under
252.Dv REG_NEWLINE .
253.It Dv REG_STARTEND
254The string is considered to start at
255.Fa string
256+
257.Fa pmatch Ns [0]. Ns Va rm_so
258and to have a terminating NUL located at
259.Fa string
260+
261.Fa pmatch Ns [0]. Ns Va rm_eo
262(there need not actually be a NUL at that location),
263regardless of the value of
264.Fa nmatch .
265See below for the definition of
266.Fa pmatch
267and
268.Fa nmatch .
269This is an extension,
270compatible with but not specified by
271.St -p1003.2 ,
272and should be used with
273caution in software intended to be portable to other systems.
274Note that a non-zero
275.Va rm_so
276does not imply
277.Dv REG_NOTBOL ;
278.Dv REG_STARTEND
279affects only the location of the string,
280not how it is matched.
281.El
282.Pp
283See
284.Xr re_format 7
285for a discussion of what is matched in situations where an RE or a
286portion thereof could match any of several substrings of
287.Fa string .
288.Pp
289Normally,
290.Fn regexec
291returns 0 for success and the non-zero code
292.Dv REG_NOMATCH
293for failure.
294Other non-zero error codes may be returned in exceptional situations;
295see
296.Sx DIAGNOSTICS .
297.Pp
298If
299.Dv REG_NOSUB
300was specified in the compilation of the RE,
301or if
302.Fa nmatch
303is 0,
304.Fn regexec
305ignores the
306.Fa pmatch
307argument (but see below for the case where
308.Dv REG_STARTEND
309is specified).
310Otherwise,
311.Fa pmatch
312points to an array of
313.Fa nmatch
314structures of type
315.Ft regmatch_t .
316Such a structure has at least the members
317.Va rm_so
318and
319.Va rm_eo ,
320both of type
321.Ft regoff_t
322(a signed arithmetic type at least as large as an
323.Ft off_t
324and a
325.Ft ssize_t ) ,
326containing respectively the offset of the first character of a substring
327and the offset of the first character after the end of the substring.
328Offsets are measured from the beginning of the
329.Fa string
330argument given to
331.Fn regexec .
332An empty substring is denoted by equal offsets,
333both indicating the character following the empty substring.
334.Pp
335The 0th member of the
336.Fa pmatch
337array is filled in to indicate what substring of
338.Fa string
339was matched by the entire RE.
340Remaining members report what substring was matched by parenthesized
341subexpressions within the RE;
342member
343.Va i
344reports subexpression
345.Va i ,
346with subexpressions counted (starting at 1) by the order of their opening
347parentheses in the RE, left to right.
348Unused entries in the array (corresponding either to subexpressions that
349did not participate in the match at all, or to subexpressions that do not
350exist in the RE (that is,
351.Va i
352>
353.Fa preg Ns -> Ns Va re_nsub ) )
354have both
355.Va rm_so
356and
357.Va rm_eo
358set to -1.
359If a subexpression participated in the match several times,
360the reported substring is the last one it matched.
361(Note, as an example in particular, that when the RE
362.Ql "(b*)+"
363matches
364.Ql bbb ,
365the parenthesized subexpression matches each of the three
366.So Li b Sc Ns s
367and then
368an infinite number of empty strings following the last
369.Ql b ,
370so the reported substring is one of the empties.)
371.Pp
372If
373.Dv REG_STARTEND
374is specified,
375.Fa pmatch
376must point to at least one
377.Ft regmatch_t
378(even if
379.Fa nmatch
380is 0 or
381.Dv REG_NOSUB
382was specified),
383to hold the input offsets for
384.Dv REG_STARTEND .
385Use for output is still entirely controlled by
386.Fa nmatch ;
387if
388.Fa nmatch
389is 0 or
390.Dv REG_NOSUB
391was specified,
392the value of
393.Fa pmatch Ns [0]
394will not be changed by a successful
395.Fn regexec .
396.Pp
397The
398.Fn regerror
399function
400maps a non-zero
401.Fa errcode
402from either
403.Fn regcomp
404or
405.Fn regexec
406to a human-readable, printable message.
407If
408.Fa preg
409is
410.No non\- Ns Dv NULL ,
411the error code should have arisen from use of
412the
413.Ft regex_t
414pointed to by
415.Fa preg ,
416and if the error code came from
417.Fn regcomp ,
418it should have been the result from the most recent
419.Fn regcomp
420using that
421.Ft regex_t .
422The
423.Po
424.Fn regerror
425may be able to supply a more detailed message using information
426from the
427.Ft regex_t .
428.Pc
429The
430.Fn regerror
431function
432places the NUL-terminated message into the buffer pointed to by
433.Fa errbuf ,
434limiting the length (including the NUL) to at most
435.Fa errbuf_size
436bytes.
437If the whole message will not fit,
438as much of it as will fit before the terminating NUL is supplied.
439In any case,
440the returned value is the size of buffer needed to hold the whole
441message (including terminating NUL).
442If
443.Fa errbuf_size
444is 0,
445.Fa errbuf
446is ignored but the return value is still correct.
447.Pp
448If the
449.Fa errcode
450given to
451.Fn regerror
452is first ORed with
453.Dv REG_ITOA ,
454the
455.Dq message
456that results is the printable name of the error code,
457e.g.\&
458.Dq Dv REG_NOMATCH ,
459rather than an explanation thereof.
460If
461.Fa errcode
462is
463.Dv REG_ATOI ,
464then
465.Fa preg
466shall be
467.No non\- Ns Dv NULL
468and the
469.Va re_endp
470member of the structure it points to
471must point to the printable name of an error code;
472in this case, the result in
473.Fa errbuf
474is the decimal digits of
475the numeric value of the error code
476(0 if the name is not recognized).
477.Dv REG_ITOA
478and
479.Dv REG_ATOI
480are intended primarily as debugging facilities;
481they are extensions,
482compatible with but not specified by
483.St -p1003.2 ,
484and should be used with
485caution in software intended to be portable to other systems.
486Be warned also that they are considered experimental and changes are possible.
487.Pp
488The
489.Fn regfree
490function
491frees any dynamically-allocated storage associated with the compiled RE
492pointed to by
493.Fa preg .
494The remaining
495.Ft regex_t
496is no longer a valid compiled RE
497and the effect of supplying it to
498.Fn regexec
499or
500.Fn regerror
501is undefined.
502.Pp
503None of these functions references global variables except for tables
504of constants;
505all are safe for use from multiple threads if the arguments are safe.
506.Sh IMPLEMENTATION CHOICES
507There are a number of decisions that
508.St -p1003.2
509leaves up to the implementor,
510either by explicitly saying
511.Dq undefined
512or by virtue of them being
513forbidden by the RE grammar.
514This implementation treats them as follows.
515.Pp
516See
517.Xr re_format 7
518for a discussion of the definition of case-independent matching.
519.Pp
520There is no particular limit on the length of REs,
521except insofar as memory is limited.
522Memory usage is approximately linear in RE size, and largely insensitive
523to RE complexity, except for bounded repetitions.
524See
525.Sx BUGS
526for one short RE using them
527that will run almost any system out of memory.
528.Pp
529A backslashed character other than one specifically given a magic meaning
530by
531.St -p1003.2
532(such magic meanings occur only in obsolete
533.Bq Dq basic
534REs)
535is taken as an ordinary character.
536.Pp
537Any unmatched
538.Ql [\&
539is a
540.Dv REG_EBRACK
541error.
542.Pp
543Equivalence classes cannot begin or end bracket-expression ranges.
544The endpoint of one range cannot begin another.
545.Pp
546.Dv RE_DUP_MAX ,
547the limit on repetition counts in bounded repetitions, is 255.
548.Pp
549A repetition operator
550.Ql ( ?\& ,
551.Ql *\& ,
552.Ql +\& ,
553or bounds)
554cannot follow another
555repetition operator.
556A repetition operator cannot begin an expression or subexpression
557or follow
558.Ql ^\&
559or
560.Ql |\& .
561.Pp
562.Ql |\&
563cannot appear first or last in a (sub)expression or after another
564.Ql |\& ,
565i.e., an operand of
566.Ql |\&
567cannot be an empty subexpression.
568An empty parenthesized subexpression,
569.Ql "()" ,
570is legal and matches an
571empty (sub)string.
572An empty string is not a legal RE.
573.Pp
574A
575.Ql {\&
576followed by a digit is considered the beginning of bounds for a
577bounded repetition, which must then follow the syntax for bounds.
578A
579.Ql {\&
580.Em not
581followed by a digit is considered an ordinary character.
582.Pp
583.Ql ^\&
584and
585.Ql $\&
586beginning and ending subexpressions in obsolete
587.Pq Dq basic
588REs are anchors, not ordinary characters.
589.Sh DIAGNOSTICS
590Non-zero error codes from
591.Fn regcomp
592and
593.Fn regexec
594include the following:
595.Pp
596.Bl -tag -width REG_ECOLLATE -compact
597.It Dv REG_NOMATCH
598The
599.Fn regexec
600function
601failed to match
602.It Dv REG_BADPAT
603invalid regular expression
604.It Dv REG_ECOLLATE
605invalid collating element
606.It Dv REG_ECTYPE
607invalid character class
608.It Dv REG_EESCAPE
609.Ql \e
610applied to unescapable character
611.It Dv REG_ESUBREG
612invalid backreference number
613.It Dv REG_EBRACK
614brackets
615.Ql "[ ]"
616not balanced
617.It Dv REG_EPAREN
618parentheses
619.Ql "( )"
620not balanced
621.It Dv REG_EBRACE
622braces
623.Ql "{ }"
624not balanced
625.It Dv REG_BADBR
626invalid repetition count(s) in
627.Ql "{ }"
628.It Dv REG_ERANGE
629invalid character range in
630.Ql "[ ]"
631.It Dv REG_ESPACE
632ran out of memory
633.It Dv REG_BADRPT
634.Ql ?\& ,
635.Ql *\& ,
636or
637.Ql +\&
638operand invalid
639.It Dv REG_EMPTY
640empty (sub)expression
641.It Dv REG_ASSERT
642cannot happen - you found a bug
643.It Dv REG_INVARG
644invalid argument, e.g.\& negative-length string
645.It Dv REG_ILLSEQ
646illegal byte sequence (bad multibyte character)
647.El
648.Sh SEE ALSO
649.Xr grep 1 ,
650.Xr re_format 7
651.Pp
652.St -p1003.2 ,
653sections 2.8 (Regular Expression Notation)
654and
655B.5 (C Binding for Regular Expression Matching).
656.Sh HISTORY
657Originally written by
658.An Henry Spencer .
659Altered for inclusion in the
660.Bx 4.4
661distribution.
662.Sh BUGS
663This is an alpha release with known defects.
664Please report problems.
665.Pp
666The back-reference code is subtle and doubts linger about its correctness
667in complex cases.
668.Pp
669The
670.Fn regexec
671function
672performance is poor.
673This will improve with later releases.
674The
675.Fa nmatch
676argument
677exceeding 0 is expensive;
678.Fa nmatch
679exceeding 1 is worse.
680The
681.Fn regexec
682function
683is largely insensitive to RE complexity
684.Em except
685that back
686references are massively expensive.
687RE length does matter; in particular, there is a strong speed bonus
688for keeping RE length under about 30 characters,
689with most special characters counting roughly double.
690.Pp
691The
692.Fn regcomp
693function
694implements bounded repetitions by macro expansion,
695which is costly in time and space if counts are large
696or bounded repetitions are nested.
697An RE like, say,
698.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
699will (eventually) run almost any existing machine out of swap space.
700.Pp
701There are suspected problems with response to obscure error conditions.
702Notably,
703certain kinds of internal overflow,
704produced only by truly enormous REs or by multiply nested bounded repetitions,
705are probably not handled well.
706.Pp
707Due to a mistake in
708.St -p1003.2 ,
709things like
710.Ql "a)b"
711are legal REs because
712.Ql )\&
713is
714a special character only in the presence of a previous unmatched
715.Ql (\& .
716This cannot be fixed until the spec is fixed.
717.Pp
718The standard's definition of back references is vague.
719For example, does
720.Ql "a\e(\e(b\e)*\e2\e)*d"
721match
722.Ql "abbbd" ?
723Until the standard is clarified,
724behavior in such cases should not be relied on.
725.Pp
726The implementation of word-boundary matching is a bit of a kludge,
727and bugs may lurk in combinations of word-boundary matching and anchoring.
728.Pp
729Word-boundary matching does not work properly in multibyte locales.
730