xref: /freebsd/lib/libc/regex/regex.3 (revision 1b6c76a2fe091c74f08427e6c870851025a9cf67)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\"    must display the following acknowledgement:
18.\"	This product includes software developed by the University of
19.\"	California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\"    may be used to endorse or promote products derived from this software
22.\"    without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
37.\" $FreeBSD$
38.\"
39.Dd March 20, 1994
40.Dt REGEX 3
41.Os
42.Sh NAME
43.Nm regcomp ,
44.Nm regexec ,
45.Nm regerror ,
46.Nm regfree
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.Fd "#include <sys/types.h>"
52.Fd "#include <regex.h>"
53.Ft int
54.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
55.Ft int
56.Fo regexec
57.Fa "const regex_t *preg" "const char *string"
58.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
59.Fc
60.Ft size_t
61.Fo regerror
62.Fa "int errcode" "const regex_t *preg"
63.Fa "char *errbuf" "size_t errbuf_size"
64.Fc
65.Ft void
66.Fn regfree "regex_t *preg"
67.Sh DESCRIPTION
68These routines implement
69.St -p1003.2
70regular expressions
71.Pq Do RE Dc Ns s ;
72see
73.Xr re_format 7 .
74.Fn Regcomp
75compiles an RE written as a string into an internal form,
76.Fn regexec
77matches that internal form against a string and reports results,
78.Fn regerror
79transforms error codes from either into human-readable messages,
80and
81.Fn regfree
82frees any dynamically-allocated storage used by the internal form
83of an RE.
84.Pp
85The header
86.Aq Pa regex.h
87declares two structure types,
88.Ft regex_t
89and
90.Ft regmatch_t ,
91the former for compiled internal forms and the latter for match reporting.
92It also declares the four functions,
93a type
94.Ft regoff_t ,
95and a number of constants with names starting with
96.Dq Dv REG_ .
97.Pp
98.Fn Regcomp
99compiles the regular expression contained in the
100.Fa pattern
101string,
102subject to the flags in
103.Fa cflags ,
104and places the results in the
105.Ft regex_t
106structure pointed to by
107.Fa preg .
108.Fa Cflags
109is the bitwise OR of zero or more of the following flags:
110.Bl -tag -width REG_EXTENDED
111.It Dv REG_EXTENDED
112Compile modern
113.Pq Dq extended
114REs,
115rather than the obsolete
116.Pq Dq basic
117REs that
118are the default.
119.It Dv REG_BASIC
120This is a synonym for 0,
121provided as a counterpart to
122.Dv REG_EXTENDED
123to improve readability.
124.It Dv REG_NOSPEC
125Compile with recognition of all special characters turned off.
126All characters are thus considered ordinary,
127so the
128.Dq RE
129is a literal string.
130This is an extension,
131compatible with but not specified by
132.St -p1003.2 ,
133and should be used with
134caution in software intended to be portable to other systems.
135.Dv REG_EXTENDED
136and
137.Dv REG_NOSPEC
138may not be used
139in the same call to
140.Fn regcomp .
141.It Dv REG_ICASE
142Compile for matching that ignores upper/lower case distinctions.
143See
144.Xr re_format 7 .
145.It Dv REG_NOSUB
146Compile for matching that need only report success or failure,
147not what was matched.
148.It Dv REG_NEWLINE
149Compile for newline-sensitive matching.
150By default, newline is a completely ordinary character with no special
151meaning in either REs or strings.
152With this flag,
153.Ql [^
154bracket expressions and
155.Ql .\&
156never match newline,
157a
158.Ql ^\&
159anchor matches the null string after any newline in the string
160in addition to its normal function,
161and the
162.Ql $\&
163anchor matches the null string before any newline in the
164string in addition to its normal function.
165.It Dv REG_PEND
166The regular expression ends,
167not at the first NUL,
168but just before the character pointed to by the
169.Va re_endp
170member of the structure pointed to by
171.Fa preg .
172The
173.Va re_endp
174member is of type
175.Ft "const char *" .
176This flag permits inclusion of NULs in the RE;
177they are considered ordinary characters.
178This is an extension,
179compatible with but not specified by
180.St -p1003.2 ,
181and should be used with
182caution in software intended to be portable to other systems.
183.El
184.Pp
185When successful,
186.Fn regcomp
187returns 0 and fills in the structure pointed to by
188.Fa preg .
189One member of that structure
190(other than
191.Va re_endp )
192is publicized:
193.Va re_nsub ,
194of type
195.Ft size_t ,
196contains the number of parenthesized subexpressions within the RE
197(except that the value of this member is undefined if the
198.Dv REG_NOSUB
199flag was used).
200If
201.Fn regcomp
202fails, it returns a non-zero error code;
203see
204.Sx DIAGNOSTICS .
205.Pp
206.Fn Regexec
207matches the compiled RE pointed to by
208.Fa preg
209against the
210.Fa string ,
211subject to the flags in
212.Fa eflags ,
213and reports results using
214.Fa nmatch ,
215.Fa pmatch ,
216and the returned value.
217The RE must have been compiled by a previous invocation of
218.Fn regcomp .
219The compiled form is not altered during execution of
220.Fn regexec ,
221so a single compiled RE can be used simultaneously by multiple threads.
222.Pp
223By default,
224the NUL-terminated string pointed to by
225.Fa string
226is considered to be the text of an entire line, minus any terminating
227newline.
228The
229.Fa eflags
230argument is the bitwise OR of zero or more of the following flags:
231.Bl -tag -width REG_STARTEND
232.It Dv REG_NOTBOL
233The first character of
234the string
235is not the beginning of a line, so the
236.Ql ^\&
237anchor should not match before it.
238This does not affect the behavior of newlines under
239.Dv REG_NEWLINE .
240.It Dv REG_NOTEOL
241The NUL terminating
242the string
243does not end a line, so the
244.Ql $\&
245anchor should not match before it.
246This does not affect the behavior of newlines under
247.Dv REG_NEWLINE .
248.It Dv REG_STARTEND
249The string is considered to start at
250.Fa string
251+
252.Fa pmatch Ns [0]. Ns Va rm_so
253and to have a terminating NUL located at
254.Fa string
255+
256.Fa pmatch Ns [0]. Ns Va rm_eo
257(there need not actually be a NUL at that location),
258regardless of the value of
259.Fa nmatch .
260See below for the definition of
261.Fa pmatch
262and
263.Fa nmatch .
264This is an extension,
265compatible with but not specified by
266.St -p1003.2 ,
267and should be used with
268caution in software intended to be portable to other systems.
269Note that a non-zero
270.Va rm_so
271does not imply
272.Dv REG_NOTBOL ;
273.Dv REG_STARTEND
274affects only the location of the string,
275not how it is matched.
276.El
277.Pp
278See
279.Xr re_format 7
280for a discussion of what is matched in situations where an RE or a
281portion thereof could match any of several substrings of
282.Fa string .
283.Pp
284Normally,
285.Fn regexec
286returns 0 for success and the non-zero code
287.Dv REG_NOMATCH
288for failure.
289Other non-zero error codes may be returned in exceptional situations;
290see
291.Sx DIAGNOSTICS .
292.Pp
293If
294.Dv REG_NOSUB
295was specified in the compilation of the RE,
296or if
297.Fa nmatch
298is 0,
299.Fn regexec
300ignores the
301.Fa pmatch
302argument (but see below for the case where
303.Dv REG_STARTEND
304is specified).
305Otherwise,
306.Fa pmatch
307points to an array of
308.Fa nmatch
309structures of type
310.Ft regmatch_t .
311Such a structure has at least the members
312.Va rm_so
313and
314.Va rm_eo ,
315both of type
316.Ft regoff_t
317(a signed arithmetic type at least as large as an
318.Ft off_t
319and a
320.Ft ssize_t ) ,
321containing respectively the offset of the first character of a substring
322and the offset of the first character after the end of the substring.
323Offsets are measured from the beginning of the
324.Fa string
325argument given to
326.Fn regexec .
327An empty substring is denoted by equal offsets,
328both indicating the character following the empty substring.
329.Pp
330The 0th member of the
331.Fa pmatch
332array is filled in to indicate what substring of
333.Fa string
334was matched by the entire RE.
335Remaining members report what substring was matched by parenthesized
336subexpressions within the RE;
337member
338.Va i
339reports subexpression
340.Va i ,
341with subexpressions counted (starting at 1) by the order of their opening
342parentheses in the RE, left to right.
343Unused entries in the array (corresponding either to subexpressions that
344did not participate in the match at all, or to subexpressions that do not
345exist in the RE (that is,
346.Va i
347>
348.Fa preg Ns -> Ns Va re_nsub ) )
349have both
350.Va rm_so
351and
352.Va rm_eo
353set to -1.
354If a subexpression participated in the match several times,
355the reported substring is the last one it matched.
356(Note, as an example in particular, that when the RE
357.Ql "(b*)+"
358matches
359.Ql bbb ,
360the parenthesized subexpression matches each of the three
361.So Li b Sc Ns s
362and then
363an infinite number of empty strings following the last
364.Ql b ,
365so the reported substring is one of the empties.)
366.Pp
367If
368.Dv REG_STARTEND
369is specified,
370.Fa pmatch
371must point to at least one
372.Ft regmatch_t
373(even if
374.Fa nmatch
375is 0 or
376.Dv REG_NOSUB
377was specified),
378to hold the input offsets for
379.Dv REG_STARTEND .
380Use for output is still entirely controlled by
381.Fa nmatch ;
382if
383.Fa nmatch
384is 0 or
385.Dv REG_NOSUB
386was specified,
387the value of
388.Fa pmatch Ns [0]
389will not be changed by a successful
390.Fn regexec .
391.Pp
392.Fn Regerror
393maps a non-zero
394.Fa errcode
395from either
396.Fn regcomp
397or
398.Fn regexec
399to a human-readable, printable message.
400If
401.Fa preg
402is
403.No non\- Ns Dv NULL ,
404the error code should have arisen from use of
405the
406.Ft regex_t
407pointed to by
408.Fa preg ,
409and if the error code came from
410.Fn regcomp ,
411it should have been the result from the most recent
412.Fn regcomp
413using that
414.Ft regex_t .
415.No ( Fn Regerror
416may be able to supply a more detailed message using information
417from the
418.Ft regex_t . )
419.Fn Regerror
420places the NUL-terminated message into the buffer pointed to by
421.Fa errbuf ,
422limiting the length (including the NUL) to at most
423.Fa errbuf_size
424bytes.
425If the whole message won't fit,
426as much of it as will fit before the terminating NUL is supplied.
427In any case,
428the returned value is the size of buffer needed to hold the whole
429message (including terminating NUL).
430If
431.Fa errbuf_size
432is 0,
433.Fa errbuf
434is ignored but the return value is still correct.
435.Pp
436If the
437.Fa errcode
438given to
439.Fn regerror
440is first ORed with
441.Dv REG_ITOA ,
442the
443.Dq message
444that results is the printable name of the error code,
445e.g.
446.Dq Dv REG_NOMATCH ,
447rather than an explanation thereof.
448If
449.Fa errcode
450is
451.Dv REG_ATOI ,
452then
453.Fa preg
454shall be
455.No non\- Ns Dv NULL
456and the
457.Va re_endp
458member of the structure it points to
459must point to the printable name of an error code;
460in this case, the result in
461.Fa errbuf
462is the decimal digits of
463the numeric value of the error code
464(0 if the name is not recognized).
465.Dv REG_ITOA
466and
467.Dv REG_ATOI
468are intended primarily as debugging facilities;
469they are extensions,
470compatible with but not specified by
471.St -p1003.2 ,
472and should be used with
473caution in software intended to be portable to other systems.
474Be warned also that they are considered experimental and changes are possible.
475.Pp
476.Fn Regfree
477frees any dynamically-allocated storage associated with the compiled RE
478pointed to by
479.Fa preg .
480The remaining
481.Ft regex_t
482is no longer a valid compiled RE
483and the effect of supplying it to
484.Fn regexec
485or
486.Fn regerror
487is undefined.
488.Pp
489None of these functions references global variables except for tables
490of constants;
491all are safe for use from multiple threads if the arguments are safe.
492.Sh IMPLEMENTATION CHOICES
493There are a number of decisions that
494.St -p1003.2
495leaves up to the implementor,
496either by explicitly saying
497.Dq undefined
498or by virtue of them being
499forbidden by the RE grammar.
500This implementation treats them as follows.
501.Pp
502See
503.Xr re_format 7
504for a discussion of the definition of case-independent matching.
505.Pp
506There is no particular limit on the length of REs,
507except insofar as memory is limited.
508Memory usage is approximately linear in RE size, and largely insensitive
509to RE complexity, except for bounded repetitions.
510See
511.Sx BUGS
512for one short RE using them
513that will run almost any system out of memory.
514.Pp
515A backslashed character other than one specifically given a magic meaning
516by
517.St -p1003.2
518(such magic meanings occur only in obsolete
519.Bq Dq basic
520REs)
521is taken as an ordinary character.
522.Pp
523Any unmatched
524.Ql [\&
525is a
526.Dv REG_EBRACK
527error.
528.Pp
529Equivalence classes cannot begin or end bracket-expression ranges.
530The endpoint of one range cannot begin another.
531.Pp
532.Dv RE_DUP_MAX ,
533the limit on repetition counts in bounded repetitions, is 255.
534.Pp
535A repetition operator
536.Po
537.Ql ?\& ,
538.Ql *\& ,
539.Ql +\& ,
540or bounds
541.Pc
542cannot follow another
543repetition operator.
544A repetition operator cannot begin an expression or subexpression
545or follow
546.Ql ^\&
547or
548.Ql |\& .
549.Pp
550.Ql |\&
551cannot appear first or last in a (sub)expression or after another
552.Ql |\& ,
553i.e. an operand of
554.Ql |\&
555cannot be an empty subexpression.
556An empty parenthesized subexpression,
557.Ql "()" ,
558is legal and matches an
559empty (sub)string.
560An empty string is not a legal RE.
561.Pp
562A
563.Ql {\&
564followed by a digit is considered the beginning of bounds for a
565bounded repetition, which must then follow the syntax for bounds.
566A
567.Ql {\&
568.Em not
569followed by a digit is considered an ordinary character.
570.Pp
571.Ql ^\&
572and
573.Ql $\&
574beginning and ending subexpressions in obsolete
575.Pq Dq basic
576REs are anchors, not ordinary characters.
577.Sh SEE ALSO
578.Xr grep 1 ,
579.Xr re_format 7
580.Pp
581.St -p1003.2 ,
582sections 2.8 (Regular Expression Notation)
583and
584B.5 (C Binding for Regular Expression Matching).
585.Sh DIAGNOSTICS
586Non-zero error codes from
587.Fn regcomp
588and
589.Fn regexec
590include the following:
591.Pp
592.Bl -tag -width REG_ECOLLATE -compact
593.It Dv REG_NOMATCH
594.Fn regexec
595failed to match
596.It Dv REG_BADPAT
597invalid regular expression
598.It Dv REG_ECOLLATE
599invalid collating element
600.It Dv REG_ECTYPE
601invalid character class
602.It Dv REG_EESCAPE
603.Ql \e
604applied to unescapable character
605.It Dv REG_ESUBREG
606invalid backreference number
607.It Dv REG_EBRACK
608brackets
609.Ql "[ ]"
610not balanced
611.It Dv REG_EPAREN
612parentheses
613.Ql "( )"
614not balanced
615.It Dv REG_EBRACE
616braces
617.Ql "{ }"
618not balanced
619.It Dv REG_BADBR
620invalid repetition count(s) in
621.Ql "{ }"
622.It Dv REG_ERANGE
623invalid character range in
624.Ql "[ ]"
625.It Dv REG_ESPACE
626ran out of memory
627.It Dv REG_BADRPT
628.Ql ?\& ,
629.Ql *\& ,
630or
631.Ql +\&
632operand invalid
633.It Dv REG_EMPTY
634empty (sub)expression
635.It Dv REG_ASSERT
636can't happen - you found a bug
637.It Dv REG_INVARG
638invalid argument, e.g. negative-length string
639.El
640.Sh HISTORY
641Originally written by
642.An Henry Spencer .
643Altered for inclusion in the
644.Bx 4.4
645distribution.
646.Sh BUGS
647This is an alpha release with known defects.
648Please report problems.
649.Pp
650The back-reference code is subtle and doubts linger about its correctness
651in complex cases.
652.Pp
653.Fn Regexec
654performance is poor.
655This will improve with later releases.
656.Fa Nmatch
657exceeding 0 is expensive;
658.Fa nmatch
659exceeding 1 is worse.
660.Fn Regexec
661is largely insensitive to RE complexity
662.Em except
663that back
664references are massively expensive.
665RE length does matter; in particular, there is a strong speed bonus
666for keeping RE length under about 30 characters,
667with most special characters counting roughly double.
668.Pp
669.Fn Regcomp
670implements bounded repetitions by macro expansion,
671which is costly in time and space if counts are large
672or bounded repetitions are nested.
673An RE like, say,
674.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
675will (eventually) run almost any existing machine out of swap space.
676.Pp
677There are suspected problems with response to obscure error conditions.
678Notably,
679certain kinds of internal overflow,
680produced only by truly enormous REs or by multiply nested bounded repetitions,
681are probably not handled well.
682.Pp
683Due to a mistake in
684.St -p1003.2 ,
685things like
686.Ql "a)b"
687are legal REs because
688.Ql )\&
689is
690a special character only in the presence of a previous unmatched
691.Ql (\& .
692This can't be fixed until the spec is fixed.
693.Pp
694The standard's definition of back references is vague.
695For example, does
696.Ql "a\e(\e(b\e)*\e2\e)*d"
697match
698.Ql "abbbd" ?
699Until the standard is clarified,
700behavior in such cases should not be relied on.
701.Pp
702The implementation of word-boundary matching is a bit of a kludge,
703and bugs may lurk in combinations of word-boundary matching and anchoring.
704