xref: /freebsd/lib/libc/regex/regex.3 (revision 26a58599a09a6181e0f5abe624021865a0c23186)
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\"	The Regents of the University of California.  All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\"    notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\"    notice, this list of conditions and the following disclaimer in the
15.\"    documentation and/or other materials provided with the distribution.
16.\" 3. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"	@(#)regex.3	8.4 (Berkeley) 3/20/94
33.\"
34.Dd April 15, 2017
35.Dt REGEX 3
36.Os
37.Sh NAME
38.Nm regcomp ,
39.Nm regexec ,
40.Nm regerror ,
41.Nm regfree
42.Nd regular-expression library
43.Sh LIBRARY
44.Lb libc
45.Sh SYNOPSIS
46.In regex.h
47.Ft int
48.Fo regcomp
49.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
50.Fc
51.Ft int
52.Fo regexec
53.Fa "const regex_t * restrict preg" "const char * restrict string"
54.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
55.Fc
56.Ft size_t
57.Fo regerror
58.Fa "int errcode" "const regex_t * restrict preg"
59.Fa "char * restrict errbuf" "size_t errbuf_size"
60.Fc
61.Ft void
62.Fn regfree "regex_t *preg"
63.Sh DESCRIPTION
64These routines implement
65.St -p1003.2
66regular expressions
67.Pq Do RE Dc Ns s ;
68see
69.Xr re_format 7 .
70The
71.Fn regcomp
72function
73compiles an RE written as a string into an internal form,
74.Fn regexec
75matches that internal form against a string and reports results,
76.Fn regerror
77transforms error codes from either into human-readable messages,
78and
79.Fn regfree
80frees any dynamically-allocated storage used by the internal form
81of an RE.
82.Pp
83The header
84.In regex.h
85declares two structure types,
86.Ft regex_t
87and
88.Ft regmatch_t ,
89the former for compiled internal forms and the latter for match reporting.
90It also declares the four functions,
91a type
92.Ft regoff_t ,
93and a number of constants with names starting with
94.Dq Dv REG_ .
95.Pp
96The
97.Fn regcomp
98function
99compiles the regular expression contained in the
100.Fa pattern
101string,
102subject to the flags in
103.Fa cflags ,
104and places the results in the
105.Ft regex_t
106structure pointed to by
107.Fa preg .
108The
109.Fa cflags
110argument
111is the bitwise OR of zero or more of the following flags:
112.Bl -tag -width REG_EXTENDED
113.It Dv REG_EXTENDED
114Compile modern
115.Pq Dq extended
116REs,
117rather than the obsolete
118.Pq Dq basic
119REs that
120are the default.
121.It Dv REG_BASIC
122This is a synonym for 0,
123provided as a counterpart to
124.Dv REG_EXTENDED
125to improve readability.
126.It Dv REG_NOSPEC
127Compile with recognition of all special characters turned off.
128All characters are thus considered ordinary,
129so the
130.Dq RE
131is a literal string.
132This is an extension,
133compatible with but not specified by
134.St -p1003.2 ,
135and should be used with
136caution in software intended to be portable to other systems.
137.Dv REG_EXTENDED
138and
139.Dv REG_NOSPEC
140may not be used
141in the same call to
142.Fn regcomp .
143.It Dv REG_ICASE
144Compile for matching that ignores upper/lower case distinctions.
145See
146.Xr re_format 7 .
147.It Dv REG_NOSUB
148Compile for matching that need only report success or failure,
149not what was matched.
150.It Dv REG_NEWLINE
151Compile for newline-sensitive matching.
152By default, newline is a completely ordinary character with no special
153meaning in either REs or strings.
154With this flag,
155.Ql [^
156bracket expressions and
157.Ql .\&
158never match newline,
159a
160.Ql ^\&
161anchor matches the null string after any newline in the string
162in addition to its normal function,
163and the
164.Ql $\&
165anchor matches the null string before any newline in the
166string in addition to its normal function.
167.It Dv REG_PEND
168The regular expression ends,
169not at the first NUL,
170but just before the character pointed to by the
171.Va re_endp
172member of the structure pointed to by
173.Fa preg .
174The
175.Va re_endp
176member is of type
177.Ft "const char *" .
178This flag permits inclusion of NULs in the RE;
179they are considered ordinary characters.
180This is an extension,
181compatible with but not specified by
182.St -p1003.2 ,
183and should be used with
184caution in software intended to be portable to other systems.
185.It Dv REG_POSIX
186Compile only
187.St -p1003.2
188compliant expressions.
189This flag has no effect unless linking against
190.Nm libregex .
191This is an extension,
192compatible with but not specified by
193.St -p1003.2 ,
194and should be used with
195caution in software intended to be portable to other systems.
196.El
197.Pp
198When successful,
199.Fn regcomp
200returns 0 and fills in the structure pointed to by
201.Fa preg .
202One member of that structure
203(other than
204.Va re_endp )
205is publicized:
206.Va re_nsub ,
207of type
208.Ft size_t ,
209contains the number of parenthesized subexpressions within the RE
210(except that the value of this member is undefined if the
211.Dv REG_NOSUB
212flag was used).
213If
214.Fn regcomp
215fails, it returns a non-zero error code;
216see
217.Sx DIAGNOSTICS .
218.Pp
219The
220.Fn regexec
221function
222matches the compiled RE pointed to by
223.Fa preg
224against the
225.Fa string ,
226subject to the flags in
227.Fa eflags ,
228and reports results using
229.Fa nmatch ,
230.Fa pmatch ,
231and the returned value.
232The RE must have been compiled by a previous invocation of
233.Fn regcomp .
234The compiled form is not altered during execution of
235.Fn regexec ,
236so a single compiled RE can be used simultaneously by multiple threads.
237.Pp
238By default,
239the NUL-terminated string pointed to by
240.Fa string
241is considered to be the text of an entire line, minus any terminating
242newline.
243The
244.Fa eflags
245argument is the bitwise OR of zero or more of the following flags:
246.Bl -tag -width REG_STARTEND
247.It Dv REG_NOTBOL
248The first character of the string is treated as the continuation
249of a line.
250This means that the anchors
251.Ql ^\& ,
252.Ql [[:<:]] ,
253and
254.Ql \e<
255do not match before it; but see
256.Dv REG_STARTEND
257below.
258This does not affect the behavior of newlines under
259.Dv REG_NEWLINE .
260.It Dv REG_NOTEOL
261The NUL terminating
262the string
263does not end a line, so the
264.Ql $\&
265anchor does not match before it.
266This does not affect the behavior of newlines under
267.Dv REG_NEWLINE .
268.It Dv REG_STARTEND
269The string is considered to start at
270.Fa string No +
271.Fa pmatch Ns [0]. Ns Fa rm_so
272and to end before the byte located at
273.Fa string No +
274.Fa pmatch Ns [0]. Ns Fa rm_eo ,
275regardless of the value of
276.Fa nmatch .
277See below for the definition of
278.Fa pmatch
279and
280.Fa nmatch .
281This is an extension,
282compatible with but not specified by
283.St -p1003.2 ,
284and should be used with
285caution in software intended to be portable to other systems.
286.Pp
287Without
288.Dv REG_NOTBOL ,
289the position
290.Fa rm_so
291is considered the beginning of a line, such that
292.Ql ^
293matches before it, and the beginning of a word if there is a word
294character at this position, such that
295.Ql [[:<:]]
296and
297.Ql \e<
298match before it.
299.Pp
300With
301.Dv REG_NOTBOL ,
302the character at position
303.Fa rm_so
304is treated as the continuation of a line, and if
305.Fa rm_so
306is greater than 0, the preceding character is taken into consideration.
307If the preceding character is a newline and the regular expression was compiled
308with
309.Dv REG_NEWLINE ,
310.Ql ^
311matches before the string; if the preceding character is not a word character
312but the string starts with a word character,
313.Ql [[:<:]]
314and
315.Ql \e<
316match before the string.
317.El
318.Pp
319See
320.Xr re_format 7
321for a discussion of what is matched in situations where an RE or a
322portion thereof could match any of several substrings of
323.Fa string .
324.Pp
325Normally,
326.Fn regexec
327returns 0 for success and the non-zero code
328.Dv REG_NOMATCH
329for failure.
330Other non-zero error codes may be returned in exceptional situations;
331see
332.Sx DIAGNOSTICS .
333.Pp
334If
335.Dv REG_NOSUB
336was specified in the compilation of the RE,
337or if
338.Fa nmatch
339is 0,
340.Fn regexec
341ignores the
342.Fa pmatch
343argument (but see below for the case where
344.Dv REG_STARTEND
345is specified).
346Otherwise,
347.Fa pmatch
348points to an array of
349.Fa nmatch
350structures of type
351.Ft regmatch_t .
352Such a structure has at least the members
353.Va rm_so
354and
355.Va rm_eo ,
356both of type
357.Ft regoff_t
358(a signed arithmetic type at least as large as an
359.Ft off_t
360and a
361.Ft ssize_t ) ,
362containing respectively the offset of the first character of a substring
363and the offset of the first character after the end of the substring.
364Offsets are measured from the beginning of the
365.Fa string
366argument given to
367.Fn regexec .
368An empty substring is denoted by equal offsets,
369both indicating the character following the empty substring.
370.Pp
371The 0th member of the
372.Fa pmatch
373array is filled in to indicate what substring of
374.Fa string
375was matched by the entire RE.
376Remaining members report what substring was matched by parenthesized
377subexpressions within the RE;
378member
379.Va i
380reports subexpression
381.Va i ,
382with subexpressions counted (starting at 1) by the order of their opening
383parentheses in the RE, left to right.
384Unused entries in the array (corresponding either to subexpressions that
385did not participate in the match at all, or to subexpressions that do not
386exist in the RE (that is,
387.Va i
388>
389.Fa preg Ns -> Ns Va re_nsub ) )
390have both
391.Va rm_so
392and
393.Va rm_eo
394set to -1.
395If a subexpression participated in the match several times,
396the reported substring is the last one it matched.
397(Note, as an example in particular, that when the RE
398.Ql "(b*)+"
399matches
400.Ql bbb ,
401the parenthesized subexpression matches each of the three
402.So Li b Sc Ns s
403and then
404an infinite number of empty strings following the last
405.Ql b ,
406so the reported substring is one of the empties.)
407.Pp
408If
409.Dv REG_STARTEND
410is specified,
411.Fa pmatch
412must point to at least one
413.Ft regmatch_t
414(even if
415.Fa nmatch
416is 0 or
417.Dv REG_NOSUB
418was specified),
419to hold the input offsets for
420.Dv REG_STARTEND .
421Use for output is still entirely controlled by
422.Fa nmatch ;
423if
424.Fa nmatch
425is 0 or
426.Dv REG_NOSUB
427was specified,
428the value of
429.Fa pmatch Ns [0]
430will not be changed by a successful
431.Fn regexec .
432.Pp
433The
434.Fn regerror
435function
436maps a non-zero
437.Fa errcode
438from either
439.Fn regcomp
440or
441.Fn regexec
442to a human-readable, printable message.
443If
444.Fa preg
445is
446.No non\- Ns Dv NULL ,
447the error code should have arisen from use of
448the
449.Ft regex_t
450pointed to by
451.Fa preg ,
452and if the error code came from
453.Fn regcomp ,
454it should have been the result from the most recent
455.Fn regcomp
456using that
457.Ft regex_t .
458The
459.Po
460.Fn regerror
461may be able to supply a more detailed message using information
462from the
463.Ft regex_t .
464.Pc
465The
466.Fn regerror
467function
468places the NUL-terminated message into the buffer pointed to by
469.Fa errbuf ,
470limiting the length (including the NUL) to at most
471.Fa errbuf_size
472bytes.
473If the whole message will not fit,
474as much of it as will fit before the terminating NUL is supplied.
475In any case,
476the returned value is the size of buffer needed to hold the whole
477message (including terminating NUL).
478If
479.Fa errbuf_size
480is 0,
481.Fa errbuf
482is ignored but the return value is still correct.
483.Pp
484If the
485.Fa errcode
486given to
487.Fn regerror
488is first ORed with
489.Dv REG_ITOA ,
490the
491.Dq message
492that results is the printable name of the error code,
493e.g.\&
494.Dq Dv REG_NOMATCH ,
495rather than an explanation thereof.
496If
497.Fa errcode
498is
499.Dv REG_ATOI ,
500then
501.Fa preg
502shall be
503.No non\- Ns Dv NULL
504and the
505.Va re_endp
506member of the structure it points to
507must point to the printable name of an error code;
508in this case, the result in
509.Fa errbuf
510is the decimal digits of
511the numeric value of the error code
512(0 if the name is not recognized).
513.Dv REG_ITOA
514and
515.Dv REG_ATOI
516are intended primarily as debugging facilities;
517they are extensions,
518compatible with but not specified by
519.St -p1003.2 ,
520and should be used with
521caution in software intended to be portable to other systems.
522Be warned also that they are considered experimental and changes are possible.
523.Pp
524The
525.Fn regfree
526function
527frees any dynamically-allocated storage associated with the compiled RE
528pointed to by
529.Fa preg .
530The remaining
531.Ft regex_t
532is no longer a valid compiled RE
533and the effect of supplying it to
534.Fn regexec
535or
536.Fn regerror
537is undefined.
538.Pp
539None of these functions references global variables except for tables
540of constants;
541all are safe for use from multiple threads if the arguments are safe.
542.Sh IMPLEMENTATION CHOICES
543There are a number of decisions that
544.St -p1003.2
545leaves up to the implementor,
546either by explicitly saying
547.Dq undefined
548or by virtue of them being
549forbidden by the RE grammar.
550This implementation treats them as follows.
551.Pp
552See
553.Xr re_format 7
554for a discussion of the definition of case-independent matching.
555.Pp
556There is no particular limit on the length of REs,
557except insofar as memory is limited.
558Memory usage is approximately linear in RE size, and largely insensitive
559to RE complexity, except for bounded repetitions.
560See
561.Sx BUGS
562for one short RE using them
563that will run almost any system out of memory.
564.Pp
565A backslashed character other than one specifically given a magic meaning
566by
567.St -p1003.2
568(such magic meanings occur only in obsolete
569.Bq Dq basic
570REs)
571is taken as an ordinary character.
572.Pp
573Any unmatched
574.Ql [\&
575is a
576.Dv REG_EBRACK
577error.
578.Pp
579Equivalence classes cannot begin or end bracket-expression ranges.
580The endpoint of one range cannot begin another.
581.Pp
582.Dv RE_DUP_MAX ,
583the limit on repetition counts in bounded repetitions, is 255.
584.Pp
585A repetition operator
586.Ql ( ?\& ,
587.Ql *\& ,
588.Ql +\& ,
589or bounds)
590cannot follow another
591repetition operator.
592A repetition operator cannot begin an expression or subexpression
593or follow
594.Ql ^\&
595or
596.Ql |\& .
597.Pp
598.Ql |\&
599cannot appear first or last in a (sub)expression or after another
600.Ql |\& ,
601i.e., an operand of
602.Ql |\&
603cannot be an empty subexpression.
604An empty parenthesized subexpression,
605.Ql "()" ,
606is legal and matches an
607empty (sub)string.
608An empty string is not a legal RE.
609.Pp
610A
611.Ql {\&
612followed by a digit is considered the beginning of bounds for a
613bounded repetition, which must then follow the syntax for bounds.
614A
615.Ql {\&
616.Em not
617followed by a digit is considered an ordinary character.
618.Pp
619.Ql ^\&
620and
621.Ql $\&
622beginning and ending subexpressions in obsolete
623.Pq Dq basic
624REs are anchors, not ordinary characters.
625.Sh DIAGNOSTICS
626Non-zero error codes from
627.Fn regcomp
628and
629.Fn regexec
630include the following:
631.Pp
632.Bl -tag -width REG_ECOLLATE -compact
633.It Dv REG_NOMATCH
634The
635.Fn regexec
636function
637failed to match
638.It Dv REG_BADPAT
639invalid regular expression
640.It Dv REG_ECOLLATE
641invalid collating element
642.It Dv REG_ECTYPE
643invalid character class
644.It Dv REG_EESCAPE
645.Ql \e
646applied to unescapable character
647.It Dv REG_ESUBREG
648invalid backreference number
649.It Dv REG_EBRACK
650brackets
651.Ql "[ ]"
652not balanced
653.It Dv REG_EPAREN
654parentheses
655.Ql "( )"
656not balanced
657.It Dv REG_EBRACE
658braces
659.Ql "{ }"
660not balanced
661.It Dv REG_BADBR
662invalid repetition count(s) in
663.Ql "{ }"
664.It Dv REG_ERANGE
665invalid character range in
666.Ql "[ ]"
667.It Dv REG_ESPACE
668ran out of memory
669.It Dv REG_BADRPT
670.Ql ?\& ,
671.Ql *\& ,
672or
673.Ql +\&
674operand invalid
675.It Dv REG_EMPTY
676empty (sub)expression
677.It Dv REG_ASSERT
678cannot happen - you found a bug
679.It Dv REG_INVARG
680invalid argument, e.g.\& negative-length string
681.It Dv REG_ILLSEQ
682illegal byte sequence (bad multibyte character)
683.El
684.Sh SEE ALSO
685.Xr grep 1 ,
686.Xr re_format 7
687.Pp
688.St -p1003.2 ,
689sections 2.8 (Regular Expression Notation)
690and
691B.5 (C Binding for Regular Expression Matching).
692.Sh HISTORY
693Originally written by
694.An Henry Spencer .
695Altered for inclusion in the
696.Bx 4.4
697distribution.
698.Sh BUGS
699This is an alpha release with known defects.
700Please report problems.
701.Pp
702The back-reference code is subtle and doubts linger about its correctness
703in complex cases.
704.Pp
705The
706.Fn regexec
707function
708performance is poor.
709This will improve with later releases.
710The
711.Fa nmatch
712argument
713exceeding 0 is expensive;
714.Fa nmatch
715exceeding 1 is worse.
716The
717.Fn regexec
718function
719is largely insensitive to RE complexity
720.Em except
721that back
722references are massively expensive.
723RE length does matter; in particular, there is a strong speed bonus
724for keeping RE length under about 30 characters,
725with most special characters counting roughly double.
726.Pp
727The
728.Fn regcomp
729function
730implements bounded repetitions by macro expansion,
731which is costly in time and space if counts are large
732or bounded repetitions are nested.
733An RE like, say,
734.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
735will (eventually) run almost any existing machine out of swap space.
736.Pp
737There are suspected problems with response to obscure error conditions.
738Notably,
739certain kinds of internal overflow,
740produced only by truly enormous REs or by multiply nested bounded repetitions,
741are probably not handled well.
742.Pp
743Due to a mistake in
744.St -p1003.2 ,
745things like
746.Ql "a)b"
747are legal REs because
748.Ql )\&
749is
750a special character only in the presence of a previous unmatched
751.Ql (\& .
752This cannot be fixed until the spec is fixed.
753.Pp
754The standard's definition of back references is vague.
755For example, does
756.Ql "a\e(\e(b\e)*\e2\e)*d"
757match
758.Ql "abbbd" ?
759Until the standard is clarified,
760behavior in such cases should not be relied on.
761.Pp
762The implementation of word-boundary matching is a bit of a kludge,
763and bugs may lurk in combinations of word-boundary matching and anchoring.
764.Pp
765Word-boundary matching does not work properly in multibyte locales.
766