1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 33.\" $FreeBSD$ 34.\" 35.Dd April 15, 2017 36.Dt REGEX 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regerror , 42.Nm regfree 43.Nd regular-expression library 44.Sh LIBRARY 45.Lb libc 46.Sh SYNOPSIS 47.In regex.h 48.Ft int 49.Fo regcomp 50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 51.Fc 52.Ft int 53.Fo regexec 54.Fa "const regex_t * restrict preg" "const char * restrict string" 55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 56.Fc 57.Ft size_t 58.Fo regerror 59.Fa "int errcode" "const regex_t * restrict preg" 60.Fa "char * restrict errbuf" "size_t errbuf_size" 61.Fc 62.Ft void 63.Fn regfree "regex_t *preg" 64.Sh DESCRIPTION 65These routines implement 66.St -p1003.2 67regular expressions 68.Pq Do RE Dc Ns s ; 69see 70.Xr re_format 7 . 71The 72.Fn regcomp 73function 74compiles an RE written as a string into an internal form, 75.Fn regexec 76matches that internal form against a string and reports results, 77.Fn regerror 78transforms error codes from either into human-readable messages, 79and 80.Fn regfree 81frees any dynamically-allocated storage used by the internal form 82of an RE. 83.Pp 84The header 85.In regex.h 86declares two structure types, 87.Ft regex_t 88and 89.Ft regmatch_t , 90the former for compiled internal forms and the latter for match reporting. 91It also declares the four functions, 92a type 93.Ft regoff_t , 94and a number of constants with names starting with 95.Dq Dv REG_ . 96.Pp 97The 98.Fn regcomp 99function 100compiles the regular expression contained in the 101.Fa pattern 102string, 103subject to the flags in 104.Fa cflags , 105and places the results in the 106.Ft regex_t 107structure pointed to by 108.Fa preg . 109The 110.Fa cflags 111argument 112is the bitwise OR of zero or more of the following flags: 113.Bl -tag -width REG_EXTENDED 114.It Dv REG_EXTENDED 115Compile modern 116.Pq Dq extended 117REs, 118rather than the obsolete 119.Pq Dq basic 120REs that 121are the default. 122.It Dv REG_BASIC 123This is a synonym for 0, 124provided as a counterpart to 125.Dv REG_EXTENDED 126to improve readability. 127.It Dv REG_NOSPEC 128Compile with recognition of all special characters turned off. 129All characters are thus considered ordinary, 130so the 131.Dq RE 132is a literal string. 133This is an extension, 134compatible with but not specified by 135.St -p1003.2 , 136and should be used with 137caution in software intended to be portable to other systems. 138.Dv REG_EXTENDED 139and 140.Dv REG_NOSPEC 141may not be used 142in the same call to 143.Fn regcomp . 144.It Dv REG_ICASE 145Compile for matching that ignores upper/lower case distinctions. 146See 147.Xr re_format 7 . 148.It Dv REG_NOSUB 149Compile for matching that need only report success or failure, 150not what was matched. 151.It Dv REG_NEWLINE 152Compile for newline-sensitive matching. 153By default, newline is a completely ordinary character with no special 154meaning in either REs or strings. 155With this flag, 156.Ql [^ 157bracket expressions and 158.Ql .\& 159never match newline, 160a 161.Ql ^\& 162anchor matches the null string after any newline in the string 163in addition to its normal function, 164and the 165.Ql $\& 166anchor matches the null string before any newline in the 167string in addition to its normal function. 168.It Dv REG_PEND 169The regular expression ends, 170not at the first NUL, 171but just before the character pointed to by the 172.Va re_endp 173member of the structure pointed to by 174.Fa preg . 175The 176.Va re_endp 177member is of type 178.Ft "const char *" . 179This flag permits inclusion of NULs in the RE; 180they are considered ordinary characters. 181This is an extension, 182compatible with but not specified by 183.St -p1003.2 , 184and should be used with 185caution in software intended to be portable to other systems. 186.It Dv REG_POSIX 187Compile only 188.St -p1003.2 189compliant expressions. 190This flag has no effect unless linking against 191.Nm libregex . 192This is an extension, 193compatible with but not specified by 194.St -p1003.2 , 195and should be used with 196caution in software intended to be portable to other systems. 197.El 198.Pp 199When successful, 200.Fn regcomp 201returns 0 and fills in the structure pointed to by 202.Fa preg . 203One member of that structure 204(other than 205.Va re_endp ) 206is publicized: 207.Va re_nsub , 208of type 209.Ft size_t , 210contains the number of parenthesized subexpressions within the RE 211(except that the value of this member is undefined if the 212.Dv REG_NOSUB 213flag was used). 214If 215.Fn regcomp 216fails, it returns a non-zero error code; 217see 218.Sx DIAGNOSTICS . 219.Pp 220The 221.Fn regexec 222function 223matches the compiled RE pointed to by 224.Fa preg 225against the 226.Fa string , 227subject to the flags in 228.Fa eflags , 229and reports results using 230.Fa nmatch , 231.Fa pmatch , 232and the returned value. 233The RE must have been compiled by a previous invocation of 234.Fn regcomp . 235The compiled form is not altered during execution of 236.Fn regexec , 237so a single compiled RE can be used simultaneously by multiple threads. 238.Pp 239By default, 240the NUL-terminated string pointed to by 241.Fa string 242is considered to be the text of an entire line, minus any terminating 243newline. 244The 245.Fa eflags 246argument is the bitwise OR of zero or more of the following flags: 247.Bl -tag -width REG_STARTEND 248.It Dv REG_NOTBOL 249The first character of the string is treated as the continuation 250of a line. 251This means that the anchors 252.Ql ^\& , 253.Ql [[:<:]] , 254and 255.Ql \e< 256do not match before it; but see 257.Dv REG_STARTEND 258below. 259This does not affect the behavior of newlines under 260.Dv REG_NEWLINE . 261.It Dv REG_NOTEOL 262The NUL terminating 263the string 264does not end a line, so the 265.Ql $\& 266anchor does not match before it. 267This does not affect the behavior of newlines under 268.Dv REG_NEWLINE . 269.It Dv REG_STARTEND 270The string is considered to start at 271.Fa string No + 272.Fa pmatch Ns [0]. Ns Fa rm_so 273and to end before the byte located at 274.Fa string No + 275.Fa pmatch Ns [0]. Ns Fa rm_eo , 276regardless of the value of 277.Fa nmatch . 278See below for the definition of 279.Fa pmatch 280and 281.Fa nmatch . 282This is an extension, 283compatible with but not specified by 284.St -p1003.2 , 285and should be used with 286caution in software intended to be portable to other systems. 287.Pp 288Without 289.Dv REG_NOTBOL , 290the position 291.Fa rm_so 292is considered the beginning of a line, such that 293.Ql ^ 294matches before it, and the beginning of a word if there is a word 295character at this position, such that 296.Ql [[:<:]] 297and 298.Ql \e< 299match before it. 300.Pp 301With 302.Dv REG_NOTBOL , 303the character at position 304.Fa rm_so 305is treated as the continuation of a line, and if 306.Fa rm_so 307is greater than 0, the preceding character is taken into consideration. 308If the preceding character is a newline and the regular expression was compiled 309with 310.Dv REG_NEWLINE , 311.Ql ^ 312matches before the string; if the preceding character is not a word character 313but the string starts with a word character, 314.Ql [[:<:]] 315and 316.Ql \e< 317match before the string. 318.El 319.Pp 320See 321.Xr re_format 7 322for a discussion of what is matched in situations where an RE or a 323portion thereof could match any of several substrings of 324.Fa string . 325.Pp 326Normally, 327.Fn regexec 328returns 0 for success and the non-zero code 329.Dv REG_NOMATCH 330for failure. 331Other non-zero error codes may be returned in exceptional situations; 332see 333.Sx DIAGNOSTICS . 334.Pp 335If 336.Dv REG_NOSUB 337was specified in the compilation of the RE, 338or if 339.Fa nmatch 340is 0, 341.Fn regexec 342ignores the 343.Fa pmatch 344argument (but see below for the case where 345.Dv REG_STARTEND 346is specified). 347Otherwise, 348.Fa pmatch 349points to an array of 350.Fa nmatch 351structures of type 352.Ft regmatch_t . 353Such a structure has at least the members 354.Va rm_so 355and 356.Va rm_eo , 357both of type 358.Ft regoff_t 359(a signed arithmetic type at least as large as an 360.Ft off_t 361and a 362.Ft ssize_t ) , 363containing respectively the offset of the first character of a substring 364and the offset of the first character after the end of the substring. 365Offsets are measured from the beginning of the 366.Fa string 367argument given to 368.Fn regexec . 369An empty substring is denoted by equal offsets, 370both indicating the character following the empty substring. 371.Pp 372The 0th member of the 373.Fa pmatch 374array is filled in to indicate what substring of 375.Fa string 376was matched by the entire RE. 377Remaining members report what substring was matched by parenthesized 378subexpressions within the RE; 379member 380.Va i 381reports subexpression 382.Va i , 383with subexpressions counted (starting at 1) by the order of their opening 384parentheses in the RE, left to right. 385Unused entries in the array (corresponding either to subexpressions that 386did not participate in the match at all, or to subexpressions that do not 387exist in the RE (that is, 388.Va i 389> 390.Fa preg Ns -> Ns Va re_nsub ) ) 391have both 392.Va rm_so 393and 394.Va rm_eo 395set to -1. 396If a subexpression participated in the match several times, 397the reported substring is the last one it matched. 398(Note, as an example in particular, that when the RE 399.Ql "(b*)+" 400matches 401.Ql bbb , 402the parenthesized subexpression matches each of the three 403.So Li b Sc Ns s 404and then 405an infinite number of empty strings following the last 406.Ql b , 407so the reported substring is one of the empties.) 408.Pp 409If 410.Dv REG_STARTEND 411is specified, 412.Fa pmatch 413must point to at least one 414.Ft regmatch_t 415(even if 416.Fa nmatch 417is 0 or 418.Dv REG_NOSUB 419was specified), 420to hold the input offsets for 421.Dv REG_STARTEND . 422Use for output is still entirely controlled by 423.Fa nmatch ; 424if 425.Fa nmatch 426is 0 or 427.Dv REG_NOSUB 428was specified, 429the value of 430.Fa pmatch Ns [0] 431will not be changed by a successful 432.Fn regexec . 433.Pp 434The 435.Fn regerror 436function 437maps a non-zero 438.Fa errcode 439from either 440.Fn regcomp 441or 442.Fn regexec 443to a human-readable, printable message. 444If 445.Fa preg 446is 447.No non\- Ns Dv NULL , 448the error code should have arisen from use of 449the 450.Ft regex_t 451pointed to by 452.Fa preg , 453and if the error code came from 454.Fn regcomp , 455it should have been the result from the most recent 456.Fn regcomp 457using that 458.Ft regex_t . 459The 460.Po 461.Fn regerror 462may be able to supply a more detailed message using information 463from the 464.Ft regex_t . 465.Pc 466The 467.Fn regerror 468function 469places the NUL-terminated message into the buffer pointed to by 470.Fa errbuf , 471limiting the length (including the NUL) to at most 472.Fa errbuf_size 473bytes. 474If the whole message will not fit, 475as much of it as will fit before the terminating NUL is supplied. 476In any case, 477the returned value is the size of buffer needed to hold the whole 478message (including terminating NUL). 479If 480.Fa errbuf_size 481is 0, 482.Fa errbuf 483is ignored but the return value is still correct. 484.Pp 485If the 486.Fa errcode 487given to 488.Fn regerror 489is first ORed with 490.Dv REG_ITOA , 491the 492.Dq message 493that results is the printable name of the error code, 494e.g.\& 495.Dq Dv REG_NOMATCH , 496rather than an explanation thereof. 497If 498.Fa errcode 499is 500.Dv REG_ATOI , 501then 502.Fa preg 503shall be 504.No non\- Ns Dv NULL 505and the 506.Va re_endp 507member of the structure it points to 508must point to the printable name of an error code; 509in this case, the result in 510.Fa errbuf 511is the decimal digits of 512the numeric value of the error code 513(0 if the name is not recognized). 514.Dv REG_ITOA 515and 516.Dv REG_ATOI 517are intended primarily as debugging facilities; 518they are extensions, 519compatible with but not specified by 520.St -p1003.2 , 521and should be used with 522caution in software intended to be portable to other systems. 523Be warned also that they are considered experimental and changes are possible. 524.Pp 525The 526.Fn regfree 527function 528frees any dynamically-allocated storage associated with the compiled RE 529pointed to by 530.Fa preg . 531The remaining 532.Ft regex_t 533is no longer a valid compiled RE 534and the effect of supplying it to 535.Fn regexec 536or 537.Fn regerror 538is undefined. 539.Pp 540None of these functions references global variables except for tables 541of constants; 542all are safe for use from multiple threads if the arguments are safe. 543.Sh IMPLEMENTATION CHOICES 544There are a number of decisions that 545.St -p1003.2 546leaves up to the implementor, 547either by explicitly saying 548.Dq undefined 549or by virtue of them being 550forbidden by the RE grammar. 551This implementation treats them as follows. 552.Pp 553See 554.Xr re_format 7 555for a discussion of the definition of case-independent matching. 556.Pp 557There is no particular limit on the length of REs, 558except insofar as memory is limited. 559Memory usage is approximately linear in RE size, and largely insensitive 560to RE complexity, except for bounded repetitions. 561See 562.Sx BUGS 563for one short RE using them 564that will run almost any system out of memory. 565.Pp 566A backslashed character other than one specifically given a magic meaning 567by 568.St -p1003.2 569(such magic meanings occur only in obsolete 570.Bq Dq basic 571REs) 572is taken as an ordinary character. 573.Pp 574Any unmatched 575.Ql [\& 576is a 577.Dv REG_EBRACK 578error. 579.Pp 580Equivalence classes cannot begin or end bracket-expression ranges. 581The endpoint of one range cannot begin another. 582.Pp 583.Dv RE_DUP_MAX , 584the limit on repetition counts in bounded repetitions, is 255. 585.Pp 586A repetition operator 587.Ql ( ?\& , 588.Ql *\& , 589.Ql +\& , 590or bounds) 591cannot follow another 592repetition operator. 593A repetition operator cannot begin an expression or subexpression 594or follow 595.Ql ^\& 596or 597.Ql |\& . 598.Pp 599.Ql |\& 600cannot appear first or last in a (sub)expression or after another 601.Ql |\& , 602i.e., an operand of 603.Ql |\& 604cannot be an empty subexpression. 605An empty parenthesized subexpression, 606.Ql "()" , 607is legal and matches an 608empty (sub)string. 609An empty string is not a legal RE. 610.Pp 611A 612.Ql {\& 613followed by a digit is considered the beginning of bounds for a 614bounded repetition, which must then follow the syntax for bounds. 615A 616.Ql {\& 617.Em not 618followed by a digit is considered an ordinary character. 619.Pp 620.Ql ^\& 621and 622.Ql $\& 623beginning and ending subexpressions in obsolete 624.Pq Dq basic 625REs are anchors, not ordinary characters. 626.Sh DIAGNOSTICS 627Non-zero error codes from 628.Fn regcomp 629and 630.Fn regexec 631include the following: 632.Pp 633.Bl -tag -width REG_ECOLLATE -compact 634.It Dv REG_NOMATCH 635The 636.Fn regexec 637function 638failed to match 639.It Dv REG_BADPAT 640invalid regular expression 641.It Dv REG_ECOLLATE 642invalid collating element 643.It Dv REG_ECTYPE 644invalid character class 645.It Dv REG_EESCAPE 646.Ql \e 647applied to unescapable character 648.It Dv REG_ESUBREG 649invalid backreference number 650.It Dv REG_EBRACK 651brackets 652.Ql "[ ]" 653not balanced 654.It Dv REG_EPAREN 655parentheses 656.Ql "( )" 657not balanced 658.It Dv REG_EBRACE 659braces 660.Ql "{ }" 661not balanced 662.It Dv REG_BADBR 663invalid repetition count(s) in 664.Ql "{ }" 665.It Dv REG_ERANGE 666invalid character range in 667.Ql "[ ]" 668.It Dv REG_ESPACE 669ran out of memory 670.It Dv REG_BADRPT 671.Ql ?\& , 672.Ql *\& , 673or 674.Ql +\& 675operand invalid 676.It Dv REG_EMPTY 677empty (sub)expression 678.It Dv REG_ASSERT 679cannot happen - you found a bug 680.It Dv REG_INVARG 681invalid argument, e.g.\& negative-length string 682.It Dv REG_ILLSEQ 683illegal byte sequence (bad multibyte character) 684.El 685.Sh SEE ALSO 686.Xr grep 1 , 687.Xr re_format 7 688.Pp 689.St -p1003.2 , 690sections 2.8 (Regular Expression Notation) 691and 692B.5 (C Binding for Regular Expression Matching). 693.Sh HISTORY 694Originally written by 695.An Henry Spencer . 696Altered for inclusion in the 697.Bx 4.4 698distribution. 699.Sh BUGS 700This is an alpha release with known defects. 701Please report problems. 702.Pp 703The back-reference code is subtle and doubts linger about its correctness 704in complex cases. 705.Pp 706The 707.Fn regexec 708function 709performance is poor. 710This will improve with later releases. 711The 712.Fa nmatch 713argument 714exceeding 0 is expensive; 715.Fa nmatch 716exceeding 1 is worse. 717The 718.Fn regexec 719function 720is largely insensitive to RE complexity 721.Em except 722that back 723references are massively expensive. 724RE length does matter; in particular, there is a strong speed bonus 725for keeping RE length under about 30 characters, 726with most special characters counting roughly double. 727.Pp 728The 729.Fn regcomp 730function 731implements bounded repetitions by macro expansion, 732which is costly in time and space if counts are large 733or bounded repetitions are nested. 734An RE like, say, 735.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 736will (eventually) run almost any existing machine out of swap space. 737.Pp 738There are suspected problems with response to obscure error conditions. 739Notably, 740certain kinds of internal overflow, 741produced only by truly enormous REs or by multiply nested bounded repetitions, 742are probably not handled well. 743.Pp 744Due to a mistake in 745.St -p1003.2 , 746things like 747.Ql "a)b" 748are legal REs because 749.Ql )\& 750is 751a special character only in the presence of a previous unmatched 752.Ql (\& . 753This cannot be fixed until the spec is fixed. 754.Pp 755The standard's definition of back references is vague. 756For example, does 757.Ql "a\e(\e(b\e)*\e2\e)*d" 758match 759.Ql "abbbd" ? 760Until the standard is clarified, 761behavior in such cases should not be relied on. 762.Pp 763The implementation of word-boundary matching is a bit of a kludge, 764and bugs may lurk in combinations of word-boundary matching and anchoring. 765.Pp 766Word-boundary matching does not work properly in multibyte locales. 767