1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 33.\" 34.Dd April 15, 2017 35.Dt REGEX 3 36.Os 37.Sh NAME 38.Nm regcomp , 39.Nm regexec , 40.Nm regerror , 41.Nm regfree 42.Nd regular-expression library 43.Sh LIBRARY 44.Lb libc 45.Sh SYNOPSIS 46.In regex.h 47.Ft int 48.Fo regcomp 49.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 50.Fc 51.Ft int 52.Fo regexec 53.Fa "const regex_t * restrict preg" "const char * restrict string" 54.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 55.Fc 56.Ft size_t 57.Fo regerror 58.Fa "int errcode" "const regex_t * restrict preg" 59.Fa "char * restrict errbuf" "size_t errbuf_size" 60.Fc 61.Ft void 62.Fn regfree "regex_t *preg" 63.Sh DESCRIPTION 64These routines implement 65.St -p1003.2 66regular expressions 67.Pq Do RE Dc Ns s ; 68see 69.Xr re_format 7 . 70The 71.Fn regcomp 72function 73compiles an RE written as a string into an internal form, 74.Fn regexec 75matches that internal form against a string and reports results, 76.Fn regerror 77transforms error codes from either into human-readable messages, 78and 79.Fn regfree 80frees any dynamically-allocated storage used by the internal form 81of an RE. 82.Pp 83The header 84.In regex.h 85declares two structure types, 86.Ft regex_t 87and 88.Ft regmatch_t , 89the former for compiled internal forms and the latter for match reporting. 90It also declares the four functions, 91a type 92.Ft regoff_t , 93and a number of constants with names starting with 94.Dq Dv REG_ . 95.Pp 96The 97.Fn regcomp 98function 99compiles the regular expression contained in the 100.Fa pattern 101string, 102subject to the flags in 103.Fa cflags , 104and places the results in the 105.Ft regex_t 106structure pointed to by 107.Fa preg . 108The 109.Fa cflags 110argument 111is the bitwise OR of zero or more of the following flags: 112.Bl -tag -width REG_EXTENDED 113.It Dv REG_EXTENDED 114Compile modern 115.Pq Dq extended 116REs, 117rather than the obsolete 118.Pq Dq basic 119REs that 120are the default. 121.It Dv REG_BASIC 122This is a synonym for 0, 123provided as a counterpart to 124.Dv REG_EXTENDED 125to improve readability. 126.It Dv REG_NOSPEC 127Compile with recognition of all special characters turned off. 128All characters are thus considered ordinary, 129so the 130.Dq RE 131is a literal string. 132This is an extension, 133compatible with but not specified by 134.St -p1003.2 , 135and should be used with 136caution in software intended to be portable to other systems. 137.Dv REG_EXTENDED 138and 139.Dv REG_NOSPEC 140may not be used 141in the same call to 142.Fn regcomp . 143.It Dv REG_ICASE 144Compile for matching that ignores upper/lower case distinctions. 145See 146.Xr re_format 7 . 147.It Dv REG_NOSUB 148Compile for matching that need only report success or failure, 149not what was matched. 150.It Dv REG_NEWLINE 151Compile for newline-sensitive matching. 152By default, newline is a completely ordinary character with no special 153meaning in either REs or strings. 154With this flag, 155.Ql [^ 156bracket expressions and 157.Ql .\& 158never match newline, 159a 160.Ql ^\& 161anchor matches the null string after any newline in the string 162in addition to its normal function, 163and the 164.Ql $\& 165anchor matches the null string before any newline in the 166string in addition to its normal function. 167.It Dv REG_PEND 168The regular expression ends, 169not at the first NUL, 170but just before the character pointed to by the 171.Va re_endp 172member of the structure pointed to by 173.Fa preg . 174The 175.Va re_endp 176member is of type 177.Ft "const char *" . 178This flag permits inclusion of NULs in the RE; 179they are considered ordinary characters. 180This is an extension, 181compatible with but not specified by 182.St -p1003.2 , 183and should be used with 184caution in software intended to be portable to other systems. 185.It Dv REG_POSIX 186Compile only 187.St -p1003.2 188compliant expressions. 189This flag has no effect unless linking against 190.Nm libregex . 191This is an extension, 192compatible with but not specified by 193.St -p1003.2 , 194and should be used with 195caution in software intended to be portable to other systems. 196.El 197.Pp 198When successful, 199.Fn regcomp 200returns 0 and fills in the structure pointed to by 201.Fa preg . 202One member of that structure 203(other than 204.Va re_endp ) 205is publicized: 206.Va re_nsub , 207of type 208.Ft size_t , 209contains the number of parenthesized subexpressions within the RE 210(except that the value of this member is undefined if the 211.Dv REG_NOSUB 212flag was used). 213If 214.Fn regcomp 215fails, it returns a non-zero error code; 216see 217.Sx DIAGNOSTICS . 218.Pp 219The 220.Fn regexec 221function 222matches the compiled RE pointed to by 223.Fa preg 224against the 225.Fa string , 226subject to the flags in 227.Fa eflags , 228and reports results using 229.Fa nmatch , 230.Fa pmatch , 231and the returned value. 232The RE must have been compiled by a previous invocation of 233.Fn regcomp . 234The compiled form is not altered during execution of 235.Fn regexec , 236so a single compiled RE can be used simultaneously by multiple threads. 237.Pp 238By default, 239the NUL-terminated string pointed to by 240.Fa string 241is considered to be the text of an entire line, minus any terminating 242newline. 243The 244.Fa eflags 245argument is the bitwise OR of zero or more of the following flags: 246.Bl -tag -width REG_STARTEND 247.It Dv REG_NOTBOL 248The first character of the string is treated as the continuation 249of a line. 250This means that the anchors 251.Ql ^\& , 252.Ql [[:<:]] , 253and 254.Ql \e< 255do not match before it; but see 256.Dv REG_STARTEND 257below. 258This does not affect the behavior of newlines under 259.Dv REG_NEWLINE . 260.It Dv REG_NOTEOL 261The NUL terminating 262the string 263does not end a line, so the 264.Ql $\& 265anchor does not match before it. 266This does not affect the behavior of newlines under 267.Dv REG_NEWLINE . 268.It Dv REG_STARTEND 269The string is considered to start at 270.Fa string No + 271.Fa pmatch Ns [0]. Ns Fa rm_so 272and to end before the byte located at 273.Fa string No + 274.Fa pmatch Ns [0]. Ns Fa rm_eo , 275regardless of the value of 276.Fa nmatch . 277See below for the definition of 278.Fa pmatch 279and 280.Fa nmatch . 281This is an extension, 282compatible with but not specified by 283.St -p1003.2 , 284and should be used with 285caution in software intended to be portable to other systems. 286.Pp 287Without 288.Dv REG_NOTBOL , 289the position 290.Fa rm_so 291is considered the beginning of a line, such that 292.Ql ^ 293matches before it, and the beginning of a word if there is a word 294character at this position, such that 295.Ql [[:<:]] 296and 297.Ql \e< 298match before it. 299.Pp 300With 301.Dv REG_NOTBOL , 302the character at position 303.Fa rm_so 304is treated as the continuation of a line, and if 305.Fa rm_so 306is greater than 0, the preceding character is taken into consideration. 307If the preceding character is a newline and the regular expression was compiled 308with 309.Dv REG_NEWLINE , 310.Ql ^ 311matches before the string; if the preceding character is not a word character 312but the string starts with a word character, 313.Ql [[:<:]] 314and 315.Ql \e< 316match before the string. 317.El 318.Pp 319See 320.Xr re_format 7 321for a discussion of what is matched in situations where an RE or a 322portion thereof could match any of several substrings of 323.Fa string . 324.Pp 325Normally, 326.Fn regexec 327returns 0 for success and the non-zero code 328.Dv REG_NOMATCH 329for failure. 330Other non-zero error codes may be returned in exceptional situations; 331see 332.Sx DIAGNOSTICS . 333.Pp 334If 335.Dv REG_NOSUB 336was specified in the compilation of the RE, 337or if 338.Fa nmatch 339is 0, 340.Fn regexec 341ignores the 342.Fa pmatch 343argument (but see below for the case where 344.Dv REG_STARTEND 345is specified). 346Otherwise, 347.Fa pmatch 348points to an array of 349.Fa nmatch 350structures of type 351.Ft regmatch_t . 352Such a structure has at least the members 353.Va rm_so 354and 355.Va rm_eo , 356both of type 357.Ft regoff_t 358(a signed arithmetic type at least as large as an 359.Ft off_t 360and a 361.Ft ssize_t ) , 362containing respectively the offset of the first character of a substring 363and the offset of the first character after the end of the substring. 364Offsets are measured from the beginning of the 365.Fa string 366argument given to 367.Fn regexec . 368An empty substring is denoted by equal offsets, 369both indicating the character following the empty substring. 370.Pp 371The 0th member of the 372.Fa pmatch 373array is filled in to indicate what substring of 374.Fa string 375was matched by the entire RE. 376Remaining members report what substring was matched by parenthesized 377subexpressions within the RE; 378member 379.Va i 380reports subexpression 381.Va i , 382with subexpressions counted (starting at 1) by the order of their opening 383parentheses in the RE, left to right. 384Unused entries in the array (corresponding either to subexpressions that 385did not participate in the match at all, or to subexpressions that do not 386exist in the RE (that is, 387.Va i 388> 389.Fa preg Ns -> Ns Va re_nsub ) ) 390have both 391.Va rm_so 392and 393.Va rm_eo 394set to -1. 395If a subexpression participated in the match several times, 396the reported substring is the last one it matched. 397(Note, as an example in particular, that when the RE 398.Ql "(b*)+" 399matches 400.Ql bbb , 401the parenthesized subexpression matches each of the three 402.So Li b Sc Ns s 403and then 404an infinite number of empty strings following the last 405.Ql b , 406so the reported substring is one of the empties.) 407.Pp 408If 409.Dv REG_STARTEND 410is specified, 411.Fa pmatch 412must point to at least one 413.Ft regmatch_t 414(even if 415.Fa nmatch 416is 0 or 417.Dv REG_NOSUB 418was specified), 419to hold the input offsets for 420.Dv REG_STARTEND . 421Use for output is still entirely controlled by 422.Fa nmatch ; 423if 424.Fa nmatch 425is 0 or 426.Dv REG_NOSUB 427was specified, 428the value of 429.Fa pmatch Ns [0] 430will not be changed by a successful 431.Fn regexec . 432.Pp 433The 434.Fn regerror 435function 436maps a non-zero 437.Fa errcode 438from either 439.Fn regcomp 440or 441.Fn regexec 442to a human-readable, printable message. 443If 444.Fa preg 445is 446.No non\- Ns Dv NULL , 447the error code should have arisen from use of 448the 449.Ft regex_t 450pointed to by 451.Fa preg , 452and if the error code came from 453.Fn regcomp , 454it should have been the result from the most recent 455.Fn regcomp 456using that 457.Ft regex_t . 458The 459.Po 460.Fn regerror 461may be able to supply a more detailed message using information 462from the 463.Ft regex_t . 464.Pc 465The 466.Fn regerror 467function 468places the NUL-terminated message into the buffer pointed to by 469.Fa errbuf , 470limiting the length (including the NUL) to at most 471.Fa errbuf_size 472bytes. 473If the whole message will not fit, 474as much of it as will fit before the terminating NUL is supplied. 475In any case, 476the returned value is the size of buffer needed to hold the whole 477message (including terminating NUL). 478If 479.Fa errbuf_size 480is 0, 481.Fa errbuf 482is ignored but the return value is still correct. 483.Pp 484If the 485.Fa errcode 486given to 487.Fn regerror 488is first ORed with 489.Dv REG_ITOA , 490the 491.Dq message 492that results is the printable name of the error code, 493e.g.\& 494.Dq Dv REG_NOMATCH , 495rather than an explanation thereof. 496If 497.Fa errcode 498is 499.Dv REG_ATOI , 500then 501.Fa preg 502shall be 503.No non\- Ns Dv NULL 504and the 505.Va re_endp 506member of the structure it points to 507must point to the printable name of an error code; 508in this case, the result in 509.Fa errbuf 510is the decimal digits of 511the numeric value of the error code 512(0 if the name is not recognized). 513.Dv REG_ITOA 514and 515.Dv REG_ATOI 516are intended primarily as debugging facilities; 517they are extensions, 518compatible with but not specified by 519.St -p1003.2 , 520and should be used with 521caution in software intended to be portable to other systems. 522Be warned also that they are considered experimental and changes are possible. 523.Pp 524The 525.Fn regfree 526function 527frees any dynamically-allocated storage associated with the compiled RE 528pointed to by 529.Fa preg . 530The remaining 531.Ft regex_t 532is no longer a valid compiled RE 533and the effect of supplying it to 534.Fn regexec 535or 536.Fn regerror 537is undefined. 538.Pp 539None of these functions references global variables except for tables 540of constants; 541all are safe for use from multiple threads if the arguments are safe. 542.Sh IMPLEMENTATION CHOICES 543There are a number of decisions that 544.St -p1003.2 545leaves up to the implementor, 546either by explicitly saying 547.Dq undefined 548or by virtue of them being 549forbidden by the RE grammar. 550This implementation treats them as follows. 551.Pp 552See 553.Xr re_format 7 554for a discussion of the definition of case-independent matching. 555.Pp 556There is no particular limit on the length of REs, 557except insofar as memory is limited. 558Memory usage is approximately linear in RE size, and largely insensitive 559to RE complexity, except for bounded repetitions. 560See 561.Sx BUGS 562for one short RE using them 563that will run almost any system out of memory. 564.Pp 565A backslashed character other than one specifically given a magic meaning 566by 567.St -p1003.2 568(such magic meanings occur only in obsolete 569.Bq Dq basic 570REs) 571is taken as an ordinary character. 572.Pp 573Any unmatched 574.Ql [\& 575is a 576.Dv REG_EBRACK 577error. 578.Pp 579Equivalence classes cannot begin or end bracket-expression ranges. 580The endpoint of one range cannot begin another. 581.Pp 582.Dv RE_DUP_MAX , 583the limit on repetition counts in bounded repetitions, is 255. 584.Pp 585A repetition operator 586.Ql ( ?\& , 587.Ql *\& , 588.Ql +\& , 589or bounds) 590cannot follow another 591repetition operator. 592A repetition operator cannot begin an expression or subexpression 593or follow 594.Ql ^\& 595or 596.Ql |\& . 597.Pp 598.Ql |\& 599cannot appear first or last in a (sub)expression or after another 600.Ql |\& , 601i.e., an operand of 602.Ql |\& 603cannot be an empty subexpression. 604An empty parenthesized subexpression, 605.Ql "()" , 606is legal and matches an 607empty (sub)string. 608An empty string is not a legal RE. 609.Pp 610A 611.Ql {\& 612followed by a digit is considered the beginning of bounds for a 613bounded repetition, which must then follow the syntax for bounds. 614A 615.Ql {\& 616.Em not 617followed by a digit is considered an ordinary character. 618.Pp 619.Ql ^\& 620and 621.Ql $\& 622beginning and ending subexpressions in obsolete 623.Pq Dq basic 624REs are anchors, not ordinary characters. 625.Sh DIAGNOSTICS 626Non-zero error codes from 627.Fn regcomp 628and 629.Fn regexec 630include the following: 631.Pp 632.Bl -tag -width REG_ECOLLATE -compact 633.It Dv REG_NOMATCH 634The 635.Fn regexec 636function 637failed to match 638.It Dv REG_BADPAT 639invalid regular expression 640.It Dv REG_ECOLLATE 641invalid collating element 642.It Dv REG_ECTYPE 643invalid character class 644.It Dv REG_EESCAPE 645.Ql \e 646applied to unescapable character 647.It Dv REG_ESUBREG 648invalid backreference number 649.It Dv REG_EBRACK 650brackets 651.Ql "[ ]" 652not balanced 653.It Dv REG_EPAREN 654parentheses 655.Ql "( )" 656not balanced 657.It Dv REG_EBRACE 658braces 659.Ql "{ }" 660not balanced 661.It Dv REG_BADBR 662invalid repetition count(s) in 663.Ql "{ }" 664.It Dv REG_ERANGE 665invalid character range in 666.Ql "[ ]" 667.It Dv REG_ESPACE 668ran out of memory 669.It Dv REG_BADRPT 670.Ql ?\& , 671.Ql *\& , 672or 673.Ql +\& 674operand invalid 675.It Dv REG_EMPTY 676empty (sub)expression 677.It Dv REG_ASSERT 678cannot happen - you found a bug 679.It Dv REG_INVARG 680invalid argument, e.g.\& negative-length string 681.It Dv REG_ILLSEQ 682illegal byte sequence (bad multibyte character) 683.El 684.Sh SEE ALSO 685.Xr grep 1 , 686.Xr re_format 7 687.Pp 688.St -p1003.2 , 689sections 2.8 (Regular Expression Notation) 690and 691B.5 (C Binding for Regular Expression Matching). 692.Sh HISTORY 693Originally written by 694.An Henry Spencer . 695Altered for inclusion in the 696.Bx 4.4 697distribution. 698.Sh BUGS 699This is an alpha release with known defects. 700Please report problems. 701.Pp 702The back-reference code is subtle and doubts linger about its correctness 703in complex cases. 704.Pp 705The 706.Fn regexec 707function 708performance is poor. 709This will improve with later releases. 710The 711.Fa nmatch 712argument 713exceeding 0 is expensive; 714.Fa nmatch 715exceeding 1 is worse. 716The 717.Fn regexec 718function 719is largely insensitive to RE complexity 720.Em except 721that back 722references are massively expensive. 723RE length does matter; in particular, there is a strong speed bonus 724for keeping RE length under about 30 characters, 725with most special characters counting roughly double. 726.Pp 727The 728.Fn regcomp 729function 730implements bounded repetitions by macro expansion, 731which is costly in time and space if counts are large 732or bounded repetitions are nested. 733An RE like, say, 734.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 735will (eventually) run almost any existing machine out of swap space. 736.Pp 737There are suspected problems with response to obscure error conditions. 738Notably, 739certain kinds of internal overflow, 740produced only by truly enormous REs or by multiply nested bounded repetitions, 741are probably not handled well. 742.Pp 743Due to a mistake in 744.St -p1003.2 , 745things like 746.Ql "a)b" 747are legal REs because 748.Ql )\& 749is 750a special character only in the presence of a previous unmatched 751.Ql (\& . 752This cannot be fixed until the spec is fixed. 753.Pp 754The standard's definition of back references is vague. 755For example, does 756.Ql "a\e(\e(b\e)*\e2\e)*d" 757match 758.Ql "abbbd" ? 759Until the standard is clarified, 760behavior in such cases should not be relied on. 761.Pp 762The implementation of word-boundary matching is a bit of a kludge, 763and bugs may lurk in combinations of word-boundary matching and anchoring. 764.Pp 765Word-boundary matching does not work properly in multibyte locales. 766