1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 33.\" $FreeBSD$ 34.\" 35.Dd May 25, 2016 36.Dt REGEX 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regerror , 42.Nm regfree 43.Nd regular-expression library 44.Sh LIBRARY 45.Lb libc 46.Sh SYNOPSIS 47.In regex.h 48.Ft int 49.Fo regcomp 50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 51.Fc 52.Ft int 53.Fo regexec 54.Fa "const regex_t * restrict preg" "const char * restrict string" 55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 56.Fc 57.Ft size_t 58.Fo regerror 59.Fa "int errcode" "const regex_t * restrict preg" 60.Fa "char * restrict errbuf" "size_t errbuf_size" 61.Fc 62.Ft void 63.Fn regfree "regex_t *preg" 64.Sh DESCRIPTION 65These routines implement 66.St -p1003.2 67regular expressions 68.Pq Do RE Dc Ns s ; 69see 70.Xr re_format 7 . 71The 72.Fn regcomp 73function 74compiles an RE written as a string into an internal form, 75.Fn regexec 76matches that internal form against a string and reports results, 77.Fn regerror 78transforms error codes from either into human-readable messages, 79and 80.Fn regfree 81frees any dynamically-allocated storage used by the internal form 82of an RE. 83.Pp 84The header 85.In regex.h 86declares two structure types, 87.Ft regex_t 88and 89.Ft regmatch_t , 90the former for compiled internal forms and the latter for match reporting. 91It also declares the four functions, 92a type 93.Ft regoff_t , 94and a number of constants with names starting with 95.Dq Dv REG_ . 96.Pp 97The 98.Fn regcomp 99function 100compiles the regular expression contained in the 101.Fa pattern 102string, 103subject to the flags in 104.Fa cflags , 105and places the results in the 106.Ft regex_t 107structure pointed to by 108.Fa preg . 109The 110.Fa cflags 111argument 112is the bitwise OR of zero or more of the following flags: 113.Bl -tag -width REG_EXTENDED 114.It Dv REG_EXTENDED 115Compile modern 116.Pq Dq extended 117REs, 118rather than the obsolete 119.Pq Dq basic 120REs that 121are the default. 122.It Dv REG_BASIC 123This is a synonym for 0, 124provided as a counterpart to 125.Dv REG_EXTENDED 126to improve readability. 127.It Dv REG_NOSPEC 128Compile with recognition of all special characters turned off. 129All characters are thus considered ordinary, 130so the 131.Dq RE 132is a literal string. 133This is an extension, 134compatible with but not specified by 135.St -p1003.2 , 136and should be used with 137caution in software intended to be portable to other systems. 138.Dv REG_EXTENDED 139and 140.Dv REG_NOSPEC 141may not be used 142in the same call to 143.Fn regcomp . 144.It Dv REG_ICASE 145Compile for matching that ignores upper/lower case distinctions. 146See 147.Xr re_format 7 . 148.It Dv REG_NOSUB 149Compile for matching that need only report success or failure, 150not what was matched. 151.It Dv REG_NEWLINE 152Compile for newline-sensitive matching. 153By default, newline is a completely ordinary character with no special 154meaning in either REs or strings. 155With this flag, 156.Ql [^ 157bracket expressions and 158.Ql .\& 159never match newline, 160a 161.Ql ^\& 162anchor matches the null string after any newline in the string 163in addition to its normal function, 164and the 165.Ql $\& 166anchor matches the null string before any newline in the 167string in addition to its normal function. 168.It Dv REG_PEND 169The regular expression ends, 170not at the first NUL, 171but just before the character pointed to by the 172.Va re_endp 173member of the structure pointed to by 174.Fa preg . 175The 176.Va re_endp 177member is of type 178.Ft "const char *" . 179This flag permits inclusion of NULs in the RE; 180they are considered ordinary characters. 181This is an extension, 182compatible with but not specified by 183.St -p1003.2 , 184and should be used with 185caution in software intended to be portable to other systems. 186.El 187.Pp 188When successful, 189.Fn regcomp 190returns 0 and fills in the structure pointed to by 191.Fa preg . 192One member of that structure 193(other than 194.Va re_endp ) 195is publicized: 196.Va re_nsub , 197of type 198.Ft size_t , 199contains the number of parenthesized subexpressions within the RE 200(except that the value of this member is undefined if the 201.Dv REG_NOSUB 202flag was used). 203If 204.Fn regcomp 205fails, it returns a non-zero error code; 206see 207.Sx DIAGNOSTICS . 208.Pp 209The 210.Fn regexec 211function 212matches the compiled RE pointed to by 213.Fa preg 214against the 215.Fa string , 216subject to the flags in 217.Fa eflags , 218and reports results using 219.Fa nmatch , 220.Fa pmatch , 221and the returned value. 222The RE must have been compiled by a previous invocation of 223.Fn regcomp . 224The compiled form is not altered during execution of 225.Fn regexec , 226so a single compiled RE can be used simultaneously by multiple threads. 227.Pp 228By default, 229the NUL-terminated string pointed to by 230.Fa string 231is considered to be the text of an entire line, minus any terminating 232newline. 233The 234.Fa eflags 235argument is the bitwise OR of zero or more of the following flags: 236.Bl -tag -width REG_STARTEND 237.It Dv REG_NOTBOL 238The first character of the string is treated as the continuation 239of a line. 240This means that the anchors 241.Ql ^\& , 242.Ql [[:<:]] , 243and 244.Ql \e< 245do not match before it; but see 246.Dv REG_STARTEND 247below. 248This does not affect the behavior of newlines under 249.Dv REG_NEWLINE . 250.It Dv REG_NOTEOL 251The NUL terminating 252the string 253does not end a line, so the 254.Ql $\& 255anchor does not match before it. 256This does not affect the behavior of newlines under 257.Dv REG_NEWLINE . 258.It Dv REG_STARTEND 259The string is considered to start at 260.Fa string No + 261.Fa pmatch Ns [0]. Ns Fa rm_so 262and to end before the byte located at 263.Fa string No + 264.Fa pmatch Ns [0]. Ns Fa rm_eo , 265regardless of the value of 266.Fa nmatch . 267See below for the definition of 268.Fa pmatch 269and 270.Fa nmatch . 271This is an extension, 272compatible with but not specified by 273.St -p1003.2 , 274and should be used with 275caution in software intended to be portable to other systems. 276.Pp 277Without 278.Dv REG_NOTBOL , 279the position 280.Fa rm_so 281is considered the beginning of a line, such that 282.Ql ^ 283matches before it, and the beginning of a word if there is a word 284character at this position, such that 285.Ql [[:<:]] 286and 287.Ql \e< 288match before it. 289.Pp 290With 291.Dv REG_NOTBOL , 292the character at position 293.Fa rm_so 294is treated as the continuation of a line, and if 295.Fa rm_so 296is greater than 0, the preceding character is taken into consideration. 297If the preceding character is a newline and the regular expression was compiled 298with 299.Dv REG_NEWLINE , 300.Ql ^ 301matches before the string; if the preceding character is not a word character 302but the string starts with a word character, 303.Ql [[:<:]] 304and 305.Ql \e< 306match before the string. 307.El 308.Pp 309See 310.Xr re_format 7 311for a discussion of what is matched in situations where an RE or a 312portion thereof could match any of several substrings of 313.Fa string . 314.Pp 315Normally, 316.Fn regexec 317returns 0 for success and the non-zero code 318.Dv REG_NOMATCH 319for failure. 320Other non-zero error codes may be returned in exceptional situations; 321see 322.Sx DIAGNOSTICS . 323.Pp 324If 325.Dv REG_NOSUB 326was specified in the compilation of the RE, 327or if 328.Fa nmatch 329is 0, 330.Fn regexec 331ignores the 332.Fa pmatch 333argument (but see below for the case where 334.Dv REG_STARTEND 335is specified). 336Otherwise, 337.Fa pmatch 338points to an array of 339.Fa nmatch 340structures of type 341.Ft regmatch_t . 342Such a structure has at least the members 343.Va rm_so 344and 345.Va rm_eo , 346both of type 347.Ft regoff_t 348(a signed arithmetic type at least as large as an 349.Ft off_t 350and a 351.Ft ssize_t ) , 352containing respectively the offset of the first character of a substring 353and the offset of the first character after the end of the substring. 354Offsets are measured from the beginning of the 355.Fa string 356argument given to 357.Fn regexec . 358An empty substring is denoted by equal offsets, 359both indicating the character following the empty substring. 360.Pp 361The 0th member of the 362.Fa pmatch 363array is filled in to indicate what substring of 364.Fa string 365was matched by the entire RE. 366Remaining members report what substring was matched by parenthesized 367subexpressions within the RE; 368member 369.Va i 370reports subexpression 371.Va i , 372with subexpressions counted (starting at 1) by the order of their opening 373parentheses in the RE, left to right. 374Unused entries in the array (corresponding either to subexpressions that 375did not participate in the match at all, or to subexpressions that do not 376exist in the RE (that is, 377.Va i 378> 379.Fa preg Ns -> Ns Va re_nsub ) ) 380have both 381.Va rm_so 382and 383.Va rm_eo 384set to -1. 385If a subexpression participated in the match several times, 386the reported substring is the last one it matched. 387(Note, as an example in particular, that when the RE 388.Ql "(b*)+" 389matches 390.Ql bbb , 391the parenthesized subexpression matches each of the three 392.So Li b Sc Ns s 393and then 394an infinite number of empty strings following the last 395.Ql b , 396so the reported substring is one of the empties.) 397.Pp 398If 399.Dv REG_STARTEND 400is specified, 401.Fa pmatch 402must point to at least one 403.Ft regmatch_t 404(even if 405.Fa nmatch 406is 0 or 407.Dv REG_NOSUB 408was specified), 409to hold the input offsets for 410.Dv REG_STARTEND . 411Use for output is still entirely controlled by 412.Fa nmatch ; 413if 414.Fa nmatch 415is 0 or 416.Dv REG_NOSUB 417was specified, 418the value of 419.Fa pmatch Ns [0] 420will not be changed by a successful 421.Fn regexec . 422.Pp 423The 424.Fn regerror 425function 426maps a non-zero 427.Fa errcode 428from either 429.Fn regcomp 430or 431.Fn regexec 432to a human-readable, printable message. 433If 434.Fa preg 435is 436.No non\- Ns Dv NULL , 437the error code should have arisen from use of 438the 439.Ft regex_t 440pointed to by 441.Fa preg , 442and if the error code came from 443.Fn regcomp , 444it should have been the result from the most recent 445.Fn regcomp 446using that 447.Ft regex_t . 448The 449.Po 450.Fn regerror 451may be able to supply a more detailed message using information 452from the 453.Ft regex_t . 454.Pc 455The 456.Fn regerror 457function 458places the NUL-terminated message into the buffer pointed to by 459.Fa errbuf , 460limiting the length (including the NUL) to at most 461.Fa errbuf_size 462bytes. 463If the whole message will not fit, 464as much of it as will fit before the terminating NUL is supplied. 465In any case, 466the returned value is the size of buffer needed to hold the whole 467message (including terminating NUL). 468If 469.Fa errbuf_size 470is 0, 471.Fa errbuf 472is ignored but the return value is still correct. 473.Pp 474If the 475.Fa errcode 476given to 477.Fn regerror 478is first ORed with 479.Dv REG_ITOA , 480the 481.Dq message 482that results is the printable name of the error code, 483e.g.\& 484.Dq Dv REG_NOMATCH , 485rather than an explanation thereof. 486If 487.Fa errcode 488is 489.Dv REG_ATOI , 490then 491.Fa preg 492shall be 493.No non\- Ns Dv NULL 494and the 495.Va re_endp 496member of the structure it points to 497must point to the printable name of an error code; 498in this case, the result in 499.Fa errbuf 500is the decimal digits of 501the numeric value of the error code 502(0 if the name is not recognized). 503.Dv REG_ITOA 504and 505.Dv REG_ATOI 506are intended primarily as debugging facilities; 507they are extensions, 508compatible with but not specified by 509.St -p1003.2 , 510and should be used with 511caution in software intended to be portable to other systems. 512Be warned also that they are considered experimental and changes are possible. 513.Pp 514The 515.Fn regfree 516function 517frees any dynamically-allocated storage associated with the compiled RE 518pointed to by 519.Fa preg . 520The remaining 521.Ft regex_t 522is no longer a valid compiled RE 523and the effect of supplying it to 524.Fn regexec 525or 526.Fn regerror 527is undefined. 528.Pp 529None of these functions references global variables except for tables 530of constants; 531all are safe for use from multiple threads if the arguments are safe. 532.Sh IMPLEMENTATION CHOICES 533There are a number of decisions that 534.St -p1003.2 535leaves up to the implementor, 536either by explicitly saying 537.Dq undefined 538or by virtue of them being 539forbidden by the RE grammar. 540This implementation treats them as follows. 541.Pp 542See 543.Xr re_format 7 544for a discussion of the definition of case-independent matching. 545.Pp 546There is no particular limit on the length of REs, 547except insofar as memory is limited. 548Memory usage is approximately linear in RE size, and largely insensitive 549to RE complexity, except for bounded repetitions. 550See 551.Sx BUGS 552for one short RE using them 553that will run almost any system out of memory. 554.Pp 555A backslashed character other than one specifically given a magic meaning 556by 557.St -p1003.2 558(such magic meanings occur only in obsolete 559.Bq Dq basic 560REs) 561is taken as an ordinary character. 562.Pp 563Any unmatched 564.Ql [\& 565is a 566.Dv REG_EBRACK 567error. 568.Pp 569Equivalence classes cannot begin or end bracket-expression ranges. 570The endpoint of one range cannot begin another. 571.Pp 572.Dv RE_DUP_MAX , 573the limit on repetition counts in bounded repetitions, is 255. 574.Pp 575A repetition operator 576.Ql ( ?\& , 577.Ql *\& , 578.Ql +\& , 579or bounds) 580cannot follow another 581repetition operator. 582A repetition operator cannot begin an expression or subexpression 583or follow 584.Ql ^\& 585or 586.Ql |\& . 587.Pp 588.Ql |\& 589cannot appear first or last in a (sub)expression or after another 590.Ql |\& , 591i.e., an operand of 592.Ql |\& 593cannot be an empty subexpression. 594An empty parenthesized subexpression, 595.Ql "()" , 596is legal and matches an 597empty (sub)string. 598An empty string is not a legal RE. 599.Pp 600A 601.Ql {\& 602followed by a digit is considered the beginning of bounds for a 603bounded repetition, which must then follow the syntax for bounds. 604A 605.Ql {\& 606.Em not 607followed by a digit is considered an ordinary character. 608.Pp 609.Ql ^\& 610and 611.Ql $\& 612beginning and ending subexpressions in obsolete 613.Pq Dq basic 614REs are anchors, not ordinary characters. 615.Sh DIAGNOSTICS 616Non-zero error codes from 617.Fn regcomp 618and 619.Fn regexec 620include the following: 621.Pp 622.Bl -tag -width REG_ECOLLATE -compact 623.It Dv REG_NOMATCH 624The 625.Fn regexec 626function 627failed to match 628.It Dv REG_BADPAT 629invalid regular expression 630.It Dv REG_ECOLLATE 631invalid collating element 632.It Dv REG_ECTYPE 633invalid character class 634.It Dv REG_EESCAPE 635.Ql \e 636applied to unescapable character 637.It Dv REG_ESUBREG 638invalid backreference number 639.It Dv REG_EBRACK 640brackets 641.Ql "[ ]" 642not balanced 643.It Dv REG_EPAREN 644parentheses 645.Ql "( )" 646not balanced 647.It Dv REG_EBRACE 648braces 649.Ql "{ }" 650not balanced 651.It Dv REG_BADBR 652invalid repetition count(s) in 653.Ql "{ }" 654.It Dv REG_ERANGE 655invalid character range in 656.Ql "[ ]" 657.It Dv REG_ESPACE 658ran out of memory 659.It Dv REG_BADRPT 660.Ql ?\& , 661.Ql *\& , 662or 663.Ql +\& 664operand invalid 665.It Dv REG_EMPTY 666empty (sub)expression 667.It Dv REG_ASSERT 668cannot happen - you found a bug 669.It Dv REG_INVARG 670invalid argument, e.g.\& negative-length string 671.It Dv REG_ILLSEQ 672illegal byte sequence (bad multibyte character) 673.El 674.Sh SEE ALSO 675.Xr grep 1 , 676.Xr re_format 7 677.Pp 678.St -p1003.2 , 679sections 2.8 (Regular Expression Notation) 680and 681B.5 (C Binding for Regular Expression Matching). 682.Sh HISTORY 683Originally written by 684.An Henry Spencer . 685Altered for inclusion in the 686.Bx 4.4 687distribution. 688.Sh BUGS 689This is an alpha release with known defects. 690Please report problems. 691.Pp 692The back-reference code is subtle and doubts linger about its correctness 693in complex cases. 694.Pp 695The 696.Fn regexec 697function 698performance is poor. 699This will improve with later releases. 700The 701.Fa nmatch 702argument 703exceeding 0 is expensive; 704.Fa nmatch 705exceeding 1 is worse. 706The 707.Fn regexec 708function 709is largely insensitive to RE complexity 710.Em except 711that back 712references are massively expensive. 713RE length does matter; in particular, there is a strong speed bonus 714for keeping RE length under about 30 characters, 715with most special characters counting roughly double. 716.Pp 717The 718.Fn regcomp 719function 720implements bounded repetitions by macro expansion, 721which is costly in time and space if counts are large 722or bounded repetitions are nested. 723An RE like, say, 724.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 725will (eventually) run almost any existing machine out of swap space. 726.Pp 727There are suspected problems with response to obscure error conditions. 728Notably, 729certain kinds of internal overflow, 730produced only by truly enormous REs or by multiply nested bounded repetitions, 731are probably not handled well. 732.Pp 733Due to a mistake in 734.St -p1003.2 , 735things like 736.Ql "a)b" 737are legal REs because 738.Ql )\& 739is 740a special character only in the presence of a previous unmatched 741.Ql (\& . 742This cannot be fixed until the spec is fixed. 743.Pp 744The standard's definition of back references is vague. 745For example, does 746.Ql "a\e(\e(b\e)*\e2\e)*d" 747match 748.Ql "abbbd" ? 749Until the standard is clarified, 750behavior in such cases should not be relied on. 751.Pp 752The implementation of word-boundary matching is a bit of a kludge, 753and bugs may lurk in combinations of word-boundary matching and anchoring. 754.Pp 755Word-boundary matching does not work properly in multibyte locales. 756