1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. All advertising materials mentioning features or use of this software 17.\" must display the following acknowledgement: 18.\" This product includes software developed by the University of 19.\" California, Berkeley and its contributors. 20.\" 4. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" $FreeBSD$ 38.\" 39.Dd August 17, 2005 40.Dt REGEX 3 41.Os 42.Sh NAME 43.Nm regcomp , 44.Nm regexec , 45.Nm regerror , 46.Nm regfree 47.Nd regular-expression library 48.Sh LIBRARY 49.Lb libc 50.Sh SYNOPSIS 51.In regex.h 52.Ft int 53.Fo regcomp 54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 55.Fc 56.Ft int 57.Fo regexec 58.Fa "const regex_t * restrict preg" "const char * restrict string" 59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 60.Fc 61.Ft size_t 62.Fo regerror 63.Fa "int errcode" "const regex_t * restrict preg" 64.Fa "char * restrict errbuf" "size_t errbuf_size" 65.Fc 66.Ft void 67.Fn regfree "regex_t *preg" 68.Sh DESCRIPTION 69These routines implement 70.St -p1003.2 71regular expressions 72.Pq Do RE Dc Ns s ; 73see 74.Xr re_format 7 . 75The 76.Fn regcomp 77function 78compiles an RE written as a string into an internal form, 79.Fn regexec 80matches that internal form against a string and reports results, 81.Fn regerror 82transforms error codes from either into human-readable messages, 83and 84.Fn regfree 85frees any dynamically-allocated storage used by the internal form 86of an RE. 87.Pp 88The header 89.In regex.h 90declares two structure types, 91.Ft regex_t 92and 93.Ft regmatch_t , 94the former for compiled internal forms and the latter for match reporting. 95It also declares the four functions, 96a type 97.Ft regoff_t , 98and a number of constants with names starting with 99.Dq Dv REG_ . 100.Pp 101The 102.Fn regcomp 103function 104compiles the regular expression contained in the 105.Fa pattern 106string, 107subject to the flags in 108.Fa cflags , 109and places the results in the 110.Ft regex_t 111structure pointed to by 112.Fa preg . 113The 114.Fa cflags 115argument 116is the bitwise OR of zero or more of the following flags: 117.Bl -tag -width REG_EXTENDED 118.It Dv REG_EXTENDED 119Compile modern 120.Pq Dq extended 121REs, 122rather than the obsolete 123.Pq Dq basic 124REs that 125are the default. 126.It Dv REG_BASIC 127This is a synonym for 0, 128provided as a counterpart to 129.Dv REG_EXTENDED 130to improve readability. 131.It Dv REG_NOSPEC 132Compile with recognition of all special characters turned off. 133All characters are thus considered ordinary, 134so the 135.Dq RE 136is a literal string. 137This is an extension, 138compatible with but not specified by 139.St -p1003.2 , 140and should be used with 141caution in software intended to be portable to other systems. 142.Dv REG_EXTENDED 143and 144.Dv REG_NOSPEC 145may not be used 146in the same call to 147.Fn regcomp . 148.It Dv REG_ICASE 149Compile for matching that ignores upper/lower case distinctions. 150See 151.Xr re_format 7 . 152.It Dv REG_NOSUB 153Compile for matching that need only report success or failure, 154not what was matched. 155.It Dv REG_NEWLINE 156Compile for newline-sensitive matching. 157By default, newline is a completely ordinary character with no special 158meaning in either REs or strings. 159With this flag, 160.Ql [^ 161bracket expressions and 162.Ql .\& 163never match newline, 164a 165.Ql ^\& 166anchor matches the null string after any newline in the string 167in addition to its normal function, 168and the 169.Ql $\& 170anchor matches the null string before any newline in the 171string in addition to its normal function. 172.It Dv REG_PEND 173The regular expression ends, 174not at the first NUL, 175but just before the character pointed to by the 176.Va re_endp 177member of the structure pointed to by 178.Fa preg . 179The 180.Va re_endp 181member is of type 182.Ft "const char *" . 183This flag permits inclusion of NULs in the RE; 184they are considered ordinary characters. 185This is an extension, 186compatible with but not specified by 187.St -p1003.2 , 188and should be used with 189caution in software intended to be portable to other systems. 190.El 191.Pp 192When successful, 193.Fn regcomp 194returns 0 and fills in the structure pointed to by 195.Fa preg . 196One member of that structure 197(other than 198.Va re_endp ) 199is publicized: 200.Va re_nsub , 201of type 202.Ft size_t , 203contains the number of parenthesized subexpressions within the RE 204(except that the value of this member is undefined if the 205.Dv REG_NOSUB 206flag was used). 207If 208.Fn regcomp 209fails, it returns a non-zero error code; 210see 211.Sx DIAGNOSTICS . 212.Pp 213The 214.Fn regexec 215function 216matches the compiled RE pointed to by 217.Fa preg 218against the 219.Fa string , 220subject to the flags in 221.Fa eflags , 222and reports results using 223.Fa nmatch , 224.Fa pmatch , 225and the returned value. 226The RE must have been compiled by a previous invocation of 227.Fn regcomp . 228The compiled form is not altered during execution of 229.Fn regexec , 230so a single compiled RE can be used simultaneously by multiple threads. 231.Pp 232By default, 233the NUL-terminated string pointed to by 234.Fa string 235is considered to be the text of an entire line, minus any terminating 236newline. 237The 238.Fa eflags 239argument is the bitwise OR of zero or more of the following flags: 240.Bl -tag -width REG_STARTEND 241.It Dv REG_NOTBOL 242The first character of 243the string 244is not the beginning of a line, so the 245.Ql ^\& 246anchor should not match before it. 247This does not affect the behavior of newlines under 248.Dv REG_NEWLINE . 249.It Dv REG_NOTEOL 250The NUL terminating 251the string 252does not end a line, so the 253.Ql $\& 254anchor should not match before it. 255This does not affect the behavior of newlines under 256.Dv REG_NEWLINE . 257.It Dv REG_STARTEND 258The string is considered to start at 259.Fa string 260+ 261.Fa pmatch Ns [0]. Ns Va rm_so 262and to have a terminating NUL located at 263.Fa string 264+ 265.Fa pmatch Ns [0]. Ns Va rm_eo 266(there need not actually be a NUL at that location), 267regardless of the value of 268.Fa nmatch . 269See below for the definition of 270.Fa pmatch 271and 272.Fa nmatch . 273This is an extension, 274compatible with but not specified by 275.St -p1003.2 , 276and should be used with 277caution in software intended to be portable to other systems. 278Note that a non-zero 279.Va rm_so 280does not imply 281.Dv REG_NOTBOL ; 282.Dv REG_STARTEND 283affects only the location of the string, 284not how it is matched. 285.El 286.Pp 287See 288.Xr re_format 7 289for a discussion of what is matched in situations where an RE or a 290portion thereof could match any of several substrings of 291.Fa string . 292.Pp 293Normally, 294.Fn regexec 295returns 0 for success and the non-zero code 296.Dv REG_NOMATCH 297for failure. 298Other non-zero error codes may be returned in exceptional situations; 299see 300.Sx DIAGNOSTICS . 301.Pp 302If 303.Dv REG_NOSUB 304was specified in the compilation of the RE, 305or if 306.Fa nmatch 307is 0, 308.Fn regexec 309ignores the 310.Fa pmatch 311argument (but see below for the case where 312.Dv REG_STARTEND 313is specified). 314Otherwise, 315.Fa pmatch 316points to an array of 317.Fa nmatch 318structures of type 319.Ft regmatch_t . 320Such a structure has at least the members 321.Va rm_so 322and 323.Va rm_eo , 324both of type 325.Ft regoff_t 326(a signed arithmetic type at least as large as an 327.Ft off_t 328and a 329.Ft ssize_t ) , 330containing respectively the offset of the first character of a substring 331and the offset of the first character after the end of the substring. 332Offsets are measured from the beginning of the 333.Fa string 334argument given to 335.Fn regexec . 336An empty substring is denoted by equal offsets, 337both indicating the character following the empty substring. 338.Pp 339The 0th member of the 340.Fa pmatch 341array is filled in to indicate what substring of 342.Fa string 343was matched by the entire RE. 344Remaining members report what substring was matched by parenthesized 345subexpressions within the RE; 346member 347.Va i 348reports subexpression 349.Va i , 350with subexpressions counted (starting at 1) by the order of their opening 351parentheses in the RE, left to right. 352Unused entries in the array (corresponding either to subexpressions that 353did not participate in the match at all, or to subexpressions that do not 354exist in the RE (that is, 355.Va i 356> 357.Fa preg Ns -> Ns Va re_nsub ) ) 358have both 359.Va rm_so 360and 361.Va rm_eo 362set to -1. 363If a subexpression participated in the match several times, 364the reported substring is the last one it matched. 365(Note, as an example in particular, that when the RE 366.Ql "(b*)+" 367matches 368.Ql bbb , 369the parenthesized subexpression matches each of the three 370.So Li b Sc Ns s 371and then 372an infinite number of empty strings following the last 373.Ql b , 374so the reported substring is one of the empties.) 375.Pp 376If 377.Dv REG_STARTEND 378is specified, 379.Fa pmatch 380must point to at least one 381.Ft regmatch_t 382(even if 383.Fa nmatch 384is 0 or 385.Dv REG_NOSUB 386was specified), 387to hold the input offsets for 388.Dv REG_STARTEND . 389Use for output is still entirely controlled by 390.Fa nmatch ; 391if 392.Fa nmatch 393is 0 or 394.Dv REG_NOSUB 395was specified, 396the value of 397.Fa pmatch Ns [0] 398will not be changed by a successful 399.Fn regexec . 400.Pp 401The 402.Fn regerror 403function 404maps a non-zero 405.Fa errcode 406from either 407.Fn regcomp 408or 409.Fn regexec 410to a human-readable, printable message. 411If 412.Fa preg 413is 414.No non\- Ns Dv NULL , 415the error code should have arisen from use of 416the 417.Ft regex_t 418pointed to by 419.Fa preg , 420and if the error code came from 421.Fn regcomp , 422it should have been the result from the most recent 423.Fn regcomp 424using that 425.Ft regex_t . 426The 427.Fn ( regerror 428may be able to supply a more detailed message using information 429from the 430.Ft regex_t . ) 431The 432.Fn regerror 433function 434places the NUL-terminated message into the buffer pointed to by 435.Fa errbuf , 436limiting the length (including the NUL) to at most 437.Fa errbuf_size 438bytes. 439If the whole message will not fit, 440as much of it as will fit before the terminating NUL is supplied. 441In any case, 442the returned value is the size of buffer needed to hold the whole 443message (including terminating NUL). 444If 445.Fa errbuf_size 446is 0, 447.Fa errbuf 448is ignored but the return value is still correct. 449.Pp 450If the 451.Fa errcode 452given to 453.Fn regerror 454is first ORed with 455.Dv REG_ITOA , 456the 457.Dq message 458that results is the printable name of the error code, 459e.g.\& 460.Dq Dv REG_NOMATCH , 461rather than an explanation thereof. 462If 463.Fa errcode 464is 465.Dv REG_ATOI , 466then 467.Fa preg 468shall be 469.No non\- Ns Dv NULL 470and the 471.Va re_endp 472member of the structure it points to 473must point to the printable name of an error code; 474in this case, the result in 475.Fa errbuf 476is the decimal digits of 477the numeric value of the error code 478(0 if the name is not recognized). 479.Dv REG_ITOA 480and 481.Dv REG_ATOI 482are intended primarily as debugging facilities; 483they are extensions, 484compatible with but not specified by 485.St -p1003.2 , 486and should be used with 487caution in software intended to be portable to other systems. 488Be warned also that they are considered experimental and changes are possible. 489.Pp 490The 491.Fn regfree 492function 493frees any dynamically-allocated storage associated with the compiled RE 494pointed to by 495.Fa preg . 496The remaining 497.Ft regex_t 498is no longer a valid compiled RE 499and the effect of supplying it to 500.Fn regexec 501or 502.Fn regerror 503is undefined. 504.Pp 505None of these functions references global variables except for tables 506of constants; 507all are safe for use from multiple threads if the arguments are safe. 508.Sh IMPLEMENTATION CHOICES 509There are a number of decisions that 510.St -p1003.2 511leaves up to the implementor, 512either by explicitly saying 513.Dq undefined 514or by virtue of them being 515forbidden by the RE grammar. 516This implementation treats them as follows. 517.Pp 518See 519.Xr re_format 7 520for a discussion of the definition of case-independent matching. 521.Pp 522There is no particular limit on the length of REs, 523except insofar as memory is limited. 524Memory usage is approximately linear in RE size, and largely insensitive 525to RE complexity, except for bounded repetitions. 526See 527.Sx BUGS 528for one short RE using them 529that will run almost any system out of memory. 530.Pp 531A backslashed character other than one specifically given a magic meaning 532by 533.St -p1003.2 534(such magic meanings occur only in obsolete 535.Bq Dq basic 536REs) 537is taken as an ordinary character. 538.Pp 539Any unmatched 540.Ql [\& 541is a 542.Dv REG_EBRACK 543error. 544.Pp 545Equivalence classes cannot begin or end bracket-expression ranges. 546The endpoint of one range cannot begin another. 547.Pp 548.Dv RE_DUP_MAX , 549the limit on repetition counts in bounded repetitions, is 255. 550.Pp 551A repetition operator 552.Ql ( ?\& , 553.Ql *\& , 554.Ql +\& , 555or bounds) 556cannot follow another 557repetition operator. 558A repetition operator cannot begin an expression or subexpression 559or follow 560.Ql ^\& 561or 562.Ql |\& . 563.Pp 564.Ql |\& 565cannot appear first or last in a (sub)expression or after another 566.Ql |\& , 567i.e., an operand of 568.Ql |\& 569cannot be an empty subexpression. 570An empty parenthesized subexpression, 571.Ql "()" , 572is legal and matches an 573empty (sub)string. 574An empty string is not a legal RE. 575.Pp 576A 577.Ql {\& 578followed by a digit is considered the beginning of bounds for a 579bounded repetition, which must then follow the syntax for bounds. 580A 581.Ql {\& 582.Em not 583followed by a digit is considered an ordinary character. 584.Pp 585.Ql ^\& 586and 587.Ql $\& 588beginning and ending subexpressions in obsolete 589.Pq Dq basic 590REs are anchors, not ordinary characters. 591.Sh DIAGNOSTICS 592Non-zero error codes from 593.Fn regcomp 594and 595.Fn regexec 596include the following: 597.Pp 598.Bl -tag -width REG_ECOLLATE -compact 599.It Dv REG_NOMATCH 600The 601.Fn regexec 602function 603failed to match 604.It Dv REG_BADPAT 605invalid regular expression 606.It Dv REG_ECOLLATE 607invalid collating element 608.It Dv REG_ECTYPE 609invalid character class 610.It Dv REG_EESCAPE 611.Ql \e 612applied to unescapable character 613.It Dv REG_ESUBREG 614invalid backreference number 615.It Dv REG_EBRACK 616brackets 617.Ql "[ ]" 618not balanced 619.It Dv REG_EPAREN 620parentheses 621.Ql "( )" 622not balanced 623.It Dv REG_EBRACE 624braces 625.Ql "{ }" 626not balanced 627.It Dv REG_BADBR 628invalid repetition count(s) in 629.Ql "{ }" 630.It Dv REG_ERANGE 631invalid character range in 632.Ql "[ ]" 633.It Dv REG_ESPACE 634ran out of memory 635.It Dv REG_BADRPT 636.Ql ?\& , 637.Ql *\& , 638or 639.Ql +\& 640operand invalid 641.It Dv REG_EMPTY 642empty (sub)expression 643.It Dv REG_ASSERT 644cannot happen - you found a bug 645.It Dv REG_INVARG 646invalid argument, e.g.\& negative-length string 647.It Dv REG_ILLSEQ 648illegal byte sequence (bad multibyte character) 649.El 650.Sh SEE ALSO 651.Xr grep 1 , 652.Xr re_format 7 653.Pp 654.St -p1003.2 , 655sections 2.8 (Regular Expression Notation) 656and 657B.5 (C Binding for Regular Expression Matching). 658.Sh HISTORY 659Originally written by 660.An Henry Spencer . 661Altered for inclusion in the 662.Bx 4.4 663distribution. 664.Sh BUGS 665This is an alpha release with known defects. 666Please report problems. 667.Pp 668The back-reference code is subtle and doubts linger about its correctness 669in complex cases. 670.Pp 671The 672.Fn regexec 673function 674performance is poor. 675This will improve with later releases. 676The 677.Fa nmatch 678argument 679exceeding 0 is expensive; 680.Fa nmatch 681exceeding 1 is worse. 682The 683.Fn regexec 684function 685is largely insensitive to RE complexity 686.Em except 687that back 688references are massively expensive. 689RE length does matter; in particular, there is a strong speed bonus 690for keeping RE length under about 30 characters, 691with most special characters counting roughly double. 692.Pp 693The 694.Fn regcomp 695function 696implements bounded repetitions by macro expansion, 697which is costly in time and space if counts are large 698or bounded repetitions are nested. 699An RE like, say, 700.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 701will (eventually) run almost any existing machine out of swap space. 702.Pp 703There are suspected problems with response to obscure error conditions. 704Notably, 705certain kinds of internal overflow, 706produced only by truly enormous REs or by multiply nested bounded repetitions, 707are probably not handled well. 708.Pp 709Due to a mistake in 710.St -p1003.2 , 711things like 712.Ql "a)b" 713are legal REs because 714.Ql )\& 715is 716a special character only in the presence of a previous unmatched 717.Ql (\& . 718This cannot be fixed until the spec is fixed. 719.Pp 720The standard's definition of back references is vague. 721For example, does 722.Ql "a\e(\e(b\e)*\e2\e)*d" 723match 724.Ql "abbbd" ? 725Until the standard is clarified, 726behavior in such cases should not be relied on. 727.Pp 728The implementation of word-boundary matching is a bit of a kludge, 729and bugs may lurk in combinations of word-boundary matching and anchoring. 730.Pp 731Word-boundary matching does not work properly in multibyte locales. 732