1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. All advertising materials mentioning features or use of this software 17.\" must display the following acknowledgement: 18.\" This product includes software developed by the University of 19.\" California, Berkeley and its contributors. 20.\" 4. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" $FreeBSD$ 38.\" 39.Dd October 2, 2002 40.Dt REGEX 3 41.Os 42.Sh NAME 43.Nm regcomp , 44.Nm regexec , 45.Nm regerror , 46.Nm regfree 47.Nd regular-expression library 48.Sh LIBRARY 49.Lb libc 50.Sh SYNOPSIS 51.In regex.h 52.Ft int 53.Fo regcomp 54.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 55.Fc 56.Ft int 57.Fo regexec 58.Fa "const regex_t * restrict preg" "const char * restrict string" 59.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 60.Fc 61.Ft size_t 62.Fo regerror 63.Fa "int errcode" "const regex_t * restrict preg" 64.Fa "char * restrict errbuf" "size_t errbuf_size" 65.Fc 66.Ft void 67.Fn regfree "regex_t *preg" 68.Sh DESCRIPTION 69These routines implement 70.St -p1003.2 71regular expressions 72.Pq Do RE Dc Ns s ; 73see 74.Xr re_format 7 . 75.Fn Regcomp 76compiles an RE written as a string into an internal form, 77.Fn regexec 78matches that internal form against a string and reports results, 79.Fn regerror 80transforms error codes from either into human-readable messages, 81and 82.Fn regfree 83frees any dynamically-allocated storage used by the internal form 84of an RE. 85.Pp 86The header 87.Aq Pa regex.h 88declares two structure types, 89.Ft regex_t 90and 91.Ft regmatch_t , 92the former for compiled internal forms and the latter for match reporting. 93It also declares the four functions, 94a type 95.Ft regoff_t , 96and a number of constants with names starting with 97.Dq Dv REG_ . 98.Pp 99.Fn Regcomp 100compiles the regular expression contained in the 101.Fa pattern 102string, 103subject to the flags in 104.Fa cflags , 105and places the results in the 106.Ft regex_t 107structure pointed to by 108.Fa preg . 109.Fa Cflags 110is the bitwise OR of zero or more of the following flags: 111.Bl -tag -width REG_EXTENDED 112.It Dv REG_EXTENDED 113Compile modern 114.Pq Dq extended 115REs, 116rather than the obsolete 117.Pq Dq basic 118REs that 119are the default. 120.It Dv REG_BASIC 121This is a synonym for 0, 122provided as a counterpart to 123.Dv REG_EXTENDED 124to improve readability. 125.It Dv REG_NOSPEC 126Compile with recognition of all special characters turned off. 127All characters are thus considered ordinary, 128so the 129.Dq RE 130is a literal string. 131This is an extension, 132compatible with but not specified by 133.St -p1003.2 , 134and should be used with 135caution in software intended to be portable to other systems. 136.Dv REG_EXTENDED 137and 138.Dv REG_NOSPEC 139may not be used 140in the same call to 141.Fn regcomp . 142.It Dv REG_ICASE 143Compile for matching that ignores upper/lower case distinctions. 144See 145.Xr re_format 7 . 146.It Dv REG_NOSUB 147Compile for matching that need only report success or failure, 148not what was matched. 149.It Dv REG_NEWLINE 150Compile for newline-sensitive matching. 151By default, newline is a completely ordinary character with no special 152meaning in either REs or strings. 153With this flag, 154.Ql [^ 155bracket expressions and 156.Ql .\& 157never match newline, 158a 159.Ql ^\& 160anchor matches the null string after any newline in the string 161in addition to its normal function, 162and the 163.Ql $\& 164anchor matches the null string before any newline in the 165string in addition to its normal function. 166.It Dv REG_PEND 167The regular expression ends, 168not at the first NUL, 169but just before the character pointed to by the 170.Va re_endp 171member of the structure pointed to by 172.Fa preg . 173The 174.Va re_endp 175member is of type 176.Ft "const char *" . 177This flag permits inclusion of NULs in the RE; 178they are considered ordinary characters. 179This is an extension, 180compatible with but not specified by 181.St -p1003.2 , 182and should be used with 183caution in software intended to be portable to other systems. 184.El 185.Pp 186When successful, 187.Fn regcomp 188returns 0 and fills in the structure pointed to by 189.Fa preg . 190One member of that structure 191(other than 192.Va re_endp ) 193is publicized: 194.Va re_nsub , 195of type 196.Ft size_t , 197contains the number of parenthesized subexpressions within the RE 198(except that the value of this member is undefined if the 199.Dv REG_NOSUB 200flag was used). 201If 202.Fn regcomp 203fails, it returns a non-zero error code; 204see 205.Sx DIAGNOSTICS . 206.Pp 207.Fn Regexec 208matches the compiled RE pointed to by 209.Fa preg 210against the 211.Fa string , 212subject to the flags in 213.Fa eflags , 214and reports results using 215.Fa nmatch , 216.Fa pmatch , 217and the returned value. 218The RE must have been compiled by a previous invocation of 219.Fn regcomp . 220The compiled form is not altered during execution of 221.Fn regexec , 222so a single compiled RE can be used simultaneously by multiple threads. 223.Pp 224By default, 225the NUL-terminated string pointed to by 226.Fa string 227is considered to be the text of an entire line, minus any terminating 228newline. 229The 230.Fa eflags 231argument is the bitwise OR of zero or more of the following flags: 232.Bl -tag -width REG_STARTEND 233.It Dv REG_NOTBOL 234The first character of 235the string 236is not the beginning of a line, so the 237.Ql ^\& 238anchor should not match before it. 239This does not affect the behavior of newlines under 240.Dv REG_NEWLINE . 241.It Dv REG_NOTEOL 242The NUL terminating 243the string 244does not end a line, so the 245.Ql $\& 246anchor should not match before it. 247This does not affect the behavior of newlines under 248.Dv REG_NEWLINE . 249.It Dv REG_STARTEND 250The string is considered to start at 251.Fa string 252+ 253.Fa pmatch Ns [0]. Ns Va rm_so 254and to have a terminating NUL located at 255.Fa string 256+ 257.Fa pmatch Ns [0]. Ns Va rm_eo 258(there need not actually be a NUL at that location), 259regardless of the value of 260.Fa nmatch . 261See below for the definition of 262.Fa pmatch 263and 264.Fa nmatch . 265This is an extension, 266compatible with but not specified by 267.St -p1003.2 , 268and should be used with 269caution in software intended to be portable to other systems. 270Note that a non-zero 271.Va rm_so 272does not imply 273.Dv REG_NOTBOL ; 274.Dv REG_STARTEND 275affects only the location of the string, 276not how it is matched. 277.El 278.Pp 279See 280.Xr re_format 7 281for a discussion of what is matched in situations where an RE or a 282portion thereof could match any of several substrings of 283.Fa string . 284.Pp 285Normally, 286.Fn regexec 287returns 0 for success and the non-zero code 288.Dv REG_NOMATCH 289for failure. 290Other non-zero error codes may be returned in exceptional situations; 291see 292.Sx DIAGNOSTICS . 293.Pp 294If 295.Dv REG_NOSUB 296was specified in the compilation of the RE, 297or if 298.Fa nmatch 299is 0, 300.Fn regexec 301ignores the 302.Fa pmatch 303argument (but see below for the case where 304.Dv REG_STARTEND 305is specified). 306Otherwise, 307.Fa pmatch 308points to an array of 309.Fa nmatch 310structures of type 311.Ft regmatch_t . 312Such a structure has at least the members 313.Va rm_so 314and 315.Va rm_eo , 316both of type 317.Ft regoff_t 318(a signed arithmetic type at least as large as an 319.Ft off_t 320and a 321.Ft ssize_t ) , 322containing respectively the offset of the first character of a substring 323and the offset of the first character after the end of the substring. 324Offsets are measured from the beginning of the 325.Fa string 326argument given to 327.Fn regexec . 328An empty substring is denoted by equal offsets, 329both indicating the character following the empty substring. 330.Pp 331The 0th member of the 332.Fa pmatch 333array is filled in to indicate what substring of 334.Fa string 335was matched by the entire RE. 336Remaining members report what substring was matched by parenthesized 337subexpressions within the RE; 338member 339.Va i 340reports subexpression 341.Va i , 342with subexpressions counted (starting at 1) by the order of their opening 343parentheses in the RE, left to right. 344Unused entries in the array (corresponding either to subexpressions that 345did not participate in the match at all, or to subexpressions that do not 346exist in the RE (that is, 347.Va i 348> 349.Fa preg Ns -> Ns Va re_nsub ) ) 350have both 351.Va rm_so 352and 353.Va rm_eo 354set to -1. 355If a subexpression participated in the match several times, 356the reported substring is the last one it matched. 357(Note, as an example in particular, that when the RE 358.Ql "(b*)+" 359matches 360.Ql bbb , 361the parenthesized subexpression matches each of the three 362.So Li b Sc Ns s 363and then 364an infinite number of empty strings following the last 365.Ql b , 366so the reported substring is one of the empties.) 367.Pp 368If 369.Dv REG_STARTEND 370is specified, 371.Fa pmatch 372must point to at least one 373.Ft regmatch_t 374(even if 375.Fa nmatch 376is 0 or 377.Dv REG_NOSUB 378was specified), 379to hold the input offsets for 380.Dv REG_STARTEND . 381Use for output is still entirely controlled by 382.Fa nmatch ; 383if 384.Fa nmatch 385is 0 or 386.Dv REG_NOSUB 387was specified, 388the value of 389.Fa pmatch Ns [0] 390will not be changed by a successful 391.Fn regexec . 392.Pp 393.Fn Regerror 394maps a non-zero 395.Fa errcode 396from either 397.Fn regcomp 398or 399.Fn regexec 400to a human-readable, printable message. 401If 402.Fa preg 403is 404.No non\- Ns Dv NULL , 405the error code should have arisen from use of 406the 407.Ft regex_t 408pointed to by 409.Fa preg , 410and if the error code came from 411.Fn regcomp , 412it should have been the result from the most recent 413.Fn regcomp 414using that 415.Ft regex_t . 416.No ( Fn Regerror 417may be able to supply a more detailed message using information 418from the 419.Ft regex_t . ) 420.Fn Regerror 421places the NUL-terminated message into the buffer pointed to by 422.Fa errbuf , 423limiting the length (including the NUL) to at most 424.Fa errbuf_size 425bytes. 426If the whole message won't fit, 427as much of it as will fit before the terminating NUL is supplied. 428In any case, 429the returned value is the size of buffer needed to hold the whole 430message (including terminating NUL). 431If 432.Fa errbuf_size 433is 0, 434.Fa errbuf 435is ignored but the return value is still correct. 436.Pp 437If the 438.Fa errcode 439given to 440.Fn regerror 441is first ORed with 442.Dv REG_ITOA , 443the 444.Dq message 445that results is the printable name of the error code, 446e.g.\& 447.Dq Dv REG_NOMATCH , 448rather than an explanation thereof. 449If 450.Fa errcode 451is 452.Dv REG_ATOI , 453then 454.Fa preg 455shall be 456.No non\- Ns Dv NULL 457and the 458.Va re_endp 459member of the structure it points to 460must point to the printable name of an error code; 461in this case, the result in 462.Fa errbuf 463is the decimal digits of 464the numeric value of the error code 465(0 if the name is not recognized). 466.Dv REG_ITOA 467and 468.Dv REG_ATOI 469are intended primarily as debugging facilities; 470they are extensions, 471compatible with but not specified by 472.St -p1003.2 , 473and should be used with 474caution in software intended to be portable to other systems. 475Be warned also that they are considered experimental and changes are possible. 476.Pp 477.Fn Regfree 478frees any dynamically-allocated storage associated with the compiled RE 479pointed to by 480.Fa preg . 481The remaining 482.Ft regex_t 483is no longer a valid compiled RE 484and the effect of supplying it to 485.Fn regexec 486or 487.Fn regerror 488is undefined. 489.Pp 490None of these functions references global variables except for tables 491of constants; 492all are safe for use from multiple threads if the arguments are safe. 493.Sh IMPLEMENTATION CHOICES 494There are a number of decisions that 495.St -p1003.2 496leaves up to the implementor, 497either by explicitly saying 498.Dq undefined 499or by virtue of them being 500forbidden by the RE grammar. 501This implementation treats them as follows. 502.Pp 503See 504.Xr re_format 7 505for a discussion of the definition of case-independent matching. 506.Pp 507There is no particular limit on the length of REs, 508except insofar as memory is limited. 509Memory usage is approximately linear in RE size, and largely insensitive 510to RE complexity, except for bounded repetitions. 511See 512.Sx BUGS 513for one short RE using them 514that will run almost any system out of memory. 515.Pp 516A backslashed character other than one specifically given a magic meaning 517by 518.St -p1003.2 519(such magic meanings occur only in obsolete 520.Bq Dq basic 521REs) 522is taken as an ordinary character. 523.Pp 524Any unmatched 525.Ql [\& 526is a 527.Dv REG_EBRACK 528error. 529.Pp 530Equivalence classes cannot begin or end bracket-expression ranges. 531The endpoint of one range cannot begin another. 532.Pp 533.Dv RE_DUP_MAX , 534the limit on repetition counts in bounded repetitions, is 255. 535.Pp 536A repetition operator 537.Ql ( ?\& , 538.Ql *\& , 539.Ql +\& , 540or bounds) 541cannot follow another 542repetition operator. 543A repetition operator cannot begin an expression or subexpression 544or follow 545.Ql ^\& 546or 547.Ql |\& . 548.Pp 549.Ql |\& 550cannot appear first or last in a (sub)expression or after another 551.Ql |\& , 552i.e. an operand of 553.Ql |\& 554cannot be an empty subexpression. 555An empty parenthesized subexpression, 556.Ql "()" , 557is legal and matches an 558empty (sub)string. 559An empty string is not a legal RE. 560.Pp 561A 562.Ql {\& 563followed by a digit is considered the beginning of bounds for a 564bounded repetition, which must then follow the syntax for bounds. 565A 566.Ql {\& 567.Em not 568followed by a digit is considered an ordinary character. 569.Pp 570.Ql ^\& 571and 572.Ql $\& 573beginning and ending subexpressions in obsolete 574.Pq Dq basic 575REs are anchors, not ordinary characters. 576.Sh SEE ALSO 577.Xr grep 1 , 578.Xr re_format 7 579.Pp 580.St -p1003.2 , 581sections 2.8 (Regular Expression Notation) 582and 583B.5 (C Binding for Regular Expression Matching). 584.Sh DIAGNOSTICS 585Non-zero error codes from 586.Fn regcomp 587and 588.Fn regexec 589include the following: 590.Pp 591.Bl -tag -width REG_ECOLLATE -compact 592.It Dv REG_NOMATCH 593.Fn regexec 594failed to match 595.It Dv REG_BADPAT 596invalid regular expression 597.It Dv REG_ECOLLATE 598invalid collating element 599.It Dv REG_ECTYPE 600invalid character class 601.It Dv REG_EESCAPE 602.Ql \e 603applied to unescapable character 604.It Dv REG_ESUBREG 605invalid backreference number 606.It Dv REG_EBRACK 607brackets 608.Ql "[ ]" 609not balanced 610.It Dv REG_EPAREN 611parentheses 612.Ql "( )" 613not balanced 614.It Dv REG_EBRACE 615braces 616.Ql "{ }" 617not balanced 618.It Dv REG_BADBR 619invalid repetition count(s) in 620.Ql "{ }" 621.It Dv REG_ERANGE 622invalid character range in 623.Ql "[ ]" 624.It Dv REG_ESPACE 625ran out of memory 626.It Dv REG_BADRPT 627.Ql ?\& , 628.Ql *\& , 629or 630.Ql +\& 631operand invalid 632.It Dv REG_EMPTY 633empty (sub)expression 634.It Dv REG_ASSERT 635can't happen - you found a bug 636.It Dv REG_INVARG 637invalid argument, e.g. negative-length string 638.El 639.Sh HISTORY 640Originally written by 641.An Henry Spencer . 642Altered for inclusion in the 643.Bx 4.4 644distribution. 645.Sh BUGS 646This is an alpha release with known defects. 647Please report problems. 648.Pp 649The back-reference code is subtle and doubts linger about its correctness 650in complex cases. 651.Pp 652.Fn Regexec 653performance is poor. 654This will improve with later releases. 655.Fa Nmatch 656exceeding 0 is expensive; 657.Fa nmatch 658exceeding 1 is worse. 659.Fn Regexec 660is largely insensitive to RE complexity 661.Em except 662that back 663references are massively expensive. 664RE length does matter; in particular, there is a strong speed bonus 665for keeping RE length under about 30 characters, 666with most special characters counting roughly double. 667.Pp 668.Fn Regcomp 669implements bounded repetitions by macro expansion, 670which is costly in time and space if counts are large 671or bounded repetitions are nested. 672An RE like, say, 673.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 674will (eventually) run almost any existing machine out of swap space. 675.Pp 676There are suspected problems with response to obscure error conditions. 677Notably, 678certain kinds of internal overflow, 679produced only by truly enormous REs or by multiply nested bounded repetitions, 680are probably not handled well. 681.Pp 682Due to a mistake in 683.St -p1003.2 , 684things like 685.Ql "a)b" 686are legal REs because 687.Ql )\& 688is 689a special character only in the presence of a previous unmatched 690.Ql (\& . 691This can't be fixed until the spec is fixed. 692.Pp 693The standard's definition of back references is vague. 694For example, does 695.Ql "a\e(\e(b\e)*\e2\e)*d" 696match 697.Ql "abbbd" ? 698Until the standard is clarified, 699behavior in such cases should not be relied on. 700.Pp 701The implementation of word-boundary matching is a bit of a kludge, 702and bugs may lurk in combinations of word-boundary matching and anchoring. 703