1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. All advertising materials mentioning features or use of this software 17.\" must display the following acknowledgement: 18.\" This product includes software developed by the University of 19.\" California, Berkeley and its contributors. 20.\" 4. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 37.\" $FreeBSD$ 38.\" 39.Dd March 20, 1994 40.Dt REGEX 3 41.Os 42.Sh NAME 43.Nm regcomp , 44.Nm regexec , 45.Nm regerror , 46.Nm regfree 47.Nd regular-expression library 48.Sh LIBRARY 49.Lb libc 50.Sh SYNOPSIS 51.Fd "#include <sys/types.h>" 52.Fd "#include <regex.h>" 53.Ft int 54.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" 55.Ft int 56.Fo regexec 57.Fa "const regex_t *preg" "const char *string" 58.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" 59.Fc 60.Ft size_t 61.Fo regerror 62.Fa "int errcode" "const regex_t *preg" 63.Fa "char *errbuf" "size_t errbuf_size" 64.Fc 65.Ft void 66.Fn regfree "regex_t *preg" 67.Sh DESCRIPTION 68These routines implement 69.St -p1003.2 70regular expressions 71.Pq Do RE Dc Ns s ; 72see 73.Xr re_format 7 . 74.Fn Regcomp 75compiles an RE written as a string into an internal form, 76.Fn regexec 77matches that internal form against a string and reports results, 78.Fn regerror 79transforms error codes from either into human-readable messages, 80and 81.Fn regfree 82frees any dynamically-allocated storage used by the internal form 83of an RE. 84.Pp 85The header 86.Aq Pa regex.h 87declares two structure types, 88.Ft regex_t 89and 90.Ft regmatch_t , 91the former for compiled internal forms and the latter for match reporting. 92It also declares the four functions, 93a type 94.Ft regoff_t , 95and a number of constants with names starting with 96.Dq Dv REG_ . 97.Pp 98.Fn Regcomp 99compiles the regular expression contained in the 100.Fa pattern 101string, 102subject to the flags in 103.Fa cflags , 104and places the results in the 105.Ft regex_t 106structure pointed to by 107.Fa preg . 108.Fa Cflags 109is the bitwise OR of zero or more of the following flags: 110.Bl -tag -width REG_EXTENDED 111.It Dv REG_EXTENDED 112Compile modern 113.Pq Dq extended 114REs, 115rather than the obsolete 116.Pq Dq basic 117REs that 118are the default. 119.It Dv REG_BASIC 120This is a synonym for 0, 121provided as a counterpart to 122.Dv REG_EXTENDED 123to improve readability. 124.It Dv REG_NOSPEC 125Compile with recognition of all special characters turned off. 126All characters are thus considered ordinary, 127so the 128.Dq RE 129is a literal string. 130This is an extension, 131compatible with but not specified by 132.St -p1003.2 , 133and should be used with 134caution in software intended to be portable to other systems. 135.Dv REG_EXTENDED 136and 137.Dv REG_NOSPEC 138may not be used 139in the same call to 140.Fn regcomp . 141.It Dv REG_ICASE 142Compile for matching that ignores upper/lower case distinctions. 143See 144.Xr re_format 7 . 145.It Dv REG_NOSUB 146Compile for matching that need only report success or failure, 147not what was matched. 148.It Dv REG_NEWLINE 149Compile for newline-sensitive matching. 150By default, newline is a completely ordinary character with no special 151meaning in either REs or strings. 152With this flag, 153.Ql [^ 154bracket expressions and 155.Ql .\& 156never match newline, 157a 158.Ql ^\& 159anchor matches the null string after any newline in the string 160in addition to its normal function, 161and the 162.Ql $\& 163anchor matches the null string before any newline in the 164string in addition to its normal function. 165.It Dv REG_PEND 166The regular expression ends, 167not at the first NUL, 168but just before the character pointed to by the 169.Va re_endp 170member of the structure pointed to by 171.Fa preg . 172The 173.Va re_endp 174member is of type 175.Ft "const char *" . 176This flag permits inclusion of NULs in the RE; 177they are considered ordinary characters. 178This is an extension, 179compatible with but not specified by 180.St -p1003.2 , 181and should be used with 182caution in software intended to be portable to other systems. 183.El 184.Pp 185When successful, 186.Fn regcomp 187returns 0 and fills in the structure pointed to by 188.Fa preg . 189One member of that structure 190(other than 191.Va re_endp ) 192is publicized: 193.Va re_nsub , 194of type 195.Ft size_t , 196contains the number of parenthesized subexpressions within the RE 197(except that the value of this member is undefined if the 198.Dv REG_NOSUB 199flag was used). 200If 201.Fn regcomp 202fails, it returns a non-zero error code; 203see 204.Sx DIAGNOSTICS . 205.Pp 206.Fn Regexec 207matches the compiled RE pointed to by 208.Fa preg 209against the 210.Fa string , 211subject to the flags in 212.Fa eflags , 213and reports results using 214.Fa nmatch , 215.Fa pmatch , 216and the returned value. 217The RE must have been compiled by a previous invocation of 218.Fn regcomp . 219The compiled form is not altered during execution of 220.Fn regexec , 221so a single compiled RE can be used simultaneously by multiple threads. 222.Pp 223By default, 224the NUL-terminated string pointed to by 225.Fa string 226is considered to be the text of an entire line, minus any terminating 227newline. 228The 229.Fa eflags 230argument is the bitwise OR of zero or more of the following flags: 231.Bl -tag -width REG_STARTEND 232.It Dv REG_NOTBOL 233The first character of 234the string 235is not the beginning of a line, so the 236.Ql ^\& 237anchor should not match before it. 238This does not affect the behavior of newlines under 239.Dv REG_NEWLINE . 240.It Dv REG_NOTEOL 241The NUL terminating 242the string 243does not end a line, so the 244.Ql $\& 245anchor should not match before it. 246This does not affect the behavior of newlines under 247.Dv REG_NEWLINE . 248.It Dv REG_STARTEND 249The string is considered to start at 250.Fa string 251+ 252.Fa pmatch Ns [0]. Ns Va rm_so 253and to have a terminating NUL located at 254.Fa string 255+ 256.Fa pmatch Ns [0]. Ns Va rm_eo 257(there need not actually be a NUL at that location), 258regardless of the value of 259.Fa nmatch . 260See below for the definition of 261.Fa pmatch 262and 263.Fa nmatch . 264This is an extension, 265compatible with but not specified by 266.St -p1003.2 , 267and should be used with 268caution in software intended to be portable to other systems. 269Note that a non-zero 270.Va rm_so 271does not imply 272.Dv REG_NOTBOL ; 273.Dv REG_STARTEND 274affects only the location of the string, 275not how it is matched. 276.El 277.Pp 278See 279.Xr re_format 7 280for a discussion of what is matched in situations where an RE or a 281portion thereof could match any of several substrings of 282.Fa string . 283.Pp 284Normally, 285.Fn regexec 286returns 0 for success and the non-zero code 287.Dv REG_NOMATCH 288for failure. 289Other non-zero error codes may be returned in exceptional situations; 290see 291.Sx DIAGNOSTICS . 292.Pp 293If 294.Dv REG_NOSUB 295was specified in the compilation of the RE, 296or if 297.Fa nmatch 298is 0, 299.Fn regexec 300ignores the 301.Fa pmatch 302argument (but see below for the case where 303.Dv REG_STARTEND 304is specified). 305Otherwise, 306.Fa pmatch 307points to an array of 308.Fa nmatch 309structures of type 310.Ft regmatch_t . 311Such a structure has at least the members 312.Va rm_so 313and 314.Va rm_eo , 315both of type 316.Ft regoff_t 317(a signed arithmetic type at least as large as an 318.Ft off_t 319and a 320.Ft ssize_t ) , 321containing respectively the offset of the first character of a substring 322and the offset of the first character after the end of the substring. 323Offsets are measured from the beginning of the 324.Fa string 325argument given to 326.Fn regexec . 327An empty substring is denoted by equal offsets, 328both indicating the character following the empty substring. 329.Pp 330The 0th member of the 331.Fa pmatch 332array is filled in to indicate what substring of 333.Fa string 334was matched by the entire RE. 335Remaining members report what substring was matched by parenthesized 336subexpressions within the RE; 337member 338.Va i 339reports subexpression 340.Va i , 341with subexpressions counted (starting at 1) by the order of their opening 342parentheses in the RE, left to right. 343Unused entries in the array (corresponding either to subexpressions that 344did not participate in the match at all, or to subexpressions that do not 345exist in the RE (that is, 346.Va i 347> 348.Fa preg Ns -> Ns Va re_nsub ) ) 349have both 350.Va rm_so 351and 352.Va rm_eo 353set to -1. 354If a subexpression participated in the match several times, 355the reported substring is the last one it matched. 356(Note, as an example in particular, that when the RE 357.Ql "(b*)+" 358matches 359.Ql bbb , 360the parenthesized subexpression matches each of the three 361.So Li b Sc Ns s 362and then 363an infinite number of empty strings following the last 364.Ql b , 365so the reported substring is one of the empties.) 366.Pp 367If 368.Dv REG_STARTEND 369is specified, 370.Fa pmatch 371must point to at least one 372.Ft regmatch_t 373(even if 374.Fa nmatch 375is 0 or 376.Dv REG_NOSUB 377was specified), 378to hold the input offsets for 379.Dv REG_STARTEND . 380Use for output is still entirely controlled by 381.Fa nmatch ; 382if 383.Fa nmatch 384is 0 or 385.Dv REG_NOSUB 386was specified, 387the value of 388.Fa pmatch Ns [0] 389will not be changed by a successful 390.Fn regexec . 391.Pp 392.Fn Regerror 393maps a non-zero 394.Fa errcode 395from either 396.Fn regcomp 397or 398.Fn regexec 399to a human-readable, printable message. 400If 401.Fa preg 402is 403.No non\- Ns Dv NULL , 404the error code should have arisen from use of 405the 406.Ft regex_t 407pointed to by 408.Fa preg , 409and if the error code came from 410.Fn regcomp , 411it should have been the result from the most recent 412.Fn regcomp 413using that 414.Ft regex_t . 415.No ( Fn Regerror 416may be able to supply a more detailed message using information 417from the 418.Ft regex_t . ) 419.Fn Regerror 420places the NUL-terminated message into the buffer pointed to by 421.Fa errbuf , 422limiting the length (including the NUL) to at most 423.Fa errbuf_size 424bytes. 425If the whole message won't fit, 426as much of it as will fit before the terminating NUL is supplied. 427In any case, 428the returned value is the size of buffer needed to hold the whole 429message (including terminating NUL). 430If 431.Fa errbuf_size 432is 0, 433.Fa errbuf 434is ignored but the return value is still correct. 435.Pp 436If the 437.Fa errcode 438given to 439.Fn regerror 440is first ORed with 441.Dv REG_ITOA , 442the 443.Dq message 444that results is the printable name of the error code, 445e.g. 446.Dq Dv REG_NOMATCH , 447rather than an explanation thereof. 448If 449.Fa errcode 450is 451.Dv REG_ATOI , 452then 453.Fa preg 454shall be 455.No non\- Ns Dv NULL 456and the 457.Va re_endp 458member of the structure it points to 459must point to the printable name of an error code; 460in this case, the result in 461.Fa errbuf 462is the decimal digits of 463the numeric value of the error code 464(0 if the name is not recognized). 465.Dv REG_ITOA 466and 467.Dv REG_ATOI 468are intended primarily as debugging facilities; 469they are extensions, 470compatible with but not specified by 471.St -p1003.2 , 472and should be used with 473caution in software intended to be portable to other systems. 474Be warned also that they are considered experimental and changes are possible. 475.Pp 476.Fn Regfree 477frees any dynamically-allocated storage associated with the compiled RE 478pointed to by 479.Fa preg . 480The remaining 481.Ft regex_t 482is no longer a valid compiled RE 483and the effect of supplying it to 484.Fn regexec 485or 486.Fn regerror 487is undefined. 488.Pp 489None of these functions references global variables except for tables 490of constants; 491all are safe for use from multiple threads if the arguments are safe. 492.Sh IMPLEMENTATION CHOICES 493There are a number of decisions that 494.St -p1003.2 495leaves up to the implementor, 496either by explicitly saying 497.Dq undefined 498or by virtue of them being 499forbidden by the RE grammar. 500This implementation treats them as follows. 501.Pp 502See 503.Xr re_format 7 504for a discussion of the definition of case-independent matching. 505.Pp 506There is no particular limit on the length of REs, 507except insofar as memory is limited. 508Memory usage is approximately linear in RE size, and largely insensitive 509to RE complexity, except for bounded repetitions. 510See 511.Sx BUGS 512for one short RE using them 513that will run almost any system out of memory. 514.Pp 515A backslashed character other than one specifically given a magic meaning 516by 517.St -p1003.2 518(such magic meanings occur only in obsolete 519.Bq Dq basic 520REs) 521is taken as an ordinary character. 522.Pp 523Any unmatched 524.Ql [\& 525is a 526.Dv REG_EBRACK 527error. 528.Pp 529Equivalence classes cannot begin or end bracket-expression ranges. 530The endpoint of one range cannot begin another. 531.Pp 532.Dv RE_DUP_MAX , 533the limit on repetition counts in bounded repetitions, is 255. 534.Pp 535A repetition operator 536.Po 537.Ql ?\& , 538.Ql *\& , 539.Ql +\& , 540or bounds 541.Pc 542cannot follow another 543repetition operator. 544A repetition operator cannot begin an expression or subexpression 545or follow 546.Ql ^\& 547or 548.Ql |\& . 549.Pp 550.Ql |\& 551cannot appear first or last in a (sub)expression or after another 552.Ql |\& , 553i.e. an operand of 554.Ql |\& 555cannot be an empty subexpression. 556An empty parenthesized subexpression, 557.Ql "()" , 558is legal and matches an 559empty (sub)string. 560An empty string is not a legal RE. 561.Pp 562A 563.Ql {\& 564followed by a digit is considered the beginning of bounds for a 565bounded repetition, which must then follow the syntax for bounds. 566A 567.Ql {\& 568.Em not 569followed by a digit is considered an ordinary character. 570.Pp 571.Ql ^\& 572and 573.Ql $\& 574beginning and ending subexpressions in obsolete 575.Pq Dq basic 576REs are anchors, not ordinary characters. 577.Sh SEE ALSO 578.Xr grep 1 , 579.Xr re_format 7 580.Pp 581.St -p1003.2 , 582sections 2.8 (Regular Expression Notation) 583and 584B.5 (C Binding for Regular Expression Matching). 585.Sh DIAGNOSTICS 586Non-zero error codes from 587.Fn regcomp 588and 589.Fn regexec 590include the following: 591.Pp 592.Bl -tag -width REG_ECOLLATE -compact 593.It Dv REG_NOMATCH 594.Fn regexec 595failed to match 596.It Dv REG_BADPAT 597invalid regular expression 598.It Dv REG_ECOLLATE 599invalid collating element 600.It Dv REG_ECTYPE 601invalid character class 602.It Dv REG_EESCAPE 603.Ql \e 604applied to unescapable character 605.It Dv REG_ESUBREG 606invalid backreference number 607.It Dv REG_EBRACK 608brackets 609.Ql "[ ]" 610not balanced 611.It Dv REG_EPAREN 612parentheses 613.Ql "( )" 614not balanced 615.It Dv REG_EBRACE 616braces 617.Ql "{ }" 618not balanced 619.It Dv REG_BADBR 620invalid repetition count(s) in 621.Ql "{ }" 622.It Dv REG_ERANGE 623invalid character range in 624.Ql "[ ]" 625.It Dv REG_ESPACE 626ran out of memory 627.It Dv REG_BADRPT 628.Ql ?\& , 629.Ql *\& , 630or 631.Ql +\& 632operand invalid 633.It Dv REG_EMPTY 634empty (sub)expression 635.It Dv REG_ASSERT 636can't happen - you found a bug 637.It Dv REG_INVARG 638invalid argument, e.g. negative-length string 639.El 640.Sh HISTORY 641Originally written by 642.An Henry Spencer . 643Altered for inclusion in the 644.Bx 4.4 645distribution. 646.Sh BUGS 647This is an alpha release with known defects. 648Please report problems. 649.Pp 650The back-reference code is subtle and doubts linger about its correctness 651in complex cases. 652.Pp 653.Fn Regexec 654performance is poor. 655This will improve with later releases. 656.Fa Nmatch 657exceeding 0 is expensive; 658.Fa nmatch 659exceeding 1 is worse. 660.Fn Regexec 661is largely insensitive to RE complexity 662.Em except 663that back 664references are massively expensive. 665RE length does matter; in particular, there is a strong speed bonus 666for keeping RE length under about 30 characters, 667with most special characters counting roughly double. 668.Pp 669.Fn Regcomp 670implements bounded repetitions by macro expansion, 671which is costly in time and space if counts are large 672or bounded repetitions are nested. 673An RE like, say, 674.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 675will (eventually) run almost any existing machine out of swap space. 676.Pp 677There are suspected problems with response to obscure error conditions. 678Notably, 679certain kinds of internal overflow, 680produced only by truly enormous REs or by multiply nested bounded repetitions, 681are probably not handled well. 682.Pp 683Due to a mistake in 684.St -p1003.2 , 685things like 686.Ql "a)b" 687are legal REs because 688.Ql )\& 689is 690a special character only in the presence of a previous unmatched 691.Ql (\& . 692This can't be fixed until the spec is fixed. 693.Pp 694The standard's definition of back references is vague. 695For example, does 696.Ql "a\e(\e(b\e)*\e2\e)*d" 697match 698.Ql "abbbd" ? 699Until the standard is clarified, 700behavior in such cases should not be relied on. 701.Pp 702The implementation of word-boundary matching is a bit of a kludge, 703and bugs may lurk in combinations of word-boundary matching and anchoring. 704