1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 4. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 33.\" $FreeBSD$ 34.\" 35.Dd August 17, 2005 36.Dt REGEX 3 37.Os 38.Sh NAME 39.Nm regcomp , 40.Nm regexec , 41.Nm regerror , 42.Nm regfree 43.Nd regular-expression library 44.Sh LIBRARY 45.Lb libc 46.Sh SYNOPSIS 47.In regex.h 48.Ft int 49.Fo regcomp 50.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 51.Fc 52.Ft int 53.Fo regexec 54.Fa "const regex_t * restrict preg" "const char * restrict string" 55.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 56.Fc 57.Ft size_t 58.Fo regerror 59.Fa "int errcode" "const regex_t * restrict preg" 60.Fa "char * restrict errbuf" "size_t errbuf_size" 61.Fc 62.Ft void 63.Fn regfree "regex_t *preg" 64.Sh DESCRIPTION 65These routines implement 66.St -p1003.2 67regular expressions 68.Pq Do RE Dc Ns s ; 69see 70.Xr re_format 7 . 71The 72.Fn regcomp 73function 74compiles an RE written as a string into an internal form, 75.Fn regexec 76matches that internal form against a string and reports results, 77.Fn regerror 78transforms error codes from either into human-readable messages, 79and 80.Fn regfree 81frees any dynamically-allocated storage used by the internal form 82of an RE. 83.Pp 84The header 85.In regex.h 86declares two structure types, 87.Ft regex_t 88and 89.Ft regmatch_t , 90the former for compiled internal forms and the latter for match reporting. 91It also declares the four functions, 92a type 93.Ft regoff_t , 94and a number of constants with names starting with 95.Dq Dv REG_ . 96.Pp 97The 98.Fn regcomp 99function 100compiles the regular expression contained in the 101.Fa pattern 102string, 103subject to the flags in 104.Fa cflags , 105and places the results in the 106.Ft regex_t 107structure pointed to by 108.Fa preg . 109The 110.Fa cflags 111argument 112is the bitwise OR of zero or more of the following flags: 113.Bl -tag -width REG_EXTENDED 114.It Dv REG_EXTENDED 115Compile modern 116.Pq Dq extended 117REs, 118rather than the obsolete 119.Pq Dq basic 120REs that 121are the default. 122.It Dv REG_BASIC 123This is a synonym for 0, 124provided as a counterpart to 125.Dv REG_EXTENDED 126to improve readability. 127.It Dv REG_NOSPEC 128Compile with recognition of all special characters turned off. 129All characters are thus considered ordinary, 130so the 131.Dq RE 132is a literal string. 133This is an extension, 134compatible with but not specified by 135.St -p1003.2 , 136and should be used with 137caution in software intended to be portable to other systems. 138.Dv REG_EXTENDED 139and 140.Dv REG_NOSPEC 141may not be used 142in the same call to 143.Fn regcomp . 144.It Dv REG_ICASE 145Compile for matching that ignores upper/lower case distinctions. 146See 147.Xr re_format 7 . 148.It Dv REG_NOSUB 149Compile for matching that need only report success or failure, 150not what was matched. 151.It Dv REG_NEWLINE 152Compile for newline-sensitive matching. 153By default, newline is a completely ordinary character with no special 154meaning in either REs or strings. 155With this flag, 156.Ql [^ 157bracket expressions and 158.Ql .\& 159never match newline, 160a 161.Ql ^\& 162anchor matches the null string after any newline in the string 163in addition to its normal function, 164and the 165.Ql $\& 166anchor matches the null string before any newline in the 167string in addition to its normal function. 168.It Dv REG_PEND 169The regular expression ends, 170not at the first NUL, 171but just before the character pointed to by the 172.Va re_endp 173member of the structure pointed to by 174.Fa preg . 175The 176.Va re_endp 177member is of type 178.Ft "const char *" . 179This flag permits inclusion of NULs in the RE; 180they are considered ordinary characters. 181This is an extension, 182compatible with but not specified by 183.St -p1003.2 , 184and should be used with 185caution in software intended to be portable to other systems. 186.El 187.Pp 188When successful, 189.Fn regcomp 190returns 0 and fills in the structure pointed to by 191.Fa preg . 192One member of that structure 193(other than 194.Va re_endp ) 195is publicized: 196.Va re_nsub , 197of type 198.Ft size_t , 199contains the number of parenthesized subexpressions within the RE 200(except that the value of this member is undefined if the 201.Dv REG_NOSUB 202flag was used). 203If 204.Fn regcomp 205fails, it returns a non-zero error code; 206see 207.Sx DIAGNOSTICS . 208.Pp 209The 210.Fn regexec 211function 212matches the compiled RE pointed to by 213.Fa preg 214against the 215.Fa string , 216subject to the flags in 217.Fa eflags , 218and reports results using 219.Fa nmatch , 220.Fa pmatch , 221and the returned value. 222The RE must have been compiled by a previous invocation of 223.Fn regcomp . 224The compiled form is not altered during execution of 225.Fn regexec , 226so a single compiled RE can be used simultaneously by multiple threads. 227.Pp 228By default, 229the NUL-terminated string pointed to by 230.Fa string 231is considered to be the text of an entire line, minus any terminating 232newline. 233The 234.Fa eflags 235argument is the bitwise OR of zero or more of the following flags: 236.Bl -tag -width REG_STARTEND 237.It Dv REG_NOTBOL 238The first character of 239the string 240is not the beginning of a line, so the 241.Ql ^\& 242anchor should not match before it. 243This does not affect the behavior of newlines under 244.Dv REG_NEWLINE . 245.It Dv REG_NOTEOL 246The NUL terminating 247the string 248does not end a line, so the 249.Ql $\& 250anchor should not match before it. 251This does not affect the behavior of newlines under 252.Dv REG_NEWLINE . 253.It Dv REG_STARTEND 254The string is considered to start at 255.Fa string 256+ 257.Fa pmatch Ns [0]. Ns Va rm_so 258and to have a terminating NUL located at 259.Fa string 260+ 261.Fa pmatch Ns [0]. Ns Va rm_eo 262(there need not actually be a NUL at that location), 263regardless of the value of 264.Fa nmatch . 265See below for the definition of 266.Fa pmatch 267and 268.Fa nmatch . 269This is an extension, 270compatible with but not specified by 271.St -p1003.2 , 272and should be used with 273caution in software intended to be portable to other systems. 274Note that a non-zero 275.Va rm_so 276does not imply 277.Dv REG_NOTBOL ; 278.Dv REG_STARTEND 279affects only the location of the string, 280not how it is matched. 281.El 282.Pp 283See 284.Xr re_format 7 285for a discussion of what is matched in situations where an RE or a 286portion thereof could match any of several substrings of 287.Fa string . 288.Pp 289Normally, 290.Fn regexec 291returns 0 for success and the non-zero code 292.Dv REG_NOMATCH 293for failure. 294Other non-zero error codes may be returned in exceptional situations; 295see 296.Sx DIAGNOSTICS . 297.Pp 298If 299.Dv REG_NOSUB 300was specified in the compilation of the RE, 301or if 302.Fa nmatch 303is 0, 304.Fn regexec 305ignores the 306.Fa pmatch 307argument (but see below for the case where 308.Dv REG_STARTEND 309is specified). 310Otherwise, 311.Fa pmatch 312points to an array of 313.Fa nmatch 314structures of type 315.Ft regmatch_t . 316Such a structure has at least the members 317.Va rm_so 318and 319.Va rm_eo , 320both of type 321.Ft regoff_t 322(a signed arithmetic type at least as large as an 323.Ft off_t 324and a 325.Ft ssize_t ) , 326containing respectively the offset of the first character of a substring 327and the offset of the first character after the end of the substring. 328Offsets are measured from the beginning of the 329.Fa string 330argument given to 331.Fn regexec . 332An empty substring is denoted by equal offsets, 333both indicating the character following the empty substring. 334.Pp 335The 0th member of the 336.Fa pmatch 337array is filled in to indicate what substring of 338.Fa string 339was matched by the entire RE. 340Remaining members report what substring was matched by parenthesized 341subexpressions within the RE; 342member 343.Va i 344reports subexpression 345.Va i , 346with subexpressions counted (starting at 1) by the order of their opening 347parentheses in the RE, left to right. 348Unused entries in the array (corresponding either to subexpressions that 349did not participate in the match at all, or to subexpressions that do not 350exist in the RE (that is, 351.Va i 352> 353.Fa preg Ns -> Ns Va re_nsub ) ) 354have both 355.Va rm_so 356and 357.Va rm_eo 358set to -1. 359If a subexpression participated in the match several times, 360the reported substring is the last one it matched. 361(Note, as an example in particular, that when the RE 362.Ql "(b*)+" 363matches 364.Ql bbb , 365the parenthesized subexpression matches each of the three 366.So Li b Sc Ns s 367and then 368an infinite number of empty strings following the last 369.Ql b , 370so the reported substring is one of the empties.) 371.Pp 372If 373.Dv REG_STARTEND 374is specified, 375.Fa pmatch 376must point to at least one 377.Ft regmatch_t 378(even if 379.Fa nmatch 380is 0 or 381.Dv REG_NOSUB 382was specified), 383to hold the input offsets for 384.Dv REG_STARTEND . 385Use for output is still entirely controlled by 386.Fa nmatch ; 387if 388.Fa nmatch 389is 0 or 390.Dv REG_NOSUB 391was specified, 392the value of 393.Fa pmatch Ns [0] 394will not be changed by a successful 395.Fn regexec . 396.Pp 397The 398.Fn regerror 399function 400maps a non-zero 401.Fa errcode 402from either 403.Fn regcomp 404or 405.Fn regexec 406to a human-readable, printable message. 407If 408.Fa preg 409is 410.No non\- Ns Dv NULL , 411the error code should have arisen from use of 412the 413.Ft regex_t 414pointed to by 415.Fa preg , 416and if the error code came from 417.Fn regcomp , 418it should have been the result from the most recent 419.Fn regcomp 420using that 421.Ft regex_t . 422The 423.Po 424.Fn regerror 425may be able to supply a more detailed message using information 426from the 427.Ft regex_t . 428.Pc 429The 430.Fn regerror 431function 432places the NUL-terminated message into the buffer pointed to by 433.Fa errbuf , 434limiting the length (including the NUL) to at most 435.Fa errbuf_size 436bytes. 437If the whole message will not fit, 438as much of it as will fit before the terminating NUL is supplied. 439In any case, 440the returned value is the size of buffer needed to hold the whole 441message (including terminating NUL). 442If 443.Fa errbuf_size 444is 0, 445.Fa errbuf 446is ignored but the return value is still correct. 447.Pp 448If the 449.Fa errcode 450given to 451.Fn regerror 452is first ORed with 453.Dv REG_ITOA , 454the 455.Dq message 456that results is the printable name of the error code, 457e.g.\& 458.Dq Dv REG_NOMATCH , 459rather than an explanation thereof. 460If 461.Fa errcode 462is 463.Dv REG_ATOI , 464then 465.Fa preg 466shall be 467.No non\- Ns Dv NULL 468and the 469.Va re_endp 470member of the structure it points to 471must point to the printable name of an error code; 472in this case, the result in 473.Fa errbuf 474is the decimal digits of 475the numeric value of the error code 476(0 if the name is not recognized). 477.Dv REG_ITOA 478and 479.Dv REG_ATOI 480are intended primarily as debugging facilities; 481they are extensions, 482compatible with but not specified by 483.St -p1003.2 , 484and should be used with 485caution in software intended to be portable to other systems. 486Be warned also that they are considered experimental and changes are possible. 487.Pp 488The 489.Fn regfree 490function 491frees any dynamically-allocated storage associated with the compiled RE 492pointed to by 493.Fa preg . 494The remaining 495.Ft regex_t 496is no longer a valid compiled RE 497and the effect of supplying it to 498.Fn regexec 499or 500.Fn regerror 501is undefined. 502.Pp 503None of these functions references global variables except for tables 504of constants; 505all are safe for use from multiple threads if the arguments are safe. 506.Sh IMPLEMENTATION CHOICES 507There are a number of decisions that 508.St -p1003.2 509leaves up to the implementor, 510either by explicitly saying 511.Dq undefined 512or by virtue of them being 513forbidden by the RE grammar. 514This implementation treats them as follows. 515.Pp 516See 517.Xr re_format 7 518for a discussion of the definition of case-independent matching. 519.Pp 520There is no particular limit on the length of REs, 521except insofar as memory is limited. 522Memory usage is approximately linear in RE size, and largely insensitive 523to RE complexity, except for bounded repetitions. 524See 525.Sx BUGS 526for one short RE using them 527that will run almost any system out of memory. 528.Pp 529A backslashed character other than one specifically given a magic meaning 530by 531.St -p1003.2 532(such magic meanings occur only in obsolete 533.Bq Dq basic 534REs) 535is taken as an ordinary character. 536.Pp 537Any unmatched 538.Ql [\& 539is a 540.Dv REG_EBRACK 541error. 542.Pp 543Equivalence classes cannot begin or end bracket-expression ranges. 544The endpoint of one range cannot begin another. 545.Pp 546.Dv RE_DUP_MAX , 547the limit on repetition counts in bounded repetitions, is 255. 548.Pp 549A repetition operator 550.Ql ( ?\& , 551.Ql *\& , 552.Ql +\& , 553or bounds) 554cannot follow another 555repetition operator. 556A repetition operator cannot begin an expression or subexpression 557or follow 558.Ql ^\& 559or 560.Ql |\& . 561.Pp 562.Ql |\& 563cannot appear first or last in a (sub)expression or after another 564.Ql |\& , 565i.e., an operand of 566.Ql |\& 567cannot be an empty subexpression. 568An empty parenthesized subexpression, 569.Ql "()" , 570is legal and matches an 571empty (sub)string. 572An empty string is not a legal RE. 573.Pp 574A 575.Ql {\& 576followed by a digit is considered the beginning of bounds for a 577bounded repetition, which must then follow the syntax for bounds. 578A 579.Ql {\& 580.Em not 581followed by a digit is considered an ordinary character. 582.Pp 583.Ql ^\& 584and 585.Ql $\& 586beginning and ending subexpressions in obsolete 587.Pq Dq basic 588REs are anchors, not ordinary characters. 589.Sh DIAGNOSTICS 590Non-zero error codes from 591.Fn regcomp 592and 593.Fn regexec 594include the following: 595.Pp 596.Bl -tag -width REG_ECOLLATE -compact 597.It Dv REG_NOMATCH 598The 599.Fn regexec 600function 601failed to match 602.It Dv REG_BADPAT 603invalid regular expression 604.It Dv REG_ECOLLATE 605invalid collating element 606.It Dv REG_ECTYPE 607invalid character class 608.It Dv REG_EESCAPE 609.Ql \e 610applied to unescapable character 611.It Dv REG_ESUBREG 612invalid backreference number 613.It Dv REG_EBRACK 614brackets 615.Ql "[ ]" 616not balanced 617.It Dv REG_EPAREN 618parentheses 619.Ql "( )" 620not balanced 621.It Dv REG_EBRACE 622braces 623.Ql "{ }" 624not balanced 625.It Dv REG_BADBR 626invalid repetition count(s) in 627.Ql "{ }" 628.It Dv REG_ERANGE 629invalid character range in 630.Ql "[ ]" 631.It Dv REG_ESPACE 632ran out of memory 633.It Dv REG_BADRPT 634.Ql ?\& , 635.Ql *\& , 636or 637.Ql +\& 638operand invalid 639.It Dv REG_EMPTY 640empty (sub)expression 641.It Dv REG_ASSERT 642cannot happen - you found a bug 643.It Dv REG_INVARG 644invalid argument, e.g.\& negative-length string 645.It Dv REG_ILLSEQ 646illegal byte sequence (bad multibyte character) 647.El 648.Sh SEE ALSO 649.Xr grep 1 , 650.Xr re_format 7 651.Pp 652.St -p1003.2 , 653sections 2.8 (Regular Expression Notation) 654and 655B.5 (C Binding for Regular Expression Matching). 656.Sh HISTORY 657Originally written by 658.An Henry Spencer . 659Altered for inclusion in the 660.Bx 4.4 661distribution. 662.Sh BUGS 663This is an alpha release with known defects. 664Please report problems. 665.Pp 666The back-reference code is subtle and doubts linger about its correctness 667in complex cases. 668.Pp 669The 670.Fn regexec 671function 672performance is poor. 673This will improve with later releases. 674The 675.Fa nmatch 676argument 677exceeding 0 is expensive; 678.Fa nmatch 679exceeding 1 is worse. 680The 681.Fn regexec 682function 683is largely insensitive to RE complexity 684.Em except 685that back 686references are massively expensive. 687RE length does matter; in particular, there is a strong speed bonus 688for keeping RE length under about 30 characters, 689with most special characters counting roughly double. 690.Pp 691The 692.Fn regcomp 693function 694implements bounded repetitions by macro expansion, 695which is costly in time and space if counts are large 696or bounded repetitions are nested. 697An RE like, say, 698.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 699will (eventually) run almost any existing machine out of swap space. 700.Pp 701There are suspected problems with response to obscure error conditions. 702Notably, 703certain kinds of internal overflow, 704produced only by truly enormous REs or by multiply nested bounded repetitions, 705are probably not handled well. 706.Pp 707Due to a mistake in 708.St -p1003.2 , 709things like 710.Ql "a)b" 711are legal REs because 712.Ql )\& 713is 714a special character only in the presence of a previous unmatched 715.Ql (\& . 716This cannot be fixed until the spec is fixed. 717.Pp 718The standard's definition of back references is vague. 719For example, does 720.Ql "a\e(\e(b\e)*\e2\e)*d" 721match 722.Ql "abbbd" ? 723Until the standard is clarified, 724behavior in such cases should not be relied on. 725.Pp 726The implementation of word-boundary matching is a bit of a kludge, 727and bugs may lurk in combinations of word-boundary matching and anchoring. 728.Pp 729Word-boundary matching does not work properly in multibyte locales. 730