1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. Neither the name of the University nor the names of its contributors 17.\" may be used to endorse or promote products derived from this software 18.\" without specific prior written permission. 19.\" 20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 30.\" SUCH DAMAGE. 31.\" 32.Dd April 15, 2017 33.Dt REGEX 3 34.Os 35.Sh NAME 36.Nm regcomp , 37.Nm regexec , 38.Nm regerror , 39.Nm regfree 40.Nd regular-expression library 41.Sh LIBRARY 42.Lb libc 43.Sh SYNOPSIS 44.In regex.h 45.Ft int 46.Fo regcomp 47.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" 48.Fc 49.Ft int 50.Fo regexec 51.Fa "const regex_t * restrict preg" "const char * restrict string" 52.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" 53.Fc 54.Ft size_t 55.Fo regerror 56.Fa "int errcode" "const regex_t * restrict preg" 57.Fa "char * restrict errbuf" "size_t errbuf_size" 58.Fc 59.Ft void 60.Fn regfree "regex_t *preg" 61.Sh DESCRIPTION 62These routines implement 63.St -p1003.2 64regular expressions 65.Pq Do RE Dc Ns s ; 66see 67.Xr re_format 7 . 68The 69.Fn regcomp 70function 71compiles an RE written as a string into an internal form, 72.Fn regexec 73matches that internal form against a string and reports results, 74.Fn regerror 75transforms error codes from either into human-readable messages, 76and 77.Fn regfree 78frees any dynamically-allocated storage used by the internal form 79of an RE. 80.Pp 81The header 82.In regex.h 83declares two structure types, 84.Ft regex_t 85and 86.Ft regmatch_t , 87the former for compiled internal forms and the latter for match reporting. 88It also declares the four functions, 89a type 90.Ft regoff_t , 91and a number of constants with names starting with 92.Dq Dv REG_ . 93.Pp 94The 95.Fn regcomp 96function 97compiles the regular expression contained in the 98.Fa pattern 99string, 100subject to the flags in 101.Fa cflags , 102and places the results in the 103.Ft regex_t 104structure pointed to by 105.Fa preg . 106The 107.Fa cflags 108argument 109is the bitwise OR of zero or more of the following flags: 110.Bl -tag -width REG_EXTENDED 111.It Dv REG_EXTENDED 112Compile modern 113.Pq Dq extended 114REs, 115rather than the obsolete 116.Pq Dq basic 117REs that 118are the default. 119.It Dv REG_BASIC 120This is a synonym for 0, 121provided as a counterpart to 122.Dv REG_EXTENDED 123to improve readability. 124.It Dv REG_NOSPEC 125Compile with recognition of all special characters turned off. 126All characters are thus considered ordinary, 127so the 128.Dq RE 129is a literal string. 130This is an extension, 131compatible with but not specified by 132.St -p1003.2 , 133and should be used with 134caution in software intended to be portable to other systems. 135.Dv REG_EXTENDED 136and 137.Dv REG_NOSPEC 138may not be used 139in the same call to 140.Fn regcomp . 141.It Dv REG_ICASE 142Compile for matching that ignores upper/lower case distinctions. 143See 144.Xr re_format 7 . 145.It Dv REG_NOSUB 146Compile for matching that need only report success or failure, 147not what was matched. 148.It Dv REG_NEWLINE 149Compile for newline-sensitive matching. 150By default, newline is a completely ordinary character with no special 151meaning in either REs or strings. 152With this flag, 153.Ql [^ 154bracket expressions and 155.Ql .\& 156never match newline, 157a 158.Ql ^\& 159anchor matches the null string after any newline in the string 160in addition to its normal function, 161and the 162.Ql $\& 163anchor matches the null string before any newline in the 164string in addition to its normal function. 165.It Dv REG_PEND 166The regular expression ends, 167not at the first NUL, 168but just before the character pointed to by the 169.Va re_endp 170member of the structure pointed to by 171.Fa preg . 172The 173.Va re_endp 174member is of type 175.Ft "const char *" . 176This flag permits inclusion of NULs in the RE; 177they are considered ordinary characters. 178This is an extension, 179compatible with but not specified by 180.St -p1003.2 , 181and should be used with 182caution in software intended to be portable to other systems. 183.It Dv REG_POSIX 184Compile only 185.St -p1003.2 186compliant expressions. 187This flag has no effect unless linking against 188.Nm libregex . 189This is an extension, 190compatible with but not specified by 191.St -p1003.2 , 192and should be used with 193caution in software intended to be portable to other systems. 194.El 195.Pp 196When successful, 197.Fn regcomp 198returns 0 and fills in the structure pointed to by 199.Fa preg . 200One member of that structure 201(other than 202.Va re_endp ) 203is publicized: 204.Va re_nsub , 205of type 206.Ft size_t , 207contains the number of parenthesized subexpressions within the RE 208(except that the value of this member is undefined if the 209.Dv REG_NOSUB 210flag was used). 211If 212.Fn regcomp 213fails, it returns a non-zero error code; 214see 215.Sx DIAGNOSTICS . 216.Pp 217The 218.Fn regexec 219function 220matches the compiled RE pointed to by 221.Fa preg 222against the 223.Fa string , 224subject to the flags in 225.Fa eflags , 226and reports results using 227.Fa nmatch , 228.Fa pmatch , 229and the returned value. 230The RE must have been compiled by a previous invocation of 231.Fn regcomp . 232The compiled form is not altered during execution of 233.Fn regexec , 234so a single compiled RE can be used simultaneously by multiple threads. 235.Pp 236By default, 237the NUL-terminated string pointed to by 238.Fa string 239is considered to be the text of an entire line, minus any terminating 240newline. 241The 242.Fa eflags 243argument is the bitwise OR of zero or more of the following flags: 244.Bl -tag -width REG_STARTEND 245.It Dv REG_NOTBOL 246The first character of the string is treated as the continuation 247of a line. 248This means that the anchors 249.Ql ^\& , 250.Ql [[:<:]] , 251and 252.Ql \e< 253do not match before it; but see 254.Dv REG_STARTEND 255below. 256This does not affect the behavior of newlines under 257.Dv REG_NEWLINE . 258.It Dv REG_NOTEOL 259The NUL terminating 260the string 261does not end a line, so the 262.Ql $\& 263anchor does not match before it. 264This does not affect the behavior of newlines under 265.Dv REG_NEWLINE . 266.It Dv REG_STARTEND 267The string is considered to start at 268.Fa string No + 269.Fa pmatch Ns [0]. Ns Fa rm_so 270and to end before the byte located at 271.Fa string No + 272.Fa pmatch Ns [0]. Ns Fa rm_eo , 273regardless of the value of 274.Fa nmatch . 275See below for the definition of 276.Fa pmatch 277and 278.Fa nmatch . 279This is an extension, 280compatible with but not specified by 281.St -p1003.2 , 282and should be used with 283caution in software intended to be portable to other systems. 284.Pp 285Without 286.Dv REG_NOTBOL , 287the position 288.Fa rm_so 289is considered the beginning of a line, such that 290.Ql ^ 291matches before it, and the beginning of a word if there is a word 292character at this position, such that 293.Ql [[:<:]] 294and 295.Ql \e< 296match before it. 297.Pp 298With 299.Dv REG_NOTBOL , 300the character at position 301.Fa rm_so 302is treated as the continuation of a line, and if 303.Fa rm_so 304is greater than 0, the preceding character is taken into consideration. 305If the preceding character is a newline and the regular expression was compiled 306with 307.Dv REG_NEWLINE , 308.Ql ^ 309matches before the string; if the preceding character is not a word character 310but the string starts with a word character, 311.Ql [[:<:]] 312and 313.Ql \e< 314match before the string. 315.El 316.Pp 317See 318.Xr re_format 7 319for a discussion of what is matched in situations where an RE or a 320portion thereof could match any of several substrings of 321.Fa string . 322.Pp 323Normally, 324.Fn regexec 325returns 0 for success and the non-zero code 326.Dv REG_NOMATCH 327for failure. 328Other non-zero error codes may be returned in exceptional situations; 329see 330.Sx DIAGNOSTICS . 331.Pp 332If 333.Dv REG_NOSUB 334was specified in the compilation of the RE, 335or if 336.Fa nmatch 337is 0, 338.Fn regexec 339ignores the 340.Fa pmatch 341argument (but see below for the case where 342.Dv REG_STARTEND 343is specified). 344Otherwise, 345.Fa pmatch 346points to an array of 347.Fa nmatch 348structures of type 349.Ft regmatch_t . 350Such a structure has at least the members 351.Va rm_so 352and 353.Va rm_eo , 354both of type 355.Ft regoff_t 356(a signed arithmetic type at least as large as an 357.Ft off_t 358and a 359.Ft ssize_t ) , 360containing respectively the offset of the first character of a substring 361and the offset of the first character after the end of the substring. 362Offsets are measured from the beginning of the 363.Fa string 364argument given to 365.Fn regexec . 366An empty substring is denoted by equal offsets, 367both indicating the character following the empty substring. 368.Pp 369The 0th member of the 370.Fa pmatch 371array is filled in to indicate what substring of 372.Fa string 373was matched by the entire RE. 374Remaining members report what substring was matched by parenthesized 375subexpressions within the RE; 376member 377.Va i 378reports subexpression 379.Va i , 380with subexpressions counted (starting at 1) by the order of their opening 381parentheses in the RE, left to right. 382Unused entries in the array (corresponding either to subexpressions that 383did not participate in the match at all, or to subexpressions that do not 384exist in the RE (that is, 385.Va i 386> 387.Fa preg Ns -> Ns Va re_nsub ) ) 388have both 389.Va rm_so 390and 391.Va rm_eo 392set to -1. 393If a subexpression participated in the match several times, 394the reported substring is the last one it matched. 395(Note, as an example in particular, that when the RE 396.Ql "(b*)+" 397matches 398.Ql bbb , 399the parenthesized subexpression matches each of the three 400.So Li b Sc Ns s 401and then 402an infinite number of empty strings following the last 403.Ql b , 404so the reported substring is one of the empties.) 405.Pp 406If 407.Dv REG_STARTEND 408is specified, 409.Fa pmatch 410must point to at least one 411.Ft regmatch_t 412(even if 413.Fa nmatch 414is 0 or 415.Dv REG_NOSUB 416was specified), 417to hold the input offsets for 418.Dv REG_STARTEND . 419Use for output is still entirely controlled by 420.Fa nmatch ; 421if 422.Fa nmatch 423is 0 or 424.Dv REG_NOSUB 425was specified, 426the value of 427.Fa pmatch Ns [0] 428will not be changed by a successful 429.Fn regexec . 430.Pp 431The 432.Fn regerror 433function 434maps a non-zero 435.Fa errcode 436from either 437.Fn regcomp 438or 439.Fn regexec 440to a human-readable, printable message. 441If 442.Fa preg 443is 444.No non\- Ns Dv NULL , 445the error code should have arisen from use of 446the 447.Ft regex_t 448pointed to by 449.Fa preg , 450and if the error code came from 451.Fn regcomp , 452it should have been the result from the most recent 453.Fn regcomp 454using that 455.Ft regex_t . 456The 457.Po 458.Fn regerror 459may be able to supply a more detailed message using information 460from the 461.Ft regex_t . 462.Pc 463The 464.Fn regerror 465function 466places the NUL-terminated message into the buffer pointed to by 467.Fa errbuf , 468limiting the length (including the NUL) to at most 469.Fa errbuf_size 470bytes. 471If the whole message will not fit, 472as much of it as will fit before the terminating NUL is supplied. 473In any case, 474the returned value is the size of buffer needed to hold the whole 475message (including terminating NUL). 476If 477.Fa errbuf_size 478is 0, 479.Fa errbuf 480is ignored but the return value is still correct. 481.Pp 482If the 483.Fa errcode 484given to 485.Fn regerror 486is first ORed with 487.Dv REG_ITOA , 488the 489.Dq message 490that results is the printable name of the error code, 491e.g.\& 492.Dq Dv REG_NOMATCH , 493rather than an explanation thereof. 494If 495.Fa errcode 496is 497.Dv REG_ATOI , 498then 499.Fa preg 500shall be 501.No non\- Ns Dv NULL 502and the 503.Va re_endp 504member of the structure it points to 505must point to the printable name of an error code; 506in this case, the result in 507.Fa errbuf 508is the decimal digits of 509the numeric value of the error code 510(0 if the name is not recognized). 511.Dv REG_ITOA 512and 513.Dv REG_ATOI 514are intended primarily as debugging facilities; 515they are extensions, 516compatible with but not specified by 517.St -p1003.2 , 518and should be used with 519caution in software intended to be portable to other systems. 520Be warned also that they are considered experimental and changes are possible. 521.Pp 522The 523.Fn regfree 524function 525frees any dynamically-allocated storage associated with the compiled RE 526pointed to by 527.Fa preg . 528The remaining 529.Ft regex_t 530is no longer a valid compiled RE 531and the effect of supplying it to 532.Fn regexec 533or 534.Fn regerror 535is undefined. 536.Pp 537None of these functions references global variables except for tables 538of constants; 539all are safe for use from multiple threads if the arguments are safe. 540.Sh IMPLEMENTATION CHOICES 541There are a number of decisions that 542.St -p1003.2 543leaves up to the implementor, 544either by explicitly saying 545.Dq undefined 546or by virtue of them being 547forbidden by the RE grammar. 548This implementation treats them as follows. 549.Pp 550See 551.Xr re_format 7 552for a discussion of the definition of case-independent matching. 553.Pp 554There is no particular limit on the length of REs, 555except insofar as memory is limited. 556Memory usage is approximately linear in RE size, and largely insensitive 557to RE complexity, except for bounded repetitions. 558See 559.Sx BUGS 560for one short RE using them 561that will run almost any system out of memory. 562.Pp 563A backslashed character other than one specifically given a magic meaning 564by 565.St -p1003.2 566(such magic meanings occur only in obsolete 567.Bq Dq basic 568REs) 569is taken as an ordinary character. 570.Pp 571Any unmatched 572.Ql [\& 573is a 574.Dv REG_EBRACK 575error. 576.Pp 577Equivalence classes cannot begin or end bracket-expression ranges. 578The endpoint of one range cannot begin another. 579.Pp 580.Dv RE_DUP_MAX , 581the limit on repetition counts in bounded repetitions, is 255. 582.Pp 583A repetition operator 584.Ql ( ?\& , 585.Ql *\& , 586.Ql +\& , 587or bounds) 588cannot follow another 589repetition operator. 590A repetition operator cannot begin an expression or subexpression 591or follow 592.Ql ^\& 593or 594.Ql |\& . 595.Pp 596.Ql |\& 597cannot appear first or last in a (sub)expression or after another 598.Ql |\& , 599i.e., an operand of 600.Ql |\& 601cannot be an empty subexpression. 602An empty parenthesized subexpression, 603.Ql "()" , 604is legal and matches an 605empty (sub)string. 606An empty string is not a legal RE. 607.Pp 608A 609.Ql {\& 610followed by a digit is considered the beginning of bounds for a 611bounded repetition, which must then follow the syntax for bounds. 612A 613.Ql {\& 614.Em not 615followed by a digit is considered an ordinary character. 616.Pp 617.Ql ^\& 618and 619.Ql $\& 620beginning and ending subexpressions in obsolete 621.Pq Dq basic 622REs are anchors, not ordinary characters. 623.Sh DIAGNOSTICS 624Non-zero error codes from 625.Fn regcomp 626and 627.Fn regexec 628include the following: 629.Pp 630.Bl -tag -width REG_ECOLLATE -compact 631.It Dv REG_NOMATCH 632The 633.Fn regexec 634function 635failed to match 636.It Dv REG_BADPAT 637invalid regular expression 638.It Dv REG_ECOLLATE 639invalid collating element 640.It Dv REG_ECTYPE 641invalid character class 642.It Dv REG_EESCAPE 643.Ql \e 644applied to unescapable character 645.It Dv REG_ESUBREG 646invalid backreference number 647.It Dv REG_EBRACK 648brackets 649.Ql "[ ]" 650not balanced 651.It Dv REG_EPAREN 652parentheses 653.Ql "( )" 654not balanced 655.It Dv REG_EBRACE 656braces 657.Ql "{ }" 658not balanced 659.It Dv REG_BADBR 660invalid repetition count(s) in 661.Ql "{ }" 662.It Dv REG_ERANGE 663invalid character range in 664.Ql "[ ]" 665.It Dv REG_ESPACE 666ran out of memory 667.It Dv REG_BADRPT 668.Ql ?\& , 669.Ql *\& , 670or 671.Ql +\& 672operand invalid 673.It Dv REG_EMPTY 674empty (sub)expression 675.It Dv REG_ASSERT 676cannot happen - you found a bug 677.It Dv REG_INVARG 678invalid argument, e.g.\& negative-length string 679.It Dv REG_ILLSEQ 680illegal byte sequence (bad multibyte character) 681.El 682.Sh SEE ALSO 683.Xr grep 1 , 684.Xr re_format 7 685.Pp 686.St -p1003.2 , 687sections 2.8 (Regular Expression Notation) 688and 689B.5 (C Binding for Regular Expression Matching). 690.Sh HISTORY 691Originally written by 692.An Henry Spencer . 693Altered for inclusion in the 694.Bx 4.4 695distribution. 696.Sh BUGS 697This is an alpha release with known defects. 698Please report problems. 699.Pp 700The back-reference code is subtle and doubts linger about its correctness 701in complex cases. 702.Pp 703The 704.Fn regexec 705function 706performance is poor. 707This will improve with later releases. 708The 709.Fa nmatch 710argument 711exceeding 0 is expensive; 712.Fa nmatch 713exceeding 1 is worse. 714The 715.Fn regexec 716function 717is largely insensitive to RE complexity 718.Em except 719that back 720references are massively expensive. 721RE length does matter; in particular, there is a strong speed bonus 722for keeping RE length under about 30 characters, 723with most special characters counting roughly double. 724.Pp 725The 726.Fn regcomp 727function 728implements bounded repetitions by macro expansion, 729which is costly in time and space if counts are large 730or bounded repetitions are nested. 731An RE like, say, 732.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" 733will (eventually) run almost any existing machine out of swap space. 734.Pp 735There are suspected problems with response to obscure error conditions. 736Notably, 737certain kinds of internal overflow, 738produced only by truly enormous REs or by multiply nested bounded repetitions, 739are probably not handled well. 740.Pp 741Due to a mistake in 742.St -p1003.2 , 743things like 744.Ql "a)b" 745are legal REs because 746.Ql )\& 747is 748a special character only in the presence of a previous unmatched 749.Ql (\& . 750This cannot be fixed until the spec is fixed. 751.Pp 752The standard's definition of back references is vague. 753For example, does 754.Ql "a\e(\e(b\e)*\e2\e)*d" 755match 756.Ql "abbbd" ? 757Until the standard is clarified, 758behavior in such cases should not be relied on. 759.Pp 760The implementation of word-boundary matching is a bit of a kludge, 761and bugs may lurk in combinations of word-boundary matching and anchoring. 762.Pp 763Word-boundary matching does not work properly in multibyte locales. 764