1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. All advertising materials mentioning features or use of this software 17.\" must display the following acknowledgement: 18.\" This product includes software developed by the University of 19.\" California, Berkeley and its contributors. 20.\" 4. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 37.\" 38.Dd June 30, 2014 39.Dt RE_FORMAT 7 40.Os 41.Sh NAME 42.Nm re_format 43.Nd POSIX 1003.2 regular expressions 44.Sh DESCRIPTION 45Regular expressions 46.Pq Dq RE Ns s , 47as defined in 48.St -p1003.2 , 49come in two forms: 50modern REs (roughly those of 51.Xr egrep 1 ; 521003.2 calls these 53.Dq extended 54REs) 55and obsolete REs (roughly those of 56.Xr ed 1 ; 571003.2 58.Dq basic 59REs). 60Obsolete REs mostly exist for backward compatibility in some old programs; 61they will be discussed at the end. 62.St -p1003.2 63leaves some aspects of RE syntax and semantics open; 64`\(dd' marks decisions on these aspects that 65may not be fully portable to other 66.St -p1003.2 67implementations. 68.Pp 69A (modern) RE is one\(dd or more non-empty\(dd 70.Em branches , 71separated by 72.Ql \&| . 73It matches anything that matches one of the branches. 74.Pp 75A branch is one\(dd or more 76.Em pieces , 77concatenated. 78It matches a match for the first, followed by a match for the second, etc. 79.Pp 80A piece is an 81.Em atom 82possibly followed 83by a single\(dd 84.Ql \&* , 85.Ql \&+ , 86.Ql \&? , 87or 88.Em bound . 89An atom followed by 90.Ql \&* 91matches a sequence of 0 or more matches of the atom. 92An atom followed by 93.Ql \&+ 94matches a sequence of 1 or more matches of the atom. 95An atom followed by 96.Ql ?\& 97matches a sequence of 0 or 1 matches of the atom. 98.Pp 99A 100.Em bound 101is 102.Ql \&{ 103followed by an unsigned decimal integer, 104possibly followed by 105.Ql \&, 106possibly followed by another unsigned decimal integer, 107always followed by 108.Ql \&} . 109The integers must lie between 0 and 110.Dv RE_DUP_MAX 111(255\(dd) inclusive, 112and if there are two of them, the first may not exceed the second. 113An atom followed by a bound containing one integer 114.Em i 115and no comma matches 116a sequence of exactly 117.Em i 118matches of the atom. 119An atom followed by a bound 120containing one integer 121.Em i 122and a comma matches 123a sequence of 124.Em i 125or more matches of the atom. 126An atom followed by a bound 127containing two integers 128.Em i 129and 130.Em j 131matches 132a sequence of 133.Em i 134through 135.Em j 136(inclusive) matches of the atom. 137.Pp 138An atom is a regular expression enclosed in 139.Ql () 140(matching a match for the 141regular expression), 142an empty set of 143.Ql () 144(matching the null string)\(dd, 145a 146.Em bracket expression 147(see below), 148.Ql .\& 149(matching any single character), 150.Ql \&^ 151(matching the null string at the beginning of a line), 152.Ql \&$ 153(matching the null string at the end of a line), a 154.Ql \e 155followed by one of the characters 156.Ql ^.[$()|*+?{\e 157(matching that character taken as an ordinary character), 158a 159.Ql \e 160followed by any other character\(dd 161(matching that character taken as an ordinary character, 162as if the 163.Ql \e 164had not been present\(dd), 165or a single character with no other significance (matching that character). 166A 167.Ql \&{ 168followed by a character other than a digit is an ordinary 169character, not the beginning of a bound\(dd. 170It is illegal to end an RE with 171.Ql \e . 172.Pp 173A 174.Em bracket expression 175is a list of characters enclosed in 176.Ql [] . 177It normally matches any single character from the list (but see below). 178If the list begins with 179.Ql \&^ , 180it matches any single character 181(but see below) 182.Em not 183from the rest of the list. 184If two characters in the list are separated by 185.Ql \&- , 186this is shorthand 187for the full 188.Em range 189of characters between those two (inclusive) in the 190collating sequence, 191.No e.g. Ql [0-9] 192in ASCII matches any decimal digit. 193It is illegal\(dd for two ranges to share an 194endpoint, 195.No e.g. Ql a-c-e . 196Ranges are very collating-sequence-dependent, 197and portable programs should avoid relying on them. 198.Pp 199To include a literal 200.Ql \&] 201in the list, make it the first character 202(following a possible 203.Ql \&^ ) . 204To include a literal 205.Ql \&- , 206make it the first or last character, 207or the second endpoint of a range. 208To use a literal 209.Ql \&- 210as the first endpoint of a range, 211enclose it in 212.Ql [.\& 213and 214.Ql .]\& 215to make it a collating element (see below). 216With the exception of these and some combinations using 217.Ql \&[ 218(see next paragraphs), all other special characters, including 219.Ql \e , 220lose their special significance within a bracket expression. 221.Pp 222Within a bracket expression, a collating element (a character, 223a multi-character sequence that collates as if it were a single character, 224or a collating-sequence name for either) 225enclosed in 226.Ql [.\& 227and 228.Ql .]\& 229stands for the 230sequence of characters of that collating element. 231The sequence is a single element of the bracket expression's list. 232A bracket expression containing a multi-character collating element 233can thus match more than one character, 234e.g.\& if the collating sequence includes a 235.Ql ch 236collating element, 237then the RE 238.Ql [[.ch.]]*c 239matches the first five characters 240of 241.Ql chchcc . 242.Pp 243Within a bracket expression, a collating element enclosed in 244.Ql [= 245and 246.Ql =] 247is an equivalence class, standing for the sequences of characters 248of all collating elements equivalent to that one, including itself. 249(If there are no other equivalent collating elements, 250the treatment is as if the enclosing delimiters were 251.Ql [.\& 252and 253.Ql .] . ) 254For example, if 255.Ql x 256and 257.Ql y 258are the members of an equivalence class, 259then 260.Ql [[=x=]] , 261.Ql [[=y=]] , 262and 263.Ql [xy] 264are all synonymous. 265An equivalence class may not\(dd be an endpoint 266of a range. 267.Pp 268Within a bracket expression, the name of a 269.Em character class 270enclosed in 271.Ql [: 272and 273.Ql :] 274stands for the list of all characters belonging to that 275class. 276Standard character class names are: 277.Bl -column "alnum" "digit" "xdigit" -offset indent 278.It Em "alnum digit punct" 279.It Em "alpha graph space" 280.It Em "blank lower upper" 281.It Em "cntrl print xdigit" 282.El 283.Pp 284These stand for the character classes defined in 285.Xr ctype 3 . 286A locale may provide others. 287A character class may not be used as an endpoint of a range. 288.Pp 289A bracketed expression like 290.Ql [[:class:]] 291can be used to match a single character that belongs to a character 292class. 293The reverse, matching any character that does not belong to a specific 294class, the negation operator of bracket expressions may be used: 295.Ql [^[:class:]] . 296.Pp 297There are two special cases\(dd of bracket expressions: 298the bracket expressions 299.Ql [[:<:]] 300and 301.Ql [[:>:]] 302match the null string at the beginning and end of a word respectively. 303A word is defined as a sequence of word characters 304which is neither preceded nor followed by 305word characters. 306A word character is an 307.Em alnum 308character (as defined by 309.Xr ctype 3 ) 310or an underscore. 311This is an extension, 312compatible with but not specified by 313.St -p1003.2 , 314and should be used with 315caution in software intended to be portable to other systems. 316The additional word delimiters 317.Ql \e< 318and 319.Ql \e> 320are provided to ease compatibility with traditional 321SVR4 322systems but are not portable and should be avoided. 323.Pp 324In the event that an RE could match more than one substring of a given 325string, 326the RE matches the one starting earliest in the string. 327If the RE could match more than one substring starting at that point, 328it matches the longest. 329Subexpressions also match the longest possible substrings, subject to 330the constraint that the whole match be as long as possible, 331with subexpressions starting earlier in the RE taking priority over 332ones starting later. 333Note that higher-level subexpressions thus take priority over 334their lower-level component subexpressions. 335.Pp 336Match lengths are measured in characters, not collating elements. 337A null string is considered longer than no match at all. 338For example, 339.Ql bb* 340matches the three middle characters of 341.Ql abbbc , 342.Ql (wee|week)(knights|nights) 343matches all ten characters of 344.Ql weeknights , 345when 346.Ql (.*).*\& 347is matched against 348.Ql abc 349the parenthesized subexpression 350matches all three characters, and 351when 352.Ql (a*)* 353is matched against 354.Ql bc 355both the whole RE and the parenthesized 356subexpression match the null string. 357.Pp 358If case-independent matching is specified, 359the effect is much as if all case distinctions had vanished from the 360alphabet. 361When an alphabetic that exists in multiple cases appears as an 362ordinary character outside a bracket expression, it is effectively 363transformed into a bracket expression containing both cases, 364.No e.g. Ql x 365becomes 366.Ql [xX] . 367When it appears inside a bracket expression, all case counterparts 368of it are added to the bracket expression, so that (e.g.) 369.Ql [x] 370becomes 371.Ql [xX] 372and 373.Ql [^x] 374becomes 375.Ql [^xX] . 376.Pp 377No particular limit is imposed on the length of REs\(dd. 378Programs intended to be portable should not employ REs longer 379than 256 bytes, 380as an implementation can refuse to accept such REs and remain 381POSIX-compliant. 382.Pp 383Obsolete 384.Pq Dq basic 385regular expressions differ in several respects. 386.Ql \&| 387is an ordinary character and there is no equivalent 388for its functionality. 389.Ql \&+ 390and 391.Ql ?\& 392are ordinary characters, and their functionality 393can be expressed using bounds 394.Po 395.Ql {1,} 396or 397.Ql {0,1} 398respectively 399.Pc . 400Also note that 401.Ql x+ 402in modern REs is equivalent to 403.Ql xx* . 404The delimiters for bounds are 405.Ql \e{ 406and 407.Ql \e} , 408with 409.Ql \&{ 410and 411.Ql \&} 412by themselves ordinary characters. 413The parentheses for nested subexpressions are 414.Ql \e( 415and 416.Ql \e) , 417with 418.Ql \&( 419and 420.Ql \&) 421by themselves ordinary characters. 422.Ql \&^ 423is an ordinary character except at the beginning of the 424RE or\(dd the beginning of a parenthesized subexpression, 425.Ql \&$ 426is an ordinary character except at the end of the 427RE or\(dd the end of a parenthesized subexpression, 428and 429.Ql \&* 430is an ordinary character if it appears at the beginning of the 431RE or the beginning of a parenthesized subexpression 432(after a possible leading 433.Ql \&^ ) . 434Finally, there is one new type of atom, a 435.Em back reference : 436.Ql \e 437followed by a non-zero decimal digit 438.Em d 439matches the same sequence of characters 440matched by the 441.Em d Ns th 442parenthesized subexpression 443(numbering subexpressions by the positions of their opening parentheses, 444left to right), 445so that (e.g.) 446.Ql \e([bc]\e)\e1 447matches 448.Ql bb 449or 450.Ql cc 451but not 452.Ql bc . 453.Sh SEE ALSO 454.Xr regex 3 455.Rs 456.%T Regular Expression Notation 457.%R IEEE Std 458.%N 1003.2 459.%P section 2.8 460.Re 461.Sh BUGS 462Having two kinds of REs is a botch. 463.Pp 464The current 465.St -p1003.2 466spec says that 467.Ql \&) 468is an ordinary character in 469the absence of an unmatched 470.Ql \&( ; 471this was an unintentional result of a wording error, 472and change is likely. 473Avoid relying on it. 474.Pp 475Back references are a dreadful botch, 476posing major problems for efficient implementations. 477They are also somewhat vaguely defined 478(does 479.Ql a\e(\e(b\e)*\e2\e)*d 480match 481.Ql abbbd ? ) . 482Avoid using them. 483.Pp 484.St -p1003.2 485specification of case-independent matching is vague. 486The 487.Dq one case implies all cases 488definition given above 489is current consensus among implementors as to the right interpretation. 490.Pp 491The syntax for word boundaries is incredibly ugly. 492