1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. 2.\" Copyright (c) 1992, 1993, 1994 3.\" The Regents of the University of California. All rights reserved. 4.\" 5.\" This code is derived from software contributed to Berkeley by 6.\" Henry Spencer. 7.\" 8.\" Redistribution and use in source and binary forms, with or without 9.\" modification, are permitted provided that the following conditions 10.\" are met: 11.\" 1. Redistributions of source code must retain the above copyright 12.\" notice, this list of conditions and the following disclaimer. 13.\" 2. Redistributions in binary form must reproduce the above copyright 14.\" notice, this list of conditions and the following disclaimer in the 15.\" documentation and/or other materials provided with the distribution. 16.\" 3. All advertising materials mentioning features or use of this software 17.\" must display the following acknowledgement: 18.\" This product includes software developed by the University of 19.\" California, Berkeley and its contributors. 20.\" 4. Neither the name of the University nor the names of its contributors 21.\" may be used to endorse or promote products derived from this software 22.\" without specific prior written permission. 23.\" 24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 34.\" SUCH DAMAGE. 35.\" 36.Dd June 30, 2014 37.Dt RE_FORMAT 7 38.Os 39.Sh NAME 40.Nm re_format 41.Nd POSIX 1003.2 regular expressions 42.Sh DESCRIPTION 43Regular expressions 44.Pq Dq RE Ns s , 45as defined in 46.St -p1003.2 , 47come in two forms: 48modern REs (roughly those of 49.Xr egrep 1 ; 501003.2 calls these 51.Dq extended 52REs) 53and obsolete REs (roughly those of 54.Xr ed 1 ; 551003.2 56.Dq basic 57REs). 58Obsolete REs mostly exist for backward compatibility in some old programs; 59they will be discussed at the end. 60.St -p1003.2 61leaves some aspects of RE syntax and semantics open; 62`\(dd' marks decisions on these aspects that 63may not be fully portable to other 64.St -p1003.2 65implementations. 66.Pp 67A (modern) RE is one\(dd or more non-empty\(dd 68.Em branches , 69separated by 70.Ql \&| . 71It matches anything that matches one of the branches. 72.Pp 73A branch is one\(dd or more 74.Em pieces , 75concatenated. 76It matches a match for the first, followed by a match for the second, etc. 77.Pp 78A piece is an 79.Em atom 80possibly followed 81by a single\(dd 82.Ql \&* , 83.Ql \&+ , 84.Ql \&? , 85or 86.Em bound . 87An atom followed by 88.Ql \&* 89matches a sequence of 0 or more matches of the atom. 90An atom followed by 91.Ql \&+ 92matches a sequence of 1 or more matches of the atom. 93An atom followed by 94.Ql ?\& 95matches a sequence of 0 or 1 matches of the atom. 96.Pp 97A 98.Em bound 99is 100.Ql \&{ 101followed by an unsigned decimal integer, 102possibly followed by 103.Ql \&, 104possibly followed by another unsigned decimal integer, 105always followed by 106.Ql \&} . 107The integers must lie between 0 and 108.Dv RE_DUP_MAX 109(255\(dd) inclusive, 110and if there are two of them, the first may not exceed the second. 111An atom followed by a bound containing one integer 112.Em i 113and no comma matches 114a sequence of exactly 115.Em i 116matches of the atom. 117An atom followed by a bound 118containing one integer 119.Em i 120and a comma matches 121a sequence of 122.Em i 123or more matches of the atom. 124An atom followed by a bound 125containing two integers 126.Em i 127and 128.Em j 129matches 130a sequence of 131.Em i 132through 133.Em j 134(inclusive) matches of the atom. 135.Pp 136An atom is a regular expression enclosed in 137.Ql () 138(matching a match for the 139regular expression), 140an empty set of 141.Ql () 142(matching the null string)\(dd, 143a 144.Em bracket expression 145(see below), 146.Ql .\& 147(matching any single character), 148.Ql \&^ 149(matching the null string at the beginning of a line), 150.Ql \&$ 151(matching the null string at the end of a line), a 152.Ql \e 153followed by one of the characters 154.Ql ^.[$()|*+?{\e 155(matching that character taken as an ordinary character), 156a 157.Ql \e 158followed by any other character\(dd 159(matching that character taken as an ordinary character, 160as if the 161.Ql \e 162had not been present\(dd), 163or a single character with no other significance (matching that character). 164A 165.Ql \&{ 166followed by a character other than a digit is an ordinary 167character, not the beginning of a bound\(dd. 168It is illegal to end an RE with 169.Ql \e . 170.Pp 171A 172.Em bracket expression 173is a list of characters enclosed in 174.Ql [] . 175It normally matches any single character from the list (but see below). 176If the list begins with 177.Ql \&^ , 178it matches any single character 179(but see below) 180.Em not 181from the rest of the list. 182If two characters in the list are separated by 183.Ql \&- , 184this is shorthand 185for the full 186.Em range 187of characters between those two (inclusive) in the 188collating sequence, 189.No e.g. Ql [0-9] 190in ASCII matches any decimal digit. 191It is illegal\(dd for two ranges to share an 192endpoint, 193.No e.g. Ql a-c-e . 194Ranges are very collating-sequence-dependent, 195and portable programs should avoid relying on them. 196.Pp 197To include a literal 198.Ql \&] 199in the list, make it the first character 200(following a possible 201.Ql \&^ ) . 202To include a literal 203.Ql \&- , 204make it the first or last character, 205or the second endpoint of a range. 206To use a literal 207.Ql \&- 208as the first endpoint of a range, 209enclose it in 210.Ql [.\& 211and 212.Ql .]\& 213to make it a collating element (see below). 214With the exception of these and some combinations using 215.Ql \&[ 216(see next paragraphs), all other special characters, including 217.Ql \e , 218lose their special significance within a bracket expression. 219.Pp 220Within a bracket expression, a collating element (a character, 221a multi-character sequence that collates as if it were a single character, 222or a collating-sequence name for either) 223enclosed in 224.Ql [.\& 225and 226.Ql .]\& 227stands for the 228sequence of characters of that collating element. 229The sequence is a single element of the bracket expression's list. 230A bracket expression containing a multi-character collating element 231can thus match more than one character, 232e.g.\& if the collating sequence includes a 233.Ql ch 234collating element, 235then the RE 236.Ql [[.ch.]]*c 237matches the first five characters 238of 239.Ql chchcc . 240.Pp 241Within a bracket expression, a collating element enclosed in 242.Ql [= 243and 244.Ql =] 245is an equivalence class, standing for the sequences of characters 246of all collating elements equivalent to that one, including itself. 247(If there are no other equivalent collating elements, 248the treatment is as if the enclosing delimiters were 249.Ql [.\& 250and 251.Ql .] . ) 252For example, if 253.Ql x 254and 255.Ql y 256are the members of an equivalence class, 257then 258.Ql [[=x=]] , 259.Ql [[=y=]] , 260and 261.Ql [xy] 262are all synonymous. 263An equivalence class may not\(dd be an endpoint 264of a range. 265.Pp 266Within a bracket expression, the name of a 267.Em character class 268enclosed in 269.Ql [: 270and 271.Ql :] 272stands for the list of all characters belonging to that 273class. 274Standard character class names are: 275.Bl -column "alnum" "digit" "xdigit" -offset indent 276.It Em "alnum digit punct" 277.It Em "alpha graph space" 278.It Em "blank lower upper" 279.It Em "cntrl print xdigit" 280.El 281.Pp 282These stand for the character classes defined in 283.Xr ctype 3 . 284A locale may provide others. 285A character class may not be used as an endpoint of a range. 286.Pp 287A bracketed expression like 288.Ql [[:class:]] 289can be used to match a single character that belongs to a character 290class. 291The reverse, matching any character that does not belong to a specific 292class, the negation operator of bracket expressions may be used: 293.Ql [^[:class:]] . 294.Pp 295There are two special cases\(dd of bracket expressions: 296the bracket expressions 297.Ql [[:<:]] 298and 299.Ql [[:>:]] 300match the null string at the beginning and end of a word respectively. 301A word is defined as a sequence of word characters 302which is neither preceded nor followed by 303word characters. 304A word character is an 305.Em alnum 306character (as defined by 307.Xr ctype 3 ) 308or an underscore. 309This is an extension, 310compatible with but not specified by 311.St -p1003.2 , 312and should be used with 313caution in software intended to be portable to other systems. 314The additional word delimiters 315.Ql \e< 316and 317.Ql \e> 318are provided to ease compatibility with traditional 319SVR4 320systems but are not portable and should be avoided. 321.Pp 322In the event that an RE could match more than one substring of a given 323string, 324the RE matches the one starting earliest in the string. 325If the RE could match more than one substring starting at that point, 326it matches the longest. 327Subexpressions also match the longest possible substrings, subject to 328the constraint that the whole match be as long as possible, 329with subexpressions starting earlier in the RE taking priority over 330ones starting later. 331Note that higher-level subexpressions thus take priority over 332their lower-level component subexpressions. 333.Pp 334Match lengths are measured in characters, not collating elements. 335A null string is considered longer than no match at all. 336For example, 337.Ql bb* 338matches the three middle characters of 339.Ql abbbc , 340.Ql (wee|week)(knights|nights) 341matches all ten characters of 342.Ql weeknights , 343when 344.Ql (.*).*\& 345is matched against 346.Ql abc 347the parenthesized subexpression 348matches all three characters, and 349when 350.Ql (a*)* 351is matched against 352.Ql bc 353both the whole RE and the parenthesized 354subexpression match the null string. 355.Pp 356If case-independent matching is specified, 357the effect is much as if all case distinctions had vanished from the 358alphabet. 359When an alphabetic that exists in multiple cases appears as an 360ordinary character outside a bracket expression, it is effectively 361transformed into a bracket expression containing both cases, 362.No e.g. Ql x 363becomes 364.Ql [xX] . 365When it appears inside a bracket expression, all case counterparts 366of it are added to the bracket expression, so that (e.g.) 367.Ql [x] 368becomes 369.Ql [xX] 370and 371.Ql [^x] 372becomes 373.Ql [^xX] . 374.Pp 375No particular limit is imposed on the length of REs\(dd. 376Programs intended to be portable should not employ REs longer 377than 256 bytes, 378as an implementation can refuse to accept such REs and remain 379POSIX-compliant. 380.Pp 381Obsolete 382.Pq Dq basic 383regular expressions differ in several respects. 384.Ql \&| 385is an ordinary character and there is no equivalent 386for its functionality. 387.Ql \&+ 388and 389.Ql ?\& 390are ordinary characters, and their functionality 391can be expressed using bounds 392.Po 393.Ql {1,} 394or 395.Ql {0,1} 396respectively 397.Pc . 398Also note that 399.Ql x+ 400in modern REs is equivalent to 401.Ql xx* . 402The delimiters for bounds are 403.Ql \e{ 404and 405.Ql \e} , 406with 407.Ql \&{ 408and 409.Ql \&} 410by themselves ordinary characters. 411The parentheses for nested subexpressions are 412.Ql \e( 413and 414.Ql \e) , 415with 416.Ql \&( 417and 418.Ql \&) 419by themselves ordinary characters. 420.Ql \&^ 421is an ordinary character except at the beginning of the 422RE or\(dd the beginning of a parenthesized subexpression, 423.Ql \&$ 424is an ordinary character except at the end of the 425RE or\(dd the end of a parenthesized subexpression, 426and 427.Ql \&* 428is an ordinary character if it appears at the beginning of the 429RE or the beginning of a parenthesized subexpression 430(after a possible leading 431.Ql \&^ ) . 432Finally, there is one new type of atom, a 433.Em back reference : 434.Ql \e 435followed by a non-zero decimal digit 436.Em d 437matches the same sequence of characters 438matched by the 439.Em d Ns th 440parenthesized subexpression 441(numbering subexpressions by the positions of their opening parentheses, 442left to right), 443so that (e.g.) 444.Ql \e([bc]\e)\e1 445matches 446.Ql bb 447or 448.Ql cc 449but not 450.Ql bc . 451.Sh SEE ALSO 452.Xr regex 3 453.Rs 454.%T Regular Expression Notation 455.%R IEEE Std 456.%N 1003.2 457.%P section 2.8 458.Re 459.Sh BUGS 460Having two kinds of REs is a botch. 461.Pp 462The current 463.St -p1003.2 464spec says that 465.Ql \&) 466is an ordinary character in 467the absence of an unmatched 468.Ql \&( ; 469this was an unintentional result of a wording error, 470and change is likely. 471Avoid relying on it. 472.Pp 473Back references are a dreadful botch, 474posing major problems for efficient implementations. 475They are also somewhat vaguely defined 476(does 477.Ql a\e(\e(b\e)*\e2\e)*d 478match 479.Ql abbbd ? ) . 480Avoid using them. 481.Pp 482.St -p1003.2 483specification of case-independent matching is vague. 484The 485.Dq one case implies all cases 486definition given above 487is current consensus among implementors as to the right interpretation. 488.Pp 489The syntax for word boundaries is incredibly ugly. 490