Remove residual blank line at start of MakefileThis is a residual of the $FreeBSD$ removal.MFC After: 3 days (though I'll just run the command on the branches)Sponsored by: Netflix
Allow -DNO_STRICT_REGEX to restore historic regex behaviorAllow restoring the behavior of '{' as described in regex(3).Ie. only treat it as start of bounds if followed by a digit.If NO_STRICT_RE
Allow -DNO_STRICT_REGEX to restore historic regex behaviorAllow restoring the behavior of '{' as described in regex(3).Ie. only treat it as start of bounds if followed by a digit.If NO_STRICT_REGEX is not defined, the behavior introduced bycommit a4a801688c909ef39cbcbc3488bc4fdbabd69d66 is retained,otherwise the previous behavior is restored.Differential Revision: https://reviews.freebsd.org/D45134
show more ...
regex: fix freeing g->charjump in low memory conditioncomputejumps() moves g->charjump to a position relativ to the value ofCHAR_MIN. As such, g->charjump doesn't necessarily point to the address
regex: fix freeing g->charjump in low memory conditioncomputejumps() moves g->charjump to a position relativ to the value ofCHAR_MIN. As such, g->charjump doesn't necessarily point to the addressactually allocated. While regfree() takes that into account, the lowmemory handling in regcomp_internal() doesn't. Fix that by free'ingthe actually allocated address, as in regfree().MFC After: 2 weeksReviewed by: imp,jrtc27Pull Request: https://github.com/freebsd/freebsd-src/pull/692
regex: mixed sets are misidentified as singletonsFix "singleton" function used by regcomp() to turn character set matchesinto exact character matches if a character set has exactly oneelement.T
regex: mixed sets are misidentified as singletonsFix "singleton" function used by regcomp() to turn character set matchesinto exact character matches if a character set has exactly oneelement.The underlying cset representation is complex; most critically itrecords"small" characters (codepoint less than either 128or 256 depending on locale) in a bit vector, and "wide" characters ina secondary array.Unfortunately the "singleton" function uses to identify singleton setstreated a cset as a singleton if either the "small" or the "wide" setshad exactly one element (it would then ignore the other set).The easiest way to demonstrate this bug: $ export LANG=C.UTF-8 $ echo 'a' | grep '[abà]'It should match (and print "a") but instead it doesn't match because thesingle accented character in the set is misinterpreted as a singleton.Reviewed by: kevans, yuripvObtained from: illumosDifferential Revision: https://reviews.freebsd.org/D43149
lib: Remove ancient SCCS tags.Remove ancient SCCS tags from the tree, automated scripting, with twominor fixup to keep things compiling. All the common forms in the treewere removed with a perl s
lib: Remove ancient SCCS tags.Remove ancient SCCS tags from the tree, automated scripting, with twominor fixup to keep things compiling. All the common forms in the treewere removed with a perl script.Sponsored by: Netflix
libc: Remove empty comments in Symbol.mapThese were left over from $FreeBSD$ removal.Reviewed by: emasteDifferential Revision: https://reviews.freebsd.org/D42612
libc: Purge unneeded cdefs.hThese sys/cdefs.h are not needed. Purge them. They are mostly left-overfrom the $FreeBSD$ removal. A few in libc are still required for macrosthat cdefs.h defines. Kee
libc: Purge unneeded cdefs.hThese sys/cdefs.h are not needed. Purge them. They are mostly left-overfrom the $FreeBSD$ removal. A few in libc are still required for macrosthat cdefs.h defines. Keep those.Sponsored by: NetflixDifferential Revision: https://reviews.freebsd.org/D42385
regcomp: use unsigned char when testing for escapes- cast GETNEXT to unsigned where it is being promoted to int to prevent sign-extension (really it would have been better for PEEK*() and GETNE
regcomp: use unsigned char when testing for escapes- cast GETNEXT to unsigned where it is being promoted to int to prevent sign-extension (really it would have been better for PEEK*() and GETNEXT() to return unsigned char; this would have removed a ton of (uch) casts, but it is too intrusive for now).- fix an isalpha that should have been iswalphaPR: 264275, 274032Reviewed by: kevans, eugen (previous version)Obtained from: NetBSDDifferential Revision: https://reviews.freebsd.org/D41947
Remove $FreeBSD$: one-line nroff patternRemove /^\.\\"\s*\$FreeBSD\$$\n/
Remove $FreeBSD$: one-line sh patternRemove /^\s*#[#!]?\s*\$FreeBSD\$.*$\n/
Remove $FreeBSD$: one-line .c patternRemove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
Remove $FreeBSD$: one-line .h patternRemove /^\s*\*+\s*\$FreeBSD\$.*$\n/
libc: drop "All rights reserved" from Foundation copyrightsThis has already been done for most files that have the Foundation asthe only listed copyright holder. Do it now for files that listmul
libc: drop "All rights reserved" from Foundation copyrightsThis has already been done for most files that have the Foundation asthe only listed copyright holder. Do it now for files that listmultiple copyright holders, but have the Foundation copyright in its ownsection.Sponsored by: The FreeBSD Foundation
libc: Fix regexec when sizeof(char *) > sizeof(long)The states macro is the type for engine.c to use, with states1 being alocal macro for regexec to use to determine whether it can use the smallm
libc: Fix regexec when sizeof(char *) > sizeof(long)The states macro is the type for engine.c to use, with states1 being alocal macro for regexec to use to determine whether it can use the smallmatcher or not (by comparing nstates and 8*sizeof(states1)). However,macro bodies are expanded in the context of their use, and so whenregexec uses states1 it uses the current value of states, which is leftover as char * from the large version (or, really, the multi-byte one,but that reuses large's states). For all supported architectures inFreeBSD, the two have the same size, and so this confusion is harmless.However, for architectures like CHERI where that is not the case (orWindows's LLP64 as discovered by LLVM and fixed in 2010 in 2e071faed8e2)and sizeof(char *) is bigger than sizeof(long) regexec will erroneouslytry to use the small matcher when nstates is between sizeof(long) andsizeof(char *) (i.e. between 64 and 128 on CHERI, or 32 and 64 on LLP64)and end up overflowing the number of bits in the underlying long if itever uses those high states. On weirder architectures where sizeof(long)is greater than sizeof(char *) this also fixes it to not fall back onthe large matcher prematurely, but such architectures are likely limitedto the embedded space, if they exist at all.Fix this by swapping round states and states1, so that states1 isdefined directly as being long and states is an alias for it for thesmall matcher case.Found by: CHERI
libc: regex: rework unsafe pointer arithmeticregcomp.c uses the "start + count < end" idiom to check that there are"count" bytes available in an array of char "start" and "end" both point to.Thi
libc: regex: rework unsafe pointer arithmeticregcomp.c uses the "start + count < end" idiom to check that there are"count" bytes available in an array of char "start" and "end" both point to.This is fine, unless "start + count" goes beyond the last element of thearray. In this case, pedantic interpretation of the C standard makes thecomparison of such a pointer against "end" undefined, and optimizers fromhell will happily remove as much code as possible because of this.An example of this occurs in regcomp.c's bothcases(), which definesbracket[3], sets "next" to "bracket" and "end" to "bracket + 2". Then itinvokes p_bracket(), which starts with "if (p->next + 5 < p->end)"...Because bothcases() and p_bracket() are static functions in regcomp.c, thereis a real risk of miscompilation if aggressive inlining happens.The following diff rewrites the "start + count < end" constructs into "end -start > count". Assuming "end" and "start" are always pointing in the array(such as "bracket[3]" above), "end - start" is well-defined and can becompared without trouble.As a bonus, MORE2() implies MORE() therefore SEETWO() can be simplified abit.PR: 252403
libc: regex: retire internal EMPTBR ("Empty branch present")It was realized just a little too late that this was a hack that belonged inindividual regex(3)-using applications. It was surrounded in
libc: regex: retire internal EMPTBR ("Empty branch present")It was realized just a little too late that this was a hack that belonged inindividual regex(3)-using applications. It was surrounded in NOTYET and notimplemented in the engine, so remove it.
libregex: implement \b and \B (word boundary, not word boundary)This is the last of the needed GNU expressions before we can unleash bsdgrepby default. \b is effectively an agnostic equivalent of
libregex: implement \b and \B (word boundary, not word boundary)This is the last of the needed GNU expressions before we can unleash bsdgrepby default. \b is effectively an agnostic equivalent of \< and \>, while\B will match every space that isn't making a transition fromnonchar -> char or char -> nonchar.
libregex: implement \` and \' (begin-of-subj, end-of-subj)These are GNU extensions, generally equivalent to ^ and $ except that thenew syntax will not match beginning of line after the first in a
libregex: implement \` and \' (begin-of-subj, end-of-subj)These are GNU extensions, generally equivalent to ^ and $ except that thenew syntax will not match beginning of line after the first in a multi-lineexpression or the end of line before absolute last in a multi-lineexpression.
libc: regex: factor out ISBOW/ISEOW macrosThese will be reused for \b (word boundary, which matches both sides).No functional change.
libregex: Implement a subset of the GNU extensionsThe entire patch-set is not yet mature enough for commit, but this usablesubset is generally enough for googletest to be happy with and mostly map
libregex: Implement a subset of the GNU extensionsThe entire patch-set is not yet mature enough for commit, but this usablesubset is generally enough for googletest to be happy with and mostly map tosome existing concepts, so they're not as invasive.The specific changes included here are:- Branching in BREs with \|- \w and \W for [[:alnum:]] and [^[:alnum:]] respectively- \s and \S for [[:space:]] and [^[:space:]] respectively- Additional quantifiers in BREs, \? and \+ (self-explanatory)There's some #ifdef'd out work for allowing empty branches as a match-all.This is a feature that's under assessment... future work will determinehow standard this behavior is and act accordingly.
regex(3): belatedly document REG_POSIX from r363734My original patch included this documented, but it appears that I failed toinclude the manpage update. Do so now.
regex(3): Interpret many escaped ordinary characters as EESCAPEIn IEEE 1003.1-2008 [1] and earlier revisions, BRE/ERE grammar allows forany character to be escaped, but "ORD_CHAR preceded by an un
regex(3): Interpret many escaped ordinary characters as EESCAPEIn IEEE 1003.1-2008 [1] and earlier revisions, BRE/ERE grammar allows forany character to be escaped, but "ORD_CHAR preceded by an unescaped<backslash> character [gives undefined results]".Historically, we've interpreted an escaped ordinary character as theordinary character itself. This becomes problematic when some extensionsgive special meanings to an otherwise ordinary character(e.g. GNU's \b, \s, \w), meaning we may have two different validinterpretations of the same sequence.To make this easier to deal with and given that the standard calls thisundefined, we should throw an error (EESCAPE) if we run into this scenarioto ease transition into a state where some escaped ordinaries are blessedwith a special meaning -- it will either error out or have extendedbehavior, rather than have two entirely different versions of undefinedbehavior that leave the consumer of regex(3) guessing as to what behaviorwill be used or leaving them with false impressions.This change bumps the symbol version of regcomp to FBSD_1.6 and provides theold escape semantics for legacy applications, just in case one has an olderapplication that would immediately turn into a pumpkin because of anextraneous escape that's embedded or otherwise critical to its operation.This is the final piece needed before enhancing libregex with GNU extensionsand flipping the switch on bsdgrep.[1] http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/PR: 229925 (exp-run, courtesy of antoine)Differential Revision: https://reviews.freebsd.org/D10510
lib/libc/regex: fix build with REDEBUG definedReviewed by: kevansDifferential Revision: https://reviews.freebsd.org/D21760
regcomp: revert part of r341838 which turned out to be unrelatedand caused issues with search in less.PR: 234066Reviewed by: pfgDifferential revision: https://reviews.freebsd.org/D18611
regcomp: reduce size of bitmap for multibyte localesThis fixes the obscure endless loop seen with case-insensitivepatterns containing characters in 128-255 range; originallyfound running GNU gre
regcomp: reduce size of bitmap for multibyte localesThis fixes the obscure endless loop seen with case-insensitivepatterns containing characters in 128-255 range; originallyfound running GNU grep test suite.Our regex implementation being kludgy translates the charactersin case-insensitive pattern to bracket expression containing bothcases for the character and doesn't correctly handle the case whenoriginal character is in bitmap and the other case is not, fallinginto the endless loop going through in p_bracket(), ordinary(),and bothcases().Reducing the bitmap to 0-127 range for multibyte locales solves thisas none of these characters have other case mapping outside of bitmap.We are also safe in the case when the original character outside ofbitmap has other case mapping in the bitmap (there are several of thosein our current ctype maps having unidirectional mapping into bitmap).Reviewed by: bapt, kevans, pfgDifferential revision: https://reviews.freebsd.org/D18302
12345678