1*e6d6c189SCody Peter Mello# From hankedr@dms.auburn.edu Sun Jan 28 12:25:43 2001 2*e6d6c189SCody Peter Mello# Received: from mail.actcom.co.il [192.114.47.13] 3*e6d6c189SCody Peter Mello# by localhost with POP3 (fetchmail-5.5.0) 4*e6d6c189SCody Peter Mello# for arnold@localhost (single-drop); Sun, 28 Jan 2001 12:25:43 +0200 (IST) 5*e6d6c189SCody Peter Mello# Received: by actcom.co.il (mbox arobbins) 6*e6d6c189SCody Peter Mello# (with Cubic Circle's cucipop (v1.31 1998/05/13) Sun Jan 28 12:27:08 2001) 7*e6d6c189SCody Peter Mello# X-From_: hankedr@dms.auburn.edu Sat Jan 27 15:15:57 2001 8*e6d6c189SCody Peter Mello# Received: from lmail.actcom.co.il by actcom.co.il with ESMTP 9*e6d6c189SCody Peter Mello# (8.9.1a/actcom-0.2) id PAA23801 for <arobbins@actcom.co.il>; 10*e6d6c189SCody Peter Mello# Sat, 27 Jan 2001 15:15:55 +0200 (EET) 11*e6d6c189SCody Peter Mello# (rfc931-sender: lmail.actcom.co.il [192.114.47.13]) 12*e6d6c189SCody Peter Mello# Received: from billohost.com (www.billohost.com [209.196.35.10]) 13*e6d6c189SCody Peter Mello# by lmail.actcom.co.il (8.9.3/8.9.1) with ESMTP id PAA15998 14*e6d6c189SCody Peter Mello# for <arobbins@actcom.co.il>; Sat, 27 Jan 2001 15:16:27 +0200 15*e6d6c189SCody Peter Mello# Received: from yak.dms.auburn.edu (yak.dms.auburn.edu [131.204.53.2]) 16*e6d6c189SCody Peter Mello# by billohost.com (8.9.3/8.9.3) with ESMTP id IAA00467 17*e6d6c189SCody Peter Mello# for <arnold@skeeve.com>; Sat, 27 Jan 2001 08:15:52 -0500 18*e6d6c189SCody Peter Mello# Received: (from hankedr@localhost) 19*e6d6c189SCody Peter Mello# by yak.dms.auburn.edu (8.9.3/8.9.3/Debian/GNU) id HAA24441; 20*e6d6c189SCody Peter Mello# Sat, 27 Jan 2001 07:15:44 -0600 21*e6d6c189SCody Peter Mello# Date: Sat, 27 Jan 2001 07:15:44 -0600 22*e6d6c189SCody Peter Mello# Message-Id: <200101271315.HAA24441@yak.dms.auburn.edu> 23*e6d6c189SCody Peter Mello# From: Darrel Hankerson <hankedr@dms.auburn.edu> 24*e6d6c189SCody Peter Mello# To: arnold@skeeve.com 25*e6d6c189SCody Peter Mello# Subject: [stolfi@ic.unicamp.br: Bug in [...]* matching with acute-u] 26*e6d6c189SCody Peter Mello# Mime-Version: 1.0 (generated by tm-edit 7.106) 27*e6d6c189SCody Peter Mello# Content-Type: message/rfc822 28*e6d6c189SCody Peter Mello# Status: R 29*e6d6c189SCody Peter Mello# 30*e6d6c189SCody Peter Mello# From: Jorge Stolfi <stolfi@ic.unicamp.br> 31*e6d6c189SCody Peter Mello# To: bug-gnu-utils@gnu.org 32*e6d6c189SCody Peter Mello# Subject: Bug in [...]* matching with acute-u 33*e6d6c189SCody Peter Mello# MIME-Version: 1.0 34*e6d6c189SCody Peter Mello# Reply-To: stolfi@ic.unicamp.br 35*e6d6c189SCody Peter Mello# X-MIME-Autoconverted: from 8bit to quoted-printable by grande.dcc.unicamp.br id GAA10716 36*e6d6c189SCody Peter Mello# Sender: bug-gnu-utils-admin@gnu.org 37*e6d6c189SCody Peter Mello# Errors-To: bug-gnu-utils-admin@gnu.org 38*e6d6c189SCody Peter Mello# X-BeenThere: bug-gnu-utils@gnu.org 39*e6d6c189SCody Peter Mello# X-Mailman-Version: 2.0 40*e6d6c189SCody Peter Mello# Precedence: bulk 41*e6d6c189SCody Peter Mello# List-Help: <mailto:bug-gnu-utils-request@gnu.org?subject=help> 42*e6d6c189SCody Peter Mello# List-Post: <mailto:bug-gnu-utils@gnu.org> 43*e6d6c189SCody Peter Mello# List-Subscribe: <http://mail.gnu.org/mailman/listinfo/bug-gnu-utils>, 44*e6d6c189SCody Peter Mello# <mailto:bug-gnu-utils-request@gnu.org?subject=subscribe> 45*e6d6c189SCody Peter Mello# List-Id: Bug reports for the GNU utilities <bug-gnu-utils.gnu.org> 46*e6d6c189SCody Peter Mello# List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/bug-gnu-utils>, 47*e6d6c189SCody Peter Mello# <mailto:bug-gnu-utils-request@gnu.org?subject=unsubscribe> 48*e6d6c189SCody Peter Mello# List-Archive: <http://mail.gnu.org/pipermail/bug-gnu-utils/> 49*e6d6c189SCody Peter Mello# Date: Sat, 27 Jan 2001 06:46:11 -0200 (EDT) 50*e6d6c189SCody Peter Mello# Content-Transfer-Encoding: 8bit 51*e6d6c189SCody Peter Mello# X-MIME-Autoconverted: from quoted-printable to 8bit by manatee.dms.auburn.edu id CAA14936 52*e6d6c189SCody Peter Mello# Content-Type: text/plain; charset=iso-8859-1 53*e6d6c189SCody Peter Mello# <mailto:bug-gnu-utils-request@gnu.org?subject=subscribe> 54*e6d6c189SCody Peter Mello# <mailto:bug-gnu-utils-request@gnu.org?subject=uns 55*e6d6c189SCody Peter Mello# Content-Length: 3137 56*e6d6c189SCody Peter Mello# 57*e6d6c189SCody Peter Mello# 58*e6d6c189SCody Peter Mello# 59*e6d6c189SCody Peter Mello# Hi, 60*e6d6c189SCody Peter Mello# 61*e6d6c189SCody Peter Mello# I think I have run into a bug in gawk's handling of REs of the 62*e6d6c189SCody Peter Mello# form [...]* when the bracketed list includes certain 8-bit characters, 63*e6d6c189SCody Peter Mello# specifically u-acute (octal \372). 64*e6d6c189SCody Peter Mello# 65*e6d6c189SCody Peter Mello# The problem occurs in GNU Awk 3.0.4, both under 66*e6d6c189SCody Peter Mello# Linux 2.2.14-5.0 (intel i686) and SunOS 5.5 (Sun sparc). 67*e6d6c189SCody Peter Mello# 68*e6d6c189SCody Peter Mello# Here is a program that illustrates the bug, and its output. 69*e6d6c189SCody Peter Mello# The first two lines of the output should be equal, shouldn't they? 70*e6d6c189SCody Peter Mello# 71*e6d6c189SCody Peter Mello# ---------------------------------------------------------------------- 72*e6d6c189SCody Peter Mello#! /usr/bin/gawk -f 73*e6d6c189SCody Peter Mello 74*e6d6c189SCody Peter MelloBEGIN { 75*e6d6c189SCody Peter Mello s = "bananas and ananases in canaan"; 76*e6d6c189SCody Peter Mello t = s; gsub(/[an]*n/, "AN", t); printf "%-8s %s\n", "[an]*n", t; 77*e6d6c189SCody Peter Mello t = s; gsub(/[an�]*n/, "AN", t); printf "%-8s %s\n", "[an�]*n", t; 78*e6d6c189SCody Peter Mello print ""; 79*e6d6c189SCody Peter Mello t = s; gsub(/[a�]*n/, "AN", t); printf "%-8s %s\n", "[a�]*n", t; 80*e6d6c189SCody Peter Mello print ""; 81*e6d6c189SCody Peter Mello t = s; gsub(/[an]n/, "AN", t); printf "%-8s %s\n", "[an]n", t; 82*e6d6c189SCody Peter Mello t = s; gsub(/[a�]n/, "AN", t); printf "%-8s %s\n", "[a�]n", t; 83*e6d6c189SCody Peter Mello t = s; gsub(/[an�]n/, "AN", t); printf "%-8s %s\n", "[an�]n", t; 84*e6d6c189SCody Peter Mello print ""; 85*e6d6c189SCody Peter Mello t = s; gsub(/[an]?n/, "AN", t); printf "%-8s %s\n", "[an]?n", t; 86*e6d6c189SCody Peter Mello t = s; gsub(/[a�]?n/, "AN", t); printf "%-8s %s\n", "[a�]?n", t; 87*e6d6c189SCody Peter Mello t = s; gsub(/[an�]?n/, "AN", t); printf "%-8s %s\n", "[an�]?n", t; 88*e6d6c189SCody Peter Mello print ""; 89*e6d6c189SCody Peter Mello t = s; gsub(/[an]+n/, "AN", t); printf "%-8s %s\n", "[an]+n", t; 90*e6d6c189SCody Peter Mello t = s; gsub(/[a�]+n/, "AN", t); printf "%-8s %s\n", "[a�]+n", t; 91*e6d6c189SCody Peter Mello t = s; gsub(/[an�]+n/, "AN", t); printf "%-8s %s\n", "[an�]+n", t; 92*e6d6c189SCody Peter Mello} 93*e6d6c189SCody Peter Mello# ---------------------------------------------------------------------- 94*e6d6c189SCody Peter Mello# [an]*n bANas ANd ANases iAN cAN 95*e6d6c189SCody Peter Mello# [an�]*n bananas and ananases in canaan 96*e6d6c189SCody Peter Mello# 97*e6d6c189SCody Peter Mello# [a�]*n bANANas ANd ANANases iAN cANAN 98*e6d6c189SCody Peter Mello# 99*e6d6c189SCody Peter Mello# [an]n bANANas ANd ANANases in cANaAN 100*e6d6c189SCody Peter Mello# [a�]n bANANas ANd ANANases in cANaAN 101*e6d6c189SCody Peter Mello# [an�]n bANANas ANd ANANases in cANaAN 102*e6d6c189SCody Peter Mello# 103*e6d6c189SCody Peter Mello# [an]?n bANANas ANd ANANases iAN cANaAN 104*e6d6c189SCody Peter Mello# [a�]?n bANANas ANd ANANases iAN cANaAN 105*e6d6c189SCody Peter Mello# [an�]?n bANANas ANd ANANases iAN cANaAN 106*e6d6c189SCody Peter Mello# 107*e6d6c189SCody Peter Mello# [an]+n bANas ANd ANases in cAN 108*e6d6c189SCody Peter Mello# [a�]+n bANANas ANd ANANases in cANAN 109*e6d6c189SCody Peter Mello# [an�]+n bananas and ananases in canaan 110*e6d6c189SCody Peter Mello# ---------------------------------------------------------------------- 111*e6d6c189SCody Peter Mello# 112*e6d6c189SCody Peter Mello# Apparently the problem is specific to u-acute; I've tried several 113*e6d6c189SCody Peter Mello# other 8-bit characters and they seem to behave as expected. 114*e6d6c189SCody Peter Mello# 115*e6d6c189SCody Peter Mello# By comparing the second and third output lines, it would seem that the 116*e6d6c189SCody Peter Mello# problem involves backtracking out of a partial match of [...]* in 117*e6d6c189SCody Peter Mello# order to match the next sub-expression, when the latter begins with 118*e6d6c189SCody Peter Mello# one of the given characters. 119*e6d6c189SCody Peter Mello# 120*e6d6c189SCody Peter Mello# 121*e6d6c189SCody Peter Mello# All the best, 122*e6d6c189SCody Peter Mello# 123*e6d6c189SCody Peter Mello# --stolfi 124*e6d6c189SCody Peter Mello# 125*e6d6c189SCody Peter Mello# ------------------------------------------------------------------------ 126*e6d6c189SCody Peter Mello# Jorge Stolfi | http://www.dcc.unicamp.br/~stolfi | stolfi@dcc.unicamp.br 127*e6d6c189SCody Peter Mello# Institute of Computing (formerly DCC-IMECC) | Wrk +55 (19)3788-5858 128*e6d6c189SCody Peter Mello# Universidade Estadual de Campinas (UNICAMP) | +55 (19)3788-5840 129*e6d6c189SCody Peter Mello# Av. Albert Einstein 1251 - Caixa Postal 6176 | Fax +55 (19)3788-5847 130*e6d6c189SCody Peter Mello# 13083-970 Campinas, SP -- Brazil | Hom +55 (19)3287-4069 131*e6d6c189SCody Peter Mello# ------------------------------------------------------------------------ 132*e6d6c189SCody Peter Mello# 133*e6d6c189SCody Peter Mello# _______________________________________________ 134*e6d6c189SCody Peter Mello# Bug-gnu-utils mailing list 135*e6d6c189SCody Peter Mello# Bug-gnu-utils@gnu.org 136*e6d6c189SCody Peter Mello# http://mail.gnu.org/mailman/listinfo/bug-gnu-utils 137*e6d6c189SCody Peter Mello# 138*e6d6c189SCody Peter Mello# 139