xref: /titanic_51/usr/src/boot/lib/libz/doc/txtvsbin.txt (revision 4a5d661a82b942b6538acd26209d959ce98b593a)
1*4a5d661aSToomas SoomeA Fast Method for Identifying Plain Text Files
2*4a5d661aSToomas Soome==============================================
3*4a5d661aSToomas Soome
4*4a5d661aSToomas Soome
5*4a5d661aSToomas SoomeIntroduction
6*4a5d661aSToomas Soome------------
7*4a5d661aSToomas Soome
8*4a5d661aSToomas SoomeGiven a file coming from an unknown source, it is sometimes desirable
9*4a5d661aSToomas Soometo find out whether the format of that file is plain text.  Although
10*4a5d661aSToomas Soomethis may appear like a simple task, a fully accurate detection of the
11*4a5d661aSToomas Soomefile type requires heavy-duty semantic analysis on the file contents.
12*4a5d661aSToomas SoomeIt is, however, possible to obtain satisfactory results by employing
13*4a5d661aSToomas Soomevarious heuristics.
14*4a5d661aSToomas Soome
15*4a5d661aSToomas SoomePrevious versions of PKZip and other zip-compatible compression tools
16*4a5d661aSToomas Soomewere using a crude detection scheme: if more than 80% (4/5) of the bytes
17*4a5d661aSToomas Soomefound in a certain buffer are within the range [7..127], the file is
18*4a5d661aSToomas Soomelabeled as plain text, otherwise it is labeled as binary.  A prominent
19*4a5d661aSToomas Soomelimitation of this scheme is the restriction to Latin-based alphabets.
20*4a5d661aSToomas SoomeOther alphabets, like Greek, Cyrillic or Asian, make extensive use of
21*4a5d661aSToomas Soomethe bytes within the range [128..255], and texts using these alphabets
22*4a5d661aSToomas Soomeare most often misidentified by this scheme; in other words, the rate
23*4a5d661aSToomas Soomeof false negatives is sometimes too high, which means that the recall
24*4a5d661aSToomas Soomeis low.  Another weakness of this scheme is a reduced precision, due to
25*4a5d661aSToomas Soomethe false positives that may occur when binary files containing large
26*4a5d661aSToomas Soomeamounts of textual characters are misidentified as plain text.
27*4a5d661aSToomas Soome
28*4a5d661aSToomas SoomeIn this article we propose a new, simple detection scheme that features
29*4a5d661aSToomas Soomea much increased precision and a near-100% recall.  This scheme is
30*4a5d661aSToomas Soomedesigned to work on ASCII, Unicode and other ASCII-derived alphabets,
31*4a5d661aSToomas Soomeand it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
32*4a5d661aSToomas Soomeand variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
33*4a5d661aSToomas Soome(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
34*4a5d661aSToomas Soome
35*4a5d661aSToomas Soome
36*4a5d661aSToomas SoomeThe Algorithm
37*4a5d661aSToomas Soome-------------
38*4a5d661aSToomas Soome
39*4a5d661aSToomas SoomeThe algorithm works by dividing the set of bytecodes [0..255] into three
40*4a5d661aSToomas Soomecategories:
41*4a5d661aSToomas Soome- The white list of textual bytecodes:
42*4a5d661aSToomas Soome  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
43*4a5d661aSToomas Soome- The gray list of tolerated bytecodes:
44*4a5d661aSToomas Soome  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
45*4a5d661aSToomas Soome- The black list of undesired, non-textual bytecodes:
46*4a5d661aSToomas Soome  0 (NUL) to 6, 14 to 31.
47*4a5d661aSToomas Soome
48*4a5d661aSToomas SoomeIf a file contains at least one byte that belongs to the white list and
49*4a5d661aSToomas Soomeno byte that belongs to the black list, then the file is categorized as
50*4a5d661aSToomas Soomeplain text; otherwise, it is categorized as binary.  (The boundary case,
51*4a5d661aSToomas Soomewhen the file is empty, automatically falls into the latter category.)
52*4a5d661aSToomas Soome
53*4a5d661aSToomas Soome
54*4a5d661aSToomas SoomeRationale
55*4a5d661aSToomas Soome---------
56*4a5d661aSToomas Soome
57*4a5d661aSToomas SoomeThe idea behind this algorithm relies on two observations.
58*4a5d661aSToomas Soome
59*4a5d661aSToomas SoomeThe first observation is that, although the full range of 7-bit codes
60*4a5d661aSToomas Soome[0..127] is properly specified by the ASCII standard, most control
61*4a5d661aSToomas Soomecharacters in the range [0..31] are not used in practice.  The only
62*4a5d661aSToomas Soomewidely-used, almost universally-portable control codes are 9 (TAB),
63*4a5d661aSToomas Soome10 (LF) and 13 (CR).  There are a few more control codes that are
64*4a5d661aSToomas Soomerecognized on a reduced range of platforms and text viewers/editors:
65*4a5d661aSToomas Soome7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
66*4a5d661aSToomas Soomecodes are rarely (if ever) used alone, without being accompanied by
67*4a5d661aSToomas Soomesome printable text.  Even the newer, portable text formats such as
68*4a5d661aSToomas SoomeXML avoid using control characters outside the list mentioned here.
69*4a5d661aSToomas Soome
70*4a5d661aSToomas SoomeThe second observation is that most of the binary files tend to contain
71*4a5d661aSToomas Soomecontrol characters, especially 0 (NUL).  Even though the older text
72*4a5d661aSToomas Soomedetection schemes observe the presence of non-ASCII codes from the range
73*4a5d661aSToomas Soome[128..255], the precision rarely has to suffer if this upper range is
74*4a5d661aSToomas Soomelabeled as textual, because the files that are genuinely binary tend to
75*4a5d661aSToomas Soomecontain both control characters and codes from the upper range.  On the
76*4a5d661aSToomas Soomeother hand, the upper range needs to be labeled as textual, because it
77*4a5d661aSToomas Soomeis used by virtually all ASCII extensions.  In particular, this range is
78*4a5d661aSToomas Soomeused for encoding non-Latin scripts.
79*4a5d661aSToomas Soome
80*4a5d661aSToomas SoomeSince there is no counting involved, other than simply observing the
81*4a5d661aSToomas Soomepresence or the absence of some byte values, the algorithm produces
82*4a5d661aSToomas Soomeconsistent results, regardless what alphabet encoding is being used.
83*4a5d661aSToomas Soome(If counting were involved, it could be possible to obtain different
84*4a5d661aSToomas Soomeresults on a text encoded, say, using ISO-8859-16 versus UTF-8.)
85*4a5d661aSToomas Soome
86*4a5d661aSToomas SoomeThere is an extra category of plain text files that are "polluted" with
87*4a5d661aSToomas Soomeone or more black-listed codes, either by mistake or by peculiar design
88*4a5d661aSToomas Soomeconsiderations.  In such cases, a scheme that tolerates a small fraction
89*4a5d661aSToomas Soomeof black-listed codes would provide an increased recall (i.e. more true
90*4a5d661aSToomas Soomepositives).  This, however, incurs a reduced precision overall, since
91*4a5d661aSToomas Soomefalse positives are more likely to appear in binary files that contain
92*4a5d661aSToomas Soomelarge chunks of textual data.  Furthermore, "polluted" plain text should
93*4a5d661aSToomas Soomebe regarded as binary by general-purpose text detection schemes, because
94*4a5d661aSToomas Soomegeneral-purpose text processing algorithms might not be applicable.
95*4a5d661aSToomas SoomeUnder this premise, it is safe to say that our detection method provides
96*4a5d661aSToomas Soomea near-100% recall.
97*4a5d661aSToomas Soome
98*4a5d661aSToomas SoomeExperiments have been run on many files coming from various platforms
99*4a5d661aSToomas Soomeand applications.  We tried plain text files, system logs, source code,
100*4a5d661aSToomas Soomeformatted office documents, compiled object code, etc.  The results
101*4a5d661aSToomas Soomeconfirm the optimistic assumptions about the capabilities of this
102*4a5d661aSToomas Soomealgorithm.
103*4a5d661aSToomas Soome
104*4a5d661aSToomas Soome
105*4a5d661aSToomas Soome--
106*4a5d661aSToomas SoomeCosmin Truta
107*4a5d661aSToomas SoomeLast updated: 2006-May-28
108