1*4a5d661aSToomas SoomeA Fast Method for Identifying Plain Text Files 2*4a5d661aSToomas Soome============================================== 3*4a5d661aSToomas Soome 4*4a5d661aSToomas Soome 5*4a5d661aSToomas SoomeIntroduction 6*4a5d661aSToomas Soome------------ 7*4a5d661aSToomas Soome 8*4a5d661aSToomas SoomeGiven a file coming from an unknown source, it is sometimes desirable 9*4a5d661aSToomas Soometo find out whether the format of that file is plain text. Although 10*4a5d661aSToomas Soomethis may appear like a simple task, a fully accurate detection of the 11*4a5d661aSToomas Soomefile type requires heavy-duty semantic analysis on the file contents. 12*4a5d661aSToomas SoomeIt is, however, possible to obtain satisfactory results by employing 13*4a5d661aSToomas Soomevarious heuristics. 14*4a5d661aSToomas Soome 15*4a5d661aSToomas SoomePrevious versions of PKZip and other zip-compatible compression tools 16*4a5d661aSToomas Soomewere using a crude detection scheme: if more than 80% (4/5) of the bytes 17*4a5d661aSToomas Soomefound in a certain buffer are within the range [7..127], the file is 18*4a5d661aSToomas Soomelabeled as plain text, otherwise it is labeled as binary. A prominent 19*4a5d661aSToomas Soomelimitation of this scheme is the restriction to Latin-based alphabets. 20*4a5d661aSToomas SoomeOther alphabets, like Greek, Cyrillic or Asian, make extensive use of 21*4a5d661aSToomas Soomethe bytes within the range [128..255], and texts using these alphabets 22*4a5d661aSToomas Soomeare most often misidentified by this scheme; in other words, the rate 23*4a5d661aSToomas Soomeof false negatives is sometimes too high, which means that the recall 24*4a5d661aSToomas Soomeis low. Another weakness of this scheme is a reduced precision, due to 25*4a5d661aSToomas Soomethe false positives that may occur when binary files containing large 26*4a5d661aSToomas Soomeamounts of textual characters are misidentified as plain text. 27*4a5d661aSToomas Soome 28*4a5d661aSToomas SoomeIn this article we propose a new, simple detection scheme that features 29*4a5d661aSToomas Soomea much increased precision and a near-100% recall. This scheme is 30*4a5d661aSToomas Soomedesigned to work on ASCII, Unicode and other ASCII-derived alphabets, 31*4a5d661aSToomas Soomeand it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.) 32*4a5d661aSToomas Soomeand variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings 33*4a5d661aSToomas Soome(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however. 34*4a5d661aSToomas Soome 35*4a5d661aSToomas Soome 36*4a5d661aSToomas SoomeThe Algorithm 37*4a5d661aSToomas Soome------------- 38*4a5d661aSToomas Soome 39*4a5d661aSToomas SoomeThe algorithm works by dividing the set of bytecodes [0..255] into three 40*4a5d661aSToomas Soomecategories: 41*4a5d661aSToomas Soome- The white list of textual bytecodes: 42*4a5d661aSToomas Soome 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255. 43*4a5d661aSToomas Soome- The gray list of tolerated bytecodes: 44*4a5d661aSToomas Soome 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC). 45*4a5d661aSToomas Soome- The black list of undesired, non-textual bytecodes: 46*4a5d661aSToomas Soome 0 (NUL) to 6, 14 to 31. 47*4a5d661aSToomas Soome 48*4a5d661aSToomas SoomeIf a file contains at least one byte that belongs to the white list and 49*4a5d661aSToomas Soomeno byte that belongs to the black list, then the file is categorized as 50*4a5d661aSToomas Soomeplain text; otherwise, it is categorized as binary. (The boundary case, 51*4a5d661aSToomas Soomewhen the file is empty, automatically falls into the latter category.) 52*4a5d661aSToomas Soome 53*4a5d661aSToomas Soome 54*4a5d661aSToomas SoomeRationale 55*4a5d661aSToomas Soome--------- 56*4a5d661aSToomas Soome 57*4a5d661aSToomas SoomeThe idea behind this algorithm relies on two observations. 58*4a5d661aSToomas Soome 59*4a5d661aSToomas SoomeThe first observation is that, although the full range of 7-bit codes 60*4a5d661aSToomas Soome[0..127] is properly specified by the ASCII standard, most control 61*4a5d661aSToomas Soomecharacters in the range [0..31] are not used in practice. The only 62*4a5d661aSToomas Soomewidely-used, almost universally-portable control codes are 9 (TAB), 63*4a5d661aSToomas Soome10 (LF) and 13 (CR). There are a few more control codes that are 64*4a5d661aSToomas Soomerecognized on a reduced range of platforms and text viewers/editors: 65*4a5d661aSToomas Soome7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these 66*4a5d661aSToomas Soomecodes are rarely (if ever) used alone, without being accompanied by 67*4a5d661aSToomas Soomesome printable text. Even the newer, portable text formats such as 68*4a5d661aSToomas SoomeXML avoid using control characters outside the list mentioned here. 69*4a5d661aSToomas Soome 70*4a5d661aSToomas SoomeThe second observation is that most of the binary files tend to contain 71*4a5d661aSToomas Soomecontrol characters, especially 0 (NUL). Even though the older text 72*4a5d661aSToomas Soomedetection schemes observe the presence of non-ASCII codes from the range 73*4a5d661aSToomas Soome[128..255], the precision rarely has to suffer if this upper range is 74*4a5d661aSToomas Soomelabeled as textual, because the files that are genuinely binary tend to 75*4a5d661aSToomas Soomecontain both control characters and codes from the upper range. On the 76*4a5d661aSToomas Soomeother hand, the upper range needs to be labeled as textual, because it 77*4a5d661aSToomas Soomeis used by virtually all ASCII extensions. In particular, this range is 78*4a5d661aSToomas Soomeused for encoding non-Latin scripts. 79*4a5d661aSToomas Soome 80*4a5d661aSToomas SoomeSince there is no counting involved, other than simply observing the 81*4a5d661aSToomas Soomepresence or the absence of some byte values, the algorithm produces 82*4a5d661aSToomas Soomeconsistent results, regardless what alphabet encoding is being used. 83*4a5d661aSToomas Soome(If counting were involved, it could be possible to obtain different 84*4a5d661aSToomas Soomeresults on a text encoded, say, using ISO-8859-16 versus UTF-8.) 85*4a5d661aSToomas Soome 86*4a5d661aSToomas SoomeThere is an extra category of plain text files that are "polluted" with 87*4a5d661aSToomas Soomeone or more black-listed codes, either by mistake or by peculiar design 88*4a5d661aSToomas Soomeconsiderations. In such cases, a scheme that tolerates a small fraction 89*4a5d661aSToomas Soomeof black-listed codes would provide an increased recall (i.e. more true 90*4a5d661aSToomas Soomepositives). This, however, incurs a reduced precision overall, since 91*4a5d661aSToomas Soomefalse positives are more likely to appear in binary files that contain 92*4a5d661aSToomas Soomelarge chunks of textual data. Furthermore, "polluted" plain text should 93*4a5d661aSToomas Soomebe regarded as binary by general-purpose text detection schemes, because 94*4a5d661aSToomas Soomegeneral-purpose text processing algorithms might not be applicable. 95*4a5d661aSToomas SoomeUnder this premise, it is safe to say that our detection method provides 96*4a5d661aSToomas Soomea near-100% recall. 97*4a5d661aSToomas Soome 98*4a5d661aSToomas SoomeExperiments have been run on many files coming from various platforms 99*4a5d661aSToomas Soomeand applications. We tried plain text files, system logs, source code, 100*4a5d661aSToomas Soomeformatted office documents, compiled object code, etc. The results 101*4a5d661aSToomas Soomeconfirm the optimistic assumptions about the capabilities of this 102*4a5d661aSToomas Soomealgorithm. 103*4a5d661aSToomas Soome 104*4a5d661aSToomas Soome 105*4a5d661aSToomas Soome-- 106*4a5d661aSToomas SoomeCosmin Truta 107*4a5d661aSToomas SoomeLast updated: 2006-May-28 108