txtvsbin.txt (7648bc9fee8dec6cb3c4941e0165a930fbe8dcb0) | txtvsbin.txt (cd8822075a38d0734e74b1735e4b5dbef9789170) |
---|---|
1A Fast Method for Identifying Plain Text Files 2============================================== 3 4 5Introduction 6------------ 7 8Given a file coming from an unknown source, it is sometimes desirable --- 24 unchanged lines hidden (view full) --- 33(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however. 34 35 36The Algorithm 37------------- 38 39The algorithm works by dividing the set of bytecodes [0..255] into three 40categories: | 1A Fast Method for Identifying Plain Text Files 2============================================== 3 4 5Introduction 6------------ 7 8Given a file coming from an unknown source, it is sometimes desirable --- 24 unchanged lines hidden (view full) --- 33(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however. 34 35 36The Algorithm 37------------- 38 39The algorithm works by dividing the set of bytecodes [0..255] into three 40categories: |
41- The white list of textual bytecodes: | 41- The allow list of textual bytecodes: |
42 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255. 43- The gray list of tolerated bytecodes: 44 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC). | 42 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255. 43- The gray list of tolerated bytecodes: 44 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC). |
45- The black list of undesired, non-textual bytecodes: | 45- The block list of undesired, non-textual bytecodes: |
46 0 (NUL) to 6, 14 to 31. 47 | 46 0 (NUL) to 6, 14 to 31. 47 |
48If a file contains at least one byte that belongs to the white list and 49no byte that belongs to the black list, then the file is categorized as | 48If a file contains at least one byte that belongs to the allow list and 49no byte that belongs to the block list, then the file is categorized as |
50plain text; otherwise, it is categorized as binary. (The boundary case, 51when the file is empty, automatically falls into the latter category.) 52 53 54Rationale 55--------- 56 57The idea behind this algorithm relies on two observations. --- 21 unchanged lines hidden (view full) --- 79 80Since there is no counting involved, other than simply observing the 81presence or the absence of some byte values, the algorithm produces 82consistent results, regardless what alphabet encoding is being used. 83(If counting were involved, it could be possible to obtain different 84results on a text encoded, say, using ISO-8859-16 versus UTF-8.) 85 86There is an extra category of plain text files that are "polluted" with | 50plain text; otherwise, it is categorized as binary. (The boundary case, 51when the file is empty, automatically falls into the latter category.) 52 53 54Rationale 55--------- 56 57The idea behind this algorithm relies on two observations. --- 21 unchanged lines hidden (view full) --- 79 80Since there is no counting involved, other than simply observing the 81presence or the absence of some byte values, the algorithm produces 82consistent results, regardless what alphabet encoding is being used. 83(If counting were involved, it could be possible to obtain different 84results on a text encoded, say, using ISO-8859-16 versus UTF-8.) 85 86There is an extra category of plain text files that are "polluted" with |
87one or more black-listed codes, either by mistake or by peculiar design | 87one or more block-listed codes, either by mistake or by peculiar design |
88considerations. In such cases, a scheme that tolerates a small fraction | 88considerations. In such cases, a scheme that tolerates a small fraction |
89of black-listed codes would provide an increased recall (i.e. more true | 89of block-listed codes would provide an increased recall (i.e. more true |
90positives). This, however, incurs a reduced precision overall, since 91false positives are more likely to appear in binary files that contain 92large chunks of textual data. Furthermore, "polluted" plain text should 93be regarded as binary by general-purpose text detection schemes, because 94general-purpose text processing algorithms might not be applicable. 95Under this premise, it is safe to say that our detection method provides 96a near-100% recall. 97 98Experiments have been run on many files coming from various platforms 99and applications. We tried plain text files, system logs, source code, 100formatted office documents, compiled object code, etc. The results 101confirm the optimistic assumptions about the capabilities of this 102algorithm. 103 104 105-- 106Cosmin Truta 107Last updated: 2006-May-28 | 90positives). This, however, incurs a reduced precision overall, since 91false positives are more likely to appear in binary files that contain 92large chunks of textual data. Furthermore, "polluted" plain text should 93be regarded as binary by general-purpose text detection schemes, because 94general-purpose text processing algorithms might not be applicable. 95Under this premise, it is safe to say that our detection method provides 96a near-100% recall. 97 98Experiments have been run on many files coming from various platforms 99and applications. We tried plain text files, system logs, source code, 100formatted office documents, compiled object code, etc. The results 101confirm the optimistic assumptions about the capabilities of this 102algorithm. 103 104 105-- 106Cosmin Truta 107Last updated: 2006-May-28 |