xref: /freebsd/sys/contrib/zlib/doc/rfc1951.txt (revision 5f4c09dd85bff675e0ca63c55ea3c517e0fddfcc)
1
2
3
4
5
6
7Network Working Group                                         P. Deutsch
8Request for Comments: 1951                           Aladdin Enterprises
9Category: Informational                                         May 1996
10
11
12        DEFLATE Compressed Data Format Specification version 1.3
13
14Status of This Memo
15
16   This memo provides information for the Internet community.  This memo
17   does not specify an Internet standard of any kind.  Distribution of
18   this memo is unlimited.
19
20IESG Note:
21
22   The IESG takes no position on the validity of any Intellectual
23   Property Rights statements contained in this document.
24
25Notices
26
27   Copyright (c) 1996 L. Peter Deutsch
28
29   Permission is granted to copy and distribute this document for any
30   purpose and without charge, including translations into other
31   languages and incorporation into compilations, provided that the
32   copyright notice and this notice are preserved, and that any
33   substantive changes or deletions from the original are clearly
34   marked.
35
36   A pointer to the latest version of this and related documentation in
37   HTML format can be found at the URL
38   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39
40Abstract
41
42   This specification defines a lossless compressed data format that
43   compresses data using a combination of the LZ77 algorithm and Huffman
44   coding, with efficiency comparable to the best currently available
45   general-purpose compression methods.  The data can be produced or
46   consumed, even for an arbitrarily long sequentially presented input
47   data stream, using only an a priori bounded amount of intermediate
48   storage.  The format can be implemented readily in a manner not
49   covered by patents.
50
51
52
53
54
55
56
57
58Deutsch                      Informational                      [Page 1]
59
60RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
61
62
63Table of Contents
64
65   1. Introduction ................................................... 2
66      1.1. Purpose ................................................... 2
67      1.2. Intended audience ......................................... 3
68      1.3. Scope ..................................................... 3
69      1.4. Compliance ................................................ 3
70      1.5.  Definitions of terms and conventions used ................ 3
71      1.6. Changes from previous versions ............................ 4
72   2. Compressed representation overview ............................. 4
73   3. Detailed specification ......................................... 5
74      3.1. Overall conventions ....................................... 5
75          3.1.1. Packing into bytes .................................. 5
76      3.2. Compressed block format ................................... 6
77          3.2.1. Synopsis of prefix and Huffman coding ............... 6
78          3.2.2. Use of Huffman coding in the "deflate" format ....... 7
79          3.2.3. Details of block format ............................. 9
80          3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
81          3.2.5. Compressed blocks (length and distance codes) ...... 11
82          3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
83          3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
84      3.3. Compliance ............................................... 14
85   4. Compression algorithm details ................................. 14
86   5. References .................................................... 16
87   6. Security Considerations ....................................... 16
88   7. Source code ................................................... 16
89   8. Acknowledgements .............................................. 16
90   9. Author's Address .............................................. 17
91
921. Introduction
93
94   1.1. Purpose
95
96      The purpose of this specification is to define a lossless
97      compressed data format that:
98          * Is independent of CPU type, operating system, file system,
99            and character set, and hence can be used for interchange;
100          * Can be produced or consumed, even for an arbitrarily long
101            sequentially presented input data stream, using only an a
102            priori bounded amount of intermediate storage, and hence
103            can be used in data communications or similar structures
104            such as Unix filters;
105          * Compresses data with efficiency comparable to the best
106            currently available general-purpose compression methods,
107            and in particular considerably better than the "compress"
108            program;
109          * Can be implemented readily in a manner not covered by
110            patents, and hence can be practiced freely;
111
112
113
114Deutsch                      Informational                      [Page 2]
115
116RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
117
118
119          * Is compatible with the file format produced by the current
120            widely used gzip utility, in that conforming decompressors
121            will be able to read data produced by the existing gzip
122            compressor.
123
124      The data format defined by this specification does not attempt to:
125
126          * Allow random access to compressed data;
127          * Compress specialized data (e.g., raster graphics) as well
128            as the best currently available specialized algorithms.
129
130      A simple counting argument shows that no lossless compression
131      algorithm can compress every possible input data set.  For the
132      format defined here, the worst case expansion is 5 bytes per 32K-
133      byte block, i.e., a size increase of 0.015% for large data sets.
134      English text usually compresses by a factor of 2.5 to 3;
135      executable files usually compress somewhat less; graphical data
136      such as raster images may compress much more.
137
138   1.2. Intended audience
139
140      This specification is intended for use by implementors of software
141      to compress data into "deflate" format and/or decompress data from
142      "deflate" format.
143
144      The text of the specification assumes a basic background in
145      programming at the level of bits and other primitive data
146      representations.  Familiarity with the technique of Huffman coding
147      is helpful but not required.
148
149   1.3. Scope
150
151      The specification specifies a method for representing a sequence
152      of bytes as a (usually shorter) sequence of bits, and a method for
153      packing the latter bit sequence into bytes.
154
155   1.4. Compliance
156
157      Unless otherwise indicated below, a compliant decompressor must be
158      able to accept and decompress any data set that conforms to all
159      the specifications presented here; a compliant compressor must
160      produce data sets that conform to all the specifications presented
161      here.
162
163   1.5.  Definitions of terms and conventions used
164
165      Byte: 8 bits stored or transmitted as a unit (same as an octet).
166      For this specification, a byte is exactly 8 bits, even on machines
167
168
169
170Deutsch                      Informational                      [Page 3]
171
172RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
173
174
175      which store a character on a number of bits different from eight.
176      See below, for the numbering of bits within a byte.
177
178      String: a sequence of arbitrary bytes.
179
180   1.6. Changes from previous versions
181
182      There have been no technical changes to the deflate format since
183      version 1.1 of this specification.  In version 1.2, some
184      terminology was changed.  Version 1.3 is a conversion of the
185      specification to RFC style.
186
1872. Compressed representation overview
188
189   A compressed data set consists of a series of blocks, corresponding
190   to successive blocks of input data.  The block sizes are arbitrary,
191   except that non-compressible blocks are limited to 65,535 bytes.
192
193   Each block is compressed using a combination of the LZ77 algorithm
194   and Huffman coding. The Huffman trees for each block are independent
195   of those for previous or subsequent blocks; the LZ77 algorithm may
196   use a reference to a duplicated string occurring in a previous block,
197   up to 32K input bytes before.
198
199   Each block consists of two parts: a pair of Huffman code trees that
200   describe the representation of the compressed data part, and a
201   compressed data part.  (The Huffman trees themselves are compressed
202   using Huffman encoding.)  The compressed data consists of a series of
203   elements of two types: literal bytes (of strings that have not been
204   detected as duplicated within the previous 32K input bytes), and
205   pointers to duplicated strings, where a pointer is represented as a
206   pair <length, backward distance>.  The representation used in the
207   "deflate" format limits distances to 32K bytes and lengths to 258
208   bytes, but does not limit the size of a block, except for
209   uncompressible blocks, which are limited as noted above.
210
211   Each type of value (literals, distances, and lengths) in the
212   compressed data is represented using a Huffman code, using one code
213   tree for literals and lengths and a separate code tree for distances.
214   The code trees for each block appear in a compact form just before
215   the compressed data for that block.
216
217
218
219
220
221
222
223
224
225
226Deutsch                      Informational                      [Page 4]
227
228RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
229
230
2313. Detailed specification
232
233   3.1. Overall conventions In the diagrams below, a box like this:
234
235         +---+
236         |   | <-- the vertical bars might be missing
237         +---+
238
239      represents one byte; a box like this:
240
241         +==============+
242         |              |
243         +==============+
244
245      represents a variable number of bytes.
246
247      Bytes stored within a computer do not have a "bit order", since
248      they are always treated as a unit.  However, a byte considered as
249      an integer between 0 and 255 does have a most- and least-
250      significant bit, and since we write numbers with the most-
251      significant digit on the left, we also write bytes with the most-
252      significant bit on the left.  In the diagrams below, we number the
253      bits of a byte so that bit 0 is the least-significant bit, i.e.,
254      the bits are numbered:
255
256         +--------+
257         |76543210|
258         +--------+
259
260      Within a computer, a number may occupy multiple bytes.  All
261      multi-byte numbers in the format described here are stored with
262      the least-significant byte first (at the lower memory address).
263      For example, the decimal number 520 is stored as:
264
265             0        1
266         +--------+--------+
267         |00001000|00000010|
268         +--------+--------+
269          ^        ^
270          |        |
271          |        + more significant byte = 2 x 256
272          + less significant byte = 8
273
274      3.1.1. Packing into bytes
275
276         This document does not address the issue of the order in which
277         bits of a byte are transmitted on a bit-sequential medium,
278         since the final data format described here is byte- rather than
279
280
281
282Deutsch                      Informational                      [Page 5]
283
284RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
285
286
287         bit-oriented.  However, we describe the compressed block format
288         in below, as a sequence of data elements of various bit
289         lengths, not a sequence of bytes.  We must therefore specify
290         how to pack these data elements into bytes to form the final
291         compressed byte sequence:
292
293             * Data elements are packed into bytes in order of
294               increasing bit number within the byte, i.e., starting
295               with the least-significant bit of the byte.
296             * Data elements other than Huffman codes are packed
297               starting with the least-significant bit of the data
298               element.
299             * Huffman codes are packed starting with the most-
300               significant bit of the code.
301
302         In other words, if one were to print out the compressed data as
303         a sequence of bytes, starting with the first byte at the
304         *right* margin and proceeding to the *left*, with the most-
305         significant bit of each byte on the left as usual, one would be
306         able to parse the result from right to left, with fixed-width
307         elements in the correct MSB-to-LSB order and Huffman codes in
308         bit-reversed order (i.e., with the first bit of the code in the
309         relative LSB position).
310
311   3.2. Compressed block format
312
313      3.2.1. Synopsis of prefix and Huffman coding
314
315         Prefix coding represents symbols from an a priori known
316         alphabet by bit sequences (codes), one code for each symbol, in
317         a manner such that different symbols may be represented by bit
318         sequences of different lengths, but a parser can always parse
319         an encoded string unambiguously symbol-by-symbol.
320
321         We define a prefix code in terms of a binary tree in which the
322         two edges descending from each non-leaf node are labeled 0 and
323         1 and in which the leaf nodes correspond one-for-one with (are
324         labeled with) the symbols of the alphabet; then the code for a
325         symbol is the sequence of 0's and 1's on the edges leading from
326         the root to the leaf labeled with that symbol.  For example:
327
328
329
330
331
332
333
334
335
336
337
338Deutsch                      Informational                      [Page 6]
339
340RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
341
342
343                          /\              Symbol    Code
344                         0  1             ------    ----
345                        /    \                A      00
346                       /\     B               B       1
347                      0  1                    C     011
348                     /    \                   D     010
349                    A     /\
350                         0  1
351                        /    \
352                       D      C
353
354         A parser can decode the next symbol from an encoded input
355         stream by walking down the tree from the root, at each step
356         choosing the edge corresponding to the next input bit.
357
358         Given an alphabet with known symbol frequencies, the Huffman
359         algorithm allows the construction of an optimal prefix code
360         (one which represents strings with those symbol frequencies
361         using the fewest bits of any possible prefix codes for that
362         alphabet).  Such a code is called a Huffman code.  (See
363         reference [1] in Chapter 5, references for additional
364         information on Huffman codes.)
365
366         Note that in the "deflate" format, the Huffman codes for the
367         various alphabets must not exceed certain maximum code lengths.
368         This constraint complicates the algorithm for computing code
369         lengths from symbol frequencies.  Again, see Chapter 5,
370         references for details.
371
372      3.2.2. Use of Huffman coding in the "deflate" format
373
374         The Huffman codes used for each alphabet in the "deflate"
375         format have two additional rules:
376
377             * All codes of a given bit length have lexicographically
378               consecutive values, in the same order as the symbols
379               they represent;
380
381             * Shorter codes lexicographically precede longer codes.
382
383
384
385
386
387
388
389
390
391
392
393
394Deutsch                      Informational                      [Page 7]
395
396RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
397
398
399         We could recode the example above to follow this rule as
400         follows, assuming that the order of the alphabet is ABCD:
401
402            Symbol  Code
403            ------  ----
404            A       10
405            B       0
406            C       110
407            D       111
408
409         I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
410         lexicographically consecutive.
411
412         Given this rule, we can define the Huffman code for an alphabet
413         just by giving the bit lengths of the codes for each symbol of
414         the alphabet in order; this is sufficient to determine the
415         actual codes.  In our example, the code is completely defined
416         by the sequence of bit lengths (2, 1, 3, 3).  The following
417         algorithm generates the codes as integers, intended to be read
418         from most- to least-significant bit.  The code lengths are
419         initially in tree[I].Len; the codes are produced in
420         tree[I].Code.
421
422         1)  Count the number of codes for each code length.  Let
423             bl_count[N] be the number of codes of length N, N >= 1.
424
425         2)  Find the numerical value of the smallest code for each
426             code length:
427
428                code = 0;
429                bl_count[0] = 0;
430                for (bits = 1; bits <= MAX_BITS; bits++) {
431                    code = (code + bl_count[bits-1]) << 1;
432                    next_code[bits] = code;
433                }
434
435         3)  Assign numerical values to all codes, using consecutive
436             values for all codes of the same length with the base
437             values determined at step 2. Codes that are never used
438             (which have a bit length of zero) must not be assigned a
439             value.
440
441                for (n = 0;  n <= max_code; n++) {
442                    len = tree[n].Len;
443                    if (len != 0) {
444                        tree[n].Code = next_code[len];
445                        next_code[len]++;
446                    }
447
448
449
450Deutsch                      Informational                      [Page 8]
451
452RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
453
454
455                }
456
457         Example:
458
459         Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
460         3, 2, 4, 4).  After step 1, we have:
461
462            N      bl_count[N]
463            -      -----------
464            2      1
465            3      5
466            4      2
467
468         Step 2 computes the following next_code values:
469
470            N      next_code[N]
471            -      ------------
472            1      0
473            2      0
474            3      2
475            4      14
476
477         Step 3 produces the following code values:
478
479            Symbol Length   Code
480            ------ ------   ----
481            A       3        010
482            B       3        011
483            C       3        100
484            D       3        101
485            E       3        110
486            F       2         00
487            G       4       1110
488            H       4       1111
489
490      3.2.3. Details of block format
491
492         Each block of compressed data begins with 3 header bits
493         containing the following data:
494
495            first bit       BFINAL
496            next 2 bits     BTYPE
497
498         Note that the header bits do not necessarily begin on a byte
499         boundary, since a block does not necessarily occupy an integral
500         number of bytes.
501
502
503
504
505
506Deutsch                      Informational                      [Page 9]
507
508RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
509
510
511         BFINAL is set if and only if this is the last block of the data
512         set.
513
514         BTYPE specifies how the data are compressed, as follows:
515
516            00 - no compression
517            01 - compressed with fixed Huffman codes
518            10 - compressed with dynamic Huffman codes
519            11 - reserved (error)
520
521         The only difference between the two compressed cases is how the
522         Huffman codes for the literal/length and distance alphabets are
523         defined.
524
525         In all cases, the decoding algorithm for the actual data is as
526         follows:
527
528            do
529               read block header from input stream.
530               if stored with no compression
531                  skip any remaining bits in current partially
532                     processed byte
533                  read LEN and NLEN (see next section)
534                  copy LEN bytes of data to output
535               otherwise
536                  if compressed with dynamic Huffman codes
537                     read representation of code trees (see
538                        subsection below)
539                  loop (until end of block code recognized)
540                     decode literal/length value from input stream
541                     if value < 256
542                        copy value (literal byte) to output stream
543                     otherwise
544                        if value = end of block (256)
545                           break from loop
546                        otherwise (value = 257..285)
547                           decode distance from input stream
548
549                           move backwards distance bytes in the output
550                           stream, and copy length bytes from this
551                           position to the output stream.
552                  end loop
553            while not last block
554
555         Note that a duplicated string reference may refer to a string
556         in a previous block; i.e., the backward distance may cross one
557         or more block boundaries.  However a distance cannot refer past
558         the beginning of the output stream.  (An application using a
559
560
561
562Deutsch                      Informational                     [Page 10]
563
564RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
565
566
567         preset dictionary might discard part of the output stream; a
568         distance can refer to that part of the output stream anyway)
569         Note also that the referenced string may overlap the current
570         position; for example, if the last 2 bytes decoded have values
571         X and Y, a string reference with <length = 5, distance = 2>
572         adds X,Y,X,Y,X to the output stream.
573
574         We now specify each compression method in turn.
575
576      3.2.4. Non-compressed blocks (BTYPE=00)
577
578         Any bits of input up to the next byte boundary are ignored.
579         The rest of the block consists of the following information:
580
581              0   1   2   3   4...
582            +---+---+---+---+================================+
583            |  LEN  | NLEN  |... LEN bytes of literal data...|
584            +---+---+---+---+================================+
585
586         LEN is the number of data bytes in the block.  NLEN is the
587         one's complement of LEN.
588
589      3.2.5. Compressed blocks (length and distance codes)
590
591         As noted above, encoded data blocks in the "deflate" format
592         consist of sequences of symbols drawn from three conceptually
593         distinct alphabets: either literal bytes, from the alphabet of
594         byte values (0..255), or <length, backward distance> pairs,
595         where the length is drawn from (3..258) and the distance is
596         drawn from (1..32,768).  In fact, the literal and length
597         alphabets are merged into a single alphabet (0..285), where
598         values 0..255 represent literal bytes, the value 256 indicates
599         end-of-block, and values 257..285 represent length codes
600         (possibly in conjunction with extra bits following the symbol
601         code) as follows:
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618Deutsch                      Informational                     [Page 11]
619
620RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
621
622
623                 Extra               Extra               Extra
624            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
625            ---- ---- ------     ---- ---- -------   ---- ---- -------
626             257   0     3       267   1   15,16     277   4   67-82
627             258   0     4       268   1   17,18     278   4   83-98
628             259   0     5       269   2   19-22     279   4   99-114
629             260   0     6       270   2   23-26     280   4  115-130
630             261   0     7       271   2   27-30     281   5  131-162
631             262   0     8       272   2   31-34     282   5  163-194
632             263   0     9       273   3   35-42     283   5  195-226
633             264   0    10       274   3   43-50     284   5  227-257
634             265   1  11,12      275   3   51-58     285   0    258
635             266   1  13,14      276   3   59-66
636
637         The extra bits should be interpreted as a machine integer
638         stored with the most-significant bit first, e.g., bits 1110
639         represent the value 14.
640
641                  Extra           Extra               Extra
642             Code Bits Dist  Code Bits   Dist     Code Bits Distance
643             ---- ---- ----  ---- ----  ------    ---- ---- --------
644               0   0    1     10   4     33-48    20    9   1025-1536
645               1   0    2     11   4     49-64    21    9   1537-2048
646               2   0    3     12   5     65-96    22   10   2049-3072
647               3   0    4     13   5     97-128   23   10   3073-4096
648               4   1   5,6    14   6    129-192   24   11   4097-6144
649               5   1   7,8    15   6    193-256   25   11   6145-8192
650               6   2   9-12   16   7    257-384   26   12  8193-12288
651               7   2  13-16   17   7    385-512   27   12 12289-16384
652               8   3  17-24   18   8    513-768   28   13 16385-24576
653               9   3  25-32   19   8   769-1024   29   13 24577-32768
654
655      3.2.6. Compression with fixed Huffman codes (BTYPE=01)
656
657         The Huffman codes for the two alphabets are fixed, and are not
658         represented explicitly in the data.  The Huffman code lengths
659         for the literal/length alphabet are:
660
661                   Lit Value    Bits        Codes
662                   ---------    ----        -----
663                     0 - 143     8          00110000 through
664                                            10111111
665                   144 - 255     9          110010000 through
666                                            111111111
667                   256 - 279     7          0000000 through
668                                            0010111
669                   280 - 287     8          11000000 through
670                                            11000111
671
672
673
674Deutsch                      Informational                     [Page 12]
675
676RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
677
678
679         The code lengths are sufficient to generate the actual codes,
680         as described above; we show the codes in the table for added
681         clarity.  Literal/length values 286-287 will never actually
682         occur in the compressed data, but participate in the code
683         construction.
684
685         Distance codes 0-31 are represented by (fixed-length) 5-bit
686         codes, with possible additional bits as shown in the table
687         shown in Paragraph 3.2.5, above.  Note that distance codes 30-
688         31 will never actually occur in the compressed data.
689
690      3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
691
692         The Huffman codes for the two alphabets appear in the block
693         immediately after the header bits and before the actual
694         compressed data, first the literal/length code and then the
695         distance code.  Each code is defined by a sequence of code
696         lengths, as discussed in Paragraph 3.2.2, above.  For even
697         greater compactness, the code length sequences themselves are
698         compressed using a Huffman code.  The alphabet for code lengths
699         is as follows:
700
701               0 - 15: Represent code lengths of 0 - 15
702                   16: Copy the previous code length 3 - 6 times.
703                       The next 2 bits indicate repeat length
704                             (0 = 3, ... , 3 = 6)
705                          Example:  Codes 8, 16 (+2 bits 11),
706                                    16 (+2 bits 10) will expand to
707                                    12 code lengths of 8 (1 + 6 + 5)
708                   17: Repeat a code length of 0 for 3 - 10 times.
709                       (3 bits of length)
710                   18: Repeat a code length of 0 for 11 - 138 times
711                       (7 bits of length)
712
713         A code length of 0 indicates that the corresponding symbol in
714         the literal/length or distance alphabet will not occur in the
715         block, and should not participate in the Huffman code
716         construction algorithm given earlier.  If only one distance
717         code is used, it is encoded using one bit, not zero bits; in
718         this case there is a single code length of one, with one unused
719         code.  One distance code of zero bits means that there are no
720         distance codes used at all (the data is all literals).
721
722         We can now define the format of the block:
723
724               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
725               5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
726               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)
727
728
729
730Deutsch                      Informational                     [Page 13]
731
732RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
733
734
735               (HCLEN + 4) x 3 bits: code lengths for the code length
736                  alphabet given just above, in the order: 16, 17, 18,
737                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
738
739                  These code lengths are interpreted as 3-bit integers
740                  (0-7); as above, a code length of 0 means the
741                  corresponding symbol (literal/length or distance code
742                  length) is not used.
743
744               HLIT + 257 code lengths for the literal/length alphabet,
745                  encoded using the code length Huffman code
746
747               HDIST + 1 code lengths for the distance alphabet,
748                  encoded using the code length Huffman code
749
750               The actual compressed data of the block,
751                  encoded using the literal/length and distance Huffman
752                  codes
753
754               The literal/length symbol 256 (end of data),
755                  encoded using the literal/length Huffman code
756
757         The code length repeat codes can cross from HLIT + 257 to the
758         HDIST + 1 code lengths.  In other words, all code lengths form
759         a single sequence of HLIT + HDIST + 258 values.
760
761   3.3. Compliance
762
763      A compressor may limit further the ranges of values specified in
764      the previous section and still be compliant; for example, it may
765      limit the range of backward pointers to some value smaller than
766      32K.  Similarly, a compressor may limit the size of blocks so that
767      a compressible block fits in memory.
768
769      A compliant decompressor must accept the full range of possible
770      values defined in the previous section, and must accept blocks of
771      arbitrary size.
772
7734. Compression algorithm details
774
775   While it is the intent of this document to define the "deflate"
776   compressed data format without reference to any particular
777   compression algorithm, the format is related to the compressed
778   formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
779   since many variations of LZ77 are patented, it is strongly
780   recommended that the implementor of a compressor follow the general
781   algorithm presented here, which is known not to be patented per se.
782   The material in this section is not part of the definition of the
783
784
785
786Deutsch                      Informational                     [Page 14]
787
788RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
789
790
791   specification per se, and a compressor need not follow it in order to
792   be compliant.
793
794   The compressor terminates a block when it determines that starting a
795   new block with fresh trees would be useful, or when the block size
796   fills up the compressor's block buffer.
797
798   The compressor uses a chained hash table to find duplicated strings,
799   using a hash function that operates on 3-byte sequences.  At any
800   given point during compression, let XYZ be the next 3 input bytes to
801   be examined (not necessarily all different, of course).  First, the
802   compressor examines the hash chain for XYZ.  If the chain is empty,
803   the compressor simply writes out X as a literal byte and advances one
804   byte in the input.  If the hash chain is not empty, indicating that
805   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
806   same hash function value) has occurred recently, the compressor
807   compares all strings on the XYZ hash chain with the actual input data
808   sequence starting at the current point, and selects the longest
809   match.
810
811   The compressor searches the hash chains starting with the most recent
812   strings, to favor small distances and thus take advantage of the
813   Huffman encoding.  The hash chains are singly linked. There are no
814   deletions from the hash chains; the algorithm simply discards matches
815   that are too old.  To avoid a worst-case situation, very long hash
816   chains are arbitrarily truncated at a certain length, determined by a
817   run-time parameter.
818
819   To improve overall compression, the compressor optionally defers the
820   selection of matches ("lazy matching"): after a match of length N has
821   been found, the compressor searches for a longer match starting at
822   the next input byte.  If it finds a longer match, it truncates the
823   previous match to a length of one (thus producing a single literal
824   byte) and then emits the longer match.  Otherwise, it emits the
825   original match, and, as described above, advances N bytes before
826   continuing.
827
828   Run-time parameters also control this "lazy match" procedure.  If
829   compression ratio is most important, the compressor attempts a
830   complete second search regardless of the length of the first match.
831   In the normal case, if the current match is "long enough", the
832   compressor reduces the search for a longer match, thus speeding up
833   the process.  If speed is most important, the compressor inserts new
834   strings in the hash table only when no match was found, or when the
835   match is not "too long".  This degrades the compression ratio but
836   saves time since there are both fewer insertions and fewer searches.
837
838
839
840
841
842Deutsch                      Informational                     [Page 15]
843
844RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
845
846
8475. References
848
849   [1] Huffman, D. A., "A Method for the Construction of Minimum
850       Redundancy Codes", Proceedings of the Institute of Radio
851       Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
852
853   [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
854       Compression", IEEE Transactions on Information Theory, Vol. 23,
855       No. 3, pp. 337-343.
856
857   [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
858       available in ftp://ftp.uu.net/pub/archiving/zip/doc/
859
860   [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
861       available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
862
863   [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
864       encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
865
866   [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
867       Comm. ACM, 33,4, April 1990, pp. 449-459.
868
8696. Security Considerations
870
871   Any data compression method involves the reduction of redundancy in
872   the data.  Consequently, any corruption of the data is likely to have
873   severe effects and be difficult to correct.  Uncompressed text, on
874   the other hand, will probably still be readable despite the presence
875   of some corrupted bytes.
876
877   It is recommended that systems using this data format provide some
878   means of validating the integrity of the compressed data.  See
879   reference [3], for example.
880
8817. Source code
882
883   Source code for a C language implementation of a "deflate" compliant
884   compressor and decompressor is available within the zlib package at
885   ftp://ftp.uu.net/pub/archiving/zip/zlib/.
886
8878. Acknowledgements
888
889   Trademarks cited in this document are the property of their
890   respective owners.
891
892   Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
893   Adler wrote the related software described in this specification.
894   Glenn Randers-Pehrson converted this document to RFC and HTML format.
895
896
897
898Deutsch                      Informational                     [Page 16]
899
900RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
901
902
9039. Author's Address
904
905   L. Peter Deutsch
906   Aladdin Enterprises
907   203 Santa Margarita Ave.
908   Menlo Park, CA 94025
909
910   Phone: (415) 322-0103 (AM only)
911   FAX:   (415) 322-1734
912   EMail: <ghost@aladdin.com>
913
914   Questions about the technical content of this specification can be
915   sent by email to:
916
917   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
918   Mark Adler <madler@alumni.caltech.edu>
919
920   Editorial comments on this specification can be sent by email to:
921
922   L. Peter Deutsch <ghost@aladdin.com> and
923   Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954Deutsch                      Informational                     [Page 17]
955
956