xref: /freebsd/sys/contrib/zlib/doc/rfc1952.txt (revision da759cfa320d5076b075d15ff3f00ab3ba5634fd)
1
2
3
4
5
6
7Network Working Group                                         P. Deutsch
8Request for Comments: 1952                           Aladdin Enterprises
9Category: Informational                                         May 1996
10
11
12               GZIP file format specification version 4.3
13
14Status of This Memo
15
16   This memo provides information for the Internet community.  This memo
17   does not specify an Internet standard of any kind.  Distribution of
18   this memo is unlimited.
19
20IESG Note:
21
22   The IESG takes no position on the validity of any Intellectual
23   Property Rights statements contained in this document.
24
25Notices
26
27   Copyright (c) 1996 L. Peter Deutsch
28
29   Permission is granted to copy and distribute this document for any
30   purpose and without charge, including translations into other
31   languages and incorporation into compilations, provided that the
32   copyright notice and this notice are preserved, and that any
33   substantive changes or deletions from the original are clearly
34   marked.
35
36   A pointer to the latest version of this and related documentation in
37   HTML format can be found at the URL
38   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39
40Abstract
41
42   This specification defines a lossless compressed data format that is
43   compatible with the widely used GZIP utility.  The format includes a
44   cyclic redundancy check value for detecting data corruption.  The
45   format presently uses the DEFLATE method of compression but can be
46   easily extended to use other compression methods.  The format can be
47   implemented readily in a manner not covered by patents.
48
49
50
51
52
53
54
55
56
57
58Deutsch                      Informational                      [Page 1]
59
60RFC 1952             GZIP File Format Specification             May 1996
61
62
63Table of Contents
64
65   1. Introduction ................................................... 2
66      1.1. Purpose ................................................... 2
67      1.2. Intended audience ......................................... 3
68      1.3. Scope ..................................................... 3
69      1.4. Compliance ................................................ 3
70      1.5. Definitions of terms and conventions used ................. 3
71      1.6. Changes from previous versions ............................ 3
72   2. Detailed specification ......................................... 4
73      2.1. Overall conventions ....................................... 4
74      2.2. File format ............................................... 5
75      2.3. Member format ............................................. 5
76          2.3.1. Member header and trailer ........................... 6
77              2.3.1.1. Extra field ................................... 8
78              2.3.1.2. Compliance .................................... 9
79      3. References .................................................. 9
80      4. Security Considerations .................................... 10
81      5. Acknowledgements ........................................... 10
82      6. Author's Address ........................................... 10
83      7. Appendix: Jean-Loup Gailly's gzip utility .................. 11
84      8. Appendix: Sample CRC Code .................................. 11
85
861. Introduction
87
88   1.1. Purpose
89
90      The purpose of this specification is to define a lossless
91      compressed data format that:
92
93          * Is independent of CPU type, operating system, file system,
94            and character set, and hence can be used for interchange;
95          * Can compress or decompress a data stream (as opposed to a
96            randomly accessible file) to produce another data stream,
97            using only an a priori bounded amount of intermediate
98            storage, and hence can be used in data communications or
99            similar structures such as Unix filters;
100          * Compresses data with efficiency comparable to the best
101            currently available general-purpose compression methods,
102            and in particular considerably better than the "compress"
103            program;
104          * Can be implemented readily in a manner not covered by
105            patents, and hence can be practiced freely;
106          * Is compatible with the file format produced by the current
107            widely used gzip utility, in that conforming decompressors
108            will be able to read data produced by the existing gzip
109            compressor.
110
111
112
113
114Deutsch                      Informational                      [Page 2]
115
116RFC 1952             GZIP File Format Specification             May 1996
117
118
119      The data format defined by this specification does not attempt to:
120
121          * Provide random access to compressed data;
122          * Compress specialized data (e.g., raster graphics) as well as
123            the best currently available specialized algorithms.
124
125   1.2. Intended audience
126
127      This specification is intended for use by implementors of software
128      to compress data into gzip format and/or decompress data from gzip
129      format.
130
131      The text of the specification assumes a basic background in
132      programming at the level of bits and other primitive data
133      representations.
134
135   1.3. Scope
136
137      The specification specifies a compression method and a file format
138      (the latter assuming only that a file can store a sequence of
139      arbitrary bytes).  It does not specify any particular interface to
140      a file system or anything about character sets or encodings
141      (except for file names and comments, which are optional).
142
143   1.4. Compliance
144
145      Unless otherwise indicated below, a compliant decompressor must be
146      able to accept and decompress any file that conforms to all the
147      specifications presented here; a compliant compressor must produce
148      files that conform to all the specifications presented here.  The
149      material in the appendices is not part of the specification per se
150      and is not relevant to compliance.
151
152   1.5. Definitions of terms and conventions used
153
154      byte: 8 bits stored or transmitted as a unit (same as an octet).
155      (For this specification, a byte is exactly 8 bits, even on
156      machines which store a character on a number of bits different
157      from 8.)  See below for the numbering of bits within a byte.
158
159   1.6. Changes from previous versions
160
161      There have been no technical changes to the gzip format since
162      version 4.1 of this specification.  In version 4.2, some
163      terminology was changed, and the sample CRC code was rewritten for
164      clarity and to eliminate the requirement for the caller to do pre-
165      and post-conditioning.  Version 4.3 is a conversion of the
166      specification to RFC style.
167
168
169
170Deutsch                      Informational                      [Page 3]
171
172RFC 1952             GZIP File Format Specification             May 1996
173
174
1752. Detailed specification
176
177   2.1. Overall conventions
178
179      In the diagrams below, a box like this:
180
181         +---+
182         |   | <-- the vertical bars might be missing
183         +---+
184
185      represents one byte; a box like this:
186
187         +==============+
188         |              |
189         +==============+
190
191      represents a variable number of bytes.
192
193      Bytes stored within a computer do not have a "bit order", since
194      they are always treated as a unit.  However, a byte considered as
195      an integer between 0 and 255 does have a most- and least-
196      significant bit, and since we write numbers with the most-
197      significant digit on the left, we also write bytes with the most-
198      significant bit on the left.  In the diagrams below, we number the
199      bits of a byte so that bit 0 is the least-significant bit, i.e.,
200      the bits are numbered:
201
202         +--------+
203         |76543210|
204         +--------+
205
206      This document does not address the issue of the order in which
207      bits of a byte are transmitted on a bit-sequential medium, since
208      the data format described here is byte- rather than bit-oriented.
209
210      Within a computer, a number may occupy multiple bytes.  All
211      multi-byte numbers in the format described here are stored with
212      the least-significant byte first (at the lower memory address).
213      For example, the decimal number 520 is stored as:
214
215             0        1
216         +--------+--------+
217         |00001000|00000010|
218         +--------+--------+
219          ^        ^
220          |        |
221          |        + more significant byte = 2 x 256
222          + less significant byte = 8
223
224
225
226Deutsch                      Informational                      [Page 4]
227
228RFC 1952             GZIP File Format Specification             May 1996
229
230
231   2.2. File format
232
233      A gzip file consists of a series of "members" (compressed data
234      sets).  The format of each member is specified in the following
235      section.  The members simply appear one after another in the file,
236      with no additional information before, between, or after them.
237
238   2.3. Member format
239
240      Each member has the following structure:
241
242         +---+---+---+---+---+---+---+---+---+---+
243         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
244         +---+---+---+---+---+---+---+---+---+---+
245
246      (if FLG.FEXTRA set)
247
248         +---+---+=================================+
249         | XLEN  |...XLEN bytes of "extra field"...| (more-->)
250         +---+---+=================================+
251
252      (if FLG.FNAME set)
253
254         +=========================================+
255         |...original file name, zero-terminated...| (more-->)
256         +=========================================+
257
258      (if FLG.FCOMMENT set)
259
260         +===================================+
261         |...file comment, zero-terminated...| (more-->)
262         +===================================+
263
264      (if FLG.FHCRC set)
265
266         +---+---+
267         | CRC16 |
268         +---+---+
269
270         +=======================+
271         |...compressed blocks...| (more-->)
272         +=======================+
273
274           0   1   2   3   4   5   6   7
275         +---+---+---+---+---+---+---+---+
276         |     CRC32     |     ISIZE     |
277         +---+---+---+---+---+---+---+---+
278
279
280
281
282Deutsch                      Informational                      [Page 5]
283
284RFC 1952             GZIP File Format Specification             May 1996
285
286
287      2.3.1. Member header and trailer
288
289         ID1 (IDentification 1)
290         ID2 (IDentification 2)
291            These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
292            (0x8b, \213), to identify the file as being in gzip format.
293
294         CM (Compression Method)
295            This identifies the compression method used in the file.  CM
296            = 0-7 are reserved.  CM = 8 denotes the "deflate"
297            compression method, which is the one customarily used by
298            gzip and which is documented elsewhere.
299
300         FLG (FLaGs)
301            This flag byte is divided into individual bits as follows:
302
303               bit 0   FTEXT
304               bit 1   FHCRC
305               bit 2   FEXTRA
306               bit 3   FNAME
307               bit 4   FCOMMENT
308               bit 5   reserved
309               bit 6   reserved
310               bit 7   reserved
311
312            If FTEXT is set, the file is probably ASCII text.  This is
313            an optional indication, which the compressor may set by
314            checking a small amount of the input data to see whether any
315            non-ASCII characters are present.  In case of doubt, FTEXT
316            is cleared, indicating binary data. For systems which have
317            different file formats for ascii text and binary data, the
318            decompressor can use FTEXT to choose the appropriate format.
319            We deliberately do not specify the algorithm used to set
320            this bit, since a compressor always has the option of
321            leaving it cleared and a decompressor always has the option
322            of ignoring it and letting some other program handle issues
323            of data conversion.
324
325            If FHCRC is set, a CRC16 for the gzip header is present,
326            immediately before the compressed data. The CRC16 consists
327            of the two least significant bytes of the CRC32 for all
328            bytes of the gzip header up to and not including the CRC16.
329            [The FHCRC bit was never set by versions of gzip up to
330            1.2.4, even though it was documented with a different
331            meaning in gzip 1.2.4.]
332
333            If FEXTRA is set, optional extra fields are present, as
334            described in a following section.
335
336
337
338Deutsch                      Informational                      [Page 6]
339
340RFC 1952             GZIP File Format Specification             May 1996
341
342
343            If FNAME is set, an original file name is present,
344            terminated by a zero byte.  The name must consist of ISO
345            8859-1 (LATIN-1) characters; on operating systems using
346            EBCDIC or any other character set for file names, the name
347            must be translated to the ISO LATIN-1 character set.  This
348            is the original name of the file being compressed, with any
349            directory components removed, and, if the file being
350            compressed is on a file system with case insensitive names,
351            forced to lower case. There is no original file name if the
352            data was compressed from a source other than a named file;
353            for example, if the source was stdin on a Unix system, there
354            is no file name.
355
356            If FCOMMENT is set, a zero-terminated file comment is
357            present.  This comment is not interpreted; it is only
358            intended for human consumption.  The comment must consist of
359            ISO 8859-1 (LATIN-1) characters.  Line breaks should be
360            denoted by a single line feed character (10 decimal).
361
362            Reserved FLG bits must be zero.
363
364         MTIME (Modification TIME)
365            This gives the most recent modification time of the original
366            file being compressed.  The time is in Unix format, i.e.,
367            seconds since 00:00:00 GMT, Jan.  1, 1970.  (Note that this
368            may cause problems for MS-DOS and other systems that use
369            local rather than Universal time.)  If the compressed data
370            did not come from a file, MTIME is set to the time at which
371            compression started.  MTIME = 0 means no time stamp is
372            available.
373
374         XFL (eXtra FLags)
375            These flags are available for use by specific compression
376            methods.  The "deflate" method (CM = 8) sets these flags as
377            follows:
378
379               XFL = 2 - compressor used maximum compression,
380                         slowest algorithm
381               XFL = 4 - compressor used fastest algorithm
382
383         OS (Operating System)
384            This identifies the type of file system on which compression
385            took place.  This may be useful in determining end-of-line
386            convention for text files.  The currently defined values are
387            as follows:
388
389
390
391
392
393
394Deutsch                      Informational                      [Page 7]
395
396RFC 1952             GZIP File Format Specification             May 1996
397
398
399                 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
400                 1 - Amiga
401                 2 - VMS (or OpenVMS)
402                 3 - Unix
403                 4 - VM/CMS
404                 5 - Atari TOS
405                 6 - HPFS filesystem (OS/2, NT)
406                 7 - Macintosh
407                 8 - Z-System
408                 9 - CP/M
409                10 - TOPS-20
410                11 - NTFS filesystem (NT)
411                12 - QDOS
412                13 - Acorn RISCOS
413               255 - unknown
414
415         XLEN (eXtra LENgth)
416            If FLG.FEXTRA is set, this gives the length of the optional
417            extra field.  See below for details.
418
419         CRC32 (CRC-32)
420            This contains a Cyclic Redundancy Check value of the
421            uncompressed data computed according to CRC-32 algorithm
422            used in the ISO 3309 standard and in section 8.1.1.6.2 of
423            ITU-T recommendation V.42.  (See http://www.iso.ch for
424            ordering ISO documents. See gopher://info.itu.ch for an
425            online version of ITU-T V.42.)
426
427         ISIZE (Input SIZE)
428            This contains the size of the original (uncompressed) input
429            data modulo 2^32.
430
431      2.3.1.1. Extra field
432
433         If the FLG.FEXTRA bit is set, an "extra field" is present in
434         the header, with total length XLEN bytes.  It consists of a
435         series of subfields, each of the form:
436
437            +---+---+---+---+==================================+
438            |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
439            +---+---+---+---+==================================+
440
441         SI1 and SI2 provide a subfield ID, typically two ASCII letters
442         with some mnemonic value.  Jean-Loup Gailly
443         <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
444         IDs; please send him any subfield ID you wish to use.  Subfield
445         IDs with SI2 = 0 are reserved for future use.  The following
446         IDs are currently defined:
447
448
449
450Deutsch                      Informational                      [Page 8]
451
452RFC 1952             GZIP File Format Specification             May 1996
453
454
455            SI1         SI2         Data
456            ----------  ----------  ----
457            0x41 ('A')  0x70 ('P')  Apollo file type information
458
459         LEN gives the length of the subfield data, excluding the 4
460         initial bytes.
461
462      2.3.1.2. Compliance
463
464         A compliant compressor must produce files with correct ID1,
465         ID2, CM, CRC32, and ISIZE, but may set all the other fields in
466         the fixed-length part of the header to default values (255 for
467         OS, 0 for all others).  The compressor must set all reserved
468         bits to zero.
469
470         A compliant decompressor must check ID1, ID2, and CM, and
471         provide an error indication if any of these have incorrect
472         values.  It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
473         at least so it can skip over the optional fields if they are
474         present.  It need not examine any other part of the header or
475         trailer; in particular, a decompressor may ignore FTEXT and OS
476         and always produce binary output, and still be compliant.  A
477         compliant decompressor must give an error indication if any
478         reserved bit is non-zero, since such a bit could indicate the
479         presence of a new field that would cause subsequent data to be
480         interpreted incorrectly.
481
4823. References
483
484   [1] "Information Processing - 8-bit single-byte coded graphic
485       character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987).
486       The ISO 8859-1 (Latin-1) character set is a superset of 7-bit
487       ASCII. Files defining this character set are available as
488       iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/
489
490   [2] ISO 3309
491
492   [3] ITU-T recommendation V.42
493
494   [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
495       available in ftp://ftp.uu.net/pub/archiving/zip/doc/
496
497   [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in
498       ftp://prep.ai.mit.edu/pub/gnu/
499
500   [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table
501       Look-Up", Communications of the ACM, 31(8), pp.1008-1013.
502
503
504
505
506Deutsch                      Informational                      [Page 9]
507
508RFC 1952             GZIP File Format Specification             May 1996
509
510
511   [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal,
512       pp.118-133.
513
514   [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt,
515       describing the CRC concept.
516
5174. Security Considerations
518
519   Any data compression method involves the reduction of redundancy in
520   the data.  Consequently, any corruption of the data is likely to have
521   severe effects and be difficult to correct.  Uncompressed text, on
522   the other hand, will probably still be readable despite the presence
523   of some corrupted bytes.
524
525   It is recommended that systems using this data format provide some
526   means of validating the integrity of the compressed data, such as by
527   setting and checking the CRC-32 check value.
528
5295. Acknowledgements
530
531   Trademarks cited in this document are the property of their
532   respective owners.
533
534   Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler,
535   the related software described in this specification.  Glenn
536   Randers-Pehrson converted this document to RFC and HTML format.
537
5386. Author's Address
539
540   L. Peter Deutsch
541   Aladdin Enterprises
542   203 Santa Margarita Ave.
543   Menlo Park, CA 94025
544
545   Phone: (415) 322-0103 (AM only)
546   FAX:   (415) 322-1734
547   EMail: <ghost@aladdin.com>
548
549   Questions about the technical content of this specification can be
550   sent by email to:
551
552   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
553   Mark Adler <madler@alumni.caltech.edu>
554
555   Editorial comments on this specification can be sent by email to:
556
557   L. Peter Deutsch <ghost@aladdin.com> and
558   Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
559
560
561
562Deutsch                      Informational                     [Page 10]
563
564RFC 1952             GZIP File Format Specification             May 1996
565
566
5677. Appendix: Jean-Loup Gailly's gzip utility
568
569   The most widely used implementation of gzip compression, and the
570   original documentation on which this specification is based, were
571   created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>.  Since this
572   implementation is a de facto standard, we mention some more of its
573   features here.  Again, the material in this section is not part of
574   the specification per se, and implementations need not follow it to
575   be compliant.
576
577   When compressing or decompressing a file, gzip preserves the
578   protection, ownership, and modification time attributes on the local
579   file system, since there is no provision for representing protection
580   attributes in the gzip file format itself.  Since the file format
581   includes a modification time, the gzip decompressor provides a
582   command line switch that assigns the modification time from the file,
583   rather than the local modification time of the compressed input, to
584   the decompressed output.
585
5868. Appendix: Sample CRC Code
587
588   The following sample code represents a practical implementation of
589   the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42
590   for a formal specification.)
591
592   The sample code is in the ANSI C programming language. Non C users
593   may find it easier to read with these hints:
594
595      &      Bitwise AND operator.
596      ^      Bitwise exclusive-OR operator.
597      >>     Bitwise right shift operator. When applied to an
598             unsigned quantity, as here, right shift inserts zero
599             bit(s) at the left.
600      !      Logical NOT operator.
601      ++     "n++" increments the variable n.
602      0xNNN  0x introduces a hexadecimal (base 16) constant.
603             Suffix L indicates a long value (at least 32 bits).
604
605      /* Table of CRCs of all 8-bit messages. */
606      unsigned long crc_table[256];
607
608      /* Flag: has the table been computed? Initially false. */
609      int crc_table_computed = 0;
610
611      /* Make the table for a fast CRC. */
612      void make_crc_table(void)
613      {
614        unsigned long c;
615
616
617
618Deutsch                      Informational                     [Page 11]
619
620RFC 1952             GZIP File Format Specification             May 1996
621
622
623        int n, k;
624        for (n = 0; n < 256; n++) {
625          c = (unsigned long) n;
626          for (k = 0; k < 8; k++) {
627            if (c & 1) {
628              c = 0xedb88320L ^ (c >> 1);
629            } else {
630              c = c >> 1;
631            }
632          }
633          crc_table[n] = c;
634        }
635        crc_table_computed = 1;
636      }
637
638      /*
639         Update a running crc with the bytes buf[0..len-1] and return
640       the updated crc. The crc should be initialized to zero. Pre- and
641       post-conditioning (one's complement) is performed within this
642       function so it shouldn't be done by the caller. Usage example:
643
644         unsigned long crc = 0L;
645
646         while (read_buffer(buffer, length) != EOF) {
647           crc = update_crc(crc, buffer, length);
648         }
649         if (crc != original_crc) error();
650      */
651      unsigned long update_crc(unsigned long crc,
652                      unsigned char *buf, int len)
653      {
654        unsigned long c = crc ^ 0xffffffffL;
655        int n;
656
657        if (!crc_table_computed)
658          make_crc_table();
659        for (n = 0; n < len; n++) {
660          c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8);
661        }
662        return c ^ 0xffffffffL;
663      }
664
665      /* Return the CRC of the bytes buf[0..len-1]. */
666      unsigned long crc(unsigned char *buf, int len)
667      {
668        return update_crc(0L, buf, len);
669      }
670
671
672
673
674Deutsch                      Informational                     [Page 12]
675
676