xref: /linux/Documentation/filesystems/ext4/inodes.rst (revision 509d3f45847627f4c5cdce004c3ec79262b5239c)
1.. SPDX-License-Identifier: GPL-2.0
2
3Index Nodes
4-----------
5
6In a regular UNIX filesystem, the inode stores all the metadata
7pertaining to the file (time stamps, block maps, extended attributes,
8etc), not the directory entry. To find the information associated with a
9file, one must traverse the directory files to find the directory entry
10associated with a file, then load the inode to find the metadata for
11that file. ext4 appears to cheat (for performance reasons) a little bit
12by storing a copy of the file type (normally stored in the inode) in the
13directory entry. (Compare all this to FAT, which stores all the file
14information directly in the directory entry, but does not support hard
15links and is in general more seek-happy than ext4 due to its simpler
16block allocator and extensive use of linked lists.)
17
18The inode table is a linear array of ``struct ext4_inode``. The table is
19sized to have enough blocks to store at least
20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
21block group containing an inode can be calculated as
22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
24is no inode 0.
25
26The inode checksum is calculated against the FS UUID, the inode number,
27and the inode structure itself.
28
29The inode table entry is laid out in ``struct ext4_inode``.
30
31.. list-table::
32   :widths: 8 8 24 40
33   :header-rows: 1
34   :class: longtable
35
36   * - Offset
37     - Size
38     - Name
39     - Description
40   * - 0x0
41     - __le16
42     - i_mode
43     - File mode. See the table i_mode_ below.
44   * - 0x2
45     - __le16
46     - i_uid
47     - Lower 16-bits of Owner UID.
48   * - 0x4
49     - __le32
50     - i_size_lo
51     - Lower 32-bits of size in bytes.
52   * - 0x8
53     - __le32
54     - i_atime
55     - Last access time, in seconds since the epoch. However, if the EA_INODE
56       inode flag is set, this inode stores an extended attribute value and
57       this field contains the checksum of the value.
58   * - 0xC
59     - __le32
60     - i_ctime
61     - Last inode change time, in seconds since the epoch. However, if the
62       EA_INODE inode flag is set, this inode stores an extended attribute
63       value and this field contains the lower 32 bits of the attribute value's
64       reference count.
65   * - 0x10
66     - __le32
67     - i_mtime
68     - Last data modification time, in seconds since the epoch. However, if the
69       EA_INODE inode flag is set, this inode stores an extended attribute
70       value and this field contains the number of the inode that owns the
71       extended attribute.
72   * - 0x14
73     - __le32
74     - i_dtime
75     - Deletion Time, in seconds since the epoch.
76   * - 0x18
77     - __le16
78     - i_gid
79     - Lower 16-bits of GID.
80   * - 0x1A
81     - __le16
82     - i_links_count
83     - Hard link count. Normally, ext4 does not permit an inode to have more
84       than 65,000 hard links. This applies to files as well as directories,
85       which means that there cannot be more than 64,998 subdirectories in a
86       directory (each subdirectory's '..' entry counts as a hard link, as does
87       the '.' entry in the directory itself). With the DIR_NLINK feature
88       enabled, ext4 supports more than 64,998 subdirectories by setting this
89       field to 1 to indicate that the number of hard links is not known.
90   * - 0x1C
91     - __le32
92     - i_blocks_lo
93     - Lower 32-bits of “block” count. If the huge_file feature flag is not
94       set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
95       on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
96       ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
97       << 32)`` 512-byte blocks on disk. If huge_file is set and
98       EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
99       consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
100       disk.
101   * - 0x20
102     - __le32
103     - i_flags
104     - Inode flags. See the table i_flags_ below.
105   * - 0x24
106     - 4 bytes
107     - i_osd1
108     - See the table i_osd1_ for more details.
109   * - 0x28
110     - 60 bytes
111     - i_block[EXT4_N_BLOCKS=15]
112     - Block map or extent tree. See the section “The Contents of inode.i_block”.
113   * - 0x64
114     - __le32
115     - i_generation
116     - File version (for NFS).
117   * - 0x68
118     - __le32
119     - i_file_acl_lo
120     - Lower 32-bits of extended attribute block. ACLs are of course one of
121       many possible extended attributes; I think the name of this field is a
122       result of the first use of extended attributes being for ACLs.
123   * - 0x6C
124     - __le32
125     - i_size_high / i_dir_acl
126     - Upper 32-bits of file/directory size. In ext2/3 this field was named
127       i_dir_acl, though it was usually set to zero and never used.
128   * - 0x70
129     - __le32
130     - i_obso_faddr
131     - (Obsolete) fragment address.
132   * - 0x74
133     - 12 bytes
134     - i_osd2
135     - See the table i_osd2_ for more details.
136   * - 0x80
137     - __le16
138     - i_extra_isize
139     - Size of this inode - 128. Alternately, the size of the extended inode
140       fields beyond the original ext2 inode, including this field.
141   * - 0x82
142     - __le16
143     - i_checksum_hi
144     - Upper 16-bits of the inode checksum.
145   * - 0x84
146     - __le32
147     - i_ctime_extra
148     - Extra change time bits. This provides sub-second precision. See Inode
149       Timestamps section.
150   * - 0x88
151     - __le32
152     - i_mtime_extra
153     - Extra modification time bits. This provides sub-second precision.
154   * - 0x8C
155     - __le32
156     - i_atime_extra
157     - Extra access time bits. This provides sub-second precision.
158   * - 0x90
159     - __le32
160     - i_crtime
161     - File creation time, in seconds since the epoch.
162   * - 0x94
163     - __le32
164     - i_crtime_extra
165     - Extra file creation time bits. This provides sub-second precision.
166   * - 0x98
167     - __le32
168     - i_version_hi
169     - Upper 32-bits for version number.
170   * - 0x9C
171     - __le32
172     - i_projid
173     - Project ID.
174
175.. _i_mode:
176
177The ``i_mode`` value is a combination of the following flags:
178
179.. list-table::
180   :widths: 16 64
181   :header-rows: 1
182
183   * - Value
184     - Description
185   * - 0x1
186     - S_IXOTH (Others may execute)
187   * - 0x2
188     - S_IWOTH (Others may write)
189   * - 0x4
190     - S_IROTH (Others may read)
191   * - 0x8
192     - S_IXGRP (Group members may execute)
193   * - 0x10
194     - S_IWGRP (Group members may write)
195   * - 0x20
196     - S_IRGRP (Group members may read)
197   * - 0x40
198     - S_IXUSR (Owner may execute)
199   * - 0x80
200     - S_IWUSR (Owner may write)
201   * - 0x100
202     - S_IRUSR (Owner may read)
203   * - 0x200
204     - S_ISVTX (Sticky bit)
205   * - 0x400
206     - S_ISGID (Set GID)
207   * - 0x800
208     - S_ISUID (Set UID)
209   * -
210     - These are mutually-exclusive file types:
211   * - 0x1000
212     - S_IFIFO (FIFO)
213   * - 0x2000
214     - S_IFCHR (Character device)
215   * - 0x4000
216     - S_IFDIR (Directory)
217   * - 0x6000
218     - S_IFBLK (Block device)
219   * - 0x8000
220     - S_IFREG (Regular file)
221   * - 0xA000
222     - S_IFLNK (Symbolic link)
223   * - 0xC000
224     - S_IFSOCK (Socket)
225
226.. _i_flags:
227
228The ``i_flags`` field is a combination of these values:
229
230.. list-table::
231   :widths: 16 64
232   :header-rows: 1
233
234   * - Value
235     - Description
236   * - 0x1
237     - This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
238   * - 0x2
239     - This file should be preserved, should undeletion be desired
240       (EXT4_UNRM_FL). (not implemented)
241   * - 0x4
242     - File is compressed (EXT4_COMPR_FL). (not really implemented)
243   * - 0x8
244     - All writes to the file must be synchronous (EXT4_SYNC_FL).
245   * - 0x10
246     - File is immutable (EXT4_IMMUTABLE_FL).
247   * - 0x20
248     - File can only be appended (EXT4_APPEND_FL).
249   * - 0x40
250     - The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
251   * - 0x80
252     - Do not update access time (EXT4_NOATIME_FL).
253   * - 0x100
254     - Dirty compressed file (EXT4_DIRTY_FL). (not used)
255   * - 0x200
256     - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
257   * - 0x400
258     - Do not compress file (EXT4_NOCOMPR_FL). (not used)
259   * - 0x800
260     - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
261       EXT4_ECOMPR_FL (compression error), which was never used.
262   * - 0x1000
263     - Directory has hashed indexes (EXT4_INDEX_FL).
264   * - 0x2000
265     - AFS magic directory (EXT4_IMAGIC_FL).
266   * - 0x4000
267     - File data must always be written through the journal
268       (EXT4_JOURNAL_DATA_FL).
269   * - 0x8000
270     - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
271   * - 0x10000
272     - All directory entry data should be written synchronously (see
273       ``dirsync``) (EXT4_DIRSYNC_FL).
274   * - 0x20000
275     - Top of directory hierarchy (EXT4_TOPDIR_FL).
276   * - 0x40000
277     - This is a huge file (EXT4_HUGE_FILE_FL).
278   * - 0x80000
279     - Inode uses extents (EXT4_EXTENTS_FL).
280   * - 0x100000
281     - Verity protected file (EXT4_VERITY_FL).
282   * - 0x200000
283     - Inode stores a large extended attribute value in its data blocks
284       (EXT4_EA_INODE_FL).
285   * - 0x400000
286     - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
287       (deprecated)
288   * - 0x01000000
289     - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
290   * - 0x04000000
291     - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
292       mainline)
293   * - 0x08000000
294     - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
295       mainline)
296   * - 0x10000000
297     - Inode has inline data (EXT4_INLINE_DATA_FL).
298   * - 0x20000000
299     - Create children with the same project ID (EXT4_PROJINHERIT_FL).
300   * - 0x40000000
301     - Use case-insensitive lookups for directory contents (EXT4_CASEFOLD_FL).
302   * - 0x80000000
303     - Reserved for ext4 library (EXT4_RESERVED_FL).
304   * -
305     - Aggregate flags:
306   * - 0x705BDFFF
307     - User-visible flags.
308   * - 0x604BC0FF
309     - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
310       EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
311       EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
312       these flags in a special manner and they are masked out of the set of
313       flags that are saved directly to i_flags.
314
315.. _i_osd1:
316
317The ``osd1`` field has multiple meanings depending on the creator:
318
319Linux:
320
321.. list-table::
322   :widths: 8 8 24 40
323   :header-rows: 1
324
325   * - Offset
326     - Size
327     - Name
328     - Description
329   * - 0x0
330     - __le32
331     - l_i_version
332     - Inode version. However, if the EA_INODE inode flag is set, this inode
333       stores an extended attribute value and this field contains the upper 32
334       bits of the attribute value's reference count.
335
336Hurd:
337
338.. list-table::
339   :widths: 8 8 24 40
340   :header-rows: 1
341
342   * - Offset
343     - Size
344     - Name
345     - Description
346   * - 0x0
347     - __le32
348     - h_i_translator
349     - ??
350
351Masix:
352
353.. list-table::
354   :widths: 8 8 24 40
355   :header-rows: 1
356
357   * - Offset
358     - Size
359     - Name
360     - Description
361   * - 0x0
362     - __le32
363     - m_i_reserved
364     - ??
365
366.. _i_osd2:
367
368The ``osd2`` field has multiple meanings depending on the filesystem creator:
369
370Linux:
371
372.. list-table::
373   :widths: 8 8 24 40
374   :header-rows: 1
375
376   * - Offset
377     - Size
378     - Name
379     - Description
380   * - 0x0
381     - __le16
382     - l_i_blocks_high
383     - Upper 16-bits of the block count. Please see the note attached to
384       i_blocks_lo.
385   * - 0x2
386     - __le16
387     - l_i_file_acl_high
388     - Upper 16-bits of the extended attribute block (historically, the file
389       ACL location). See the Extended Attributes section below.
390   * - 0x4
391     - __le16
392     - l_i_uid_high
393     - Upper 16-bits of the Owner UID.
394   * - 0x6
395     - __le16
396     - l_i_gid_high
397     - Upper 16-bits of the GID.
398   * - 0x8
399     - __le16
400     - l_i_checksum_lo
401     - Lower 16-bits of the inode checksum.
402   * - 0xA
403     - __le16
404     - l_i_reserved
405     - Unused.
406
407Hurd:
408
409.. list-table::
410   :widths: 8 8 24 40
411   :header-rows: 1
412
413   * - Offset
414     - Size
415     - Name
416     - Description
417   * - 0x0
418     - __le16
419     - h_i_reserved1
420     - ??
421   * - 0x2
422     - __u16
423     - h_i_mode_high
424     - Upper 16-bits of the file mode.
425   * - 0x4
426     - __le16
427     - h_i_uid_high
428     - Upper 16-bits of the Owner UID.
429   * - 0x6
430     - __le16
431     - h_i_gid_high
432     - Upper 16-bits of the GID.
433   * - 0x8
434     - __u32
435     - h_i_author
436     - Author code?
437
438Masix:
439
440.. list-table::
441   :widths: 8 8 24 40
442   :header-rows: 1
443
444   * - Offset
445     - Size
446     - Name
447     - Description
448   * - 0x0
449     - __le16
450     - h_i_reserved1
451     - ??
452   * - 0x2
453     - __u16
454     - m_i_file_acl_high
455     - Upper 16-bits of the extended attribute block (historically, the file
456       ACL location).
457   * - 0x4
458     - __u32
459     - m_i_reserved2[2]
460     - ??
461
462Inode Size
463~~~~~~~~~~
464
465In ext2 and ext3, the inode structure size was fixed at 128 bytes
466(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
467128 bytes. Starting with ext4, it is possible to allocate a larger
468on-disk inode at format time for all inodes in the filesystem to provide
469space beyond the end of the original ext2 inode. The on-disk inode
470record size is recorded in the superblock as ``s_inode_size``. The
471number of bytes actually used by struct ext4_inode beyond the original
472128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
473inode, which allows struct ext4_inode to grow for a new kernel without
474having to upgrade all of the on-disk inodes. Access to fields beyond
475EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
476``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
477of August 2019) the inode structure is 160 bytes
478(``i_extra_isize = 32``). The extra space between the end of the inode
479structure and the end of the inode record can be used to store extended
480attributes. Each inode record can be as large as the filesystem block
481size, though this is not terribly efficient.
482
483Finding an Inode
484~~~~~~~~~~~~~~~~
485
486Each block group contains ``sb->s_inodes_per_group`` inodes. Because
487inode 0 is defined not to exist, this formula can be used to find the
488block group that an inode lives in:
489``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
490can be found within the block group's inode table at
491``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
492address within the inode table, use
493``offset = index * sb->s_inode_size``.
494
495Inode Timestamps
496~~~~~~~~~~~~~~~~
497
498Four timestamps are recorded in the lower 128 bytes of the inode
499structure -- inode change time (ctime), access time (atime), data
500modification time (mtime), and deletion time (dtime). The four fields
501are 32-bit signed integers that represent seconds since the Unix epoch
502(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
503January 2038. If the filesystem does not have orphan_file feature, inodes
504that are not linked from any directory but are still open (orphan inodes) have
505the dtime field overloaded for use with the orphan list. The superblock field
506``s_last_orphan`` points to the first inode in the orphan list; dtime is then
507the number of the next orphaned inode, or zero if there are no more orphans.
508
509If the inode structure size ``sb->s_inode_size`` is larger than 128
510bytes and the ``i_inode_extra`` field is large enough to encompass the
511respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
512inode fields are widened to 64 bits. Within this “extra” 32-bit field,
513the lower two bits are used to extend the 32-bit seconds field to be 34
514bit wide; the upper 30 bits are used to provide nanosecond timestamp
515accuracy. Therefore, timestamps should not overflow until May 2446.
516dtime was not widened. There is also a fifth timestamp to record inode
517creation time (crtime); this field is 64-bits wide and decoded in the
518same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
519through the regular stat() interface, though debugfs will report them.
520
521We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
522In other words:
523
524.. list-table::
525   :widths: 20 20 20 20 20
526   :header-rows: 1
527
528   * - Extra epoch bits
529     - MSB of 32-bit time
530     - Adjustment for signed 32-bit to 64-bit tv_sec
531     - Decoded 64-bit tv_sec
532     - valid time range
533   * - 0 0
534     - 1
535     - 0
536     - ``-0x80000000 - -0x00000001``
537     - 1901-12-13 to 1969-12-31
538   * - 0 0
539     - 0
540     - 0
541     - ``0x000000000 - 0x07fffffff``
542     - 1970-01-01 to 2038-01-19
543   * - 0 1
544     - 1
545     - 0x100000000
546     - ``0x080000000 - 0x0ffffffff``
547     - 2038-01-19 to 2106-02-07
548   * - 0 1
549     - 0
550     - 0x100000000
551     - ``0x100000000 - 0x17fffffff``
552     - 2106-02-07 to 2174-02-25
553   * - 1 0
554     - 1
555     - 0x200000000
556     - ``0x180000000 - 0x1ffffffff``
557     - 2174-02-25 to 2242-03-16
558   * - 1 0
559     - 0
560     - 0x200000000
561     - ``0x200000000 - 0x27fffffff``
562     - 2242-03-16 to 2310-04-04
563   * - 1 1
564     - 1
565     - 0x300000000
566     - ``0x280000000 - 0x2ffffffff``
567     - 2310-04-04 to 2378-04-22
568   * - 1 1
569     - 0
570     - 0x300000000
571     - ``0x300000000 - 0x37fffffff``
572     - 2378-04-22 to 2446-05-10
573
574This is a somewhat odd encoding since there are effectively seven times
575as many positive values as negative values. There have also been
576long-standing bugs decoding and encoding dates beyond 2038, which don't
577seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
578incorrectly use the extra epoch bits 1,1 for dates between 1901 and
5791970. At some point the kernel will be fixed and e2fsck will fix this
580situation, assuming that it is run before 2310.
581