xref: /linux/Documentation/filesystems/ext4/blockgroup.rst (revision 7094b84863e5832cb1cd9c4b9d648904775b6bd9)
1.. SPDX-License-Identifier: GPL-2.0
2
3Block Groups
4------------
5
6Layout
7~~~~~~
8
9The layout of a standard block group is approximately as follows (each
10of these fields is discussed in a separate section below):
11
12.. list-table::
13   :widths: 1 1 1 1 1 1 1 1
14   :header-rows: 1
15
16   * - Group 0 Padding
17     - ext4 Super Block
18     - Group Descriptors
19     - Reserved GDT Blocks
20     - Data Block Bitmap
21     - inode Bitmap
22     - inode Table
23     - Data Blocks
24   * - 1024 bytes
25     - 1 block
26     - many blocks
27     - many blocks
28     - 1 block
29     - 1 block
30     - many blocks
31     - many more blocks
32
33For the special case of block group 0, the first 1024 bytes are unused,
34to allow for the installation of x86 boot sectors and other oddities.
35The superblock will start at offset 1024 bytes, whichever block that
36happens to be (usually 0). However, if for some reason the block size =
371024, then block 0 is marked in use and the superblock goes in block 1.
38For all other block groups, there is no padding.
39
40The ext4 driver primarily works with the superblock and the group
41descriptors that are found in block group 0. Redundant copies of the
42superblock and group descriptors are written to some of the block groups
43across the disk in case the beginning of the disk gets trashed, though
44not all block groups necessarily host a redundant copy (see following
45paragraph for more details). If the group does not have a redundant
46copy, the block group begins with the data block bitmap. Note also that
47when the filesystem is freshly formatted, mkfs will allocate “reserve
48GDT block” space after the block group descriptors and before the start
49of the block bitmaps to allow for future expansion of the filesystem. By
50default, a filesystem is allowed to increase in size by a factor of
511024x over the original filesystem size.
52
53The location of the inode table is given by ``grp.bg_inode_table_*``. It
54is continuous range of blocks large enough to contain
55``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
56
57As for the ordering of items in a block group, it is generally
58established that the super block and the group descriptor table, if
59present, will be at the beginning of the block group. The bitmaps and
60the inode table can be anywhere, and it is quite possible for the
61bitmaps to come after the inode table, or for both to be in different
62groups (flex_bg). Leftover space is used for file data blocks, indirect
63block maps, extent tree blocks, and extended attributes.
64
65Flexible Block Groups
66~~~~~~~~~~~~~~~~~~~~~
67
68Starting in ext4, there is a new feature called flexible block groups
69(flex_bg). In a flex_bg, several block groups are tied together as one
70logical block group; the bitmap spaces and the inode table space in the
71first block group of the flex_bg are expanded to include the bitmaps
72and inode tables of all other block groups in the flex_bg. For example,
73if the flex_bg size is 4, then group 0 will contain (in order) the
74superblock, group descriptors, data block bitmaps for groups 0-3, inode
75bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
76space in group 0 is for file data. The effect of this is to group the
77block group metadata close together for faster loading, and to enable
78large files to be continuous on disk. Backup copies of the superblock
79and group descriptors are always at the beginning of block groups, even
80if flex_bg is enabled. The number of block groups that make up a
81flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
82
83Meta Block Groups
84~~~~~~~~~~~~~~~~~
85
86Without the option META_BG, for safety concerns, all block group
87descriptors copies are kept in the first block group. Given the default
88128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
89can have at most 2^27/64 = 2^21 block groups. This limits the entire
90filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
91
92The solution to this problem is to use the metablock group feature
93(META_BG), which is already in ext3 for all 2.6 releases. With the
94META_BG feature, ext4 filesystems are partitioned into many metablock
95groups. Each metablock group is a cluster of block groups whose group
96descriptor structures can be stored in a single disk block. For ext4
97filesystems with 4 KB block size, a single metablock group partition
98includes 64 block groups, or 8 GiB of disk space. The metablock group
99feature moves the location of the group descriptors from the congested
100first block group of the whole filesystem into the first group of each
101metablock group itself. The backups are in the second and last group of
102each metablock group. This increases the 2^21 maximum block groups limit
103to the hard limit 2^32, allowing support for a 512PiB filesystem.
104
105The change in the filesystem format replaces the current scheme where
106the superblock is followed by a variable-length set of block group
107descriptors. Instead, the superblock and a single block group descriptor
108block is placed at the beginning of the first, second, and last block
109groups in a meta-block group. A meta-block group is a collection of
110block groups which can be described by a single block group descriptor
111block. Since the size of the block group descriptor structure is 64
112bytes, a meta-block group contains 16 block groups for filesystems with
113a 1KB block size, and 64 block groups for filesystems with a 4KB
114blocksize. Filesystems can either be created using this new block group
115descriptor layout, or existing filesystems can be resized on-line, and
116the field s_first_meta_bg in the superblock will indicate the first
117block group using this new layout.
118
119Please see an important note about ``BLOCK_UNINIT`` in the section about
120block and inode bitmaps.
121
122Lazy Block Group Initialization
123~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
124
125A new feature for ext4 are three block group descriptor flags that
126enable mkfs to skip initializing other parts of the block group
127metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
128that the inode and block bitmaps for that group can be calculated and
129therefore the on-disk bitmap blocks are not initialized. This is
130generally the case for an empty block group or a block group containing
131only fixed-location block group metadata. The INODE_ZEROED flag means
132that the inode table has been initialized; mkfs will unset this flag and
133rely on the kernel to initialize the inode tables in the background.
134
135By not writing zeroes to the bitmaps and inode table, mkfs time is
136reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
137but the dumpe2fs output prints this as “uninit_bg”. They are the same
138thing.
139