xref: /linux/Documentation/filesystems/ext4/allocators.rst (revision 8a98ec7c7b3901330a036af0f62f523c31d763da)
1*8a98ec7cSDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0
2*8a98ec7cSDarrick J. Wong
3*8a98ec7cSDarrick J. WongBlock and Inode Allocation Policy
4*8a98ec7cSDarrick J. Wong---------------------------------
5*8a98ec7cSDarrick J. Wong
6*8a98ec7cSDarrick J. Wongext4 recognizes (better than ext3, anyway) that data locality is
7*8a98ec7cSDarrick J. Wonggenerally a desirably quality of a filesystem. On a spinning disk,
8*8a98ec7cSDarrick J. Wongkeeping related blocks near each other reduces the amount of movement
9*8a98ec7cSDarrick J. Wongthat the head actuator and disk must perform to access a data block,
10*8a98ec7cSDarrick J. Wongthus speeding up disk IO. On an SSD there of course are no moving parts,
11*8a98ec7cSDarrick J. Wongbut locality can increase the size of each transfer request while
12*8a98ec7cSDarrick J. Wongreducing the total number of requests. This locality may also have the
13*8a98ec7cSDarrick J. Wongeffect of concentrating writes on a single erase block, which can speed
14*8a98ec7cSDarrick J. Wongup file rewrites significantly. Therefore, it is useful to reduce
15*8a98ec7cSDarrick J. Wongfragmentation whenever possible.
16*8a98ec7cSDarrick J. Wong
17*8a98ec7cSDarrick J. WongThe first tool that ext4 uses to combat fragmentation is the multi-block
18*8a98ec7cSDarrick J. Wongallocator. When a file is first created, the block allocator
19*8a98ec7cSDarrick J. Wongspeculatively allocates 8KiB of disk space to the file on the assumption
20*8a98ec7cSDarrick J. Wongthat the space will get written soon. When the file is closed, the
21*8a98ec7cSDarrick J. Wongunused speculative allocations are of course freed, but if the
22*8a98ec7cSDarrick J. Wongspeculation is correct (typically the case for full writes of small
23*8a98ec7cSDarrick J. Wongfiles) then the file data gets written out in a single multi-block
24*8a98ec7cSDarrick J. Wongextent. A second related trick that ext4 uses is delayed allocation.
25*8a98ec7cSDarrick J. WongUnder this scheme, when a file needs more blocks to absorb file writes,
26*8a98ec7cSDarrick J. Wongthe filesystem defers deciding the exact placement on the disk until all
27*8a98ec7cSDarrick J. Wongthe dirty buffers are being written out to disk. By not committing to a
28*8a98ec7cSDarrick J. Wongparticular placement until it's absolutely necessary (the commit timeout
29*8a98ec7cSDarrick J. Wongis hit, or sync() is called, or the kernel runs out of memory), the hope
30*8a98ec7cSDarrick J. Wongis that the filesystem can make better location decisions.
31*8a98ec7cSDarrick J. Wong
32*8a98ec7cSDarrick J. WongThe third trick that ext4 (and ext3) uses is that it tries to keep a
33*8a98ec7cSDarrick J. Wongfile's data blocks in the same block group as its inode. This cuts down
34*8a98ec7cSDarrick J. Wongon the seek penalty when the filesystem first has to read a file's inode
35*8a98ec7cSDarrick J. Wongto learn where the file's data blocks live and then seek over to the
36*8a98ec7cSDarrick J. Wongfile's data blocks to begin I/O operations.
37*8a98ec7cSDarrick J. Wong
38*8a98ec7cSDarrick J. WongThe fourth trick is that all the inodes in a directory are placed in the
39*8a98ec7cSDarrick J. Wongsame block group as the directory, when feasible. The working assumption
40*8a98ec7cSDarrick J. Wonghere is that all the files in a directory might be related, therefore it
41*8a98ec7cSDarrick J. Wongis useful to try to keep them all together.
42*8a98ec7cSDarrick J. Wong
43*8a98ec7cSDarrick J. WongThe fifth trick is that the disk volume is cut up into 128MB block
44*8a98ec7cSDarrick J. Wonggroups; these mini-containers are used as outlined above to try to
45*8a98ec7cSDarrick J. Wongmaintain data locality. However, there is a deliberate quirk -- when a
46*8a98ec7cSDarrick J. Wongdirectory is created in the root directory, the inode allocator scans
47*8a98ec7cSDarrick J. Wongthe block groups and puts that directory into the least heavily loaded
48*8a98ec7cSDarrick J. Wongblock group that it can find. This encourages directories to spread out
49*8a98ec7cSDarrick J. Wongover a disk; as the top-level directory/file blobs fill up one block
50*8a98ec7cSDarrick J. Wonggroup, the allocators simply move on to the next block group. Allegedly
51*8a98ec7cSDarrick J. Wongthis scheme evens out the loading on the block groups, though the author
52*8a98ec7cSDarrick J. Wongsuspects that the directories which are so unlucky as to land towards
53*8a98ec7cSDarrick J. Wongthe end of a spinning drive get a raw deal performance-wise.
54*8a98ec7cSDarrick J. Wong
55*8a98ec7cSDarrick J. WongOf course if all of these mechanisms fail, one can always use e4defrag
56*8a98ec7cSDarrick J. Wongto defragment files.
57