1.. SPDX-License-Identifier: GPL-2.0 2 3Idmappings 4========== 5 6Most filesystem developers will have encountered idmappings. They are used when 7reading from or writing ownership to disk, reporting ownership to userspace, or 8for permission checking. This document is aimed at filesystem developers that 9want to know how idmappings work. 10 11Formal notes 12------------ 13 14An idmapping is essentially a translation of a range of ids into another or the 15same range of ids. The notational convention for idmappings that is widely used 16in userspace is:: 17 18 u:k:r 19 20``u`` indicates the first element in the upper idmapset ``U`` and ``k`` 21indicates the first element in the lower idmapset ``K``. The ``r`` parameter 22indicates the range of the idmapping, i.e. how many ids are mapped. From now 23on, we will always prefix ids with ``u`` or ``k`` to make it clear whether 24we're talking about an id in the upper or lower idmapset. 25 26To see what this looks like in practice, let's take the following idmapping:: 27 28 u22:k10000:r3 29 30and write down the mappings it will generate:: 31 32 u22 -> k10000 33 u23 -> k10001 34 u24 -> k10002 35 36From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an 37idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are 38order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of 39the set of all possible ids usable on a given system. 40 41Looking at this mathematically briefly will help us highlight some properties 42that make it easier to understand how we can translate between idmappings. For 43example, we know that the inverse idmapping is an order isomorphism as well:: 44 45 k10000 -> u22 46 k10001 -> u23 47 k10002 -> u24 48 49Given that we are dealing with order isomorphisms plus the fact that we're 50dealing with subsets we can embed idmappings into each other, i.e. we can 51sensibly translate between different idmappings. For example, assume we've been 52given the three idmappings:: 53 54 1. u0:k10000:r10000 55 2. u0:k20000:r10000 56 3. u0:k30000:r10000 57 58and id ``k11000`` which has been generated by the first idmapping by mapping 59``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. 60 61Because we're dealing with order isomorphic subsets it is meaningful to ask 62what id ``k11000`` corresponds to in the second or third idmapping. The 63straightforward algorithm to use is to apply the inverse of the first idmapping, 64mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using 65either the second idmapping mapping or third idmapping mapping. The second 66idmapping would map ``u1000`` down to ``21000``. The third idmapping would map 67``u1000`` down to ``u31000``. 68 69If we were given the same task for the following three idmappings:: 70 71 1. u0:k10000:r10000 72 2. u0:k20000:r200 73 3. u0:k30000:r300 74 75we would fail to translate as the sets aren't order isomorphic over the full 76range of the first idmapping anymore (However they are order isomorphic over 77the full range of the second idmapping.). Neither the second or third idmapping 78contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having 79an id mapped. We can simply say that ``u1000`` is unmapped in the second and 80third idmapping. The kernel will report unmapped ids as the overflowuid 81``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. 82 83The algorithm to calculate what a given id maps to is pretty simple. First, we 84need to verify that the range can contain our target id. We will skip this step 85for simplicity. After that if we want to know what ``id`` maps to we can do 86simple calculations: 87 88- If we want to map from left to right:: 89 90 u:k:r 91 id - u + k = n 92 93- If we want to map from right to left:: 94 95 u:k:r 96 id - k + u = n 97 98Instead of "left to right" we can also say "down" and instead of "right to 99left" we can also say "up". Obviously mapping down and up invert each other. 100 101To see whether the simple formulas above work, consider the following two 102idmappings:: 103 104 1. u0:k20000:r10000 105 2. u500:k30000:r10000 106 107Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We 108want to know what id this was mapped from in the upper idmapset of the first 109idmapping. So we're mapping up in the first idmapping:: 110 111 id - k + u = n 112 k21000 - k20000 + u0 = u1000 113 114Now assume we are given the id ``u1100`` in the upper idmapset of the second 115idmapping and we want to know what this id maps down to in the lower idmapset 116of the second idmapping. This means we're mapping down in the second 117idmapping:: 118 119 id - u + k = n 120 u1100 - u500 + k30000 = k30600 121 122General notes 123------------- 124 125In the context of the kernel an idmapping can be interpreted as mapping a range 126of userspace ids into a range of kernel ids:: 127 128 userspace-id:kernel-id:range 129 130A userspace id is always an element in the upper idmapset of an idmapping of 131type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower 132idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on 133"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` 134types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. 135 136The kernel is mostly concerned with kernel ids. They are used when performing 137permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. 138A userspace id on the other hand is an id that is reported to userspace by the 139kernel, or is passed by userspace to the kernel, or a raw device id that is 140written or read from disk. 141 142Note that we are only concerned with idmappings as the kernel stores them not 143how userspace would specify them. 144 145For the rest of this document we will prefix all userspace ids with ``u`` and 146all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So 147an idmapping will be written as ``u0:k10000:r10000``. 148 149For example, within this idmapping, the id ``u1000`` is an id in the upper 150idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to 151``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset" 152starting with ``k10000``. 153 154A kernel id is always created by an idmapping. Such idmappings are associated 155with user namespaces. Since we mainly care about how idmappings work we're not 156going to be concerned with how idmappings are created nor how they are used 157outside of the filesystem context. This is best left to an explanation of user 158namespaces. 159 160The initial user namespace is special. It always has an idmapping of the 161following form:: 162 163 u0:k0:r4294967295 164 165which is an identity idmapping over the full range of ids available on this 166system. 167 168Other user namespaces usually have non-identity idmappings such as:: 169 170 u0:k10000:r10000 171 172When a process creates or wants to change ownership of a file, or when the 173ownership of a file is read from disk by a filesystem, the userspace id is 174immediately translated into a kernel id according to the idmapping associated 175with the relevant user namespace. 176 177For instance, consider a file that is stored on disk by a filesystem as being 178owned by ``u1000``: 179 180- If a filesystem were to be mounted in the initial user namespaces (as most 181 filesystems are) then the initial idmapping will be used. As we saw this is 182 simply the identity idmapping. This would mean id ``u1000`` read from disk 183 would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field 184 would contain ``k1000``. 185 186- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` 187 then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's 188 ``i_uid`` and ``i_gid`` would contain ``k11000``. 189 190Translation algorithms 191---------------------- 192 193We've already seen briefly that it is possible to translate between different 194idmappings. We'll now take a closer look how that works. 195 196Crossmapping 197~~~~~~~~~~~~ 198 199This translation algorithm is used by the kernel in quite a few places. For 200example, it is used when reporting back the ownership of a file to userspace 201via the ``stat()`` system call family. 202 203If we've been given ``k11000`` from one idmapping we can map that id up in 204another idmapping. In order for this to work both idmappings need to contain 205the same kernel id in their kernel idmapsets. For example, consider the 206following idmappings:: 207 208 1. u0:k10000:r10000 209 2. u20000:k10000:r10000 210 211and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can 212then translate ``k11000`` into a userspace id in the second idmapping using the 213kernel idmapset of the second idmapping:: 214 215 /* Map the kernel id up into a userspace id in the second idmapping. */ 216 from_kuid(u20000:k10000:r10000, k11000) = u21000 217 218Note, how we can get back to the kernel id in the first idmapping by inverting 219the algorithm:: 220 221 /* Map the userspace id down into a kernel id in the second idmapping. */ 222 make_kuid(u20000:k10000:r10000, u21000) = k11000 223 224 /* Map the kernel id up into a userspace id in the first idmapping. */ 225 from_kuid(u0:k10000:r10000, k11000) = u1000 226 227This algorithm allows us to answer the question what userspace id a given 228kernel id corresponds to in a given idmapping. In order to be able to answer 229this question both idmappings need to contain the same kernel id in their 230respective kernel idmapsets. 231 232For example, when the kernel reads a raw userspace id from disk it maps it down 233into a kernel id according to the idmapping associated with the filesystem. 234Let's assume the filesystem was mounted with an idmapping of 235``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This 236means ``u1000`` will be mapped to ``k21000`` which is what will be stored in 237the inode's ``i_uid`` and ``i_gid`` field. 238 239When someone in userspace calls ``stat()`` or a related function to get 240ownership information about the file the kernel can't simply map the id back up 241according to the filesystem's idmapping as this would give the wrong owner if 242the caller is using an idmapping. 243 244So the kernel will map the id back up in the idmapping of the caller. Let's 245assume the caller has the somewhat unconventional idmapping 246``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. 247Consequently the user would see that this file is owned by ``u4000``. 248 249Remapping 250~~~~~~~~~ 251 252It is possible to translate a kernel id from one idmapping to another one via 253the userspace idmapset of the two idmappings. This is equivalent to remapping 254a kernel id. 255 256Let's look at an example. We are given the following two idmappings:: 257 258 1. u0:k10000:r10000 259 2. u0:k20000:r10000 260 261and we are given ``k11000`` in the first idmapping. In order to translate this 262kernel id in the first idmapping into a kernel id in the second idmapping we 263need to perform two steps: 264 2651. Map the kernel id up into a userspace id in the first idmapping:: 266 267 /* Map the kernel id up into a userspace id in the first idmapping. */ 268 from_kuid(u0:k10000:r10000, k11000) = u1000 269 2702. Map the userspace id down into a kernel id in the second idmapping:: 271 272 /* Map the userspace id down into a kernel id in the second idmapping. */ 273 make_kuid(u0:k20000:r10000, u1000) = k21000 274 275As you can see we used the userspace idmapset in both idmappings to translate 276the kernel id in one idmapping to a kernel id in another idmapping. 277 278This allows us to answer the question what kernel id we would need to use to 279get the same userspace id in another idmapping. In order to be able to answer 280this question both idmappings need to contain the same userspace id in their 281respective userspace idmapsets. 282 283Note, how we can easily get back to the kernel id in the first idmapping by 284inverting the algorithm: 285 2861. Map the kernel id up into a userspace id in the second idmapping:: 287 288 /* Map the kernel id up into a userspace id in the second idmapping. */ 289 from_kuid(u0:k20000:r10000, k21000) = u1000 290 2912. Map the userspace id down into a kernel id in the first idmapping:: 292 293 /* Map the userspace id down into a kernel id in the first idmapping. */ 294 make_kuid(u0:k10000:r10000, u1000) = k11000 295 296Another way to look at this translation is to treat it as inverting one 297idmapping and applying another idmapping if both idmappings have the relevant 298userspace id mapped. This will come in handy when working with idmapped mounts. 299 300Invalid translations 301~~~~~~~~~~~~~~~~~~~~ 302 303It is never valid to use an id in the kernel idmapset of one idmapping as the 304id in the userspace idmapset of another or the same idmapping. While the kernel 305idmapset always indicates an idmapset in the kernel id space the userspace 306idmapset indicates a userspace id. So the following translations are forbidden:: 307 308 /* Map the userspace id down into a kernel id in the first idmapping. */ 309 make_kuid(u0:k10000:r10000, u1000) = k11000 310 311 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ 312 make_kuid(u10000:k20000:r10000, k110000) = k21000 313 ~~~~~~~ 314 315and equally wrong:: 316 317 /* Map the kernel id up into a userspace id in the first idmapping. */ 318 from_kuid(u0:k10000:r10000, k11000) = u1000 319 320 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ 321 from_kuid(u20000:k0:r10000, u1000) = k21000 322 ~~~~~ 323 324Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type 325``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are 326conflated. So the two examples above would cause a compilation failure. 327 328Idmappings when creating filesystem objects 329------------------------------------------- 330 331The concepts of mapping an id down or mapping an id up are expressed in the two 332kernel functions filesystem developers are rather familiar with and which we've 333already used in this document:: 334 335 /* Map the userspace id down into a kernel id. */ 336 make_kuid(idmapping, uid) 337 338 /* Map the kernel id up into a userspace id. */ 339 from_kuid(idmapping, kuid) 340 341We will take an abbreviated look into how idmappings figure into creating 342filesystem objects. For simplicity we will only look at what happens when the 343VFS has already completed path lookup right before it calls into the filesystem 344itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is 345called. We will also assume that the directory we're creating filesystem 346objects in is readable and writable for everyone. 347 348When creating a filesystem object the caller will look at the caller's 349filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids 350but they are exclusively used when determining file ownership which is why they 351are called "filesystem ids". They are usually identical to the uid and gid of 352the caller but can differ. We will just assume they are always identical to not 353get lost in too many details. 354 355When the caller enters the kernel two things happen: 356 3571. Map the caller's userspace ids down into kernel ids in the caller's 358 idmapping. 359 (To be precise, the kernel will simply look at the kernel ids stashed in the 360 credentials of the current task but for our education we'll pretend this 361 translation happens just in time.) 3622. Verify that the caller's kernel ids can be mapped up to userspace ids in the 363 filesystem's idmapping. 364 365The second step is important as regular filesystem will ultimately need to map 366the kernel id back up into a userspace id when writing to disk. 367So with the second step the kernel guarantees that a valid userspace id can be 368written to disk. If it can't the kernel will refuse the creation request to not 369even remotely risk filesystem corruption. 370 371The astute reader will have realized that this is simply a variation of the 372crossmapping algorithm we mentioned above in a previous section. First, the 373kernel maps the caller's userspace id down into a kernel id according to the 374caller's idmapping and then maps that kernel id up according to the 375filesystem's idmapping. 376 377From the implementation point it's worth mentioning how idmappings are represented. 378All idmappings are taken from the corresponding user namespace. 379 380 - caller's idmapping (usually taken from ``current_user_ns()``) 381 - filesystem's idmapping (``sb->s_user_ns``) 382 - mount's idmapping (``mnt_idmap(vfsmnt)``) 383 384Let's see some examples with caller/filesystem idmapping but without mount 385idmappings. This will exhibit some problems we can hit. After that we will 386revisit/reconsider these examples, this time using mount idmappings, to see how 387they can solve the problems we observed before. 388 389Example 1 390~~~~~~~~~ 391 392:: 393 394 caller id: u1000 395 caller idmapping: u0:k0:r4294967295 396 filesystem idmapping: u0:k0:r4294967295 397 398Both the caller and the filesystem use the identity idmapping: 399 4001. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 401 402 make_kuid(u0:k0:r4294967295, u1000) = k1000 403 4042. Verify that the caller's kernel ids can be mapped to userspace ids in the 405 filesystem's idmapping. 406 407 For this second step the kernel will call the function 408 ``fsuidgid_has_mapping()`` which ultimately boils down to calling 409 ``from_kuid()``:: 410 411 from_kuid(u0:k0:r4294967295, k1000) = u1000 412 413In this example both idmappings are the same so there's nothing exciting going 414on. Ultimately the userspace id that lands on disk will be ``u1000``. 415 416Example 2 417~~~~~~~~~ 418 419:: 420 421 caller id: u1000 422 caller idmapping: u0:k10000:r10000 423 filesystem idmapping: u0:k20000:r10000 424 4251. Map the caller's userspace ids down into kernel ids in the caller's 426 idmapping:: 427 428 make_kuid(u0:k10000:r10000, u1000) = k11000 429 4302. Verify that the caller's kernel ids can be mapped up to userspace ids in the 431 filesystem's idmapping:: 432 433 from_kuid(u0:k20000:r10000, k11000) = u-1 434 435It's immediately clear that while the caller's userspace id could be 436successfully mapped down into kernel ids in the caller's idmapping the kernel 437ids could not be mapped up according to the filesystem's idmapping. So the 438kernel will deny this creation request. 439 440Note that while this example is less common, because most filesystem can't be 441mounted with non-initial idmappings this is a general problem as we can see in 442the next examples. 443 444Example 3 445~~~~~~~~~ 446 447:: 448 449 caller id: u1000 450 caller idmapping: u0:k10000:r10000 451 filesystem idmapping: u0:k0:r4294967295 452 4531. Map the caller's userspace ids down into kernel ids in the caller's 454 idmapping:: 455 456 make_kuid(u0:k10000:r10000, u1000) = k11000 457 4582. Verify that the caller's kernel ids can be mapped up to userspace ids in the 459 filesystem's idmapping:: 460 461 from_kuid(u0:k0:r4294967295, k11000) = u11000 462 463We can see that the translation always succeeds. The userspace id that the 464filesystem will ultimately put to disk will always be identical to the value of 465the kernel id that was created in the caller's idmapping. This has mainly two 466consequences. 467 468First, that we can't allow a caller to ultimately write to disk with another 469userspace id. We could only do this if we were to mount the whole filesystem 470with the caller's or another idmapping. But that solution is limited to a few 471filesystems and not very flexible. But this is a use-case that is pretty 472important in containerized workloads. 473 474Second, the caller will usually not be able to create any files or access 475directories that have stricter permissions because none of the filesystem's 476kernel ids map up into valid userspace ids in the caller's idmapping 477 4781. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: 479 480 make_kuid(u0:k0:r4294967295, u1000) = k1000 481 4822. Map kernel ids up to userspace ids in the caller's idmapping:: 483 484 from_kuid(u0:k10000:r10000, k1000) = u-1 485 486Example 4 487~~~~~~~~~ 488 489:: 490 491 file id: u1000 492 caller idmapping: u0:k10000:r10000 493 filesystem idmapping: u0:k0:r4294967295 494 495In order to report ownership to userspace the kernel uses the crossmapping 496algorithm introduced in a previous section: 497 4981. Map the userspace id on disk down into a kernel id in the filesystem's 499 idmapping:: 500 501 make_kuid(u0:k0:r4294967295, u1000) = k1000 502 5032. Map the kernel id up into a userspace id in the caller's idmapping:: 504 505 from_kuid(u0:k10000:r10000, k1000) = u-1 506 507The crossmapping algorithm fails in this case because the kernel id in the 508filesystem idmapping cannot be mapped up to a userspace id in the caller's 509idmapping. Thus, the kernel will report the ownership of this file as the 510overflowid. 511 512Example 5 513~~~~~~~~~ 514 515:: 516 517 file id: u1000 518 caller idmapping: u0:k10000:r10000 519 filesystem idmapping: u0:k20000:r10000 520 521In order to report ownership to userspace the kernel uses the crossmapping 522algorithm introduced in a previous section: 523 5241. Map the userspace id on disk down into a kernel id in the filesystem's 525 idmapping:: 526 527 make_kuid(u0:k20000:r10000, u1000) = k21000 528 5292. Map the kernel id up into a userspace id in the caller's idmapping:: 530 531 from_kuid(u0:k10000:r10000, k21000) = u-1 532 533Again, the crossmapping algorithm fails in this case because the kernel id in 534the filesystem idmapping cannot be mapped to a userspace id in the caller's 535idmapping. Thus, the kernel will report the ownership of this file as the 536overflowid. 537 538Note how in the last two examples things would be simple if the caller would be 539using the initial idmapping. For a filesystem mounted with the initial 540idmapping it would be trivial. So we only consider a filesystem with an 541idmapping of ``u0:k20000:r10000``: 542 5431. Map the userspace id on disk down into a kernel id in the filesystem's 544 idmapping:: 545 546 make_kuid(u0:k20000:r10000, u1000) = k21000 547 5482. Map the kernel id up into a userspace id in the caller's idmapping:: 549 550 from_kuid(u0:k0:r4294967295, k21000) = u21000 551 552Idmappings on idmapped mounts 553----------------------------- 554 555The examples we've seen in the previous section where the caller's idmapping 556and the filesystem's idmapping are incompatible causes various issues for 557workloads. For a more complex but common example, consider two containers 558started on the host. To completely prevent the two containers from affecting 559each other, an administrator may often use different non-overlapping idmappings 560for the two containers:: 561 562 container1 idmapping: u0:k10000:r10000 563 container2 idmapping: u0:k20000:r10000 564 filesystem idmapping: u0:k30000:r10000 565 566An administrator wanting to provide easy read-write access to the following set 567of files:: 568 569 dir id: u0 570 dir/file1 id: u1000 571 dir/file2 id: u2000 572 573to both containers currently can't. 574 575Of course the administrator has the option to recursively change ownership via 576``chown()``. For example, they could change ownership so that ``dir`` and all 577files below it can be crossmapped from the filesystem's into the container's 578idmapping. Let's assume they change ownership so it is compatible with the 579first container's idmapping:: 580 581 dir id: u10000 582 dir/file1 id: u11000 583 dir/file2 id: u12000 584 585This would still leave ``dir`` rather useless to the second container. In fact, 586``dir`` and all files below it would continue to appear owned by the overflowid 587for the second container. 588 589Or consider another increasingly popular example. Some service managers such as 590systemd implement a concept called "portable home directories". A user may want 591to use their home directories on different machines where they are assigned 592different login userspace ids. Most users will have ``u1000`` as the login id 593on their machine at home and all files in their home directory will usually be 594owned by ``u1000``. At uni or at work they may have another login id such as 595``u1125``. This makes it rather difficult to interact with their home directory 596on their work machine. 597 598In both cases changing ownership recursively has grave implications. The most 599obvious one is that ownership is changed globally and permanently. In the home 600directory case this change in ownership would even need to happen every time the 601user switches from their home to their work machine. For really large sets of 602files this becomes increasingly costly. 603 604If the user is lucky, they are dealing with a filesystem that is mountable 605inside user namespaces. But this would also change ownership globally and the 606change in ownership is tied to the lifetime of the filesystem mount, i.e. the 607superblock. The only way to change ownership is to completely unmount the 608filesystem and mount it again in another user namespace. This is usually 609impossible because it would mean that all users currently accessing the 610filesystem can't anymore. And it means that ``dir`` still can't be shared 611between two containers with different idmappings. 612But usually the user doesn't even have this option since most filesystems 613aren't mountable inside containers. And not having them mountable might be 614desirable as it doesn't require the filesystem to deal with malicious 615filesystem images. 616 617But the usecases mentioned above and more can be handled by idmapped mounts. 618They allow to expose the same set of dentries with different ownership at 619different mounts. This is achieved by marking the mounts with a user namespace 620through the ``mount_setattr()`` system call. The idmapping associated with it 621is then used to translate from the caller's idmapping to the filesystem's 622idmapping and vica versa using the remapping algorithm we introduced above. 623 624Idmapped mounts make it possible to change ownership in a temporary and 625localized way. The ownership changes are restricted to a specific mount and the 626ownership changes are tied to the lifetime of the mount. All other users and 627locations where the filesystem is exposed are unaffected. 628 629Filesystems that support idmapped mounts don't have any real reason to support 630being mountable inside user namespaces. A filesystem could be exposed 631completely under an idmapped mount to get the same effect. This has the 632advantage that filesystems can leave the creation of the superblock to 633privileged users in the initial user namespace. 634 635However, it is perfectly possible to combine idmapped mounts with filesystems 636mountable inside user namespaces. We will touch on this further below. 637 638Filesystem types vs idmapped mount types 639~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 640 641With the introduction of idmapped mounts we need to distinguish between 642filesystem ownership and mount ownership of a VFS object such as an inode. The 643owner of a inode might be different when looked at from a filesystem 644perspective than when looked at from an idmapped mount. Such fundamental 645conceptual distinctions should almost always be clearly expressed in the code. 646So, to distinguish idmapped mount ownership from filesystem ownership separate 647types have been introduced. 648 649If a uid or gid has been generated using the filesystem or caller's idmapping 650then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid 651has been generated using a mount idmapping then we will be using the dedicated 652``vfsuid_t`` and ``vfsgid_t`` types. 653 654All VFS helpers that generate or take uids and gids as arguments use the 655``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler 656to catch errors that originate from conflating filesystem and VFS uids and gids. 657 658The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t`` 659and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped 660from and to ``uid_t`` and ``gid_t`` types:: 661 662 uid_t <--> kuid_t <--> vfsuid_t 663 gid_t <--> kgid_t <--> vfsgid_t 664 665Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type, 666e.g., during ``stat()``, or store ownership information in a shared VFS object 667based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can 668use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers. 669 670To illustrate why this helper currently exists, consider what happens when we 671change ownership of an inode from an idmapped mount. After we generated 672a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to 673this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership. 674Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t`` 675or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and 676``vfsgid_into_kgid()``. 677 678Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached 679``struct posix_acl``, stores ownership information a filesystem or "global" 680``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t`` 681and ``vfsgid_t`` is specific to an idmapped mount. 682 683We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based 684on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based 685on filesystem idmappings. To prevent abusing filesystem idmappings to generate 686``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t`` 687or ``kgid_t`` types filesystem idmappings and mount idmappings are different 688types as well. 689 690All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require 691a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing 692a filesystem or caller idmapping will cause a compilation error. 693 694Similar to how we prefix all userspace ids in this document with ``u`` and all 695kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount 696idmapping will be written as: ``u0:v10000:r10000``. 697 698Remapping helpers 699~~~~~~~~~~~~~~~~~ 700 701Idmapping functions were added that translate between idmappings. They make use 702of the remapping algorithm we've introduced earlier. We're going to look at: 703 704- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()`` 705 706 The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into 707 VFS ids in the mount's idmapping:: 708 709 /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ 710 from_kuid(filesystem, kid) = uid 711 712 /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */ 713 make_kuid(mount, uid) = kuid 714 715- ``mapped_fsuid()`` and ``mapped_fsgid()`` 716 717 The ``mapped_fs*id()`` functions translate the caller's kernel ids into 718 kernel ids in the filesystem's idmapping. This translation is achieved by 719 remapping the caller's VFS ids using the mount's idmapping:: 720 721 /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */ 722 from_kuid(mount, kid) = uid 723 724 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 725 make_kuid(filesystem, uid) = kuid 726 727- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` 728 729 Whenever 730 731Note that these two functions invert each other. Consider the following 732idmappings:: 733 734 caller idmapping: u0:k10000:r10000 735 filesystem idmapping: u0:k20000:r10000 736 mount idmapping: u0:v10000:r10000 737 738Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id 739to ``k21000`` according to its idmapping. This is what is stored in the 740inode's ``i_uid`` and ``i_gid`` fields. 741 742When the caller queries the ownership of this file via ``stat()`` the kernel 743would usually simply use the crossmapping algorithm and map the filesystem's 744kernel id up to a userspace id in the caller's idmapping. 745 746But when the caller is accessing the file on an idmapped mount the kernel will 747first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel 748id into a VFS id in the mount's idmapping:: 749 750 i_uid_into_vfsuid(k21000): 751 /* Map the filesystem's kernel id up into a userspace id. */ 752 from_kuid(u0:k20000:r10000, k21000) = u1000 753 754 /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */ 755 make_kuid(u0:v10000:r10000, u1000) = v11000 756 757Finally, when the kernel reports the owner to the caller it will turn the 758VFS id in the mount's idmapping into a userspace id in the caller's 759idmapping:: 760 761 k11000 = vfsuid_into_kuid(v11000) 762 from_kuid(u0:k10000:r10000, k11000) = u1000 763 764We can test whether this algorithm really works by verifying what happens when 765we create a new file. Let's say the user is creating a file with ``u1000``. 766 767The kernel maps this to ``k11000`` in the caller's idmapping. Usually the 768kernel would now apply the crossmapping, verifying that ``k11000`` can be 769mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't 770be mapped up in the filesystem's idmapping directly this creation request 771fails. 772 773But when the caller is accessing the file on an idmapped mount the kernel will 774first call ``mapped_fs*id()`` thereby translating the caller's kernel id into 775a VFS id according to the mount's idmapping:: 776 777 mapped_fsuid(k11000): 778 /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ 779 from_kuid(u0:k10000:r10000, k11000) = u1000 780 781 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 782 make_kuid(u0:v20000:r10000, u1000) = v21000 783 784When finally writing to disk the kernel will then map ``v21000`` up into a 785userspace id in the filesystem's idmapping:: 786 787 k21000 = vfsuid_into_kuid(v21000) 788 from_kuid(u0:k20000:r10000, k21000) = u1000 789 790As we can see, we end up with an invertible and therefore information 791preserving algorithm. A file created from ``u1000`` on an idmapped mount will 792also be reported as being owned by ``u1000`` and vica versa. 793 794Let's now briefly reconsider the failing examples from earlier in the context 795of idmapped mounts. 796 797Example 2 reconsidered 798~~~~~~~~~~~~~~~~~~~~~~ 799 800:: 801 802 caller id: u1000 803 caller idmapping: u0:k10000:r10000 804 filesystem idmapping: u0:k20000:r10000 805 mount idmapping: u0:v10000:r10000 806 807When the caller is using a non-initial idmapping the common case is to attach 808the same idmapping to the mount. We now perform three steps: 809 8101. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 811 812 make_kuid(u0:k10000:r10000, u1000) = k11000 813 8142. Translate the caller's VFS id into a kernel id in the filesystem's 815 idmapping:: 816 817 mapped_fsuid(v11000): 818 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 819 from_kuid(u0:v10000:r10000, v11000) = u1000 820 821 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 822 make_kuid(u0:k20000:r10000, u1000) = k21000 823 8242. Verify that the caller's kernel ids can be mapped to userspace ids in the 825 filesystem's idmapping:: 826 827 from_kuid(u0:k20000:r10000, k21000) = u1000 828 829So the ownership that lands on disk will be ``u1000``. 830 831Example 3 reconsidered 832~~~~~~~~~~~~~~~~~~~~~~ 833 834:: 835 836 caller id: u1000 837 caller idmapping: u0:k10000:r10000 838 filesystem idmapping: u0:k0:r4294967295 839 mount idmapping: u0:v10000:r10000 840 841The same translation algorithm works with the third example. 842 8431. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 844 845 make_kuid(u0:k10000:r10000, u1000) = k11000 846 8472. Translate the caller's VFS id into a kernel id in the filesystem's 848 idmapping:: 849 850 mapped_fsuid(v11000): 851 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 852 from_kuid(u0:v10000:r10000, v11000) = u1000 853 854 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 855 make_kuid(u0:k0:r4294967295, u1000) = k1000 856 8572. Verify that the caller's kernel ids can be mapped to userspace ids in the 858 filesystem's idmapping:: 859 860 from_kuid(u0:k0:r4294967295, k21000) = u1000 861 862So the ownership that lands on disk will be ``u1000``. 863 864Example 4 reconsidered 865~~~~~~~~~~~~~~~~~~~~~~ 866 867:: 868 869 file id: u1000 870 caller idmapping: u0:k10000:r10000 871 filesystem idmapping: u0:k0:r4294967295 872 mount idmapping: u0:v10000:r10000 873 874In order to report ownership to userspace the kernel now does three steps using 875the translation algorithm we introduced earlier: 876 8771. Map the userspace id on disk down into a kernel id in the filesystem's 878 idmapping:: 879 880 make_kuid(u0:k0:r4294967295, u1000) = k1000 881 8822. Translate the kernel id into a VFS id in the mount's idmapping:: 883 884 i_uid_into_vfsuid(k1000): 885 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 886 from_kuid(u0:k0:r4294967295, k1000) = u1000 887 888 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 889 make_kuid(u0:v10000:r10000, u1000) = v11000 890 8913. Map the VFS id up into a userspace id in the caller's idmapping:: 892 893 k11000 = vfsuid_into_kuid(v11000) 894 from_kuid(u0:k10000:r10000, k11000) = u1000 895 896Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's 897idmapping. With the idmapped mount in place it now can be crossmapped into the 898filesystem's idmapping via the mount's idmapping. The file will now be created 899with ``u1000`` according to the mount's idmapping. 900 901Example 5 reconsidered 902~~~~~~~~~~~~~~~~~~~~~~ 903 904:: 905 906 file id: u1000 907 caller idmapping: u0:k10000:r10000 908 filesystem idmapping: u0:k20000:r10000 909 mount idmapping: u0:v10000:r10000 910 911Again, in order to report ownership to userspace the kernel now does three 912steps using the translation algorithm we introduced earlier: 913 9141. Map the userspace id on disk down into a kernel id in the filesystem's 915 idmapping:: 916 917 make_kuid(u0:k20000:r10000, u1000) = k21000 918 9192. Translate the kernel id into a VFS id in the mount's idmapping:: 920 921 i_uid_into_vfsuid(k21000): 922 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 923 from_kuid(u0:k20000:r10000, k21000) = u1000 924 925 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 926 make_kuid(u0:v10000:r10000, u1000) = v11000 927 9283. Map the VFS id up into a userspace id in the caller's idmapping:: 929 930 k11000 = vfsuid_into_kuid(v11000) 931 from_kuid(u0:k10000:r10000, k11000) = u1000 932 933Earlier, the file's kernel id couldn't be crossmapped in the filesystems's 934idmapping. With the idmapped mount in place it now can be crossmapped into the 935filesystem's idmapping via the mount's idmapping. The file is now owned by 936``u1000`` according to the mount's idmapping. 937 938Changing ownership on a home directory 939~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 940 941We've seen above how idmapped mounts can be used to translate between 942idmappings when either the caller, the filesystem or both uses a non-initial 943idmapping. A wide range of usecases exist when the caller is using 944a non-initial idmapping. This mostly happens in the context of containerized 945workloads. The consequence is as we have seen that for both, filesystem's 946mounted with the initial idmapping and filesystems mounted with non-initial 947idmappings, access to the filesystem isn't working because the kernel ids can't 948be crossmapped between the caller's and the filesystem's idmapping. 949 950As we've seen above idmapped mounts provide a solution to this by remapping the 951caller's or filesystem's idmapping according to the mount's idmapping. 952 953Aside from containerized workloads, idmapped mounts have the advantage that 954they also work when both the caller and the filesystem use the initial 955idmapping which means users on the host can change the ownership of directories 956and files on a per-mount basis. 957 958Consider our previous example where a user has their home directory on portable 959storage. At home they have id ``u1000`` and all files in their home directory 960are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. 961 962Taking their home directory with them becomes problematic. They can't easily 963access their files, they might not be able to write to disk without applying 964lax permissions or ACLs and even if they can, they will end up with an annoying 965mix of files and directories owned by ``u1000`` and ``u1125``. 966 967Idmapped mounts allow to solve this problem. A user can create an idmapped 968mount for their home directory on their work computer or their computer at home 969depending on what ownership they would prefer to end up on the portable storage 970itself. 971 972Let's assume they want all files on disk to belong to ``u1000``. When the user 973plugs in their portable storage at their work station they can setup a job that 974creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now 975when they create a file the kernel performs the following steps we already know 976from above::: 977 978 caller id: u1125 979 caller idmapping: u0:k0:r4294967295 980 filesystem idmapping: u0:k0:r4294967295 981 mount idmapping: u1000:v1125:r1 982 9831. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 984 985 make_kuid(u0:k0:r4294967295, u1125) = k1125 986 9872. Translate the caller's VFS id into a kernel id in the filesystem's 988 idmapping:: 989 990 mapped_fsuid(v1125): 991 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 992 from_kuid(u1000:v1125:r1, v1125) = u1000 993 994 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 995 make_kuid(u0:k0:r4294967295, u1000) = k1000 996 9972. Verify that the caller's filesystem ids can be mapped to userspace ids in the 998 filesystem's idmapping:: 999 1000 from_kuid(u0:k0:r4294967295, k1000) = u1000 1001 1002So ultimately the file will be created with ``u1000`` on disk. 1003 1004Now let's briefly look at what ownership the caller with id ``u1125`` will see 1005on their work computer: 1006 1007:: 1008 1009 file id: u1000 1010 caller idmapping: u0:k0:r4294967295 1011 filesystem idmapping: u0:k0:r4294967295 1012 mount idmapping: u1000:v1125:r1 1013 10141. Map the userspace id on disk down into a kernel id in the filesystem's 1015 idmapping:: 1016 1017 make_kuid(u0:k0:r4294967295, u1000) = k1000 1018 10192. Translate the kernel id into a VFS id in the mount's idmapping:: 1020 1021 i_uid_into_vfsuid(k1000): 1022 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 1023 from_kuid(u0:k0:r4294967295, k1000) = u1000 1024 1025 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 1026 make_kuid(u1000:v1125:r1, u1000) = v1125 1027 10283. Map the VFS id up into a userspace id in the caller's idmapping:: 1029 1030 k1125 = vfsuid_into_kuid(v1125) 1031 from_kuid(u0:k0:r4294967295, k1125) = u1125 1032 1033So ultimately the caller will be reported that the file belongs to ``u1125`` 1034which is the caller's userspace id on their workstation in our example. 1035 1036The raw userspace id that is put on disk is ``u1000`` so when the user takes 1037their home directory back to their home computer where they are assigned 1038``u1000`` using the initial idmapping and mount the filesystem with the initial 1039idmapping they will see all those files owned by ``u1000``. 1040