xref: /linux/Documentation/virt/uml/user_mode_linux_howto_v2.rst (revision 5afca7e996c42aed1b4a42d4712817601ba42aff)
1.. SPDX-License-Identifier: GPL-2.0
2
3#########
4UML HowTo
5#########
6
7.. contents:: :local:
8
9************
10Introduction
11************
12
13Welcome to User Mode Linux
14
15User Mode Linux is the first Open Source virtualization platform (first
16release date 1991) and second virtualization platform for an x86 PC.
17
18How is UML Different from a VM using Virtualization package X?
19==============================================================
20
21We have come to assume that virtualization also means some level of
22hardware emulation. In fact, it does not. As long as a virtualization
23package provides the OS with devices which the OS can recognize and
24has a driver for, the devices do not need to emulate real hardware.
25Most OSes today have built-in support for a number of "fake"
26devices used only under virtualization.
27User Mode Linux takes this concept to the ultimate extreme - there
28is not a single real device in sight. It is 100% artificial or if
29we use the correct term 100% paravirtual. All UML devices are abstract
30concepts which map onto something provided by the host - files, sockets,
31pipes, etc.
32
33The other major difference between UML and various virtualization
34packages is that there is a distinct difference between the way the UML
35kernel and the UML programs operate.
36The UML kernel is just a process running on Linux - same as any other
37program. It can be run by an unprivileged user and it does not require
38anything in terms of special CPU features.
39The UML userspace, however, is a bit different. The Linux kernel on the
40host machine assists UML in intercepting everything the program running
41on a UML instance is trying to do and making the UML kernel handle all
42of its requests.
43This is different from other virtualization packages which do not make any
44difference between the guest kernel and guest programs. This difference
45results in a number of advantages and disadvantages of UML over let's say
46QEMU which we will cover later in this document.
47
48
49Why Would I Want User Mode Linux?
50=================================
51
52
53* If User Mode Linux kernel crashes, your host kernel is still fine. It
54  is not accelerated in any way (vhost, kvm, etc) and it is not trying to
55  access any devices directly.  It is, in fact, a process like any other.
56
57* You can run a usermode kernel as a non-root user (you may need to
58  arrange appropriate permissions for some devices).
59
60* You can run a very small VM with a minimal footprint for a specific
61  task (for example 32M or less).
62
63* You can get extremely high performance for anything which is a "kernel
64  specific task" such as forwarding, firewalling, etc while still being
65  isolated from the host kernel.
66
67* You can play with kernel concepts without breaking things.
68
69* You are not bound by "emulating" hardware, so you can try weird and
70  wonderful concepts which are very difficult to support when emulating
71  real hardware such as time travel and making your system clock
72  dependent on what UML does (very useful for things like tests).
73
74* It's fun.
75
76Why not to run UML
77==================
78
79* The syscall interception technique used by UML makes it inherently
80  slower for any userspace applications. While it can do kernel tasks
81  on par with most other virtualization packages, its userspace is
82  **slow**. The root cause is that UML has a very high cost of creating
83  new processes and threads (something most Unix/Linux applications
84  take for granted).
85
86* UML is strictly uniprocessor at present. If you want to run an
87  application which needs many CPUs to function, it is clearly the
88  wrong choice.
89
90***********************
91Building a UML instance
92***********************
93
94There is no UML installer in any distribution. While you can use off
95the shelf install media to install into a blank VM using a virtualization
96package, there is no UML equivalent. You have to use appropriate tools on
97your host to build a viable filesystem image.
98
99This is extremely easy on Debian - you can do it using debootstrap. It is
100also easy on OpenWRT - the build process can build UML images. All other
101distros - YMMV.
102
103Creating an image
104=================
105
106Create a sparse raw disk image::
107
108   # dd if=/dev/zero of=disk_image_name bs=1 count=1 seek=16G
109
110This will create a 16G disk image. The OS will initially allocate only one
111block and will allocate more as they are written by UML. As of kernel
112version 4.19 UML fully supports TRIM (as usually used by flash drives).
113Using TRIM inside the UML image by specifying discard as a mount option
114or by running ``tune2fs -o discard /dev/ubdXX`` will request UML to
115return any unused blocks to the OS.
116
117Create a filesystem on the disk image and mount it::
118
119   # mkfs.ext4 ./disk_image_name && mount ./disk_image_name /mnt
120
121This example uses ext4, any other filesystem such as ext3, btrfs, xfs,
122jfs, etc will work too.
123
124Create a minimal OS installation on the mounted filesystem::
125
126   # debootstrap buster /mnt http://deb.debian.org/debian
127
128debootstrap does not set up the root password, fstab, hostname or
129anything related to networking. It is up to the user to do that.
130
131Set the root password - the easiest way to do that is to chroot into the
132mounted image::
133
134   # chroot /mnt
135   # passwd
136   # exit
137
138Edit key system files
139=====================
140
141UML block devices are called ubds. The fstab created by debootstrap
142will be empty and it needs an entry for the root file system::
143
144   /dev/ubd0   ext4    discard,errors=remount-ro  0       1
145
146The image hostname will be set to the same as the host on which you
147are creating its image. It is a good idea to change that to avoid
148"Oh, bummer, I rebooted the wrong machine".
149
150UML supports two classes of network devices - the older uml_net ones
151which are scheduled for obsoletion. These are called ethX. It also
152supports the newer vector IO devices which are significantly faster
153and have support for some standard virtual network encapsulations like
154Ethernet over GRE and Ethernet over L2TPv3. These are called vec0.
155
156Depending on which one is in use, ``/etc/network/interfaces`` will
157need entries like::
158
159   # legacy UML network devices
160   auto eth0
161   iface eth0 inet dhcp
162
163   # vector UML network devices
164   auto vec0
165   iface vec0 inet dhcp
166
167We now have a UML image which is nearly ready to run, all we need is a
168UML kernel and modules for it.
169
170Most distributions have a UML package. Even if you intend to use your own
171kernel, testing the image with a stock one is always a good start. These
172packages come with a set of modules which should be copied to the target
173filesystem. The location is distribution dependent. For Debian these
174reside under /usr/lib/uml/modules. Copy recursively the content of this
175directory to the mounted UML filesystem::
176
177   # cp -rax /usr/lib/uml/modules /mnt/lib/modules
178
179If you have compiled your own kernel, you need to use the usual "install
180modules to a location" procedure by running::
181
182  # make INSTALL_MOD_PATH=/mnt/lib/modules modules_install
183
184This will install modules into /mnt/lib/modules/$(KERNELRELEASE).
185To specify the full module installation path, use::
186
187  # make MODLIB=/mnt/lib/modules modules_install
188
189At this point the image is ready to be brought up.
190
191*************************
192Setting Up UML Networking
193*************************
194
195UML networking is designed to emulate an Ethernet connection. This
196connection may be either point-to-point (similar to a connection
197between machines using a back-to-back cable) or a connection to a
198switch. UML supports a wide variety of means to build these
199connections to all of: local machine, remote machine(s), local and
200remote UML and other VM instances.
201
202
203+-----------+--------+------------------------------------+------------+
204| Transport |  Type  |        Capabilities                | Throughput |
205+===========+========+====================================+============+
206| tap       | vector | checksum, tso                      | > 8Gbit    |
207+-----------+--------+------------------------------------+------------+
208| hybrid    | vector | checksum, tso, multipacket rx      | > 6GBit    |
209+-----------+--------+------------------------------------+------------+
210| raw       | vector | checksum, tso, multipacket rx, tx" | > 6GBit    |
211+-----------+--------+------------------------------------+------------+
212| EoGRE     | vector | multipacket rx, tx                 | > 3Gbit    |
213+-----------+--------+------------------------------------+------------+
214| Eol2tpv3  | vector | multipacket rx, tx                 | > 3Gbit    |
215+-----------+--------+------------------------------------+------------+
216| bess      | vector | multipacket rx, tx                 | > 3Gbit    |
217+-----------+--------+------------------------------------+------------+
218| fd        | vector | dependent on fd type               | varies     |
219+-----------+--------+------------------------------------+------------+
220| vde       | vector | dep. on VDE VPN: Virt.Net Locator  | varies     |
221+-----------+--------+------------------------------------+------------+
222| tuntap    | legacy | none                               | ~ 500Mbit  |
223+-----------+--------+------------------------------------+------------+
224| daemon    | legacy | none                               | ~ 450Mbit  |
225+-----------+--------+------------------------------------+------------+
226| socket    | legacy | none                               | ~ 450Mbit  |
227+-----------+--------+------------------------------------+------------+
228| ethertap  | legacy | obsolete                           | ~ 500Mbit  |
229+-----------+--------+------------------------------------+------------+
230| vde       | legacy | obsolete                           | ~ 500Mbit  |
231+-----------+--------+------------------------------------+------------+
232
233* All transports which have tso and checksum offloads can deliver speeds
234  approaching 10G on TCP streams.
235
236* All transports which have multi-packet rx and/or tx can deliver pps
237  rates of up to 1Mps or more.
238
239* All legacy transports are generally limited to ~600-700MBit and 0.05Mps.
240
241* GRE and L2TPv3 allow connections to all of: local machine, remote
242  machines, remote network devices and remote UML instances.
243
244* Socket allows connections only between UML instances.
245
246* Daemon and bess require running a local switch. This switch may be
247  connected to the host as well.
248
249
250Network configuration privileges
251================================
252
253The majority of the supported networking modes need ``root`` privileges.
254For example, in the legacy tuntap networking mode, users were required
255to be part of the group associated with the tunnel device.
256
257For newer network drivers like the vector transports, ``root`` privilege
258is required to fire an ioctl to setup the tun interface and/or use
259raw sockets where needed.
260
261This can be achieved by granting the user a particular capability instead
262of running UML as root.  In case of vector transport, a user can add the
263capability ``CAP_NET_ADMIN`` or ``CAP_NET_RAW`` to the uml binary.
264Thenceforth, UML can be run with normal user privilges, along with
265full networking.
266
267For example::
268
269   # sudo setcap cap_net_raw,cap_net_admin+ep linux
270
271Configuring vector transports
272===============================
273
274All vector transports support a similar syntax:
275
276If X is the interface number as in vec0, vec1, vec2, etc, the general
277syntax for options is::
278
279   vecX:transport="Transport Name",option=value,option=value,...,option=value
280
281Common options
282--------------
283
284These options are common for all transports:
285
286* ``depth=int`` - sets the queue depth for vector IO. This is the
287  amount of packets UML will attempt to read or write in a single
288  system call. The default number is 64 and is generally sufficient
289  for most applications that need throughput in the 2-4 Gbit range.
290  Higher speeds may require larger values.
291
292* ``mac=XX:XX:XX:XX:XX`` - sets the interface MAC address value.
293
294* ``gro=[0,1]`` - sets GRO off or on. Enables receive/transmit offloads.
295  The effect of this option depends on the host side support in the transport
296  which is being configured. In most cases it will enable TCP segmentation and
297  RX/TX checksumming offloads. The setting must be identical on the host side
298  and the UML side. The UML kernel will produce warnings if it is not.
299  For example, GRO is enabled by default on local machine interfaces
300  (e.g. veth pairs, bridge, etc), so it should be enabled in UML in the
301  corresponding UML transports (raw, tap, hybrid) in order for networking to
302  operate correctly.
303
304* ``mtu=int`` - sets the interface MTU
305
306* ``headroom=int`` - adjusts the default headroom (32 bytes) reserved
307  if a packet will need to be re-encapsulated into for instance VXLAN.
308
309* ``vec=0`` - disable multipacket IO and fall back to packet at a
310  time mode
311
312Shared Options
313--------------
314
315* ``ifname=str`` Transports which bind to a local network interface
316  have a shared option - the name of the interface to bind to.
317
318* ``src, dst, src_port, dst_port`` - all transports which use sockets
319  which have the notion of source and destination and/or source port
320  and destination port use these to specify them.
321
322* ``v6=[0,1]`` to specify if a v6 connection is desired for all
323  transports which operate over IP. Additionally, for transports that
324  have some differences in the way they operate over v4 and v6 (for example
325  EoL2TPv3), sets the correct mode of operation. In the absence of this
326  option, the socket type is determined based on what do the src and dst
327  arguments resolve/parse to.
328
329tap transport
330-------------
331
332Example::
333
334   vecX:transport=tap,ifname=tap0,depth=128,gro=1
335
336This will connect vec0 to tap0 on the host. Tap0 must already exist (for example
337created using tunctl) and UP.
338
339tap0 can be configured as a point-to-point interface and given an IP
340address so that UML can talk to the host. Alternatively, it is possible
341to connect UML to a tap interface which is connected to a bridge.
342
343While tap relies on the vector infrastructure, it is not a true vector
344transport at this point, because Linux does not support multi-packet
345IO on tap file descriptors for normal userspace apps like UML. This
346is a privilege which is offered only to something which can hook up
347to it at kernel level via specialized interfaces like vhost-net. A
348vhost-net like helper for UML is planned at some point in the future.
349
350Privileges required: tap transport requires either:
351
352* tap interface to exist and be created persistent and owned by the
353  UML user using tunctl. Example ``tunctl -u uml-user -t tap0``
354
355* binary to have ``CAP_NET_ADMIN`` privilege
356
357hybrid transport
358----------------
359
360Example::
361
362   vecX:transport=hybrid,ifname=tap0,depth=128,gro=1
363
364This is an experimental/demo transport which couples tap for transmit
365and a raw socket for receive. The raw socket allows multi-packet
366receive resulting in significantly higher packet rates than normal tap.
367
368Privileges required: hybrid requires ``CAP_NET_RAW`` capability by
369the UML user as well as the requirements for the tap transport.
370
371raw socket transport
372--------------------
373
374Example::
375
376   vecX:transport=raw,ifname=p-veth0,depth=128,gro=1
377
378
379This transport uses vector IO on raw sockets. While you can bind to any
380interface including a physical one, the most common use it to bind to
381the "peer" side of a veth pair with the other side configured on the
382host.
383
384Example host configuration for Debian:
385
386**/etc/network/interfaces**::
387
388   auto veth0
389   iface veth0 inet static
390	address 192.168.4.1
391	netmask 255.255.255.252
392	broadcast 192.168.4.3
393	pre-up ip link add veth0 type veth peer name p-veth0 && \
394          ifconfig p-veth0 up
395
396UML can now bind to p-veth0 like this::
397
398   vec0:transport=raw,ifname=p-veth0,depth=128,gro=1
399
400
401If the UML guest is configured with 192.168.4.2 and netmask 255.255.255.0
402it can talk to the host on 192.168.4.1
403
404The raw transport also provides some support for offloading some of the
405filtering to the host. The two options to control it are:
406
407* ``bpffile=str`` filename of raw bpf code to be loaded as a socket filter
408
409* ``bpfflash=int`` 0/1 allow loading of bpf from inside User Mode Linux.
410  This option allows the use of the ethtool load firmware command to
411  load bpf code.
412
413In either case the bpf code is loaded into the host kernel. While this is
414presently limited to legacy bpf syntax (not ebpf), it is still a security
415risk. It is not recommended to allow this unless the User Mode Linux
416instance is considered trusted.
417
418Privileges required: raw socket transport requires `CAP_NET_RAW`
419capability.
420
421GRE socket transport
422--------------------
423
424Example::
425
426   vecX:transport=gre,src=$src_host,dst=$dst_host
427
428
429This will configure an Ethernet over ``GRE`` (aka ``GRETAP`` or
430``GREIRB``) tunnel which will connect the UML instance to a ``GRE``
431endpoint at host dst_host. ``GRE`` supports the following additional
432options:
433
434* ``rx_key=int`` - GRE 32-bit integer key for rx packets, if set,
435  ``txkey`` must be set too
436
437* ``tx_key=int`` - GRE 32-bit integer key for tx packets, if set
438  ``rx_key`` must be set too
439
440* ``sequence=[0,1]`` - enable GRE sequence
441
442* ``pin_sequence=[0,1]`` - pretend that the sequence is always reset
443  on each packet (needed to interoperate with some really broken
444  implementations)
445
446* ``v6=[0,1]`` - force IPv4 or IPv6 sockets respectively
447
448* GRE checksum is not presently supported
449
450GRE has a number of caveats:
451
452* You can use only one GRE connection per IP address. There is no way to
453  multiplex connections as each GRE tunnel is terminated directly on
454  the UML instance.
455
456* The key is not really a security feature. While it was intended as such
457  its "security" is laughable. It is, however, a useful feature to
458  ensure that the tunnel is not misconfigured.
459
460An example configuration for a Linux host with a local address of
461192.168.128.1 to connect to a UML instance at 192.168.129.1
462
463**/etc/network/interfaces**::
464
465   auto gt0
466   iface gt0 inet static
467    address 10.0.0.1
468    netmask 255.255.255.0
469    broadcast 10.0.0.255
470    mtu 1500
471    pre-up ip link add gt0 type gretap local 192.168.128.1 \
472           remote 192.168.129.1 || true
473    down ip link del gt0 || true
474
475Additionally, GRE has been tested versus a variety of network equipment.
476
477Privileges required: GRE requires ``CAP_NET_RAW``
478
479l2tpv3 socket transport
480-----------------------
481
482_Warning_. L2TPv3 has a "bug". It is the "bug" known as "has more
483options than GNU ls". While it has some advantages, there are usually
484easier (and less verbose) ways to connect a UML instance to something.
485For example, most devices which support L2TPv3 also support GRE.
486
487Example::
488
489    vec0:transport=l2tpv3,udp=1,src=$src_host,dst=$dst_host,srcport=$src_port,dstport=$dst_port,depth=128,rx_session=0xffffffff,tx_session=0xffff
490
491This will configure an Ethernet over L2TPv3 fixed tunnel which will
492connect the UML instance to a L2TPv3 endpoint at host $dst_host using
493the L2TPv3 UDP flavour and UDP destination port $dst_port.
494
495L2TPv3 always requires the following additional options:
496
497* ``rx_session=int`` - l2tpv3 32-bit integer session for rx packets
498
499* ``tx_session=int`` - l2tpv3 32-bit integer session for tx packets
500
501As the tunnel is fixed these are not negotiated and they are
502preconfigured on both ends.
503
504Additionally, L2TPv3 supports the following optional parameters.
505
506* ``rx_cookie=int`` - l2tpv3 32-bit integer cookie for rx packets - same
507  functionality as GRE key, more to prevent misconfiguration than provide
508  actual security
509
510* ``tx_cookie=int`` - l2tpv3 32-bit integer cookie for tx packets
511
512* ``cookie64=[0,1]`` - use 64-bit cookies instead of 32-bit.
513
514* ``counter=[0,1]`` - enable l2tpv3 counter
515
516* ``pin_counter=[0,1]`` - pretend that the counter is always reset on
517  each packet (needed to interoperate with some really broken
518  implementations)
519
520* ``v6=[0,1]`` - force v6 sockets
521
522* ``udp=[0,1]`` - use raw sockets (0) or UDP (1) version of the protocol
523
524L2TPv3 has a number of caveats:
525
526* you can use only one connection per IP address in raw mode. There is
527  no way to multiplex connections as each L2TPv3 tunnel is terminated
528  directly on the UML instance. UDP mode can use different ports for
529  this purpose.
530
531Here is an example of how to configure a Linux host to connect to UML
532via L2TPv3:
533
534**/etc/network/interfaces**::
535
536   auto l2tp1
537   iface l2tp1 inet static
538    address 192.168.126.1
539    netmask 255.255.255.0
540    broadcast 192.168.126.255
541    mtu 1500
542    pre-up ip l2tp add tunnel remote 127.0.0.1 \
543           local 127.0.0.1 encap udp tunnel_id 2 \
544           peer_tunnel_id 2 udp_sport 1706 udp_dport 1707 && \
545           ip l2tp add session name l2tp1 tunnel_id 2 \
546           session_id 0xffffffff peer_session_id 0xffffffff
547    down ip l2tp del session tunnel_id 2 session_id 0xffffffff && \
548           ip l2tp del tunnel tunnel_id 2
549
550
551Privileges required: L2TPv3 requires ``CAP_NET_RAW`` for raw IP mode and
552no special privileges for the UDP mode.
553
554BESS socket transport
555---------------------
556
557BESS is a high performance modular network switch.
558
559https://github.com/NetSys/bess
560
561It has support for a simple sequential packet socket mode which in the
562more recent versions is using vector IO for high performance.
563
564Example::
565
566   vecX:transport=bess,src=$unix_src,dst=$unix_dst
567
568This will configure a BESS transport using the unix_src Unix domain
569socket address as source and unix_dst socket address as destination.
570
571For BESS configuration and how to allocate a BESS Unix domain socket port
572please see the BESS documentation.
573
574https://github.com/NetSys/bess/wiki/Built-In-Modules-and-Ports
575
576BESS transport does not require any special privileges.
577
578VDE vector transport
579--------------------
580
581Virtual Distributed Ethernet (VDE) is a project whose main goal is to provide a
582highly flexible support for virtual networking.
583
584http://wiki.virtualsquare.org/#/tutorials/vdebasics
585
586Common usages of VDE include fast prototyping and teaching.
587
588Examples:
589
590   ``vecX:transport=vde,vnl=tap://tap0``
591
592use tap0
593
594   ``vecX:transport=vde,vnl=slirp://``
595
596use slirp
597
598   ``vec0:transport=vde,vnl=vde:///tmp/switch``
599
600connect to a vde switch
601
602   ``vecX:transport=\"vde,vnl=cmd://ssh remote.host //tmp/sshlirp\"``
603
604connect to a remote slirp (instant VPN: convert ssh to VPN, it uses sshlirp)
605https://github.com/virtualsquare/sshlirp
606
607   ``vec0:transport=vde,vnl=vxvde://234.0.0.1``
608
609connect to a local area cloud (all the UML nodes using the same
610multicast address running on hosts in the same multicast domain (LAN)
611will be automagically connected together to a virtual LAN.
612
613Configuring Legacy transports
614=============================
615
616Legacy transports are now considered obsolete. Please use the vector
617versions.
618
619***********
620Running UML
621***********
622
623This section assumes that either the user-mode-linux package from the
624distribution or a custom built kernel has been installed on the host.
625
626These add an executable called linux to the system. This is the UML
627kernel. It can be run just like any other executable.
628It will take most normal linux kernel arguments as command line
629arguments.  Additionally, it will need some UML-specific arguments
630in order to do something useful.
631
632Arguments
633=========
634
635Mandatory Arguments:
636--------------------
637
638* ``mem=int[K,M,G]`` - amount of memory. By default in bytes. It will
639  also accept K, M or G qualifiers.
640
641* ``ubdX[s,d,c,t]=`` virtual disk specification. This is not really
642  mandatory, but it is likely to be needed in nearly all cases so we can
643  specify a root file system.
644  The simplest possible image specification is the name of the image
645  file for the filesystem (created using one of the methods described
646  in `Creating an image`_).
647
648  * UBD devices support copy on write (COW). The changes are kept in
649    a separate file which can be discarded allowing a rollback to the
650    original pristine image.  If COW is desired, the UBD image is
651    specified as: ``cow_file,master_image``.
652    Example:``ubd0=Filesystem.cow,Filesystem.img``
653
654  * UBD devices can be set to use synchronous IO. Any writes are
655    immediately flushed to disk. This is done by adding ``s`` after
656    the ``ubdX`` specification.
657
658  * UBD performs some heuristics on devices specified as a single
659    filename to make sure that a COW file has not been specified as
660    the image. To turn them off, use the ``d`` flag after ``ubdX``.
661
662  * UBD supports TRIM - asking the Host OS to reclaim any unused
663    blocks in the image. To turn it off, specify the ``t`` flag after
664    ``ubdX``.
665
666* ``root=`` root device - most likely ``/dev/ubd0`` (this is a Linux
667  filesystem image)
668
669Important Optional Arguments
670----------------------------
671
672If UML is run as "linux" with no extra arguments, it will try to start an
673xterm for every console configured inside the image (up to 6 in most
674Linux distributions). Each console is started inside an
675xterm. This makes it nice and easy to use UML on a host with a GUI. It is,
676however, the wrong approach if UML is to be used as a testing harness or run
677in a text-only environment.
678
679In order to change this behaviour we need to specify an alternative console
680and wire it to one of the supported "line" channels. For this we need to map a
681console to use something different from the default xterm.
682
683Example which will divert console number 1 to stdin/stdout::
684
685   con1=fd:0,fd:1
686
687UML supports a wide variety of serial line channels which are specified using
688the following syntax
689
690   conX=channel_type:options[,channel_type:options]
691
692
693If the channel specification contains two parts separated by comma, the first
694one is input, the second one output.
695
696* The null channel - Discard all input or output. Example ``con=null`` will set
697  all consoles to null by default.
698
699* The fd channel - use file descriptor numbers for input/output. Example:
700  ``con1=fd:0,fd:1.``
701
702* The port channel - start a telnet server on TCP port number. Example:
703  ``con1=port:4321``.  The host must have /usr/sbin/in.telnetd (usually part of
704  a telnetd package) and the port-helper from the UML utilities (see the
705  information for the xterm channel below).  UML will not boot until a client
706  connects.
707
708* The pty and pts channels - use system pty/pts.
709
710* The tty channel - bind to an existing system tty. Example: ``con1=/dev/tty8``
711  will make UML use the host 8th console (usually unused).
712
713* The xterm channel - this is the default - bring up an xterm on this channel
714  and direct IO to it. Note that in order for xterm to work, the host must
715  have the UML distribution package installed. This usually contains the
716  port-helper and other utilities needed for UML to communicate with the xterm.
717  Alternatively, these need to be complied and installed from source. All
718  options applicable to consoles also apply to UML serial lines which are
719  presented as ttyS inside UML.
720
721Starting UML
722============
723
724We can now run UML.
725::
726
727   # linux mem=2048M umid=TEST \
728    ubd0=Filesystem.img \
729    vec0:transport=tap,ifname=tap0,depth=128,gro=1 \
730    root=/dev/ubda con=null con0=null,fd:2 con1=fd:0,fd:1
731
732This will run an instance with ``2048M RAM`` and try to use the image file
733called ``Filesystem.img`` as root. It will connect to the host using tap0.
734All consoles except ``con1`` will be disabled and console 1 will
735use standard input/output making it appear in the same terminal it was started.
736
737Logging in
738============
739
740If you have not set up a password when generating the image, you will have to
741shut down the UML instance, mount the image, chroot into it and set it - as
742described in the Generating an Image section.  If the password is already set,
743you can just log in.
744
745The UML Management Console
746============================
747
748In addition to managing the image from "the inside" using normal sysadmin tools,
749it is possible to perform a number of low-level operations using the UML
750management console. The UML management console is a low-level interface to the
751kernel on a running UML instance, somewhat like the i386 SysRq interface. Since
752there is a full-blown operating system under UML, there is much greater
753flexibility possible than with the SysRq mechanism.
754
755There are a number of things you can do with the mconsole interface:
756
757* get the kernel version
758* add and remove devices
759* halt or reboot the machine
760* Send SysRq commands
761* Pause and resume the UML
762* Inspect processes running inside UML
763* Inspect UML internal /proc state
764
765You need the mconsole client (uml\_mconsole) which is a part of the UML
766tools package available in most Linux distritions.
767
768You also need ``CONFIG_MCONSOLE`` (under 'General Setup') enabled in the UML
769kernel.  When you boot UML, you'll see a line like::
770
771   mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
772
773If you specify a unique machine id on the UML command line, i.e.
774``umid=debian``, you'll see this::
775
776   mconsole initialized on /home/jdike/.uml/debian/mconsole
777
778
779That file is the socket that uml_mconsole will use to communicate with
780UML.  Run it with either the umid or the full path as its argument::
781
782   # uml_mconsole debian
783
784or
785
786   # uml_mconsole /home/jdike/.uml/debian/mconsole
787
788
789You'll get a prompt, at which you can run one of these commands:
790
791* version
792* help
793* halt
794* reboot
795* config
796* remove
797* sysrq
798* help
799* cad
800* stop
801* go
802* proc
803* stack
804
805version
806-------
807
808This command takes no arguments.  It prints the UML version::
809
810   (mconsole)  version
811   OK Linux OpenWrt 4.14.106 #0 Tue Mar 19 08:19:41 2019 x86_64
812
813
814There are a couple actual uses for this.  It's a simple no-op which
815can be used to check that a UML is running.  It's also a way of
816sending a device interrupt to the UML. UML mconsole is treated internally as
817a UML device.
818
819help
820----
821
822This command takes no arguments. It prints a short help screen with the
823supported mconsole commands.
824
825
826halt and reboot
827---------------
828
829These commands take no arguments.  They shut the machine down immediately, with
830no syncing of disks and no clean shutdown of userspace.  So, they are
831pretty close to crashing the machine::
832
833   (mconsole)  halt
834   OK
835
836config
837------
838
839"config" adds a new device to the virtual machine. This is supported
840by most UML device drivers. It takes one argument, which is the
841device to add, with the same syntax as the kernel command line::
842
843   (mconsole) config ubd3=/home/jdike/incoming/roots/root_fs_debian22
844
845remove
846------
847
848"remove" deletes a device from the system.  Its argument is just the
849name of the device to be removed. The device must be idle in whatever
850sense the driver considers necessary.  In the case of the ubd driver,
851the removed block device must not be mounted, swapped on, or otherwise
852open, and in the case of the network driver, the device must be down::
853
854   (mconsole)  remove ubd3
855
856sysrq
857-----
858
859This command takes one argument, which is a single letter.  It calls the
860generic kernel's SysRq driver, which does whatever is called for by
861that argument.  See the SysRq documentation in
862Documentation/admin-guide/sysrq.rst in your favorite kernel tree to
863see what letters are valid and what they do.
864
865cad
866---
867
868This invokes the ``Ctl-Alt-Del`` action in the running image.  What exactly
869this ends up doing is up to init, systemd, etc.  Normally, it reboots the
870machine.
871
872stop
873----
874
875This puts the UML in a loop reading mconsole requests until a 'go'
876mconsole command is received. This is very useful as a
877debugging/snapshotting tool.
878
879go
880--
881
882This resumes a UML after being paused by a 'stop' command. Note that
883when the UML has resumed, TCP connections may have timed out and if
884the UML is paused for a long period of time, crond might go a little
885crazy, running all the jobs it didn't do earlier.
886
887proc
888----
889
890This takes one argument - the name of a file in /proc which is printed
891to the mconsole standard output
892
893stack
894-----
895
896This takes one argument - the pid number of a process. Its stack is
897printed to a standard output.
898
899*******************
900Advanced UML Topics
901*******************
902
903Sharing Filesystems between Virtual Machines
904============================================
905
906Don't attempt to share filesystems simply by booting two UMLs from the
907same file.  That's the same thing as booting two physical machines
908from a shared disk.  It will result in filesystem corruption.
909
910Using layered block devices
911---------------------------
912
913The way to share a filesystem between two virtual machines is to use
914the copy-on-write (COW) layering capability of the ubd block driver.
915Any changed blocks are stored in the private COW file, while reads come
916from either device - the private one if the requested block is valid in
917it, the shared one if not.  Using this scheme, the majority of data
918which is unchanged is shared between an arbitrary number of virtual
919machines, each of which has a much smaller file containing the changes
920that it has made.  With a large number of UMLs booting from a large root
921filesystem, this leads to a huge disk space saving.
922
923Sharing file system data will also help performance, since the host will
924be able to cache the shared data using a much smaller amount of memory,
925so UML disk requests will be served from the host's memory rather than
926its disks.  There is a major caveat in doing this on multisocket NUMA
927machines.  On such hardware, running many UML instances with a shared
928master image and COW changes may cause issues like NMIs from excess of
929inter-socket traffic.
930
931If you are running UML on high-end hardware like this, make sure to
932bind UML to a set of logical CPUs residing on the same socket using the
933``taskset`` command or have a look at the "tuning" section.
934
935To add a copy-on-write layer to an existing block device file, simply
936add the name of the COW file to the appropriate ubd switch::
937
938   ubd0=root_fs_cow,root_fs_debian_22
939
940where ``root_fs_cow`` is the private COW file and ``root_fs_debian_22`` is
941the existing shared filesystem.  The COW file need not exist.  If it
942doesn't, the driver will create and initialize it.
943
944Disk Usage
945----------
946
947UML has TRIM support which will release any unused space in its disk
948image files to the underlying OS. It is important to use either ls -ls
949or du to verify the actual file size.
950
951COW validity.
952-------------
953
954Any changes to the master image will invalidate all COW files. If this
955happens, UML will *NOT* automatically delete any of the COW files and
956will refuse to boot. In this case the only solution is to either
957restore the old image (including its last modified timestamp) or remove
958all COW files which will result in their recreation. Any changes in
959the COW files will be lost.
960
961Cows can moo - uml_moo : Merging a COW file with its backing file
962-----------------------------------------------------------------
963
964Depending on how you use UML and COW devices, it may be advisable to
965merge the changes in the COW file into the backing file every once in
966a while.
967
968The utility that does this is uml_moo.  Its usage is::
969
970   uml_moo COW_file new_backing_file
971
972
973There's no need to specify the backing file since that information is
974already in the COW file header.  If you're paranoid, boot the new
975merged file, and if you're happy with it, move it over the old backing
976file.
977
978``uml_moo`` creates a new backing file by default as a safety measure.
979It also has a destructive merge option which will merge the COW file
980directly into its current backing file.  This is really only usable
981when the backing file only has one COW file associated with it.  If
982there are multiple COWs associated with a backing file, a -d merge of
983one of them will invalidate all of the others.  However, it is
984convenient if you're short of disk space, and it should also be
985noticeably faster than a non-destructive merge.
986
987``uml_moo`` is installed with the UML distribution packages and is
988available as a part of UML utilities.
989
990Host file access
991==================
992
993If you want to access files on the host machine from inside UML, you
994can treat it as a separate machine and either nfs mount directories
995from the host or copy files into the virtual machine with scp.
996However, since UML is running on the host, it can access those
997files just like any other process and make them available inside the
998virtual machine without the need to use the network.
999This is possible with the hostfs virtual filesystem.  With it, you
1000can mount a host directory into the UML filesystem and access the
1001files contained in it just as you would on the host.
1002
1003*SECURITY WARNING*
1004
1005Hostfs without any parameters to the UML Image will allow the image
1006to mount any part of the host filesystem and write to it. Always
1007confine hostfs to a specific "harmless" directory (for example ``/var/tmp``)
1008if running UML. This is especially important if UML is being run as root.
1009
1010Using hostfs
1011------------
1012
1013To begin with, make sure that hostfs is available inside the virtual
1014machine with::
1015
1016   # cat /proc/filesystems
1017
1018``hostfs`` should be listed.  If it's not, either rebuild the kernel
1019with hostfs configured into it or make sure that hostfs is built as a
1020module and available inside the virtual machine, and insmod it.
1021
1022
1023Now all you need to do is run mount::
1024
1025   # mount none /mnt/host -t hostfs
1026
1027will mount the host's ``/`` on the virtual machine's ``/mnt/host``.
1028If you don't want to mount the host root directory, then you can
1029specify a subdirectory to mount with the -o switch to mount::
1030
1031   # mount none /mnt/home -t hostfs -o /home
1032
1033will mount the host's /home on the virtual machine's /mnt/home.
1034
1035hostfs as the root filesystem
1036-----------------------------
1037
1038It's possible to boot from a directory hierarchy on the host using
1039hostfs rather than using the standard filesystem in a file.
1040To start, you need that hierarchy.  The easiest way is to loop mount
1041an existing root_fs file::
1042
1043   #  mount root_fs uml_root_dir -o loop
1044
1045
1046You need to change the filesystem type of ``/`` in ``etc/fstab`` to be
1047'hostfs', so that line looks like this::
1048
1049   /dev/ubd/0       /        hostfs      defaults          1   1
1050
1051Then you need to chown to yourself all the files in that directory
1052that are owned by root.  This worked for me::
1053
1054   #  find . -uid 0 -exec chown jdike {} \;
1055
1056Next, make sure that your UML kernel has hostfs compiled in, not as a
1057module.  Then run UML with the boot device pointing at that directory::
1058
1059   ubd0=/path/to/uml/root/directory
1060
1061UML should then boot as it does normally.
1062
1063Hostfs Caveats
1064--------------
1065
1066Hostfs does not support keeping track of host filesystem changes on the
1067host (outside UML). As a result, if a file is changed without UML's
1068knowledge, UML will not know about it and its own in-memory cache of
1069the file may be corrupt. While it is possible to fix this, it is not
1070something which is being worked on at present.
1071
1072Tuning UML
1073============
1074
1075UML at present is strictly uniprocessor. It will, however spin up a
1076number of threads to handle various functions.
1077
1078The UBD driver, SIGIO and the MMU emulation do that. If the system is
1079idle, these threads will be migrated to other processors on a SMP host.
1080This, unfortunately, will usually result in LOWER performance because of
1081all of the cache/memory synchronization traffic between cores. As a
1082result, UML will usually benefit from being pinned on a single CPU,
1083especially on a large system. This can result in performance differences
1084of 5 times or higher on some benchmarks.
1085
1086Similarly, on large multi-node NUMA systems UML will benefit if all of
1087its memory is allocated from the same NUMA node it will run on. The
1088OS will *NOT* do that by default. In order to do that, the sysadmin
1089needs to create a suitable tmpfs ramdisk bound to a particular node
1090and use that as the source for UML RAM allocation by specifying it
1091in the TMP or TEMP environment variables. UML will look at the values
1092of ``TMPDIR``, ``TMP`` or ``TEMP`` for that. If that fails, it will
1093look for shmfs mounted under ``/dev/shm``. If everything else fails use
1094``/tmp/`` regardless of the filesystem type used for it::
1095
1096   mount -t tmpfs -ompol=bind:X none /mnt/tmpfs-nodeX
1097   TEMP=/mnt/tmpfs-nodeX taskset -cX linux options options options..
1098
1099*******************************************
1100Contributing to UML and Developing with UML
1101*******************************************
1102
1103UML is an excellent platform to develop new Linux kernel concepts -
1104filesystems, devices, virtualization, etc. It provides unrivalled
1105opportunities to create and test them without being constrained to
1106emulating specific hardware.
1107
1108Example - want to try how Linux will work with 4096 "proper" network
1109devices?
1110
1111Not an issue with UML. At the same time, this is something which
1112is difficult with other virtualization packages - they are
1113constrained by the number of devices allowed on the hardware bus
1114they are trying to emulate (for example 16 on a PCI bus in qemu).
1115
1116If you have something to contribute such as a patch, a bugfix, a
1117new feature, please send it to ``linux-um@lists.infradead.org``.
1118
1119Please follow all standard Linux patch guidelines such as cc-ing
1120relevant maintainers and run ``./scripts/checkpatch.pl`` on your patch.
1121For more details see ``Documentation/process/submitting-patches.rst``
1122
1123Note - the list does not accept HTML or attachments, all emails must
1124be formatted as plain text.
1125
1126Developing always goes hand in hand with debugging. First of all,
1127you can always run UML under gdb and there will be a whole section
1128later on on how to do that. That, however, is not the only way to
1129debug a Linux kernel. Quite often adding tracing statements and/or
1130using UML specific approaches such as ptracing the UML kernel process
1131are significantly more informative.
1132
1133Tracing UML
1134=============
1135
1136When running, UML consists of a main kernel thread and a number of
1137helper threads. The ones of interest for tracing are NOT the ones
1138that are already ptraced by UML as a part of its MMU emulation.
1139
1140These are usually the first three threads visible in a ps display.
1141The one with the lowest PID number and using most CPU is usually the
1142kernel thread. The other threads are the disk
1143(ubd) device helper thread and the SIGIO helper thread.
1144Running ptrace on this thread usually results in the following picture::
1145
1146   host$ strace -p 16566
1147   --- SIGIO {si_signo=SIGIO, si_code=POLL_IN, si_band=65} ---
1148   epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1
1149   epoll_wait(4, [], 64, 0)                = 0
1150   rt_sigreturn({mask=[PIPE]})             = 16967
1151   ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0
1152   ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0
1153   ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0
1154   ptrace(PTRACE_SETREGS, 16967, NULL, 0xd5f34f38) = 0
1155   ptrace(PTRACE_SETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=2696}]) = 0
1156   ptrace(PTRACE_SYSEMU, 16967, NULL, 0)   = 0
1157   --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=16967, si_uid=0, si_status=SIGTRAP, si_utime=65, si_stime=89} ---
1158   wait4(16967, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP | 0x80}], WSTOPPED|__WALL, NULL) = 16967
1159   ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0
1160   ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0
1161   ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0
1162   timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=2830912}}, NULL) = 0
1163   getpid()                                = 16566
1164   clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=1, tv_nsec=0}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
1165   --- SIGALRM {si_signo=SIGALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=1631716592, ptr=0x614204f0}} ---
1166   rt_sigreturn({mask=[PIPE]})             = -1 EINTR (Interrupted system call)
1167
1168This is a typical picture from a mostly idle UML instance.
1169
1170* UML interrupt controller uses epoll - this is UML waiting for IO
1171  interrupts:
1172
1173   epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1
1174
1175* The sequence of ptrace calls is part of MMU emulation and running the
1176  UML userspace.
1177* ``timer_settime`` is part of the UML high res timer subsystem mapping
1178  timer requests from inside UML onto the host high resolution timers.
1179* ``clock_nanosleep`` is UML going into idle (similar to the way a PC
1180  will execute an ACPI idle).
1181
1182As you can see UML will generate quite a bit of output even in idle. The output
1183can be very informative when observing IO. It shows the actual IO calls, their
1184arguments and returns values.
1185
1186Kernel debugging
1187================
1188
1189You can run UML under gdb now, though it will not necessarily agree to
1190be started under it. If you are trying to track a runtime bug, it is
1191much better to attach gdb to a running UML instance and let UML run.
1192
1193Assuming the same PID number as in the previous example, this would be::
1194
1195   # gdb -p 16566
1196
1197This will STOP the UML instance, so you must enter `cont` at the GDB
1198command line to request it to continue. It may be a good idea to make
1199this into a gdb script and pass it to gdb as an argument.
1200
1201Developing Device Drivers
1202=========================
1203
1204Nearly all UML drivers are monolithic. While it is possible to build a
1205UML driver as a kernel module, that limits the possible functionality
1206to in-kernel only and non-UML specific.  The reason for this is that
1207in order to really leverage UML, one needs to write a piece of
1208userspace code which maps driver concepts onto actual userspace host
1209calls.
1210
1211This forms the so-called "user" portion of the driver. While it can
1212reuse a lot of kernel concepts, it is generally just another piece of
1213userspace code. This portion needs some matching "kernel" code which
1214resides inside the UML image and which implements the Linux kernel part.
1215
1216*Note: There are very few limitations in the way "kernel" and "user" interact*.
1217
1218UML does not have a strictly defined kernel-to-host API. It does not
1219try to emulate a specific architecture or bus. UML's "kernel" and
1220"user" can share memory, code and interact as needed to implement
1221whatever design the software developer has in mind. The only
1222limitations are purely technical. Due to a lot of functions and
1223variables having the same names, the developer should be careful
1224which includes and libraries they are trying to refer to.
1225
1226As a result a lot of userspace code consists of simple wrappers.
1227E.g. ``os_close_file()`` is just a wrapper around ``close()``
1228which ensures that the userspace function close does not clash
1229with similarly named function(s) in the kernel part.
1230
1231Using UML as a Test Platform
1232============================
1233
1234UML is an excellent test platform for device driver development. As
1235with most things UML, "some user assembly may be required". It is
1236up to the user to build their emulation environment. UML at present
1237provides only the kernel infrastructure.
1238
1239Part of this infrastructure is the ability to load and parse fdt
1240device tree blobs as used in Arm or Open Firmware platforms. These
1241are supplied as an optional extra argument to the kernel command
1242line::
1243
1244    dtb=filename
1245
1246The device tree is loaded and parsed at boottime and is accessible by
1247drivers which query it. At this moment in time this facility is
1248intended solely for development purposes. UML's own devices do not
1249query the device tree.
1250
1251Security Considerations
1252-----------------------
1253
1254Drivers or any new functionality should default to not
1255accepting arbitrary filename, bpf code or other parameters
1256which can affect the host from inside the UML instance.
1257For example, specifying the socket used for IPC communication
1258between a driver and the host at the UML command line is OK
1259security-wise. Allowing it as a loadable module parameter
1260isn't.
1261
1262If such functionality is desirable for a particular application
1263(e.g. loading BPF "firmware" for raw socket network transports),
1264it should be off by default and should be explicitly turned on
1265as a command line parameter at startup.
1266
1267Even with this in mind, the level of isolation between UML
1268and the host is relatively weak. If the UML userspace is
1269allowed to load arbitrary kernel drivers, an attacker can
1270use this to break out of UML. Thus, if UML is used in
1271a production application, it is recommended that all modules
1272are loaded at boot and kernel module loading is disabled
1273afterwards.
1274