xref: /linux/Documentation/virt/kvm/x86/nested-vmx.rst (revision 23c996fc2bc1978a02c64eddb90b4ab5d309c8df)
1.. SPDX-License-Identifier: GPL-2.0
2
3==========
4Nested VMX
5==========
6
7Overview
8---------
9
10On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
11to easily and efficiently run guest operating systems. Normally, these guests
12*cannot* themselves be hypervisors running their own guests, because in VMX,
13guests cannot use VMX instructions.
14
15The "Nested VMX" feature adds this missing capability - of running guest
16hypervisors (which use VMX) with their own nested guests. It does so by
17allowing a guest to use VMX instructions, and correctly and efficiently
18emulating them using the single level of VMX available in the hardware.
19
20We describe in much greater detail the theory behind the nested VMX feature,
21its implementation and its performance characteristics, in the OSDI 2010 paper
22"The Turtles Project: Design and Implementation of Nested Virtualization",
23available at:
24
25	https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
26
27
28Terminology
29-----------
30
31Single-level virtualization has two levels - the host (KVM) and the guests.
32In nested virtualization, we have three levels: The host (KVM), which we call
33L0, the guest hypervisor, which we call L1, and its nested guest, which we
34call L2.
35
36
37Running nested VMX
38------------------
39
40The nested VMX feature is enabled by default since Linux kernel v4.20. For
41older Linux kernel, it can be enabled by giving the "nested=1" option to the
42kvm-intel module.
43
44
45No modifications are required to user space (qemu). However, qemu's default
46emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
47explicitly enabled, by giving qemu one of the following options:
48
49     - cpu host              (emulated CPU has all features of the real CPU)
50
51     - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
52
53
54ABIs
55----
56
57Nested VMX aims to present a standard and (eventually) fully-functional VMX
58implementation for the a guest hypervisor to use. As such, the official
59specification of the ABI that it provides is Intel's VMX specification,
60namely volume 3B of their "Intel 64 and IA-32 Architectures Software
61Developer's Manual". Not all of VMX's features are currently fully supported,
62but the goal is to eventually support them all, starting with the VMX features
63which are used in practice by popular hypervisors (KVM and others).
64
65As a VMX implementation, nested VMX presents a VMCS structure to L1.
66As mandated by the spec, other than the two fields revision_id and abort,
67this structure is *opaque* to its user, who is not supposed to know or care
68about its internal structure. Rather, the structure is accessed through the
69VMREAD and VMWRITE instructions.
70Still, for debugging purposes, KVM developers might be interested to know the
71internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
72
73The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
74also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
75which L0 builds to actually run L2 - how this is done is explained in the
76aforementioned paper.
77
78For convenience, we repeat the content of struct vmcs12 here. If the internals
79of this structure changes, this can break live migration across KVM versions.
80VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
81struct shadow_vmcs is ever changed.
82
83::
84
85	typedef u64 natural_width;
86	struct __packed vmcs12 {
87		/* According to the Intel spec, a VMCS region must start with
88		 * these two user-visible fields */
89		u32 revision_id;
90		u32 abort;
91
92		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
93		u32 padding[7]; /* room for future expansion */
94
95		u64 io_bitmap_a;
96		u64 io_bitmap_b;
97		u64 msr_bitmap;
98		u64 vm_exit_msr_store_addr;
99		u64 vm_exit_msr_load_addr;
100		u64 vm_entry_msr_load_addr;
101		u64 tsc_offset;
102		u64 virtual_apic_page_addr;
103		u64 apic_access_addr;
104		u64 ept_pointer;
105		u64 guest_physical_address;
106		u64 vmcs_link_pointer;
107		u64 guest_ia32_debugctl;
108		u64 guest_ia32_pat;
109		u64 guest_ia32_efer;
110		u64 guest_pdptr0;
111		u64 guest_pdptr1;
112		u64 guest_pdptr2;
113		u64 guest_pdptr3;
114		u64 host_ia32_pat;
115		u64 host_ia32_efer;
116		u64 padding64[8]; /* room for future expansion */
117		natural_width cr0_guest_host_mask;
118		natural_width cr4_guest_host_mask;
119		natural_width cr0_read_shadow;
120		natural_width cr4_read_shadow;
121		natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
122		natural_width exit_qualification;
123		natural_width guest_linear_address;
124		natural_width guest_cr0;
125		natural_width guest_cr3;
126		natural_width guest_cr4;
127		natural_width guest_es_base;
128		natural_width guest_cs_base;
129		natural_width guest_ss_base;
130		natural_width guest_ds_base;
131		natural_width guest_fs_base;
132		natural_width guest_gs_base;
133		natural_width guest_ldtr_base;
134		natural_width guest_tr_base;
135		natural_width guest_gdtr_base;
136		natural_width guest_idtr_base;
137		natural_width guest_dr7;
138		natural_width guest_rsp;
139		natural_width guest_rip;
140		natural_width guest_rflags;
141		natural_width guest_pending_dbg_exceptions;
142		natural_width guest_sysenter_esp;
143		natural_width guest_sysenter_eip;
144		natural_width host_cr0;
145		natural_width host_cr3;
146		natural_width host_cr4;
147		natural_width host_fs_base;
148		natural_width host_gs_base;
149		natural_width host_tr_base;
150		natural_width host_gdtr_base;
151		natural_width host_idtr_base;
152		natural_width host_ia32_sysenter_esp;
153		natural_width host_ia32_sysenter_eip;
154		natural_width host_rsp;
155		natural_width host_rip;
156		natural_width paddingl[8]; /* room for future expansion */
157		u32 pin_based_vm_exec_control;
158		u32 cpu_based_vm_exec_control;
159		u32 exception_bitmap;
160		u32 page_fault_error_code_mask;
161		u32 page_fault_error_code_match;
162		u32 cr3_target_count;
163		u32 vm_exit_controls;
164		u32 vm_exit_msr_store_count;
165		u32 vm_exit_msr_load_count;
166		u32 vm_entry_controls;
167		u32 vm_entry_msr_load_count;
168		u32 vm_entry_intr_info_field;
169		u32 vm_entry_exception_error_code;
170		u32 vm_entry_instruction_len;
171		u32 tpr_threshold;
172		u32 secondary_vm_exec_control;
173		u32 vm_instruction_error;
174		u32 vm_exit_reason;
175		u32 vm_exit_intr_info;
176		u32 vm_exit_intr_error_code;
177		u32 idt_vectoring_info_field;
178		u32 idt_vectoring_error_code;
179		u32 vm_exit_instruction_len;
180		u32 vmx_instruction_info;
181		u32 guest_es_limit;
182		u32 guest_cs_limit;
183		u32 guest_ss_limit;
184		u32 guest_ds_limit;
185		u32 guest_fs_limit;
186		u32 guest_gs_limit;
187		u32 guest_ldtr_limit;
188		u32 guest_tr_limit;
189		u32 guest_gdtr_limit;
190		u32 guest_idtr_limit;
191		u32 guest_es_ar_bytes;
192		u32 guest_cs_ar_bytes;
193		u32 guest_ss_ar_bytes;
194		u32 guest_ds_ar_bytes;
195		u32 guest_fs_ar_bytes;
196		u32 guest_gs_ar_bytes;
197		u32 guest_ldtr_ar_bytes;
198		u32 guest_tr_ar_bytes;
199		u32 guest_interruptibility_info;
200		u32 guest_activity_state;
201		u32 guest_sysenter_cs;
202		u32 host_ia32_sysenter_cs;
203		u32 padding32[8]; /* room for future expansion */
204		u16 virtual_processor_id;
205		u16 guest_es_selector;
206		u16 guest_cs_selector;
207		u16 guest_ss_selector;
208		u16 guest_ds_selector;
209		u16 guest_fs_selector;
210		u16 guest_gs_selector;
211		u16 guest_ldtr_selector;
212		u16 guest_tr_selector;
213		u16 host_es_selector;
214		u16 host_cs_selector;
215		u16 host_ss_selector;
216		u16 host_ds_selector;
217		u16 host_fs_selector;
218		u16 host_gs_selector;
219		u16 host_tr_selector;
220	};
221
222
223Authors
224-------
225
226These patches were written by:
227    - Abel Gordon, abelg <at> il.ibm.com
228    - Nadav Har'El, nyh <at> il.ibm.com
229    - Orit Wasserman, oritw <at> il.ibm.com
230    - Ben-Ami Yassor, benami <at> il.ibm.com
231    - Muli Ben-Yehuda, muli <at> il.ibm.com
232
233With contributions by:
234    - Anthony Liguori, aliguori <at> us.ibm.com
235    - Mike Day, mdday <at> us.ibm.com
236    - Michael Factor, factor <at> il.ibm.com
237    - Zvi Dubitzky, dubi <at> il.ibm.com
238
239And valuable reviews by:
240    - Avi Kivity, avi <at> redhat.com
241    - Gleb Natapov, gleb <at> redhat.com
242    - Marcelo Tosatti, mtosatti <at> redhat.com
243    - Kevin Tian, kevin.tian <at> intel.com
244    - and others.
245