xref: /linux/Documentation/virt/kvm/x86/timekeeping.rst (revision 4f2c0a4acffbec01079c28f839422e64ddeff004)
1*daec8d40SPaolo Bonzini.. SPDX-License-Identifier: GPL-2.0
2*daec8d40SPaolo Bonzini
3*daec8d40SPaolo Bonzini======================================================
4*daec8d40SPaolo BonziniTimekeeping Virtualization for X86-Based Architectures
5*daec8d40SPaolo Bonzini======================================================
6*daec8d40SPaolo Bonzini
7*daec8d40SPaolo Bonzini:Author: Zachary Amsden <zamsden@redhat.com>
8*daec8d40SPaolo Bonzini:Copyright: (c) 2010, Red Hat.  All rights reserved.
9*daec8d40SPaolo Bonzini
10*daec8d40SPaolo Bonzini.. Contents
11*daec8d40SPaolo Bonzini
12*daec8d40SPaolo Bonzini   1) Overview
13*daec8d40SPaolo Bonzini   2) Timing Devices
14*daec8d40SPaolo Bonzini   3) TSC Hardware
15*daec8d40SPaolo Bonzini   4) Virtualization Problems
16*daec8d40SPaolo Bonzini
17*daec8d40SPaolo Bonzini1. Overview
18*daec8d40SPaolo Bonzini===========
19*daec8d40SPaolo Bonzini
20*daec8d40SPaolo BonziniOne of the most complicated parts of the X86 platform, and specifically,
21*daec8d40SPaolo Bonzinithe virtualization of this platform is the plethora of timing devices available
22*daec8d40SPaolo Bonziniand the complexity of emulating those devices.  In addition, virtualization of
23*daec8d40SPaolo Bonzinitime introduces a new set of challenges because it introduces a multiplexed
24*daec8d40SPaolo Bonzinidivision of time beyond the control of the guest CPU.
25*daec8d40SPaolo Bonzini
26*daec8d40SPaolo BonziniFirst, we will describe the various timekeeping hardware available, then
27*daec8d40SPaolo Bonzinipresent some of the problems which arise and solutions available, giving
28*daec8d40SPaolo Bonzinispecific recommendations for certain classes of KVM guests.
29*daec8d40SPaolo Bonzini
30*daec8d40SPaolo BonziniThe purpose of this document is to collect data and information relevant to
31*daec8d40SPaolo Bonzinitimekeeping which may be difficult to find elsewhere, specifically,
32*daec8d40SPaolo Bonziniinformation relevant to KVM and hardware-based virtualization.
33*daec8d40SPaolo Bonzini
34*daec8d40SPaolo Bonzini2. Timing Devices
35*daec8d40SPaolo Bonzini=================
36*daec8d40SPaolo Bonzini
37*daec8d40SPaolo BonziniFirst we discuss the basic hardware devices available.  TSC and the related
38*daec8d40SPaolo BonziniKVM clock are special enough to warrant a full exposition and are described in
39*daec8d40SPaolo Bonzinithe following section.
40*daec8d40SPaolo Bonzini
41*daec8d40SPaolo Bonzini2.1. i8254 - PIT
42*daec8d40SPaolo Bonzini----------------
43*daec8d40SPaolo Bonzini
44*daec8d40SPaolo BonziniOne of the first timer devices available is the programmable interrupt timer,
45*daec8d40SPaolo Bonzinior PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
46*daec8d40SPaolo Bonzinichannels which can be programmed to deliver periodic or one-shot interrupts.
47*daec8d40SPaolo BonziniThese three channels can be configured in different modes and have individual
48*daec8d40SPaolo Bonzinicounters.  Channel 1 and 2 were not available for general use in the original
49*daec8d40SPaolo BonziniIBM PC, and historically were connected to control RAM refresh and the PC
50*daec8d40SPaolo Bonzinispeaker.  Now the PIT is typically integrated as part of an emulated chipset
51*daec8d40SPaolo Bonziniand a separate physical PIT is not used.
52*daec8d40SPaolo Bonzini
53*daec8d40SPaolo BonziniThe PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
54*daec8d40SPaolo Bonziniusing single or multiple byte access to the I/O ports.  There are 6 modes
55*daec8d40SPaolo Bonziniavailable, but not all modes are available to all timers, as only timer 2
56*daec8d40SPaolo Bonzinihas a connected gate input, required for modes 1 and 5.  The gate line is
57*daec8d40SPaolo Bonzinicontrolled by port 61h, bit 0, as illustrated in the following diagram::
58*daec8d40SPaolo Bonzini
59*daec8d40SPaolo Bonzini  --------------             ----------------
60*daec8d40SPaolo Bonzini  |            |           |                |
61*daec8d40SPaolo Bonzini  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
62*daec8d40SPaolo Bonzini  |    Clock   |   |       |                |
63*daec8d40SPaolo Bonzini  --------------   |    +->| GATE  TIMER 0  |
64*daec8d40SPaolo Bonzini                   |        ----------------
65*daec8d40SPaolo Bonzini                   |
66*daec8d40SPaolo Bonzini                   |        ----------------
67*daec8d40SPaolo Bonzini                   |       |                |
68*daec8d40SPaolo Bonzini                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
69*daec8d40SPaolo Bonzini                   |       |                |            (aka /dev/null)
70*daec8d40SPaolo Bonzini                   |    +->| GATE  TIMER 1  |
71*daec8d40SPaolo Bonzini                   |        ----------------
72*daec8d40SPaolo Bonzini                   |
73*daec8d40SPaolo Bonzini                   |        ----------------
74*daec8d40SPaolo Bonzini                   |       |                |
75*daec8d40SPaolo Bonzini                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
76*daec8d40SPaolo Bonzini                           |                |      |
77*daec8d40SPaolo Bonzini  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
78*daec8d40SPaolo Bonzini                            ----------------         _|    )--|LPF|---Speaker
79*daec8d40SPaolo Bonzini                                                    / *----   \___/
80*daec8d40SPaolo Bonzini  Port 61h, bit 1 ---------------------------------/
81*daec8d40SPaolo Bonzini
82*daec8d40SPaolo BonziniThe timer modes are now described.
83*daec8d40SPaolo Bonzini
84*daec8d40SPaolo BonziniMode 0: Single Timeout.
85*daec8d40SPaolo Bonzini This is a one-shot software timeout that counts down
86*daec8d40SPaolo Bonzini when the gate is high (always true for timers 0 and 1).  When the count
87*daec8d40SPaolo Bonzini reaches zero, the output goes high.
88*daec8d40SPaolo Bonzini
89*daec8d40SPaolo BonziniMode 1: Triggered One-shot.
90*daec8d40SPaolo Bonzini The output is initially set high.  When the gate
91*daec8d40SPaolo Bonzini line is set high, a countdown is initiated (which does not stop if the gate is
92*daec8d40SPaolo Bonzini lowered), during which the output is set low.  When the count reaches zero,
93*daec8d40SPaolo Bonzini the output goes high.
94*daec8d40SPaolo Bonzini
95*daec8d40SPaolo BonziniMode 2: Rate Generator.
96*daec8d40SPaolo Bonzini The output is initially set high.  When the countdown
97*daec8d40SPaolo Bonzini reaches 1, the output goes low for one count and then returns high.  The value
98*daec8d40SPaolo Bonzini is reloaded and the countdown automatically resumes.  If the gate line goes
99*daec8d40SPaolo Bonzini low, the count is halted.  If the output is low when the gate is lowered, the
100*daec8d40SPaolo Bonzini output automatically goes high (this only affects timer 2).
101*daec8d40SPaolo Bonzini
102*daec8d40SPaolo BonziniMode 3: Square Wave.
103*daec8d40SPaolo Bonzini This generates a high / low square wave.  The count
104*daec8d40SPaolo Bonzini determines the length of the pulse, which alternates between high and low
105*daec8d40SPaolo Bonzini when zero is reached.  The count only proceeds when gate is high and is
106*daec8d40SPaolo Bonzini automatically reloaded on reaching zero.  The count is decremented twice at
107*daec8d40SPaolo Bonzini each clock to generate a full high / low cycle at the full periodic rate.
108*daec8d40SPaolo Bonzini If the count is even, the clock remains high for N/2 counts and low for N/2
109*daec8d40SPaolo Bonzini counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
110*daec8d40SPaolo Bonzini for (N-1)/2 counts.  Only even values are latched by the counter, so odd
111*daec8d40SPaolo Bonzini values are not observed when reading.  This is the intended mode for timer 2,
112*daec8d40SPaolo Bonzini which generates sine-like tones by low-pass filtering the square wave output.
113*daec8d40SPaolo Bonzini
114*daec8d40SPaolo BonziniMode 4: Software Strobe.
115*daec8d40SPaolo Bonzini After programming this mode and loading the counter,
116*daec8d40SPaolo Bonzini the output remains high until the counter reaches zero.  Then the output
117*daec8d40SPaolo Bonzini goes low for 1 clock cycle and returns high.  The counter is not reloaded.
118*daec8d40SPaolo Bonzini Counting only occurs when gate is high.
119*daec8d40SPaolo Bonzini
120*daec8d40SPaolo BonziniMode 5: Hardware Strobe.
121*daec8d40SPaolo Bonzini After programming and loading the counter, the
122*daec8d40SPaolo Bonzini output remains high.  When the gate is raised, a countdown is initiated
123*daec8d40SPaolo Bonzini (which does not stop if the gate is lowered).  When the counter reaches zero,
124*daec8d40SPaolo Bonzini the output goes low for 1 clock cycle and then returns high.  The counter is
125*daec8d40SPaolo Bonzini not reloaded.
126*daec8d40SPaolo Bonzini
127*daec8d40SPaolo BonziniIn addition to normal binary counting, the PIT supports BCD counting.  The
128*daec8d40SPaolo Bonzinicommand port, 0x43 is used to set the counter and mode for each of the three
129*daec8d40SPaolo Bonzinitimers.
130*daec8d40SPaolo Bonzini
131*daec8d40SPaolo BonziniPIT commands, issued to port 0x43, using the following bit encoding::
132*daec8d40SPaolo Bonzini
133*daec8d40SPaolo Bonzini  Bit 7-4: Command (See table below)
134*daec8d40SPaolo Bonzini  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
135*daec8d40SPaolo Bonzini  Bit 0  : Binary (0) / BCD (1)
136*daec8d40SPaolo Bonzini
137*daec8d40SPaolo BonziniCommand table::
138*daec8d40SPaolo Bonzini
139*daec8d40SPaolo Bonzini  0000 - Latch Timer 0 count for port 0x40
140*daec8d40SPaolo Bonzini	sample and hold the count to be read in port 0x40;
141*daec8d40SPaolo Bonzini	additional commands ignored until counter is read;
142*daec8d40SPaolo Bonzini	mode bits ignored.
143*daec8d40SPaolo Bonzini
144*daec8d40SPaolo Bonzini  0001 - Set Timer 0 LSB mode for port 0x40
145*daec8d40SPaolo Bonzini	set timer to read LSB only and force MSB to zero;
146*daec8d40SPaolo Bonzini	mode bits set timer mode
147*daec8d40SPaolo Bonzini
148*daec8d40SPaolo Bonzini  0010 - Set Timer 0 MSB mode for port 0x40
149*daec8d40SPaolo Bonzini	set timer to read MSB only and force LSB to zero;
150*daec8d40SPaolo Bonzini	mode bits set timer mode
151*daec8d40SPaolo Bonzini
152*daec8d40SPaolo Bonzini  0011 - Set Timer 0 16-bit mode for port 0x40
153*daec8d40SPaolo Bonzini	set timer to read / write LSB first, then MSB;
154*daec8d40SPaolo Bonzini	mode bits set timer mode
155*daec8d40SPaolo Bonzini
156*daec8d40SPaolo Bonzini  0100 - Latch Timer 1 count for port 0x41 - as described above
157*daec8d40SPaolo Bonzini  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
158*daec8d40SPaolo Bonzini  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
159*daec8d40SPaolo Bonzini  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
160*daec8d40SPaolo Bonzini
161*daec8d40SPaolo Bonzini  1000 - Latch Timer 2 count for port 0x42 - as described above
162*daec8d40SPaolo Bonzini  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
163*daec8d40SPaolo Bonzini  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
164*daec8d40SPaolo Bonzini  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
165*daec8d40SPaolo Bonzini
166*daec8d40SPaolo Bonzini  1101 - General counter latch
167*daec8d40SPaolo Bonzini	Latch combination of counters into corresponding ports
168*daec8d40SPaolo Bonzini	Bit 3 = Counter 2
169*daec8d40SPaolo Bonzini	Bit 2 = Counter 1
170*daec8d40SPaolo Bonzini	Bit 1 = Counter 0
171*daec8d40SPaolo Bonzini	Bit 0 = Unused
172*daec8d40SPaolo Bonzini
173*daec8d40SPaolo Bonzini  1110 - Latch timer status
174*daec8d40SPaolo Bonzini	Latch combination of counter mode into corresponding ports
175*daec8d40SPaolo Bonzini	Bit 3 = Counter 2
176*daec8d40SPaolo Bonzini	Bit 2 = Counter 1
177*daec8d40SPaolo Bonzini	Bit 1 = Counter 0
178*daec8d40SPaolo Bonzini
179*daec8d40SPaolo Bonzini	The output of ports 0x40-0x42 following this command will be:
180*daec8d40SPaolo Bonzini
181*daec8d40SPaolo Bonzini	Bit 7 = Output pin
182*daec8d40SPaolo Bonzini	Bit 6 = Count loaded (0 if timer has expired)
183*daec8d40SPaolo Bonzini	Bit 5-4 = Read / Write mode
184*daec8d40SPaolo Bonzini	    01 = MSB only
185*daec8d40SPaolo Bonzini	    10 = LSB only
186*daec8d40SPaolo Bonzini	    11 = LSB / MSB (16-bit)
187*daec8d40SPaolo Bonzini	Bit 3-1 = Mode
188*daec8d40SPaolo Bonzini	Bit 0 = Binary (0) / BCD mode (1)
189*daec8d40SPaolo Bonzini
190*daec8d40SPaolo Bonzini2.2. RTC
191*daec8d40SPaolo Bonzini--------
192*daec8d40SPaolo Bonzini
193*daec8d40SPaolo BonziniThe second device which was available in the original PC was the MC146818 real
194*daec8d40SPaolo Bonzinitime clock.  The original device is now obsolete, and usually emulated by the
195*daec8d40SPaolo Bonzinisystem chipset, sometimes by an HPET and some frankenstein IRQ routing.
196*daec8d40SPaolo Bonzini
197*daec8d40SPaolo BonziniThe RTC is accessed through CMOS variables, which uses an index register to
198*daec8d40SPaolo Bonzinicontrol which bytes are read.  Since there is only one index register, read
199*daec8d40SPaolo Bonziniof the CMOS and read of the RTC require lock protection (in addition, it is
200*daec8d40SPaolo Bonzinidangerous to allow userspace utilities such as hwclock to have direct RTC
201*daec8d40SPaolo Bonziniaccess, as they could corrupt kernel reads and writes of CMOS memory).
202*daec8d40SPaolo Bonzini
203*daec8d40SPaolo BonziniThe RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
204*daec8d40SPaolo Bonzinican function as a periodic timer, an additional once a day alarm, and can issue
205*daec8d40SPaolo Bonziniinterrupts after an update of the CMOS registers by the MC146818 is complete.
206*daec8d40SPaolo BonziniThe type of interrupt is signalled in the RTC status registers.
207*daec8d40SPaolo Bonzini
208*daec8d40SPaolo BonziniThe RTC will update the current time fields by battery power even while the
209*daec8d40SPaolo Bonzinisystem is off.  The current time fields should not be read while an update is
210*daec8d40SPaolo Bonziniin progress, as indicated in the status register.
211*daec8d40SPaolo Bonzini
212*daec8d40SPaolo BonziniThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
213*daec8d40SPaolo Bonziniprogrammed to a 32kHz divider if the RTC is to count seconds.
214*daec8d40SPaolo Bonzini
215*daec8d40SPaolo BonziniThis is the RAM map originally used for the RTC/CMOS::
216*daec8d40SPaolo Bonzini
217*daec8d40SPaolo Bonzini  Location    Size    Description
218*daec8d40SPaolo Bonzini  ------------------------------------------
219*daec8d40SPaolo Bonzini  00h         byte    Current second (BCD)
220*daec8d40SPaolo Bonzini  01h         byte    Seconds alarm (BCD)
221*daec8d40SPaolo Bonzini  02h         byte    Current minute (BCD)
222*daec8d40SPaolo Bonzini  03h         byte    Minutes alarm (BCD)
223*daec8d40SPaolo Bonzini  04h         byte    Current hour (BCD)
224*daec8d40SPaolo Bonzini  05h         byte    Hours alarm (BCD)
225*daec8d40SPaolo Bonzini  06h         byte    Current day of week (BCD)
226*daec8d40SPaolo Bonzini  07h         byte    Current day of month (BCD)
227*daec8d40SPaolo Bonzini  08h         byte    Current month (BCD)
228*daec8d40SPaolo Bonzini  09h         byte    Current year (BCD)
229*daec8d40SPaolo Bonzini  0Ah         byte    Register A
230*daec8d40SPaolo Bonzini                       bit 7   = Update in progress
231*daec8d40SPaolo Bonzini                       bit 6-4 = Divider for clock
232*daec8d40SPaolo Bonzini                                  000 = 4.194 MHz
233*daec8d40SPaolo Bonzini                                  001 = 1.049 MHz
234*daec8d40SPaolo Bonzini                                  010 = 32 kHz
235*daec8d40SPaolo Bonzini                                  10X = test modes
236*daec8d40SPaolo Bonzini                                  110 = reset / disable
237*daec8d40SPaolo Bonzini                                  111 = reset / disable
238*daec8d40SPaolo Bonzini                       bit 3-0 = Rate selection for periodic interrupt
239*daec8d40SPaolo Bonzini                                  000 = periodic timer disabled
240*daec8d40SPaolo Bonzini                                  001 = 3.90625 uS
241*daec8d40SPaolo Bonzini                                  010 = 7.8125 uS
242*daec8d40SPaolo Bonzini                                  011 = .122070 mS
243*daec8d40SPaolo Bonzini                                  100 = .244141 mS
244*daec8d40SPaolo Bonzini                                     ...
245*daec8d40SPaolo Bonzini                                 1101 = 125 mS
246*daec8d40SPaolo Bonzini                                 1110 = 250 mS
247*daec8d40SPaolo Bonzini                                 1111 = 500 mS
248*daec8d40SPaolo Bonzini  0Bh         byte    Register B
249*daec8d40SPaolo Bonzini                       bit 7   = Run (0) / Halt (1)
250*daec8d40SPaolo Bonzini                       bit 6   = Periodic interrupt enable
251*daec8d40SPaolo Bonzini                       bit 5   = Alarm interrupt enable
252*daec8d40SPaolo Bonzini                       bit 4   = Update-ended interrupt enable
253*daec8d40SPaolo Bonzini                       bit 3   = Square wave interrupt enable
254*daec8d40SPaolo Bonzini                       bit 2   = BCD calendar (0) / Binary (1)
255*daec8d40SPaolo Bonzini                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
256*daec8d40SPaolo Bonzini                       bit 0   = 0 (DST off) / 1 (DST enabled)
257*daec8d40SPaolo Bonzini  OCh         byte    Register C (read only)
258*daec8d40SPaolo Bonzini                       bit 7   = interrupt request flag (IRQF)
259*daec8d40SPaolo Bonzini                       bit 6   = periodic interrupt flag (PF)
260*daec8d40SPaolo Bonzini                       bit 5   = alarm interrupt flag (AF)
261*daec8d40SPaolo Bonzini                       bit 4   = update interrupt flag (UF)
262*daec8d40SPaolo Bonzini                       bit 3-0 = reserved
263*daec8d40SPaolo Bonzini  ODh         byte    Register D (read only)
264*daec8d40SPaolo Bonzini                       bit 7   = RTC has power
265*daec8d40SPaolo Bonzini                       bit 6-0 = reserved
266*daec8d40SPaolo Bonzini  32h         byte    Current century BCD (*)
267*daec8d40SPaolo Bonzini  (*) location vendor specific and now determined from ACPI global tables
268*daec8d40SPaolo Bonzini
269*daec8d40SPaolo Bonzini2.3. APIC
270*daec8d40SPaolo Bonzini---------
271*daec8d40SPaolo Bonzini
272*daec8d40SPaolo BonziniOn Pentium and later processors, an on-board timer is available to each CPU
273*daec8d40SPaolo Bonzinias part of the Advanced Programmable Interrupt Controller.  The APIC is
274*daec8d40SPaolo Bonziniaccessed through memory-mapped registers and provides interrupt service to each
275*daec8d40SPaolo BonziniCPU, used for IPIs and local timer interrupts.
276*daec8d40SPaolo Bonzini
277*daec8d40SPaolo BonziniAlthough in theory the APIC is a safe and stable source for local interrupts,
278*daec8d40SPaolo Bonziniin practice, many bugs and glitches have occurred due to the special nature of
279*daec8d40SPaolo Bonzinithe APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
280*daec8d40SPaolo Bonzinithe use of the APIC and that workarounds may be required.  In addition, some of
281*daec8d40SPaolo Bonzinithese workarounds pose unique constraints for virtualization - requiring either
282*daec8d40SPaolo Bonziniextra overhead incurred from extra reads of memory-mapped I/O or additional
283*daec8d40SPaolo Bonzinifunctionality that may be more computationally expensive to implement.
284*daec8d40SPaolo Bonzini
285*daec8d40SPaolo BonziniSince the APIC is documented quite well in the Intel and AMD manuals, we will
286*daec8d40SPaolo Bonziniavoid repetition of the detail here.  It should be pointed out that the APIC
287*daec8d40SPaolo Bonzinitimer is programmed through the LVT (local vector timer) register, is capable
288*daec8d40SPaolo Bonziniof one-shot or periodic operation, and is based on the bus clock divided down
289*daec8d40SPaolo Bonziniby the programmable divider register.
290*daec8d40SPaolo Bonzini
291*daec8d40SPaolo Bonzini2.4. HPET
292*daec8d40SPaolo Bonzini---------
293*daec8d40SPaolo Bonzini
294*daec8d40SPaolo BonziniHPET is quite complex, and was originally intended to replace the PIT / RTC
295*daec8d40SPaolo Bonzinisupport of the X86 PC.  It remains to be seen whether that will be the case, as
296*daec8d40SPaolo Bonzinithe de facto standard of PC hardware is to emulate these older devices.  Some
297*daec8d40SPaolo Bonzinisystems designated as legacy free may support only the HPET as a hardware timer
298*daec8d40SPaolo Bonzinidevice.
299*daec8d40SPaolo Bonzini
300*daec8d40SPaolo BonziniThe HPET spec is rather loose and vague, requiring at least 3 hardware timers,
301*daec8d40SPaolo Bonzinibut allowing implementation freedom to support many more.  It also imposes no
302*daec8d40SPaolo Bonzinifixed rate on the timer frequency, but does impose some extremal values on
303*daec8d40SPaolo Bonzinifrequency, error and slew.
304*daec8d40SPaolo Bonzini
305*daec8d40SPaolo BonziniIn general, the HPET is recommended as a high precision (compared to PIT /RTC)
306*daec8d40SPaolo Bonzinitime source which is independent of local variation (as there is only one HPET
307*daec8d40SPaolo Bonziniin any given system).  The HPET is also memory-mapped, and its presence is
308*daec8d40SPaolo Bonziniindicated through ACPI tables by the BIOS.
309*daec8d40SPaolo Bonzini
310*daec8d40SPaolo BonziniDetailed specification of the HPET is beyond the current scope of this
311*daec8d40SPaolo Bonzinidocument, as it is also very well documented elsewhere.
312*daec8d40SPaolo Bonzini
313*daec8d40SPaolo Bonzini2.5. Offboard Timers
314*daec8d40SPaolo Bonzini--------------------
315*daec8d40SPaolo Bonzini
316*daec8d40SPaolo BonziniSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have
317*daec8d40SPaolo Bonzinitiming chips built into the cards which may have registers which are accessible
318*daec8d40SPaolo Bonzinito kernel or user drivers.  To the author's knowledge, using these to generate
319*daec8d40SPaolo Bonzinia clocksource for a Linux or other kernel has not yet been attempted and is in
320*daec8d40SPaolo Bonzinigeneral frowned upon as not playing by the agreed rules of the game.  Such a
321*daec8d40SPaolo Bonzinitimer device would require additional support to be virtualized properly and is
322*daec8d40SPaolo Bonzininot considered important at this time as no known operating system does this.
323*daec8d40SPaolo Bonzini
324*daec8d40SPaolo Bonzini3. TSC Hardware
325*daec8d40SPaolo Bonzini===============
326*daec8d40SPaolo Bonzini
327*daec8d40SPaolo BonziniThe TSC or time stamp counter is relatively simple in theory; it counts
328*daec8d40SPaolo Bonziniinstruction cycles issued by the processor, which can be used as a measure of
329*daec8d40SPaolo Bonzinitime.  In practice, due to a number of problems, it is the most complicated
330*daec8d40SPaolo Bonzinitimekeeping device to use.
331*daec8d40SPaolo Bonzini
332*daec8d40SPaolo BonziniThe TSC is represented internally as a 64-bit MSR which can be read with the
333*daec8d40SPaolo BonziniRDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
334*daec8d40SPaolo Bonzinilimitations made it possible to write the TSC, but generally on old hardware it
335*daec8d40SPaolo Bonziniwas only possible to write the low 32-bits of the 64-bit counter, and the upper
336*daec8d40SPaolo Bonzini32-bits of the counter were cleared.  Now, however, on Intel processors family
337*daec8d40SPaolo Bonzini0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
338*daec8d40SPaolo Bonzinihas been lifted and all 64-bits are writable.  On AMD systems, the ability to
339*daec8d40SPaolo Bonziniwrite the TSC MSR is not an architectural guarantee.
340*daec8d40SPaolo Bonzini
341*daec8d40SPaolo BonziniThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
342*daec8d40SPaolo Bonzinimeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
343*daec8d40SPaolo Bonzini
344*daec8d40SPaolo BonziniSome vendors have implemented an additional instruction, RDTSCP, which returns
345*daec8d40SPaolo Bonziniatomically not just the TSC, but an indicator which corresponds to the
346*daec8d40SPaolo Bonziniprocessor number.  This can be used to index into an array of TSC variables to
347*daec8d40SPaolo Bonzinidetermine offset information in SMP systems where TSCs are not synchronized.
348*daec8d40SPaolo BonziniThe presence of this instruction must be determined by consulting CPUID feature
349*daec8d40SPaolo Bonzinibits.
350*daec8d40SPaolo Bonzini
351*daec8d40SPaolo BonziniBoth VMX and SVM provide extension fields in the virtualization hardware which
352*daec8d40SPaolo Bonziniallows the guest visible TSC to be offset by a constant.  Newer implementations
353*daec8d40SPaolo Bonzinipromise to allow the TSC to additionally be scaled, but this hardware is not
354*daec8d40SPaolo Bonziniyet widely available.
355*daec8d40SPaolo Bonzini
356*daec8d40SPaolo Bonzini3.1. TSC synchronization
357*daec8d40SPaolo Bonzini------------------------
358*daec8d40SPaolo Bonzini
359*daec8d40SPaolo BonziniThe TSC is a CPU-local clock in most implementations.  This means, on SMP
360*daec8d40SPaolo Bonziniplatforms, the TSCs of different CPUs may start at different times depending
361*daec8d40SPaolo Bonzinion when the CPUs are powered on.  Generally, CPUs on the same die will share
362*daec8d40SPaolo Bonzinithe same clock, however, this is not always the case.
363*daec8d40SPaolo Bonzini
364*daec8d40SPaolo BonziniThe BIOS may attempt to resynchronize the TSCs during the poweron process and
365*daec8d40SPaolo Bonzinithe operating system or other system software may attempt to do this as well.
366*daec8d40SPaolo BonziniSeveral hardware limitations make the problem worse - if it is not possible to
367*daec8d40SPaolo Bonziniwrite the full 64-bits of the TSC, it may be impossible to match the TSC in
368*daec8d40SPaolo Bonzininewly arriving CPUs to that of the rest of the system, resulting in
369*daec8d40SPaolo Bonziniunsynchronized TSCs.  This may be done by BIOS or system software, but in
370*daec8d40SPaolo Bonzinipractice, getting a perfectly synchronized TSC will not be possible unless all
371*daec8d40SPaolo Bonzinivalues are read from the same clock, which generally only is possible on single
372*daec8d40SPaolo Bonzinisocket systems or those with special hardware support.
373*daec8d40SPaolo Bonzini
374*daec8d40SPaolo Bonzini3.2. TSC and CPU hotplug
375*daec8d40SPaolo Bonzini------------------------
376*daec8d40SPaolo Bonzini
377*daec8d40SPaolo BonziniAs touched on already, CPUs which arrive later than the boot time of the system
378*daec8d40SPaolo Bonzinimay not have a TSC value that is synchronized with the rest of the system.
379*daec8d40SPaolo BonziniEither system software, BIOS, or SMM code may actually try to establish the TSC
380*daec8d40SPaolo Bonzinito a value matching the rest of the system, but a perfect match is usually not
381*daec8d40SPaolo Bonzinia guarantee.  This can have the effect of bringing a system from a state where
382*daec8d40SPaolo BonziniTSC is synchronized back to a state where TSC synchronization flaws, however
383*daec8d40SPaolo Bonzinismall, may be exposed to the OS and any virtualization environment.
384*daec8d40SPaolo Bonzini
385*daec8d40SPaolo Bonzini3.3. TSC and multi-socket / NUMA
386*daec8d40SPaolo Bonzini--------------------------------
387*daec8d40SPaolo Bonzini
388*daec8d40SPaolo BonziniMulti-socket systems, especially large multi-socket systems are likely to have
389*daec8d40SPaolo Bonziniindividual clocksources rather than a single, universally distributed clock.
390*daec8d40SPaolo BonziniSince these clocks are driven by different crystals, they will not have
391*daec8d40SPaolo Bonziniperfectly matched frequency, and temperature and electrical variations will
392*daec8d40SPaolo Bonzinicause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
393*daec8d40SPaolo Bonziniexact clock and bus design, the drift may or may not be fixed in absolute
394*daec8d40SPaolo Bonzinierror, and may accumulate over time.
395*daec8d40SPaolo Bonzini
396*daec8d40SPaolo BonziniIn addition, very large systems may deliberately slew the clocks of individual
397*daec8d40SPaolo Bonzinicores.  This technique, known as spread-spectrum clocking, reduces EMI at the
398*daec8d40SPaolo Bonziniclock frequency and harmonics of it, which may be required to pass FCC
399*daec8d40SPaolo Bonzinistandards for telecommunications and computer equipment.
400*daec8d40SPaolo Bonzini
401*daec8d40SPaolo BonziniIt is recommended not to trust the TSCs to remain synchronized on NUMA or
402*daec8d40SPaolo Bonzinimultiple socket systems for these reasons.
403*daec8d40SPaolo Bonzini
404*daec8d40SPaolo Bonzini3.4. TSC and C-states
405*daec8d40SPaolo Bonzini---------------------
406*daec8d40SPaolo Bonzini
407*daec8d40SPaolo BonziniC-states, or idling states of the processor, especially C1E and deeper sleep
408*daec8d40SPaolo Bonzinistates may be problematic for TSC as well.  The TSC may stop advancing in such
409*daec8d40SPaolo Bonzinia state, resulting in a TSC which is behind that of other CPUs when execution
410*daec8d40SPaolo Bonziniis resumed.  Such CPUs must be detected and flagged by the operating system
411*daec8d40SPaolo Bonzinibased on CPU and chipset identifications.
412*daec8d40SPaolo Bonzini
413*daec8d40SPaolo BonziniThe TSC in such a case may be corrected by catching it up to a known external
414*daec8d40SPaolo Bonziniclocksource.
415*daec8d40SPaolo Bonzini
416*daec8d40SPaolo Bonzini3.5. TSC frequency change / P-states
417*daec8d40SPaolo Bonzini------------------------------------
418*daec8d40SPaolo Bonzini
419*daec8d40SPaolo BonziniTo make things slightly more interesting, some CPUs may change frequency.  They
420*daec8d40SPaolo Bonzinimay or may not run the TSC at the same rate, and because the frequency change
421*daec8d40SPaolo Bonzinimay be staggered or slewed, at some points in time, the TSC rate may not be
422*daec8d40SPaolo Bonziniknown other than falling within a range of values.  In this case, the TSC will
423*daec8d40SPaolo Bonzininot be a stable time source, and must be calibrated against a known, stable,
424*daec8d40SPaolo Bonziniexternal clock to be a usable source of time.
425*daec8d40SPaolo Bonzini
426*daec8d40SPaolo BonziniWhether the TSC runs at a constant rate or scales with the P-state is model
427*daec8d40SPaolo Bonzinidependent and must be determined by inspecting CPUID, chipset or vendor
428*daec8d40SPaolo Bonzinispecific MSR fields.
429*daec8d40SPaolo Bonzini
430*daec8d40SPaolo BonziniIn addition, some vendors have known bugs where the P-state is actually
431*daec8d40SPaolo Bonzinicompensated for properly during normal operation, but when the processor is
432*daec8d40SPaolo Bonziniinactive, the P-state may be raised temporarily to service cache misses from
433*daec8d40SPaolo Bonziniother processors.  In such cases, the TSC on halted CPUs could advance faster
434*daec8d40SPaolo Bonzinithan that of non-halted processors.  AMD Turion processors are known to have
435*daec8d40SPaolo Bonzinithis problem.
436*daec8d40SPaolo Bonzini
437*daec8d40SPaolo Bonzini3.6. TSC and STPCLK / T-states
438*daec8d40SPaolo Bonzini------------------------------
439*daec8d40SPaolo Bonzini
440*daec8d40SPaolo BonziniExternal signals given to the processor may also have the effect of stopping
441*daec8d40SPaolo Bonzinithe TSC.  This is typically done for thermal emergency power control to prevent
442*daec8d40SPaolo Bonzinian overheating condition, and typically, there is no way to detect that this
443*daec8d40SPaolo Bonzinicondition has happened.
444*daec8d40SPaolo Bonzini
445*daec8d40SPaolo Bonzini3.7. TSC virtualization - VMX
446*daec8d40SPaolo Bonzini-----------------------------
447*daec8d40SPaolo Bonzini
448*daec8d40SPaolo BonziniVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
449*daec8d40SPaolo Bonziniinstructions, which is enough for full virtualization of TSC in any manner.  In
450*daec8d40SPaolo Bonziniaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
451*daec8d40SPaolo Bonzinifield specified in the VMCS.  Special instructions must be used to read and
452*daec8d40SPaolo Bonziniwrite the VMCS field.
453*daec8d40SPaolo Bonzini
454*daec8d40SPaolo Bonzini3.8. TSC virtualization - SVM
455*daec8d40SPaolo Bonzini-----------------------------
456*daec8d40SPaolo Bonzini
457*daec8d40SPaolo BonziniSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
458*daec8d40SPaolo Bonziniinstructions, which is enough for full virtualization of TSC in any manner.  In
459*daec8d40SPaolo Bonziniaddition, SVM allows passing through the host TSC plus an additional offset
460*daec8d40SPaolo Bonzinifield specified in the SVM control block.
461*daec8d40SPaolo Bonzini
462*daec8d40SPaolo Bonzini3.9. TSC feature bits in Linux
463*daec8d40SPaolo Bonzini------------------------------
464*daec8d40SPaolo Bonzini
465*daec8d40SPaolo BonziniIn summary, there is no way to guarantee the TSC remains in perfect
466*daec8d40SPaolo Bonzinisynchronization unless it is explicitly guaranteed by the architecture.  Even
467*daec8d40SPaolo Bonziniif so, the TSCs in multi-sockets or NUMA systems may still run independently
468*daec8d40SPaolo Bonzinidespite being locally consistent.
469*daec8d40SPaolo Bonzini
470*daec8d40SPaolo BonziniThe following feature bits are used by Linux to signal various TSC attributes,
471*daec8d40SPaolo Bonzinibut they can only be taken to be meaningful for UP or single node systems.
472*daec8d40SPaolo Bonzini
473*daec8d40SPaolo Bonzini=========================	=======================================
474*daec8d40SPaolo BonziniX86_FEATURE_TSC			The TSC is available in hardware
475*daec8d40SPaolo BonziniX86_FEATURE_RDTSCP		The RDTSCP instruction is available
476*daec8d40SPaolo BonziniX86_FEATURE_CONSTANT_TSC	The TSC rate is unchanged with P-states
477*daec8d40SPaolo BonziniX86_FEATURE_NONSTOP_TSC		The TSC does not stop in C-states
478*daec8d40SPaolo BonziniX86_FEATURE_TSC_RELIABLE	TSC sync checks are skipped (VMware)
479*daec8d40SPaolo Bonzini=========================	=======================================
480*daec8d40SPaolo Bonzini
481*daec8d40SPaolo Bonzini4. Virtualization Problems
482*daec8d40SPaolo Bonzini==========================
483*daec8d40SPaolo Bonzini
484*daec8d40SPaolo BonziniTimekeeping is especially problematic for virtualization because a number of
485*daec8d40SPaolo Bonzinichallenges arise.  The most obvious problem is that time is now shared between
486*daec8d40SPaolo Bonzinithe host and, potentially, a number of virtual machines.  Thus the virtual
487*daec8d40SPaolo Bonzinioperating system does not run with 100% usage of the CPU, despite the fact that
488*daec8d40SPaolo Bonziniit may very well make that assumption.  It may expect it to remain true to very
489*daec8d40SPaolo Bonziniexacting bounds when interrupt sources are disabled, but in reality only its
490*daec8d40SPaolo Bonzinivirtual interrupt sources are disabled, and the machine may still be preempted
491*daec8d40SPaolo Bonziniat any time.  This causes problems as the passage of real time, the injection
492*daec8d40SPaolo Bonziniof machine interrupts and the associated clock sources are no longer completely
493*daec8d40SPaolo Bonzinisynchronized with real time.
494*daec8d40SPaolo Bonzini
495*daec8d40SPaolo BonziniThis same problem can occur on native hardware to a degree, as SMM mode may
496*daec8d40SPaolo Bonzinisteal cycles from the naturally on X86 systems when SMM mode is used by the
497*daec8d40SPaolo BonziniBIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
498*daec8d40SPaolo Bonzinicause similar problems to virtualization makes it a good justification for
499*daec8d40SPaolo Bonzinisolving many of these problems on bare metal.
500*daec8d40SPaolo Bonzini
501*daec8d40SPaolo Bonzini4.1. Interrupt clocking
502*daec8d40SPaolo Bonzini-----------------------
503*daec8d40SPaolo Bonzini
504*daec8d40SPaolo BonziniOne of the most immediate problems that occurs with legacy operating systems
505*daec8d40SPaolo Bonziniis that the system timekeeping routines are often designed to keep track of
506*daec8d40SPaolo Bonzinitime by counting periodic interrupts.  These interrupts may come from the PIT
507*daec8d40SPaolo Bonzinior the RTC, but the problem is the same: the host virtualization engine may not
508*daec8d40SPaolo Bonzinibe able to deliver the proper number of interrupts per second, and so guest
509*daec8d40SPaolo Bonzinitime may fall behind.  This is especially problematic if a high interrupt rate
510*daec8d40SPaolo Bonziniis selected, such as 1000 HZ, which is unfortunately the default for many Linux
511*daec8d40SPaolo Bonziniguests.
512*daec8d40SPaolo Bonzini
513*daec8d40SPaolo BonziniThere are three approaches to solving this problem; first, it may be possible
514*daec8d40SPaolo Bonzinito simply ignore it.  Guests which have a separate time source for tracking
515*daec8d40SPaolo Bonzini'wall clock' or 'real time' may not need any adjustment of their interrupts to
516*daec8d40SPaolo Bonzinimaintain proper time.  If this is not sufficient, it may be necessary to inject
517*daec8d40SPaolo Bonziniadditional interrupts into the guest in order to increase the effective
518*daec8d40SPaolo Bonziniinterrupt rate.  This approach leads to complications in extreme conditions,
519*daec8d40SPaolo Bonziniwhere host load or guest lag is too much to compensate for, and thus another
520*daec8d40SPaolo Bonzinisolution to the problem has risen: the guest may need to become aware of lost
521*daec8d40SPaolo Bonziniticks and compensate for them internally.  Although promising in theory, the
522*daec8d40SPaolo Bonziniimplementation of this policy in Linux has been extremely error prone, and a
523*daec8d40SPaolo Bonzininumber of buggy variants of lost tick compensation are distributed across
524*daec8d40SPaolo Bonzinicommonly used Linux systems.
525*daec8d40SPaolo Bonzini
526*daec8d40SPaolo BonziniWindows uses periodic RTC clocking as a means of keeping time internally, and
527*daec8d40SPaolo Bonzinithus requires interrupt slewing to keep proper time.  It does use a low enough
528*daec8d40SPaolo Bonzinirate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
529*daec8d40SPaolo Bonzinipractice.
530*daec8d40SPaolo Bonzini
531*daec8d40SPaolo Bonzini4.2. TSC sampling and serialization
532*daec8d40SPaolo Bonzini-----------------------------------
533*daec8d40SPaolo Bonzini
534*daec8d40SPaolo BonziniAs the highest precision time source available, the cycle counter of the CPU
535*daec8d40SPaolo Bonzinihas aroused much interest from developers.  As explained above, this timer has
536*daec8d40SPaolo Bonzinimany problems unique to its nature as a local, potentially unstable and
537*daec8d40SPaolo Bonzinipotentially unsynchronized source.  One issue which is not unique to the TSC,
538*daec8d40SPaolo Bonzinibut is highlighted because of its very precise nature is sampling delay.  By
539*daec8d40SPaolo Bonzinidefinition, the counter, once read is already old.  However, it is also
540*daec8d40SPaolo Bonzinipossible for the counter to be read ahead of the actual use of the result.
541*daec8d40SPaolo BonziniThis is a consequence of the superscalar execution of the instruction stream,
542*daec8d40SPaolo Bonziniwhich may execute instructions out of order.  Such execution is called
543*daec8d40SPaolo Bonzininon-serialized.  Forcing serialized execution is necessary for precise
544*daec8d40SPaolo Bonzinimeasurement with the TSC, and requires a serializing instruction, such as CPUID
545*daec8d40SPaolo Bonzinior an MSR read.
546*daec8d40SPaolo Bonzini
547*daec8d40SPaolo BonziniSince CPUID may actually be virtualized by a trap and emulate mechanism, this
548*daec8d40SPaolo Bonziniserialization can pose a performance issue for hardware virtualization.  An
549*daec8d40SPaolo Bonziniaccurate time stamp counter reading may therefore not always be available, and
550*daec8d40SPaolo Bonziniit may be necessary for an implementation to guard against "backwards" reads of
551*daec8d40SPaolo Bonzinithe TSC as seen from other CPUs, even in an otherwise perfectly synchronized
552*daec8d40SPaolo Bonzinisystem.
553*daec8d40SPaolo Bonzini
554*daec8d40SPaolo Bonzini4.3. Timespec aliasing
555*daec8d40SPaolo Bonzini----------------------
556*daec8d40SPaolo Bonzini
557*daec8d40SPaolo BonziniAdditionally, this lack of serialization from the TSC poses another challenge
558*daec8d40SPaolo Bonziniwhen using results of the TSC when measured against another time source.  As
559*daec8d40SPaolo Bonzinithe TSC is much higher precision, many possible values of the TSC may be read
560*daec8d40SPaolo Bonziniwhile another clock is still expressing the same value.
561*daec8d40SPaolo Bonzini
562*daec8d40SPaolo BonziniThat is, you may read (T,T+10) while external clock C maintains the same value.
563*daec8d40SPaolo BonziniDue to non-serialized reads, you may actually end up with a range which
564*daec8d40SPaolo Bonzinifluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
565*daec8d40SPaolo Bonzinicalibrated against an external value may have a range of valid values.
566*daec8d40SPaolo BonziniRe-calibrating this computation may actually cause time, as computed after the
567*daec8d40SPaolo Bonzinicalibration, to go backwards, compared with time computed before the
568*daec8d40SPaolo Bonzinicalibration.
569*daec8d40SPaolo Bonzini
570*daec8d40SPaolo BonziniThis problem is particularly pronounced with an internal time source in Linux,
571*daec8d40SPaolo Bonzinithe kernel time, which is expressed in the theoretically high resolution
572*daec8d40SPaolo Bonzinitimespec - but which advances in much larger granularity intervals, sometimes
573*daec8d40SPaolo Bonziniat the rate of jiffies, and possibly in catchup modes, at a much larger step.
574*daec8d40SPaolo Bonzini
575*daec8d40SPaolo BonziniThis aliasing requires care in the computation and recalibration of kvmclock
576*daec8d40SPaolo Bonziniand any other values derived from TSC computation (such as TSC virtualization
577*daec8d40SPaolo Bonziniitself).
578*daec8d40SPaolo Bonzini
579*daec8d40SPaolo Bonzini4.4. Migration
580*daec8d40SPaolo Bonzini--------------
581*daec8d40SPaolo Bonzini
582*daec8d40SPaolo BonziniMigration of a virtual machine raises problems for timekeeping in two ways.
583*daec8d40SPaolo BonziniFirst, the migration itself may take time, during which interrupts cannot be
584*daec8d40SPaolo Bonzinidelivered, and after which, the guest time may need to be caught up.  NTP may
585*daec8d40SPaolo Bonzinibe able to help to some degree here, as the clock correction required is
586*daec8d40SPaolo Bonzinitypically small enough to fall in the NTP-correctable window.
587*daec8d40SPaolo Bonzini
588*daec8d40SPaolo BonziniAn additional concern is that timers based off the TSC (or HPET, if the raw bus
589*daec8d40SPaolo Bonziniclock is exposed) may now be running at different rates, requiring compensation
590*daec8d40SPaolo Bonziniin some way in the hypervisor by virtualizing these timers.  In addition,
591*daec8d40SPaolo Bonzinimigrating to a faster machine may preclude the use of a passthrough TSC, as a
592*daec8d40SPaolo Bonzinifaster clock cannot be made visible to a guest without the potential of time
593*daec8d40SPaolo Bonziniadvancing faster than usual.  A slower clock is less of a problem, as it can
594*daec8d40SPaolo Bonzinialways be caught up to the original rate.  KVM clock avoids these problems by
595*daec8d40SPaolo Bonzinisimply storing multipliers and offsets against the TSC for the guest to convert
596*daec8d40SPaolo Bonziniback into nanosecond resolution values.
597*daec8d40SPaolo Bonzini
598*daec8d40SPaolo Bonzini4.5. Scheduling
599*daec8d40SPaolo Bonzini---------------
600*daec8d40SPaolo Bonzini
601*daec8d40SPaolo BonziniSince scheduling may be based on precise timing and firing of interrupts, the
602*daec8d40SPaolo Bonzinischeduling algorithms of an operating system may be adversely affected by
603*daec8d40SPaolo Bonzinivirtualization.  In theory, the effect is random and should be universally
604*daec8d40SPaolo Bonzinidistributed, but in contrived as well as real scenarios (guest device access,
605*daec8d40SPaolo Bonzinicauses of virtualization exits, possible context switch), this may not always
606*daec8d40SPaolo Bonzinibe the case.  The effect of this has not been well studied.
607*daec8d40SPaolo Bonzini
608*daec8d40SPaolo BonziniIn an attempt to work around this, several implementations have provided a
609*daec8d40SPaolo Bonziniparavirtualized scheduler clock, which reveals the true amount of CPU time for
610*daec8d40SPaolo Bonziniwhich a virtual machine has been running.
611*daec8d40SPaolo Bonzini
612*daec8d40SPaolo Bonzini4.6. Watchdogs
613*daec8d40SPaolo Bonzini--------------
614*daec8d40SPaolo Bonzini
615*daec8d40SPaolo BonziniWatchdog timers, such as the lock detector in Linux may fire accidentally when
616*daec8d40SPaolo Bonzinirunning under hardware virtualization due to timer interrupts being delayed or
617*daec8d40SPaolo Bonzinimisinterpretation of the passage of real time.  Usually, these warnings are
618*daec8d40SPaolo Bonzinispurious and can be ignored, but in some circumstances it may be necessary to
619*daec8d40SPaolo Bonzinidisable such detection.
620*daec8d40SPaolo Bonzini
621*daec8d40SPaolo Bonzini4.7. Delays and precision timing
622*daec8d40SPaolo Bonzini--------------------------------
623*daec8d40SPaolo Bonzini
624*daec8d40SPaolo BonziniPrecise timing and delays may not be possible in a virtualized system.  This
625*daec8d40SPaolo Bonzinican happen if the system is controlling physical hardware, or issues delays to
626*daec8d40SPaolo Bonzinicompensate for slower I/O to and from devices.  The first issue is not solvable
627*daec8d40SPaolo Bonziniin general for a virtualized system; hardware control software can't be
628*daec8d40SPaolo Bonziniadequately virtualized without a full real-time operating system, which would
629*daec8d40SPaolo Bonzinirequire an RT aware virtualization platform.
630*daec8d40SPaolo Bonzini
631*daec8d40SPaolo BonziniThe second issue may cause performance problems, but this is unlikely to be a
632*daec8d40SPaolo Bonzinisignificant issue.  In many cases these delays may be eliminated through
633*daec8d40SPaolo Bonziniconfiguration or paravirtualization.
634*daec8d40SPaolo Bonzini
635*daec8d40SPaolo Bonzini4.8. Covert channels and leaks
636*daec8d40SPaolo Bonzini------------------------------
637*daec8d40SPaolo Bonzini
638*daec8d40SPaolo BonziniIn addition to the above problems, time information will inevitably leak to the
639*daec8d40SPaolo Bonziniguest about the host in anything but a perfect implementation of virtualized
640*daec8d40SPaolo Bonzinitime.  This may allow the guest to infer the presence of a hypervisor (as in a
641*daec8d40SPaolo Bonzinired-pill type detection), and it may allow information to leak between guests
642*daec8d40SPaolo Bonziniby using CPU utilization itself as a signalling channel.  Preventing such
643*daec8d40SPaolo Bonziniproblems would require completely isolated virtual time which may not track
644*daec8d40SPaolo Bonzinireal time any longer.  This may be useful in certain security or QA contexts,
645*daec8d40SPaolo Bonzinibut in general isn't recommended for real-world deployment scenarios.
646