1*daec8d40SPaolo Bonzini.. SPDX-License-Identifier: GPL-2.0 2*daec8d40SPaolo Bonzini 3*daec8d40SPaolo Bonzini====================================================== 4*daec8d40SPaolo BonziniTimekeeping Virtualization for X86-Based Architectures 5*daec8d40SPaolo Bonzini====================================================== 6*daec8d40SPaolo Bonzini 7*daec8d40SPaolo Bonzini:Author: Zachary Amsden <zamsden@redhat.com> 8*daec8d40SPaolo Bonzini:Copyright: (c) 2010, Red Hat. All rights reserved. 9*daec8d40SPaolo Bonzini 10*daec8d40SPaolo Bonzini.. Contents 11*daec8d40SPaolo Bonzini 12*daec8d40SPaolo Bonzini 1) Overview 13*daec8d40SPaolo Bonzini 2) Timing Devices 14*daec8d40SPaolo Bonzini 3) TSC Hardware 15*daec8d40SPaolo Bonzini 4) Virtualization Problems 16*daec8d40SPaolo Bonzini 17*daec8d40SPaolo Bonzini1. Overview 18*daec8d40SPaolo Bonzini=========== 19*daec8d40SPaolo Bonzini 20*daec8d40SPaolo BonziniOne of the most complicated parts of the X86 platform, and specifically, 21*daec8d40SPaolo Bonzinithe virtualization of this platform is the plethora of timing devices available 22*daec8d40SPaolo Bonziniand the complexity of emulating those devices. In addition, virtualization of 23*daec8d40SPaolo Bonzinitime introduces a new set of challenges because it introduces a multiplexed 24*daec8d40SPaolo Bonzinidivision of time beyond the control of the guest CPU. 25*daec8d40SPaolo Bonzini 26*daec8d40SPaolo BonziniFirst, we will describe the various timekeeping hardware available, then 27*daec8d40SPaolo Bonzinipresent some of the problems which arise and solutions available, giving 28*daec8d40SPaolo Bonzinispecific recommendations for certain classes of KVM guests. 29*daec8d40SPaolo Bonzini 30*daec8d40SPaolo BonziniThe purpose of this document is to collect data and information relevant to 31*daec8d40SPaolo Bonzinitimekeeping which may be difficult to find elsewhere, specifically, 32*daec8d40SPaolo Bonziniinformation relevant to KVM and hardware-based virtualization. 33*daec8d40SPaolo Bonzini 34*daec8d40SPaolo Bonzini2. Timing Devices 35*daec8d40SPaolo Bonzini================= 36*daec8d40SPaolo Bonzini 37*daec8d40SPaolo BonziniFirst we discuss the basic hardware devices available. TSC and the related 38*daec8d40SPaolo BonziniKVM clock are special enough to warrant a full exposition and are described in 39*daec8d40SPaolo Bonzinithe following section. 40*daec8d40SPaolo Bonzini 41*daec8d40SPaolo Bonzini2.1. i8254 - PIT 42*daec8d40SPaolo Bonzini---------------- 43*daec8d40SPaolo Bonzini 44*daec8d40SPaolo BonziniOne of the first timer devices available is the programmable interrupt timer, 45*daec8d40SPaolo Bonzinior PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three 46*daec8d40SPaolo Bonzinichannels which can be programmed to deliver periodic or one-shot interrupts. 47*daec8d40SPaolo BonziniThese three channels can be configured in different modes and have individual 48*daec8d40SPaolo Bonzinicounters. Channel 1 and 2 were not available for general use in the original 49*daec8d40SPaolo BonziniIBM PC, and historically were connected to control RAM refresh and the PC 50*daec8d40SPaolo Bonzinispeaker. Now the PIT is typically integrated as part of an emulated chipset 51*daec8d40SPaolo Bonziniand a separate physical PIT is not used. 52*daec8d40SPaolo Bonzini 53*daec8d40SPaolo BonziniThe PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done 54*daec8d40SPaolo Bonziniusing single or multiple byte access to the I/O ports. There are 6 modes 55*daec8d40SPaolo Bonziniavailable, but not all modes are available to all timers, as only timer 2 56*daec8d40SPaolo Bonzinihas a connected gate input, required for modes 1 and 5. The gate line is 57*daec8d40SPaolo Bonzinicontrolled by port 61h, bit 0, as illustrated in the following diagram:: 58*daec8d40SPaolo Bonzini 59*daec8d40SPaolo Bonzini -------------- ---------------- 60*daec8d40SPaolo Bonzini | | | | 61*daec8d40SPaolo Bonzini | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 62*daec8d40SPaolo Bonzini | Clock | | | | 63*daec8d40SPaolo Bonzini -------------- | +->| GATE TIMER 0 | 64*daec8d40SPaolo Bonzini | ---------------- 65*daec8d40SPaolo Bonzini | 66*daec8d40SPaolo Bonzini | ---------------- 67*daec8d40SPaolo Bonzini | | | 68*daec8d40SPaolo Bonzini |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM 69*daec8d40SPaolo Bonzini | | | (aka /dev/null) 70*daec8d40SPaolo Bonzini | +->| GATE TIMER 1 | 71*daec8d40SPaolo Bonzini | ---------------- 72*daec8d40SPaolo Bonzini | 73*daec8d40SPaolo Bonzini | ---------------- 74*daec8d40SPaolo Bonzini | | | 75*daec8d40SPaolo Bonzini |------>| CLOCK OUT | ---------> Port 61h, bit 5 76*daec8d40SPaolo Bonzini | | | 77*daec8d40SPaolo Bonzini Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ 78*daec8d40SPaolo Bonzini ---------------- _| )--|LPF|---Speaker 79*daec8d40SPaolo Bonzini / *---- \___/ 80*daec8d40SPaolo Bonzini Port 61h, bit 1 ---------------------------------/ 81*daec8d40SPaolo Bonzini 82*daec8d40SPaolo BonziniThe timer modes are now described. 83*daec8d40SPaolo Bonzini 84*daec8d40SPaolo BonziniMode 0: Single Timeout. 85*daec8d40SPaolo Bonzini This is a one-shot software timeout that counts down 86*daec8d40SPaolo Bonzini when the gate is high (always true for timers 0 and 1). When the count 87*daec8d40SPaolo Bonzini reaches zero, the output goes high. 88*daec8d40SPaolo Bonzini 89*daec8d40SPaolo BonziniMode 1: Triggered One-shot. 90*daec8d40SPaolo Bonzini The output is initially set high. When the gate 91*daec8d40SPaolo Bonzini line is set high, a countdown is initiated (which does not stop if the gate is 92*daec8d40SPaolo Bonzini lowered), during which the output is set low. When the count reaches zero, 93*daec8d40SPaolo Bonzini the output goes high. 94*daec8d40SPaolo Bonzini 95*daec8d40SPaolo BonziniMode 2: Rate Generator. 96*daec8d40SPaolo Bonzini The output is initially set high. When the countdown 97*daec8d40SPaolo Bonzini reaches 1, the output goes low for one count and then returns high. The value 98*daec8d40SPaolo Bonzini is reloaded and the countdown automatically resumes. If the gate line goes 99*daec8d40SPaolo Bonzini low, the count is halted. If the output is low when the gate is lowered, the 100*daec8d40SPaolo Bonzini output automatically goes high (this only affects timer 2). 101*daec8d40SPaolo Bonzini 102*daec8d40SPaolo BonziniMode 3: Square Wave. 103*daec8d40SPaolo Bonzini This generates a high / low square wave. The count 104*daec8d40SPaolo Bonzini determines the length of the pulse, which alternates between high and low 105*daec8d40SPaolo Bonzini when zero is reached. The count only proceeds when gate is high and is 106*daec8d40SPaolo Bonzini automatically reloaded on reaching zero. The count is decremented twice at 107*daec8d40SPaolo Bonzini each clock to generate a full high / low cycle at the full periodic rate. 108*daec8d40SPaolo Bonzini If the count is even, the clock remains high for N/2 counts and low for N/2 109*daec8d40SPaolo Bonzini counts; if the clock is odd, the clock is high for (N+1)/2 counts and low 110*daec8d40SPaolo Bonzini for (N-1)/2 counts. Only even values are latched by the counter, so odd 111*daec8d40SPaolo Bonzini values are not observed when reading. This is the intended mode for timer 2, 112*daec8d40SPaolo Bonzini which generates sine-like tones by low-pass filtering the square wave output. 113*daec8d40SPaolo Bonzini 114*daec8d40SPaolo BonziniMode 4: Software Strobe. 115*daec8d40SPaolo Bonzini After programming this mode and loading the counter, 116*daec8d40SPaolo Bonzini the output remains high until the counter reaches zero. Then the output 117*daec8d40SPaolo Bonzini goes low for 1 clock cycle and returns high. The counter is not reloaded. 118*daec8d40SPaolo Bonzini Counting only occurs when gate is high. 119*daec8d40SPaolo Bonzini 120*daec8d40SPaolo BonziniMode 5: Hardware Strobe. 121*daec8d40SPaolo Bonzini After programming and loading the counter, the 122*daec8d40SPaolo Bonzini output remains high. When the gate is raised, a countdown is initiated 123*daec8d40SPaolo Bonzini (which does not stop if the gate is lowered). When the counter reaches zero, 124*daec8d40SPaolo Bonzini the output goes low for 1 clock cycle and then returns high. The counter is 125*daec8d40SPaolo Bonzini not reloaded. 126*daec8d40SPaolo Bonzini 127*daec8d40SPaolo BonziniIn addition to normal binary counting, the PIT supports BCD counting. The 128*daec8d40SPaolo Bonzinicommand port, 0x43 is used to set the counter and mode for each of the three 129*daec8d40SPaolo Bonzinitimers. 130*daec8d40SPaolo Bonzini 131*daec8d40SPaolo BonziniPIT commands, issued to port 0x43, using the following bit encoding:: 132*daec8d40SPaolo Bonzini 133*daec8d40SPaolo Bonzini Bit 7-4: Command (See table below) 134*daec8d40SPaolo Bonzini Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) 135*daec8d40SPaolo Bonzini Bit 0 : Binary (0) / BCD (1) 136*daec8d40SPaolo Bonzini 137*daec8d40SPaolo BonziniCommand table:: 138*daec8d40SPaolo Bonzini 139*daec8d40SPaolo Bonzini 0000 - Latch Timer 0 count for port 0x40 140*daec8d40SPaolo Bonzini sample and hold the count to be read in port 0x40; 141*daec8d40SPaolo Bonzini additional commands ignored until counter is read; 142*daec8d40SPaolo Bonzini mode bits ignored. 143*daec8d40SPaolo Bonzini 144*daec8d40SPaolo Bonzini 0001 - Set Timer 0 LSB mode for port 0x40 145*daec8d40SPaolo Bonzini set timer to read LSB only and force MSB to zero; 146*daec8d40SPaolo Bonzini mode bits set timer mode 147*daec8d40SPaolo Bonzini 148*daec8d40SPaolo Bonzini 0010 - Set Timer 0 MSB mode for port 0x40 149*daec8d40SPaolo Bonzini set timer to read MSB only and force LSB to zero; 150*daec8d40SPaolo Bonzini mode bits set timer mode 151*daec8d40SPaolo Bonzini 152*daec8d40SPaolo Bonzini 0011 - Set Timer 0 16-bit mode for port 0x40 153*daec8d40SPaolo Bonzini set timer to read / write LSB first, then MSB; 154*daec8d40SPaolo Bonzini mode bits set timer mode 155*daec8d40SPaolo Bonzini 156*daec8d40SPaolo Bonzini 0100 - Latch Timer 1 count for port 0x41 - as described above 157*daec8d40SPaolo Bonzini 0101 - Set Timer 1 LSB mode for port 0x41 - as described above 158*daec8d40SPaolo Bonzini 0110 - Set Timer 1 MSB mode for port 0x41 - as described above 159*daec8d40SPaolo Bonzini 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above 160*daec8d40SPaolo Bonzini 161*daec8d40SPaolo Bonzini 1000 - Latch Timer 2 count for port 0x42 - as described above 162*daec8d40SPaolo Bonzini 1001 - Set Timer 2 LSB mode for port 0x42 - as described above 163*daec8d40SPaolo Bonzini 1010 - Set Timer 2 MSB mode for port 0x42 - as described above 164*daec8d40SPaolo Bonzini 1011 - Set Timer 2 16-bit mode for port 0x42 as described above 165*daec8d40SPaolo Bonzini 166*daec8d40SPaolo Bonzini 1101 - General counter latch 167*daec8d40SPaolo Bonzini Latch combination of counters into corresponding ports 168*daec8d40SPaolo Bonzini Bit 3 = Counter 2 169*daec8d40SPaolo Bonzini Bit 2 = Counter 1 170*daec8d40SPaolo Bonzini Bit 1 = Counter 0 171*daec8d40SPaolo Bonzini Bit 0 = Unused 172*daec8d40SPaolo Bonzini 173*daec8d40SPaolo Bonzini 1110 - Latch timer status 174*daec8d40SPaolo Bonzini Latch combination of counter mode into corresponding ports 175*daec8d40SPaolo Bonzini Bit 3 = Counter 2 176*daec8d40SPaolo Bonzini Bit 2 = Counter 1 177*daec8d40SPaolo Bonzini Bit 1 = Counter 0 178*daec8d40SPaolo Bonzini 179*daec8d40SPaolo Bonzini The output of ports 0x40-0x42 following this command will be: 180*daec8d40SPaolo Bonzini 181*daec8d40SPaolo Bonzini Bit 7 = Output pin 182*daec8d40SPaolo Bonzini Bit 6 = Count loaded (0 if timer has expired) 183*daec8d40SPaolo Bonzini Bit 5-4 = Read / Write mode 184*daec8d40SPaolo Bonzini 01 = MSB only 185*daec8d40SPaolo Bonzini 10 = LSB only 186*daec8d40SPaolo Bonzini 11 = LSB / MSB (16-bit) 187*daec8d40SPaolo Bonzini Bit 3-1 = Mode 188*daec8d40SPaolo Bonzini Bit 0 = Binary (0) / BCD mode (1) 189*daec8d40SPaolo Bonzini 190*daec8d40SPaolo Bonzini2.2. RTC 191*daec8d40SPaolo Bonzini-------- 192*daec8d40SPaolo Bonzini 193*daec8d40SPaolo BonziniThe second device which was available in the original PC was the MC146818 real 194*daec8d40SPaolo Bonzinitime clock. The original device is now obsolete, and usually emulated by the 195*daec8d40SPaolo Bonzinisystem chipset, sometimes by an HPET and some frankenstein IRQ routing. 196*daec8d40SPaolo Bonzini 197*daec8d40SPaolo BonziniThe RTC is accessed through CMOS variables, which uses an index register to 198*daec8d40SPaolo Bonzinicontrol which bytes are read. Since there is only one index register, read 199*daec8d40SPaolo Bonziniof the CMOS and read of the RTC require lock protection (in addition, it is 200*daec8d40SPaolo Bonzinidangerous to allow userspace utilities such as hwclock to have direct RTC 201*daec8d40SPaolo Bonziniaccess, as they could corrupt kernel reads and writes of CMOS memory). 202*daec8d40SPaolo Bonzini 203*daec8d40SPaolo BonziniThe RTC generates an interrupt which is usually routed to IRQ 8. The interrupt 204*daec8d40SPaolo Bonzinican function as a periodic timer, an additional once a day alarm, and can issue 205*daec8d40SPaolo Bonziniinterrupts after an update of the CMOS registers by the MC146818 is complete. 206*daec8d40SPaolo BonziniThe type of interrupt is signalled in the RTC status registers. 207*daec8d40SPaolo Bonzini 208*daec8d40SPaolo BonziniThe RTC will update the current time fields by battery power even while the 209*daec8d40SPaolo Bonzinisystem is off. The current time fields should not be read while an update is 210*daec8d40SPaolo Bonziniin progress, as indicated in the status register. 211*daec8d40SPaolo Bonzini 212*daec8d40SPaolo BonziniThe clock uses a 32.768kHz crystal, so bits 6-4 of register A should be 213*daec8d40SPaolo Bonziniprogrammed to a 32kHz divider if the RTC is to count seconds. 214*daec8d40SPaolo Bonzini 215*daec8d40SPaolo BonziniThis is the RAM map originally used for the RTC/CMOS:: 216*daec8d40SPaolo Bonzini 217*daec8d40SPaolo Bonzini Location Size Description 218*daec8d40SPaolo Bonzini ------------------------------------------ 219*daec8d40SPaolo Bonzini 00h byte Current second (BCD) 220*daec8d40SPaolo Bonzini 01h byte Seconds alarm (BCD) 221*daec8d40SPaolo Bonzini 02h byte Current minute (BCD) 222*daec8d40SPaolo Bonzini 03h byte Minutes alarm (BCD) 223*daec8d40SPaolo Bonzini 04h byte Current hour (BCD) 224*daec8d40SPaolo Bonzini 05h byte Hours alarm (BCD) 225*daec8d40SPaolo Bonzini 06h byte Current day of week (BCD) 226*daec8d40SPaolo Bonzini 07h byte Current day of month (BCD) 227*daec8d40SPaolo Bonzini 08h byte Current month (BCD) 228*daec8d40SPaolo Bonzini 09h byte Current year (BCD) 229*daec8d40SPaolo Bonzini 0Ah byte Register A 230*daec8d40SPaolo Bonzini bit 7 = Update in progress 231*daec8d40SPaolo Bonzini bit 6-4 = Divider for clock 232*daec8d40SPaolo Bonzini 000 = 4.194 MHz 233*daec8d40SPaolo Bonzini 001 = 1.049 MHz 234*daec8d40SPaolo Bonzini 010 = 32 kHz 235*daec8d40SPaolo Bonzini 10X = test modes 236*daec8d40SPaolo Bonzini 110 = reset / disable 237*daec8d40SPaolo Bonzini 111 = reset / disable 238*daec8d40SPaolo Bonzini bit 3-0 = Rate selection for periodic interrupt 239*daec8d40SPaolo Bonzini 000 = periodic timer disabled 240*daec8d40SPaolo Bonzini 001 = 3.90625 uS 241*daec8d40SPaolo Bonzini 010 = 7.8125 uS 242*daec8d40SPaolo Bonzini 011 = .122070 mS 243*daec8d40SPaolo Bonzini 100 = .244141 mS 244*daec8d40SPaolo Bonzini ... 245*daec8d40SPaolo Bonzini 1101 = 125 mS 246*daec8d40SPaolo Bonzini 1110 = 250 mS 247*daec8d40SPaolo Bonzini 1111 = 500 mS 248*daec8d40SPaolo Bonzini 0Bh byte Register B 249*daec8d40SPaolo Bonzini bit 7 = Run (0) / Halt (1) 250*daec8d40SPaolo Bonzini bit 6 = Periodic interrupt enable 251*daec8d40SPaolo Bonzini bit 5 = Alarm interrupt enable 252*daec8d40SPaolo Bonzini bit 4 = Update-ended interrupt enable 253*daec8d40SPaolo Bonzini bit 3 = Square wave interrupt enable 254*daec8d40SPaolo Bonzini bit 2 = BCD calendar (0) / Binary (1) 255*daec8d40SPaolo Bonzini bit 1 = 12-hour mode (0) / 24-hour mode (1) 256*daec8d40SPaolo Bonzini bit 0 = 0 (DST off) / 1 (DST enabled) 257*daec8d40SPaolo Bonzini OCh byte Register C (read only) 258*daec8d40SPaolo Bonzini bit 7 = interrupt request flag (IRQF) 259*daec8d40SPaolo Bonzini bit 6 = periodic interrupt flag (PF) 260*daec8d40SPaolo Bonzini bit 5 = alarm interrupt flag (AF) 261*daec8d40SPaolo Bonzini bit 4 = update interrupt flag (UF) 262*daec8d40SPaolo Bonzini bit 3-0 = reserved 263*daec8d40SPaolo Bonzini ODh byte Register D (read only) 264*daec8d40SPaolo Bonzini bit 7 = RTC has power 265*daec8d40SPaolo Bonzini bit 6-0 = reserved 266*daec8d40SPaolo Bonzini 32h byte Current century BCD (*) 267*daec8d40SPaolo Bonzini (*) location vendor specific and now determined from ACPI global tables 268*daec8d40SPaolo Bonzini 269*daec8d40SPaolo Bonzini2.3. APIC 270*daec8d40SPaolo Bonzini--------- 271*daec8d40SPaolo Bonzini 272*daec8d40SPaolo BonziniOn Pentium and later processors, an on-board timer is available to each CPU 273*daec8d40SPaolo Bonzinias part of the Advanced Programmable Interrupt Controller. The APIC is 274*daec8d40SPaolo Bonziniaccessed through memory-mapped registers and provides interrupt service to each 275*daec8d40SPaolo BonziniCPU, used for IPIs and local timer interrupts. 276*daec8d40SPaolo Bonzini 277*daec8d40SPaolo BonziniAlthough in theory the APIC is a safe and stable source for local interrupts, 278*daec8d40SPaolo Bonziniin practice, many bugs and glitches have occurred due to the special nature of 279*daec8d40SPaolo Bonzinithe APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect 280*daec8d40SPaolo Bonzinithe use of the APIC and that workarounds may be required. In addition, some of 281*daec8d40SPaolo Bonzinithese workarounds pose unique constraints for virtualization - requiring either 282*daec8d40SPaolo Bonziniextra overhead incurred from extra reads of memory-mapped I/O or additional 283*daec8d40SPaolo Bonzinifunctionality that may be more computationally expensive to implement. 284*daec8d40SPaolo Bonzini 285*daec8d40SPaolo BonziniSince the APIC is documented quite well in the Intel and AMD manuals, we will 286*daec8d40SPaolo Bonziniavoid repetition of the detail here. It should be pointed out that the APIC 287*daec8d40SPaolo Bonzinitimer is programmed through the LVT (local vector timer) register, is capable 288*daec8d40SPaolo Bonziniof one-shot or periodic operation, and is based on the bus clock divided down 289*daec8d40SPaolo Bonziniby the programmable divider register. 290*daec8d40SPaolo Bonzini 291*daec8d40SPaolo Bonzini2.4. HPET 292*daec8d40SPaolo Bonzini--------- 293*daec8d40SPaolo Bonzini 294*daec8d40SPaolo BonziniHPET is quite complex, and was originally intended to replace the PIT / RTC 295*daec8d40SPaolo Bonzinisupport of the X86 PC. It remains to be seen whether that will be the case, as 296*daec8d40SPaolo Bonzinithe de facto standard of PC hardware is to emulate these older devices. Some 297*daec8d40SPaolo Bonzinisystems designated as legacy free may support only the HPET as a hardware timer 298*daec8d40SPaolo Bonzinidevice. 299*daec8d40SPaolo Bonzini 300*daec8d40SPaolo BonziniThe HPET spec is rather loose and vague, requiring at least 3 hardware timers, 301*daec8d40SPaolo Bonzinibut allowing implementation freedom to support many more. It also imposes no 302*daec8d40SPaolo Bonzinifixed rate on the timer frequency, but does impose some extremal values on 303*daec8d40SPaolo Bonzinifrequency, error and slew. 304*daec8d40SPaolo Bonzini 305*daec8d40SPaolo BonziniIn general, the HPET is recommended as a high precision (compared to PIT /RTC) 306*daec8d40SPaolo Bonzinitime source which is independent of local variation (as there is only one HPET 307*daec8d40SPaolo Bonziniin any given system). The HPET is also memory-mapped, and its presence is 308*daec8d40SPaolo Bonziniindicated through ACPI tables by the BIOS. 309*daec8d40SPaolo Bonzini 310*daec8d40SPaolo BonziniDetailed specification of the HPET is beyond the current scope of this 311*daec8d40SPaolo Bonzinidocument, as it is also very well documented elsewhere. 312*daec8d40SPaolo Bonzini 313*daec8d40SPaolo Bonzini2.5. Offboard Timers 314*daec8d40SPaolo Bonzini-------------------- 315*daec8d40SPaolo Bonzini 316*daec8d40SPaolo BonziniSeveral cards, both proprietary (watchdog boards) and commonplace (e1000) have 317*daec8d40SPaolo Bonzinitiming chips built into the cards which may have registers which are accessible 318*daec8d40SPaolo Bonzinito kernel or user drivers. To the author's knowledge, using these to generate 319*daec8d40SPaolo Bonzinia clocksource for a Linux or other kernel has not yet been attempted and is in 320*daec8d40SPaolo Bonzinigeneral frowned upon as not playing by the agreed rules of the game. Such a 321*daec8d40SPaolo Bonzinitimer device would require additional support to be virtualized properly and is 322*daec8d40SPaolo Bonzininot considered important at this time as no known operating system does this. 323*daec8d40SPaolo Bonzini 324*daec8d40SPaolo Bonzini3. TSC Hardware 325*daec8d40SPaolo Bonzini=============== 326*daec8d40SPaolo Bonzini 327*daec8d40SPaolo BonziniThe TSC or time stamp counter is relatively simple in theory; it counts 328*daec8d40SPaolo Bonziniinstruction cycles issued by the processor, which can be used as a measure of 329*daec8d40SPaolo Bonzinitime. In practice, due to a number of problems, it is the most complicated 330*daec8d40SPaolo Bonzinitimekeeping device to use. 331*daec8d40SPaolo Bonzini 332*daec8d40SPaolo BonziniThe TSC is represented internally as a 64-bit MSR which can be read with the 333*daec8d40SPaolo BonziniRDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware 334*daec8d40SPaolo Bonzinilimitations made it possible to write the TSC, but generally on old hardware it 335*daec8d40SPaolo Bonziniwas only possible to write the low 32-bits of the 64-bit counter, and the upper 336*daec8d40SPaolo Bonzini32-bits of the counter were cleared. Now, however, on Intel processors family 337*daec8d40SPaolo Bonzini0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction 338*daec8d40SPaolo Bonzinihas been lifted and all 64-bits are writable. On AMD systems, the ability to 339*daec8d40SPaolo Bonziniwrite the TSC MSR is not an architectural guarantee. 340*daec8d40SPaolo Bonzini 341*daec8d40SPaolo BonziniThe TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by 342*daec8d40SPaolo Bonzinimeans of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. 343*daec8d40SPaolo Bonzini 344*daec8d40SPaolo BonziniSome vendors have implemented an additional instruction, RDTSCP, which returns 345*daec8d40SPaolo Bonziniatomically not just the TSC, but an indicator which corresponds to the 346*daec8d40SPaolo Bonziniprocessor number. This can be used to index into an array of TSC variables to 347*daec8d40SPaolo Bonzinidetermine offset information in SMP systems where TSCs are not synchronized. 348*daec8d40SPaolo BonziniThe presence of this instruction must be determined by consulting CPUID feature 349*daec8d40SPaolo Bonzinibits. 350*daec8d40SPaolo Bonzini 351*daec8d40SPaolo BonziniBoth VMX and SVM provide extension fields in the virtualization hardware which 352*daec8d40SPaolo Bonziniallows the guest visible TSC to be offset by a constant. Newer implementations 353*daec8d40SPaolo Bonzinipromise to allow the TSC to additionally be scaled, but this hardware is not 354*daec8d40SPaolo Bonziniyet widely available. 355*daec8d40SPaolo Bonzini 356*daec8d40SPaolo Bonzini3.1. TSC synchronization 357*daec8d40SPaolo Bonzini------------------------ 358*daec8d40SPaolo Bonzini 359*daec8d40SPaolo BonziniThe TSC is a CPU-local clock in most implementations. This means, on SMP 360*daec8d40SPaolo Bonziniplatforms, the TSCs of different CPUs may start at different times depending 361*daec8d40SPaolo Bonzinion when the CPUs are powered on. Generally, CPUs on the same die will share 362*daec8d40SPaolo Bonzinithe same clock, however, this is not always the case. 363*daec8d40SPaolo Bonzini 364*daec8d40SPaolo BonziniThe BIOS may attempt to resynchronize the TSCs during the poweron process and 365*daec8d40SPaolo Bonzinithe operating system or other system software may attempt to do this as well. 366*daec8d40SPaolo BonziniSeveral hardware limitations make the problem worse - if it is not possible to 367*daec8d40SPaolo Bonziniwrite the full 64-bits of the TSC, it may be impossible to match the TSC in 368*daec8d40SPaolo Bonzininewly arriving CPUs to that of the rest of the system, resulting in 369*daec8d40SPaolo Bonziniunsynchronized TSCs. This may be done by BIOS or system software, but in 370*daec8d40SPaolo Bonzinipractice, getting a perfectly synchronized TSC will not be possible unless all 371*daec8d40SPaolo Bonzinivalues are read from the same clock, which generally only is possible on single 372*daec8d40SPaolo Bonzinisocket systems or those with special hardware support. 373*daec8d40SPaolo Bonzini 374*daec8d40SPaolo Bonzini3.2. TSC and CPU hotplug 375*daec8d40SPaolo Bonzini------------------------ 376*daec8d40SPaolo Bonzini 377*daec8d40SPaolo BonziniAs touched on already, CPUs which arrive later than the boot time of the system 378*daec8d40SPaolo Bonzinimay not have a TSC value that is synchronized with the rest of the system. 379*daec8d40SPaolo BonziniEither system software, BIOS, or SMM code may actually try to establish the TSC 380*daec8d40SPaolo Bonzinito a value matching the rest of the system, but a perfect match is usually not 381*daec8d40SPaolo Bonzinia guarantee. This can have the effect of bringing a system from a state where 382*daec8d40SPaolo BonziniTSC is synchronized back to a state where TSC synchronization flaws, however 383*daec8d40SPaolo Bonzinismall, may be exposed to the OS and any virtualization environment. 384*daec8d40SPaolo Bonzini 385*daec8d40SPaolo Bonzini3.3. TSC and multi-socket / NUMA 386*daec8d40SPaolo Bonzini-------------------------------- 387*daec8d40SPaolo Bonzini 388*daec8d40SPaolo BonziniMulti-socket systems, especially large multi-socket systems are likely to have 389*daec8d40SPaolo Bonziniindividual clocksources rather than a single, universally distributed clock. 390*daec8d40SPaolo BonziniSince these clocks are driven by different crystals, they will not have 391*daec8d40SPaolo Bonziniperfectly matched frequency, and temperature and electrical variations will 392*daec8d40SPaolo Bonzinicause the CPU clocks, and thus the TSCs to drift over time. Depending on the 393*daec8d40SPaolo Bonziniexact clock and bus design, the drift may or may not be fixed in absolute 394*daec8d40SPaolo Bonzinierror, and may accumulate over time. 395*daec8d40SPaolo Bonzini 396*daec8d40SPaolo BonziniIn addition, very large systems may deliberately slew the clocks of individual 397*daec8d40SPaolo Bonzinicores. This technique, known as spread-spectrum clocking, reduces EMI at the 398*daec8d40SPaolo Bonziniclock frequency and harmonics of it, which may be required to pass FCC 399*daec8d40SPaolo Bonzinistandards for telecommunications and computer equipment. 400*daec8d40SPaolo Bonzini 401*daec8d40SPaolo BonziniIt is recommended not to trust the TSCs to remain synchronized on NUMA or 402*daec8d40SPaolo Bonzinimultiple socket systems for these reasons. 403*daec8d40SPaolo Bonzini 404*daec8d40SPaolo Bonzini3.4. TSC and C-states 405*daec8d40SPaolo Bonzini--------------------- 406*daec8d40SPaolo Bonzini 407*daec8d40SPaolo BonziniC-states, or idling states of the processor, especially C1E and deeper sleep 408*daec8d40SPaolo Bonzinistates may be problematic for TSC as well. The TSC may stop advancing in such 409*daec8d40SPaolo Bonzinia state, resulting in a TSC which is behind that of other CPUs when execution 410*daec8d40SPaolo Bonziniis resumed. Such CPUs must be detected and flagged by the operating system 411*daec8d40SPaolo Bonzinibased on CPU and chipset identifications. 412*daec8d40SPaolo Bonzini 413*daec8d40SPaolo BonziniThe TSC in such a case may be corrected by catching it up to a known external 414*daec8d40SPaolo Bonziniclocksource. 415*daec8d40SPaolo Bonzini 416*daec8d40SPaolo Bonzini3.5. TSC frequency change / P-states 417*daec8d40SPaolo Bonzini------------------------------------ 418*daec8d40SPaolo Bonzini 419*daec8d40SPaolo BonziniTo make things slightly more interesting, some CPUs may change frequency. They 420*daec8d40SPaolo Bonzinimay or may not run the TSC at the same rate, and because the frequency change 421*daec8d40SPaolo Bonzinimay be staggered or slewed, at some points in time, the TSC rate may not be 422*daec8d40SPaolo Bonziniknown other than falling within a range of values. In this case, the TSC will 423*daec8d40SPaolo Bonzininot be a stable time source, and must be calibrated against a known, stable, 424*daec8d40SPaolo Bonziniexternal clock to be a usable source of time. 425*daec8d40SPaolo Bonzini 426*daec8d40SPaolo BonziniWhether the TSC runs at a constant rate or scales with the P-state is model 427*daec8d40SPaolo Bonzinidependent and must be determined by inspecting CPUID, chipset or vendor 428*daec8d40SPaolo Bonzinispecific MSR fields. 429*daec8d40SPaolo Bonzini 430*daec8d40SPaolo BonziniIn addition, some vendors have known bugs where the P-state is actually 431*daec8d40SPaolo Bonzinicompensated for properly during normal operation, but when the processor is 432*daec8d40SPaolo Bonziniinactive, the P-state may be raised temporarily to service cache misses from 433*daec8d40SPaolo Bonziniother processors. In such cases, the TSC on halted CPUs could advance faster 434*daec8d40SPaolo Bonzinithan that of non-halted processors. AMD Turion processors are known to have 435*daec8d40SPaolo Bonzinithis problem. 436*daec8d40SPaolo Bonzini 437*daec8d40SPaolo Bonzini3.6. TSC and STPCLK / T-states 438*daec8d40SPaolo Bonzini------------------------------ 439*daec8d40SPaolo Bonzini 440*daec8d40SPaolo BonziniExternal signals given to the processor may also have the effect of stopping 441*daec8d40SPaolo Bonzinithe TSC. This is typically done for thermal emergency power control to prevent 442*daec8d40SPaolo Bonzinian overheating condition, and typically, there is no way to detect that this 443*daec8d40SPaolo Bonzinicondition has happened. 444*daec8d40SPaolo Bonzini 445*daec8d40SPaolo Bonzini3.7. TSC virtualization - VMX 446*daec8d40SPaolo Bonzini----------------------------- 447*daec8d40SPaolo Bonzini 448*daec8d40SPaolo BonziniVMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 449*daec8d40SPaolo Bonziniinstructions, which is enough for full virtualization of TSC in any manner. In 450*daec8d40SPaolo Bonziniaddition, VMX allows passing through the host TSC plus an additional TSC_OFFSET 451*daec8d40SPaolo Bonzinifield specified in the VMCS. Special instructions must be used to read and 452*daec8d40SPaolo Bonziniwrite the VMCS field. 453*daec8d40SPaolo Bonzini 454*daec8d40SPaolo Bonzini3.8. TSC virtualization - SVM 455*daec8d40SPaolo Bonzini----------------------------- 456*daec8d40SPaolo Bonzini 457*daec8d40SPaolo BonziniSVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP 458*daec8d40SPaolo Bonziniinstructions, which is enough for full virtualization of TSC in any manner. In 459*daec8d40SPaolo Bonziniaddition, SVM allows passing through the host TSC plus an additional offset 460*daec8d40SPaolo Bonzinifield specified in the SVM control block. 461*daec8d40SPaolo Bonzini 462*daec8d40SPaolo Bonzini3.9. TSC feature bits in Linux 463*daec8d40SPaolo Bonzini------------------------------ 464*daec8d40SPaolo Bonzini 465*daec8d40SPaolo BonziniIn summary, there is no way to guarantee the TSC remains in perfect 466*daec8d40SPaolo Bonzinisynchronization unless it is explicitly guaranteed by the architecture. Even 467*daec8d40SPaolo Bonziniif so, the TSCs in multi-sockets or NUMA systems may still run independently 468*daec8d40SPaolo Bonzinidespite being locally consistent. 469*daec8d40SPaolo Bonzini 470*daec8d40SPaolo BonziniThe following feature bits are used by Linux to signal various TSC attributes, 471*daec8d40SPaolo Bonzinibut they can only be taken to be meaningful for UP or single node systems. 472*daec8d40SPaolo Bonzini 473*daec8d40SPaolo Bonzini========================= ======================================= 474*daec8d40SPaolo BonziniX86_FEATURE_TSC The TSC is available in hardware 475*daec8d40SPaolo BonziniX86_FEATURE_RDTSCP The RDTSCP instruction is available 476*daec8d40SPaolo BonziniX86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states 477*daec8d40SPaolo BonziniX86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states 478*daec8d40SPaolo BonziniX86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) 479*daec8d40SPaolo Bonzini========================= ======================================= 480*daec8d40SPaolo Bonzini 481*daec8d40SPaolo Bonzini4. Virtualization Problems 482*daec8d40SPaolo Bonzini========================== 483*daec8d40SPaolo Bonzini 484*daec8d40SPaolo BonziniTimekeeping is especially problematic for virtualization because a number of 485*daec8d40SPaolo Bonzinichallenges arise. The most obvious problem is that time is now shared between 486*daec8d40SPaolo Bonzinithe host and, potentially, a number of virtual machines. Thus the virtual 487*daec8d40SPaolo Bonzinioperating system does not run with 100% usage of the CPU, despite the fact that 488*daec8d40SPaolo Bonziniit may very well make that assumption. It may expect it to remain true to very 489*daec8d40SPaolo Bonziniexacting bounds when interrupt sources are disabled, but in reality only its 490*daec8d40SPaolo Bonzinivirtual interrupt sources are disabled, and the machine may still be preempted 491*daec8d40SPaolo Bonziniat any time. This causes problems as the passage of real time, the injection 492*daec8d40SPaolo Bonziniof machine interrupts and the associated clock sources are no longer completely 493*daec8d40SPaolo Bonzinisynchronized with real time. 494*daec8d40SPaolo Bonzini 495*daec8d40SPaolo BonziniThis same problem can occur on native hardware to a degree, as SMM mode may 496*daec8d40SPaolo Bonzinisteal cycles from the naturally on X86 systems when SMM mode is used by the 497*daec8d40SPaolo BonziniBIOS, but not in such an extreme fashion. However, the fact that SMM mode may 498*daec8d40SPaolo Bonzinicause similar problems to virtualization makes it a good justification for 499*daec8d40SPaolo Bonzinisolving many of these problems on bare metal. 500*daec8d40SPaolo Bonzini 501*daec8d40SPaolo Bonzini4.1. Interrupt clocking 502*daec8d40SPaolo Bonzini----------------------- 503*daec8d40SPaolo Bonzini 504*daec8d40SPaolo BonziniOne of the most immediate problems that occurs with legacy operating systems 505*daec8d40SPaolo Bonziniis that the system timekeeping routines are often designed to keep track of 506*daec8d40SPaolo Bonzinitime by counting periodic interrupts. These interrupts may come from the PIT 507*daec8d40SPaolo Bonzinior the RTC, but the problem is the same: the host virtualization engine may not 508*daec8d40SPaolo Bonzinibe able to deliver the proper number of interrupts per second, and so guest 509*daec8d40SPaolo Bonzinitime may fall behind. This is especially problematic if a high interrupt rate 510*daec8d40SPaolo Bonziniis selected, such as 1000 HZ, which is unfortunately the default for many Linux 511*daec8d40SPaolo Bonziniguests. 512*daec8d40SPaolo Bonzini 513*daec8d40SPaolo BonziniThere are three approaches to solving this problem; first, it may be possible 514*daec8d40SPaolo Bonzinito simply ignore it. Guests which have a separate time source for tracking 515*daec8d40SPaolo Bonzini'wall clock' or 'real time' may not need any adjustment of their interrupts to 516*daec8d40SPaolo Bonzinimaintain proper time. If this is not sufficient, it may be necessary to inject 517*daec8d40SPaolo Bonziniadditional interrupts into the guest in order to increase the effective 518*daec8d40SPaolo Bonziniinterrupt rate. This approach leads to complications in extreme conditions, 519*daec8d40SPaolo Bonziniwhere host load or guest lag is too much to compensate for, and thus another 520*daec8d40SPaolo Bonzinisolution to the problem has risen: the guest may need to become aware of lost 521*daec8d40SPaolo Bonziniticks and compensate for them internally. Although promising in theory, the 522*daec8d40SPaolo Bonziniimplementation of this policy in Linux has been extremely error prone, and a 523*daec8d40SPaolo Bonzininumber of buggy variants of lost tick compensation are distributed across 524*daec8d40SPaolo Bonzinicommonly used Linux systems. 525*daec8d40SPaolo Bonzini 526*daec8d40SPaolo BonziniWindows uses periodic RTC clocking as a means of keeping time internally, and 527*daec8d40SPaolo Bonzinithus requires interrupt slewing to keep proper time. It does use a low enough 528*daec8d40SPaolo Bonzinirate (ed: is it 18.2 Hz?) however that it has not yet been a problem in 529*daec8d40SPaolo Bonzinipractice. 530*daec8d40SPaolo Bonzini 531*daec8d40SPaolo Bonzini4.2. TSC sampling and serialization 532*daec8d40SPaolo Bonzini----------------------------------- 533*daec8d40SPaolo Bonzini 534*daec8d40SPaolo BonziniAs the highest precision time source available, the cycle counter of the CPU 535*daec8d40SPaolo Bonzinihas aroused much interest from developers. As explained above, this timer has 536*daec8d40SPaolo Bonzinimany problems unique to its nature as a local, potentially unstable and 537*daec8d40SPaolo Bonzinipotentially unsynchronized source. One issue which is not unique to the TSC, 538*daec8d40SPaolo Bonzinibut is highlighted because of its very precise nature is sampling delay. By 539*daec8d40SPaolo Bonzinidefinition, the counter, once read is already old. However, it is also 540*daec8d40SPaolo Bonzinipossible for the counter to be read ahead of the actual use of the result. 541*daec8d40SPaolo BonziniThis is a consequence of the superscalar execution of the instruction stream, 542*daec8d40SPaolo Bonziniwhich may execute instructions out of order. Such execution is called 543*daec8d40SPaolo Bonzininon-serialized. Forcing serialized execution is necessary for precise 544*daec8d40SPaolo Bonzinimeasurement with the TSC, and requires a serializing instruction, such as CPUID 545*daec8d40SPaolo Bonzinior an MSR read. 546*daec8d40SPaolo Bonzini 547*daec8d40SPaolo BonziniSince CPUID may actually be virtualized by a trap and emulate mechanism, this 548*daec8d40SPaolo Bonziniserialization can pose a performance issue for hardware virtualization. An 549*daec8d40SPaolo Bonziniaccurate time stamp counter reading may therefore not always be available, and 550*daec8d40SPaolo Bonziniit may be necessary for an implementation to guard against "backwards" reads of 551*daec8d40SPaolo Bonzinithe TSC as seen from other CPUs, even in an otherwise perfectly synchronized 552*daec8d40SPaolo Bonzinisystem. 553*daec8d40SPaolo Bonzini 554*daec8d40SPaolo Bonzini4.3. Timespec aliasing 555*daec8d40SPaolo Bonzini---------------------- 556*daec8d40SPaolo Bonzini 557*daec8d40SPaolo BonziniAdditionally, this lack of serialization from the TSC poses another challenge 558*daec8d40SPaolo Bonziniwhen using results of the TSC when measured against another time source. As 559*daec8d40SPaolo Bonzinithe TSC is much higher precision, many possible values of the TSC may be read 560*daec8d40SPaolo Bonziniwhile another clock is still expressing the same value. 561*daec8d40SPaolo Bonzini 562*daec8d40SPaolo BonziniThat is, you may read (T,T+10) while external clock C maintains the same value. 563*daec8d40SPaolo BonziniDue to non-serialized reads, you may actually end up with a range which 564*daec8d40SPaolo Bonzinifluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but 565*daec8d40SPaolo Bonzinicalibrated against an external value may have a range of valid values. 566*daec8d40SPaolo BonziniRe-calibrating this computation may actually cause time, as computed after the 567*daec8d40SPaolo Bonzinicalibration, to go backwards, compared with time computed before the 568*daec8d40SPaolo Bonzinicalibration. 569*daec8d40SPaolo Bonzini 570*daec8d40SPaolo BonziniThis problem is particularly pronounced with an internal time source in Linux, 571*daec8d40SPaolo Bonzinithe kernel time, which is expressed in the theoretically high resolution 572*daec8d40SPaolo Bonzinitimespec - but which advances in much larger granularity intervals, sometimes 573*daec8d40SPaolo Bonziniat the rate of jiffies, and possibly in catchup modes, at a much larger step. 574*daec8d40SPaolo Bonzini 575*daec8d40SPaolo BonziniThis aliasing requires care in the computation and recalibration of kvmclock 576*daec8d40SPaolo Bonziniand any other values derived from TSC computation (such as TSC virtualization 577*daec8d40SPaolo Bonziniitself). 578*daec8d40SPaolo Bonzini 579*daec8d40SPaolo Bonzini4.4. Migration 580*daec8d40SPaolo Bonzini-------------- 581*daec8d40SPaolo Bonzini 582*daec8d40SPaolo BonziniMigration of a virtual machine raises problems for timekeeping in two ways. 583*daec8d40SPaolo BonziniFirst, the migration itself may take time, during which interrupts cannot be 584*daec8d40SPaolo Bonzinidelivered, and after which, the guest time may need to be caught up. NTP may 585*daec8d40SPaolo Bonzinibe able to help to some degree here, as the clock correction required is 586*daec8d40SPaolo Bonzinitypically small enough to fall in the NTP-correctable window. 587*daec8d40SPaolo Bonzini 588*daec8d40SPaolo BonziniAn additional concern is that timers based off the TSC (or HPET, if the raw bus 589*daec8d40SPaolo Bonziniclock is exposed) may now be running at different rates, requiring compensation 590*daec8d40SPaolo Bonziniin some way in the hypervisor by virtualizing these timers. In addition, 591*daec8d40SPaolo Bonzinimigrating to a faster machine may preclude the use of a passthrough TSC, as a 592*daec8d40SPaolo Bonzinifaster clock cannot be made visible to a guest without the potential of time 593*daec8d40SPaolo Bonziniadvancing faster than usual. A slower clock is less of a problem, as it can 594*daec8d40SPaolo Bonzinialways be caught up to the original rate. KVM clock avoids these problems by 595*daec8d40SPaolo Bonzinisimply storing multipliers and offsets against the TSC for the guest to convert 596*daec8d40SPaolo Bonziniback into nanosecond resolution values. 597*daec8d40SPaolo Bonzini 598*daec8d40SPaolo Bonzini4.5. Scheduling 599*daec8d40SPaolo Bonzini--------------- 600*daec8d40SPaolo Bonzini 601*daec8d40SPaolo BonziniSince scheduling may be based on precise timing and firing of interrupts, the 602*daec8d40SPaolo Bonzinischeduling algorithms of an operating system may be adversely affected by 603*daec8d40SPaolo Bonzinivirtualization. In theory, the effect is random and should be universally 604*daec8d40SPaolo Bonzinidistributed, but in contrived as well as real scenarios (guest device access, 605*daec8d40SPaolo Bonzinicauses of virtualization exits, possible context switch), this may not always 606*daec8d40SPaolo Bonzinibe the case. The effect of this has not been well studied. 607*daec8d40SPaolo Bonzini 608*daec8d40SPaolo BonziniIn an attempt to work around this, several implementations have provided a 609*daec8d40SPaolo Bonziniparavirtualized scheduler clock, which reveals the true amount of CPU time for 610*daec8d40SPaolo Bonziniwhich a virtual machine has been running. 611*daec8d40SPaolo Bonzini 612*daec8d40SPaolo Bonzini4.6. Watchdogs 613*daec8d40SPaolo Bonzini-------------- 614*daec8d40SPaolo Bonzini 615*daec8d40SPaolo BonziniWatchdog timers, such as the lock detector in Linux may fire accidentally when 616*daec8d40SPaolo Bonzinirunning under hardware virtualization due to timer interrupts being delayed or 617*daec8d40SPaolo Bonzinimisinterpretation of the passage of real time. Usually, these warnings are 618*daec8d40SPaolo Bonzinispurious and can be ignored, but in some circumstances it may be necessary to 619*daec8d40SPaolo Bonzinidisable such detection. 620*daec8d40SPaolo Bonzini 621*daec8d40SPaolo Bonzini4.7. Delays and precision timing 622*daec8d40SPaolo Bonzini-------------------------------- 623*daec8d40SPaolo Bonzini 624*daec8d40SPaolo BonziniPrecise timing and delays may not be possible in a virtualized system. This 625*daec8d40SPaolo Bonzinican happen if the system is controlling physical hardware, or issues delays to 626*daec8d40SPaolo Bonzinicompensate for slower I/O to and from devices. The first issue is not solvable 627*daec8d40SPaolo Bonziniin general for a virtualized system; hardware control software can't be 628*daec8d40SPaolo Bonziniadequately virtualized without a full real-time operating system, which would 629*daec8d40SPaolo Bonzinirequire an RT aware virtualization platform. 630*daec8d40SPaolo Bonzini 631*daec8d40SPaolo BonziniThe second issue may cause performance problems, but this is unlikely to be a 632*daec8d40SPaolo Bonzinisignificant issue. In many cases these delays may be eliminated through 633*daec8d40SPaolo Bonziniconfiguration or paravirtualization. 634*daec8d40SPaolo Bonzini 635*daec8d40SPaolo Bonzini4.8. Covert channels and leaks 636*daec8d40SPaolo Bonzini------------------------------ 637*daec8d40SPaolo Bonzini 638*daec8d40SPaolo BonziniIn addition to the above problems, time information will inevitably leak to the 639*daec8d40SPaolo Bonziniguest about the host in anything but a perfect implementation of virtualized 640*daec8d40SPaolo Bonzinitime. This may allow the guest to infer the presence of a hypervisor (as in a 641*daec8d40SPaolo Bonzinired-pill type detection), and it may allow information to leak between guests 642*daec8d40SPaolo Bonziniby using CPU utilization itself as a signalling channel. Preventing such 643*daec8d40SPaolo Bonziniproblems would require completely isolated virtual time which may not track 644*daec8d40SPaolo Bonzinireal time any longer. This may be useful in certain security or QA contexts, 645*daec8d40SPaolo Bonzinibut in general isn't recommended for real-world deployment scenarios. 646