Under the hood of Windows timestamps
Programmers have always needed reliable timestamps and programming on Windows is no different. But getting accurate timestamps efficiently from user mode, a seemingly simple task that every operating system supports, seems to have at best made Microsoft jump through hoops for 30 years, and at worst eluded them completely.
Compared to Linux, where the clock_gettime
API will return all sorts of time stamps to a standardized nanosecond resolution, Windows is not so simple and this leads to assumptions and a quagmire of edge cases. Sure Windows has a complicated legacy and there are lots of backwards compatibility issues that still dictate how time works to this day, buy why is it that even in Windows 11 there not a time API with standard units? Why do we always have to convert from some arbitrary frequency - which only aid assumptions and bad programming? These timing assumptions and other edge cases are typically what cause old software to misbehave when running on modern machines. Old applications sometimes stutter, run too slow, run too fast and sometimes outright crash. These misbehaving apps are not directly Microsoft’s fault, the apps are usually victims of assumptions that were valid at the time.
In this post we are taking a deep dive in to the Windows time APIs to see exactly how they work and how they evolved, from DOS to hypervisors, and we’ll figure out what modern apps should be using and we’ll see that Windows 11 gets pretty close to a standard timer but not quite, but its close enough that it will probably lead to more bad assumptions in the future.
For a fleeting moment it seemed like we could avoid all time based APIs and do things ourselves. Starting with the original Pentium processor, some 30 years ago, Intel added the RDTSC
instruction, this reads a hardware time stamp counter (TSC) that incremented every clock cycle, and for the first time applications had a hardware counter that was accessible with low latency in user mode that would provide cycle level precision. For a while all of our timing problems were solved, even a lowly 100Mhz Pentium could give 10ns timestamps, compute accurate elapsed time periods, code could easily be profiled to see the results of the smallest micro optimizations; all things that were exceptionally difficult or impossible with the windows time APIs at the time. However, it didn’t last, while RDTSC
was used all over the place it became very unreliable and a lot of software broke.
Early multiple processor systems had physically different processors in physically different packages and each would have its own none-synchronized TSC. If your code switched processors the RDTSC
instruction could give the impression of time going backwards which causes all sorts of problems. Lots of games and media apps that were written in the days of single processors suffered timing problems due to this. It wasn’t just third party code, even system tools suffered, for example ping
on an early Windows Server sometimes fails with negative time!
C:\>ping google.com
Ping google.com (142.250.69.238) with 32 bytes of data:
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Over time using RDTSC
on single processors became unreliable because TSC would increment at the raw clock speed of the processor; which wasn’t stable. The frequency would change small amounts due to spread spectrum clocking but would change by huge amounts due to thermal and power management which made TSC unusable. To make things worse, variable frequency clocking happened around the same time as multi-processor machines started to get commonplace so it was a double whammy. RDTSC
could still be used, albeit with great care, but it was generally impossible to correlate the TSC to wall time. Today this problem is fixed via an invariant TSC which increments at a fixed rate in all power and thermal states, from a software point of view the invariant TSC increments at the base clock speed of the processor, regardless of the actual operating speed. TSC can once again be used for accurate timing as it directly correlates to wall time but Microsoft still recommends to not use the RDTSC
instruction and instead use the QueryPerformanceCounter
API, is this true??? Before we answer that question and dive in to the nitty gritty of QueryPerformanceCounter
along with the role the TSC plays in modern Windows operating systems, lets look at the methods used to get time prior to RDTSC
. There is a lot of history here and its not hard to figure out why RDTSC
got used so much.
Prior to RDTSC
, and during the many years that RDTSC
couldn’t reliably be used, Windows only had a few options for elapsed time, most of them were timer interrupt driven, and none of them were great. Stuttering in early games and media was often the result of poor time. GetTickCount()
has been in the Windows API since the 16bit days, it still exists today and is used everywhere. It returns the integer number of milliseconds since the system was booted. The resulting 32bit counter will wrap in 49.7 days and to avoid this modern versions of Windows have GetTickCount64()
which is identical in functionality other than returning a 64bit result. Realistically having a timer wrap in 49.7 days probably wasn’t an issue for Windows3.1 because it wouldn’t stay on that long. While the accuracy of the GetTickCount()
is 1ms, it’s resolution has never been close and still isn’t. Initially the resolution of this API was a whopping 55ms and this was true all the way though the Win9x operating systems, for Windows NT based operating systems the resolution has always been 10-15ms, where it still is today, even on Windows 11.
To understand were the 55ms interval comes from, which seems like a very random number in computer terms, we have to go all the way back to the origins of the PC BIOS. The original IBM PC ran at approximately 4.77MHz, it was clocked at quarter speed by a 1.19318MHz crystal and this also clocked the 8254 programmable interval timer. 65536 cycles at 1.19318MHz is 54.92ms. The timer generated a 18.208Hz interrupt on IRQ8 which was used by the BIOS to track wall time. Early PCs didn’t have a battery backed clock so had no concept of wall time, the user would enter the date and time at boot and the BIOS would keep track of wall time using this interrupt. With 18.208Hz interrupt rate, 2^16 interrupts equals approximately 3599.59 seconds, which is almost exactly one hour. To save a few CPU cycles the BIOS used this almost once an hour 16-bit overflow to check whether the wall time had crossed midnight and when it did, it needed to increment the date.
Lots of DOS games hijacked the 8254 timer hardware and installed their own interrupt handler, so DOS game had access to accurate time. Early Windows ran on top of DOS and used all the DOS and BIOS services as-is, therefore it didn’t mess with the timer hardware. The original GetTickCount()
API was a wrapper for the BIOS timer interrupt, it literally counted interrupts and multiplied that by 54.92 to compute elapsed time in milliseconds. As an optimization, the API did the multiplication only when it was called, amazingly it is still implemented this way on Windows 11 and although the origin of the tick counter is a little different the resolution is still only about 15ms. NT based operating systems were never based on DOS and the NT kernel did hijack the hardware timer. On a single processor NT machine the interrupt period was set to about 10ms, while on a multi processor machines about 15ms; much better than 55ms from 16bit windows and win9x but nowhere near good enough for games and media.
A common pattern within Windows NT based operating systems is to have documented public Win32 APIs that are internally implemented with the somewhat undocumented native API. GetTickCount
was no different, it had been a documented API since the first version of Windows and on Windows NT it simply wrapped the native API NtGetTickCount
which was originally a system call to the kernel. Therefore getting the elapsed time involved a system call. Today system calls are still expensive and optimizations are in place to avoid them, 30 years ago they were really expensive..
Today, GetTickCount
is just a few instructions, there are no system calls and it doesn’t call NtGetTickCount
(which is implemented identically).
GetTickCount:
mov ecx,7FFE0320h ;get the interrupt count
mov rcx,qword ptr [rcx]
mov eax,dword ptr [7FFE0004h] ;get a 24bit fixed point multiplier (defined by the interrupt period)
imul rax,rcx ;multiply
shr rax,18h ;shift down to give integer milliseconds
ret
In the disassembly above it, while it doesn’t use any sycalls, it does read from hardcoded addresses 0x7FFE0320
and 0x7FFE0004
. These addresses are part of what is known as kuser_shared
, this was first implemented in Windows NT 3.5 and has been there, in some form, ever since. kuser_shared
is a read only memory page mapped to every process by the kernel at fixed address 0x7FFE0000
. This allows the kernel to share things like constants, counters and configuration data without having to rely on system calls, user mode code can simply read the data from memory at full speed, and all user mode processes see the same data at the same time. It’s intended for internal use and was never officially documented until Windows 10 DDK. Needless to say there are a lot of elements in kuser_shared
that relate to time.
It’s highly recommended that production code doesn’t poke around in
kuser_shared
, it can change between even minor versions of windows; but it’s exactly what we’re going to do.
Looking back at the disassembly above we can make sense of it. 0x7FFE0320
is a 64bit counter, the kernel increments this every tick interrupt. This used to be a 32bit counter at address 0x7FFE0000
but that location is no longer used and is zero; a perfect example of why production code shouldn’t use kuser_shared
directly and why some old applications that did no longer work. 0x7FFE004
is known as the TickCountMultiplier
and is the interrupt period in units of 100ns, shifted left by 24 bits, and divided by 10,000. Thr conversion of raw tick count to integer milliseconds is simply a multiplication of TickCountMultiplier and a shift right by 24 bits, which is exactly what the above code does (and is pretty much exactly what 16bit windows did).
Today, most machines today have a multiplier of 0x0FA00000
which works out to be a period of 156250 (100ns system units) or 15.6250ms. Even in 2024, using Windows 11, the tick interrupt period is 15.625ms and this is the resolution of GetTickCount
- about the same as all the earlier NT based cousins. While the value for the tick count multiplier has been constant at 0x0FA00000
for a while, you shouldn’t rely on it and you certainly shouldn’t compute the tick period directly from the multiplier at 0x7FFE0004
. There are more legitimate ways to get the tick period from the kernel.
At this point it is worth noting that the tick count at 0x7FFE0320
, which drives GetTickCount
, isn’t driven directly from a hardware interrupt like it used to be. Instead, the tick count is abstracted from the more modern kernel interval timer interrupt, which can, and does, change its period on the fly. If I remember from my time on the the Windows kernel, the timer interval interrupt is handled by a function called KeUpdataSystemTime
and in this function the kernel updates the interval interrupt time and then simply decrements the remaining TickCount period with the elapsed time since the last interval interrupt (in 100nS units), when this counter period reaches zero the tick counter at location 0x7FFE0320
is incremented - it’s kind crude. Due to this, the kernel intereval timer interrupt may occur much faster than the tick period but it can never be slower otherwise the tick counter would miss increments. On my machine the kernel interval interrupt occurs somewhere between 0.5ms and 15.625ms.
Its really important to note that the period of the kernel interval timer interrupt doesn’t have to be a factor of 15.625ms (the tick count speed), in can be anything between the min and max timer period. Therefore, the the tick counter at location 0x7FFE0320
might average out to its actual period of 15.625ms, but it has a whole lot of jitter, if the kernel interval timer is set at 7ms, then the kernel might not update the tick count at 0x7FFE0320
until 21ms have elapsed.
The various counters in
kuser_shared
are updated without the need for locks on multi-processor machines. The older 32bit versions of Windows had to jump through a few hoops to ensure the 64bit values were reliably updated. Everything in this post assumes a modern 64bit version of Windows.
Here is some test code that show this in real time. The code spins while it waits for the kernel interval timer interrupt at location 0x7FFE0008
to change (this address is the time of the last kernel interval timer interrupt in units of 100ns since boot, its set by the kernel itself within the interrupt handler). The code then looks to see if the tick count has changed and if it has computes the elapsed time from the last update.
int main()
{
uint64_t old = *(volatile uint64_t*)0x7FFE0008;
uint64_t old_tick = *(volatile uint64_t*)0x7FFE0320;
uint64_t last_tick_change = old;
while (1)
{
uint64_t current = *(volatile uint64_t*)0x7FFE0008;
//has a kernel interrupt fired
if (old != current)
{
uint64_t kdelta = current - old;
uint64_t tick_count = *(volatile uint64_t*)0x7FFE0320;
if (tick_count == old_tick)
{
printf("ktimer = %lld (kdelta=%lld), tick=%lld\n", current, kdelta, tick_count);
}
else
{
printf("ktimer = %lld (kdelta=%lld), tick=%lld, tick_delta=%lld\n", current, kdelta, tick_count, current - last_tick_change);
last_tick_change = current;
old_tick = tick_count;
}
old = current;
}
}
}
Here are real results from my windows 11 test machine, the amount of jitter in tick_delta is not good.
initially the timer was running at about 1ms intervals
you can see the tick delta from the previous update is 15.9ms
This 1ms interval continues until it switches to 10ms
ktimer = 4946927507603 (kdelta=9994), tick=31660336, tick_delta=159942
ktimer = 4946927517606 (kdelta=10003), tick=31660336
ktimer = 4946927527603 (kdelta=9997), tick=31660336
ktimer = 4946927537604 (kdelta=10001), tick=31660336
ktimer = 4946927547608 (kdelta=10004), tick=31660336
ktimer = 4946927557606 (kdelta=9998), tick=31660336
ktimer = 4946927567603 (kdelta=9997), tick=31660336
ktimer = 4946927577604 (kdelta=10001), tick=31660336
ktimer = 4946927587604 (kdelta=10000), tick=31660336
ktimer = 4946927597603 (kdelta=9999), tick=31660336
ktimer = 4946927607618 (kdelta=10015), tick=31660336
ktimer = 4946927617607 (kdelta=9989), tick=31660336
Here is where the interrupt interval switches to 10ms, and it because of this
the next tick update doesn't occur until 21ms.
With a 10ms interval the best the 15.625ms tick can do is bounce between 10 and 20ms
ktimer = 4946927718095 (kdelta=100488), tick=31660337, tick_delta=210492
ktimer = 4946927818092 (kdelta=99997), tick=31660338, tick_delta=99997
ktimer = 4946927918095 (kdelta=100003), tick=31660338
ktimer = 4946928018092 (kdelta=99997), tick=31660339, tick_delta=200000
ktimer = 4946928118084 (kdelta=99992), tick=31660339
ktimer = 4946928218105 (kdelta=100021), tick=31660340, tick_delta=200013
ktimer = 4946928318081 (kdelta=99976), tick=31660341, tick_delta=99976
ktimer = 4946928418096 (kdelta=100015), tick=31660341
ktimer = 4946928518105 (kdelta=100009), tick=31660342, tick_delta=200024
ktimer = 4946928618088 (kdelta=99983), tick=31660343, tick_delta=99983
ktimer = 4946928718076 (kdelta=99988), tick=31660343
ktimer = 4946928818082 (kdelta=100006), tick=31660344, tick_delta=199994
from here the interval is quite random for a while
ktimer = 4946928848076 (kdelta=29994), tick=31660344
ktimer = 4946928858169 (kdelta=10093), tick=31660344
ktimer = 4946928867110 (kdelta=8941), tick=31660344
ktimer = 4946928868757 (kdelta=1647), tick=31660344
ktimer = 4946928869416 (kdelta=659), tick=31660344
ktimer = 4946928947830 (kdelta=78414), tick=31660345, tick_delta=129748
ktimer = 4946928948959 (kdelta=1129), tick=31660345
ktimer = 4946928958946 (kdelta=9987), tick=31660345
ktimer = 4946928967940 (kdelta=8994), tick=31660345
ktimer = 4946929063042 (kdelta=95102), tick=31660346, tick_delta=115212
You don’t have to compute the elapsed time between 2 timer interrupts to work out the current speed of the kernel interval timer, we can use native api function NtQueryTimerResolution
, this fills out the minimum resolution (slowest timer rate), the maximum resolution (fastest timer rate), and the current resolution at the time it was called (all in 100ns system time units). The minimum resolution for the kernel interval timer will always be the the tick count interval, this ensures the tick counter always increments monotonically and never misses a value.
There are no easily accessible headers for the native API. You have to declare the functions yourself and either link with static library ntdll.lib, or use
GetProcAddress
on ntdll.dll to dynamically link.
extern "C" NTSYSAPI NTSTATUS NTAPI NtQueryTimerResolution(
OUT PULONG MinimumResolution,
OUT PULONG MaximumResolution,
OUT PULONG CurrentResolution);
//get the min, max and current resolution
ULONG minRes;
ULONG maxRes;
ULONG currentRes;
NtQueryTimerResolution(&minRes, &maxRes, ¤tRes);
On my machine NtQueryTimerResolution
reports a minimum period (maximum resolution) of 5000 or 0.5ms and a maximum period (minimum resolution) of 156250 or 15.625ms. Calling this in a loop and displaying the current resolution will should the kernel timer changing resolution on the fly.
kernel timer resolution = 156250
kernel timer resolution = 100000
kernel timer resolution = 156250
kernel timer resolution = 10000
kernel timer resolution = 10000
kernel timer resolution = 10000
kernel timer resolution = 156250
The reason is the kernel timer resolution changes is because it’s a system wide setting. The default value is the minimum resolution (15.625ms) but it will change as other processes ask for a different resolution, the final resolution will always be set to the minimum interval that has been asked for by any process - this is not a bad random seed because it depends entirely on the processes that are running and what they are currently doing. In both of the above time logs it is easy to see that periodically some process is asking for 1ms resolution, which is common due to how the newer multimedia timers work, but in order to satisfy such a request, everybody gets 1ms resolution.
The timer resolution can be set directly using native API NtSetTimerResolution
:
extern "C" NTSYSAPI NTSTATUS NTAPI NtSetTimerResolution(ULONG DesiredResolution, BOOLEAN SetResolution, PULONG CurrentResolution);
//currentRes will be set to the current resolution, it may not be what you asked for if some other process has asked for a higher resolution.
ULONG currentRes;
NtSetTimerResolution(5000, TRUE, ¤tRes);
The resolution has to be between the min and max reported by NtQueryTimerResolution
and it returns the current set resolution, which may not be what you asked because some other process may have already set it to a higher resolution. If we always set the maximum resolution, then the timer rate should be constant because no other process will be able to change it. Including a call to NtSetTimerResolution
and running the same log again we get much cleaner results, the kernel timer fires every 5000 time units (0.5ms) and the jitter across the tick counter is much more consistent.
ktimer = 4948512812590 (kdelta=5001), tick=31670482, tick_delta=155002
ktimer = 4948512817589 (kdelta=4999), tick=31670482
ktimer = 4948512822590 (kdelta=5001), tick=31670482
ktimer = 4948512827590 (kdelta=5000), tick=31670482
ktimer = 4948512832590 (kdelta=5000), tick=31670482
ktimer = 4948512837589 (kdelta=4999), tick=31670482
ktimer = 4948512842590 (kdelta=5001), tick=31670482
ktimer = 4948512847589 (kdelta=4999), tick=31670482
ktimer = 4948512852589 (kdelta=5000), tick=31670482
ktimer = 4948512857593 (kdelta=5004), tick=31670482
ktimer = 4948512862590 (kdelta=4997), tick=31670482
ktimer = 4948512867595 (kdelta=5005), tick=31670482
ktimer = 4948512872589 (kdelta=4994), tick=31670482
ktimer = 4948512877588 (kdelta=4999), tick=31670482
ktimer = 4948512882589 (kdelta=5001), tick=31670482
ktimer = 4948512887589 (kdelta=5000), tick=31670482
ktimer = 4948512892589 (kdelta=5000), tick=31670482
ktimer = 4948512897589 (kdelta=5000), tick=31670482
ktimer = 4948512902600 (kdelta=5011), tick=31670482
ktimer = 4948512907592 (kdelta=4992), tick=31670482
ktimer = 4948512912590 (kdelta=4998), tick=31670482
ktimer = 4948512917606 (kdelta=5016), tick=31670482
ktimer = 4948512922597 (kdelta=4991), tick=31670482
ktimer = 4948512927605 (kdelta=5008), tick=31670482
ktimer = 4948512932591 (kdelta=4986), tick=31670482
ktimer = 4948512937591 (kdelta=5000), tick=31670482
ktimer = 4948512942592 (kdelta=5001), tick=31670482
ktimer = 4948512947589 (kdelta=4997), tick=31670482
ktimer = 4948512952589 (kdelta=5000), tick=31670482
ktimer = 4948512957616 (kdelta=5027), tick=31670482
ktimer = 4948512962590 (kdelta=4974), tick=31670482
ktimer = 4948512967589 (kdelta=4999), tick=31670482
ktimer = 4948512972589 (kdelta=5000), tick=31670483, tick_delta=159999
Unfortunately the windows kernel makes no effort to minimize timer jitter, it doesn’t try to compute a common factor of the requested periods, it simply uses the minimum period, if you need low jitter the only option is to set the kernel timer to the maximum available resolution. The downside of this is using 0.5ms intervals generates 2000 kernel interrupts per second, while that sounds bad it doesn’t really have any noticeable effect on performance, Microsoft should just use a fixed the interval and be done with it.
In the above code another useful value from 0x7FFE0008
in kuser_shared
was used, this is the last kernel interrupt time in 100ns system units. Reading the last interrupt time is a much better way to measure elapsed time than GetTickCount(), its higher resolution and more accurate. One way to get the last kernel timer interrupt time is to directly read 0x7FFE0008
, the best way is to use QueryInterruptTime()
which does exactly the same:
QueryInterruptTime:
mov eax,7FFE0008h
mov rax,qword ptr [rax]
mov qword ptr [rcx],rax
ret
Windows 95 added the timeGetTime
API as part of the multimedia APIs, the same APIs first showed up on NT based operating systems in Windows2000. These new APIs were still only accurate to integer milliseconds but apps could get 1ms resolution too. These APIs still weren’t great for accurately measuring elapsed time or scheduling video frames but given the lack of better options both games and multimedia apps used these extensively, even today these APIs are still used everywhere. Applications used timeBeginPeriod
to set the resolution of the timer, as you can probably guess, this is just wrapper for NtSetTimerResolution
with input quantized to whole milliseconds, with 1ms being the minimum. This is why 1ms shows up quite often in the above timer resolution log.
timeGetTime:
sub rsp,28h
xor r9d,r9d
lea rdx,[TimeInit]
xor r8d,r8d
lea rcx,[g_TimeInitOnce]
call qword ptr [__imp_InitOnceExecuteOnce] ;Only run the init code once, from any thread within a given process
nop dword ptr [rax+rax]
mov ecx,7FFE0008h ;last interrupt time
mov rax,346DC5D63886594Bh ;fixed point multiply by 1/100000
mov rcx,qword ptr [rcx]
sub rcx,qword ptr [TimerData] ;subtract raw init time from TimerData
imul rcx
sar rdx,0Bh
mov rax,rdx
shr rax,3Fh
add rax,rdx
add eax,dword ptr [TimerData+8h] ;add on the init time in ms from TimerData+8
add rsp,28h
ret
timeGetTime
just returns a scaled and offset version of the 64bit interrupt time. The multimedia timers are significantly better than GetTickCount
but there is no point in using them in new code as there is far less cpu overhead by calling QueryInterruptTime
and you get 100ns accuracy (you are still at the whim of the kernel timer interrupt interval if you don’t set it yourself).
Using QueryInterruptTime
after setting the timer resolution to 0.5ms:
int main()
{
uint64_t old_tick = 0;
while (1)
{
uint64_t tick;
QueryInterruptTime(&tick);
if (tick != old_tick)
{
double delta = (double)(tick - old_tick) / 10000.0;
printf("delta = %0.6f\n", delta);
old_tick = tick;
}
}
}
The result is quite stable, the interrupt time is within a few microseconds of the expected time, but half a millisecond is the smallest interval that can be measured and this still can’t accurately measure a 16.6ms frame time.
delta = 0.500100
delta = 0.502400
delta = 0.497900
delta = 0.499500
delta = 0.500200
delta = 0.500200
delta = 0.499800
delta = 0.500400
delta = 0.499600
delta = 0.500100
delta = 0.499900
delta = 0.499900
delta = 0.500500
Windows also provides an API called GetSystemTimeAsFileTime
which returns the system current time as a file time. Within Windows, system time and file time are both in units of 100ns, system time is the elapsed time since boot while file time is the elapsed time since the epoch (Jan 1st 1601). However this API is also driven by the same timer interrupt timestamp but instead of reading 0x7FFE0008
which is the system timestamp, it reads 0x7FFE0014
which is the file time timestamp of the last timer interrupt.
; This API splits the 64bit result in to a pair of 32bit values instead of directly returning a 64bit value.
GetSystemTimeAsFileTime:
mov eax,7FFE0014h
mov rax,qword ptr [rax]
mov dword ptr [rcx],eax
shr rax,20h
mov dword ptr [rcx+4],eax
ret
There is a precise version of this API called GetSystemTimePreciseAsFileTime
which is a wrapper for the native function rtlGetSystemTimePrecise
and internally it uses native function rtlQueryPerformanceCounter
along the last interrupt time and some other magic values within kuser_shared, ultimately it returns the file time of when the API was actually called.
This function uses address
0x7FFE0340
which is the incrementing tick count of the kernel timer interrupt. This counter is shifted left one bit so it appears to increment by 2, and the bottom bit is a lock bit. It’s very difficult to catch the lock bit set but I believe it is set while the kernel interrupt executes to indicate the timer values are changing. There are a lot of timer related values withinkuser_shared
and they can’t all be updated atomically, especially from a multi-processor point of view. You’ll see this value being used in the following code:
RtlGetSystemTimePrecise:
mov qword ptr [rsp+10h],rbx
mov qword ptr [rsp+18h],rbp
mov qword ptr [rsp+20h],rsi
push rdi
push r12
push r13
push r14
push r15
sub rsp,20h
mov edi,7FFE0358h
lea r12d,[rdi-18h]
lea r13d,[rdi-10h]
retry2:
mov eax,7FFE0368h
mov ecx,7FFE0014h
retry1:
mov rbx,qword ptr [r12] ;0x340 - kernel timer counter + lock
test bl,1
jne pause1 ;if the lock in the bottom bit of 0x340 is set, pause and try again (aka locked)
mov r14,qword ptr [rcx] ;load system time for last interrupt
lea rcx,[rsp+50h] ;result location of RtlQueryPerformanceCounter
mov rbp,qword ptr [r13] ;baseline qpc interrupt time
mov r15,qword ptr [rdi] ;qpc system time increment
mov sil,byte ptr [rax] ;time shift
call RtlQueryPerformanceCounter ;get the qpc time stamp
nop
mov rax,qword ptr [r12] ;reload the interrupt counter/lock from 0x340
cmp rax,rbx
jne pause2 ;if a timer interrupt has occurred or the lock has changed, spin and try again
mov rdx,qword ptr [rsp+50h] ;result of RtlQueryPerformanceCounter
mov edi,0 ;delta time from last interrupt
cmp rdx,rbp ;current qpc vs qpc interrupt time
jbe comp_result ;qpc is below or equal (shouldn't happen), delta is zero, return the last interrupt
sub rdx,rbp ;qpc delta since last interrupt
dec rdx
test sil,sil
je skip_shift ;if shift is zero, skip shifting
mov cl,sil
shl rdx,cl ;shift qpc delta by the shift amount
skip_shift:
mov rax,r15 ;qpc system time increment
mul rax,rdx ;multiply the time delta by the time increment to compute 100ns
mov rdi,rdx
mov rbx,qword ptr [rsp+58h]
comp_result:
lea rax,[r14+rdi] ;add 100ns unit delta time to the baseline interrupt time in 100ns
mov rbp,qword ptr [rsp+60h]
mov rsi,qword ptr [rsp+68h]
add rsp,20h
pop r15
pop r14
pop r13
pop r12
pop rdi
ret
int 3
pause1:
pause
jmp retry1
pause2:
pause
jmp retry2
This is the first API that takes us away from times based on the kernel timer interrupt to a now time of when the API was actually called, its the first API to eliminate the kernel timer jitter and the first API that can be correlated to wall time. The implementation of RtlGetSystemTimePrecise
jumps through some hoops because it always returns time in 100ns units and although it uses rtlQueryPerformanceCounter
it has to do a scale and offset because rtlQueryPerformanceCounter
doesn’t have a fixed frequency.
Following the typical NT native API pattern, the user api
QueryPerformanceCounter
andrtlQueryPerformanceCounter
are the same thing.
Windows NT system and file times have always been in 100ns units so it would make sense that QueryPerformanceCounter
would return the same 100ns units but for some reason this took until windows 10 build 1607 (Anniversary update in 2016). Today, it is mostly true that the frequency of QueryPerformanceCounter
has been fixed at 10Mhz but its not guaranteed, make assumptions at your own risk. Over the years QueryPerformanceCounter
has used a range of frequencies depending on the source timer but it will always have resolution greater than 1us.
The frequency of QueryPerformanceCounter
can be obtained via QueryPerformanceFrequency
and all this does is load and return the 64bit value at 0x7FEE0300
which is set by the kernel once at boot and never changes at runtime.
QueryPerformanceFrequency:
mov rax,qword ptr [7FFE0300h]
mov qword ptr [rcx],rax
mov eax,1
ret
QueryPerformanceCounter
andrtlQueryPerformanceCounter
have a native equivalent calledNtQueryPerformanceCounter
. The big difference is thatNtQueryPerformanceCounter
is always a syscall to the kernel, even though it computes the exact same result.
QueryPerformanceCounter:
mov qword ptr [rsp+8],rbx
push rdi
sub rsp,20h
mov eax,7FFE03C6h ;qpc flags address
mov rbx,rcx
movzx r9d,byte ptr [rax] ;load qpc flags
test r9b,1 ;check bypass syscall
je use_syscall ;NtQueryPerformanceCounter (syscall 31h)
mov r11d,7FFE03B8h ;qpc bias address
mov r11,qword ptr [r11] ;load qpc bias
test r9b,2 ;check qpc flags for using hypervisor page
je no_hyperv ;no hypervisor page
mov r8,qword ptr [7FFBB84332A8h] ;RtlpHypervisorSharedUserVa = 000000007FFE1000
test r8,r8 ;is the hypervisor page null??
je use_syscall ;if null, use syscall
loop_hyperv:
mov r10d,dword ptr [r8] ;load cookie from RtlpHypervisorSharedUserVa+0
test r10d,r10d ;is the cookie zero?
je use_syscall ;if cookie is zero use syscall
test r9b,r9b ;is rdtscp available?
jns use_rdtsc_hyperv ;if no use rdtsc and some combo of lfence/mfence
rdtscp
rdtsc_result_hyperv:
shl rdx,20h
or rdx,rax ;make 64bit result from rdtsc
mov rax,qword ptr [r8+8] ;load scale at RtlpHypervisorSharedUserVa+8
mov rcx,qword ptr [r8+10h] ;load offset at RtlpHypervisorSharedUserVa+16
mul rax,rdx ;multiply by scale
mov eax,dword ptr [r8] ;reload cookie from RtlpHypervisorSharedUserVa+0
add rdx,rcx ;add the offset
cmp eax,r10d
jne loop_hyperv ;do again if hypervisor cookie as changed, maybe the scale/offset changed.
compute_result:
mov cl,byte ptr [7FFE03C7h] ;load qpc shift
lea rax,[rdx+r11] ;add on qpc bias
shr rax,cl ;shift the result right
mov qword ptr [rbx],rax ;store result
mov eax,1 ;return true
mov rbx,qword ptr [rsp+30h]
add rsp,20h
pop rdi
ret
;main code path if the hypervisor isn't being used
no_hyperv:
test r9b,r9b ;is rdtscp available
jns use_rdtsc ;if not, use rdtsc
rdtscp
rdtsc_result: ;if using rdtsc, it returns here
shl rdx,20h
or rdx,rax ;make 64bit value
jmp compute_result ;compute final result and return
;this is the hypervisor rdtsc path, identical to below, we just check sync flags prior to rdtsc
use_rdtsc_hyperv:
test r9b,20h ;use lfence?
je no_lfence_hyperv
lfence
jmp read_tsc_hyperv
no_lfence_hyperv:
test r9b,10h ;use mfence?
je read_tsc_hyperv
mfence
read_tsc_hyperv:
rdtsc
jmp rdtsc_result_hyperv
;this is none hypervisor rdtsc, we might need a sync instruction before the rdtsc
use_rdtsc:
test r9b,20h ;use lfence for sync
je skip_lfence
lfence
jmp read_tsc
skip_lfence:
test r9b,10h ;use mfence for sync
je read_tsc
mfence
read_tsc:
rdtsc
jmp rdtsc_result
use_syscall:
....
Dissecting the above code there are a few different paths but ultimately only 2 modes of operation. It’s either a syscall or some derivative of the TSC read with either RDTSC
or RDTSCP
. The difference between these two almost identical instructions is the original RDTSC
is a speculative instruction, it doesn’t necessarily wait until all previous instructions to execute before reading the counter. Similarly, subsequent instructions may begin executing before the timer is read. This can cause problems if the programmer expects the timestamp to be take at the exact location of the instruction. Making this instruction behave requires some form of synchronization instruction:
- If software requires
RDTSC
to be executed only after all previous instructions have executed on Intel platforms, it can executeLFENCE
immediately beforeRDTSC
. - If software requires
RDTSC
to be executed only after all previous instructions have executed on AMD platforms, it can executeMFENCE
immediately beforeRDTSC
. Technically this is slower than the Intel solution as it waits until all previous loads and stores are globally visible. ExecutingLFENCE
on AMD hardware doesn’t synchronize as intended. - If software requires
RDTSC
to be executed prior to execution of any subsequent instruction, it can executeLFENCE
immediately afterRDTSC
.
RDTSCP
[Available when CPUID.80000001h:EDX[27]
, is set] is a none speculative version of RDTSC
, equivalent to the first two items above. It waits for all prior instructions to execute before reading the timer and it works the same on AMD and Intel hardware. It doesn’t prevent future instructions from starting to execute, which software can prevent by executing LFENCE
after RDTSCP
(item 3 above). A feature of this newer instruction, unrelated to timers, is the model specific register IA32_TSC_AUX is returned in register ECX, the windows scheduler puts the processor ID in this register, which software can use to determine if your thread is hopping processors (linux does something similar but its entirely dependent on the operating system to set this model specific register).
Going back to QueryPerformanceCounter, you can see all of the above options being used, the code flow is controlled by the QPC flags at 0x7FFE03C6h
, the DDK names the various flags as follows:
SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED = 0x01
Set if QPC should bypass the syscall and use the timestamp instructions, this will only be set if the timestamp is invariant. This flag first appeared in Window8.1. If this bit is clear QPC just calls NtQueryPerformanceCounter
which is always a syscall, this allows motherboard timers to be used which are only accessible from kernel mode. If the bypass is not enabled then QueryPerformanceCounter
is always a system call with significant cpu overhead, making it somewhat unsuitable for high frequency operations as a single call can take a few microseconds to execute (many thousands of clock cycles).
SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_HV_PAGE = 0x02
If set QPC uses the hypervisor page, this new memory page showed in Windows 10 v1607 and is mapped to all processes much the same way as kuser_shared
. The memory page is used to apply a scale and offset to normalize the frequency to 10MHz (100ns units). This normalization can only occur if the hypervisor page is present so earlier versions of windows will not return a fixed 10Mhz.
SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE = 0x10
Use MFENCE
for synchronization, not checked if RDTSCP
is used, only used on AMD processors
SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE = 0x20
Use LFENCE
for synchronization, not checked if RDTDCP
is used, only used on Intel processors
SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP = 0x80
If this is set RDTSCP
is available and will be used in place of RDTSC
and syncrhonization.
The syscall is avoided if at all possible and the code path is optimized for using RDTSCP with the hypervisor page to normalize to 10Mhz, this is the code path that has no branches and is the expected code path on typical windows 10 or later machine. Thre are a few other options that can be returned:
-
User mode TSC via
RDTSC or RDTSCP
with hypervisor normalization to 10Mhz. CPU cost is 100ish cycles. -
User mode TSC via
RDTSC or RDTSCP
with a shift. On the test machine the shift is set to 10 giving a frequency of 3.613000Mhz (approx 3.7GHz / 1024). CPU cost is about 100 cycles -
System call. If this path is being used it will most likely be using the HPET with a frequency of 14.318180MHz, it could also use a PMTimer at 3.579545MHz (exactly 4x slower than the HPET) but its unlikely. CPU cost for the syscall path is thousands of cycles.
Windows 10/11 can be forced to use the HPET counter. From an administrator terminal using
bcdedit /set useplatformclock true
. While this could be useful for testin I recommend you don’t leave this boot variable set because it can have some nasty side effects, usebcdedit /deletevalue useplatformclock
to delete the variable and go back to the default. You need to reboot after changing this setting. I highly recommend you test with this especially if you are making some high performance media app or game. There are machines out there that use this path and it has caused numerous problems for a handful of commercial games. It also changes the QPC frequency so its no longer the normalized 10Mhz.
What is this hypervisor page?? Unfortunately, there isn’t much information on it. It first appeared in anniversary update of windows 10 (version 1607) it seems to only be used by QueryPerformanceCounter
. It’s address isn’t fixed but on all the machines tested it always shows up very close to kuser_shared
, on this test machine it is directly after the kernel page at 0x7ffe1000
. You can query the local process address by using native API NtQuerySystemInformation
with query class ID 0xc5
.
extern "C" NTSYSAPI NTSTATUS NTAPI NtQuerySystemInformation(
ULONG SystemInformationClass,
PVOID SystemInformation,
ULONG SystemInformationLength,
PULONG ReturnLength
);
void* GetHypervisorPageAddress()
{
void* system_info;
ULONG res_size;
NTSTATUS result = NtQuerySystemInformation(0xC5, &system_info, sizeof(system_info), &res_size);
if ((result >= 0) && (res_size == sizeof(system_info)))
{
return system_info;
}
return 0;
}
Now we can write some code to dump out the known data in the hypervisor page, and compute the TSC frequency as seen by the operating system.
int main()
{
//known elements of the hypervisor page
struct HypervisorPage
{
uint32_t cookie;
uint32_t unused; //maybe cookie is 64bits but it's currently loaded as 32bits
uint64_t rdtsc_scale;
uint64_t rdtsc_offset;
};
//use the above function
HypervisorPage* hv = (HypervisorPage*)GetHypervisorPageAddress();
if (hv)
{
printf("cookie: %x\n", hv->cookie);
printf("rdtsc_cale: %llx\n", hv->rdtsc_scale);
printf("rdtsc_offset: %llx\n", hv->rdtsc_offset);
//compute the tsc speed as an integer with 32bits of fraction
int64_t rem;
int64_t tsc_int = _div128((1ULL << 32), 0, hv->rdtsc_scale, &rem);
//convert to a double
double tsc = (double)tsc_int / (1ULL << 32);
printf("tsc clock speed = %.6f", tsc*10.0); //*10 for Mhz
//this is hacky but surprisingly accurate (only because rdtsc_scale fits in 53bits)
//double f = (double)0xffffffffffffffffULL / (double)hv->rdtsc_scale;
}
}
We can see from the disassembly of QueryPerformanceCounter
that if the hypervisor cookie is zero the syscall is used, on Windows 11 the cookie starts at 0x1 and I assume increments if the hypervisor makes changes to the scale or offset, I’ve never seen this happen but it makes sense the hypervisor could change these values and QueryPerformanceCounter is written to handle it (In windows 10 the cookie used to be a value of ‘hAlt’). On my test machine rdtsc_scale is slighly different on every reboot so its probably being computed at startup, for example on the first boot 0x00b11b8333a4a9e5
which in 64bit fixed point is 1/370.035209, on the 2nd boot 0x00b11b77df38e17c
or 1/370.0355705, both are very close to the divosr required to get 100ns units from a processor with a base speed of 3.7GHz.
Another time related use for NtQuerySystemInformation
is to determine if QueryPerformanceCounter
is going to use a syscall or if its going to stay in user mode. You could figure this out yourself the offical method (albiet somewhat undocumented) will be correct in the future. Knowing a syscall is going to be used might change how you use QueryPerformanceCounter
.
bool DoesQPCBypassKernel()
{
struct SYSTEM_QUERY_PERFORMANCE_COUNTER_INFORMATION
{
ULONG Version;
//0x00000001 = kernelqpc
ULONG Flags;
ULONG ValidFlags;
};
SYSTEM_QUERY_PERFORMANCE_COUNTER_INFORMATION qpc_info;
ULONG res_size;
qpc_info.Version = 1; //must set the version on input (only version=1 is defined)
NTSTATUS result = NtQuerySystemInformation(0x7c, &qpc_info, sizeof(qpc_info), &res_size);
if ((result >= 0) && (res_size == sizeof(qpc_info)))
{
if (qpc_info.ValidFlags & 0x1) //is kernel flag valid
{
return (qpc_info.Flags & 0x1) == 0; //if valid it needs to be zero to bypass
}
}
//this is not actually false, but unknown
return false;
}
Have we come full circle? Why don’t we use RDTSC
for time stamps? If QueryPerformanceCounter
is just a wrapper for RDTSC/RDTSCP
then why not avoid all the overhead and execute the instructions directly? Is Microsoft correct in saying everybody should use QueryPerformanceCounter??? Well, it depends..
If CPUID.80000007H:EDX[8]
indicates invariant timestamp support then there is little harm in making your own ultra low overhead timers using RDTSC/RDTSCP
. RDTSC
is the most optimal and I’ve never seen the lack of synchronization cause problems outside of profiling short code sequences. If you do things yourself you do have to figure out the base clock frequency, something that is given to you if you use QueryPerformanceCounter
. While I don’t see the timestamp instructions becoming unsable again, there is always a risk of some processor changing the invariant timer in ways software doesn’t expect, nobody expected RDTSC
to cause as much trouble as it did when it was first used 30 years ago.
On a modern machine I don’t see any problem in using QueryPerformanceCounter
, especially if its bypassing the syscall, its plenty fast enough for anything other than maybe micro-profiling short sequences. QueryPerformanceCounter
is pretty future proof when used properly and the OS handles the edge cases and platform differences. For general time keeping, no matter what the QueryPerformanceCounter
timer source may be, or even if its using a syscall, its insignificant overhead. I am starting to see numerous posts online where the frequency is assumed to be a fixed 10MHz, don’t do this! Today, I would only use RDTSC
for careful micro profiling, or if thousands and thousands of timestamps are needed where just the overhead of the API calls is too much.
If you make your own timers and need the TSC frequency I recommend not using the above code outside of a development environment - it will cause you problems in the future and your app will be the one crashing in 10 years. It would be nice if Windows just had an API to give the TSC frequency but it doesn’t. On Intel processors CPUID 16h
can be used to get the base speed but if you want to compute the TSC frequency you can do a one-time calibration by comparing TSC readings against a known system clock or even QueryPerformanceCounter itself, during development you can check your calibration code against the hypervisor frequency to see if if its reliable. If you use a good reference timer and throw out any outliers this type of calibration can be very accurate, it’s future proof, works on Intel and AMD and it accounts for things like virtualization and overclocking. Never calibrate against GetTickCount
, it’ll never be accurate with the amount of jitter we saw earlier.
Finally, the invariant TSC might appear to have the same counter frequency as the base processor speed but it certainly doesn’t increment at that speed. Typically the invariant TSC runs on a much slower, stable and always running reference clock, say 25MHz, or 50Mhz or 100MHz. The core counter increments at this reference speed adding whatever multiplier is required to match the base speed. On my test machine the amount added is 148 to give the impression of a 3.7GHz counter, from this we can calculate a reference clock of 25MHz. On Intel processors this incremnent rate is available from CPUID 15h
. However, things are not that simple because RDTSC
always returns a value that is monotonically incrementing so won’t return the same value twice if you call it back to back. How this is implemented with respect to the reference clock and what is actually returned is model specific - for example some Intel processors only return even numbers. While its safe to assume the invariant TSC is accurate with wall time over long periods, you probably don’t want to use it to time ultra small sections of code, unless you know exactly how RDTSC is implemented on your particular processor.