Under the hood of Windows timestamps

Tricky Bits Blog Posts

Windows

Bits n Bytes

February 25, 2024

Rob Wyatt

Programmers have always needed reliable timestamps and programming on Windows is no different. But getting accurate timestamps efficiently from user mode, a seemingly simple task that every operating system supports, seems to have at best made Microsoft jump through hoops for 30 years, and at worst eluded them completely.

Compared to Linux, where the clock_gettime API will return all sorts of time stamps to a standardized nanosecond resolution, Windows is not so simple and this leads to assumptions and a quagmire of edge cases. Sure Windows has a complicated legacy and there are lots of backwards compatibility issues that still dictate how time works to this day, buy why is it that even in Windows 11 there not a time API with standard units? Why do we always have to convert from some arbitrary frequency - which only aid assumptions and bad programming? These timing assumptions and other edge cases are typically what cause old software to misbehave when running on modern machines. Old applications sometimes stutter, run too slow, run too fast and sometimes outright crash. These misbehaving apps are not directly Microsoft’s fault, the apps are usually victims of assumptions that were valid at the time.

In this post we are taking a deep dive in to the Windows time APIs to see exactly how they work and how they evolved, from DOS to hypervisors, and we’ll figure out what modern apps should be using and we’ll see that Windows 11 gets pretty close to a standard timer but not quite, but its close enough that it will probably lead to more bad assumptions in the future.

For a fleeting moment it seemed like we could avoid all time based APIs and do things ourselves. Starting with the original Pentium processor, some 30 years ago, Intel added the RDTSC instruction, this reads a hardware time stamp counter (TSC) that incremented every clock cycle, and for the first time applications had a hardware counter that was accessible with low latency in user mode that would provide cycle level precision. For a while all of our timing problems were solved, even a lowly 100Mhz Pentium could give 10ns timestamps, compute accurate elapsed time periods, code could easily be profiled to see the results of the smallest micro optimizations; all things that were exceptionally difficult or impossible with the windows time APIs at the time. However, it didn’t last, while RDTSC was used all over the place it became very unreliable and a lot of software broke.

Early multiple processor systems had physically different processors in physically different packages and each would have its own none-synchronized TSC. If your code switched processors the RDTSC instruction could give the impression of time going backwards which causes all sorts of problems. Lots of games and media apps that were written in the days of single processors suffered timing problems due to this. It wasn’t just third party code, even system tools suffered, for example ping on an early Windows Server sometimes fails with negative time!

C:\>ping google.com

Ping google.com (142.250.69.238) with 32 bytes of data:
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128
Reply from 142.250.69.238: bytes=32 time=-59ms TTL=128

Over time using RDTSC on single processors became unreliable because TSC would increment at the raw clock speed of the processor; which wasn’t stable. The frequency would change small amounts due to spread spectrum clocking but would change by huge amounts due to thermal and power management which made TSC unusable. To make things worse, variable frequency clocking happened around the same time as multi-processor machines started to get commonplace so it was a double whammy. RDTSC could still be used, albeit with great care, but it was generally impossible to correlate the TSC to wall time. Today this problem is fixed via an invariant TSC which increments at a fixed rate in all power and thermal states, from a software point of view the invariant TSC increments at the base clock speed of the processor, regardless of the actual operating speed. TSC can once again be used for accurate timing as it directly correlates to wall time but Microsoft still recommends to not use the RDTSC instruction and instead use the QueryPerformanceCounter API, is this true??? Before we answer that question and dive in to the nitty gritty of QueryPerformanceCounter along with the role the TSC plays in modern Windows operating systems, lets look at the methods used to get time prior to RDTSC. There is a lot of history here and its not hard to figure out why RDTSC got used so much.

Prior to RDTSC, and during the many years that RDTSC couldn’t reliably be used, Windows only had a few options for elapsed time, most of them were timer interrupt driven, and none of them were great. Stuttering in early games and media was often the result of poor time. GetTickCount() has been in the Windows API since the 16bit days, it still exists today and is used everywhere. It returns the integer number of milliseconds since the system was booted. The resulting 32bit counter will wrap in 49.7 days and to avoid this modern versions of Windows have GetTickCount64() which is identical in functionality other than returning a 64bit result. Realistically having a timer wrap in 49.7 days probably wasn’t an issue for Windows3.1 because it wouldn’t stay on that long. While the accuracy of the GetTickCount() is 1ms, it’s resolution has never been close and still isn’t. Initially the resolution of this API was a whopping 55ms and this was true all the way though the Win9x operating systems, for Windows NT based operating systems the resolution has always been 10-15ms, where it still is today, even on Windows 11.

To understand were the 55ms interval comes from, which seems like a very random number in computer terms, we have to go all the way back to the origins of the PC BIOS. The original IBM PC ran at approximately 4.77MHz, it was clocked at quarter speed by a 1.19318MHz crystal and this also clocked the 8254 programmable interval timer. 65536 cycles at 1.19318MHz is 54.92ms. The timer generated a 18.208Hz interrupt on IRQ8 which was used by the BIOS to track wall time. Early PCs didn’t have a battery backed clock so had no concept of wall time, the user would enter the date and time at boot and the BIOS would keep track of wall time using this interrupt. With 18.208Hz interrupt rate, 2^16 interrupts equals approximately 3599.59 seconds, which is almost exactly one hour. To save a few CPU cycles the BIOS used this almost once an hour 16-bit overflow to check whether the wall time had crossed midnight and when it did, it needed to increment the date.

Lots of DOS games hijacked the 8254 timer hardware and installed their own interrupt handler, so DOS game had access to accurate time. Early Windows ran on top of DOS and used all the DOS and BIOS services as-is, therefore it didn’t mess with the timer hardware. The original GetTickCount() API was a wrapper for the BIOS timer interrupt, it literally counted interrupts and multiplied that by 54.92 to compute elapsed time in milliseconds. As an optimization, the API did the multiplication only when it was called, amazingly it is still implemented this way on Windows 11 and although the origin of the tick counter is a little different the resolution is still only about 15ms. NT based operating systems were never based on DOS and the NT kernel did hijack the hardware timer. On a single processor NT machine the interrupt period was set to about 10ms, while on a multi processor machines about 15ms; much better than 55ms from 16bit windows and win9x but nowhere near good enough for games and media.

A common pattern within Windows NT based operating systems is to have documented public Win32 APIs that are internally implemented with the somewhat undocumented native API. GetTickCount was no different, it had been a documented API since the first version of Windows and on Windows NT it simply wrapped the native API NtGetTickCount which was originally a system call to the kernel. Therefore getting the elapsed time involved a system call. Today system calls are still expensive and optimizations are in place to avoid them, 30 years ago they were really expensive..

Today, GetTickCount is just a few instructions, there are no system calls and it doesn’t call NtGetTickCount (which is implemented identically).

GetTickCount:
mov         ecx,7FFE0320h               ;get the interrupt count
mov         rcx,qword ptr [rcx]         
mov         eax,dword ptr [7FFE0004h]   ;get a 24bit fixed point multiplier (defined by the interrupt period)
imul        rax,rcx                     ;multiply
shr         rax,18h                     ;shift down to give integer milliseconds
ret  

In the disassembly above it, while it doesn’t use any sycalls, it does read from hardcoded addresses 0x7FFE0320 and 0x7FFE0004. These addresses are part of what is known as kuser_shared, this was first implemented in Windows NT 3.5 and has been there, in some form, ever since. kuser_shared is a read only memory page mapped to every process by the kernel at fixed address 0x7FFE0000. This allows the kernel to share things like constants, counters and configuration data without having to rely on system calls, user mode code can simply read the data from memory at full speed, and all user mode processes see the same data at the same time. It’s intended for internal use and was never officially documented until Windows 10 DDK. Needless to say there are a lot of elements in kuser_shared that relate to time.

It’s highly recommended that production code doesn’t poke around in kuser_shared, it can change between even minor versions of windows; but it’s exactly what we’re going to do.

Looking back at the disassembly above we can make sense of it. 0x7FFE0320 is a 64bit counter, the kernel increments this every tick interrupt. This used to be a 32bit counter at address 0x7FFE0000 but that location is no longer used and is zero; a perfect example of why production code shouldn’t use kuser_shared directly and why some old applications that did no longer work. 0x7FFE004 is known as the TickCountMultiplier and is the interrupt period in units of 100ns, shifted left by 24 bits, and divided by 10,000. Thr conversion of raw tick count to integer milliseconds is simply a multiplication of TickCountMultiplier and a shift right by 24 bits, which is exactly what the above code does (and is pretty much exactly what 16bit windows did).

Today, most machines today have a multiplier of 0x0FA00000 which works out to be a period of 156250 (100ns system units) or 15.6250ms. Even in 2024, using Windows 11, the tick interrupt period is 15.625ms and this is the resolution of GetTickCount - about the same as all the earlier NT based cousins. While the value for the tick count multiplier has been constant at 0x0FA00000 for a while, you shouldn’t rely on it and you certainly shouldn’t compute the tick period directly from the multiplier at 0x7FFE0004. There are more legitimate ways to get the tick period from the kernel.

At this point it is worth noting that the tick count at 0x7FFE0320, which drives GetTickCount, isn’t driven directly from a hardware interrupt like it used to be. Instead, the tick count is abstracted from the more modern kernel interval timer interrupt, which can, and does, change its period on the fly. If I remember from my time on the the Windows kernel, the timer interval interrupt is handled by a function called KeUpdataSystemTime and in this function the kernel updates the interval interrupt time and then simply decrements the remaining TickCount period with the elapsed time since the last interval interrupt (in 100nS units), when this counter period reaches zero the tick counter at location 0x7FFE0320 is incremented - it’s kind crude. Due to this, the kernel intereval timer interrupt may occur much faster than the tick period but it can never be slower otherwise the tick counter would miss increments. On my machine the kernel interval interrupt occurs somewhere between 0.5ms and 15.625ms.

Its really important to note that the period of the kernel interval timer interrupt doesn’t have to be a factor of 15.625ms (the tick count speed), in can be anything between the min and max timer period. Therefore, the the tick counter at location 0x7FFE0320 might average out to its actual period of 15.625ms, but it has a whole lot of jitter, if the kernel interval timer is set at 7ms, then the kernel might not update the tick count at 0x7FFE0320 until 21ms have elapsed.

The various counters in kuser_shared are updated without the need for locks on multi-processor machines. The older 32bit versions of Windows had to jump through a few hoops to ensure the 64bit values were reliably updated. Everything in this post assumes a modern 64bit version of Windows.

Here is some test code that show this in real time. The code spins while it waits for the kernel interval timer interrupt at location 0x7FFE0008 to change (this address is the time of the last kernel interval timer interrupt in units of 100ns since boot, its set by the kernel itself within the interrupt handler). The code then looks to see if the tick count has changed and if it has computes the elapsed time from the last update.

int main()
{
  uint64_t old = *(volatile uint64_t*)0x7FFE0008;
  uint64_t old_tick = *(volatile uint64_t*)0x7FFE0320;
  uint64_t last_tick_change = old;
  while (1)
  {
    uint64_t current = *(volatile uint64_t*)0x7FFE0008;

    //has a kernel interrupt fired
    if (old != current)
    {
      uint64_t kdelta = current - old;
      uint64_t tick_count = *(volatile uint64_t*)0x7FFE0320;

      if (tick_count == old_tick)
      {
        printf("ktimer = %lld (kdelta=%lld), tick=%lld\n", current, kdelta, tick_count);
      }
      else 
      {
        printf("ktimer = %lld (kdelta=%lld), tick=%lld, tick_delta=%lld\n", current, kdelta, tick_count, current - last_tick_change);
        last_tick_change = current;
        old_tick = tick_count;
      }

      old = current;
    }
  }
}

Here are real results from my windows 11 test machine, the amount of jitter in tick_delta is not good.

initially the timer was running at about 1ms intervals
you can see the tick delta from the previous update is 15.9ms

This 1ms interval continues until it switches to 10ms

ktimer = 4946927507603 (kdelta=9994), tick=31660336, tick_delta=159942
ktimer = 4946927517606 (kdelta=10003), tick=31660336
ktimer = 4946927527603 (kdelta=9997), tick=31660336
ktimer = 4946927537604 (kdelta=10001), tick=31660336
ktimer = 4946927547608 (kdelta=10004), tick=31660336
ktimer = 4946927557606 (kdelta=9998), tick=31660336
ktimer = 4946927567603 (kdelta=9997), tick=31660336
ktimer = 4946927577604 (kdelta=10001), tick=31660336
ktimer = 4946927587604 (kdelta=10000), tick=31660336
ktimer = 4946927597603 (kdelta=9999), tick=31660336
ktimer = 4946927607618 (kdelta=10015), tick=31660336
ktimer = 4946927617607 (kdelta=9989), tick=31660336

Here is where the interrupt interval switches to 10ms, and it because of this
the next tick update doesn't occur until 21ms. 
With a 10ms interval the best the 15.625ms tick can do is bounce between 10 and 20ms

ktimer = 4946927718095 (kdelta=100488), tick=31660337, tick_delta=210492
ktimer = 4946927818092 (kdelta=99997), tick=31660338, tick_delta=99997
ktimer = 4946927918095 (kdelta=100003), tick=31660338
ktimer = 4946928018092 (kdelta=99997), tick=31660339, tick_delta=200000
ktimer = 4946928118084 (kdelta=99992), tick=31660339
ktimer = 4946928218105 (kdelta=100021), tick=31660340, tick_delta=200013
ktimer = 4946928318081 (kdelta=99976), tick=31660341, tick_delta=99976
ktimer = 4946928418096 (kdelta=100015), tick=31660341
ktimer = 4946928518105 (kdelta=100009), tick=31660342, tick_delta=200024
ktimer = 4946928618088 (kdelta=99983), tick=31660343, tick_delta=99983
ktimer = 4946928718076 (kdelta=99988), tick=31660343
ktimer = 4946928818082 (kdelta=100006), tick=31660344, tick_delta=199994

from here the interval is quite random for a while

ktimer = 4946928848076 (kdelta=29994), tick=31660344
ktimer = 4946928858169 (kdelta=10093), tick=31660344
ktimer = 4946928867110 (kdelta=8941), tick=31660344
ktimer = 4946928868757 (kdelta=1647), tick=31660344
ktimer = 4946928869416 (kdelta=659), tick=31660344
ktimer = 4946928947830 (kdelta=78414), tick=31660345, tick_delta=129748
ktimer = 4946928948959 (kdelta=1129), tick=31660345
ktimer = 4946928958946 (kdelta=9987), tick=31660345
ktimer = 4946928967940 (kdelta=8994), tick=31660345
ktimer = 4946929063042 (kdelta=95102), tick=31660346, tick_delta=115212

You don’t have to compute the elapsed time between 2 timer interrupts to work out the current speed of the kernel interval timer, we can use native api function NtQueryTimerResolution, this fills out the minimum resolution (slowest timer rate), the maximum resolution (fastest timer rate), and the current resolution at the time it was called (all in 100ns system time units). The minimum resolution for the kernel interval timer will always be the the tick count interval, this ensures the tick counter always increments monotonically and never misses a value.

There are no easily accessible headers for the native API. You have to declare the functions yourself and either link with static library ntdll.lib, or use GetProcAddress on ntdll.dll to dynamically link.

extern "C" NTSYSAPI NTSTATUS NTAPI NtQueryTimerResolution(
  OUT PULONG MinimumResolution,
  OUT PULONG MaximumResolution, 
  OUT PULONG CurrentResolution);


  //get the min, max and current resolution
  ULONG minRes;
  ULONG maxRes;
  ULONG currentRes;
  NtQueryTimerResolution(&minRes, &maxRes, &currentRes);

On my machine NtQueryTimerResolution reports a minimum period (maximum resolution) of 5000 or 0.5ms and a maximum period (minimum resolution) of 156250 or 15.625ms. Calling this in a loop and displaying the current resolution will should the kernel timer changing resolution on the fly.

kernel timer resolution = 156250
kernel timer resolution = 100000
kernel timer resolution = 156250
kernel timer resolution = 10000
kernel timer resolution = 10000
kernel timer resolution = 10000
kernel timer resolution = 156250

The reason is the kernel timer resolution changes is because it’s a system wide setting. The default value is the minimum resolution (15.625ms) but it will change as other processes ask for a different resolution, the final resolution will always be set to the minimum interval that has been asked for by any process - this is not a bad random seed because it depends entirely on the processes that are running and what they are currently doing. In both of the above time logs it is easy to see that periodically some process is asking for 1ms resolution, which is common due to how the newer multimedia timers work, but in order to satisfy such a request, everybody gets 1ms resolution.

The timer resolution can be set directly using native API NtSetTimerResolution:

extern "C" NTSYSAPI NTSTATUS NTAPI NtSetTimerResolution(ULONG DesiredResolution, BOOLEAN SetResolution, PULONG CurrentResolution);

//currentRes will be set to the current resolution, it may not be what you asked for if some other process has asked for a higher resolution. 
ULONG currentRes;
NtSetTimerResolution(5000, TRUE, &currentRes);

The resolution has to be between the min and max reported by NtQueryTimerResolution and it returns the current set resolution, which may not be what you asked because some other process may have already set it to a higher resolution. If we always set the maximum resolution, then the timer rate should be constant because no other process will be able to change it. Including a call to NtSetTimerResolution and running the same log again we get much cleaner results, the kernel timer fires every 5000 time units (0.5ms) and the jitter across the tick counter is much more consistent.

ktimer = 4948512812590 (kdelta=5001), tick=31670482, tick_delta=155002
ktimer = 4948512817589 (kdelta=4999), tick=31670482
ktimer = 4948512822590 (kdelta=5001), tick=31670482
ktimer = 4948512827590 (kdelta=5000), tick=31670482
ktimer = 4948512832590 (kdelta=5000), tick=31670482
ktimer = 4948512837589 (kdelta=4999), tick=31670482
ktimer = 4948512842590 (kdelta=5001), tick=31670482
ktimer = 4948512847589 (kdelta=4999), tick=31670482
ktimer = 4948512852589 (kdelta=5000), tick=31670482
ktimer = 4948512857593 (kdelta=5004), tick=31670482
ktimer = 4948512862590 (kdelta=4997), tick=31670482
ktimer = 4948512867595 (kdelta=5005), tick=31670482
ktimer = 4948512872589 (kdelta=4994), tick=31670482
ktimer = 4948512877588 (kdelta=4999), tick=31670482
ktimer = 4948512882589 (kdelta=5001), tick=31670482
ktimer = 4948512887589 (kdelta=5000), tick=31670482
ktimer = 4948512892589 (kdelta=5000), tick=31670482
ktimer = 4948512897589 (kdelta=5000), tick=31670482
ktimer = 4948512902600 (kdelta=5011), tick=31670482
ktimer = 4948512907592 (kdelta=4992), tick=31670482
ktimer = 4948512912590 (kdelta=4998), tick=31670482
ktimer = 4948512917606 (kdelta=5016), tick=31670482
ktimer = 4948512922597 (kdelta=4991), tick=31670482
ktimer = 4948512927605 (kdelta=5008), tick=31670482
ktimer = 4948512932591 (kdelta=4986), tick=31670482
ktimer = 4948512937591 (kdelta=5000), tick=31670482
ktimer = 4948512942592 (kdelta=5001), tick=31670482
ktimer = 4948512947589 (kdelta=4997), tick=31670482
ktimer = 4948512952589 (kdelta=5000), tick=31670482
ktimer = 4948512957616 (kdelta=5027), tick=31670482
ktimer = 4948512962590 (kdelta=4974), tick=31670482
ktimer = 4948512967589 (kdelta=4999), tick=31670482
ktimer = 4948512972589 (kdelta=5000), tick=31670483, tick_delta=159999

Unfortunately the windows kernel makes no effort to minimize timer jitter, it doesn’t try to compute a common factor of the requested periods, it simply uses the minimum period, if you need low jitter the only option is to set the kernel timer to the maximum available resolution. The downside of this is using 0.5ms intervals generates 2000 kernel interrupts per second, while that sounds bad it doesn’t really have any noticeable effect on performance, Microsoft should just use a fixed the interval and be done with it.

In the above code another useful value from 0x7FFE0008 in kuser_shared was used, this is the last kernel interrupt time in 100ns system units. Reading the last interrupt time is a much better way to measure elapsed time than GetTickCount(), its higher resolution and more accurate. One way to get the last kernel timer interrupt time is to directly read 0x7FFE0008, the best way is to use QueryInterruptTime() which does exactly the same:

QueryInterruptTime:
  mov         eax,7FFE0008h  
  mov         rax,qword ptr [rax]  
  mov         qword ptr [rcx],rax  
  ret  

Windows 95 added the timeGetTime API as part of the multimedia APIs, the same APIs first showed up on NT based operating systems in Windows2000. These new APIs were still only accurate to integer milliseconds but apps could get 1ms resolution too. These APIs still weren’t great for accurately measuring elapsed time or scheduling video frames but given the lack of better options both games and multimedia apps used these extensively, even today these APIs are still used everywhere. Applications used timeBeginPeriod to set the resolution of the timer, as you can probably guess, this is just wrapper for NtSetTimerResolution with input quantized to whole milliseconds, with 1ms being the minimum. This is why 1ms shows up quite often in the above timer resolution log.

timeGetTime:
  sub         rsp,28h  
  xor         r9d,r9d  
  lea         rdx,[TimeInit]  
  xor         r8d,r8d  
  lea         rcx,[g_TimeInitOnce]  
  call        qword ptr [__imp_InitOnceExecuteOnce]   ;Only run the init code once, from any thread within a given process
  nop         dword ptr [rax+rax]  
  mov         ecx,7FFE0008h                           ;last interrupt time
  mov         rax,346DC5D63886594Bh                   ;fixed point multiply by 1/100000
  mov         rcx,qword ptr [rcx]  
  sub         rcx,qword ptr [TimerData]               ;subtract raw init time from TimerData  
  imul        rcx  
  sar         rdx,0Bh  
  mov         rax,rdx  
  shr         rax,3Fh  
  add         rax,rdx  
  add         eax,dword ptr [TimerData+8h]            ;add on the init time in ms from TimerData+8 
  add         rsp,28h  
  ret  

timeGetTime just returns a scaled and offset version of the 64bit interrupt time. The multimedia timers are significantly better than GetTickCount but there is no point in using them in new code as there is far less cpu overhead by calling QueryInterruptTime and you get 100ns accuracy (you are still at the whim of the kernel timer interrupt interval if you don’t set it yourself).

Using QueryInterruptTime after setting the timer resolution to 0.5ms:

int main()
{
  uint64_t old_tick = 0;
  while (1)
  {
    uint64_t tick;
    QueryInterruptTime(&tick);

    if (tick != old_tick)
    {
      double delta = (double)(tick - old_tick) / 10000.0;
      printf("delta = %0.6f\n", delta);
      old_tick = tick;
    }
  }
}

The result is quite stable, the interrupt time is within a few microseconds of the expected time, but half a millisecond is the smallest interval that can be measured and this still can’t accurately measure a 16.6ms frame time.

delta = 0.500100
delta = 0.502400
delta = 0.497900
delta = 0.499500
delta = 0.500200
delta = 0.500200
delta = 0.499800
delta = 0.500400
delta = 0.499600
delta = 0.500100
delta = 0.499900
delta = 0.499900
delta = 0.500500

Windows also provides an API called GetSystemTimeAsFileTime which returns the system current time as a file time. Within Windows, system time and file time are both in units of 100ns, system time is the elapsed time since boot while file time is the elapsed time since the epoch (Jan 1st 1601). However this API is also driven by the same timer interrupt timestamp but instead of reading 0x7FFE0008 which is the system timestamp, it reads 0x7FFE0014 which is the file time timestamp of the last timer interrupt.

; This API splits the 64bit result in to a pair of 32bit values instead of directly returning a 64bit value. 
GetSystemTimeAsFileTime:
  mov         eax,7FFE0014h  
  mov         rax,qword ptr [rax]  
  mov         dword ptr [rcx],eax  
  shr         rax,20h  
  mov         dword ptr [rcx+4],eax  
  ret  

There is a precise version of this API called GetSystemTimePreciseAsFileTime which is a wrapper for the native function rtlGetSystemTimePrecise and internally it uses native function rtlQueryPerformanceCounter along the last interrupt time and some other magic values within kuser_shared, ultimately it returns the file time of when the API was actually called.

This function uses address 0x7FFE0340 which is the incrementing tick count of the kernel timer interrupt. This counter is shifted left one bit so it appears to increment by 2, and the bottom bit is a lock bit. It’s very difficult to catch the lock bit set but I believe it is set while the kernel interrupt executes to indicate the timer values are changing. There are a lot of timer related values within kuser_shared and they can’t all be updated atomically, especially from a multi-processor point of view. You’ll see this value being used in the following code:

RtlGetSystemTimePrecise:
  mov         qword ptr [rsp+10h],rbx  
  mov         qword ptr [rsp+18h],rbp  
  mov         qword ptr [rsp+20h],rsi  
  push        rdi  
  push        r12  
  push        r13  
  push        r14  
  push        r15  
  sub         rsp,20h  
  mov         edi,7FFE0358h         
  lea         r12d,[rdi-18h]  
  lea         r13d,[rdi-10h]
retry2:  
  mov         eax,7FFE0368h  
  mov         ecx,7FFE0014h
retry1:  
  mov         rbx,qword ptr [r12]                              ;0x340 - kernel timer counter + lock
  test        bl,1  
  jne         pause1                                           ;if the lock in the bottom bit of 0x340 is set, pause and try again (aka locked)
  mov         r14,qword ptr [rcx]                              ;load system time for last interrupt  
  lea         rcx,[rsp+50h]                                    ;result location of RtlQueryPerformanceCounter
  mov         rbp,qword ptr [r13]                              ;baseline qpc interrupt time
  mov         r15,qword ptr [rdi]                              ;qpc system time increment
  mov         sil,byte ptr [rax]                               ;time shift
  call        RtlQueryPerformanceCounter                       ;get the qpc time stamp
  nop  
  mov         rax,qword ptr [r12]                              ;reload the interrupt counter/lock from 0x340
  cmp         rax,rbx  
  jne         pause2                                           ;if a timer interrupt has occurred or the lock has changed, spin and try again  
  mov         rdx,qword ptr [rsp+50h]                          ;result of RtlQueryPerformanceCounter
  mov         edi,0                                            ;delta time from last interrupt
  cmp         rdx,rbp                                          ;current qpc vs qpc interrupt time  
  jbe         comp_result                                      ;qpc is below or equal (shouldn't happen), delta is zero, return the last interrupt
  sub         rdx,rbp                                          ;qpc delta since last interrupt		
  dec         rdx	                                   
  test        sil,sil  
  je          skip_shift                                       ;if shift is zero, skip shifting
  mov         cl,sil  
  shl         rdx,cl                                           ;shift qpc delta by the shift amount
skip_shift:
  mov         rax,r15                                          ;qpc system time increment  
  mul         rax,rdx                                          ;multiply the time delta by the time increment to compute 100ns  
  mov         rdi,rdx  
  mov         rbx,qword ptr [rsp+58h]  
comp_result:
  lea         rax,[r14+rdi]                                    ;add 100ns unit delta time to the baseline interrupt time in 100ns
  mov         rbp,qword ptr [rsp+60h]  
  mov         rsi,qword ptr [rsp+68h]  
  add         rsp,20h  
  pop         r15  
  pop         r14  
  pop         r13  
  pop         r12  
  pop         rdi  
  ret  
  int         3  
pause1:
  pause  
  jmp         retry1   
pause2:
  pause  
  jmp         retry2

This is the first API that takes us away from times based on the kernel timer interrupt to a now time of when the API was actually called, its the first API to eliminate the kernel timer jitter and the first API that can be correlated to wall time. The implementation of RtlGetSystemTimePrecise jumps through some hoops because it always returns time in 100ns units and although it uses rtlQueryPerformanceCounter it has to do a scale and offset because rtlQueryPerformanceCounter doesn’t have a fixed frequency.

Following the typical NT native API pattern, the user api QueryPerformanceCounter and rtlQueryPerformanceCounter are the same thing.

Windows NT system and file times have always been in 100ns units so it would make sense that QueryPerformanceCounter would return the same 100ns units but for some reason this took until windows 10 build 1607 (Anniversary update in 2016). Today, it is mostly true that the frequency of QueryPerformanceCounter has been fixed at 10Mhz but its not guaranteed, make assumptions at your own risk. Over the years QueryPerformanceCounter has used a range of frequencies depending on the source timer but it will always have resolution greater than 1us.

The frequency of QueryPerformanceCounter can be obtained via QueryPerformanceFrequency and all this does is load and return the 64bit value at 0x7FEE0300 which is set by the kernel once at boot and never changes at runtime.

QueryPerformanceFrequency:
  mov         rax,qword ptr [7FFE0300h]  
  mov         qword ptr [rcx],rax  
  mov         eax,1  
  ret 

QueryPerformanceCounter and rtlQueryPerformanceCounter have a native equivalent called NtQueryPerformanceCounter. The big difference is that NtQueryPerformanceCounter is always a syscall to the kernel, even though it computes the exact same result.

QueryPerformanceCounter:
  mov         qword ptr [rsp+8],rbx  
  push        rdi  
  sub         rsp,20h  
  mov         eax,7FFE03C6h                     ;qpc flags address
  mov         rbx,rcx  
  movzx       r9d,byte ptr [rax]                ;load qpc flags
  test        r9b,1                             ;check bypass syscall  
  je          use_syscall                       ;NtQueryPerformanceCounter (syscall 31h)
  mov         r11d,7FFE03B8h                    ;qpc bias address
  mov         r11,qword ptr [r11]               ;load qpc bias 
  test        r9b,2                             ;check qpc flags for using hypervisor page
  je          no_hyperv                         ;no hypervisor page

  mov         r8,qword ptr [7FFBB84332A8h]      ;RtlpHypervisorSharedUserVa = 000000007FFE1000
  test        r8,r8                             ;is the hypervisor page null??
  je          use_syscall                       ;if null, use syscall
loop_hyperv:
  mov         r10d,dword ptr [r8]               ;load cookie from RtlpHypervisorSharedUserVa+0
  test        r10d,r10d                         ;is the cookie zero?
  je          use_syscall                       ;if cookie is zero use syscall
  test        r9b,r9b                           ;is rdtscp available? 
  jns         use_rdtsc_hyperv                  ;if no use rdtsc and some combo of lfence/mfence
  rdtscp  
rdtsc_result_hyperv:
  shl         rdx,20h  
  or          rdx,rax                           ;make 64bit result from rdtsc  
  mov         rax,qword ptr [r8+8]              ;load scale at RtlpHypervisorSharedUserVa+8  
  mov         rcx,qword ptr [r8+10h]            ;load offset at RtlpHypervisorSharedUserVa+16
  mul         rax,rdx                           ;multiply by scale
  mov         eax,dword ptr [r8]                ;reload cookie from RtlpHypervisorSharedUserVa+0
  add         rdx,rcx                           ;add the offset
  cmp         eax,r10d                
  jne         loop_hyperv                       ;do again if hypervisor cookie as changed, maybe the scale/offset changed.

compute_result:
  mov         cl,byte ptr [7FFE03C7h]           ;load qpc shift
  lea         rax,[rdx+r11]                     ;add on qpc bias
  shr         rax,cl                            ;shift the result right
  mov         qword ptr [rbx],rax               ;store result  
  mov         eax,1                             ;return true
  mov         rbx,qword ptr [rsp+30h]  
  add         rsp,20h  
  pop         rdi  
  ret  

;main code path if the hypervisor isn't being used
no_hyperv:
  test        r9b,r9b                       ;is rdtscp available
  jns         use_rdtsc                     ;if not, use rdtsc
  rdtscp 
rdtsc_result:                               ;if using rdtsc, it returns here
  shl         rdx,20h  
  or          rdx,rax                       ;make 64bit value 
  jmp         compute_result                ;compute final result and return

;this is the hypervisor rdtsc path, identical to below, we just check sync flags prior to rdtsc
use_rdtsc_hyperv:
  test        r9b,20h                       ;use lfence?
  je          no_lfence_hyperv  
  lfence  
  jmp         read_tsc_hyperv  
no_lfence_hyperv:
  test        r9b,10h                       ;use mfence?
  je          read_tsc_hyperv  
  mfence  
read_tsc_hyperv:
  rdtsc  
  jmp         rdtsc_result_hyperv 

;this is none hypervisor rdtsc, we might need a sync instruction before the rdtsc
use_rdtsc:
  test        r9b,20h                       ;use lfence for sync
  je          skip_lfence  
  lfence  
  jmp         read_tsc
skip_lfence:
  test        r9b,10h                       ;use mfence for sync
  je          read_tsc  
  mfence
read_tsc:  
  rdtsc  
  jmp         rdtsc_result  
use_syscall:
  ....                

Dissecting the above code there are a few different paths but ultimately only 2 modes of operation. It’s either a syscall or some derivative of the TSC read with either RDTSC or RDTSCP. The difference between these two almost identical instructions is the original RDTSC is a speculative instruction, it doesn’t necessarily wait until all previous instructions to execute before reading the counter. Similarly, subsequent instructions may begin executing before the timer is read. This can cause problems if the programmer expects the timestamp to be take at the exact location of the instruction. Making this instruction behave requires some form of synchronization instruction:

If software requires RDTSC to be executed only after all previous instructions have executed on Intel platforms, it can execute LFENCE immediately before RDTSC.
If software requires RDTSC to be executed only after all previous instructions have executed on AMD platforms, it can execute MFENCE immediately before RDTSC. Technically this is slower than the Intel solution as it waits until all previous loads and stores are globally visible. Executing LFENCE on AMD hardware doesn’t synchronize as intended.
If software requires RDTSC to be executed prior to execution of any subsequent instruction, it can execute LFENCE immediately after RDTSC.

RDTSCP [Available when CPUID.80000001h:EDX[27], is set] is a none speculative version of RDTSC, equivalent to the first two items above. It waits for all prior instructions to execute before reading the timer and it works the same on AMD and Intel hardware. It doesn’t prevent future instructions from starting to execute, which software can prevent by executing LFENCE after RDTSCP (item 3 above). A feature of this newer instruction, unrelated to timers, is the model specific register IA32_TSC_AUX is returned in register ECX, the windows scheduler puts the processor ID in this register, which software can use to determine if your thread is hopping processors (linux does something similar but its entirely dependent on the operating system to set this model specific register).

Going back to QueryPerformanceCounter, you can see all of the above options being used, the code flow is controlled by the QPC flags at 0x7FFE03C6h, the DDK names the various flags as follows:

SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED = 0x01 Set if QPC should bypass the syscall and use the timestamp instructions, this will only be set if the timestamp is invariant. This flag first appeared in Window8.1. If this bit is clear QPC just calls NtQueryPerformanceCounter which is always a syscall, this allows motherboard timers to be used which are only accessible from kernel mode. If the bypass is not enabled then QueryPerformanceCounter is always a system call with significant cpu overhead, making it somewhat unsuitable for high frequency operations as a single call can take a few microseconds to execute (many thousands of clock cycles).

SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_HV_PAGE = 0x02 If set QPC uses the hypervisor page, this new memory page showed in Windows 10 v1607 and is mapped to all processes much the same way as kuser_shared. The memory page is used to apply a scale and offset to normalize the frequency to 10MHz (100ns units). This normalization can only occur if the hypervisor page is present so earlier versions of windows will not return a fixed 10Mhz.

SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE = 0x10 Use MFENCE for synchronization, not checked if RDTSCP is used, only used on AMD processors

SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE = 0x20 Use LFENCE for synchronization, not checked if RDTDCP is used, only used on Intel processors

SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP = 0x80 If this is set RDTSCP is available and will be used in place of RDTSC and syncrhonization.

The syscall is avoided if at all possible and the code path is optimized for using RDTSCP with the hypervisor page to normalize to 10Mhz, this is the code path that has no branches and is the expected code path on typical windows 10 or later machine. Thre are a few other options that can be returned:

User mode TSC via RDTSC or RDTSCP with hypervisor normalization to 10Mhz. CPU cost is 100ish cycles.
User mode TSC via RDTSC or RDTSCP with a shift. On the test machine the shift is set to 10 giving a frequency of 3.613000Mhz (approx 3.7GHz / 1024). CPU cost is about 100 cycles
System call. If this path is being used it will most likely be using the HPET with a frequency of 14.318180MHz, it could also use a PMTimer at 3.579545MHz (exactly 4x slower than the HPET) but its unlikely. CPU cost for the syscall path is thousands of cycles.

Windows 10/11 can be forced to use the HPET counter. From an administrator terminal using bcdedit /set useplatformclock true. While this could be useful for testin I recommend you don’t leave this boot variable set because it can have some nasty side effects, use bcdedit /deletevalue useplatformclock to delete the variable and go back to the default. You need to reboot after changing this setting. I highly recommend you test with this especially if you are making some high performance media app or game. There are machines out there that use this path and it has caused numerous problems for a handful of commercial games. It also changes the QPC frequency so its no longer the normalized 10Mhz.

What is this hypervisor page?? Unfortunately, there isn’t much information on it. It first appeared in anniversary update of windows 10 (version 1607) it seems to only be used by QueryPerformanceCounter. It’s address isn’t fixed but on all the machines tested it always shows up very close to kuser_shared, on this test machine it is directly after the kernel page at 0x7ffe1000. You can query the local process address by using native API NtQuerySystemInformation with query class ID 0xc5.

  extern "C" NTSYSAPI NTSTATUS NTAPI  NtQuerySystemInformation(
    ULONG   SystemInformationClass,
    PVOID   SystemInformation,
    ULONG   SystemInformationLength,
    PULONG  ReturnLength
  );

  void* GetHypervisorPageAddress()
  {
    void* system_info;
    ULONG res_size;
    NTSTATUS result = NtQuerySystemInformation(0xC5, &system_info, sizeof(system_info), &res_size);

    if ((result >= 0) && (res_size == sizeof(system_info)))
    {
      return system_info;
    }

    return 0;
  }

Now we can write some code to dump out the known data in the hypervisor page, and compute the TSC frequency as seen by the operating system.

int main()
{
  //known elements of the hypervisor page
  struct HypervisorPage
  {
    uint32_t  cookie;
    uint32_t  unused; //maybe cookie is 64bits but it's currently loaded as 32bits
    uint64_t  rdtsc_scale;
    uint64_t  rdtsc_offset;
  };

  //use the above function
  HypervisorPage* hv = (HypervisorPage*)GetHypervisorPageAddress();
  if (hv)
  {
    printf("cookie: %x\n", hv->cookie);
    printf("rdtsc_cale: %llx\n", hv->rdtsc_scale);
    printf("rdtsc_offset: %llx\n", hv->rdtsc_offset);

    //compute the tsc speed as an integer with 32bits of fraction
    int64_t rem;
    int64_t tsc_int = _div128((1ULL << 32), 0, hv->rdtsc_scale, &rem); 

    //convert to a double
    double tsc = (double)tsc_int / (1ULL << 32);
    printf("tsc clock speed = %.6f", tsc*10.0); //*10 for Mhz

    //this is hacky but surprisingly accurate (only because rdtsc_scale fits in 53bits)
    //double f = (double)0xffffffffffffffffULL / (double)hv->rdtsc_scale;
  }
}

We can see from the disassembly of QueryPerformanceCounter that if the hypervisor cookie is zero the syscall is used, on Windows 11 the cookie starts at 0x1 and I assume increments if the hypervisor makes changes to the scale or offset, I’ve never seen this happen but it makes sense the hypervisor could change these values and QueryPerformanceCounter is written to handle it (In windows 10 the cookie used to be a value of ‘hAlt’). On my test machine rdtsc_scale is slighly different on every reboot so its probably being computed at startup, for example on the first boot 0x00b11b8333a4a9e5 which in 64bit fixed point is 1/370.035209, on the 2nd boot 0x00b11b77df38e17c or 1/370.0355705, both are very close to the divosr required to get 100ns units from a processor with a base speed of 3.7GHz.

Another time related use for NtQuerySystemInformation is to determine if QueryPerformanceCounter is going to use a syscall or if its going to stay in user mode. You could figure this out yourself the offical method (albiet somewhat undocumented) will be correct in the future. Knowing a syscall is going to be used might change how you use QueryPerformanceCounter.

bool DoesQPCBypassKernel()
{
  struct SYSTEM_QUERY_PERFORMANCE_COUNTER_INFORMATION
  {
    ULONG Version;

    //0x00000001 = kernelqpc
    ULONG Flags;
    ULONG ValidFlags;
  };

  SYSTEM_QUERY_PERFORMANCE_COUNTER_INFORMATION qpc_info;
  ULONG res_size;

  qpc_info.Version = 1; //must set the version on input (only version=1 is defined)
  NTSTATUS result = NtQuerySystemInformation(0x7c, &qpc_info, sizeof(qpc_info), &res_size);

  if ((result >= 0) && (res_size == sizeof(qpc_info)))
  {
    if (qpc_info.ValidFlags & 0x1)  //is kernel flag valid
    {
      return (qpc_info.Flags & 0x1) == 0;  //if valid it needs to be zero to bypass
    }
  }

  //this is not actually false, but unknown
  return false;
}

Have we come full circle? Why don’t we use RDTSC for time stamps? If QueryPerformanceCounter is just a wrapper for RDTSC/RDTSCP then why not avoid all the overhead and execute the instructions directly? Is Microsoft correct in saying everybody should use QueryPerformanceCounter??? Well, it depends..

If CPUID.80000007H:EDX[8] indicates invariant timestamp support then there is little harm in making your own ultra low overhead timers using RDTSC/RDTSCP. RDTSC is the most optimal and I’ve never seen the lack of synchronization cause problems outside of profiling short code sequences. If you do things yourself you do have to figure out the base clock frequency, something that is given to you if you use QueryPerformanceCounter. While I don’t see the timestamp instructions becoming unsable again, there is always a risk of some processor changing the invariant timer in ways software doesn’t expect, nobody expected RDTSC to cause as much trouble as it did when it was first used 30 years ago.

On a modern machine I don’t see any problem in using QueryPerformanceCounter, especially if its bypassing the syscall, its plenty fast enough for anything other than maybe micro-profiling short sequences. QueryPerformanceCounter is pretty future proof when used properly and the OS handles the edge cases and platform differences. For general time keeping, no matter what the QueryPerformanceCounter timer source may be, or even if its using a syscall, its insignificant overhead. I am starting to see numerous posts online where the frequency is assumed to be a fixed 10MHz, don’t do this! Today, I would only use RDTSC for careful micro profiling, or if thousands and thousands of timestamps are needed where just the overhead of the API calls is too much.

If you make your own timers and need the TSC frequency I recommend not using the above code outside of a development environment - it will cause you problems in the future and your app will be the one crashing in 10 years. It would be nice if Windows just had an API to give the TSC frequency but it doesn’t. On Intel processors CPUID 16h can be used to get the base speed but if you want to compute the TSC frequency you can do a one-time calibration by comparing TSC readings against a known system clock or even QueryPerformanceCounter itself, during development you can check your calibration code against the hypervisor frequency to see if if its reliable. If you use a good reference timer and throw out any outliers this type of calibration can be very accurate, it’s future proof, works on Intel and AMD and it accounts for things like virtualization and overclocking. Never calibrate against GetTickCount, it’ll never be accurate with the amount of jitter we saw earlier.

Finally, the invariant TSC might appear to have the same counter frequency as the base processor speed but it certainly doesn’t increment at that speed. Typically the invariant TSC runs on a much slower, stable and always running reference clock, say 25MHz, or 50Mhz or 100MHz. The core counter increments at this reference speed adding whatever multiplier is required to match the base speed. On my test machine the amount added is 148 to give the impression of a 3.7GHz counter, from this we can calculate a reference clock of 25MHz. On Intel processors this incremnent rate is available from CPUID 15h. However, things are not that simple because RDTSC always returns a value that is monotonically incrementing so won’t return the same value twice if you call it back to back. How this is implemented with respect to the reference clock and what is actually returned is model specific - for example some Intel processors only return even numbers. While its safe to assume the invariant TSC is accurate with wall time over long periods, you probably don’t want to use it to time ultra small sections of code, unless you know exactly how RDTSC is implemented on your particular processor.