Sei sulla pagina 1di 4

Windows HLK Clock Interrupt Test Failure Analysis

Summary:

The Microsoft Windows HLK Clock Interrupt Test measures the clock interrupt resolution by enabling ETW
tracing and processing that data in real-time. During the test, the clock interrupt resolution is manipulated to
confirm that the HAL and underlying hardware are performing as expected at a variety of supported resolutions.

Figure 1. Clock Interrupt Illustration.

As illustrated in Figure 1, the Clock timer resolution is set to a particular resolution, d. Interrupts are
generated at this frequency. The time between events is measured to determine if the measured time is the
same as time d. An error occurs when time d is skewed either too early, or too late. As illustrated in Figure 2.

Figure 2. Clock Interrupt Skew Illustration.

In Figure 2, a skew of d prime is shown where either the measured d is smaller than the expected d value or
the measured value is longer than expected. When the skew occurs, the test reports an error condition.

During the test, several errors are reported by the HLK Clock Interrupt Test as the runs, these include:
1. NtSetTimerResolution indicated that the current timer resolution is XXX but NtQueryTimerResolution indicated
that the current timer resolution is YYYY!
2. Explicitly setting previously discovered timer resolution XXX via NtSetTimerResolution resulted in an unexpected
change in timer resolution to YYY!
3. Test iteration invalidated due to interference from another component. modeSwitchDeliveredIncrement
= XXXX Retrying
4. Unable to complete testing due to repeated interference from another component. Please investigate,
resolve conflict, and retry.

All errors occur randomly, that is, the errors are not predictable and occur during different times of the test
and sometimes do not occur at all. If the system Power Profile is set to Performance, or the Windows
Defender service is disabled, or the Windows Search Indexing service is disabled, it has been observed that the
error occurs less often. Event without any changes, with default settings, the test will pass occasionally.

Errors 1 -3 can occur when another application is also manipulating the Timer resolution, or if another
application is consuming a large percentage of CPU utilization. Although Errors 1 and 2 are reported as an
error in the log, the test will continue. Error condition 3 is informative and the test simply retrieves to set the
timer and rerun the desired timer resolution.

Error 4 is of primary interest because this error could be the result of disk I/O activity. The working assumption
is that the NVMe driver stack software components, namely, stornvme.sys and storport.sys may be taking
cycles from the CPU which causes measurement skews in the test causing the test to fail. Although several
anomalies have been identified by review Windows Performance Analysis traces, PCIe bus traces, and other
Windows event traces, a clear correlation between the test failures and disk I/O has not been identified. Some
traces where the error is reported have no disk I/O activity at all at the time the error is reported in the test log.

Detailed Explanation:

Error 1: This error occurs when the HLK Clock Interrupt Test attempts to change the clock timer resolution
utilizing the Windows API system call, NtSetTimerResolution. The application calls this function to set a new
timer resolution, but when reading the timer resolution value with the Windows API System Call
NtQueryTimerResolution, the value returned is not the same that was set.

This can occur if there are other applications running in the system which are also changing the system timer
resolution. Other system applications such as power management subsystems sometimes change the system
timer in order to decrease the number of interrupts in the system in order to reduce the overall system power
consumption.

Error 2: This error is similar to Error 1, that the expected value of the system timer resolution is different than
what was set by the NtSetTimerResolution functional call.

Error 3: The details of this error are limited but appears to be associated with Errors 1 and 2 in that the test is
unable to start its measurements due to other activity in the system. The test waits 2500ns before attempting
to start the test. This error can occur multiple times throughout the run of the test.

Error 4:

The test runs through a number of different clock resolutions (appears to be somewhat random in the
selection of frequencies). When this error occurs, additional information is provided. With this error case, the
log will indicate where in the test the measured value differs from the expected value, “ Previous results in this
log marked with '*' are out of tolerance. max = 14880, min = 496”

Example:
8/7/2019 9:17:50.355 AM 00060: 0x00000166f907c03e 0x0000000000001360 4960
8/7/2019 9:17:50.355 AM 00061: 0x00000166f907c180 0x0000000000000142 322 *
8/7/2019 9:17:50.355 AM 00062: 0x00000166f907d39f 0x000000000000121f 4639

As shown in the listing above, at 8/7/209 9:17:50.355AM the test expected to measure 4960ns, but measured
322ns. In this capture, the fact that it is lower than expected is odd, since any disturbance from disk I/O
activity for example it would be expected that the value would be larger not less. This is likely due to the
resolution being changed by another windows component. It has been observed that when the test changes
the clock resolution, there are a number of change events that occur in between the requests from the test.
This indicates that there is another software component that is change the resolution during the test. This is
illustrated in Figure 3.
Figure 3. Clock Change Requests and Events.

It has also been observed that Clock resolution event changes occur whenever disk I/O activity occurs, as illustrated in
Figure 4.

Figure 4. Disk I/O and Clock Resolution Change Events.

It is unclear what is issuing these change events but is likely related to the windows power management
subsystem. The longer the disk I/O the more likely that these resolution change events are going to disrupt
the HLK Clock Interrupt Test.

In our analysis of various traces, for occasions where the test passes, the number of Disk I/O disruptions are
minimal. If a drive exhibits longer I/O (longer latency commands), then the more likely the test will fail.

Anomalies discovered in the traces for some disk I/O appear to influence the test failure. Given the Clock
Interrupt Test is testing clock skew based on ISR/DPC responsiveness, looking at the interrupt processing for
the NVMe driver, Figure 5, there are some areas where the driver is taking more time than usual for
processing I/O.

Figure 5. Disk I/O Anomaly.

Looking at the detail of these anomalies, there are actually two peaks of interest, both which exhibit the same
phenomena. On occasion, there are writes which takes a long time to complete, see Figure 6, where a single
4K write which takes 46988.900us to complete.
Figure 6. Detailed view of long I/O write Anomaly.

As shown in Figure 4, Clock Resolution Events occur whenever there is Disk I/O. The longer the Disk I/O the
more likely the HLK Clock Interrupt Test will be disrupted. Since it is suspected that these Clock Resolution
Changes are coming from the Windows Power Management subsystem, by selecting the “Performance”
Power Profile, the test is more likely to pass. Also, by decrease the amount of Disk I/O activity by disabling
services such as Windows Defender or Windows Search Indexing, this also results in a higher likelihood of the
test passing.

The findings here are not final. Further investigation is required to determine the exact correlation between
the test failures and specific Disk I/O. Investigation is required to determine why some I/O operations have
high latency. Ultimately, drives with lower latency I/O will exhibit less failures, whereas drives with these
anomalies will exhibit higher number of test failures.

It has been observed that in some cases where failures are reported in the log file, there is no corresponding
disk I/O. Although Disk I/O can contribute to the test failing, it appears there are other factors which could
cause the test to fail beyond Disk I/O.

Potrebbero piacerti anche