Sei sulla pagina 1di 11

SVR422

Windows Hang and Crash Dump Analysis


Mark Russinovich Chief Software Architect Winternals Software Copyright 2006 Mark Russinovich

About The Speaker


CoCo-author of Windows Internals and Inside Windows 2000 (Microsoft Press) Senior Contributing Editor Windows IT Pro Magazine Author of tools on www.sysinternals.com CoCo-founder and chief software architect of Winternals Software (www.winternals.com ) www.winternals.com) Microsoft Most Valuable Professional (MVP) 2005, 2006 Teach public and private live classes on Windows Internals and Advanced Troubleshooting with David Solomon (www.solsem.com) (www.solsem.com )

Outline
Crash dumps and tools Analysis basics
IRQLs Stacks

Analyzing an easy crash UnUn-analyzable crashes


Crash transformation Buffer overrun Code overwrite Microsoft Windows Memory Diagnostic

Manual analysis
Stack trashes Hung Systems When there is no crash dump

Introduction
Many systems administrators ignore Windows crash dump options
I didnt know I could analyze crashes Crash analysis too hard A crash dump wont tell me anything anyway

Why Does Windows Crash?


This is called when somethings wrong in kernel-mode: kernelUnhandled exception (for example, executing invalid instruction) OS or driver detects severe inconsistency Referencing paged out memory at interrupt level (famous IRQL_NOT_LESS_EQUAL crash) A reschedule is attempted at dispatch level IRQL or higher Hardware error

Basic crash dump analysis is actually pretty straightforward


Even if only 1 out of 5 or 10 dumps tells you whats wrong, isnt it worth spending a few minutes?

More advanced crash dump analysis much harder


Not well documented Requires advanced internals, compiler and CPU knowledge Requires lots of experience Often difficult to pinpoint cause
More often than not, victim is not the culprit For example, a driver corrupts an operating system structure; Windows crashes later

INF402

Why Does Windows Crash?


Microsofts analysis of crash root causes indicates:
~70% caused by third-party driver code third~15% caused by unknown (memory is too corrupted to tell) ~10% caused by hardware issues ~5% caused by Microsoft code

What Happens at the Crash


When a condition is detected that requires a crash, KeBugCheckEx is called
Takes five arguments:
Stop code (also called bugcheck code) Four stop-code defined parameters stop-

There are lots of third-party drivers! thirdFrom online crash analysis database:
55,000 unique drivers 24 new/day (28,000 in 2004) 220,000 total drivers 98 revised/day (130,000 in 2004)

KeBugCheckEx:
Turns off interrupts Tells other CPUs to stop Paints the blue screen Notifies registered drivers of the crash If a dump is configured (and it is safe to do so), writes dump to disk

Many Devices
Over 1,263,300 distinct Plug and Play (PnP) IDs (680,000 in 2004) 1,600 PnP IDs added every day

Bugcheck Codes
Bugcheck codes are shared by many components and drivers
There are about 150 defined stop codes Two common ones are:
(DRIVER_) IRQL_NOT_LESS_OR_EQUAL (0x0A) - Usually an invalid memory access INVALID_KERNEL_MODE_TRAP (0x7F) and KMODE_EXCEPTION_NOT_HANDLED (0x1E) Generated by executing garbage instructions Its usually caused when a stack is trashed

Crash Dumps Options


Small Memory Dump (aka minidump or Triage Dump)
Default for Microsoft Windows 2000/Windows XP Professional/Home Only 64 KB (128 KB on 64-bit systems, 64up to 512 KB on Vista) Contains minimal crash information Creates a unique file name in \Windows\Minidump after reboot Windows\

Kernel
Writes OS memory and not processes
Most crash debugging doesnt involve looking at process memory anyway

Most are documented in the Debugging Tools help file


Also search Microsoft Knowledge Base (www.microsoft.com/support) (www.microsoft.com/support)

Often, bugcheck code and parameters are not enough to solve the crash
Need to examine crash dump

Useful for large memory systems Overwrites every time Default on Windows Vista

Full
Writes all of RAM Overwrites every time

Minidumps
On Windows XP, Windows Server 2003, and Windows Vista, minidump is always created, even if system set to full or kernel dump Can extract a minidump from a kernel or full dump using the debugger .dump /m command To analyze, requires access to the images on the system that crashed
At least must have have access to the Ntoskrnl.exe Microsoft Symbol Server now has images for Windows XP and later
Set image path to same as symbol path (covered later)

Writing a Crash Dump


Crash dumps are written to the paging file
Too risky to try and create a new file (no guarantee you will get a dump anyway)

How is even this protected?


When the system boots it checks HKEY_LOCAL_MACHINE\System\ HKEY_LOCAL_MACHINE\System\ CurrentControlSet\Control\ CurrentControlSet\Control\CrashControl The boot volume paging files on-disk mapping is obtained onRelevant components are checksummed:
Boot disk miniport driver Crash I/O functions Page file map

On crash, if checksum doesnt match, dump is not written

INF402

Why Would You Not Get a Dump?


Crash occurred before paging file was open
For example a crash during driver initialization

At The Reboot

WinLogon Session Manager 2 3 SaveDump


User mode Kernel mode

The crash corrupted components involved in the dump process Spontaneous reboot Paging file on boot volume is too small Not enough free space for extracted dump Hung system
Well cover how to troubleshoot these problems later

Memory.dmp

NtCreatePagingFile

Paging File

At The Reboot
Session Manager process (\Windows\system32\smss.exe) initializes (\Windows\system32\ paging file
NtCreatePagingFile 1 2

Online Crash Analysis (OCA)


By Default, after a reboot Windows XP/Windows Server 2003 prompts you to send information to http://watson.microsoft.com Can be configured with Computer PropertiesProperties>Advanced>Advanced->Error Reporting Can be customized with Group Policies
Do/do not show UI Send dump to an internal error reporting server

NtCreatePagingFile determines if the dump has a crash header


Protects the dump from use Note: crash dump portion of paging file is in use during the copy, so virtual memory can run low while the copy is in progress

WinLogon calls NtQuerySystemInformation to tell if theres a dump 3 to extract If theres a dump, Winlogon executes SaveDump 4 (\Windows\system32\savedump.exe) Windows\system32\
Writes an event to the System event log SaveDump writes contents to appropriate file On Windows XP or later, checks to see if Windows Error Reporting should be invoked

Windows Error Reporting


Savedump checks if kernel error reporting is enabled
Checks two values under HKLM\Software\Microsoft\PCHealth\ HKLM\Software\Microsoft\PCHealth\ErrorReporting: IncludeKernelFaults and DoReport

What Gets Sent


<?xml version="1.0" encoding="Unicode" ?> <SYSTEMINFO> <SYSTEM> <OSNAME>Microsoft Windows XP Professional</OSNAME> <OSVER>5.1.2477 0.0</OSVER> <OSLANGUAGE>1033</OSLANGUAGE> </SYSTEM> <DRIVERS> <DRIVER> <FILENAME>ac97intc.sys</FILENAME> <FILESIZE>98112</FILESIZE> <CREATIONDATE>05-17-2001 06:31:52</CREATIONDATE> <VERSION>5.10.0.3518</VERSION> <MANUFACTURER>Intel Corporation</MANUFACTURER> <PRODUCTNAME> Intel(r) Integrated Controller Hub Audio Driver</PRODUCTNAME> </DRIVER>

If crash reporting is enabled, Savedump:


Extracts a minidump from the dump file (if system set to full or kernel dumps) Writes the name of the minidump under HKLM\Software\Microsoft\PCHealth\ErrorReporting\ HKLM\Software\Microsoft\PCHealth\ErrorReporting\KernelFaults Adds a command to execute Dumprep.exe to HKLM\Software\Microsoft\Windows\CurrentVersion\ HKLM\Software\Microsoft\Windows\CurrentVersion\Run
This will cause it to run at the first user log on

1. XML description of

Dumpprep then:
Generates an XML description of system version, drivers present, loaded plug and play drivers and depending on the configuration Displays the message box (if enabled) to send the dump Submits to dump for automatic analysis

system version, drivers present, loaded plug and play drivers 2. Minidump file

INF402

What Does OCA Do?


Server farm uses !analyze, but looks up crash fingerprint in Microsofts crash resolution database
Sometimes OCA will point you at KB articles that describe the problem
KB articles may tell you to use Windows Update to get newer drivers, a hotfix, or install a Service Pack

Outline
Crash dumps and tools Analysis basics
IRQLs Stacks

Analyzing an easy crash UnUn-analyzable crashes


Crash transformation Buffer overrun Code overwrite Windows Memory Diagnostic

Many times OCA will say A driver caused a problem

OCA cant tell you when it suspects a driver that hasnt been conclusively identified as being responsible by hand analysis

Manual analysis
Stack trashes Hung Systems When there is no crash dump

Analyzing a Crash Dump Yourself


There are two kernel-level debuggers that can open crash dump kernelfiles:
WinDbg Windows program Kd command-line program commandBoth provide same kernel debugger analysis commands

IRQLs
IRQL stands for Interrupt Request Level
Each CPU maintains IRQL independently Software and hardware interrupts map to IRQLs When a CPU raises its IRQL to a level all interrupts at that level and below are masked for that CPU
SYNCH_LEVEL : : : DEVICE_IRQL 2 DEVICE_IRQL 1 DISPATCH_LEVEL APC_LEVEL PASSIVE_LEVEL

Must first configure to point to symbols


Easiest to use Microsoft Symbol Server for symbol access
Windbg: click on File->Symbol File Path FileEnter srv*c:\symbols*http://msdl.microsoft.com/download/symbols srv*c:\

If a minidump, must also configure image path to point to location of images (File->Image File Path) (FileUse same string as for symbol server (Windows XP and beyond)

Hardware Interrupts Current IRQL Software Interrupts

Unmasked

To open a crash dump:


WinDbg: File->Open Crash DumpKd crash dump syntax: FileKd: kd z <memory dump file> -y <symbols directory> -i <image path>

Masked

Key IRQLs
PASSIVE_LEVEL:
No interrupts are masked User mode code always executes at PASSIVE_LEVEL KernelKernel-mode code executes at PASSIVE_LEVEL most of the time

Stacks
Each thread has a user-mode and userkernelkernel-mode stack
The user-mode stack is usually 1 MB on x86 userThe kernel-mode stack is typically 12 KB (20 KB for kernelGUI threads) on x86 systems

Stacks allow for nested function invocation


Parameters can be passed on the stack Stores return address Serves as storage for local variables

DISPATCH_LEVEL:
Highest software interrupt level Scheduler is off Page faults cannot be handled and are illegal operations

INF402

Stack Frames
Function 1

Parameter 1 Return Address Frame Pointer Local Variable 1 Local Variable 2 Parameter 3 Parameter 2 Parameter 1 Return Address Frame Pointer Local Variable 1 Local Variable 2
Higher Addresses

Calling Conventions
Stacks are easy to interpret if functions use standard calling conventions Other calling conventions make the stack hard to figure out
No frame pointer Register arguments (fast calls)

Function 2

Stack Frame

Function 3

Parameter 2 Parameter 1 Return Address Frame Pointer Local Variable 1

A debugger requires symbol information to parse nonnon-standard stack frames


Makes accurate analysis of crashes involving thirdthirdparty drivers difficult

Outline
Crash dumps and tools Analysis basics
IRQLs Stacks

NotMyFault.exe
In order to demonstrate common crash scenarios, Mark wrote NotMyFault.Exe
Download from http://www.sysinternals.com /files/notmyfault.zip

Analyzing an easy crash UnUn-analyzable crashes


Crash transformation Buffer overrun Code overwrite Windows Memory Diagnostic

It loads MyFault.sys MyFault.Sys has an IOCTL interface that implements different bugs

User Mode Kernel Mode

Manual analysis
Stack trashes Hung Systems When there is no crash dump

IOCTL Interface

MyFault.sys

Generating an Easy Crash


Run NotMyFault and select High IRQL fault (kernel mode)
Allocates paged pool buffer Frees the buffer Raises IRQL DISPATCH_LEVEL Touches the buffer and pages following the buffer

Analyzing an Easy Crash


Open crash dump with Windbg !analyze easily identifies MyFault.sys by looking at the KeBugCheckEx parameters
The Memory Manager looked at the stack and determined the address that caused the page fault !analyze often looks at the stack to determine the cause of a crash

Paged buffers that are marked not present but are touched when IRQL DISPATCH_LEVEL result in the DRIVER_IRQL_NOT_LESS_OR_EQUAL bug check
Memory Manager calls KeBugCheckEx from page fault handler The IRQL is not less than or equal to the maximum IRQL at which the operation is legal (which is < DISPATCH_LEVEL)

INF402

Automated Analysis
When you open a crash dump with Windbg or Kd you get a basic crash analysis:
Stop code and parameters A guess at offending driver

Crash Transformation
Many crashes cant be analyzed
The victim crashed the system, not the criminal The analyzer may point at Ntoskrnl.exe or Win32K.sys or other Windows components Or, you may get many different crash dumps all pointing at different causes

The analysis is the result of the automated execution of the !analyze debugger command
!Analyze uses heuristics to walk up the stack and determine what driver is the likely cause of the crash Followup is taken from optional triage.ini file

Youre goal isnt to analyze impossible crashes Its to try to make an unanalyzable crash into one that can be analyzed

Dont trust blame of ntoskrnl, win32k, hal, ntfs or other core Windows components

Outline
Crash dumps and tools Analysis basics
IRQLs Stacks

Using the Driver Verifier


The tool for crash transformation is the Driver Verifier (Verifier.exe not in Start menu)
Introduced in Windows 2000 Helps developers test their drivers and systems administrators identify faulty drivers

Analyzing an easy crash UnUn-analyzable crashes


Crash transformation Buffer overrun Code overwrite Windows Memory Diagnostic

Run Verifier.exe
Choose Create Custom Settings Choose Select Individual Settings from a List Enable all options except Low Resource Simulation

Manual analysis
Stack trashes Hung Systems When there is no crash dump

Selecting Drivers to Verify


Dont verify all the drivers
Performance hit will make system unusable Limits effectiveness of the Verifier

Crash Transformation Recipe


The Recipe:
1. 2. 3. 4.

First, try any suspicious drivers (recently updated, known to be problematic, etc.) If still un-analyzable crashes, try enabling verification on all unthirdthird-party drivers and/or all unsigned drivers As a last resort enable verification on groups of 10-20 drivers 10at a time Run the Windows Memory Diagnostic

The following crash examples demonstrate the Driver Verifier making un-analyzable crashes into ones that unpoint at the problem
Buffer overflow System code overwrite

INF402

Buffer Overruns
Result when a driver goes past the end (overrun) or the beginning (underrun) of a buffer Usually detected when overwritten data is referenced
Another driver or the kernel makes the reference There can be a long delay between corruption and detection
Higher Addresses

Causing a Buffer Overrun


Run NotMyFault and select Buffer Overrun
Allocates a nonpaged pool buffer Writes a string past the end

Another Drivers Buffer Pool Structures Driver Buffer

Note that you might have to run several times since a crash will occur only if:
The kernel references the corrupted pool structures A driver references the corrupted buffer

The crash tells you what happened, but not why

A Buffer Overrun Bluescreen


In this example, where the crash was the result of the kernel tripping on corrupt pool tracking structures, the Bluescreen tells you what to do:

What is Special Pool?


Special pool is a kernel buffer area where buffers are sandwiched with invalid pages Conditions for a driver allocating from special pool:
Driver Verifier is verifying driver Special pool is enabled and available Allocation is slightly less than one page (4 KB on x86)

Page n+2

Invalid

Buffer
Page n+1

Higher Addresses

Signature

Special pool is a limited resource


When it runs out verified drivers allocate from standard pool

Page n

Invalid

Note: can be enabled without rebooting

The Verifier Catching Buffer Overrun


The Driver Verifier catches the overrun when it occurs
The Bluescreen tells you whos fault it is !analyze explains the crash and also tells you the buggy driver name The stack shows where the driver bug is

Code Overwrites
Caused when a bug results in a wild pointer
A wild pointer that points at invalid memory is easily detected A wild pointer that points at data is similar to buffer overrun
Might not cause a problem for a long time Crash makes it look like its something elses fault

System code write protection catches code overwrite, but its not on if:
Its a Windows 2000 system with > 127 MB memory Its a Windows XP or Windows 2003 Server system with > 255 MB In other words, its off on most systems

INF402

Causing a Code Overwrite


Run NotMyFault and select Code Overwrite
Overwrites first bytes of nt!ntreadfile Function is most common entry to I/O system so a random thread will cause the crash

System Code Write Protection


To obtain a more obvious crash, enable system code write protection by turning on Driver Verifier on one or more drivers
Can also manually enable by setting HKLM\System\CurrentControlSet\ HKLM\System\CurrentControlSet\Control \Session Manager\Memory Management Manager\ LargePageMinimum REG_DWORD 0xFFFFFFFF EnforceWriteProtection REG_DWORD 1 Reboot to take effect

The crash hints that the fault occurred in NtReadFile


The last user-mode address is ZwReadFile userThe ebx register in the exception frame points at NtReadFile NtReadFiles start location looks scrambled (u ntreadfile)

Rerun NotMyFault
Crash occurs immediately and even the blue screen points at MyFault.sys:

!analyze shows the address of the write and the target (NtReadFile)

Windows Memory Diagnostic


Memory errors are a significant cause of hardwarehardware-related crashes Windows Memory Diagnostic checks memory for errors
Free download from Microsoft.com
http://oca.microsoft.com/en/windiag.asp

Outline
Crash dumps and tools Analysis basics
IRQLs Stacks

Analyzing an easy crash UnUn-analyzable crashes


Crash transformation Buffer overrun Code overwrite Windows Memory Diagnostic

Installs to floppy or CD ROM Built into Windows Vista

Run at least one pass

Manual analysis
Stack trashes Hung Systems When there is no crash dump

Manual Analysis
Sometimes !analyze isnt enough
Doesnt tell you anything useful You want to know what was happening at the time of the crash

Stack Trashing
An example of a crash requiring manual analysis is a stack trash Stack trashes have several possible causes:
A driver pushing things on the stack causes the stack to overflow A driver overruns a stack-allocated buffer stack-

Useful commands:
List loaded drivers: lm kv
Make sure drivers are all recognized and up to date

Look at memory usage: !vm


Make sure memory pools are not full If full, use !poolused (requires pool tagging to be on)

Examine current thread: !thread


May or may not be related to the crash

Usually results in garbage code being executed (KMODE_EXCEPTION_NOT_HANDLED)


Driver Verifier cant determine cause Since the stack is corrupted, analysis is especially hard

List all processes: !process 0 0


Make sure you understand what was running on the system

If a Verifier detected deadlock: !deadlock Additional commands: !help

INF402

Debugging Stack Trashes


Run NotMyFault and select Stack Trash
Allocates a buffer on the stack Overruns the buffer Returns to the caller

Troubleshooting Crashes That Dont Generate Crash Dumps


If you are getting crashes with no resulting dump (or other spontaneous reboots), you need to boot in Debugging Mode:
Press F8 during the boot and choose Debugging Mode Or, edit the targets boot.ini file to configure:
/debugport=comX /baudrate=XXX (note: default baud rate in Debugging Mode is 19200) Windows XP and Windows 2003 support 1394 Windows Vista supports USB 2.0

Crash doesnt show much off hand


!analyze actually blames Win32K.sys, the Win32 kernel-mode kernelsubsystem Stack doesnt show anything except an exception handler

Look deeper
!thread shows an outstanding IRP !irp <irp> shows that myfault.sys was the target of the IRP

In either case, this loads the kernel debugger at boot time


Does not affect performance On a crash system will wait indefinitely for debugger connection even if configured to do so!

Connecting to a Crashed System


When system crashes, attach a kernel debugger and analyze
In Windbg, choose File->Kernel Debug FileConfigure baud rate and COM port Click OK

Hung Systems
Sometimes system becomes unresponsive
Keyboard and mouse freeze

Two types of hang:


Instant lockup
Kernel synchronization deadlock Infinite loop at high IRQL or very high priority thread

Debugger should connect and display the bugcheck code Type !analyze v, and if necessary, perform additional analysis commands as described earlier

Grinding to a halt
Storage stack resource deadlock

To save complete memory dump for offline analysis, use .dump (or .dump /f to capture a full dump)
Note: this will be slow over a serial cable

Two techniques that both require prior setup and a reboot:


Manually crash the hung system and hope you get a dump to analyze offline Boot the system in debugging mode and when it hangs, break in with the kernel debugger and analyze system

Initiating a Manual Crash


Crash from keyboard
Requires PS2 keyboard and right control key Right CTRL button and then Scroll Lock twice Must be configured in the Registry: HKLM\SYSTEM\CurrentControlSet\Services\i8042prt\ HKLM\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters \CrashOnCtrlScroll (DWORD) set to value of 1
Documented in Debugging Tools help file

Breaking into a Hung System


Instead of crashing you can boot in debugging mode and break in when it hangs After the hang, connect the host debugger system to the target
Run WinDbg (or KD) Press Ctrl-C (or click Debug->Break) this breaks into CtrlDebugtarget system

Keyboard interrupts must run for this to work

Use a hardware dump switch


Some servers come with an NMI button You can also make one: http://www.microsoft.com/whdc/system/CEC/dmpsw.mspx Must be configured in the Registry HKLM\System\CurrentControlSet\Control\ HKLM\System\CurrentControlSet\Control\CrashControl \NMICrashDump (DWORD) set to value of 1

INF402

Analyzing a Hang
Then attempt to determine reason for hang. (This is the hard part.)
Use !thread to see whats running check the stack running
Check each CPU by using the ~ command, for example, ~0, ~1

Generating a Hung System


Enable keyboard-initiated manual crash keyboardand reboot Run Notmyfault Select Hang and press Do Bug On reboot, open dump and look at current thread
!thread
Remember to check each CPU of a SMP: ~0, ~1, etc.

Use !locks to look at possible deadlocks Use !irql to see previous IRQL (Windows Server 2003 and later)

If you cant figure it out but want to save it for later analysis:
Use .crash to force a crash Or .dump to save the current state of the system in a dump file
This can also be done with LiveKD (free from Sysinternals) on a live system

Try to determine reason for hang

Analyzing a Sick System


Sometimes a system is still responsive, but you know that something is wrong with it
You want to look at its kernel state, but You dont want to take it off line by crashing it or connecting a debugger to it

The Bluescreen Screen Saver


Scare your enemies and fool your friends with the Sysinternals Bluescreen Screen Saver
Remotely execute it (requires admin privilege on remote system):
psexec i d c sysInternals bluescreen.scr /s

You can get a dump of a live system with LiveKd (free download from Sysinternals.com)
Use it to run Windbg or Kd Use .dump to snapshot live system

Be careful, your job may be on the line!

More Information
Windows Internals, 4th Edition Chapter 10: Crash Dump Analysis The help file which is installed with Debugging Tools for Windows Knowledge Base Articles
http://www.microsoft.com/ddk/debugging

Resources
Technical Chats and Webcasts
http://www.microsoft.com/communities/chats/default.mspx http://www.microsoft.com/usa/webcasts/default.asp

Microsoft Learning and Certification


http://www.microsoft.com/learning/default.mspx

MSDN & TechNet


http://microsoft.com/msdn http://microsoft.com/technet

Other books:
http://www.microsoft.com/ddk/newbooks.asp

Virtual Labs
http://www.microsoft.com/technet/traincert/virtuallab/rms.mspx

The debugger team wants your feedback and bug reports


windbgfb@microsoft.com microsoft.public.windbg newsgroup

Newsgroups
http://communities2.microsoft.com/ communities/newsgroups/en-us/default.aspx

Technical Community Sites


http://www.microsoft.com/communities/default.mspx

User Groups
http://www.microsoft.com/communities/usergroups/default.mspx

INF402

10

Live from TechEd Webcast Series has Been Brought to You by:

Fill out a session evaluation on CommNet for a chance to

Win an XBOX 360!


www.microsoft.com/hpc

2006 Microsof t Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The inf ormation herein is f or informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

INF402

11

Potrebbero piacerti anche