Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Troubleshooting Guide
Document Control
Document Owner Fabian SIRACH
Review Cycle in months 12
Table of Contents
DOCUMENT CONTROL..........................................................................................................2
TABLE OF CONTENTS...........................................................................................................3
1 PREFACE...............................................................................................................................6
1.1 DOCUMENT AUDIENCE............................................................................................................6
1.2 PREREQUISITES AND RELATED DOCUMENTATION............................................................................6
2 CLUSTER ADMINISTRATION...............................................................................................7
2.1 RESOURCES MANAGEMENT......................................................................................................7
2.2 GROUP MANAGEMENT............................................................................................................8
2.3 NODE MANAGEMENT............................................................................................................10
2.4 APPLYING SERVICE PACK AND HOTFIX.....................................................................................13
2.5 CHKDSK AND AUTOCHK.........................................................................................................13
2.6 CLUSTER COMMAND LINE.....................................................................................................14
3 TROUBLESHOOTING.........................................................................................................15
3.1 ONE NODE IS DOWN............................................................................................................15
3.2 ENTIRE CLUSTER IS DOWN....................................................................................................17
3.3 ONE OR MORE SERVERS QUIT RESPONDING.............................................................................17
3.4 CLUSTER SERVICE DO NOT START..........................................................................................17
3.5 CLUSTER SERVICE STARTS BUT CLUSTER ADMINISTRATOR WILL NOT CONNECT...................................18
3.6 CLUSTER ADMINISTRATOR STOPS RESPONDING ON FAILOVER.......................................................19
3.7 GROUP/RESOURCES FAILOVER PROBLEMS................................................................................20
3.8 QUORUM RESOURCES FAILURE..............................................................................................23
3.9 NETWORK NAME RESOURCE DOES NOT GO ONLINE..................................................................25
3.10 PHYSICAL DISK RESOURCE PROBLEM....................................................................................25
3.11 CLIENT CONNECTIVITY PROBLEM...........................................................................................26
3.11.1 Clients have intermittent Connectivity Based on Group Ownership......................26
3.11.2 Clients do not Have any Connection with the Cluster............................................27
3.11.3 Clients have Problems Accessing Data Through a File Share..............................27
3.11.4 Client Experience Intermittent Access....................................................................28
4 APPENDIX: MSCS EVENT MESSAGES............................................................................29
4.1 EVENT ID 1000................................................................................................................29
4.2 EVENT ID 1002................................................................................................................29
4.3 EVENT ID 1006................................................................................................................30
4.4 EVENT ID 1007................................................................................................................30
4.5 EVENT ID 1009................................................................................................................30
4.6 EVENT ID 1010................................................................................................................31
4.7 EVENT ID 1011.................................................................................................................31
4.8 EVENT ID 1012................................................................................................................31
4.9 EVENT ID 1015................................................................................................................31
2001 Nestec Ltd. – GLOBE – Global Business Excellence. http://veviis01.nestec.ch/GLOBE/ GL-GLOBE
Proprietary document not to be divulged outside the Company.
Printed by Nestec Ltd., CH-1800 Vevey, Switzerland.
Cluster Administration and
Troubleshooting Guide
1 Preface
This document intends to provide general operations required for administering and
troubleshooting Windows 2000 Cluster Service for NESTLE Enterprise Portal on Windows
2000 Operating System.
2 Cluster Administration
Refer to
W2K_GEO_SERVICE_FAILOVER Procedure
W2K_GEO_SERVICE_TAKEOVER Procedure
Remark: taking a resource offline causes all resources that depend on that resource to be
taken offline.
Refer to
W2K_GEO_SERVICE_FAILOVER Procedure
W2K_GEO_SERVICE_TAKEOVER Procedure
1. Open Add/Remove Programs (click Start, point to Settings, click Control Panel, and
then double-click Add/Remove Programs),
2. Click Add/Remove Windows Components,
3. The Welcome to the Windows Components wizard will begin.
4. In Components, select Cluster Service.
5. Click Next.
6. Cluster Service files are located on the Windows 2000 Advanced Server CD-ROM.
Enter Z:\i386.
7. Click OK.
8. Click Next.
9. Click I Understand to accept the condition that Cluster Service is supported on
hardware from the Hardware Compatibility List only.
10. In the Create or Join a Cluster dialog, select The second or next node in the cluster,
and click Next.
11. Enter the cluster name and click Next.
12. Leave Connect to cluster as unchecked. The Cluster Service Configuration wizard
will automatically supply the name of the user account selected during the installation
of the first node. Always use the same account as you used when setting up the first
cluster node.
13. Enter the password for the account and click Next.
14. At the next dialog box, click Finish to complete configuration.
15. The Cluster Service will start. Click OK.
16. Close Add/Remove Programs.
1. Click Start, click Programs, click Administrative Tools, and click Cluster
Administrator.
2. The presence of two nodes shows that a cluster exists and is in operation.
1. Open Add/Remove Programs (click Start, point to Settings, click Control Panel, and
then double-click Add/Remove Programs),
2. Click Add/Remove Windows Components,
3. The Welcome to the Windows Components wizard will begin.
4. Click Next,
5. In Components, click to clear Cluster Service, and then click Next.
1. Open Control Panel (click Start, point to Settings, and then click Control Panel),
2. Double click System,
3. On the Advanced tab, click Environment Variables.
4. Under System variables, click New.
5. In Variable Name, specify the name of the variable. In Variable Value, specify the
name of the diagnostic log file.
For example, set Variable Name to Clusterlog and Variable Value to
C:\Temp\Cluster.log.
6. Click OK, and then click OK again. Close Control Panel.
7. Stop and restart the Cluster service.
1. Open Control Panel (click Start, point to Settings, and then click Control Panel),
2. Double click System,
3. On the Advanced tab, click Environment Variables.
4. Under System variables, select Clusterlog.
5. Click Delete, click OK, and then click OK again. Close Control Panel.
6. Stop and restart the Cluster service.
With this tool, you can do each operation that can be done with Cluster Administrator.
3 Troubleshooting
Before troubleshooting, if a single node is unavailable, make sure that resources and
groups are available on the other node.
1. Check event logs on the online node (event log messages meaning is indicated in
Appendix)
2. Check cluster diagnostic logfile,
3. Check for the existence of a recent Memory.dmp file that may have been created from a
recent crash. If necessary, contact Microsoft Product Support Services for assistance with
this file.
4. Go to the paragraph corresponding to the failure.
If a node is online or one server could be started, gather information about failure:
1. Check event logs on the online node (event log messages meaning is indicated in
Appendix)
2. Check cluster diagnostic logfile,
3. Check for the existence of a recent Memory.dmp file that may have been created from
a recent crash. If necessary, contact Microsoft Product Support Services for assistance
with this file.
4. Go to paragraph corresponding to the identified failure.
If no server can be restarted, in last resort, restore both nodes (this procedure is defined
Disaster Recovery Guide).
Information: failures related to the service account may result in Event ID 7000 or
Event ID 7013 errors in the event log. In addition, you may receive the following error
message:
"Could not start the Cluster Service on \\computername. Error 1069: The
service did not start because of a logon failure."
Make sure the account is not disabled and that password expiration is not a
factor.
This domain account needs to be a member of the local administrators group on
each server.
The account needs the Logon as a service and Lock pages in memory rights.
Make sure the password specified for the Cluster Service account is correct.
(Retype it and click on the apply button, try to restart the service).
3. Check to make sure the quorum disk is online and that the Fiber Channel has
proper termination and proper function.
Information: if the quorum disk is not accessible during startup, the following error
message may occur:
"Could not start the Cluster Service on \\computername. Error 0021: The device
is not ready."
If the Cluster Service is running on the other cluster node, check the cluster
logfile on that system for indications of whether or not the other node attempted to join
the cluster. If the cluster node did try to join the cluster, and the request was denied,
the logfile may contain details of the event. For example, if you evict a node from the
cluster, but do not remove and reinstall MSCS on that node, when the server attempts
to join the cluster, the request to join will be denied.
To start a cluster node if the cluster service doesn’t start and no cluster.log file exists,
use the –debug option:
3.5 Cluster Service Starts but Cluster Administrator will not connect
If the Services utility in Administrative Tools indicates that the service is running,
and you cannot connect with Cluster Administrator to administer the cluster, the
problem may be related to:
8909675.doc Page 18 of 68 Created: 06.01.2004
Author: Fabian SIRACH Printed: 22.08.2002 17:39
Cluster Administration and
Troubleshooting Guide
The typical reason that a group may not failover properly is usually because of problems with
resources within the group. For example, if you elect to move a group from one node to
another, the resources within the group will be taken offline, and ownership of the group will
be transferred to the other node. On receiving ownership, the node will attempt to bring
resources online, according to dependencies defined for the resources. If resources fail to go
online, MSCS attempts again to bring them online. After repeated failures, the failing resource
or resources may affect the group and cause the group to transition back to the previous
node. Eventually, if failures continue, the group or affected resources may be taken offline.
You can configure the number of attempts and allowed failures through resource and group
properties.
1. When you experience problems with group or resource failover, evaluate which
resource or resources may be failing. Determine why the resource won't go
online.
2. Check resource dependencies for proper configuration and make sure they
are available.
3. Also, make sure that the "Possible Owners" list includes both nodes.
4. If resource properties do not appear to be part of the problem, check the event log
or cluster logfile for details.
Information: the "Preferred Owners" list is designed for automatic failback or initial group
placement within the cluster. In a two-node cluster, this list should only contain the name of
the preferred node for the group, and should not contain multiple entries.
If the Cluster Service won't start because of a quorum disk failure, check the
corresponding device. Quorum access problem is usually caused by connectivity or
authentication issues. If this is not the case, execute the procedure below.
This operation is complex and has a great impact on cluster, so it’s recommended to
realize it with Microsoft Product Support Services’ assistance.
You can check the status of the quorum device by starting the service with the -fixquorum
switch, and attempt to bring the quorum disk online, or change the quorum location for the
service.
When you use the fixquorum option to start the Cluster service, only cluster name and
cluster IP resources are set online. To recover a failed quorum resource:
1. Start Cluster Service with the -fixquorum option on a single node :
a. Start the Services snap-in. Click Start , point to Programs , click Administrative
Tools , and then click Services .
b. Right-click and select the properties of the Cluster Service.
c. In the Start Parameters box, type: /fix quorum
d. Then press the Start button.
2. Use Cluster Administrator to configure the Cluster Service to use a different disk on the
shared bus for the quorum resource.
3. To view or change the quorum drive settings, right-click the cluster name at the top of the
tree, listed on the left portion of the Cluster Administrator window, and select Properties.
4. The Cluster Properties window contains three different tabs, one of which is for the
quorum disk,
5. From this tab, you may view or change quorum disk settings. You may also redesignate
the quorum resource.
Recover a failed quorum log:
This operation is complex and has a great impact on cluster, so it’s recommended to
realize it with Microsoft Product Support Services’ assistance.
Remark : If the error message occurs after you restore the system state on a computer
that has lost the quorum log, the quorum information is copied to
%SystemRoot%\Cluster\Cluster_backup. You can then use the Clusrest.exe tool from the
Resource Kit to restore this information to the quorum disk.
If you have a backup of the system state on one of the computers after the last
changes were made to the cluster, you can restore the quorum by restoring this
information.
If you do not have a backup of the Quorum log file, recreate a new quorum log file based
on the cluster configuration information in the local system's cluster hive by starting the
Cluster Service with the ResetQuorumLog switch :
1. Start the Services snap-in. Click Start, point to Programs , click Administrative Tools , and
then click Services .
2. Right-click and select the properties of the Cluster Service.
3. In the Start Parameters box, type: /ResetQuorumLog
4. Then press the Start button.
1. Check the system event log on each server for possible errors.
2. Check to make sure that the group has at least one IP address resource and one
network name resource,
3. Check that clients use one of these to access the resource or resources within
the group. If clients connect with any other network name or IP address, they may not
be accessing the correct server in the event that ownership of the resources changes.
As a result of improper addressing, access to these resources may appear limited to a
particular node.
4. If you are able to confirm that clients use proper addressing for the resource or
resources, check the IP address and network name resources to see that they are
online.
5. Check network connectivity with the server that owns the resources. For example,
try some of the following techniques:
From the server
PING server's primary adapter IP address (on client network)
PING IP address of the group
PING Network Name of the group
PING Router/Gateway between client and server (if any)
PING Client IP address
If the above tests work correctly up to the router/gateway check, the problem may be
elsewhere on the network because you have connectivity with the other server and
local addresses. If tests complete up to the client IP address test, there may be a client
configuration or routing problem.
If the tests from the server all pass, but you experience failures performing tests from
the client, there may be client configuration problems. If all tests complete except the
test using the network name of the group, there may be a name resolution problem.
This may be related to client configuration, or it may be a problem with the client's
designated DNS server.
1. Some autosense settings for network speed can spontaneously redetect network speed.
During the detection, network traffic through the adapter may be compromised. For best
results, set the network speed manually to avoid the recalibration.
8. Make sure to use the correct network adapter drivers. Some adapters may require
special drivers, although they may be detected as a similar device.
Solution Check the system event log and the cluster diagnostic logfile
for additional information. It is possible that the cluster
service may restart itself after the error. This event message
may indicate serious problems that may be related to
hardware or other causes.
Solution Remove MSCS from the affected node, and reinstall MSCS
on that system if desired.
Problem The cluster service attempted to run but found that it is not
a member of an existing cluster. This may be due to eviction
by an administrator or incomplete attempt to join a cluster.
This error indicates a need to remove and reinstall the
cluster software.
Solution Remove MSCS from the affected node, and reinstall MSCS
on that server if desired.
Description Microsoft Cluster Server did not start because the current
version of Windows is not correct.
Problem The quorum logfile for the cluster was found to be corrupt.
The system will attempt to resolve the problem.
Solution The system will attempt to resolve this problem. This error
may also be an indication that the cluster property for
maximum size should be increased through the Quorum
tab. You can manually resolve this problem by using the
-noquorumlogging parameter.
Problem Available disk space is low on the quorum disk and must be
resolved.
Description The quorum resource was not found. The Microsoft Cluster
Server has terminated.
Solution Use the -fixquorum startup option for the cluster service.
Investigate and resolve the problem with the quorum disk.
If necessary, designate another disk as the quorum device
and restart the cluster service before starting other nodes.
Solution Close any applications that may have an open handle to the
registry key so that it may be replicated as configured with
the resource properties. If necessary, contact the application
vendor about this problem.
Problem The disk did not respond to the issued SCSI command. This
usually indicates a hardware problem.
Description Reservation of cluster disk "Disk W:" has been lost. Please
check your system and disk configuration.
Problem The cluster service had exclusive use of the disk, and lost
the reservation of the device on the shared SCSI bus.
Solution The disk may have gone offline or failed. Another node may
have taken control of the disk, or a SCSI bus reset
command was issued on the bus that caused a loss of
reservation.
Description The NetBIOS interface for "IP Address" resource has failed.
Solution Check the system event log for errors. Check network
adapter configuration and operation. Check TCP/IP
configuration and name resolution methods. Check DNS
servers for possible database problems or invalid static
mappings.
Problem The cluster service attempted to bring the share online, but
the attempt to create the share failed.
Solution Check the system event log for errors. Check the cluster
diagnostic log (if it is enabled) for status codes that may be
related to this event. Check the resource properties for
proper configuration. Also, make sure the file share has
proper dependencies defined for related resources.
Solution Make sure another node of the same cluster is online first
before starting this node. Upon joining with another cluster
node, the node will receive an updated copy of the official
cluster database and should alleviate this error.
Problem The Cluster Service tried to open the CLUSDB registry hive
and could not do so. As a result, the cluster service cannot
be brought online.
Description The Cluster Resource Monitor could not load the DLL %1 for
resource type %2.
Problem The Cluster Service tried to load the named resource DLL
and it failed to initialize. The DLL could be corrupt, or an
incompatible version. As a result, the resource cannot be
brought online.
Solution Check any parameters related to the resource and check the
event log for details.
Solution Scan the event log for additional errors. The disk corruption
could be indicative of other problems. Check related
hardware and devices on the shared bus and ensure proper
cables and termination. This error may be a symptom of
failing hardware or a deteriorating drive.
Solution Scan the event log for additional errors. The disk corruption
could be indicative of other problems. Check related
hardware and devices on the shared bus and ensure proper
cables and termination. This error may be a symptom of
failing hardware or a deteriorating drive.
Problem The file share cannot be brought online. The problem may
be caused by permissions to the directory or disk in which
the directory resides. This may also be related to permission
problems within the domain.
Solution Check to make sure that the Cluster Service account has
rights to the directory to be shared. Make sure a domain
controller is accessible on the network. Make sure
dependencies for the share and for other resource in the
group are set correctly. Error 5 translates to "Access
Denied."
Problem The named resource failed and the cluster service logged
the event. In this example, a disk resource failed.
Solution For disk resources, check the device for proper operation.
Check cables, termination, and logfiles on both cluster
nodes. For other resources, check resource properties for
proper configuration, and check to make sure dependencies
are configured correctly. Check the diagnostic log (if it is
enabled) for status codes corresponding to the failure.
Description Cluster node attempted to join the cluster but failed with
error 5052.
Solution If the node was previously evicted from the cluster, you
must remove and reinstall MSCS on the affected server.
Problem Another node attempted to join the cluster and this node
refused the request.
Solution If the node was previously evicted from the cluster, you
must remove and reinstall MSCS on the affected server.
Look in Cluster Administrator to see if the other node is
listed as a possible cluster member.
Problem The cluster service on the affected node was halted because
of some kind of inconsistency between cluster nodes.
Solution Check the system event log for errors. Check the network
adapter for proper operation and replace the adapter if
necessary. Check to make sure the proper adapter driver is
loaded for the device and check for newer versions of the
driver.
Solution Check the quorum drive for available disk space. The file
system may be corrupted or the device may be failing.
Check file system permissions to ensure that the cluster
service account has full access to the drive and directory.
Problem The cluster service attempted to start but found that it was
not a valid member of the cluster.
Problem The network configuration for the adapter has changed and
the cluster service cannot make use of the adapter for the
network that was assigned to it.
Description Microsoft Cluster Server did not find any network adapters
with valid IP addresses installed in the system. The node will
not be able to join a cluster.
Solution Check to make sure that the networks are available and
functioning correctly. This may be a symptom of larger
network problems or domain security issues.
5.1 Event ID 9
Source Disk
Problem An I/O request was sent to a SCSI device and was not
serviced within acceptable time. The device timeout was
logged by this event.
Description The server was unable to add the virtual root "/" for the
directory "path" because of the following error: The system
cannot find the path specified. The data is the error.
Problem The World Wide Web Publishing service could not create a
virtual root for the IIS Virtual Root resource. The directory
path may have been deleted.
Description DHCP IP address lease "IP address" for the card with
network address "media access control Address" has been
denied.
Description DHCP failed to renew a lease for the card with network
address "MAC Address." The following error occurred: The
semaphore timeout period has expired.
More Info The description for this error message may vary somewhat
based on the actual error. For example, another error that
may be listed in the event detail might be: "Logon Failure:
Problem The Cluster Service attempted to start but could not gain
access to the quorum log on the quorum disk. This may be
because of problems gaining access to the disk or problems
joining a cluster that has already formed.
Solution Check the disk and quorum log for problems. If necessary,
check the cluster logfile for more information. There may be
other events in the system event log that may give more
information.
Tools Use to
Disk Management (compmgmt.msc) Determine whether a disk is available to a
particular node
If the disk can be selected under Disk
Management, it is online to the local system. If
the disk object appears dimmed, it is not
available for that node.
Services option in Administrative Tools Verify that the Cluster Service is running
Windows 2000 Explorer, My Computer, or Verify that a particular share has been exported
the Net View command from the server you expected
Event Viewer View and manage System, Security, and
Application event logs
Dr. Watson Detect, log, and diagnose application errors
Task Manager Monitor applications, tasks, and key
performance metrics; and view detailed
information on memory and CPU usage on
each application and process
Performance Monitor Monitor system details of application and
system behaviours, and monitor performance
Network Monitor Monitor and troubleshoot network connectivity
by capturing and analyzing network traffic
Windows Diagnostics (Winmsd.exe) Easily examine your system information on
device drivers, network usage, and system
resources, such as IRQ, DMA, and I/O
addresses
Tools Use to
Diskmap This command-line utility produces a detailed report on the configuration of the
hard disk that you specify. It provides information from the registry about disk
characteristics and geometry, and reads and displays data about all of the
partitions and logical drives defined on the disk. It also shows Disk Signatures.
Dumpel Dump Event Log is a command-line utility that dumps an event log for a local or
remote system into a tab-separated text file. This utility can also be used to filter
for or filter out certain event types.
Filever This command-line tool examines the version resource structure of a file or a
directory of files on either a local or remote computer, and displays information on
the versions of executable files, such as .exe and .dll files.
Getmac GetMAC provides a quick method for obtaining the MAC (Ethernet) layer address
and binding order for a computer running Windows 2000, locally or across a
network. This can be useful when you want to enter the address into a sniffer, or if
you need to know what protocols are currently in use on a computer.
Netcons This GUI tool monitors and displays current net connections, taking the place of
the Windows command-line command net use.
Clustool This tool permit cluster configuration backup and restore.
If you set the CLUSTERLOG environment variable, the cluster will create a logfile that
contains diagnostic information using the path specified. Important events during the
operation of the Cluster Service will be logged in this file. Because so many different events
occur, the logfile may be somewhat cryptic or hard to read. This document gives some hints
about how to read the logfile and information about what items to look for.
Note: Each time you attempt to start the Cluster Service, the log will be cleared and a new
logfile started. Each component of MSCS that places an entry in the logfile will indicate itself
by abbreviation in square brackets. For example, the Node Manager component would be
abbreviated [NM]. Logfile entries will vary from one cluster to another. As a result, other
logfiles may vary from excerpts referenced in this document.
Note Log entry lines in the following sections have been wrapped for space constraints in this
document. The lines do not normally wrap.
Near the beginning of the logfile, notice the build number of MSCS, followed by the operating
system version number and service pack level. If you call for support, engineers may ask for
this information:
082::14-21:29:26.625 Cluster Service started - Cluster Version 1.224.
082::14-21:29:26.625 OS Version 4.0.1381 - Service Pack 3.
Following the version information, some initialization steps occur. Those steps are followed by
an attempt to join the cluster, if one node already exists in a running state. If the Cluster
Service could not detect any other cluster members, it will attempt to form the cluster.
Consider the following log entries:
0b5::12-20:15:23.531 We’re initing Ep...
0b5::12-20:15:23.531 [DM]: Initialization
0b5::12-20:15:23.531 [DM] DmpRestartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: thread created
0b5::12-20:15:23.531 [NMINIT] Initializing the Node Manager...
0b5::12-20:15:23.546 [NMINIT] Local node name = NODEA.
0b5::12-20:15:23.546 [NMINIT] Local node ID = 1.
0b5::12-20:15:23.546 [NM] Creating object for node 1 (NODEA)
0b5::12-20:15:23.546 [NM] node 1 state 1
Note that the cluster service attempts to join the cluster. If it cannot connect with an existing
member, the software decides to form the cluster. The next series of steps attempts to form
groups and resources necessary to accomplish this task. It is important to note that the
cluster service must arbitrate control of the quorum disk.
0b5::12-20:15:32.781 [FM] Creating group a1a13a86-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:32.781 [FM] Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains a1a13a87-
0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] Creating resource a1a13a87-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:32.781 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
0b5::12-20:15:32.781 [FMX] Found the quorum resource a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created
0b5::12-20:15:32.781 [FM] arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-
0000f8034599.
0b5::12-20:15:32.781 FmpRmCreateResource: creating resource a1a13a87-0eaf-11d1-8427-
0000f8034599 in shared resource monitor
0b5::12-20:15:32.812 FmpRmCreateResource: created resource a1a13a87-0eaf-11d1-8427-
0000f8034599, resid 1363016
0dc::12-20:15:32.828 Physical Disk <Disk D:>: Arbitrate returned status 0.
0b5::12-20:15:32.828 [FM] FmGetQuorumResource successful
0b5::12-20:15:32.828 FmpRmOnlineResource: bringing resource a1a13a87-0eaf-11d1-8427-
0000f8034599 (resid 1363016) online.
0b5::12-20:15:32.843 [CP] CppResourceNotify for resource Disk D:
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 8
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 388 type 0 context 8
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 388 type 0 context 8
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting type 0 context 9
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 389 type 0 context 9
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 389 type 0 context 9
0b5::12-20:15:32.843 FmpRmOnlineResource: Resource a1a13a87-0eaf-11d1-8427-0000f8034599
pending
0e1::12-20:15:33.359 Physical Disk <Disk D:>: Online, created registry watcher thread.
In this case, the node forms the cluster group and quorum disk resource, gains control of the
disk, and opens the quorum logfile. From here, the cluster performs operations with the
logfile, and proceeds to form the cluster. This involves configuring network interfaces and
bringing them online.
0b5::12-20:15:33.718 [NM] Beginning form process.
0b5::12-20:15:33.718 [NM] Synchronizing node information.
0b5::12-20:15:33.718 [NM] Creating node objects.
0b5::12-20:15:33.718 [NM] Configuring networks & interfaces.
0b5::12-20:15:33.718 [NM] Synchronizing network information.
0b5::12-20:15:33.718 [NM] Synchronizing interface information.
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20
dwError=0x00000000
0b5::12-20:15:33.718 [NM] Setting database entry for interface a1a13a7f-0eaf-11d1-8427-
0000f8034599
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Exit, dwError=0x00000000
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmBeginLocalUpdate Exit, pLocalXsaction=0x00151c20
dwError=0x00000000
0b5::12-20:15:33.875 [NM] Setting database entry for interface a1a13a81-0eaf-11d1-8427-
0000f8034599
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Exit, dwError=0x00000000
0b5::12-20:15:33.875 [NM] Matched 2 networks, created 0 new networks.
0b5::12-20:15:33.875 [NM] Resynchronizing network information.
0b5::12-20:15:33.875 [NM] Resynchronizing interface information.
0b5::12-20:15:33.875 [NM] Creating network objects.
0b5::12-20:15:33.875 [NM] Creating object for network a1a13a7e-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.875 [NM] Creating object for network a1a13a80-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.875 [NM] Creating interface objects.
0b5::12-20:15:33.875 [NM] Creating object for interface a1a13a7f-0eaf-11d1-8427-
0000f8034599.
0b5::12-20:15:33.875 [NM] Registering network a1a13a7e-0eaf-11d1-8427-0000f8034599 with
cluster transport.
0b5::12-20:15:33.875 [NM] Registering interfaces for network a1a13a7e-0eaf-11d1-8427-
0000f8034599 with cluster transport.
0b5::12-20:15:33.875 [NM] Registering interface a1a13a7f-0eaf-11d1-8427-0000f8034599 with
cluster transport, addr 9.9.9.2, endpoint 3003.
0b5::12-20:15:33.890 [NM] Instructing cluster transport to bring network a1a13a7e-0eaf-
11d1-8427-0000f8034599 online.
0b5::12-20:15:33.890 [NM] Creating object for interface a1a13a81-0eaf-11d1-8427-
0000f8034599.
0b5::12-20:15:33.890 [NM] Registering network a1a13a80-0eaf-11d1-8427-0000f8034599 with
cluster transport.
0b5::12-20:15:33.890 [NM] Registering interfaces for network a1a13a80-0eaf-11d1-8427-
0000f8034599 with cluster transport.
After initializing network interfaces, the cluster will continue formation with the enumeration of
cluster nodes. In this case, as a newly formed cluster, the cluster will contain only one node.
If this session had been joining an existing cluster, the node enumeration would show two
nodes. Next, the cluster will bring the Cluster IP address and Cluster Name resources online.
0b5::12-20:15:34.015 [FM] OnlineGroup: setting group state to Online for f901aa29-0eaf-
11d1-8427-0000f8034599
069::12-20:15:34.015 IP address <Cluster IP address>: Created NBT interface
\Device\NetBt_If6 (instance 355833456).
0b5::12-20:15:34.015 [FM] FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
0b5::12-20:15:34.015 [FM] FmFormNewClusterPhase2 complete.
.
.
.
0b5::12-20:15:34.281 [INIT] Successfully formed a cluster.
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Entry.
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Exit gdwNumHandles=3
0b5::12-20:15:34.281 [INIT] Cluster Started! Original Min WS is 204800, Max WS is 1413120.
08c::12-20:15:34.296 [CPROXY] clussvc initialized
069::12-20:15:40.421 IP address <Cluster IP Address>: IP Address 192.88.80.114 on adapter
DC21X41 online
.
.
.
04d::12-20:15:40.421 [FM] OnlineWaitingTree, a1a13a84-0eaf-11d1-8427-0000f8034599 depends
on a1a13a83-0eaf-11d1-8427-0000f8034599. Start first
04d::12-20:15:40.421 [FM] OnlineWaitingTree, Start resource a1a13a84-0eaf-11d1-8427-
0000f8034599
04d::12-20:15:40.421 [FM] OnlineResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends on
a1a13a83-0eaf-11d1-8427-0000f8034599. Bring online first.
04d::12-20:15:40.421 FmpRmOnlineResource: bringing resource a1a13a84-0eaf-11d1-8427-
0000f8034599 (resid 1391032) online.
04d::12-20:15:40.421 [CP] CppResourceNotify for resource Cluster Name
04d::12-20:15:40.421 [GUM] GumSendUpdate: Locker waiting type 0 context 8
04d::12-20:15:40.437 [GUM] Thread 0x4d UpdateLock wait on Type 0
04d::12-20:15:40.437 [GUM] DoLockingUpdate successful, lock granted to 1
076::12-20:15:40.437 Network Name <Cluster Name>: Bringing resource online...
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker dispatching seq 411 type 0 context 8
04d::12-20:15:40.437 [GUM] GumpDoUnlockingUpdate releasing lock ownership
04d::12-20:15:40.437 [GUM] GumSendUpdate: completed update seq 411 type 0 context 8
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker waiting type 0 context 11
.
.
.
076::12-20:15:43.515 Network Name <Cluster Name>: Registered server name MDLCLUSTER on
transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>: Registered workstation name MDLCLUSTER on
transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>: Network Name MDLCLUSTER is now online
Following these steps, the cluster will attempt to bring other resources and groups online. The
logfile will continue to increase in size as the cluster service runs. Therefore, it may be a good
idea to enable this option when you are having problems, rather than leaving it on for days or
weeks at a time.
After reviewing a successful startup of the Cluster Service, you may want to examine some
errors that may appear because of various failures. The following examples illustrate possible
log entries for four different failures.
Note The error code on these logfile entries is 21. You can issue net helpmsg 21 from the
command line and receive the explanation of the error status code. Status code 21 means,
"The device is not ready.” This indicates a possible problem with the device. In this case, the
device was turned off, and the error status correctly indicates the problem.
Status code 2 means, "The system cannot find the file specified.” The error in this case may
mean that it cannot find the disk, or that, because of some kind of problem, it cannot locate
the quorum logfile that should be on the disk.
If another computer on the network has the same IP address as the cluster IP address
resource, the resource will be prevented from going online. Further, the cluster name will not
be registered on the network, as it depends on the IP address resource. Because this name
is the network name used for cluster administration, you will not be able to administer the
cluster using this name, in this type of failure. However, you may be able to use the computer
name of the cluster node to connect with Cluster Administrator. Additionally, you may be able
to connect locally from the console using the loopback address. The following sample entries
are from a cluster logfile during this type of failure:
0b9::14-21:32:59.968 IP Address <Cluster IP Address>: The IP address is already in use on
the network, status 5057.
0d2::14-21:32:59.984 [FM] NotifyCallBackRoutine: enqueuing event
03e::14-21:32:59.984 [FM] WorkerThread, processing transition event for a1a13a83-0eaf-11d1-
8427-0000f8034599, oldState = 129, newState = 4.03e
.
.
.
03e::14-21:32:59.984 FmpHandleResourceFailure: taking resource a1a13a83-0eaf-11d1-8427-
0000f8034599 and dependents offline
03e::14-21:32:59.984 [FM] TerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 depends
on a1a13a83-0eaf-11d1-8427-0000f8034599. Terminating first
0d3::14-21:32:59.984 Network Name <Cluster Name>: Terminating name MDLCLUSTER...
0d3::14-21:32:59.984 Network Name <Cluster Name>: Name MDLCLUSTER is already offline.
.
.
.
03e::14-21:33:00.000 FmpRmTerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 is now
offline
0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Terminating resource...
0c7::14-21:33:00.000 IP Address <Cluster IP Address>: Address 192.88.80.114 on adapter
DC21X41 offline.
If you evict a node from a cluster, the cluster software on that node must be reinstalled to
gain access to the cluster again. If you start the evicted node, and the Cluster Service
attempts to join the cluster, entries similar to the following may appear in the cluster logfile:
032::26-16:11:45.109 [INIT] Attempting to join cluster MDLCLUSTER
032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.115
040::26-16:11:45.109 [JOIN] Asking 192.88.80.115 to sponsor us.
032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 9.9.9.2
032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor 192.88.80.190
099::26-16:11:45.109 [JOIN] Asking 9.9.9.2 to sponsor us.
032::26-16:11:45.109 [JOIN] Spawning thread to connect to sponsor NODEA
098::26-16:11:45.109 [JOIN] Asking 192.88.80.190 to sponsor us.
032::26-16:11:45.125 [JOIN] Waiting for all connect threads to terminate.
092::26-16:11:45.125 [JOIN] Asking NODEA to sponsor us.
040::26-16:12:18.640 [JOIN] Sponsor 192.88.80.115 is not available (JoinVersion),
status=1722.
098::26-16:12:18.640 [JOIN] Sponsor 192.88.80.190 is not available (JoinVersion),
status=1722.
099::26-16:12:18.640 [JOIN] Sponsor 9.9.9.2 is not available (JoinVersion), status=1722.
The node attempts to join the existing cluster, but has invalid credentials, because it was
previously evicted. Therefore, the existing node refuses to communicate with it. The node
may attempt to form its own version of the cluster, but cannot gain control of the quorum disk,
because the existing cluster node maintains ownership. Examination of the logfile on the
existing cluster node reveals that the Cluster Service posted entries to reflect the failed
attempt to join:
0c4::29-18:13:31.035 [NMJOIN] Processing request by node 2 to begin joining.
0c4::29-18:13:31.035 [NMJOIN] Node 2 is not a member of this cluster. Cannot join.
:2000,4.0
======================================================================
-------------------------------------------------------------------------------
The information in this article applies to:
SUMMARY
=======
This is a list of all the available switches that can be used as startup
parameters to start the Cluster service.
To do this, go to the properties of the service, and put the appropriate switch
in the Start Parameters box, and then click Start.
NOTE: You must include a forward slash (/) at the beginning of the switch.
You can also use the desired switch when starting the Cluster service from the
command line as well:
NOTE: The Debug command has special startup parameters, please reference the
Debug section below for proper usage.
MORE INFORMATION
================
- Debug
Function: It is possible that Cluster logging may not contain any helpful
information in diagnosing failures of the Cluster service to start. This is
because the Cluster service may fail prior to the Cluster.log starting.
Starting the Cluster service with this switch displays the initialization of
the Cluster service and can be beneficial in identifying these early
occurring problems.
Usage scenarios: This switch must be used only when the Cluster service fails
to start up. This switch will display on the screen the operation of the
Cluster Service as it attempts to start. This switch can only be used when
starting the service from the command prompt and you must be in the directory
that the Cluster Service is installed to, by default this is
%SystemRoot%\Cluster. This is also the only switch that you do not use the
NET START command to start the service.
Operation: Open a Command Prompt and change your current directory to the
%SystemRoot%\cluster directory. Then type:
"CLUSSVC /debug"
The cluster service will send output to the window similar to what would
normally be seen in the cluster.log. You may also capture this information to
a file by using the following command syntax instead:
"CLUSSVC /debug > c:\debug.log"
At the point that you are satisfied that the Cluster service is running
Note: You may wish to use the ClusterLogLevel environment variable to control
the output level when using the Debug switch, see this article for additional
information:
- FixQuorum
Function: Lets the cluster service start up despite problems with the quorum
device. The only resources that will be brought online once the service is
started is the Cluster IP Address and the Cluster Name. You can open Cluster
Administrator and bring other resources online manually.
Operation: After the cluster service is started up, all resources including
the quorum resource remain offline. Users can then manually try to bring the
quorum resource online and monitor the cluster log entries as well as the new
event log entries and attempt to diagnose any problems with the quorum
resource.
- ResetQuorumLog
Function: If the quorum log and checkpoint file is not found or is corrupt,
this can be used to create files based from the information in the local
node's %SystemRoot%\Cluster\CLUSDB registry hive. If the quorum log file is
found to be in proper order, this switch has no effect.
Requirements: Typically, only one node is started up using this switch and
this switch is used alone. Must be used only by experienced users who
understand the consequences of using information that is potentially out of
date, to create a new quorum log file.
Usage scenarios: This switch must be used only when the Cluster service fails
to start up on a Windows 2000 or later machine due to a missing/corrupt
quorum log QUOLOG.LOG and CHKxxx.TMP files. Windows NT 4.0 will automatically
recreate these files if they do not exist, this functionality was added in
Windows 2000 to give more control over the start of the Cluster service.
Operation: The Cluster service does an auto-reset of the quorum log file if it
is found missing or corrupt by using the information in the currently loaded
cluster hive using the file %systemroot%\Cluster\CLUSDB.
- DebugResMon
Function: Helps you to debug the resource monitor process and, therefore, the
resource dynamic-link libraries (DLLs) that are loaded by the resource
monitor. You can use any standard Windows-based debugger.
Requirements: Can only be used when the cluster service is started from the
command prompt and using the "/debug" option, there is no equivalent registry
setting that could be used when cluster service is run as a service. Debugger
must be available for attaching to the resource monitor when it starts up.
Typically, this switch is used alone.
Usage scenarios: Developers use it to debug the resource monitor process and
resource DLLs. This option is extremely useful if a bug in a resource DLL
causes the resource monitor process to crash soon after it is started up by
the cluster service and before users can manually attach a debugger to the
resource monitor process.
Operation: Just before the resource monitor process is started up, the cluster
service process waits with a message "Waiting for debugger to connect to the
resmon process X" where X is the PID (Process ID) of the resource monitor
process. The cluster service does this waiting for all resource monitor
processes created by it. Once the user attaches a debugger to the resource
monitor process, and the resource monitor process starts up, the cluster
service continues with its initialization.
- NoRepEvtLogging
Usage scenarios: For example, to start the cluster service and log those
events not recorded in the event log to a local file, Debugnorep.log:
This will prevent the node that was started with this switch not to replicate
it's information to other nodes, but it will still receive information from
other nodes that were started normally.
======================================================================
Keywords : kbenv kbtool w2000mscs kbClustering
Technology : kbWinNTsearch kbWinNT400search kbwin2000AdvServ
kbwin2000AdvServSearch kbwin2000DataServ kbwin2000DataServSearch
kbWinNTSsearch kbWinNTSEntSearch kbWinNTSEnt400 kbWinNTS400search
kbwin2000Search kbWinAdvServSearch kbWinDataServSearch
Version : :2000,4.0
Issue type : kbinfo
======================================================================
=======
Copyright Microsoft Corporation 2001.