Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Update 1
Modified on 24 MAY 2018
VMware vSphere 6.5
VMware ESXi 6.5
vCenter Server 6.5
vSphere Troubleshooting
You can find the most up-to-date technical documentation on the VMware website at:
https://docs.vmware.com/
If you have comments about this documentation, submit your feedback to
docfeedback@vmware.com
VMware, Inc.
3401 Hillview Ave.
Palo Alto, CA 94304
www.vmware.com
Copyright © 2010–2018 VMware, Inc. All rights reserved. Copyright and trademark information.
VMware, Inc. 2
Contents
Updated Information 6
1 Troubleshooting Overview 7
Guidelines for Troubleshooting 7
Troubleshooting with Logs 9
3 Troubleshooting Hosts 21
Troubleshooting vSphere HA Host States 21
Troubleshooting vSphere Auto Deploy 26
Authentication Token Manipulation Error 34
Active Directory Rule Set Error Causes Host Profile Compliance Failure 34
Unable to Download VIBs When Using vCenter Server Reverse Proxy 35
5 Troubleshooting Availability 43
Troubleshooting vSphere HA Admission Control 43
Troubleshooting Heartbeat Datastores 45
Troubleshooting vSphere HA Failure Response 47
Troubleshooting vSphere Fault Tolerance in Network Partitions 49
Troubleshooting VM Component Protection 50
VMware, Inc. 3
vSphere Troubleshooting
7 Troubleshooting Storage 61
Resolving SAN Storage Display Problems 61
Resolving SAN Performance Problems 63
Virtual Machines with RDMs Need to Ignore SCSI INQUIRY Cache 68
Software iSCSI Adapter Is Enabled When Not Needed 69
Failure to Mount NFS Datastores 69
Troubleshooting Storage Adapters 70
Checking Metadata Consistency with VOMA 70
No Failover for Storage Path When TUR Command Is Unsuccessful 72
Troubleshooting Flash Devices 74
Troubleshooting Virtual Volumes 77
Troubleshooting VAIO Filters 80
8 Troubleshooting Networking 82
Troubleshooting MAC Address Allocation 82
The Conversion to the Enhanced LACP Support Fails 86
Unable to Remove a Host from a vSphere Distributed Switch 87
Hosts on a vSphere Distributed Switch 5.1 and Later Lose Connectivity to vCenter Server 88
Hosts on vSphere Distributed Switch 5.0 and Earlier Lose Connectivity to vCenter Server 90
Alarm for Loss of Network Redundancy on a Host 91
Virtual Machines Lose Connectivity After Changing the Uplink Failover Order of a Distributed Port
Group 92
Unable to Add a Physical Adapter to a vSphere Distributed Switch That Has Network I/O Control
Enabled 93
Troubleshooting SR-IOV Enabled Workloads 94
A Virtual Machine that Runs a VPN Client Causes Denial of Service for Virtual Machines on the
Host or Across a vSphere HA Cluster 95
Low Throughput for UDP Workloads on Windows Virtual Machines 97
Virtual Machines on the Same Distributed Port Group and on Different Hosts Cannot Communicate
with Each Other 99
Attempt to Power On a Migrated vApp Fails Because the Associated Protocol Profile Is Missing 100
Networking Configuration Operation Is Rolled Back and a Host Is Disconnected from
vCenter Server 101
VMware, Inc. 4
About vSphere Troubleshooting
®
vSphere Troubleshooting describes troubleshooting issues and procedures for VMware vCenter Server
implementations and related components.
Intended Audience
This information is for anyone who wants to troubleshoot virtual machines, ESXi hosts, clusters, and
related storage solutions. The information in this book is for experienced Windows or Linux system
administrators who are familiar with virtual machine technology and data center operations.
Note Not all functionality in the vSphere Web Client has been implemented for the vSphere Client in the
vSphere 6.5 release. For an up-to-date list of unsupported functionality, see Functionality Updates for the
vSphere Client Guide at http://www.vmware.com/info?id=1413.
VMware, Inc. 5
Updated Information
This vSphere Troubleshooting is updated with each release of the product or when necessary.
Revision Description
04 OCT 2017 n Updated log information in vSphere Auto Deploy TFTP Timeout Error at Boot Time.
n Updated log directories in Troubleshooting with Logs.
VMware, Inc. 6
Troubleshooting Overview 1
vSphere Troubleshooting contains common troubleshooting scenarios and provides solutions for each of
these problems. You can also find guidance here for resolving problems that have similar origins. For
unique problems, consider developing and adopting a troubleshooting methodology.
The following approach for effective troubleshooting elaborates on how to gather troubleshooting
information, such as identifying symptoms and defining the problem space. Troubleshooting with log files
is also discussed.
Defining the Problem After you have isolated the symptoms of the problem, you must define the
Space problem space. Identify the software or hardware components that are
affected and might be causing the problem and those components that are
not involved.
Testing Possible When you know what the symptoms of the problem are and which
Solutions components are involved, test the solutions systematically until the problem
is resolved.
Identifying Symptoms
Before you attempt to resolve a problem in your implementation, you must identify precisely how it is
failing.
VMware, Inc. 7
vSphere Troubleshooting
The first step in the troubleshooting process is to gather information that defines the specific symptoms of
what is happening. You might ask these questions when gathering this information:
n Can the affected task be divided into subtasks that you can evaluate separately?
n What has changed recently in the software or hardware that might be related to the failure?
To define the problem space in an implementation of vSphere, be aware of the components present. In
addition to VMware software, consider third-party software in use and which hardware is being used with
the VMware virtual hardware.
Recognizing the characteristics of the software and hardware elements and how they can impact the
problem, you can explore general problems that might be causing the symptoms.
n Incompatibility of components
Break down the process and consider each piece and the likelihood of its involvement separately. For
example, a case that is related to a virtual disk on local storage is probably unrelated to third-party router
configuration. However, a local disk controller setting might be contributing to the problem. If a component
is unrelated to the specific symptoms, you can probably eliminate it as a candidate for solution testing.
Think about what changed in the configuration recently before the problems started. Look for what is
common in the problem. If several problems started at the same time, you can probably trace all the
problems to the same cause.
With the information that you have gained about the symptoms and affected components, you can design
tests for pinpointing and resolving the problem. These tips might make this process more effective.
n Verify that each solution determines unequivocally whether the problem is fixed. Test each potential
solution but move on promptly if the fix does not resolve the problem.
VMware, Inc. 8
vSphere Troubleshooting
n Develop and pursue a hierarchy of potential solutions based on likelihood. Systematically eliminate
each potential problem from the most likely to the least likely until the symptoms disappear.
n When testing potential solutions, change only one thing at a time. If your setup works after many
things are changed at once, you might not be able to discern which of those things made a difference.
n If the changes that you made for a solution do not help resolve the problem, return the
implementation to its previous status. If you do not return the implementation to its previous status,
new errors might be introduced.
n Find a similar implementation that is working and test it in parallel with the implementation that is not
working properly. Make changes on both systems at the same time until few differences or only one
difference remains between them.
Common Logs
The following logs are common to all deployments on Windows or Linux.
VMware, Inc. 9
vSphere Troubleshooting
For Platform Services Controller node deployments, additional runtime logs are located at
C:\ProgramData\VMware\CIS\runtime\VMwareSTSService\logs.
VMware, Inc. 10
Troubleshooting Virtual
Machines 2
The virtual machine troubleshooting topics provide solutions to potential problems that you might
encounter when using your virtual machines.
n Virtual Machine Does Not Power On After Cloning or Deploying from Template
The troubleshooting topics discussed focus on problems that you might encounter when using the
vSphere Fault Tolerance feature on your virtual machines. The topics also describe how to resolve
problems.
You can also see the VMware knowledge base article at http://kb.vmware.com/kb/1033634 to help you
troubleshoot Fault Tolerance. This article contains a list of error messages that you might encounter when
you attempt to use the feature and, where applicable, advice on how to resolve each error.
Problem
When you attempt to power on a virtual machine with Fault Tolerance enabled, an error message might
appear if you did not enable HV.
Cause
This error is often the result of HV not being available on the ESXi server on which you are attempting to
power on the virtual machine. HV might not be available either because it is not supported by the ESXi
server hardware or because HV is not enabled in the BIOS.
VMware, Inc. 11
vSphere Troubleshooting
Solution
If the ESXi server hardware supports HV, but HV is not currently enabled, enable HV in the BIOS on that
server. The process for enabling HV varies among BIOSes. See the documentation for your hosts'
BIOSes for details on how to enable HV.
If the ESXi server hardware does not support HV, switch to hardware that uses processors that support
Fault Tolerance.
Problem
Secondary VM could not be powered on as there are no compatible hosts that can accommodate it.
Cause
This can occur for a variety of reasons including that there are no other hosts in the cluster, there are no
other hosts with HV enabled, Hardware MMU Virtualization is not supported by host CPUs, data stores
are inaccessible, there is no available capacity, or hosts are in maintenance mode.
Solution
If there are insufficient hosts, add more hosts to the cluster. If there are hosts in the cluster, ensure they
support HV and that HV is enabled. The process for enabling HV varies among BIOSes. See the
documentation for your hosts' BIOSes for details on how to enable HV. Check that hosts have sufficient
capacity and that they are not in maintenance mode.
Problem
When a Secondary VM resides on a host that is heavily loaded, the Secondary VM can affect the
performance of the Primary VM.
Cause
A Secondary VM running on a host that is overcommitted (for example, with its CPU resources) might not
get the same amount of resources as the Primary VM. When this occurs, the Primary VM must slow down
to allow the Secondary VM to keep up, effectively reducing its execution speed to the slower speed of the
Secondary VM.
VMware, Inc. 12
vSphere Troubleshooting
Solution
If the Secondary VM is on an overcommitted host, you can move the VM to another location without
resource contention problems. Or more specifically, do the following:
n For FT networking contention, use vMotion technology to move the Secondary VM to a host with
fewer FT VMs contending on the FT network. Verify that the quality of the storage access to the VM is
not asymmetric.
n For storage contention problems, turn FT off and on again. When you recreate the Secondary VM,
change its datastore to a location with less resource contention and better performance potential.
n To resolve a CPU resources problem, set an explicit CPU reservation for the Primary VM at an MHz
value sufficient to run its workload at the desired performance level. This reservation is applied to
both the Primary and Secondary VMs, ensuring that both VMs can execute at a specified rate. For
guidance in setting this reservation, view the performance graphs of the virtual machine (before Fault
Tolerance was enabled) to see how many CPU resources it used under normal conditions.
Problem
FT VMs might see a variable increase in packet latency (on the order of milliseconds). Applications that
demand very low network packet latency or jitter (for example, certain real-time applications) might see a
degradation in performance.
Cause
Some increase in network latency is expected overhead for Fault Tolerance, but certain factors can add to
this latency. For example, if the FT network is on a particularly high latency link, this latency is passed on
to the applications. Also, if the FT network has insufficient bandwidth (fewer than 10 Gbps), greater
latency might occur.
Solution
Verify that the FT network has sufficient bandwidth (10 Gbps or more) and uses a low latency link
between the Primary VM and Secondary VM. These precautions do not eliminate network latency, but
minimize its potential impact.
Problem
Some hosts in the cluster might become overloaded with FT VMs, while other hosts might have unused
resources.
VMware, Inc. 13
vSphere Troubleshooting
Cause
vSphere DRS does not load balance FT VMs (unless they are using legacy FT). This limitation might
result in a cluster where hosts are unevenly distributed with FT VMs.
Solution
Manually rebalance the FT VMs across the cluster by using vSphere vMotion. Generally, the fewer FT
VMs that are on a host, the better they perform, due to reduced contention for FT network bandwidth and
CPU resources.
Problem
n If both the Primary VM and Secondary VM cannot access the metadata datastore, the VMs might fail
unexpectedly. Typically, an unrelated failure that terminates FT must also occur when access to the
FT metadata datastore is lost by both VMs. vSphere HA then tries to restart the Primary VM on a host
with access to the metadata datastore.
n The VM might stop being recognized as an FT VM by vCenter Server. This failed recognition can
allow unsupported operations such as taking snapshots to be performed on the VM and cause
problematic behavior.
Cause
Lack of access to the Fault Tolerance metadata datastore can lead to the undesirable outcomes in the
previous list.
Solution
When planning your FT deployment, place the metadata datastore on highly available storage. While FT
is running, if you see that the access to the metadata datastore is lost on either the Primary VM or the
Secondary VM, promptly address the storage problem before loss of access causes one of the previous
problems. If a VM stops being recognized as an FT VM by vCenter Server, do not perform unsupported
operations on the VM. Restore access to the metadata datastore. After access is restored for the FT VMs
and the refresh period has ended, the VMs are recognizable.
VMware, Inc. 14
vSphere Troubleshooting
Problem
When you select Turn On Fault Tolerance for a powered-on VM, the operation fails and you see an
Unknown error message.
Cause
This operation can fail if the host that the VM is running on has insufficient memory resources to provide
fault tolerant protection. vSphere Fault Tolerance automatically tries to allocate a full memory reservation
on the host for the VM. Overhead memory is required for fault tolerant VMs and can sometimes expand to
1 to 2 GB. If the powered-on VM is running on a host that has insufficient memory resources to
accommodate the full reservation plus the overhead memory, trying to turn on Fault Tolerance fails.
Subsequently, the Unknown error message is returned.
Solution
n Free up memory resources on the host to accommodate the VM's memory reservation and the added
overhead.
n Move the VM to a host with ample free memory resources and try again.
Problem
Because EVC is a prerequisite for using DRS with FT VMs, DRS does not place or evacuate them if EVC
has been disabled (even if it is later reenabled).
Cause
When EVC is disabled on a DRS cluster, a VM override that disables DRS on an FT VM might be added.
Even if EVC is later reenabled, this override is not canceled.
Solution
If DRS does not place or evacuate FT VMs in the cluster, check the VMs for a VM override that is
disabling DRS. If you find one, remove the override that is disabling DRS.
Note For more information on how to edit or delete VM overrides, see vSphere Resource Management.
VMware, Inc. 15
vSphere Troubleshooting
Use a10-Gbit logging network for FT and verify that the network is low latency.
Too Much Activity on VMFS Volume Can Lead to Virtual Machine Failovers
When a number of file system locking operations, virtual machine power ons, power offs, or vMotion
migrations occur on a single VMFS volume, this can trigger fault tolerant virtual machines to be failed
over. A symptom that this might be occurring is receiving many warnings about SCSI reservations in the
VMkernel log. To resolve this problem, reduce the number of file system operations or ensure that the
fault tolerant virtual machine is on a VMFS volume that does not have an abundance of other virtual
machines that are regularly being powered on, powered off, or migrated using vMotion.
VMware, Inc. 16
vSphere Troubleshooting
Error Message When You Try to Migrate Virtual Machine with USB
Devices Attached
Migration with vMotion cannot proceed and issues a confusing error message when you connect multiple
USB devices from an ESXi host to a virtual machine and one or more devices are not enabled for
vMotion.
Problem
The Migrate Virtual Machine wizard runs a compatibility check before a migration operation begins. If
unsupported USB devices are detected, the compatibility check fails and an error message similar to the
following appears: Currently connected device 'USB 1' uses backing 'path:1/7/1', which
is not accessible.
Cause
To successfully pass vMotion compatibility checks, you must enable all USB devices that are connected
to the virtual machine from a host for vMotion. If one or more devices are not enabled for vMotion,
migration will fail.
Solution
1 Make sure that the devices are not in the process of transferring data before removing them.
Problem
If the USB arbitrator is being used for USB passthrough from an ESXi host to a virtual machine the USB
device appears under lsusb but does not mount correctly.
Cause
This problem occurs because the nonbootable USB device is reserved for the virtual machine by default.
It does not appear on the host's file system, even though lsusb can see the device.
Solution
VMware, Inc. 17
vSphere Troubleshooting
4 Restart hostd and any running virtual machines to restore access to the passthrough devices in the
virtual machine.
What to do next
Problem
Virtual machines that reside on an ESXi host that vCenter Server manages might become orphaned in
rare cases. Such virtual machines exist in the vCenter Server database, but the ESXi host no longer
recognizes them.
Cause
Virtual machines can become orphaned if a host failover is unsuccessful, or when the virtual machine is
unregistered directly on the host. If this situation occurs, move the orphaned virtual machine to another
host in the data center on which the virtual machine files are stored.
Solution
1 Determine the datastore where the virtual machine configuration (.vmx) file is located.
a Select the virtual machine in the vSphere Web Client inventory, and click the Datastores tab.
The datastore or datastores where the virtual machine files are stored are displayed.
b If more than one datastore is displayed, select each datastore and click the file browser icon to
browse for the .vmx file.
2 Return to the virtual machine in the vSphere Web Client, right-click it, and select All Virtual
Infrastructure Actions > Remove from Inventory.
a Right-click the datastore where the virtual machine file is located and select Register VM.
c Select the location for the virtual machine and click Next.
d Select the host on which to run the virtual machine and click Next.
e Click Finish.
VMware, Inc. 18
vSphere Troubleshooting
Problem
When you clone a virtual machine or deploy a virtual machine from a template, you might not be able to
power on the virtual machine after creation.
Cause
The swap file size is not reserved when the virtual machine disks are created.
Solution
n Reduce the size of the swap file that is required for the virtual machine. You can do this by increasing
the virtual machine memory reservation.
c Use the Reservation drop-down menu to increase the amount of memory allocated to the virtual
machine.
d Click OK.
n Alternatively, you can increase the amount of space available for the swap file by moving other virtual
machine disks off the datastore that is being used for the swap file.
c For each virtual machine to move, right-click the virtual machine and select Migrate.
n You can also increase the amount of space available for the swap file by changing the swap file
location to a datastore with adequate space.
VMware, Inc. 19
vSphere Troubleshooting
d Click Edit.
Note If the host is part of a cluster that specifies that the virtual machine swap files are stored in
the same directory as the virtual machine, you cannot click Edit. You must use the Cluster
Settings dialog box to change the swap file location policy for the cluster.
e Select Use a specific datastore and select a datastore from the list.
f Click OK.
VMware, Inc. 20
Troubleshooting Hosts 3
The host troubleshooting topics provide solutions to potential problems that you might encounter when
using your vCenter Servers and ESXi hosts.
n Active Directory Rule Set Error Causes Host Profile Compliance Failure
Problem
vSphere HA reports that an agent is in the Agent Unreachable state when the agent for the host cannot
be contacted by the master host or by vCenter Server. Consequently, vSphere HA is not able to monitor
the virtual machines on the host and might not restart them after a failure.
VMware, Inc. 21
vSphere Troubleshooting
Cause
A vSphere HA agent can be in the Agent Unreachable state for several reasons. This condition most
often indicates that a networking problem is preventing vCenter Server or the master host from contacting
the agent on the host, or that all hosts in the cluster have failed. This condition can also indicate the
unlikely situation that vSphere HA was disabled and then re-enabled on the cluster while vCenter Server
could not communicate with the vSphere HA agent on the host, or that the ESXi host agent on the host
has failed, and the watchdog process was unable to restart it. In any of these cases, a failover event is
not triggered when a host goes into the Unreachable state.
Solution
Determine if vCenter Server is reporting the host as not responding. If so, there is a networking problem,
an ESXi host agent failure, or a total cluster failure. After the condition is resolved, vSphere HA should
work correctly. If not, reconfigure vSphere HA on the host. Similarly, if vCenter Server reports the hosts
are responding but a host's state is Agent Unreachable, reconfigure vSphere HA on that host.
Problem
vSphere HA reports that an agent is in the Uninitialized state when the agent for the host is unable to
enter the run state and become the master host or to connect to the master host. Consequently, vSphere
HA is not able to monitor the virtual machines on the host and might not restart them after a failure.
Cause
A vSphere HA agent can be in the Uninitialized state for one or more reasons. This condition most often
indicates that the host does not have access to any datastores. Less frequently, this condition indicates
that the host does not have access to its local datastore on which vSphere HA caches state information,
the agent on the host is inaccessible, or the vSphere HA agent is unable to open required firewall ports. It
is also possible that the ESXi host agent has stopped.
Solution
Search the list of the host's events for recent occurrences of the event vSphere HA Agent for the
host has an error. This event indicates the reason for the host being in the uninitialized state. If the
condition exists because of a datastore problem, resolve whatever is preventing the host from accessing
the affected datastores. If the ESXi host agent has stopped, you must restart it. After the problem has
been resolved, if the agent does not return to an operational state, reconfigure vSphere HA on the host.
Note If the condition exists because of a firewall problem, check if there is another service on the host
that is using port 8182. If so, shut down that service, and reconfigure vSphere HA.
VMware, Inc. 22
vSphere Troubleshooting
Problem
vSphere HA reports that an agent is in the Initialization Error state when the last attempt to configure
vSphere HA for the host failed. vSphere HA does not monitor the virtual machines on such a host and
might not restart them after a failure.
Cause
This condition most often indicates that vCenter Server was unable to connect to the host while the
vSphere HA agent was being installed or configured on the host. This condition might also indicate that
the installation and configuration completed, but the agent did not become a master host or a slave host
within a timeout period. Less frequently, the condition is an indication that there is insufficient disk space
on the host's local datastore to install the agent, or that there are insufficient unreserved memory
resources on the host for the agent resource pool. Finally, for ESXi 5.x hosts, the configuration fails if a
previous installation of another component required a host reboot, but the reboot has not yet occurred.
Solution
Host communication Resolve any communication problems with the host and retry the configuration operation.
errors
Timeout errors Possible causes include that the host crashed during the configuration task, the agent failed to start
after being installed, or the agent was unable to initialize itself after starting up. Verify that vCenter
Server is able to communicate with the host. If so, see vSphere HA Agent Is in the Agent Unreachable
State or vSphere HA Agent is in the Uninitialized State for possible solutions.
Lack of resources Free up approximately 75MB of disk space. If the failure is due to insufficient unreserved memory, free
up memory on the host by either relocating virtual machines to another host or reducing their
reservations. In either case, retry the vSphere HA configuration task after resolving the problem.
Reboot pending If an installation for a 5.0 or later host fails because a reboot is pending, reboot the host and retry the
vSphere HA configuration task.
VMware, Inc. 23
vSphere Troubleshooting
Problem
vSphere HA reports that an agent is in the Uninitialization Error state when vCenter Server is unable to
unconfigure the agent on the host during the Unconfigure HA task. An agent left in this state can interfere
with the operation of the cluster. For example, the agent on the host might elect itself as master host and
lock a datastore. Locking a datastore prevents the valid cluster master host from managing the virtual
machines with configuration files on that datastore.
Cause
This condition usually indicates that vCenter Server lost the connection to the host while the agent was
being unconfigured.
Solution
Add the host back to vCenter Server (version 5.0 or later). The host can be added as a stand-alone host
or added to any cluster.
Problem
Usually, such reports indicate that a host has actually failed, but failure reports can sometimes be
incorrect. A failed host reduces the available capacity in the cluster and, in the case of an incorrect report,
prevents vSphere HA from protecting the virtual machines running on the host.
Cause
This host state is reported when the vSphere HA master host to which vCenter Server is connected is
unable to communicate with the host and with the heartbeat datastores that are in use for the host. Any
storage failure that makes the datastores inaccessible to hosts can cause this condition if accompanied
by a network failure.
Solution
Check for the noted failure conditions and resolve any that are found.
VMware, Inc. 24
vSphere Troubleshooting
Problem
While the virtual machines running on the host continue to be monitored by the master hosts that are
responsible for them, vSphere HA's ability to restart the virtual machines after a failure is affected. First,
each master host has access to a subset of the hosts, so less failover capacity is available to each host.
Second, vSphere HA might be unable to restart a FT Secondary VM after a failure (see Primary VM
Remains in the Need Secondary State).
Cause
n The vSphere HA master host to which vCenter Server is connected is unable to communicate with
the host by using the management (or VMware vSAN™)network, but is able to communicate with that
host by using the heartbeat datastores that have been selected for it.
A network partition can occur for a number of reasons including incorrect VLAN tagging, the failure of a
physical NIC or switch, configuring a cluster with some hosts that use only IPv4 and others that use only
IPv6, or the management networks for some hosts were moved to a different virtual switch without first
putting the host into maintenance mode.
Solution
Resolve the networking problem that prevents the hosts from communicating by using the management
networks.
Problem
When a host is in the Network Isolated state, there are two things to consider -- the isolated host and the
vSphere HA agent that holds the master role.
n On the isolated host, the vSphere HA agent applies the configured isolation response to the running
VMs, determining if they should be shut down or powered off. It does this after checking whether a
master agent is able to take responsibility for each VM (by locking the VM's home datastore.) If not,
the agent defers applying the isolation response for the VM and rechecks the datastore state after a
short delay.
n If the vSphere HA master agent can access one or more of the datastores, it monitors the VMs that
were running on the host when it became isolated and attempts to restart any that were powered off
or shut down.
VMware, Inc. 25
vSphere Troubleshooting
Cause
n Isolation addresses have been configured and the host is unable to ping them.
n The vSphere HA agent on the host is unable to access any of the agents running on the other cluster
hosts.
Note If your vSphere HA cluster has vSAN enabled, a host is determined to be isolated if it cannot
communicate with the other vSphere HA agents in the cluster and cannot reach the configured isolation
addresses. Although the vSphere HA agents use the vSAN network for inter-agent communication, the
default isolation address is still the gateway of the host. Hence, in the default configuration, both networks
must fail for a host be declared isolated.
Solution
Resolve the networking problem that is preventing the host from pinging its isolation addresses and
communicating with other hosts.
Problem
When you enable vSphere HA on an existing cluster with a large number of hosts and virtual machines,
the setup of vSphere HA on some of the hosts might fail.
Cause
This failure is the result of a time out occurring before the installation of vSphere HA on the host(s)
completes.
Solution
Set the vCenter Server advanced option config.vpxd.das.electionWaitTimeSec to value=240. Once this
change is made, the time outs do not occur.
VMware, Inc. 26
vSphere Troubleshooting
Problem
A TFTP Timeout error message appears when a host provisioned with vSphere Auto Deploy boots. The
text of the message depends on the BIOS.
Cause
Solution
n Ensure that your TFTP service is running and reachable by the host that you are trying to boot.
n To view the diagnostic logs for details on the present error, see your TFTP service documentation.
Problem
A host is booting with a different ESXi image profile or configuration than the image profile or
configuration that the rules specify. For example, you change the rules to assign a different image profile,
but the host still uses the old image profile.
Cause
After the host has been added to a vCenter Server system, the boot configuration is determined by the
vCenter Server system. The vCenter Server system associates an image profile, host profile, or folder
location with the host.
Solution
Problem
During boot, a host that you want to provision with vSphere Auto Deploy loads iPXE. The host is not
redirected to the vSphere Auto Deploy server.
Cause
The tramp file that is included in the TFTP ZIP file has the wrong IP address for the vSphere Auto Deploy
server.
VMware, Inc. 27
vSphere Troubleshooting
Solution
u Correct the IP address of the vSphere Auto Deploy server in the tramp file, as explained in the
vSphere Installation and Setup documentation.
Problem
When you write or modify rules to assign an image profile to one or more hosts, the following error
results:
Warning: Image Profile <name-here> contains one or more software packages that are
not stateless-ready. You may experience problems when using this profile with Auto
Deploy.
Cause
Each VIB in an image profile has a stateless-ready flag that indicates that the VIB is meant for use
with vSphere Auto Deploy. You get the error if you attempt to write a vSphere Auto Deploy rule that uses
an image profile in which one or more VIBs have that flag set to FALSE.
Note You can use hosts provisioned with vSphere Auto Deploy that include VIBs that are not stateless
ready without problems. However booting with an image profile that includes VIBs that are not stateless
ready is treated like a fresh install. Each time you boot the host, you lose any configuration data that
would otherwise be available across reboots for hosts provisioned with vSphere Auto Deploy.
Solution
1 Use vSphere ESXi Image Builder cmdlets in a vSphere PowerCLI session to view the VIBs in the
image profile.
vSphere Auto Deploy Host with a Built-In USB Flash Drive Does
Not Send Coredumps to Local Disk
If your vSphere Auto Deploy host has a built-in USB flash drive, and an error results in a coredump, the
coredump is lost. Set up your system to use ESXi Dump Collector to store coredumps on a networked
host.
Problem
If your vSphere Auto Deploy host has a built-in USB Flash, and if it encounters an error that results in a
coredump, the coredump is not sent to the local disk.
VMware, Inc. 28
vSphere Troubleshooting
Solution
Problem
A host to be provisioned with vSphere Auto Deploy boots from iPXE and displays iPXE information on the
console. However, after five minutes, the host displays the following message to the console and reboots.
Details: This host has been added to VC, but no Image Profile
is associated with it. You can use Apply-ESXImageProfile in the
PowerCLI to associate an Image Profile with this host.
Alternatively, you can reevaluate the rules for this host with the
Test-DeployRuleSetCompliance and Repair-DeployRuleSetCompliance cmdlets.
The console then displays the host's machine attributes including vendor, serial number, IP address, and
so on.
Cause
VMware, Inc. 29
vSphere Troubleshooting
Solution
You can assign an image profile to the host by running the Apply-EsxImageProfile cmdlet, or by
creating the following rule:
1 Run the New-DeployRule cmdlet to create a rule that includes a pattern that matches the host with
an image profile.
3 Run the Test-DeployRuleSetCompliance cmdlet and use the output of that cmdlet as the input to
the Repair-DeployRuleSetCompliance cmdlet.
Problem
When you attempt to boot a host provisioned with vSphere Auto Deploy, the host performs a network boot
and is assigned a DHCP address by the DHCP server, but the host cannot contact the TFTP server.
Cause
The TFTP server might have stopped running, or a firewall might block the TFTP port.
Solution
n If you installed the WinAgents TFTP server, open the WinAgents TFTP management console and
verify that the service is running. If the service is running, check the Windows firewall's inbound rules
to make sure the TFTP port is not blocked. Turn off the firewall temporarily to see whether the firewall
is the problem.
n For all other TFTP servers, see the server documentation for debugging procedures.
Problem
When you attempt to boot a host provisioned with vSphere Auto Deploy, the boot process stops at the
iPXE boot screen and the status message indicates that the host is attempting to get the ESXi image
from the vSphere Auto Deploy server.
Cause
The vSphere Auto Deploy service might be stopped or the vSphere Auto Deploy server might be
unaccessible.
VMware, Inc. 30
vSphere Troubleshooting
Solution
1 Log in to the system on which you installed the vSphere Auto Deploy server.
a Click Start > Settings > Control Panel > Administrative Tools.
c In the Services field, look for the VMware vSphere Auto Deploy Waiter service and restart the
service if it is not running.
3 Open a Web browser, enter the following URL, and check whether the vSphere Auto Deploy server is
accessible.
https://Auto_Deploy_Server_IP_Address:Auto_Deploy_Server_Port/vmw/rdb
Note Use this address only to check whether the server is accessible.
a Try setting up permissive TCP Inbound rules for the vSphere Auto Deploy server port.
The port is 6501 unless you specified a different port during installation.
b As a last resort, disable the firewall temporarily and enable it again after you verified whether it
blocked the traffic. Do not disable the firewall on production environments.
To disable the firewall, run netsh firewall set opmode disable. To enable the firewall, run
netsh firewall set opmode enable.
Problem
When you attempt to boot a host provisioned with vSphere Auto Deploy, the host performs a network boot
but is not assigned a DHCP address. The vSphere Auto Deploy server cannot provision the host with the
image profile.
Cause
You might have a problem with the DHCP service or with the firewall setup.
VMware, Inc. 31
vSphere Troubleshooting
Solution
1 Check that the DHCP server service is running on the Windows system on which the DHCP server is
set up to provision hosts.
a Click Start > Settings > Control Panel > Administrative Tools.
c In the Services field, look for the DHCP server service and restart the service if it is not running.
2 If the DHCP server is running, recheck the DHCP scope and the DHCP reservations that you
configured for your target hosts.
If the DHCP scope and reservations are configured correctly, the problem most likely involves the
firewall.
3 As a temporary workaround, turn off the firewall to see whether that resolves the problem.
a Open the command prompt by clicking Start > Program > Accessories > Command prompt.
b Type the following command to temporarily turn off the firewall. Do not turn off the firewall in a
production environment.
See the firewall documentation for DHCP and for the Windows system on which the DHCP server is
running for details.
Problem
When you attempt to boot a host provisioned with vSphere Auto Deploy, the host does not start the
network boot process.
Cause
Solution
1 Reboot the host and follow the on-screen instructions to access the BIOS configuration.
2 In the BIOS configuration, enable Network Boot in the Boot Device configuration.
VMware, Inc. 32
vSphere Troubleshooting
Problem
When you use vSphere Auto Deploy to provision the ESXi hosts in your environment, you might
encounter a problem with the vSphere Auto Deploy database.
Important This is a rare problem. Follow all other vSphere Auto Deploy troubleshooting strategies
before you replace the current database file. Rules or associations that you created since the backup you
choose are lost.
Cause
This problem happens only with hosts that are provisioned with vSphere Auto Deploy.
Solution
2 Find the vSphere Auto Deploy log by going to the vSphere Auto Deploy page in the
vSphere Web Client.
If you see the message, replace the existing database with the most recent backup.
The directory contains a file named db, and backup files named db-yyy-mm-dd.
VMware Support might ask for that file if you call for assistance.
8 If the message still appears in the log, repeat the steps to use the next recent backup until vSphere
Auto Deploy works without database errors.
VMware, Inc. 33
vSphere Troubleshooting
Problem
When you create a password on the host, the following fault message appears: A general system
error occurred: passwd: Authentication token manipulation error.
The following message is included: Failed to set the password. It is possible that your
password does not meet the complexity criteria set by the system.
Cause
The host checks for password compliance using the default authentication plug-in, pam_passwdqc.so. If
the password is not compliant, the error appears.
Solution
When you create a password, include a mix of characters from four character classes: lowercase letters,
uppercase letters, numbers, and special characters such as an underscore or dash.
n Passwords containing characters from three character classes must be at least eight characters long.
n Passwords containing characters from all four character classes must be at least seven characters
long.
Note An uppercase character that begins a password does not count toward the number of character
classes used. A number that ends a password does not count toward the number of character classes
used.
Problem
When you apply a host profile that specifies an Active Directory domain to join, but you do not enable the
activeDirectoryAll rule set in the firewall configuration, a compliance failure occurs. The
vSphere Web Client displays the error message Failures against the host profile: Ruleset
activedirectoryAll does not match the specification. The compliance failure also occurs
when you apply a host profile to leave an Active Directory domain, but you do not disable the
activeDirectoryAll rule set in the host profile.
VMware, Inc. 34
vSphere Troubleshooting
Cause
Active Directory requires the activeDirectoryAll firewall rule set. You must enable the rule set in the
firewall configuration. If you omit this setting, the system adds the necessary firewall rules when the host
joins the domain, but the host will be noncompliant because of the mismatch in firewall rules. The host will
also be noncompliant if you remove it from the domain without disabling the Active Directory rule set.
Solution
To find a host profile, click Policies and Profiles > Host Profiles on the vSphere Web Client Home
page.
3 Click Next.
4 Select Security and Services > Firewall Configuration > Firewall configuration > Ruleset
Configuration.
6 In the right panel, select the Flag indicating whether ruleset should be enabled check box.
7 Click Next, and then click Finish to complete the change to the host profile.
Problem
If you configure vCenter Server reverse proxy to use a custom port, the VIB downloads fail.
Cause
If vCenter Server is using a custom port for the reverse proxy, the custom port is not automatically
enabled in the ESXi firewall and the VIB downloads fail.
Solution
cp /etc/vmware/firewall/service.xml /etc/vmware/firewall/service.xml.bak
VMware, Inc. 35
vSphere Troubleshooting
4 Edit the access permissions of the service.xml file to allow writes by running the chmod command.
6 Add a new rule to the service.xml file that enables the custom port for the vCenter Server reverse
proxy .
<service id='id_value'>
<id>vcenterrhttpproxy</id>
<rule id='0000'>
<direction>outbound</direction>
<protocol>tcp</protocol>
<port type='dst'>custom_reverse_proxy_port</port>
</rule>
<enabled>true</enabled>
<required>false</required>
</service>
Where id_value must be a unique value, for example, if the last listed service in the service.xml file
has ID 0040, you must enter id number 0041.
7 Revert the access permissions of the service.xml file to the default read-only setting.
VMware, Inc. 36
vSphere Troubleshooting
10 (Optional) If you want the firewall configuration to persist after a reboot of the ESXi host, copy the
service.xml onto persistent storage and modify the local.sh file.
a Copy the modified service.xml file onto persistent storage, for example /store/, or onto a
VMFS volume, for example /vmfs/volumes/volume/.
cp /etc/vmware/firewall/service.xml location_of_xml_file
You can store a VMFS volume in a single location and copy it to multiple hosts.
b Add the service.xml file information to the local.sh file on the host.
cp location_of_xml_file /etc/vmware/firewall
esxcli network firewall refresh
VMware, Inc. 37
Troubleshooting vCenter Server
and the vSphere Web Client 4
The vCenter Server and vSphere Web Client troubleshooting topics provide solutions to problems you
might encounter when you set up and configure vCenter Server and the vSphere Web Client, including
vCenter Single Sign-On.
Problem
If the vCenter Server installer cannot stop the Tomcat service during an upgrade, the upgrade fails with
an error message similar to Unable to delete VC Tomcat service. This problem can occur even if
you stop the Tomcat service manually before the upgrade, if some files that are used by the Tomcat
process are locked.
Solution
1 From the Windows Start menu, select Settings > Control Panel > Administrative Tools >
Services.
VMware, Inc. 38
vSphere Troubleshooting
This releases any locked files that are used by the Tomcat process, and enables the vCenter Server
installer to stop the Tomcat service for the upgrade.
Alternatively, you can restart the vCenter Server machine and restart the upgrade process, but select the
option not to overwrite the vCenter Server data.
Problem
The following error message appears: The DB User entered does not have the required
permissions needed to install and configure vCenter Server with the selected DB.
Please correct the following error(s): %s
Cause
The database version must be supported for vCenter Server. For SQL, even if the database is a
supported version, if it is set to run in compatibility mode with an unsupported version, this error occurs.
For example, if SQL 2008 is set to run in SQL 2000 compatibility mode, this error occurs.
Solution
u Make sure the vCenter Server database is a supported version and is not set to compatibility mode
with an unsupported version. See the VMware Product Interoperability Matrixes at
http://partnerweb.vmware.com/comp_guide2/sim/interop_matrix.php?.
Problem
When you log in to the vSphere Web Client, the inventory appears to be empty or the vCenter Server
system you expected to see does not appear.
VMware, Inc. 39
vSphere Troubleshooting
Cause
In vSphere 5.1 and later, you log into the vSphere Web Client to view and manage multiple instances of
vCenter Server. Any vCenter Server system on which you have permissions appears in the inventory, if
the server is registered with the same Component Manager as the vSphere Web Client.
Solution
n Log in to the vSphere Web Client as a user with permissions on the vCenter Server system.
The vCenter Server system will not appear in the inventory if you do not have permissions on it. For
example, if you log in as the vCenter Single Sign On administrator user, you might not have
permissions on any vCenter Server system.
n Verify that the vCenter Server system is registered with the same Component Manager as the
vSphere Web Client.
The vSphere Web Client discovers only vCenter Server systems that are registered with the same
Componenent Manager.
Problem
When you attempt to open a virtual machine console from the vSphere Web Client, the console does not
open. The following error message appears:
Cause
Another program or process is using port 9443, the default port used by the HTML5 virtual machine
console.
VMware, Inc. 40
vSphere Troubleshooting
Solution
u Edit the webclient.properties file to add the line html.console.port=port, where port is the
new port number.
The webclient.properties file is located in one of the following locations, depending on the
operating system on the machine on which the vSphere Web Client is installed:
Problem
vCenter Server cannot connect to managed hosts after server certificates are replaced and the system is
restarted.
Solution
Log into the host as the root user and reconnect the host to vCenter Server.
Problem
When you install new vCenter Server certificates, you might not see the new certificate.
Cause
Existing open connections to vCenter Server are not forcibly closed and might still use the old certificate.
Solution
To force all connections to use the new certificate, use one of the following methods.
VMware, Inc. 41
vSphere Troubleshooting
Problem
When you attempt to enable vSphere HA on a host with custom SSL certificates installed, the following
error message appears: vSphere HA cannot be configured on this host because its SSL
thumbprint has not been verified.
Cause
When you add a host to vCenter Server, and vCenter Server already trusts the host's SSL certificate,
VPX_HOST.EXPECTED_SSL_THUMBPRINT is not populated in the vCenter Server database. vSphere HA
obtains the host's SSL thumbprint from this field in the database. Without the thumbprint, you cannot
enable vSphere HA.
Solution
1 In the vSphere Web Client, disconnect the host that has custom SSL certificates installed.
VMware, Inc. 42
Troubleshooting Availability 5
The availability troubleshooting topics provide solutions to potential problems that you might encounter
when using your hosts and datastores in vSphere HA clusters.
You might get an error message when you try to use vSphere HA or vSphere FT. For information about
these error messages, see the VMware knowledge base article at http://kb.vmware.com/kb/1033634.
If vSphere HA admission control does not function properly, there is no assurance that all virtual
machines in the cluster can be restarted after a host failure.
Problem
If you select the Host Failures Cluster Tolerates admission control policy and certain problems arise, the
cluster turns red.
Cause
This problem can arise when hosts in the cluster are disconnected, in maintenance mode, not
responding, or have a vSphere HA error. Disconnected and maintenance mode hosts are typically caused
by user action. Unresponsive or error-possessing hosts usually result from a more serious problem, for
example, hosts or agents have failed or a networking problem exists.
VMware, Inc. 43
vSphere Troubleshooting
Another possible cause of this problem is if your cluster contains any virtual machines that have much
larger memory or CPU reservations than the others. The Host Failures Cluster Tolerates admission
control policy is based on the calculation on a slot size consisting of two components, the CPU and
memory reservations of a virtual machine. If the calculation of this slot size is skewed by outlier virtual
machines, the admission control policy can become too restrictive and result in a red cluster. In this case,
you can use the vSphere HA advanced options to reduce the slot size, use a different admission control
policy, or modify the policy to tolerate fewer host failures.
Solution
Check that all hosts in the cluster are healthy, that is, connected, not in maintenance mode and free of
vSphere HA errors. vSphere HA admission control only considers resources from healthy hosts.
Problem
If you select the Host Failures Cluster Tolerates admission control policy and certain problems arise, you
might be prevented from powering on a virtual machine due to insufficient resources.
Cause
n Hosts in the cluster are disconnected, in maintenance mode, not responding, or have a vSphere HA
error.
Disconnected and maintenance mode hosts are typically caused by user action. Unresponsive or
error-possessing hosts usually result from a more serious problem, for example, hosts or agents have
failed or a networking problem exists).
n Cluster contains virtual machines that have much larger memory or CPU reservations than the
others.
The Host Failures Cluster Tolerates admission control policy is based on the calculation on a slot size
comprised of two components, the CPU and memory reservations of a virtual machine. If the
calculation of this slot size is skewed by outlier virtual machines, the admission control policy can
become too restrictive and result in the inability to power on virtual machines.
Problems occur if there are no free slots in the cluster or if powering on a virtual machine causes the
slot size to increase because it has a larger reservation than existing virtual machines. In either case,
you should use the vSphere HA advanced options to reduce the slot size, use a different admission
control policy, or modify the policy to tolerate fewer host failures.
VMware, Inc. 44
vSphere Troubleshooting
Solution
View the Advanced Runtime Info pane that appears in the vSphere HA section of the cluster's Monitor
tab in the vSphere Web Client. This information pane shows the slot size and how many available slots
there are in the cluster. If the slot size appears too high, click on the Resource Allocation tab of the
cluster and sort the virtual machines by reservation to determine which have the largest CPU and
memory reservations. If there are outlier virtual machines with much higher reservations than the others,
consider using a different vSphere HA admission control policy (such as the Percentage of Cluster
Resources Reserved admission control policy) or use the vSphere HA advanced options to place an
absolute cap on the slot size. Both of these options, however, increase the risk of resource fragmentation.
Problem
When you select the Host Failures Cluster Tolerates admission control policy, view the Advanced
Runtime Info pane that appears in the vSphere HA section of the cluster's Monitor tab in the
vSphere Web Client. This pane displays information about the cluster, including the number of slots
available to power on additional virtual machines in the cluster. This number might be smaller than
expected under certain conditions.
Cause
Slot size is calculated using the largest reservations plus the memory overhead of any powered on virtual
machines in the cluster. However, vSphere HA admission control considers only the resources on a host
that are available for virtual machines. This amount is less than the total amount of physical resources on
the host, because there is some overhead.
Solution
Reduce the virtual machine reservations if possible, use vSphere HA advanced options to reduce the slot
size, or use a different admission control policy.
VMware, Inc. 45
vSphere Troubleshooting
vCenter Server automatically selects a preferred set of datastores for heartbeating. This selection is
made with the goal of maximizing the number of hosts that have access to a given datastore and
minimizing the likelihood that the selected datastores are backed by the same storage array or NFS
server. In most cases, this selection should not be changed. To see which datastores vSphere HA has
selected for use, in the vSphere Web Client you can go to the cluster's Monitor tab and select vSphere
HA and Heartbeat. Only datastores mounted by at least two hosts are available here.
Note There is no heartbeat datastore available if the only shared storage accessible to all hosts in the
cluster is vSAN.
Problem
You can specify the datastores preferred for storage heartbeating, and based on this preference, vCenter
Server determines the final set of datastores to use. However, vCenter Server might not choose the
datastores that you specify.
Cause
n The specified number of datastores is more than is required. vCenter Server chooses the optimal
number of required datastores out of the stated user preference and ignores the rest.
n A specified datastore is not optimal for host accessibility and storage backing redundancy. More
specifically, the datastore might not be chosen if it is accessible to only a small set of hosts in the
cluster. A datastore also might not be chosen if it is on the same LUN or the same NFS server as
datastores that vCenter Server has already chosen.
n A specified datastore is inaccessible because of storage failures, for example, storage array all paths
down (APD) or permanent device loss (PDL).
n If the cluster contains a network partition, or if a host is unreachable or isolated, the host continues to
use the existing heartbeat datastores even if the user preferences change.
Solution
Verify that all the hosts in the cluster are reachable and have the vSphere HA agent running. Also, ensure
that the specified datastores are accessible to most, if not all, hosts in the cluster and that the datastores
are on different LUNs or NFS servers.
VMware, Inc. 46
vSphere Troubleshooting
Problem
The operation to unmount or remove a datastore fails if the datastore has any opened files. For these
user operations, the vSphere HA agent closes all the files that it has opened, for example, heartbeat files.
If the agent is not reachable by vCenter Server or the agent cannot flush out pending I/Os to close the
files, a The HA agent on host '{hostName}' failed to quiesce file activity on datastore
'{dsName} fault is triggered.
Cause
If the datastore to be unmounted or removed is used for heartbeating, vCenter Server excludes it from
heartbeating and chooses a new one. However, the agent does not receive the updated heartbeat
datastores if it is not reachable, that is, if the host is isolated or in a network partition. In such cases,
heartbeat files are not closed and the user operation fails. The operation can also fail if the datastore is
not accessible because of storage failures such as all paths down.
Note When you remove a VMFS datastore, the datastore is removed from all the hosts in inventory. So
if there are any hosts in a vSphere HA cluster that are unreachable or that cannot access the datastore,
the operation fails.
Solution
Ensure that the datastore is accessible and the affected hosts are reachable.
There are various reasons why affected virtual machines might not be restarted and if this happens you
should troubleshoot to determine the cause.
Problem
When a virtual machine is powered on for several minutes, yet its vSphere HA protection state remains as
unprotected, if a failure occurs, vSphere HA might not attempt to restart the virtual machine.
Cause
vCenter Server reports a virtual machine as protected after the vSphere HA master host that is
responsible for the virtual machine has saved to disk the information that the virtual machine must be
restarted after a failure. This process can fail for a number of reasons.
n vSphere HA master host has not been elected or vCenter Server is unable to communicate with it.
VMware, Inc. 47
vSphere Troubleshooting
In this situation, vCenter Server reports the vSphere HA host state for the cluster hosts as Agent
Unreachable or Agent Uninitialized and reports a cluster configuration problem that a master host has
not been found.
n Multiple master hosts exist and the one with which vCenter Server is communicating is not
responsible for the virtual machine.
Problems occur when vCenter Server is in contact with a master host, but due to a management
network partition, there are multiple master hosts, and the agent with which vCenter Server is
communicating is not responsible for the virtual machine. This situation is likely if vCenter Server is
reporting the vSphere HA state of some hosts as network partitioned.
n Agent is unable to access the datastore on which the configuration file of the virtual machine is
stored.
vCenter Server might be in contact with the vSphere HA master host that owns the virtual machine,
but the agent is unable to access the datastore on which the configuration file of the virtual machine is
stored. This situation can occur if an all paths down condition affects all hosts in the cluster.
Solution
1 Determine whether vCenter Server is in contact with a vSphere HA master host, and if not, address
this problem.
2 If vCenter Server is in contact with a master host, determine whether there is a network partition, and
if so, address that problem.
3 If the problem persists, determine if other virtual machines that use the same datastore for their
configuration files are also unprotected.
4 If these virtual machines are unprotected, verify that the vSphere HA master host can access the
datastore.
5 If none of the previous steps resolves the problem, restore protection by reconfiguring vSphere HA on
the host on which the virtual machine is running.
Problem
When a host fails or a virtual machine fails while its host continues running, the virtual machine might not
restart or restarts only after a long delay.
Cause
vSphere HA might not restart a virtual machine after a failure or might delay its restart for several
reasons.
n Virtual machine is not protected by vSphere HA at the time the failure occurred
n Insufficient spare capacity on hosts with which the virtual machine is compatible
VMware, Inc. 48
vSphere Troubleshooting
n vSphere HA attempted to restart the virtual machine but encountered a fatal error each time it tried.
n Your cluster's shared storage is vSAN and one of the virtual machine's files has become inaccessible
due to the occurrence of more than the specified number of host failures.
Solution
To avoid virtual machine restart failures, check that virtual machines become protected by vSphere HA
after they are powered on. Also, ensure that your admission control settings match your restart
expectations if a failure occurs. Maximizing the compatibility between virtual machines and hosts in the
cluster can also reduce the likelihood of restart failures.
Note For information on the factors vSphere HA considers for virtual machine restarts, see "Determining
Responses to Host Issues" in vSphere Availability.
A cluster partition impedes cluster management functions such as vMotion and can impact vSphere HA’s
ability to monitor and restart virtual machines after a failure. This condition must be corrected as soon as
possible.
Network partitions also degrade the functionality of vSphere Fault Tolerance. For example, in a partitioned
cluster, a Primary VM (or its Secondary VM) could end up in a partition managed by a master host that is
not responsible for the virtual machine. When a Secondary VM must be restarted, vSphere HA does so
only if the Primary VM is in a partition managed by the master host responsible for it. Ultimately, you must
correct the network partition, but until that is possible, you must troubleshoot and correct any problems
that arise with your fault-tolerant virtual machines to ensure that they are properly protected.
Problem
vSphere HA might not restart the Secondary VM of a vSphere Fault Tolerance (FT) virtual machine pair
even though there are sufficient resources available.
VMware, Inc. 49
vSphere Troubleshooting
Cause
To restart a Secondary VM, vSphere HA requires that the Primary VM be running on a host that is in the
same partition as the one containing the vSphere HA master host responsible for the FT pair. In addition,
the vSphere HA agent on the Primary VM’s host must be operating correctly. If these conditions are met,
FT also requires that there be at least one other host in the same partition that is compatible with the FT
pair and that has a functioning vSphere HA agent.
Solution
To fix this condition, check the vSphere HA host states reported by vCenter Server. If hosts are identified
as partitioned, isolated, or unreachable, resolve the problem before proceeding. In some situations, you
can resolve a restart problem by reconfiguring vSphere HA on the host that vCenter Server is reporting as
the master host. However, in most situations, this step is insufficient, and you must resolve all host state
problems.
After you have addressed any host state problems, check if there are any hosts in the cluster other than
the Primary VM's that are compatible with the FT virtual machine pair. You can determine compatibility by
trying to migrate the Primary VM to other hosts. Address any incompatibilities that are discovered.
Problem
After a failover occurs, vCenter Server might incorrectly report that the Primary VM is powered off and
registered to its original host, and that the Secondary VM is powered on and registered to its original host.
Cause
This error occurs when vCenter Server is unable to communicate with the hosts on which the Primary VM
and Secondary VM are actually running. vCenter Server reports these hosts as not responding and the
problem persists until vCenter Server is able to communicate with the hosts.
Solution
To fix this problem, resolve the networking problem that is preventing vCenter Server from communicating
with the hosts in the cluster.
If the response that you have configured VMCP to make for such a failure is not run, you should
troubleshoot to determine the cause.
VMware, Inc. 50
vSphere Troubleshooting
Problem
When an All Paths Down (APD) or Permanent Device Loss (PDL) failure occurs and a datastore becomes
inaccessible, VMCP might not resolve the issue for the affected virtual machines.
Cause
In an APD or PDL failure situation, VMCP might not terminate a virtual machine for the following reasons:
Furthermore, if the failure is an APD, VMCP might not terminate a VM for several reasons:
n During a network partition or isolation, the host affected by the APD failure is not able to query the
master host for available capacity. In such a case, vSphere HA defers to the user policy and
terminates the VM if the VM Component Protection setting is aggressive.
n vSphere HA terminates APD-affected VMs only after the following timeouts expire:
n APD failover delay (default 180 seconds). For faster recovery, this can be set to 0.
Note Based on these default values, vSphere HA terminates the affected virtual machine after
320 seconds (APD timeout + APD failover delay)
Solution
VMware, Inc. 51
Troubleshooting Resource
Management 6
The resource management troubleshooting topics provide solutions to potential problems that you might
encounter when using your hosts and datastores in vSphere DRS or vSphere Storage DRS cluster.
Problem
You have enabled Storage DRS for a datastore cluster, but Storage DRS is disabled on one or more
virtual machine disks in the datastore cluster.
Cause
The following scenarios can cause Storage DRS to be disabled on a virtual disk.
n A virtual machine's swap file is host-local (the swap file is stored in a specified datastore that is on the
host). The swap file cannot be relocated and Storage DRS is disabled for the swap file disk.
n A certain location is specified for a virtual machine's .vmx swap file. The swap file cannot be
relocated and Storage DRS is disabled on the .vmx swap file disk.
n The relocate or Storage vMotion operation is currently disabled for the virtual machine in vCenter
Server (for example, because other vCenter Server operations are in progress on the virtual
machine). Storage DRS is disabled until the relocate or Storage vMotion operation is re-enabled in
vCenter Server.
n The home disk of a virtual machine is protected by vSphere HA and relocating it will cause loss of
vSphere HA protection.
VMware, Inc. 52
vSphere Troubleshooting
n If the disk is an independent disk, Storage DRS is disabled, except in the case of relocation or clone
placement.
n If the virtual machine has system files on a separate datastore from the home datastore (legacy),
Storage DRS is disabled on the home disk. If you use Storage vMotion to manually migrate the home
disk, the system files on different datastores will be all be located on the target datastore and Storage
DRS will be enabled on the home disk.
n If the virtual machine has a disk whose base/redo files are spread across separate datastores
(legacy), Storage DRS for the disk is disabled. If you use Storage vMotion to manually migrate the
disk, the files on different datastores will be all be located on the target datastore and Storage DRS
will be enabled on the disk.
n The virtual machine has hidden disks (such as disks in previous snapshots, not in the current
snapshot). This situation causes Storage DRS to be disabled on the virtual machine.
n The virtual machine is being Storage DRS-placed with manually specified datastores.
Solution
Address the problem that is causing Storage DRS to be disabled on the disk.
Problem
A datastore in a datastore cluster cannot enter maintenance mode. The Entering Maintenance Mode
status remains at 1%.
Cause
One or more disks on the datastore cannot be migrated with Storage vMotion. This condition can occur in
the following instances.
n Storage DRS rules prevent Storage DRS from making migration recommendations for the disk.
Solution
n If Storage DRS is disabled, enable it or determine why it is disabled. See Storage DRS is Disabled on
a Virtual Disk for reasons why Storage DRS might be disabled.
VMware, Inc. 53
vSphere Troubleshooting
n If Storage DRS rules are preventing Storage DRS from making migration recommendations, you can
remove or disable particular rules.
a Browse to the datastore cluster in the vSphere Web Client object navigator.
d Click Remove.
n Alternatively, if Storage DRS rules are preventing Storage DRS from making migration
recommendations, you can set the Storage DRS advanced option IgnoreAffinityRulesForMaintenance
to 1.
a Browse to the datastore cluster in the vSphere Web Client object navigator.
g Click OK.
Problem
Storage DRS generates an event and an alarm and Storage DRS cannot operate.
Cause
The following scenarios can cause vCenter Server to disable Storage DRS for a datastore.
Storage DRS is not supported on datastores that are shared across multiple data centers. This
configuration can occur when a host in one data center mounts a datastore in another data center, or
when a host using the datastore is moved to a different data center. When a datastore is shared
across multiple data centers, Storage DRS I/O load balancing is disabled for the entire datastore
cluster. However, Storage DRS space balancing remains active for all datastores in the datastore
cluster that are not shared across data centers.
n The datastore is connected to a host that is not running Storage I/O Control.
VMware, Inc. 54
vSphere Troubleshooting
Solution
n The datastore must be visible in only one data center. Move the hosts to the same data center or
unmount the datastore from hosts that reside in other data centers.
n Ensure that all hosts associated with the datastore cluster are ESXi 5.0 or later.
n Ensure that all hosts associated with the datastore cluster have Storage I/O Control enabled.
Problem
When you attempt to migrate multiple virtual machines into a datastore cluster, some virtual machines
migrate successfully, but migration of subsequent virtual machines fails. vCenter Server displays the error
message, Insufficient Disk Space on Datastore.
Cause
Until each placement recommendation is applied, the space resources appear to be available to Storage
DRS. Therefore, Storage DRS might reallocate space resources to subsequent requests for space.
Solution
Retry the failed migration operations one at a time and ensure that each recommendation is applied
before requesting the next migration
Problem
When you attempt to create or clone a virtual machine on a datastore cluster, you might receive the error
message, Operation Not Allowed in the Current State.
Cause
Storage DRS checks for rule violations when you create a virtual machine on a Storage DRS-enabled
datastore. If Storage DRS cannot create the new virtual machine's disks in compliance with the rules, it
generates a fault. The fault is generated because Storage DRS cannot reference the virtual machine,
which is in the process of being created and does not yet exist.
Solution
Revise or remove the rules and retry the create or clone virtual machine operation.
VMware, Inc. 55
vSphere Troubleshooting
Problem
When you deploy an OVF template with Storage DRS disabled on a datastore cluster, the resulting virtual
machine has Storage DRS enabled.
Cause
The vSphere Web Client applies the default automation level of the datastore cluster to virtual machines
deployed from an OVF template.
Solution
1 To manually change the automation level of the virtual machine, browse to the datastore cluster in the
vSphere Web Client object navigator.
5 From the Keep VMDKs Together dropdown menu, select No and click OK.
Problem
The Faults dialog box appears to display multiple instances of identical faults, but in fact, each fault refers
to a different datastore. The Faults dialog box does not list the names of the datastores, which causes the
faults to appear to be redundant.
Solution
The Faults dialog box always displays a separate rule violation fault for each datastore that is considered
for placement. If you want the datastore to enter maintenance mode, remove the rule that prevents the
virtual machine from being migrated.
VMware, Inc. 56
vSphere Troubleshooting
Problem
When you remove a virtual machine from a datastore cluster, and that virtual machine is subject to an
affinity or anti-affinity rule in a datastore cluster, the rule remains. This allows you to store virtual machine
configurations in different datastore clusters. If the virtual machine is moved back into the datastore
cluster, the rule is applied. You cannot delete the rule after you remove the virtual machine from the
datastore cluster.
Cause
vCenter Server retains rules for a virtual machine that is removed from a datastore cluster if the virtual
machine remains in the vCenter Server inventory.
Solution
To remove a rule from a datastore cluster configuration, you must delete the rule before you remove the
virtual machine to which the rule applies from the datastore cluster.
5 Click OK.
Problem
Storage DRS generates a single placement recommendation when you create, clone, or relocate a virtual
machine. No alternative recommendations are provided when multiple alternative recommendations are
expected.
Cause
If the destination host explicitly specifies the virtual machine's swap file location as a datastore in the
target datastore cluster, the disks to be placed in that cluster do not form a single affinity group. Storage
DRS generates alternative placement recommendations only for a single item or a single affinity group.
Solution
Accept the single recommendation. To obtain multiple recommendations, choose a destination host that
does not specify that the virtual machine swap file location is on a datastore that is in the target datastore
cluster.
VMware, Inc. 57
vSphere Troubleshooting
Problem
When you apply Storage DRS recommendations for space or I/O load balancing, the operation fails.
Cause
The following scenarios can prevent you from applying Storage DRS recommendations.
n A Thin Provisioning Threshold Crossed alarm might have been triggered for the target datastore,
which indicates that the datastore is running out of space and no virtual machines will be migrated to
it.
Solution
n Address the issue that triggered the Thin Provisioning Threshold Crossed alarm.
n Verify that the target datastore is not in maintenance mode or entering maintenance mode.
Problem
Cause
The datastore is Storage I/O Control-enabled, but it cannot be fully controlled by Storage I/O Control
because of the external workload.
This condition can occur if the Storage I/O Control-enabled datastore is connected to a host that does not
support Storage I/O Control.
Solution
Ensure that all hosts that are connected to the datastore support Storage I/O Control.
VMware, Inc. 58
vSphere Troubleshooting
Problem
Cause
The array is shared with non-vSphere workloads, or the array is performing system tasks such as
replication.
Solution
There is no solution. vCenter Server does not reduce the total amount of I/O sent to the array, but
continues to enforce shares.
Problem
You are unable to view performance charts for a datastore on the Performance tab in the
vSphere Web Client.
Cause
Solution
4 Click OK.
Problem
VMware, Inc. 59
vSphere Troubleshooting
Cause
The following reasons might prevent you from enabling Storage I/O Control on a datastore.
n At least one host that is connected to the datastore is not running ESX/ESXi 4.1 or later.
n You do not have the appropriate license to enable Storage I/O Control.
Solution
n Verify that the hosts connected to the datastore are ESX/ESXi 4.1 or later.
n Verify that you have the appropriate license to enable Storage I/O Control.
VMware, Inc. 60
Troubleshooting Storage 7
The storage troubleshooting topics provide solutions to potential problems that you might encounter when
using vSphere in different storage environments that include SAN, vSAN, or Virtual Volumes.
VMware, Inc. 61
vSphere Troubleshooting
Check cable connectivity. If you do not see a port, the problem could be cable connectivity. Check the cables first. Ensure
that cables are connected to the ports and a link light indicates that the connection is good. If
each end of the cable does not show a good link light, replace the cable.
Check zoning. Zoning limits access to specific storage devices, increases security, and decreases traffic over
the network. Some storage vendors allow only single-initiator zones. In that case, an HBA can
be in multiple zones to only one target. Other vendors allow multiple-initiator zones. See your
storage vendor’s documentation for zoning requirements. Use the SAN switch software to
configure and manage zoning.
Check access control n The MASK_PATH plug-in allows you to prevent your host from accessing a specific storage
configuration. array or specific LUNs on a storage array. If your host is detecting devices and paths that
you do not want the host to access, path masking could have been set up incorrectly.
n For booting from a SAN, ensure that each host sees only required LUNs. Do not allow any
host to see any boot LUN other than its own. Use storage system software to make sure
that the host can see only the LUNs that it is supposed to see.
n Ensure that the Disk.MaxLUN parameter allows you to view the LUN you expect to see.
For information on the parameter, see the vSphere Storage documentation.
Check storage processor If a disk array has more than one storage processor (SP), make sure that the SAN switch has a
setup. connection to the SP that owns the LUNs you want to access. On some disk arrays, only one
SP is active and the other SP is passive until there is a failure. If you are connected to the
wrong SP (the one with the passive path), you might see the LUNs but get errors when trying
to access them.
Rescan your HBA. Perform a rescan each time you complete the following tasks:
n Create new LUNs on a SAN.
n Change the path masking configuration on the host.
n Reconnect a cable.
n Make a change to a host in a cluster.
For information, see the vSphere Storage documentation.
Check cable connectivity. If you do not see a port, the problem could be cable connectivity or routing. Check the cables
first. Ensure that cables are connected to the ports and a link light indicates that the connection
is good. If each end of the cable does not show a good link light, replace the cable.
Check routing settings. Controls connectivity between different subnets on your Ethernet configuration. If your ESXi
system and iSCSI storage are not on the same subnet, ensure that appropriate routing exists
between the subnets. Also, ensure that the subnet mask and gateway address are set correctly
on the iSCSI storage and the iSCSI initiator in the ESXi host.
VMware, Inc. 62
vSphere Troubleshooting
Check access control If the expected LUNs do not appear after rescan, access control might not be configured
configuration. correctly on the storage system side:
n If CHAP is configured, ensure that it is enabled on the ESXi host and matches the storage
system setup.
n If IP-based filtering is used, ensure that the iSCSI HBA or the VMkernel port group IP
address is allowed.
n If you are using initiator name-based filtering, ensure that the name is a qualified iSCSI
name and matches the storage system setup.
n For booting from a SAN, ensure that each host sees only required LUNs. Do not allow any
host to see any boot LUN other than its own. Use storage system software to make sure
that the host can see only the LUNs that it is supposed to see.
n Ensure that theDisk.MaxLUN setting allows you to view the LUN you expect to see. For
information, see the vSphere Storage documentation.
Check storage processor If a storage system has more than one storage processor, make sure that the SAN switch has
setup. a connection to the SP that owns the LUNs you want to access. On some storage systems,
only one SP is active and the other SP is passive until a failure occurs. If you are connected to
the wrong SP (the one with the passive path) you might not see the expected LUNs, or you
might see the LUNs but get errors when trying to access them.
For software and dependent The software iSCSI and dependent hardware adapters in ESXi require that VMkernel network
hardware iSCSI, check port have access to the iSCSI storage. The adapters use the VMkernel for data transfer
network configuration. between the ESXi system and the iSCSI storage.
Rescan your iSCSI initiator. Perform a rescan each time you complete the following tasks:
n Create new LUNs on a SAN.
n Change the LUN masking.
n Reconnect a cable.
n Make a change to a host in a cluster.
n Change CHAP settings or add new discovery addresses.
For information, see the vSphere Storage documentation.
To monitor storage performance in real time, use the resxtop and esxtop command-line utilities. For
more information, see the vSphere Monitoring and Performance documentation.
VMware, Inc. 63
vSphere Troubleshooting
Problem
Excessive SCSI reservations cause performance degradation and SCSI reservation conflicts.
Cause
n Creating a template
Note For storage devices that support the hardware acceleration, the hosts use the atomic test and set
(ATS) algorithm to lock the LUN. For more information on hardware acceleration, see the vSphere
Storage documentation.
Solution
n Serialize the operations of the shared LUNs, if possible, limit the number of operations on different
hosts that require SCSI reservation at the same time.
n Increase the number of LUNs and limit the number of hosts accessing the same LUN.
n Reduce the number of virtual machines per LUN. Follow recommendations in Configuration
Maximums.
n Make sure that you have the latest HBA firmware across all hosts.
For information about handling SCSI reservation conflicts on specific storage arrays, see the VMware
knowledge base article at http://kb.vmware.com/kb/1005009.
VMware, Inc. 64
vSphere Troubleshooting
Problem
Your host is unable to access a LUN, or access is very slow. The host's log files might indicate frequent
path state changes. For example:
Frequent path state changes are occurring for path vmhba2:C0:T0:L3. This may indicate a storage
problem. Affected device: naa.600600000000000000edd1. Affected datastores: ds1
Cause
The problem might be caused by path thrashing. Path thrashing might occur when two hosts access the
same LUN through different storage processors (SPs) and, as a result, the LUN is never available.
Path thrashing typically occurs on active-passive arrays. Path thrashing can also occur on a directly
connected array with HBA failover on one or more nodes. Active-active arrays or arrays that provide
transparent failover do not cause path thrashing.
Solution
1 Ensure that all hosts that share the same set of LUNs on the active-passive arrays use the same
storage processor.
2 Correct any cabling or masking inconsistencies between different hosts and SAN targets so that all
HBAs see the same targets.
3 Ensure that the claim rules defined on all hosts that share the LUNs are exactly the same.
4 Configure the path to use the Most Recently Used PSP, which is the default.
Problem
The host takes longer to complete I/O requests and virtual machines display unsatisfactory performance.
Cause
The problem might be caused by an inadequate LUN queue depth. SCSI device drivers have a
configurable parameter called the LUN queue depth that determines how many commands to a given
LUN can be active at one time. If the host generates more commands to a LUN, the excess commands
are queued in the VMkernel.
VMware, Inc. 65
vSphere Troubleshooting
Solution
1 If the sum of active commands from all virtual machines consistently exceeds the LUN depth,
increase the queue depth.
The procedure that you use to increase the queue depth depends on the type of storage adapter the
host uses.
2 When multiple virtual machines are active on a LUN, change the Disk.SchedNumReqOutstanding
(DSNRO) parameter, so that it matches the queue depth value.
The maximum value refers to the queue depths reported for various paths to the LUN. When you lower
this value, it throttles the host's throughput and alleviates SAN contention concerns if multiple hosts are
overutilizing the storage and are filling its command queue.
To adjust the maximum queue depth parameter, use the vCLI commands.
In the procedure, --server=server_name specifies the target server. The specified target server
prompts you for a user name and password. Other connection options, such as a configuration file or
session file, are supported. For a list of connection options, see Getting Started with vSphere Command-
Line Interfaces.
Prerequisites
Install vCLI or deploy the vSphere Management Assistant (vMA) virtual machine. See Getting Started with
vSphere Command-Line Interfaces. For troubleshooting, run esxcli commands in the ESXi Shell.
Procedure
1 Verify which HBA module is currently loaded by entering the following command:
Option Description
qla QLogic
lpfc Emulex
bfa Brocade
VMware, Inc. 66
vSphere Troubleshooting
Use the following strings for the parameter and module options.
String Description
-p ql2xmaxqdepth=value QLogic
-m qla2xxx
-p lpfc0_lun_queue_depth=value Emulex
-m lpfc820
-p bfa_lun_queue_depth=value Brocade
-m bfa
Prerequisites
n Install vCLI or deploy the vSphere Management Assistant (vMA) virtual machine. See Getting Started
with vSphere Command-Line Interfaces. For troubleshooting, you can run esxcli commands in the
ESXi Shell.
n In the procedure, the --server=server_name connection option specifies the target server. Be
prepared to enter a user name and password when the target server prompts you. For a list of other
possible connection options, see Getting Started with vSphere Command-Line Interfaces.
Procedure
VMware, Inc. 67
vSphere Troubleshooting
The following output shows the queue depth for software iSCSI.
Caution Setting the queue depth to a value higher than the default can decrease the total number of
LUNs supported.
Change this parameter only when you have multiple virtual machines active on a LUN. The parameter
does not apply when only one virtual machine is active. In that case, the bandwidth is controlled by the
queue depth of the storage adapter.
Procedure
1 Enter the following command to display the current DSNRO setting for the specified device:
Problem
Certain guest operating systems or applications run in virtual machines with RDMs display unpredictable
behavior.
VMware, Inc. 68
vSphere Troubleshooting
Cause
This behavior might be caused by cached SCSI INQUIRY data that interferes with specific guest
operating systems and applications.
When the ESXi host first connects to a target storage device on a SAN, it issues the SCSI INQUIRY
command to obtain basic identification data from the device. By default, ESXi caches the received SCSI
INQUIRY data (Standard, page 80, and page 83) and the data remains unchanged afterwards.
Solution
u Configure the virtual machine with RDM to ignore the SCSI INQUIRY cache by adding the following
parameter to the .vmx file.
scsix:y.ignoreDeviceInquiryCache = "true"
where x is the SCSI controller number and y is the SCSI target number of the RDM.
Enable this parameter only when your storage vendor recommends that you do so. This parameter is
required for just a limited number of storage arrays and only for specific guest operating systems.
Problem
After your ESXi host's first boot, the software iSCSI adapter is enabled and appears in the
vSphere Web Client on the list of storage adapters.
Cause
The iBFT-enabled network adapter on your host causes the software iSCSI to be always present. This
condition occurs even when you do not use iBFT for the iSCSI boot.
Solution
If you do not use the iBFT-enabled network adapter for the iSCSI boot and do not want the software
iSCSI adapter to be enabled, remove the iBFT configuration from the network adapter. Because this
process is vendor-specific, consult your vendor documentation for details.
Problem
The use of non-ASCII characters for directory and filenames on NFS storage might cause unpredictable
behavior. For example, you might fail to mount an NFS datastore or not be able to power on a virtual
machine.
VMware, Inc. 69
vSphere Troubleshooting
Cause
ESXi supports the use of non-ASCII characters for directory and filenames on NFS storage, so you can
create datastores and virtual machines using names in international languages. However, when the
underlying NFS server does not offer internationalization support, unpredictable failures might occur.
Solution
Always make sure that the underlying NFS server offers internationalization support. If the server does
not, use only ASCII characters.
Problem
Solution
Use the esxcli storage san commands to obtain and display events and statistics for the adapters.
You can analyze the commands' output to identify the adapter problems and to find appropriate solutions.
esxcli storage san fc Retrieve events for Fibre Channel adapters. -- adapter | -A
events get Adapter name (vmhbaX), or
none, to list information for
all Fibre Channel adapters
on the system.
VMware, Inc. 70
vSphere Troubleshooting
Problem
You can check metadata consistency when you experience problems with a VMFS datastore or a virtual
flash resource. For example, perform a metadata check if one of the following occurs:
n You see metadata errors in the vmkernel.log file similar to the following:
n You see corruption being reported for a datastore in events tabs of vCenter Server.
Solution
To check metadata consistency, run VOMA from the CLI of an ESXi host. VOMA can be used to check
and fix minor inconsistency issues for a VMFS datastore or a virtual flash resource. To resolve errors
reported by VOMA, consult VMware Support.
n Make sure that the VMFS datastore you analyze does not span multiple extents. You can run VOMA
only against a single-extent datastore.
n Power off any virtual machines that are running or migrate them to a different datastore.
The following example demonstrates how to use VOMA to check VMFS metadata consistency.
1 Obtain the name and partition number of the device that backs the VMFS datastore that you want to
check.
The Device Name and Partition columns in the output identify the device. For example:
Provide the absolute path to the device partition that backs the VMFS datastore, and provide a
partition number with the device name. For example:
VMware, Inc. 71
vSphere Troubleshooting
The output lists possible errors. For example, the following output indicates that the heartbeat
address is invalid.
XXXXXXXXXXXXXXXXXXXXXXX
Phase 2: Checking VMFS heartbeat region
ON-DISK ERROR: Invalid HB address
Phase 3: Checking all file descriptors.
Phase 4: Checking pathname and connectivity.
Phase 5: Checking resource reference counts.
Command options that the VOMA tool takes include the following.
-d|--device Device or disk to be inspected. Make sure to provide the absolute path to the device
partition backing the VMFS datastore. For example, /vmfs/devices/disks/naa.
00000000000000000000000000:1.
For more details, see the VMware Knowledge Base article 2036767.
VMware, Inc. 72
vSphere Troubleshooting
Problem
Typically, when a storage path experiences problems, an ESXi host sends the Test Unit Ready (TUR)
command to confirm that the path is down before initiating a path failover. However, if the TUR command
is unsuccessful and repeatedly returns a retry operation request
(VMK_STORAGE_RETRY_OPERATION), the host continues to retry the command without triggering the
failover. Usually, the following errors cause the host to retry the TUR command:
n SCSI_HOST_BUS_BUSY 0x02
n SCSI_HOST_SOFT_ERROR 0x0b
n SCSI_HOST_RETRY 0x0c
Cause
To resolve this issue, you can use the enable|disable_action_OnRetryErrors parameter. When you
enable this parameter, the ESXi host can mark the problematic path as dead. After marking the path as
dead, the host can trigger the failover and use an alternative working path.
Solution
Action Command
Enable the ability to mark a # esxcli storage nmp satp generic deviceconfig set -c
problematic path as dead enable_action_OnRetryErrors -d naa.XXX
Disable the ability to mark a # esxcli storage nmp satp generic deviceconfig set -c
problematic path as dead disable_action_OnRetryErrors -d naa.XXX
The following example output indicates that the parameter has been enabled:
naa.XXX
Device Display Name: DGC Fibre Channel Disk (naa.XXX)
Storage Array Type: VMW_SATP_CX Storage Array Type Device
Config: {navireg ipfilter action_OnRetryErrors}
You can also set this parameter when configuring an SATP claim rule:
VMware, Inc. 73
vSphere Troubleshooting
The troubleshooting topics can help you avoid potential problems and provide solutions for issues that
you might encounter when configuring flash drives.
Problem
When you attempt to configure either vSAN or virtual flash resource, a local flash disk does not appear on
the list of disks to be used.
Cause
This problem might occur when flash disk intended for use with either feature has been already formatted
with VMFS. Virtual flash and vSAN cannot share the flash disk with VMFS or any other file system.
Also, because virtual flash and vSAN are mutually exclusive consumers of flash disks, both features
cannot share a flash disk. If the flash disk is already claimed by one feature, for example vSAN, you
cannot use it for another, such as virtual flash, unless you release the disk.
Solution
Use only unformatted flash disks for virtual flash resource and vSAN configuration.
n Avoid formatting the flash disks with VMFS during ESXi installation or Auto Deploy.
n If the flash disk is already formatted with VMFS, remove the VMFS datastore. For information, see
the vSphere Storage documentation.
n To use the flash disk as a virtual flash resource, do not claim this disk for vSAN. If the disk is claimed
by vSAN, remove the disk from vSAN. The flash disk is released from vSAN and becomes available
on the list of disks to use with virtual flash. For information about removing disks from vSAN, see the
Administering VMware vSAN documentation.
n If you intend to use the flash disk with vSAN, do not use the disk for a virtual flash resource. If the
flash disk is used as the virtual flash resource, remove the virtual flash configuration. The disk
becomes available for vSAN. See the vSphere Storage documentation.
Another reason that makes flash disk unavailable is when ESXi cannot detect the disk. See Local Flash
Disks Are Undetectable.
VMware, Inc. 74
vSphere Troubleshooting
Problem
By default, auto-partitioning deploys VMFS file systems on any unused local storage disks on your host,
including flash disks.
However, a flash disk formatted with VMFS becomes unavailable for such features as virtual flash and
vSAN. Both features require an unformatted flash disk and neither can share the disk with any other file
system.
Solution
To ensure that auto-partitioning does not format the flash disk with VMFS, use the following boot options
when you install ESXi or boot the ESXi host for the first time:
n autoPartition=TRUE
n skipPartitioningSsds=TRUE
1 In the vSphere Web Client, select the host to use as a reference host and click the Configure tab.
2 Click System to open the system options, and click Advanced System Settings.
If flash disks that you plan to use with Flash Read Cache and vSAN already have VMFS datastores,
remove the datastores.
Problem
ESXi might not be able to detect flash disks, or recognize them as local. This problem can occur when
you configure entities that require only local flash disks, for example, virtual flash resource or vSAN.
Cause
ESXi does not recognize certain devices as flash disks when their vendors do not support automatic flash
disk detection. In other cases, some flash disks might not be detected as local, and ESXi marks them as
remote. When the host does not recognize the disks as the local flash disks, it excludes them from the list
of disks available for configuration.
Solution
n If ESXi does not automatically recognize its disks as flash disks, tag them as flash disk disks.
n If ESXi does not detect flash disks as local, manually set them as local.
VMware, Inc. 75
vSphere Troubleshooting
ESXi does not recognize certain devices as flash when their vendors do not support automatic flash disk
detection. The Drive Type column for the devices shows HDD as their type.
Caution Marking the HDD devices as flash might deteriorate the performance of datastores and
services that use them. Mark the devices only if you are certain that they are flash devices.
Prerequisites
Procedure
4 From the list of storage devices, select one or several HDD devices to mark as flash devices and click
the Mark as Flash Disks ( ) icon.
What to do next
If the flash device that you mark is shared among multiple hosts, make sure that you mark the device
from all hosts that share the device.
Prerequisites
n Power off virtual machines that reside on the device and unmount an associated datastore.
Procedure
4 From the list of storage devices, select one or several remote devices to mark as local and click the
All Actions icon.
VMware, Inc. 76
vSphere Troubleshooting
For information about the Virtual Volumes functionality, see the vSphere Storagepublication.
esxcli storage vvol daemon unbindall Unbind all virtual volumes from all
VASA providers known to the ESXi
host.
esxcli storage vvol list List all protocol endpoints that your
protocolendpoint host can access.
esxcli storage vvol vasacontext get Show the VASA context (VC
UUID) associated with the host.
esxcli storage vvol vasaprovider list List all storage (VASA) providers
associated with the host.
Problem
The vSphere Web Client shows the datastore as inaccessible. You cannot use the datastore for virtual
machine provisioning.
VMware, Inc. 77
vSphere Troubleshooting
Cause
This problem might occur when you fail to configure protocol endpoints for the SCSI-based storage
container that is mapped to the virtual datastore. Like traditional LUNs, SCSI protocol endpoints need to
be configured so that an ESXi host can detect them.
Solution
Before creating virtual datastores for SCSI-based containers, make sure to configure protocol endpoints
on the storage side.
Problem
An OVF template or a VM being migrated from a nonvirtual datastore might include additional large files,
such as ISO disk images, DVD images, and image files. If these additional files cause the configuration
virtual volume to exceed its 4-GB limit, migration or deployment to a virtual datastore fails.
Cause
The configuration virtual volume, or config-VVol, contains various VM-related files. On traditional
nonvirtual datastores, these files are stored in the VM home directory. Similar to the VM home directory,
the config-VVol typically includes the VM configuration file, virtual disk and snapshot descriptor files, log
files, lock files, and so on.
On virtual datastores, all other large-sized files, such as virtual disks, memory snapshots, swap, and
digest, are stored as separate virtual volumes.
Config-VVols are created as 4-GB virtual volumes. Generic content of the config-VVol usually consumes
only a fraction of this 4-GB allocation, so config-VVols are typically thin-provisioned to conserve backing
space. Any additional large files, such as ISO disk images, DVD images, and image files, might cause the
config-VVol to exceed its 4-GB limit. If such files are included in an OVF template, deployment of the VM
OVF to vSphere Virtual Volumes storage fails. If these files are part of an existing VM, migration of that
VM from a traditional datastore to vSphere Virtual Volumes storage also fails.
Solution
n For VM migration. Before migrating a VM from a traditional datastore to a virtual datastore, remove
excess content from the VM home directory to keep the config-VVol under the 4-GB limit.
n For OVF deployment. Because you cannot deploy an OVF template that contains excess files directly
to a virtual datastore, first deploy the VM to a nonvirtual datastore. Remove any excess content from
the VM home directory, and migrate the resulting VM to vSphere Virtual Volumes storage.
VMware, Inc. 78
vSphere Troubleshooting
Problem
The following problems occur when you migrate a version 10 or earlier VM with memory snapshots:
n Migration of a version 10 or earlier VM with memory snapshots to a virtual datastore is not supported
and causes a failure.
n Migration of a version 10 or earlier VM with memory snapshots from a virtual datastore to a nonvirtual
datastore, such as VMFS, can succeed. If you later make additional snapshots and attempt to
migrate this VM back to vSphere Virtual Volumes storage, your attempt fails.
Cause
vSphere Virtual Volumes storage does not require that you use a particular hardware version for your
virtual machines. Typically, you can move a virtual machine with any hardware version to vSphere Virtual
Volumes storage. However, if you have a VM with memory snapshots, and plan to migrate this VM
between a virtual datastore and a nonvirtual datastore, use the VM of hardware version 11.
Non-VVols virtual machines of hardware version 11 or later use separate files to store their memory
snapshots. This usage is consistent with VMs on vSphere Virtual Volumes storage, where memory
snapshots are created as separate VVols instead of being stored as part of a .vmsn file in the VM home
directory. In contrast, non-VVols VMs with hardware version 10 continue to store their memory snapshots
as part of the .vmsn file in the VM home directory. As a result, you might experience problems or failures
when attempting to migrate these VMs between virtual and nonvirtual datastores.
Solution
To avoid problems when migrating VMs with memory snapshots across virtual and nonvirtual datastores,
use hardware version 11. Follow these guidelines when migrating version 10 or earlier VMs with memory
snapshots:
n Migrating a version 10 or earlier VM with memory snapshots to a virtual datastore is not supported.
The only workaround is to remove all snapshots. Upgrading the hardware version does not solve this
problem.
n Migrating a version 10 or earlier VM with memory snapshots from a virtual datastore to a nonvirtual
datastore, such as VMFS, can succeed. However, the migration might put the VM in an inconsistent
state. The snapshots that were taken on the virtual datastore use the vmem object. Any memory
snapshots taken after migrating to VMFS are stored in the .vmsn file. If you later attempt to migrate
this VM back to vSphere Virtual Volumes storage, your attempt fails. As with the previous case,
remove all snapshots to work around this problem.
VMware, Inc. 79
vSphere Troubleshooting
For information about I/O filters, see the see the vSphere Storagepublication.
If an I/O filter installation fails on a host, the system generates events that report the failure. In addition,
an alarm on the host shows the reason for the failure. Examples of failures include the following:
n The VIB requires the host to be in maintenance mode for an upgrade or uninstallation.
n The VIB requires the host to reboot after the installation or uninstallation.
n Attempts to put the host in maintenance mode fail because the virtual machine cannot be evacuated
from the host.
vCenter Server can resolve some failures. You might have to intervene for other failures. For example,
you might need to edit the VIB URL, manually evacuate or power off virtual machines, or manually install
or uninstall VIBs.
When you specify a target server by using --server=server_name, the server prompts you for a user
name and password. Other connection options, such as a configuration file or session file, are supported.
For a list of connection options, see Getting Started with vSphere Command-Line Interfaces, or run
esxcli --help at the vCLI command prompt.
Prerequisites
Install vCLI or deploy the vSphere Management Assistant (vMA) virtual machine. See Getting Started with
vSphere Command-Line Interfaces. For troubleshooting, run esxcli commands in the ESXi Shell.
VMware, Inc. 80
vSphere Troubleshooting
Procedure
Options for the install command allow you to perform a dry run, specify a specific VIB, bypass
acceptance-level verification, and so on. Do not bypass verification on production systems. See the
vSphere Command-Line Interface Reference documentation.
VMware, Inc. 81
Troubleshooting Networking 8
The troubleshooting topics about networking in vSphere provide solutions to potential problems that you
might encounter with the connectivity of ESXi hosts, vCenter Server and virtual machines.
n Hosts on a vSphere Distributed Switch 5.1 and Later Lose Connectivity to vCenter Server
n Hosts on vSphere Distributed Switch 5.0 and Earlier Lose Connectivity to vCenter Server
n Virtual Machines Lose Connectivity After Changing the Uplink Failover Order of a Distributed Port
Group
n Unable to Add a Physical Adapter to a vSphere Distributed Switch That Has Network I/O Control
Enabled
n A Virtual Machine that Runs a VPN Client Causes Denial of Service for Virtual Machines on the Host
or Across a vSphere HA Cluster
n Virtual Machines on the Same Distributed Port Group and on Different Hosts Cannot Communicate
with Each Other
n Attempt to Power On a Migrated vApp Fails Because the Associated Protocol Profile Is Missing
n Networking Configuration Operation Is Rolled Back and a Host Is Disconnected from vCenter Server
VMware, Inc. 82
vSphere Troubleshooting
Problem
The MAC addresses of virtual machines on the same broadcast domain or IP subnet are in conflict, or
vCenter Server generates a duplicate MAC address for a newly created virtual machine.
A virtual machine powers on and functions properly, but shares a MAC address with another virtual
machine. This situation might cause packet loss and other problems.
Cause
Virtual machines might have duplicate MAC addresses due to several reasons.
n Two vCenter Server instances with identical IDs generate overlapping MAC addresses for virtual
machine network adapters.
Each vCenter Server instance has an ID between 0 and 63 that is randomly generated at installation
time, but can be reconfigured after installation. vCenter Server uses the instance ID to generate MAC
addresses for the network adapters of the machine.
n A virtual machine has been transferred in power-off state from one vCenter Server instance to
another in the same network, for example, by using shared storage, and a new virtual machine
network adapter on the first vCenter Server receives the freed MAC address.
Solution
If you have an existing virtual machine with a conflicting MAC address, you must provide a unique
MAC address in the Virtual Hardware settings.
n Power off the virtual machine, configure the adapter to use a manual MAC address, and type the
new address.
n If you cannot power the virtual machine off for configuration, re-create the network adapter that is
in conflict with enabled manual MAC address assignment and type the new address. In the guest
operating system, set the same static IP address to the re-added adapter as before.
For information about configuring the network adapters of virtual machines, see the vSphere
Networking and vSphere Virtual Machine Administration documentation.
VMware, Inc. 83
vSphere Troubleshooting
n If the vCenter Server instance generates the MAC addresses of virtual machines according to the
default allocation, VMware OUI, change the vCenter Server instance ID or use another allocation
method to resolve conflicts.
Note Changing the vCenter Server instance ID or switching to a different allocation scheme does
not resolve MAC address conflicts in existing virtual machines. Only virtual machines created or
network adapters added after the change receive addresses according to the new scheme.
For information about MAC address allocation schemes and setup, see the vSphere Networking
documentation.
Solution Description
Change the vCenter Server ID You can keep using the VMware OUI allocation scheme if your deployment
contains a small number of vCenter Server instances. According to this scheme,
a MAC address has the following format:
00:50:56:XX:YY:ZZ
Switch to prefix-based allocation You can use a custom OUI. For example, for a 02:12:34 locally administered
address range, MAC addresses have the form 02:12:34:XX:YY:ZZ. You can use
the fourth octet XX to distribute the OUI address space between the
vCenter Server instances. This structure results in 255 address clusters, each
cluster managed by a vCenter Server instance, and in about 65000 MAC
addresses per vCenter Server. For example, 02:12:34:01:YY:ZZ for vCenter
Server A, 02:12:34:02:YY:ZZ for vCenter Server B, and so on.
Prefix-based allocation is suitable for deployments of a larger scale.
For globally unique MAC addresses, the OUI must be registered in IEEE.
b Apply the new MAC address allocation scheme to an existing virtual machine in its Virtual
Hardware settings.
n Power off a virtual machine, configure the adapter to use a manual MAC address, revert to
automatic MAC address allocation, and power on the virtual machine.
n If the virtual machine is in production and you cannot power it off for configuration, after you
change the vCenter Server ID or the address allocation scheme, re-create the network
adapter in conflict with enabled automatic MAC address assignment. In the guest operating
system, set the same static IP address to the re-added adapter as before.
VMware, Inc. 84
vSphere Troubleshooting
n Enforce MAC address regeneration when transferring a virtual machine between vCenter Server
instances by using the virtual machine files from a datastore.
a Power off a virtual machine, remove it from the inventory, and in its configuration file (.vmx), set
the ethernetX.addressType parameter to generated.
X next to ethernet stands for the sequence number of the virtual NIC in the virtual machine.
b Import the virtual machine from one vCenter Server system to another by registering the virtual
machine from a datastore in the target vCenter Server.
The virtual machine files can reside in a datastore that is shared between the two vCenter Server
instances or can be uploaded to a datastore that is accessible only from the target
vCenter Server system.
For information about registering a virtual machine from a datastore, see vSphere Virtual Machine
Administration.
While the virtual machine is starting up, an information icon appears on the virtual machine in the
vSphere Web Client.
d Right-click the virtual machine and select Guest OS > Answer Question.
The target vCenter Server re-generates the MAC address of the virtual machine. The new MAC
address starts with the VMware OUI 00:0c:29 and is based on the BIOS UUID of the virtual
machine. The BIOS UUID of the virtual machine is calculated from the BIOS UUID of the host.
n If the vCenter Server and hosts are version 6.0 and later and the vCenter Server instances are
connected in Enhanced Linked Mode, migrate virtual machines by using vMotion across
vCenter Server systems.
When a virtual machine is migrated across vCenter Server systems, the source vCenter Server adds
the MAC address of the virtual machine to a blacklist and does not assign them to other virtual
machines.
VMware, Inc. 85
vSphere Troubleshooting
Problem
In the vSphere Web Client, after you assign a MAC address within the range 00:50:56:40:YY:ZZ –
00:50:56:7F:YY:ZZ to a virtual machine, attempts to power the virtual machine on fail with a status
message that the MAC address is in conflict.
Cause
You attempt to assign a MAC address which starts with the VMware OUI 00:50:56 and is within the
address range allocated for host VMkernel adapters on the vCenter Server system.
Solution
If you want to preserve the VMware OUI prefix, set a static MAC address within the range
00:50:56:00:00:00 – 00:50:56:3F:FF:FF. Otherwise, set an arbitrary MAC address whose prefix is
different from the VMware OUI one. For information about the ranges available for static MAC addresses
that have the VMware OUI prefix, see the vSphere Networking documentation.
Problem
After you upgrade a vSphere distributed switch to version 5.5 and later, when you initiate the conversion
to the enhanced LACP support from an existing LACP configuration, the conversion fails at a certain
stage of the process.
Cause
The conversion from an existing LACP configuration to the enhanced LACP support includes several
tasks for reconfiguring the distributed switch. The conversion might fail because another user might have
reconfigured the distributed switch during the conversion. For example, physical NICs from the hosts
might have been reassigned to different uplinks or the teaming and failover configuration of the distributed
port groups might have been changed.
Another reason for the failure might be that some of the hosts have disconnected during the conversion.
Solution
When the conversion to the enhanced LACP support fails on a certain stage, it is completed only partially.
You must check the configuration of the distributed switch and the participating hosts to identify the
objects with incomplete LACP configuration.
Check the target configuration that must result from each conversion stage in the order that is listed in the
table. When you locate the stage where the conversion has failed, complete its target configuration
manually and continue with the stages that follow.
VMware, Inc. 86
vSphere Troubleshooting
Table 8‑1. Steps to Complete the Conversion to the Enhanced LACP Manually
Conversion Stage Target Configuration State Solution
1. Create a new LAG. A newly created LAG must be Check the LACP configuration of the distributed switch and
present on the distributed switch. create a new LAG if there is none.
2. Create a an intermediate The newly created LAG must be Check the teaming and failover configuration of the
LACP teaming and failover standby that lets you migrate distributed port group. Set the new LAG as standby if it is
configuration on the physical NICs to the LAG without not.
distributed port groups. losing connectivity. If you do not want to use a LAG to handle the traffic for all
distributed port groups, revert the teaming and failover
configuration to a state where standalone uplinks are active
and the LAG is unused .
3. Reassign physical NICs All physical NICs from the LAG Check whether physical NICs are assigned to the LAG ports.
from standalone uplinks to ports must be reassigned from Assign a physical NIC to every LAG port.
LAG ports. standalone uplinks to the LAG
Note The LAG must remain standby in the teaming and
ports
failover order of the distributed port groups while you
reassign physical NICs to the LAG ports.
4. Create the final LACP The final LACP teaming and Check the teaming and failover configuration of the
teaming and failover failover configuration is the distributed port group. Create a valid LACP teaming and
configuration on the following. failover configuration for all distributed port groups for which
distributed port groups. n Active: only the new LAG you want to apply LACP.
n Standby: empty
n Unused: all standalone
uplinks
For example, suppose you verify that a new LAG has been created on the distributed switch and that an
intermediate teaming and failover configuration has been created for the distributed port groups. You
continue with checking whether there are physical NICs assigned to the LAG ports. You find out that not
all hosts have physical NICs assigned to the LAG ports, and you assign the NICs manually. You complete
the conversion by creating the final LACP teaming and failover configuration for the distributed port
groups.
Problem
n Attempts to remove a host from a vSphere distributed switch fail, and you receive a notification that
resources are still in use. The notification that you receive might look like the following:
VMware, Inc. 87
vSphere Troubleshooting
n Attempts to remove a host proxy switch that still exists on the host from a previous networking
configuration fail. For example, you moved the host to a different data center or vCenter Server
system, or upgraded the ESXi and vCenter Server software, and created new networking
configuration. When trying to remove the host proxy switch, the operation fails because resources on
the proxy switch are still in use.
Cause
You cannot remove the host from the distributed switch or delete the host proxy switch because of the
following reasons.
Solution
Problem Solution
Cannot remove a 1 In the vSphere Web Client, navigate to the distributed switch.
host from a 2 On the Configure tab, select More > Ports.
distributed switch 3 Locate all ports that are still in use and check which VMkernel or virtual machine network adapters on
the host are still attached to the ports.
4 Migrate or delete the VMkernel and virtual machine network adapters that are still connected to the
switch.
5 Use the Add and Manage Hosts wizard in the vSphere Web Client to remove the host from the
switch.
After the host is removed, the host proxy switch is deleted automatically.
Problem
After you change the networking configuration of a port group on a vSphere Distributed Switch 5.1 and
later that contains the VMkernel adapters for the management network, the hosts on the switch lose
connectivity to vCenter Server. In the vSphere Web Client the status of the hosts is nonresponsive.
VMware, Inc. 88
vSphere Troubleshooting
Cause
On a vSphere Distributed Switch 5.1 and later in vCenter Server that has networking rollback disabled,
the port group containing the VMkernel adapters for the management network is misconfigured in
vCenter Server and the invalid configuration is propagated to the hosts on the switch.
Note In vSphere 5.1 and later, networking rollback is enabled by default. However, you can enable or
disable rollbacks at the vCenter Server level. For more information see the vSphere Networking
documentation.
Solution
1 From the Direct Console User Interface (DCUI) to an affected host, use the Restore vDS option from
the Network Restore Options menu to configure the uplinks and the ID of the VLAN for the
management network.
The DCUI creates a local ephemeral port and applies the VLAN and uplink configuration to the port.
The DCUI changes the VMkernel adapter for the management network to use the new host local port
to restore connectivity to vCenter Server.
After the host re-connects to vCenter Server, the vSphere Web Client displays a warning that some
hosts on the switch have different networking configuration from the configuration stored in vSphere
distributed switch.
2 In the vSphere Web Client, configure the distributed port group for the management network with
correct settings.
Situation Solution
You have altered the port group You can roll the configuration of the port group back one step. Right-click the port
configuration only once group, click Restore Configuration, and select Restore to previous
configuration.
You have backed up a valid You can restore the configuration of the port group by using the backup file. Right-
configuration of the port group click the port group, click Restore Configuration, and select Restore
configuration from a file.
You can also restore the configuration for the entire switch, including the port
group, from a backup file for the switch.
You have performed more than one You must provide valid settings for the port group manually.
configuration step and you do not have
a backup file
For information about networking rollback, recovery, and restore, see the vSphere Networking
documentation.
3 Migrate the VMkernel adapter for the management network from the host local ephemeral port to a
distributed port on the switch by using the Add and Manage Hosts wizard.
Unlike distributed ports, the ephemeral local port of the VMKernel has a non-numeric ID.
For information about handling VMkernel adapters through the Add and Manage Hosts wizard, see
the vSphere Networking documentation.
VMware, Inc. 89
vSphere Troubleshooting
4 Apply the configuration of the distributed port group and VMkernel adapter from vCenter Server to the
host.
n Push the correct configuration of the distributed port group and VMkernel adapter from
vCenter Server to the host.
c From the Virtual switches list, select the distributed switch and click Rectify the state of the
selected distributed switch on the host.
n Wait until vCenter Server applies the settings within the next 24 hours.
Problem
After you change the networking configuration of a port group on a vSphere Distributed Switch 5.0 or
earlier that contains the VMkernel adapters for the management network, the hosts on the switch lose
connectivity to vCenter Server. In the vSphere Web Client the status of the hosts is nonresponsive.
Cause
On a vSphere Distributed Switch 5.0 and earlier in vCenter Server, the port group containing the
VMkernel adapters for the management network is misconfigured in vCenter Server and the invalid
configuration is propagated to the hosts on the switch.
Solution
3 In the vSphere Standard Switch view, create a new standard switch if the host does not have a
standard switch suitable for the management network.
b In the Add Network wizard, under Connection Types select Virtual Machine, and click Next.
d Under the Create a vSphere standard switch section, select one or more unoccupied physical
adapters on the host to carry the management traffic and click Next.
If all physical adapters are already busy with traffic from other switches, create the switch without
a physical network adapter connected. Later, remove the physical adapter for the management
network from the proxy switch of the distributed switch and add it to this standard switch.
VMware, Inc. 90
vSphere Troubleshooting
e In the Port Group Properties section, type a network label that identifies the port group that you
are creating and optionally a VLAN ID.
f Click Finish.
4 In the vSphere Distributed Switch view, migrate the VMkernel adapter for the network to a standard
switch.
a Select the vSphere Distributed Switch view, and for the distributed switch, click Manage Virtual
Adapters.
b In the Manage Virtual Adapters wizard, select the VMkernel adapter from the list and click
Migrate.
c Select the newly created or another standard switch to migrate the adapter to, and click Next.
d Enter a network label that is unique in the scope of the host and optionally a VLAN ID for the
management network, and click Next.
e Review the settings on the target standard switch and click Finish.
5 In the vSphere Web Client, configure the distributed port group for the management network with
correct settings.
6 Migrate the VMkernel adapter for the management network from the standard switch to a port on the
distributed switch by using the Add and Manage Hosts wizard.
For information about the Add and Manage Hosts wizard, see the vSphere Networking
documentation.
7 If you have moved the physical adapter from the proxy switch to the standard switch, you can
reattach it to the distributed switch again by using the Add and Manage Hosts wizard.
Problem
No redundant physical NICs for a host are connected to a particular standard or a distributed switch, and
the following alarm appears:
Cause
Only one physical NIC on the host is connected to a certain standard or a distributed switch. The
redundant physical NICs are either down or are not assigned to the switch.
For example, assume that a host in your environment has physical NICs vmnic0 and vmnic1 connected to
vSwitch0, and the physical NIC vmnic1 goes offline, leaving only vmnic0 connected to vSwitch0. As a
result, the uplink redundancy for vSwitch0 is lost on the host.
VMware, Inc. 91
vSphere Troubleshooting
Solution
Check which switch has lost uplink redundancy on the host. Connect at least one more physical NIC on
the host to this switch and reset the alarm to green. You can use the vSphere Web Client or the
ESXi Shell.
If a physical NIC is down, try to bring it back up by using the ESXi Shell on the host.
For information about using the networking commands in the ESXi Shell, see vSphere Command-Line
Interface Reference. For information about configuring networking on a host in the vSphere Web Client,
see vSphere Networking.
Problem
After you rearrange the uplinks in the failover groups for a distributed port group in vCenter Server, for
example, by using the vSphere Web Client, some virtual machines in the port group can no longer access
the external network.
Cause
After changing the failover order, many reasons might cause virtual machines to lose connectivity to the
external network.
n The host that runs the virtual machines does not have physical NICs associated with the uplinks that
are set to active or standby. All uplinks that are associated with physical NICs from the host for the
port group are moved to unused.
n A Link Aggregation Group (LAG) that has no physical NICs from the host is set as the only active
uplink according to the requirements for using LACP in vSphere.
n If the virtual machine traffic is separated in VLANs, the host physical adapters for the active uplinks
might be connected to trunk ports on the physical switch that do not handle traffic from these VLANs.
n If the port group is configured with IP hash load balancing policy, an active uplink adapter is
connected to a physical switch port that might not be in an EtherChannel.
You can examine the connectivity of the virtual machines in the port group to associated host uplinks and
uplink adapters from the central topology diagram of the distributed switch or from the proxy switch
diagram for the host.
Solution
n Restore the failover order with the uplink that is associated with a single physical NIC on the host
back to active.
VMware, Inc. 92
vSphere Troubleshooting
n Create a port group with identical settings, make it use the valid uplink number for the host, and
migrate the virtual machine networking to the port group.
n Move the NIC to an uplink that participates in the active failover group.
You can use the vSphere Web Client to move the host physical NIC to another uplink.
n Use the Add and Manage Hosts wizard on the distributed switch.
c On the Select task page, select the Manage host networking option and select the host.
d To assign the NIC of the host to an active uplink, navigate to the Manage physical network
adapters page and associate the NIC to the switch uplink.
a Navigate to the host in the vSphere Web Client, and on the Configure tab, expand the
Networking menu.
c Click Manage the physical network adapters connected to the selected switch, and
move the NIC to the active uplink
Problem
You try to add a physical adapter with low speed, for example, 1 Gbps, to a vSphere Distributed Switch
that is connected to physical adapters with high speed, for example, 10 Gbps. Network I/O Control
version 3 is enabled on the switch and bandwidth reservations exist for one or more system traffic types,
such as vSphere management traffic, vSphere vMotion traffic, vSphere NFS traffic, and so on. The task
for adding the physical adapter fails with a status message that a parameter is incorrect.
Cause
Network I/O Control aligns the bandwidth that is available for reservation to the 10-Gbps speed of the
individual physical adapters that are already connected to the distributed switch. After you reserve a part
of this bandwidth, adding a physical adapter whose speed is less than 10 Gbps might not meet the
potential needs of a system traffic type.
For information about Network I/O Control version 3, see the vSphere Networking documentation.
VMware, Inc. 93
vSphere Troubleshooting
Solution
4 Type the physical adapters that you want to use outside the scope of Network I/O Control as a
comma-separated list for the Net.IOControlPnicOptOut parameter.
6 In the vSphere Web Client, add the physical adapter to the distributed switch.
Problem
When you connect the network adapter of a virtual machine to an SR-IOV virtual function (VF), you create
a passthrough network adapter for the virtual machine. After the (VF) driver in the guest operating system
modifies the MAC address for the passthrough network adapter, the guest operating system shows that
the change is successful but the VM network adapter loses connectivity. Although the guest operating
system shows that the new MAC address is enabled, a log message in the /var/log/vmkernel.log file
indicates that the operation has failed.
Requested mac address change to new MAC address on port VM NIC port number, disallowed by vswitch
policy.
where
n new MAC address is the MAC address in the guest operation system.
n VM NIC port number is the port number of the VM network adapter in hexadecimal format.
Cause
The default security policy on the port group to which the passthrough network adapter is connected does
not allow changes in the MAC address in the guest operating system. As a result, the networking
interface in the guest operating system cannot acquire an IP address and loses connectivity.
VMware, Inc. 94
vSphere Troubleshooting
Solution
u In the guest operating system, reset the interface to cause the passthrough network adapter to regain
its valid MAC address. If the interface is configured to use DHCP for address assignment, the
interface acquires an IP address automatically.
For example, on a Linux virtual machine run the ifconfig console command.
where X in ethX represents the sequence number of the virtual machine network adapter in the guest
operating system.
Problem
A virtual machine that is expected to send BPDU frames causes the traffic to the external network of the
virtual machines in the same port group to be blocked.
If the virtual machine runs on a host that is a part of a vSphere HA cluster, and the host becomes
network-isolated under certain conditions, you observe Denial of Service (DoS) on the hosts in the
cluster.
Cause
As a best practice, a physical switch port that is connected to an ESXi host has the Port Fast and BPDU
guard enabled to enforce the boundary of the Spanning Tree Protocol (STP). A standard or distributed
switch does not support STP, and it does not send any BPDU frames to the switch port. However, if any
BPDU frame from a compromised virtual machine arrives at a physical switch port facing an ESXi host ,
the BPDU guard feature disables the port to stop the frames from affecting the Spanning Tree Topology of
the network.
In certain cases a virtual machine is expected to send BPDU frames, for example, when deploying VPN
that is connected through a Windows bridge device or through a bridge function. If the physical switch
port paired with the physical adapter that handles the traffic from this virtual machine has the BPDU guard
on, the port is error-disabled, and the virtual machines and VMkernel adapters using the host physical
adapter cannot communicate with the external network anymore.
VMware, Inc. 95
vSphere Troubleshooting
If the teaming and failover policy of the port group contains more active uplinks, the BPDU traffic is moved
to the adapter for the next active uplink. The new physical switch port becomes disabled, and more
workloads become unable to exchange packets with the network. Eventually, almost all entities on the
ESXi host might become unreachable.
If the virtual machine runs on a host that is a part of a vSphere HA cluster, and the host becomes
network-isolated because most of the physical switch ports connected to it are disabled, the active master
host in the cluster moves the BPDU sender virtual machine to another host. The virtual machine starts
disabling the physical switch ports connected to the new host. The migration across the vSphere HA
cluster eventually leads to accumulated DoS across the entire cluster.
Solution
n If the VPN software must continue its work on the virtual machine, allow the traffic out of the virtual
machine and configure the physical switch port individually to pass the BPDU frames.
Network
Device Configuration
Distributed or Set the Forged Transmit security property on the port group to Accept to allow BPDU frames to leave
standard switch the host and reach the physical switch port.
You can isolate the settings and the physical adapter for the VPN traffic by placing the virtual machine in
a separate port group and assigning the physical adapter to the group.
Caution Setting the Forged Transmit security property to Accept to enable a host to send BPDU
frames carries a security risk because a compromised virtual machine can perform spoofing attacks.
Note Do not enable the BPDU filter globally. If the BPDU filter is enabled globally, the Port Fast mode
becomes disabled and all physical switch ports perform the full set of STP functions.
n To deploy a bridge device between two virtual machine NICs connected to the same Layer 2 network,
allow the BPDU traffic out of the virtual machines and deactivate Port Fast and BPDU loop prevention
features.
Distributed or Set the Forged Transmit property of the security policy on the port groups to Accept to allow BPDU
standard switch frames to leave the host and reach the physical switch port.
You can isolate the settings and one or more physical adapters for the bridge traffic by placing the
virtual machine in a separate port group and assigning the physical adapters to the group.
Caution Setting the Forged Transmit security property to Accept to enable bridge deployment carries
a security risk because a compromised virtual machine can perform spoofing attacks.
Physical switch n Disable Port Fast on the ports to the virtual bridge device to run STP on them.
n Disable BPDU guard and filter on the ports facing the bridge device.
VMware, Inc. 96
vSphere Troubleshooting
n Protect the environment from DoS attacks in any case by activating the BPDU filter on the ESXi host
or on the physical switch.
n On a host running ESXi 4.1 Update 3, ESXi 5.0 Patch 04 and later 5.0 releases, and ESXi 5.1
Patch 01 and later, enable the Guest BPDU filter in one of the following ways and reboot the host:
n In the Advanced System Settings table on the Configure tab for the host in the
vSphere Web Client, set the Net.BlockGuestBPDU property to 1.
n On a host that does not have the Guest BPDU filter implemented enable the BPDU filter on the
physical switch port to the virtual bridge device.
Distributed or Set the Forged Transmit property of the security policy on the port group to Reject.
standard switch
Note Do not enable the BPDU filter globally. If the BPDU filter is enabled globally, the Port Fast
mode becomes disabled and all physical switch ports perform the full set of STP functions.
Problem
When a Windows virtual machine transmits UDP packets larger than 1024 bytes, you experience lower
than expected or oscillating throughput even when other traffic is negligible. In case of a video streaming
server, video playback pauses.
Cause
For every UDP packet larger than 1024 bytes, the Windows network stack waits for a transmit completion
interrupt before sending the next packet. Unlike for earlier releases, vSphere 5.1 and later releases do not
provide a transparent workaround of the situation.
VMware, Inc. 97
vSphere Troubleshooting
Solution
n Increase the threshold in bytes at which Windows changes its behavior for UDP packets by modifying
the registry of the Windows guest OS.
b Add a value with the name FastSendDatagramThreshold of type DWORD equal to 1500.
For information about fixing this issue in the Windows registry, see
http://support.microsoft.com/kb/235257.
If the Windows virtual machine has a VMXNET3 vNIC adapter, configure one of the following
parameters in the .vmx file of the virtual machine. Use the vSphere Web Client, or directly modify
the .vmx file.
Increase the interrupt rate of the virtual machine to a higher rate than ethernetX.coalescingScheme rbc
expected packet rate. For example, if the expected packet rate is 15000 ethernetX.coalescingParams 16000
interrupts per second, set the interrupt rate to 16000 interrupts per second.
Set the ethernetX.coalescingScheme parameter to rbc and the
ethernetX.coalescingParams parameter to 16000. The default interrupt
rate is 4000 interrupts per second.
Disable coalescing for low throughput or latency-sensitive workloads. For ethernetX.coalescingScheme disabled
information about configuring low-latency workloads, see Best Practices for
Performance Tuning of Latency-Sensitive Workloads in vSphere VMs.
Revert to the coalescing algorithm from earlier ESXi releases. ethernetX.coalescingScheme calibrate
Note The ability to revert to the earlier algorithm will not be available in later
vSphere releases.
X next to ethernet stands for the sequence number of the vNIC in the virtual machine.
For more information about configuring parameters in the .vmx file, see the vSphere Virtual Machine
Administrationdocumentation.
This approach affects all virtual machines and all virtual machine NICs on the host.
You can edit the advanced system settings list for the host in the vSphere Web Client, or by using a
vCLI console command on the host from the ESXi Shell.
VMware, Inc. 98
vSphere Troubleshooting
Set a default interrupt rate higher than the Net.CoalesceScheme /Net/CoalesceScheme rbc
expected packet rate. For example, set the Net.CoalesceParams /Net/CoalesceParams 16000
interrupt rate to 16000 if 15000 interrupts
are expected per second.
For information about configuring a host from the vSphere Web Client, see the vCenter Server and
Host Management documentation. For information about setting host properties by using a vCLI
command, refer to the vSphere Command-Line Interface Reference documentation.
Problem
Virtual machines that reside on different hosts and on the same port group are unable to communicate.
Pings from one virtual machine to another have no effect. You cannot migrate the virtual machines
between the hosts by using vMotion.
Cause
n There are no physical NICs on some of the hosts assigned to active or standby uplinks in the teaming
and failover order of the distributed port group.
n The physical NICs on the hosts that are assigned to the active or standby uplinks reside in different
VLANs on the physical switch. The physical NICs in different VLANs cannot see each other and thus
cannot communicate with each other.
Solution
n In the topology of the distributed switch, check which host does not have physical NICs assigned to
an active or standby uplink on the distributed port group. Assign at least one physical NIC on that
host to an active uplink on the port group.
VMware, Inc. 99
vSphere Troubleshooting
n In the topology of the distributed switch, check the VLAN IDs of the physical NICs that are assigned
to the active uplinks on the distributed port group. On all hosts, assign physical NICs that are from the
same VLAN to an active uplink on the distributed port group.
n To verify that there is no problem at the physical layer, migrate the virtual machines to the same host
and check the communication between them. Verify that inbound and outbound ICMP traffic is
enabled in the guest OS. By default ICMP traffic is disabled in Windows Server 2008 and Windows
Server 2012.
Problem
After you cold migrate a vApp or a virtual machine to another data center or vCenter Server system, an
attempt to power it on fails. An error message states that a property cannot be initialized or allocated
because the network of the vApp or virtual machine does not have an associated network protocol profile.
Cannot initialize property 'property'. Network 'port group' has no associated network protocol profile.
Cannot allocate IP address for property 'property'. Network 'port group' has no associated network
protocol profile.
Cause
By using the OVF environment, the vApp or virtual machine retrieves network settings from a network
protocol profile that is associated with the port group of the vApp or virtual machine.
vCenter Server creates such a network protocol profile for you when you install the OVF of a vApp and
associates the profile with the port group that you specify during the installation.
The mapping between the protocol profile and port group is valid only in the scope of a data center. When
you move the vApp, the protocol profile is not transferred to the target data center because of the
following reasons:
n The network settings of the protocol profile might not be valid in the network environment of the target
data center.
n A port group that has the same name and is associated with another protocol profile might already
exist in the target data center, and vApps and virtual machines might be connected to this group.
Replacing the protocol profiles for the port group might affect the connectivity of these vApp and
virtual machines.
Solution
n Create a network protocol profile on the target data center or vCenter Server system with the required
network settings and associate the protocol profile with the port group to which the vApp or virtual
machine is connected. For example, this approach is suitable if the vApp or virtual machine is a
vCenter Server extension that uses the vCenter Extension vService.
For information about providing network settings to a vApp or virtual machine from a network protocol
profile, see the vSphere Networking documentation.
n Use the vSphere Web Client to export the OVF file of the vApp or virtual machine from the source
data center or vCenter Server system and deploy it on the target data center or vCenter Server
system.
When you use the vSphere Web Client to deploy the OVF file, the target vCenter Server system
creates the network protocol profile for the vApp.
For information about managing OVF files in the vSphere Web Client, see the vSphere Virtual
Machine Administration documentation.
Problem
Cause
Under stressful conditions on a host, that is, if many concurrent networking operations compete for limited
resources, the time to perform some of the operations might exceed the default timeout for rollback of
network configuration operations on the distributed switch. As a result, these operations are rolled back.
For example, such a condition might come up when you create a VMkernel adapter on a host that has a
very high number of switch ports or virtual adapters, all of which consume system resources on the host.
Solution
n Use the vSphere Web Client to increase the timeout for rollback on vCenter Server.
If you encounter the same problem again, increase the rollback timeout with 60 seconds
incrementally until the operation has enough time to succeed.
e Click OK.
n Increase the timeout for rollback by editing the vpxd.cfg configuration file.
If you encounter the same problem again, increase the rollback timeout with 60 seconds
incrementally until the operation has enough time to succeed.
a On a vCenter Server instance, navigate to the directory that contains the vpxd.cfg configuration
file.
c Under the <network> section, increase the timeout, in the <rollbackTimeout> element.
<config>
<vpxd>
<network>
<rollbackTimeout>60</rollbackTimeout>
</network>
</vpxd>
</config>
Problem
You try to assign a license to an ESXi host, but you cannot perform the operation and you receive an
error message.
Cause
You might be unable to assign a license to an ESXi host because of the following reasons:
n The calculated license usage for the host exceeds the license capacity. For example, you have a
vSphere license key with capacity for two CPUs. You try to assign the key to a host that has four
CPUs. You cannot assign the license, because the required license usage for the host is greater than
the license capacity.
n The features on the host do not match the license edition. For example, you might configure hosts
with vSphere Distributed Switch and vSphere DRS while in evaluation mode. Later, you try to assign
vSphere Standard license to the hosts. This operation fails because the vSphere Standard edition
does not include vSphere Distributed Switch and vSphere DRS.
n The host is connected to a vCenter Server system that is assigned a license that restricts the edition
of the license that you want to assign.
Solution
n Upgrade the license edition to match the resources and features on the host, or disable the features
that do not match the license edition.
n Assign a vSphere license whose edition is compatible with the license edition of vCenter Server.
Problem
An ESXi host disconnects from vCenter Server when the host evaluation period or license expires. All
ESXi hosts disconnect from vCenter Server when the evaluation period or the license of vCenter Server
expire. You receive a licensing-related error message both when a single host disconnects and when all
hosts disconnect. You cannot add hosts to the vCenter Server inventory. The hosts and the virtual
machines on the hosts continue to run.
Cause
n The 60-day evaluation period of the host has expired or the host license has expired.
n The 60-day evaluation period of vCenter Server is expired or the vCenter Server license is expired.
Solution
n Assign a vSphere license to the ESXi host and try to reconnect it to vCenter Server.
Problem
Cause
You might be unable to power on a virtual machine because of the following reasons.
Solution
The evaluation period of the host is expired Assign a vSphere license to the ESXi host
The license of the host is expired Assign a vSphere license to the ESXi host
Problem
You cannot use or configure a feature and a licensing-related error message appears.
Cause
The ESXi host or the vCenter Server system is assigned a license that does not support the features that
you want to configure.
Solution
Check the licensed features on the ESXi host and on the vCenter Server system. Upgrade the edition of
the license assigned to the host or vCenter Server if they do not include the features that you try to
configure or use.