Sei sulla pagina 1di 38

Understanding Red Hat Enterprise Virtualization: Common Issues and Troubleshooting

Derrick Ornelas Software Maintenance Engineer, Red Hat 06.28.12

INTENDED AUDIENCE
RHEV administrators

with good working-level knowledge of product who want to learn how to begin troubleshooting their own issues who want a better understanding of the inner workings of the product

AGENDA

Understanding RHEV

Components Networking structure Storage structure Understanding Logs Common Issues

Troubleshooting

RHEV ARCHITECTURE

Components Manager
What is the RHEV Manager?

Management platform for Virtualization Single platform for managing virtual servers and desktops

Built on Red Hat Enterprise Linux and JBoss

Runs on RHEL 6.2+ Server (physical or virtual) Cannot currently run on the same hosts that it manages RHEV-M is written in Java and runs on JBoss AS 5.1.2 JBoss is included with the RHEV subscription but is not supported for use in hosting non-RHEV applications

Uses an embedded PostgreSQL database

Components Hypervisor

Dedicated hypervisor

The minimum OS needed to run/manage virtual machines Well defined management interfaces/APIs ~120MB image size 750MB disk space required for installation Supports the same hardware as RHEL Leverages hardware certification and partner testing

Small footprint

Built using livecd-tools to create RHEL 6 live image

Utilizes KVM (Kernel-based Virtual Machine) Includes libvirt & vdsm for virtual machine management

Components VDSM

vdsm daemon listens for incoming commands from RHEV-M

Operates libvirt for VM lifecycle management Manages Storage Domains, Pools, SPM role, metadata, VM volumes and snapshots Monitors storage domain availability Written in Python Communicates with RHEV-M using XML-RPC on port 54321 Configuration in /etc/vdsm/vdsm.conf Used to operate and control virtual machines: Start/Stop/Restart, Migrations, Monitoring

libvirt starts, stops, pauses and migrates Vms

Components VDSM
vdsClient Can be used to interact with vdsmd for troubleshooting only Does not update RHEV-M database! Examples Print a list of running vm's:
vdsClient -s 0 list table

Get VM info from host:


vdsClient -s 0 getAllVmStats

Start a virtual machine (for emergency situations only):


vdsClient -s 0 create /dev/null vmId=b53eff20-7fb2-4b73-8172-76ec279f917b memSize=1024 macAddr=00:1a:4a:40:18:0b display=vnc vmName=rhel6_2 drive=pool:82e6bb7a-8c10-41c9-80c2-f947d6adac13,domain:d964e86d-ac5f-48a6-b7e47742b6fcf271,image:9c997323-36b1-4ce9-906f-c9a7e8ba8e08,volume:c1acf9b6-ac55-44f1bfe6-b38c20c27bec,boot:true,format:cow bridge=rhevm

Direct access to libvirt functionality via virsh is restricted

NETWORKING logical networks

Networking

Each Data Center defines the logical networks that exists in its environment These logical networks are usually assigned by functionality and physical topology, For example:

Guest data network Storage network access Management network (may be out of band for managing the servers) Display network (for Spice/VNC)

Each Cluster may have a different set of logical networks but they must exist in the Data Center definition All the Hosts in the Cluster should have the same network configuration By default the RHEV-M defines only the management network rhevm Logical networks layout do not necessary correspond to the physical NICs on the host, but the infrastructure must support VLAN tags and Bonding to do so, Otherwise logical network == physical NIC

NETWORKING

Support for VLAN tags and bonding

Supported bonding modes


active-backup / mode 1 balance-xor / mode 2 802.3ad / mode 4 balance-tlb / mode 5

balance-rr (mode 0) and balance-alb (mode 6) not supported due to incompatibility with software bridges

A software bridge is created for each logical network and functions like a switch

NETWORKING ports

Note: SPICE/VNC clients connect directly to the host USB redirection clients connect directly to the guest

STORAGE definitions
Storage pool logical equivalent to Data Center, groups storage domains together Storage domain physical chunk of storage that holds virtual machine disks Storage Pool Manager single host in data center that is chosen to manage all storage in a storage pool Host Storage Manager VDSM component on each host that reads/writes messages to the SPM

STORAGE
RHEV-M VDSMD Host Storage Manager VDSMD Host Storage Manager Storage Pool Manager

Host Host Host

Storage Pool Manager (Host) Storage Pool Manager (Host)

Storage Pool

STORAGE architecture
Storage Pool

File NFS share ISCSI


---------------------Managed via iscsiadm

Block FC
---------------------Direct Access

Each domain is a share Volumes and Metadata are files

Devices are managed by LVM and multipathd Each LUN is a PV Each Domain is represented by VG, tagged with RHAT_storage_domain Metadata & Volumes represented by LVs Domain Types: Data

Domain Types: Data, Export, ISO

STORAGE architecture
How are virtual machines stored?

OVF file

Holds VM description name, NICs, CPU, memory, disks and more Only used when importing/exporting VMs to/from RHEV Managed as image which is a logical group of volumes Volumes in an image are different versions of a disk

VM disk

Stored as files on NFS Stored on LVM logical volumes on iSCSI/FC

STORAGE architecture
Snapshots

A new Sparse volume is created, regardless of type of original volume QCOW2 chains the volumes together, grouped as image The last volume on the chain is read-write (rw); all the others are readonly (r) On Block storage, all its volumes/LVs must be active Template volume can be used as head of chain

Templates

Template volume is always read-only in this case

STORAGE architecture
Volumes are visible to all hosts in storage pool SPM:

single host to control all storage operations Single storage domain that keeps all the up-to-date information about the storage pool as metadata Storage pool and domain has metadata that describes it Each volume also has metadata describing it On Block storage - volume metadata is stored on LV On NFS storage - volume metadata is a file per volume with .meta suffix

Master Data Storage Domain:

Metadata:

Storage Architecture
Each storage domain contains the following files/volumes for internal use: ids not used inbox monitored by SPM for HSM messages outbox monitored by HSMs for SPM messages leases SPM writes timestamp here to prevent other hosts from becoming SPM at the same time master ext3 filesystem with vms and tasks directories Only mounted on SPM, and only used on master storage domain metadata contains volume metadata On NFS file for each On Block LV for each

Storage Metadata

Metadata - information describing the storage pool and each of its storage domains that is stored on the physical storage Consists of a combination of text and LVM tags Two storage domain metadata versions exist: V1 and V2

Version 1 used by ISO and Export storage domains, and all RHEV 2.x storage domains Version 2 used by new data storage domains in RHEV 3.0

Block storage metadata V1 storage domain metadata located on first 2k bytes of


/dev/<SD_UUID>/metadata

V2 storage domain metadata is part of VG tags Volume medata located on /dev/<SD_UUID>/metadata storage domain metadata located in /rhev/dataVolume metadata located in /rhev/data-

NFS storage metadata

center/mnt/<mountpoint>/<SD_UUID>/dom_md/metadata

center/mnt/<mountpoint>/<SD_UUID>/images/<image_GUID>/<volume_UUID> .meta

Storage Structure
# tree /rhev/data-center/

Show the tree structure of the Storage Pool as seen by host tree package is not installed by default on RHEL 6 tree package not available on RHEV-H Provides a table view of the storage Shows LVM information with RHEV-related tags

# python /usr/share/vdsm/dumpStorageTable.py

# pvs | vgs | lvs -o +tags

TROUBLESHOOTING understanding logs

Main RHEV-M log: /var/log/rhevm/rhevm.log Timestamps will be in the timezone of the OS (eg localtime) Main Hypervisor logs: /var/log/vdsm/vdsm.log /var/log/vdsm/libvirt.log Timestamps will be in UTC on RHEV-H Time difference between manager and hypervisor may need to be taken into account when following task flows from RHEVM to RHEVH
rhevm.log: 2012-01-14 17:26:17,803 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-11-thread1425) Running command: RunVmCommand internal: false. Entities affected : ID: 570c6cfd-6fe4-4a33-8fd0d32d5bfa2bd5 Type: VM vdsm.log: Thread-200734::DEBUG::2012-01-14 07:26:19,168::clientIF::54::vds::(wrapper) [10.64.24.140]::call create with ({'bridge': 'rhevm', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.2.0', 'vmId': '570c6cfd-6fe4-4a33-8fd0-d32d5bfa2bd5' ...

Troubleshooting Understanding Logs

All actions are initiated by the manager vdsm daemon listens for incoming tasks Tasks are handled asynchronously by vdsm, manager will poll status Response returned to manager when completed Check vdsm logs for Run and protect to indicate start (and end) of a new task
Thread-227417::INFO::2012-01-14 07:54:39,246::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=82e6bb7a-8c10-41c9-80c2-f947d6adac13) ... Thread-227417::INFO::2012-01-14 07:54:39,248::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': 3, 'spmStatus': 'SPM', 'spmLver': 8}}

Task flow can be followed in vdsm log by looking for lines that have the same starting: <Thread-xxx> - usually for short tasks, or <task number> - for long (async) tasks

TROUBLESHOOTING checking database

Info, eg UUIDs/current state, about VMs, storage pools, domains, VM images, etc, can be obtained by querying the RHEV-M database Embedded postgreSQL database server Database name - rhevm Connect by running 'psql rhevm rhevm' Restoring from database backup

pg_restore -c -d rhevm -U postgres <dump_file>

Graphical client such as pgadmin3 can be used for convenience Available in Fedora or EPEL repository Example query Show all hosts and their info
select * from vds_static\x\g\x

TROUBLESHOOTING database
Important tables:
images list of all VM volumes image_vm_map maps VMs to active volumes lun_storage_server_connection_map maps LUNs and iSCSI connection info lun list of physical storage LUNs storage_pool list of all storage pools/data centers storage_domain_static list of all storage domains storage_server_connections list of iSCSI and NFS storage connections vds_static list of all hosts/hypervisors vm_static list of all virtual machines Note: Tables and column names can be confusing. Disk image is referred to as image_group, while disk volume is referred to as image

TROUBLESHOOTING storage issues

For troubleshooting storage issues, generally concentrate on the SPM host (or host attempting to become SPM if problem relates to acquiring SPM role) Current SPM host may not have been the SPM at the time of a problem occurred: Search rhevm.log for 'starting spm on' to find SPM at the time of the problem RHEV storage operations use standard RHEL commands so typical storage troubleshooting applies: Storage commands can be run from hypervisor command line multipath, iscsiadm, showmount/mount/rpcinfo, cat /proc/scsi/scsi, less /var/log/messages, etc Storage domains, VM disks, snapshots and templates on iSCSI/FC data centers are LVM volume groups / logical volumes so typical LVM troubleshooting applies: vgscan, lvs, vgchange, cat /etc/lvm/{archive,backup}/<VG>, etc

TROUBLESHOOTING certificates

CA certificate generated during rhevm-setup Located on manager at: http://<fqdn>:<http_port>/ca.crt /var/lib/jbossas/server/rhevm-slimmed/deploy/ROOT.war/ca.crt /etc/pki/rhevm/ca.pem CA certificate must be in Window's Trusted Root Certificate Authority certificate store on client to connect via HTTPS SSL certificates for rhevm<->vdsmd communication created during host registration process Located on manager at /etc/pki/rhevm/certs/<host_address>cert.pem Located on hosts at /etc/pki/vdsm/certs/vdsmcert.pem SSL/TLS requires times to be in sync else connection will fail to be established (use NTP)

TROUBLESHOOTING certificates

WPF application code can not run without being signed

Code-signing cert is installed by RHEV-GUI-CertificateInstaller.exe offered by admin portal on first access

Located on manager at /usr/share/rhevm/rhevm.ear/rhevmanager.war/RHEV-GUICertificateInstaller.exe

WPF code certificate must be installed by installer or admin WPF app will crash on start

TROUBLESHOOTING data center down

Data center = storage pool

Storage pool must have a master storage domain (holds storage pool metadata) Common problem: Cannot activate Master storage domain Storage pool must have an SPM host managing all changes to the storage domains Common problem: Cannot start SPM on any hosts

Usually caused by storage or network related problems Missing LUN(s) and/or volume group(s) (storage domain = volume group) Corrupt / inconsistent storage domain metadata or LVM metadata Current SPM host is non-responsive and no fencing defined for that host

Troubleshooting examples:

Situation where current SPM host is non-responsive with no power fencing: Manually reboot host and 'Confirm host has been rebooted' in RHEVM GUI Put master storage domain into maintenance and try to activate it again Put all hosts in maintenance (or reboot all) and try to activate just one host Focus on vdsm.log on that host to determine why storage pool won't activate

TROUBLESHOOTING problematic host states

Non-responsive

Cause: RHEVM cannot communicate with the host on vdsm port 54321 RHEVM regularly monitors hosts and if host cannot be contacted after a while, errors similar to these will appear in the rhevm.log:

ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b1f2e8e-3b5a-11e1b24a-5254005ef58b : rhevhost1.redhat.com, VDS Network Error, continuing ResourceManager::vdsNotResponding entered for Host 0b1f2e8e-3b5a-11e1-b24a-5254005ef58b, 10.10.1.205

Power fencing will allow a non-responsive host to be rebooted and HA VMs to restart on another host (SPM role will transfer, too, if host was SPM)

Non-operational Cause: RHEVM can still communicate with the host on port 54321 but something is wrong with the configuration/operation of the host.

It is a problem with the host not being able to successfully operate all the components defined for a host in its cluster

Storage domain (volume group) cannot be found/activated Metadata corruption / inconsistency (Wrong master version) Logical network (not rhevm network) cannot be created or is down

TROUBLESHOOTING problematic VM states

Unknown

Usually related to a Host becoming non-responsive and no power management has been defined for that host RHEVM can no longer monitor the Vms on that host to determine their state so it marks them as Unknown

Failed to run Fence script on vds:host5.redhat.com, VMs moved to UnKnown instead

Paused

VM usually enters this state if it encounters storage problems outside the VM. Prevents further operation of the VM to stop it trying to access/write to its disk

libvirtEventLoop::INFO::2012-01-13 06:29:02,646::libvirtvm::1231::vm.Vm::(_onAbnormalStop) vmId=`b53eff20-7fb2-4b73-8172-76ec279f917b`::abnormal vm stop device ide0-0-0 error eio libvirtEventLoop::DEBUG::2012-01-13 06:29:02,646::libvirtvm::1386::vm.Vm:: (_onLibvirtLifecycleEvent) vmId=`b53eff20-7fb2-4b73-8172-76ec279f917b`::event Suspended detail 2 opaque None

Example causes:

Storage connection problems Storage domain full when trying to extend sparse VM volume

TROUBLESHOOTING can't start VM

Disk problems, examples: Moving a VM with multiple volumes between storage domains may have been interrupted and some of the volumes are still in original storage domain and some in the destination domain VM disk based on a template has been moved to a new storage domain but the template wasn't moved Solution: May need to check the DB to get a list images/storage domains for the VM, then use LVM commands (eg lvs) to get an on-disk view of where its images are. If there are discrepancies, may need to run SQL update statements to update the DB view and/or vdsClient commands to modify the on-disk view RHEVM UI problems, examples: VM state appears to be stuck in a state that won't allow it to be started, eg Image Locked or Unknown. RHEVM DB has become out of sync with what is happening on the hypervisors For image locked, need to check if the task (eg creating/moving/deleting an image) has complete. For Unknown, need to check if the VM is running on any hosts Solution: May need to run SQL update statements against the DB to change the state of the VM (eg to Down) so the VM can be started again

TROUBLESHOOTING can't migrate VM

Hostname problems hostname of destination needs to be DNS resolvable by source host

migration destination error: Migration destination has an invalid hostname

Firewall problems Live migration needs ports from 49152-49216 on the destination to be accessible Hardware incompatibilities Eg, CPU on destination host doesn't match source host Timeout problems Live migration timeout is 300 seconds (migrate_timeout in /etc/vdsm/vdsm.conf) Some factors influencing time to migrate VM memory image between hosts: Amount of memory in VM Amount of memory activity happening in the VM Saturation of the network used to migrate the memory image

Live migrating many VMs at once, eg putting a host into maintenance that had many VMs running on it, may mean not all VMs complete their live migration before the timeout expires, hence preventing the host to go into maintenance

TROUBLESHOOTING KVM and QEMU


Common Issues

Time Drift CPU overcommit, heavy server load


Use NTP as much as possible RHEV Administration Guide Appendix G KVM Virtual Machine Timing Management

Performance (suggested best practices): Guest:


noatime elevator=noop vcpu numbers Minimal services elevator=deadline noatime Multipathing Bonding avoid overallocation More threads than physical cpu's - threads block which causes problems as VM should essentially be running in real time Don't over commit servers with more cpus or memory than you actually have

Host:

TROUBLESHOOTING collecting logs

rhevm-log-collector Main tool used by support personnel to capture a snapshot of a customer's RHEV environment

Collects RHEVM log files and database tmp/logcollector/RHEVH-and-PostgreSQL-reports/time_diff.txt Runs sosreports on nominated hypervisors to collect the usual RHEL related info, eg log files / command output, as well as RHEV specific info, eg VDSM/libvirt log files and vdsClient command output

Usage: rhevm-log-collector [options] list rhevm-log-collector [options] collect The 'list' operation will list hosts, data centers, or clusters from which logs may be collected The 'collect' operation will collect the data and compress it This process may take some time depending on log file size https://access.redhat.com/knowledge/techbriefs/troubleshooting-red-hatenterprise-virtualization-manager-log-collection-rhev-3

Stay connected through the Red Hat Customer Portal

Troubleshooting Host Installation for Red Hat Enterprise Virtualization 3.0

Review Tech brief

access.redhat.com

Potrebbero piacerti anche