Sei sulla pagina 1di 52

VERITAS Storage Foundation 5.

x for UNIX:
Fundamentals

Lesson 7
Resolving Hardware Problems
Lesson Introduction
 Lesson 1: Virtual Objects
 Lesson 2: Installation and Interfaces
 Lesson 3: Creating a Volume and File
System
 Lesson 4: Selecting Volume Layouts
 Lesson 5: Making Basic Configuration
Changes
 Lesson 6: Administering File Systems
 Lesson 7: Resolving Hardware Problems
Lesson Topics and Objectives
Topic After completing this lesson,
you will be able to:
Topic 1: How Does VxVM Interpret failures in hardware.
Interpret Failures in Hardware
Topic 2: Recovering Disabled Recover disabled disk groups.
Disk Groups
Topic 3:Resolving Disk Resolve disk failures.
Failures
Topic 4: Managing Hot Manage hot relocation at the host
Relocation at the Host Level level.
Topic 1: How Does VxVM Interpret
Failures in Hardware

After completing this topic, you


will be able to interpret failures in
hardware.
Potential Failures in a Storage Environment
Temporary Failures
Disk
Arrays  Power cut
 Fibre connection
failure
 Complete SAN failure
 SAN switch failure
 HBA card/port failure
SAN
 LUN/Disk failure
 Complete disk array
failure
JBOD
 Site Failure

Can be Permanent
or Temporary
I/O Error Handling
The
TheOS
OSdetects
detectsthe
theI/O
I/Oerror
errorand
andinforms
informsVxVM.
VxVM.

IfIfthe
theLUN/disk
LUN/diskcannot
cannotbe beaccessed
accessedatatall,
all,dynamic
dynamicmultipathing
multipathing(DMP)
(DMP)
disables
disablesthe
thepath.
path.IfIfthere
thereis
isonly
onlyone
onepath,
path,the
theDMP
DMPnode
nodeisisdisabled.
disabled.

No Does the failure impact the Yes


whole disk group?

Treat
Treatthe
thefailure
failureas
asaa Disable
disk/LUN Disablethe
thedisk
diskgroup.
group.
disk/LUNfailure.
failure.

IfIfthe
thefailure
failureimpacts
impactsany
anyfile
filesystem,
system,disable
disablethe
thefile
filesystem.
system.
Identifying Disabled Disk Groups
VxVM disk and disk group records before the failure:
vxdisk list
DEVICE TYPE DISK GROUP STATUS

disk0_1 auto:cdsdisk datadg01 datadg online
disk0_2 auto:cdsdisk datadg02 datadg online
disk0_3 auto:none - - online invalid

vxdg list
NAME STATE ID
datadg enabled,cds 1150193039.58.train1

VxVM disk and disk group records after the failure:


vxdisk list
DEVICE TYPE DISK GROUP STATUS

disk0_1 auto:cdsdisk datadg01 datadg online dgdisabled
disk0_2 auto:cdsdisk datadg02 datadg online dgdisabled
disk0_3 auto - - error

vxdg list
NAME STATE ID
datadg disabled 1150193039.58.train1
Identifying Failed Disks
VxVM disk records before the failure:
vxdisk list
DEVICE TYPE DISK GROUP STATUS
disk0_0 sliced rootdisk sysdg online
disk0_1 auto:cdsdisk datadg01 datadg online
disk0_2 auto:cdsdisk datadg02 datadg online
disk0_3 auto:cdsdisk - - online
disk0_4 auto:none - - online invalid
VxVM disk records after the failure:
vxdisk list
DEVICE TYPE DISK GROUP STATUS
disk0_0 sliced rootdisk sysdg online
disk0_1 auto:cdsdisk datadg01 datadg online
disk0_2 auto - - error
disk0_3 auto - - error
disk0_4 auto - - error
- - datadg02 datadg failed was:disk0_2
Permanent versus Temporary Failures
 Temporary Failure
– Data on the LUN/disk is still there, only
temporarily unavailable.
– When the hardware problem is resolved, in most
cases recovery can make use of the pre-existing
data.
 Permanent Failure
– The data on the LUN/disk is completely
destroyed.
– If the volumes were not redundant, data needs
to be restored from backup.
– However, the VxVM objects and the disk group
configuration information can be restored.
Topic 2: Recovering Disabled Disk Groups

After completing this topic, you


will be able to recover disabled
disk groups.
Device Recovery
 As soon as the hardware problem is resolved, the OS
recognizes the disk array and the disks.
 DMP automatically detects the change, adds the disk
array to the configuration, and enables the DMP paths.
This may take up to 300 seconds. If you want to make it
faster, you can execute the vxdctl enable command
immediately after resolving the hardware problem.
 Relevant messages are logged to the system log.

June 13 12:06:25 train1 vxdmp: [ID 803759 kern.notice] NOTICE:
VxVM vxdmp V-5-0-34 added disk array D60J0DDA, datype =
HDS9500-ALUA
June 13 12:06:25 train1 vxdmp: [ID 736771 kern.notice] NOTICE:
VxVM vxdmp V-5-0-148 enabled path 32/0xa0 belonging to the
dmpnode 253/0x10
June 13 12:06:25 train1 vxdmp: [ID 899070 kern.notice] NOTICE:
VxVM vxdmp V-5-0-147 enabled dmpnode 253/0x10
… Solaris Example
Recovering From Temporary Disk Group Failures
 The disks still have their private regions. Therefore,
there is no need to recover the disk group
configuration data.
 Recover the disk group as follows:
1. Unmount any disabled file systems in the disk group.
2. Deport the disk group.
3. Make sure that the DMP paths are enabled using:
vxdisk –o alldgs list
4. Import the disk group.
5. Start the volumes in the disk group using:
vxvol –g diskgroup startall
Note that mirrored volumes may go through a
synchronization process at the background if they were
open at the time of the failure.
6. Carry out file system checks.
7. Mount the file systems.
Recovering From Permanent Disk Group Failures
 DMP recovery is again automatically done as in
temporary failures. However, this time the disks do
not have any private region that has the disk group
configuration data.
 After the DMP paths are enabled, recover the disk
group as follows:
1. Unmount any disabled file systems in the disk group.
2. Deport the disk group.
At this point all disk group information is lost except for
the configuration backups.
3. Restore the disk group configuration data.
Note that mirrored volumes will go through a
synchronization process at the background.
4. Re-create the file systems if necessary.
5. Restore data from a backup.
Disk Group Configuration Backup and Restore
engdg engdg engdg
vol01 vol01 vol01

Back Up Precommit Commit

vxconfigbackup diskgroup

vxconfigrestore -p diskgroup

vxconfigrestore -c diskgroup

Disk
Topic 3: Resolving Disk Failures

After completing this topic, you


will be able to resolve disk
failures.
Disk Failure: Volume States After the Failure
vxprint -g datadg -ht datadg02isisthe
datadg02 thefailed
faileddisk.
disk.
DG NAME NCONFIG NLOG MINORS GROUP-ID
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
V NAME RVG KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
. . .
dg datadg default default 64000 954250803.2005.train06
dm datadg01 disk0_1 auto:cdsdisk 1519 4152640 -
dm datadg02 - -
- - NODEVICE
v vol01 - ENABLED ACTIVE 204800 SELECT - fsgen
pl vol01-01 vol01 ENABLED ACTIVE 205200 CONCAT - RW
sd datadg01-01 vol01-01 datadg01 0 205200 0 disk0_1 ENA
pl vol01-02 vol01 DISABLED NODEVICE 205200 CONCAT - RW
sd datadg02-01 vol01-02 datadg02 0 205200 0 -
RLOC
v vol02 - DISABLED ACTIVE 204800 SELECT - fsgen
pl vol02-01 vol02 DISABLED NODEVICE 205200 CONCAT - RW
sd datadg02-02 vol02-01 datadg02 205200 205200 0 -
Disk Replacement Tasks

1 Physical Replacement
Replace corrupt disk with
a new disk.

2 Logical Replacement
Volume
 Replace the disk in VxVM.
 Start disabled volumes. Volume
 Resynchronize redundant volumes.
Physically Replacing a Disk
1. Connect the new disk.
2. Ensure that the operating system recognizes the disk.
3. Get VxVM to recognize the disk:
vxdctl enable
4. Verify that VxVM recognizes the disk:
vxdisk –o alldgs list

Note: In VEA, use Actions—>Rescan to run disk setup


commands appropriate for the OS and ensure that VxVM
recognizes newly attached hardware.
Logically Replacing a Disk
VEA:
 Select the disk to be replaced.
 Select Actions—>Replace Disk.
vxdiskadm:
“Replace a failed or removed disk”
CLI:
vxdg -k -g diskgroup adddisk disk_name=device_name
The -k option forces VxVM to take the disk media name of
the failed disk and assign it to the new disk. Use with caution.
Example:
vxdg -k -g datadg adddisk datadg01=c1t1d0
Note: You may need to initialize the disk prior to running the vxdg
adddisk command: vxdisksetup –i device_name
Recovering a Volume
VEA:
 Select the volume to be recovered.
 Select Actions—>Recover Volume.
CLI:
vxreattach [-bcr] [device_tag]
 Reattaches disks to a disk group if disk has a transient failure,
such as when a drive is turned off and then turned back on
 -r attempts to recover stale plexes using vxrecover.

vxrecover [-bnpsvV] [-g diskgroup] \


[volume_name|disk_name]
vxrecover -b -g datadg datavol
Resolving Disk Failures - Summary
Permanent Disk Failure Temporary Disk Failure
1. Fix the hardware problem. (Replace disks, re-cable, change HBA, …)
2. Ensure that the OS recognizes the device
3. Force VxVM to scan for added devices: vxdctl enable
4-a. Initialize a new drive. -
vxdisksetup –i device_name
4-b. Attach the disk media name 4. Reattach the disk media name
to the new drive. to the disk access name.
vxdg –g diskgroup –k adddisk \ vxreattach
disk_name=device_name
5. Recover the redundant volumes.
vxrecover
6. Start any non-redundant volumes.
vxvol –g diskgroup –f start volume
7. Restore non-redundant volume 7. Check data for consistency.
data from backup. fsck -F vxfs \
/dev/vx/rdsk/diskgroup/volume
Disk Failure: Volume States After Attaching the Disk
vxprint -g datadg -ht
DG NAME NCONFIG NLOG MINORS GROUP-ID
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
V NAME RVG KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
. . .
dg datadg default default 64000 954250803.2005.train06
dm datadg01 disk0_1 auto:cdsdisk 1519 4152640 -
dm datadg02 disk0_2 auto:cdsdisk 1519 4152640 -
v vol01 - ENABLED ACTIVE 204800 SELECT - fsgen
pl vol01-01 vol01 ENABLED ACTIVE 205200 CONCAT - RW
sd datadg01-01 vol01-01 datadg01 0 205200 0 disk0_1 ENA
pl vol01-02 vol01 DISABLED IOFAIL 205200 CONCAT - RW
sd datadg02-01 vol01-02 datadg02 0 205200 0 disk0_2 ENA
v vol02 - DISABLED ACTIVE 204800 SELECT - fsgen
pl vol02-01 vol02 DISABLED RECOVER 205200 CONCAT - RW
sd datadg02-02 vol02-01 datadg02 205200 205200 0 disk0_2 ENA
Disk Failure: Volume States During Recovery
vxprint -g datadg -ht
DG NAME NCONFIG NLOG MINORS GROUP-ID
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
V NAME RVG KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
. . .
dg datadg default default 64000 954250803.2005.train06
dm datadg01 disk0_1 auto:cdsdisk 1519 4152640 -
dm datadg02 disk0_2 auto:cdsdisk 1519 4152640 -
v vol01 - ENABLED ACTIVE 204800 SELECT - fsgen
pl vol01-01 vol01 ENABLED ACTIVE 205200 CONCAT - RW
sd datadg01-01 vol01-01 datadg01 0 205200 0 disk0_1 ENA
pl vol01-02 vol01 ENABLED STALE 205200 CONCAT - WO
sd datadg02-01 vol01-02 datadg02 0 205200 0 disk0_2 ENA
v vol02 - DISABLED ACTIVE 204800 SELECT - fsgen
pl vol02-01 vol02 DISABLED RECOVER 205200 CONCAT - RW
sd datadg02-02 vol02-01 datadg02 205200 205200 0 disk0_2 ENA
Intermittent Disk Failures
 VxVM can mark a disk as failing if the disk is experiencing I/O
failures but is still accessible.
vxdisk list
DEVICE TYPE DISK GROUP STATUS

disk0_1 auto:cdsdisk datadg01 datadg online failing
disk0_2 auto:cdsdisk datadg02 datadg online

 Disks marked as failing are not used for any new volume space.
 To resolve intermittent disk failure problems:
– If any volumes on the failing disk are not redundant, attempt to
mirror those volumes:
 If you can mirror the volumes, continue with the procedure for
redundant volumes.
 If you cannot mirror the volume, prepare for backup and restore.
– If the volume is mirrored:
 Prevent read I/O from accessing the failing disk by changing the
volume read policy.
 Remove the failing disk.
 Replace the disk.
 Set the volume read policy back to the original policy.
Forced Removal
To forcibly remove a disk and not evacuate the data:
1. Use the vxdiskadm option, “Remove a disk for
replacement.” VxVM handles the drive as if it has
already failed.
2. Use the vxdiskadm option, “Replace a failed or
removed disk.”
Using the command line:
vxdg –k –g diskgroup rmdisk [disk_name]
vxdisksetup –i [new_device_name]
vxdg –k –g diskgroup adddisk \
[disk_name]=[new_device_name]
Topic 4: Managing Hot Relocation at the
Host Level
After completing this topic, you
will be able manage hot
relocation at the host level.
What Is Hot Relocation?
Hot Relocation: The system automatically reacts to I/O
failures on redundant VxVM objects and restores
redundancy to those objects by relocating affected
subdisks.
Spare Disks

VM Disks

Subdisks are relocated to disks designated as


spare disks or to free space in the disk group.
Hot-Relocation Process

Volumes

Spare Disks
1
VM Disks

3
1. vxrelocddetects
1.vxrelocd detectsdisk
diskfailure.
failure.
2.
2.Administrator
Administratoris
isnotified
notifiedbybye-mail.
e-mail.
3.
3.Subdisks
Subdisksare
arerelocated
relocatedtotoaaspare.
spare. 2
4.
4.Volume
Volumerecovery
recoveryisisattempted.
attempted.
Administrator
How Is Space Selected?
 Hot relocation attempts to move all subdisks from a
failing drive to a single spare destination disk.
 If no disks have been designated as spares, VxVM
uses any available free space in the disk group in
which the failure occurs.
 If there is not enough spare disk space, a
combination of spare disk space and free space is
used.
 Free space that you exclude from hot relocation is
not used.
Managing Spare Disks
VEA:
Actions—>Set Disk Usage
vxdiskadm:
 “Mark a disk as a spare for a disk group”
 “Turn off the spare flag on a disk”
 “Exclude a disk from hot-relocation use”
 “Make a disk available for hot-relocation use”
CLI:
To designate a disk as a spare:
vxedit -g diskgroup set spare=on|off disk_name
To exclude/include a disk for hot relocation:
vxedit -g diskgroup set nohotuse=on|off disk_name
To force hot relocation to only use spare disks:
Add spare=only to /etc/default/vxassist
Lesson Summary

 Key Points
This lesson described how to interpret failures in
hardware, recover disabled disk groups, resolve
disk failures, and manage hot relocation at the host
level.
 Reference Materials
– VERITAS Volume Manager Administrator’s Guide
– VERITAS Storage Foundation Release Notes
Lab 7
Lab 7: Resolving Hardware Problems
In this lab, you practice recovering from a
variety of hardware failure scenarios, resulting
in disabled disk groups and failed disks.
First you recover a temporarily disabled disk
group, and then you use a set of interactive
lab scripts to investigate and practice
recovery techniques.

For Lab Exercises, see Appendix A.


For Lab Solutions, see Appendix B.
What Did You Learn?

 You are about to be


asked a series of
questions related to the
content in this lesson
 Read and try to answer
each question.
 Click Answer at the
bottom of the slide to
view the correct answer.
Volume Manager defines a disk as FAILED if:

A. There are uncorrectable I/O failures on the


public region of the drive, but VxVM can still
access the private region.
B. VxVM cannot access the private region or the
public region.
C. VxVM can access the public region of the
drive, but there are uncorrectable I/O failures
on the private region of the drive.
D. There are failures on slice 2 of the disk.

Answer
Volume Manager defines a disk as FAILED if:

A. There are uncorrectable I/O failures on the


public region of the drive, but VxVM can still
access the private region.
B. VxVM cannot access the private region or the
public region.
C. VxVM can access the public region of the
drive, but there are uncorrectable I/O failures
on the private region of the drive.
D. There are failures on slice 2 of the disk.
The correct answer is B.

Next >>
Permanent disk failures:
A. Are failures in which the data on the drive can
no longer be accessed for any reason
B. Are disk devices that have failures that are
repaired some time later
C. Are hardware failures localized to a part of the
disk
D. Are failures that occur off and on and that
involve problems that cannot be consistently
reproduced

Answer
Permanent disk failures:
A. Are failures in which the data on the drive can
no longer be accessed for any reason
B. Are disk devices that have failures that are
repaired some time later
C. Are hardware failures localized to a part of the
disk
D. Are failures that occur off and on and that
involve problems that cannot be consistently
reproduced
The correct answer is A.

Next >>
To recover from the permanent failure, the first
step after physically replacing the disk is to:
A. Recover the redundant volumes.
B. Start any nonredundant volumes.
C. Initialize the new drive.
D. Attach the disk media name to the new drive.

Answer
To recover from the permanent failure, the first
step after physically replacing the disk is to:
A. Recover the redundant volumes.
B. Start any nonredundant volumes.
C. Initialize the new drive.
D. Attach the disk media name to the new drive.

The correct answer is C.

Next >>
Select the command that causes VxVM to
immediately recognize a newly added
replacement disk that is recognized by the
operating system.
A. vxdctl enable
B. vxiod -f set 0
C. vxscandisks
D. vxconfigd restart

Answer
Select the command that causes VxVM to
immediately recognize a newly added
replacement disk that is recognized by the
operating system.
A. vxdctl enable
B. vxiod -f set 0
C. vxscandisks
D. vxconfigd restart

The correct answer is A.

Next >>
After a detached disk is reattached, which plex
state indicates that VERITAS Volume Manager
believes that the data in that plex needs to be
recovered?
A. STALE
B. IOFAIL
C. DISABLED
D. RECOVER

Answer
After a detached disk is reattached, which plex
state indicates that VERITAS Volume Manager
believes that the data in that plex needs to be
recovered?
A. STALE
B. IOFAIL
C. DISABLED
D. RECOVER

The correct answer is D.

Next >>
When relocating subdisks, where is the first
location that VxVM attempts to select a
destination disk?
A. Scatter the subdisks to different disks
B. The same controller, target, and device as the
failed drive
C. The same controller and target, but to a
different device
D. A different controller

Answer
When relocating subdisks, where is the first
location that VxVM attempts to select a
destination disk?
A. Scatter the subdisks to different disks
B. The same controller, target, and device as the
failed drive
C. The same controller and target, but to a
different device
D. A different controller
The correct answer is B.

Next >>
Select the appropriate command to set up a
disk disk01 as a spare. The disk is in the disk
group acctdg.
A. vxedit -g acctdg set spare=on disk01
B. vxassist -g acctdg set spare=on
disk01
C. vxedit -g acctdg set hotuse=on
disk01
D. vxedit -g acctdg set hotrel=on
disk01

Answer
Select the appropriate command to set up a
disk disk01 as a spare. The disk is in the disk
group acctdg.
A. vxedit -g acctdg set spare=on disk01
B. vxassist -g acctdg set spare=on
disk01
C. vxedit -g acctdg set hotuse=on
disk01
D. vxedit -g acctdg set hotrel=on
disk01
The correct answer is A.

Next >>
Select the appropriate command to get VxVM to
recognize that a failed disk is now working
again.
A. vxdctl
B. vxdisk list
C. vxdisk adddisk
D. vxdctl enable

Answer
Select the appropriate command to get VxVM to
recognize that a failed disk is now working
again.
A. vxdctl
B. vxdisk list
C. vxdisk adddisk
D. vxdctl enable

The correct answer is D.

Next >>
The command that attempts to find the name of
a drive in the private region and match it to a
disk media record that is missing a disk access
record is _________.
A. vxreattach
B. vxunreloc
C. vxattach
D. vxrecover

Answer
The command that attempts to find the name of
a drive in the private region and match it to a
disk media record that is missing a disk access
record is _________.
A. vxreattach
B. vxunreloc
C. vxattach
D. vxrecover

The correct answer is A.

Next >>
End of Presentation

Potrebbero piacerti anche