Sei sulla pagina 1di 12

Veritas Cluster --Tech Tips Document

Some important VCS Problem and its Solution (Solaris)

TECHNOLOGY INFRASTRUCTURE SERVICES

Author
Team
Date of Creation
Reviewed by
Email-id

K.Kandasamy
TIS - Unix Team
13th December 2007
Ramachandra Pargaonkar
Kandasamy.kumaravel@wipro.com

Veritas Cluster --Tech Tips Document

1) How to change the name of a cluster ?


Details:
Use these commands to change the name of a cluster:
# haconf -makerw
# haclus -modify Cluster Name [New_ClusterName]
# haconf -dump -makero

Note: For RAC environments, the cluster name (similar to the hostid) gets stamped onto the
provide region of the disks. Therefore, if you change the name of the cluster also update the
cluster name on the disks by following these steps:
1. Confirm you are on the Master Node:
# vxdctl -c mode
2. Update the cluster name stamped on the disks:
# vxdg deport [disk_group]
# vxdg -Cs import [disk_group]
===============================================================
2. VCS CRITICAL V-16-1-10029 VxFEN driver not configured. VCS Stopping.
Manually restart VCS after configuring fencing
After rebooting the nodes in a cluster, Veritas Cluster Server (VCS) fails to start and the
following messages are seen in the /var/VRTSvcs/log/engine_A.log file:
2007/10/22 05:16:47 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:17:02 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:17:17 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:17:33 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:17:48 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:18:03 VCS CRITICAL V-16-1-10037 VxFEN driver not configured.
Retrying...
2007/10/22 05:18:18 VCS CRITICAL V-16-1-10029 VxFEN driver not configured. VCS
Stopping. Manually restart VCS after configuring fencing
Attempting to start I/O Fencing manually results in the following error:
# ./S97vxfen start

Veritas Cluster --Tech Tips Document


Starting vxfen..
Starting vxfen.. Done
[/etc/rc2.d]# VCS FEN vxfenconfig NOTICE Driver will use SCSI-3 compliant disks.
VCS FEN vxfenconfig ERROR V-11-2-1016 There exists the potential for a preexisting
split-brain
The coordinator disks list no nodes which,
are in the current membership. However, they,
also list nodes which are not in the,
current membership.
I/O Fencing Disabled!
This is a clear indication there are pre-existing keys left on the coordinator disks.
Resolution:
Clear these keys to start I/O fencing, using these commands:
# /opt/VRTSvcs/vxfen/bin/vxfenclearpre
# reboot
Note: Reboot all nodes in the cluster
3) Import failed: License has expired or is not available for operation
Exact Error Message
import
This error message is generated when failing over a disk group from one server to
another in Veritas Cluster Server (VCS) 5.0.
Workaround:
1. Check that the license for Veritas Volume Manager (VxVM) and VCS is valid and not
expired.
2. Verify main.cf for the disk group:
DiskGroup test (
DiskGroup = oradg_u01
DiskGroupType = SAN
Note: According to the Veritas Cluster Server Bundled Agent's Guide for VCS 5.0, a SAN
disk group is only supported in the Storage Foundation Volume Set (SFVS) environment and
therefore needs a different license.
3. Change DiskGroupType to private with these commands:
# haconf -makerw
# hares -modify test DiskGroupType private
# haconf -dump -makero

Veritas Cluster --Tech Tips Document


The disk group now can fail over to another server.
===============================================================
4) Error when starting cluster with hastart
Details:
Error when starting cluster with hastart. When changing the name of the system, the sysname
might file not be changed. The Cluster software checks the names in the configuration files to
see if they are consistent.
Check that the file sysname has the name of the system provided in the other Cluster
configuration files. The location of the configuration file is:
/etc/VRTSvcs/conf
===============================================================
5.How to change a heartbeat on the fly?
Need to change a heartbeat during production timeframe. Low Latency Transport reads the
/etc/llttab file when started. this loads the proper devices for Low Latency Transport to
monitor. It is possible to change the devices that Low Latency transport monitors while it is
active. Low Latency Transport reads the /etc/llttab file when started. It is possible to change
the devices that Low Latency transport monitors while it is active.To add a new high priority
link while Low Latency transport is active, use the following command:
lltconfig -t <alias> -d <device> -b ether
To see your results instantly, run:
lltstat -vvn
The new device will show in the output
6.Getting error while trying to swtich the service group
Exact Error Message
# hagrp -switch asdoraSG -to gsun908
VCS WARNING V-16-1-10484 Group dependency is violated if group asdoraSG goes offline
on system gsun909
Details:
Cause:
In an online global dependency, a child service group must be online on a system in the
cluster before the parent service group can come online.
The child service group cannot be taken offline while the parent service group is online,
however the parent service group can be taken offline while the child service group is online.
Solution:

Veritas Cluster --Tech Tips Document

1. Determine the child and parent service groups.


i.e.
# hagrp -dep asdoraSG
#Parent Child Relationship
asddmSG asdoraSG online global firm
2. Offline the parent service group, swtich the child service group to another system and then
online the parent service group again.
i.e.
# hagrp -offline asddmSG -sys gsun909
# hagrp -switch asdoraSG -to gsun908
# hagrp -online asddmSG -sys gsun909
===============================================================
7 ) A disk group under Veritas Cluster Server (VCS) control cannot be deported.
Details:
The DiskGroup resource had the attributes StartVolumes = 0 StopVolumes = 0
With the attributes StartVolumes = 0 StopVolumes = 0 the DiskGroup agent uses:
vxdg flush <diskgroup>
vxdg deport <diskgroup>
Manually running vxdg flush <diskgroup> never completed.
Resolution:
Remove and reinstall Storage Foundation.
8) How to clear a faulted disk group agent ?
Details:
The output from the command hastatus -sum shows the disk group agent faulted on one
system.
Workaround:
Clear the fault by stopping and restarting the agent:
#haagent -stop Diskgroup -sys <system>
#haagent -start Diskgroup -sys <system>
===============================================================
9 ) A mount resource faults. The file system mounts successfully when not under Veritas
Cluster Server (VCS) control
Details:
This issue is caused by a syntax error in the main.cf file:
Mount xxx-xxx-Mount ( Critical = 0
MountPoint = "/xxx"

Veritas Cluster --Tech Tips Document


BlockDevice = "dev/vx/dsk/xxxx-dg/xxx"
FSType = vxfs
MountOpt = rw
FsckOpt = "-y"
#There is no leading "/" before "dev" on the BlockDevice line.
The BlockDevice line should read:
BlockDevice = "/dev/vx/dsk/xxxx-dg/xxx"
Workaround:
Run the command hacf -verify to verify the syntax of the main.cf file
10 )The Veritas Cluster Server (VCS) utility 'hastatus -sum' shows that a node is stuck in
REMOTE_BUILD status
Details:
The issue After upgrading network cards, which require a new Solaris network driver
(e1000g), a node joining a VCS cluster is stuck in a REMOTE_BUILD state:
Symptoms:
/opt/VRTSvcs/bin/hastatus -sum reports that a node is in a REMOTE_BUILD state.
Conditions:
The Maximum Transmission Unit (MTU) size on the new driver/card is set greater than the
corresponding MTU value on the network switch.
Cause:
The MTU size is too high.
Workaround:
1.Change the MTU of the new driver/card to less than or equal to the MTU value on the
network switch.
2. Restart LLT.
===============================================================
11) How to add the include line to the main.cf file via the command line?.
Details:
Run the following command to create the main.cmd file:
hacf -typetocmd <types.cf>
The main.cmd file is a file of commands that is needed to add the include line to the main.cf
file. Each command in the main.cmd needs to be run to add the include line to the main.cf
file. This can be a lengthy process.
===============================================================
12) How to turn off the notifier?
Details:
In order to disable the notifier, the following can be run from any node in the cluster:
haconf -makerw
hares -modify ntfr Enabled 0
hares -modify ntfr Critical 0
haconf -dump -makero
To re-enable the notifier, follow the above steps by replacing the 0 with a 1 in steps 2 and 3.
The service group will always show partial after the above steps are done. It will never show
fully online.
13) The largefiles option does not work ?

Veritas Cluster --Tech Tips Document

Details:
The issue:
An error message in /var/VRTSvcs/log/engine_A.log states that the mount option is
incompatible with the file system.
Change:
The largefiles option was added to the MountOpt attribute for the mount resource.
Resolution:
The largefiles option must be enabled for the file system at the operating system level before
it can be configured for largefiles within Veritas Cluster Server (VCS).
Enable largefiles with this command:
/usr/lib/fs/vxfs/fsadm -o largefiles <mount>
14) Kernel message: Dazed and confused, but trying to continue
Details:
Symptoms:
System panic with error messages on boot:
kernel: LLT INFO V-14-1-10009 LLT Protocol available
kernel: device eth1 entered promiscuous mode
kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex.
kernel: tg3: eth1: Flow control is on for TX and on for RX.
kernel: device eth1 left promiscuous mode
kernel: Uhhuh. NMI received for unknown reason 35 on CPU 0.
kernel: Dazed and confused, but trying to continue
kernel: Do you have a strange power saving mode enabled?
Cause:
Defective hardware was used in the loading of the LLT module.
Resolution:
Replace CPU 0.
===============================================================
15) Multi Network Interface Card B (MultiNICB) resource demands high CPU time
with Solaris IP Multipathing (IPMP)
Details
Solaris IP Multipathing (IPMP) is in use and the MultiNICB resource is configured with the
UseMpathd attribute enabled (set to 1).
Cause
During every monitor cycle, the MultiNICB agent checks the system process table for the
IPMP daemon process in mpathd.
If several MultiNICB resources are configured in the cluster, the agent checks for the IPMP
daemon many times every minute, resulting in a higher CPU demand.
Workaround
To decrease CPU demand:
Increase the MonitorInterval attribute for the MultiNICB resource from the default 10 seconds
to 30 seconds.
In cluster configurations that have more than three MultiNICB resources, change the
NumThreads attribute from the default of 10 to 1 or 2.

Veritas Cluster --Tech Tips Document

Enhancement request e426856 addresses this issue in a future product or patch release

16) 'Stale NFS handle' accessing files exported by NFS clients


Exact Error Message
Stale NFS handle
Details:
On SLES 10 SP1 systems, whenever an NFS resource is faulted and fails over to the second
node, the clients, on accessing the exported file system, may see 'Stale NFS handle' errors.
This has been resolved by mounting the special file system nfsd before starting nfs daemons.
Resolution
Download the 4.1MP4+e1023246.tar_288679.gz file from the Download Now
link(ftp.support.veritas.com/pub/support/products) and then unpack the file:
# mv 4.1MP4+e1023246.tar_288679.gz 4.1MP4+e1023246.tar.gz
# cksum 4.1MP4+e1023246.tar.gz
608184018
3487 4.1MP4+e1023246.tar.gz
# gzip -d 4.1MP4+e1023246.tar.gz
# tar -xf 4.1MP4+e1023246.tar
# cd 4.1MP4+e1023246
# cat README
POINT PATCH FOR VERITAS CLUSTER SERVER 4.1 MP4
==============================================
NAME:
DATE:

4.1MP4+e1023246
2007-May-14

VCS RELEASE:

4.1MP4

LINUX RELEASE:

SLES 10 SP1

RELEVANT ARCHITECTURES:
ETRACK REFERENCE:

i586, x86_64, and ia64

1023246

PROBLEM DESCRIPTION:
'Stale NFS handle' errors seen on accessing files exported by NFS clients
when a service group configured with NFS fails over

Veritas Cluster --Tech Tips Document

PATCH CONTAINS
-------------.
|__ online
|__ README
PATCH INSTALLATION INSTRUCTIONS:
-------------------------------Installed this patch after installing Veritas Cluster Server 4.1 MP4, following these steps:
The Default value of $VCS_HOME is /opt/VRTSvcs
1. Log in as superuser to the system where the point patch is to be installed.
2. Go to the directory $VCS_HOME/bin/NFS:
#cd $VCS_HOME/bin/NFS
3. Copy online as online.orig on all nodes of the cluster:
#cp online online.orig
4. On each node of the cluster, copy the "online" from this patch to
$VCS_HOME/bin/NFS/online:
#cp /PointPatchDir/online ./online
===============================================================
17) Veritas Cluster Server (VCS) I/O Fencing parameters for racing (Solaris)
Details:
When communication between cluster nodes fails, causing the cluster to be divided into subclusters, these sub-clusters start a race to grab coordinator disks for data protection (VCS I/O
Fencing). vxfen has a mechanism that enables cluster administrators to give larger subclusters better odds to win this race. This document describes the differences in
implementation between VCS versions and their tunable parameters.
Note: While this mechanism can be used to give larger sub-clusters much better odds to win
the race condition, it cannot be used to guarantee that the larger sub-cluster will always win.
1. How to give the odds
Prior to 4.1 MP2
If the number of nodes in a sub-cluster is less than the number of nodes leaving the original
cluster, the sub-cluster repeats reading the coordinator disks to delay the start of the race. By
default, the number of reads is calculated as cube of (the number of leaving nodes). For
example, if a 5-node cluster is divided into a 3-node and a 2-node cluster, the 2-node subcluster repeats reading coordinator disks 27 (= 3 cubed) times. A tunable parameter,
max_read_coord_disk, can be used to change this value, as described later.
4.1 MP2 and 5.0 or later

Veritas Cluster --Tech Tips Document


If the number of nodes in a sub-cluster is less than the number of nodes leaving from the
original cluster, the sub-cluster waits for a number of seconds before joining the race. This
wait time is calculated as (the number of leaving nodes) x 5. For example, if a 5-node cluster
is divided into a 3-node and a 2-node cluster, the 2-node sub-cluster will wait for 15 (3 times
5) seconds. Tunable parameters min_delay and max_delay (4.1MP2), or vxfen_min_delay and
vxfen_max_delay (5.0 or later), can be employed to change this wait time, as described later.
2. Tunable parameters
The following parameters can be specified in the file /kernel/drv/vxfen.conf. Use/tune these
parameters only in situations where you often see a larger sub-cluster losing the vxfen race.
Note that careful and ample testing is required to determine the most optimal values for a
specific environment.
Prior to 4.1 MP2
max_read_coord_disk: The maximum number of times vxfen will loop reading coordinator
disks. If the calculated repeat count exceeds this limit, this value will be used instead.
Default = 25
Min = 1
Max = 1000
4.1MP2
min_delay: The lower limit of the wait time in seconds. If the calculated wait time is below
this limit, this value will be used instead.
Default = 1
Min = 1
Max = N/A
max_delay: The upper limit of the wait time in seconds. If the calculated wait time exceeds
this limit, this value will be used instead.
Default: 600
Min = N/A
Max = 600
Limitations: The implementation of 4.1 MP2 is a subset 5.0 (or later) implementation and has
some limitations. vxfen in 4.1 MP2 will "round down" the min_delay and max_delay values
to a number that is a multiple of 5. For example, if the calculated delay time is 20 and
max_delay specified is 18, the wait time value chosen will be 18. However, vxfen will only
the wait for 15 seconds, ignoring the remaining 3 seconds. Therefore, to avoid confusion, it is
recommended that only numbers that are a multiple of 5 be specified.
Note: With 4.1 MP2, the default and minimum values implemented for min_delay are a bit
confusing, as they are both 1 and not a multiple of 5. Magic in the code prevents this value
from being rounded down to 0, so specifying 1 here - or using the default value of 1 - is safe
and will not be a problem.
VCS 5.0 or later allows more granularity, so the delay can be specified in any number of
seconds within the minimum and maximum boundaries.
5.0 or later
vxfen_min_delay: The lower limit of the wait time in seconds. If the calculated wait time is
below this, this value will be used instead.

Veritas Cluster --Tech Tips Document


Default = 1
Min = 1
Max = 600
vxfen_max_delay: The upper limit of the wait time in seconds. If the calculated wait time
exceeds this, this value will be used instead.
Default: 60
Min = the value of vxfen_min_delay
Max = 600
On Veritas Cluster Server 4.1/4.1MP1 and Solaris 10 encapsulated rootdisk, the haremajor
command prevents the system from starting up due to missing devices under the /dev/vx/dsk
directory
Details:Example:
# haremajor -vx 320 321
LLT INFO V-14-1-10009 LLT Protocol available GAB INFO V-15-1-20021 GAB available
haremajor 1.1
Using the following major number(s):
320
321
Do you want to continue [y/n]? y
Updating /etc/name_to_major
Backing up swapvol
Generating the new device: swapvol 320 62003
Backing up swapvol
Generating the new device: swapvol 320 62003
Backing up rootvol
Generating the new device: rootvol 320 0
Backing up rootvol
Generating the new device: rootvol 320 0
Backing up var
Generating the new device: var 320 62000
Backing up var
Generating the new device: var 320 62000
Backing up home
Generating the new device: home 320 62002
Backing up home
Generating the new device: home 320 62002
Backing up opt
Generating the new device: opt 320 62001
Backing up opt
Generating the new device: opt 320 62001
If there are any problems, you can backout the changes by restoring the
following files:
- /etc/name_to_major.off.363
- /dev/vx/dsk/bootdg/off.swapvol
- /dev/vx/rdsk/bootdg/off.swapvol
- /dev/vx/dsk/bootdg/off.rootvol

Veritas Cluster --Tech Tips Document


- /dev/vx/rdsk/bootdg/off.rootvol
- /dev/vx/dsk/bootdg/off.var
- /dev/vx/rdsk/bootdg/off.var
- /dev/vx/dsk/bootdg/off.home
- /dev/vx/rdsk/bootdg/off.home
- /dev/vx/dsk/bootdg/off.opt
- /dev/vx/rdsk/bootdg/off.opt
To complete re-majoring, reboot your machine with the following command:
reboot
**********
Rebooting with command: boot -s
Boot device: /sbus@3,0/SUNW,fas@3,8800000/sd@2,0:a File and args: -s
SunOS Release 5.10 Version Generic_118833-18 64-bit Copyright 1983-2005
Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Hostname: olds14.example.com
NOTICE: VxVM vxdmp V-5-0-34 added disk array DISKS, datype = Disk
The / file system (/dev/vx/rdsk/bootdg/rootvol) is being checked.
WARNING - Unable to repair the / filesystem.
Run fsck manually (fsck -F ufs /dev/vx/rdsk/bootdg/rootvol).
Jan 8 16:04:11 svc.startd[7]: svc:/system/filesystem/usr:default:Method "/lib/svc/method/fsusr" failed with exit sta.
[ system/filesystem/usr:default failed fatally (see 'svcs -x' for details) ]
Requesting System Maintenance Mode
Console login service(s) cannot run
Solution:
A hotfix is available to fix this issue. Please contact Symantec Technical Support to obtain this
hotfix.
===============================================================

Potrebbero piacerti anche