Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
3
Revision History
Printing history
0.1 Review
1.0 Initial publication (written and published by Hue Vu)
1.5 Revised to include following configurations/additions, and distribute
for review:
• Extended Cluster for RAC
• Continentalclusters with RAC
• Continentalclusters with single IP Subnet configuration
2.0 Updated to reflect feedback from review of version 1.5
• Document wording revised, per feedback
• Executive summary enhanced
• New section (General Requirements) added, to precede “Types of
Disaster Tolerant Clusters” section
• For readability Design considerations and implementation attributes
have been moved to an Appendix
• Section 4 (comparison of 4 HP-UX solutions) table reformatted for
usability
• Metrocluster section updated to reflect policy to determine maximum
supported distance
3.0 Second publication of document
3.1 Updated to reflect enhancements to DTS, including
• CFS Support
• Oracle 10g Discussion
• Virtual Server Environment (VSE) Support
• Support of SRDF asynchronous data replication with MC/SRDF
4
Executive Summary
In a Serviceguard cluster configuration, high availability is achieved by using redundant hardware to
eliminate single points of failure. This protects the cluster against hardware faults, such as a single
node failure. This architecture, which is typically implemented on one site in a single data center, is
sometimes called a local cluster. For some installations, the level of protection provided by a local
cluster is insufficient for the business. Consider an order-processing center where power outages are
common during harsh weather. Or consider the systems running the stock market, where multiple
system failures, for any reason, have a significant financial impact. For these types of installations,
and many more like them, it is important to guard not only against single points of failure, but against
multiple points of failure (MPOF), or against single massive failures that cause many components to
fail (such as the failure of a data center, an entire site, or a small area).
Creating clusters that are resistant to multiple points of failure or single massive failures requires a
different type of cluster architecture from the local cluster. This architecture is called a disaster tolerant
architecture – often referred to as a Disaster Tolerant Solution (DTS). This architecture provides you
with the ability to fail over automatically to another part of the cluster or manually to a different cluster
after certain disasters. Specifically, the disaster tolerant solution provides appropriate failover in the
case where an entire data center becomes unavailable. HP has a rich portfolio of disaster tolerant
cluster offerings, including Extended Campus Cluster1, Metrocluster, and Continentalclusters. While
each of these solutions has its own characteristics, their common goal is to protect users from a site-
wide outage. To achieve this, the common feature they all implement is multiple data centers with
multiple copies of the user’s data. Effectively, if one data center fails, a second data center is
available to continue processing.
Both Metrocluster and Extended Campus Cluster solutions are single Serviceguard clusters, meaning
an application can automatically fail over from one data center to the other in the event of a failure.
Although similar in nature, these topologies have key differences that provide different levels of
disaster tolerance. For example, a key difference between these two topologies is the method of data
replication used. Metrocluster implements storage-based data replication with one of the following
three storage subsystems
– HP StorageWorks Continuous Access XP (aka Metrocluster/CAXP)
– EMC’s Symmetrix arrays (aka Metrocluster/SRDF)
– HP StorageWorks Continuous Access EVA (aka Metrocluster/CAEVA)
Extended Campus Cluster is a host-based data replication product.
While Extended Campus Cluster spans two data centers up to 100km apart, the distance between
Metrocluster sites is based on the cluster network and data replication link. In a Metrocluster
configuration, maximum distance is the shortest of the distances defined by:
• Cluster network – maximum distance cannot exceed roundtrip cluster heartbeat network latency
requirement of 200ms
• DWDM provider – distance cannot exceed the maximum as specified for the product supplied by
the DWDM provider
• Data replication link – maximum supported distance as stated by the storage partner
The third solution – Continentalclusters – is built on top of two individual Serviceguard clusters, and
uses semi-automatic failover to start up an application on its recovery cluster. When a site fails, the
1
Extended Campus Cluster is also known as “CampusCluster” and “Extended Distance Cluster”. Throughout this document, this configuration will
be referred to as “Extended Campus Cluster”.
5
user is notified, and must initiate a “recovery” process on the secondary site for the affected
applications to be brought up. Continentalclusters has no distance limitation (i.e., it may span very
short to very long distances, implementing both LAN and WAN technologies).
Continentalclusters also supports a configuration with three data centers. In the three data center
configuration, the first two data centers implement Metrocluster. The third data center is a traditional
single data center Serviceguard cluster. This configuration is suited for environments that (may
already) have two data centers implemented, but for business reasons, require a third data center.
Deployment of this configuration is rare. Typically, Continentalclusters is implemented with two data
centers (i.e., two single data center Serviceguard clusters), with semi-automatic failover between data
centers.
From initial observation, the solutions appear to be interchangeable. The key to selecting the
appropriate fit for a customer’s environment is often driven by the customer’s Recovery Time and
Recovery Point Objectives (referred to as RTO and RPO). Customers requiring the least amount of
downtime will require a solution that tightly integrates data currency with application availability. The
best solution for this customer is one that offers automatic failover of the application. On the other
hand, customers who want control over application failover would prefer a solution that allows the
user to decide when an application starts at the recovery site. Please refer to “Section 9:
Recommendations” for guidelines on selecting and recommending a disaster tolerant solution.
Section 1: Introduction
Many decisions have to be made when designing a disaster tolerant solution. These decisions can
have a tremendous impact on the availability of the solution, consistency of the data, and the overall
cost of the solution. This paper discusses the overall disaster tolerant architecture and its general
requirements, solutions that HP currently offers for HP-UX, differences between them, and offers a high-
level design guideline. Architectures discussed include:
• Extended Campus Cluster2
• Extended distance support for Oracle Real Application Server
– In an active/active configuration (Extended Cluster with RAC)
– In an active/standby configuration (Continentalclusters with RAC)
• Metrocluster
• Continentalclusters
Target Audience
This paper is only available internally to HP personnel. It is intended for use by HP’s pre-sales force to
aid in providing recommendations to customers on disaster tolerant solutions.
Purpose of Document
The purpose of this document is two-fold:
• Discuss and compare disaster tolerant cluster solutions that HP currently offers for HP-UX
• Provide recommendations on positioning our products relative to each other, enabling HP Field
personnel to help customers determine the best disaster tolerant solution for their environments
As this document specifically discusses HP-UX solutions, it does not address implementations on
platforms other than HP-UX.
2
Extended Campus Cluster is also known as “CampusCluster” and “Extended Distance Cluster”. Throughout this document, this configuration will
be referred to as “Extended Campus Cluster”.
6
Section 2: What is a Disaster Tolerance Architecture
In a conventional Serviceguard cluster configuration, all components are in a single data center. This
is referred to as a local cluster. High availability is achieved by using redundant hardware to
guard against single points of failure, such as protection against the node failure in Figure 1.
DAT A CENTER
D a ta L A N + H e artbe at H ea rtb ea t
Failo v er
SAN
However, for many types of installations, it is important to guard not only against single points of
failure, but against multiple points of failure (MPOF), or against single massive failures that
cause many components to fail, such as the failure of a data center, of an entire site, or of a small
area. A data center, in the context of disaster recovery, is a physically proximate collection of
servers, storage, network, and power source that can be used to run a business application(s), usually
all in one room. Creating clusters that are resistant to multiple points of failure or single massive
failures requires a different type of cluster architecture called a disaster tolerant architecture.
This architecture provides you with the ability to fail over automatically to another part of the cluster or
manually to a different cluster after certain disasters. Specifically, the disaster tolerant cluster provides
appropriate failover in the case where an entire data center becomes unavailable, as in the sample
configuration in Figure 2.
7
DATA CENTER DATA CENTER
1 D a ta L A N + H e a rt b e a t D a t a L A N +2 H e a r t b e a t
H e a rtb e a t H e a rtb e a t
F a ilo v e r
SAN SAN
Depending on the type of disaster you are protecting against and the available technology, the
nodes can be as close as partitions within a single node, nodes in another room in the same building,
or as far away as another continent. Whatever the distance, the goal of a disaster tolerant
architecture is to survive the loss of a data center that contains critical resources to run a business
application. Putting clustered nodes further apart increases the likelihood that alternate nodes will be
available for failover in the event of a disaster. The most significant losses during a disaster are the
loss of access to data, and the loss of data itself. You protect against this loss through data
replication (i.e., creating extra copies of the data). Data replication should:
• Ensure data consistency by replicating data in a logical order so that it is immediately usable or
recoverable. Inconsistent data is unusable and is not recoverable for processing. Consistent data
may or may not be current.
• Ensure data currency by replicating data quickly so that a replica of the data can be recovered
to include all committed disk writes that were applied to the local disks.
• Ensure data recoverability so that there are some actions that can be taken to make the data
consistent, such as applying logs or rolling a database.
• Minimize data loss by configuring data replication to address consistency, currency, and
recoverability.
8
Section 4: Cluster File System (CFS) Support
Traditionally, the only storage management options in Serviceguard (SG) environments have been
either Logical Volume Manager (LVM) or Symantec Volume Manager (VxVM). Similarly, the only
options available to SG Extension for RAC (SG/SGeRAC) were the Shared Logical Volume Manager
(SLVM) and the Symantec Cluster Volume Manager (CVM), where Oracle’s application software is
typically installed on a local file system. In December 2005, support was extended to include
Symantec’s Cluster File System (CFS) by both SG and SG/SGeRAC. With CFS, executables and data
alike can be managed by the file system (e.g., Oracle data files and Oracle binaries can both be put
in a CFS). CFS provides major enhancements such as improved manageability and improved
maintenance. For instance, with CFS, Oracle binaries are installed only once, and are visible to all
cluster nodes. A central location is available to store runtime logs, archive logs, etc. From a
maintenance perspective, software updates, patches, and changes have to be applied only once.
CFS support – which requires CVM 4.1 - is currently available for single data centers only. Support
for CFS and CVM 4.1 with Extended Campus Cluster, Extended Cluster for RAC, and
Continentalclusters are all targeted for (calendar year) 2006. Until the time at which support is
provided, CVM 4.1 is not supported in DTS configurations. Please note that the support of CFS
requires the HP Storage Management Suite which includes appropriate versions of both CFS and
CVM in the Management Suite. There are presently no plans to support CFS with Metrocluster.
Support of ASM by SG/SGeRAC (version A.11.17 and beyond) is available on HP-UX 11iv2 for
RAC databases only (i.e., there is no ASM support for Oracle single instance database with SG).
Additionally, SG/SGeRAC configurations using ASM must use raw logical volumes managed by
SLVM (i.e., ASM “sits on top of” SLVM). The primary reason SLVM is required is to leverage the
multipathing capabilities provided by SLVM so that ASM can be supported by SG/SGeRAC on HP-
UX 11iv2. There are presently no plans to support any Disaster Tolerant cluster configuration with
ASM.
Extended Distance SG/SGeRAC and Continentalclusters currently support the Oracle 10g RAC
database server in non-ASM, non-CFS configurations. Additionally, Metrocluster and
Continentalclusters support the Oracle 10g single instance database server in non-ASM, non-CFS
configurations. CFS support by Extended Distance SG/SGeRAC and Continentalclusters is targeted
for 2006.
More information on SG/SGeRAC integration with Oracle 10g may be found at the HA ATC
website: http://haweb.cup.hp.com/ATC/, and in the product user’s guides (i.e., Designing Disaster
Tolerant High Availability Clusters 14th Edition, and Using Serviceguard Extension for RAC 3rd Edition)
9
Section 6: DTS and HP’s Virtualization Strategy
DTS products support HP’s VSE strategy. Serviceguard is integrated with HP VSE products related to
partitioning, utility pricing, workload management and tools for managing the overall VSE
environment. The addition of DTS leverages this integration to extend support from a single data
center to multiple data centers. More information on the integration of DTS with VSE may be found in
the following document: http://www.hp.com/products1/unix/operating/docs/wlm.serviceguard.pdf
additionally, a demo that implements Metrocluster in a VSE may be downloaded from the HA ATC
website, http://haweb.cup.hp.com/ATC/. Once on the website, the download is available on the
“Demos” webpage.
Extended Campus Cluster relies on the capability of the Fibre Channel (FC) technology. It uses FC
switches and/or hubs, and Dense Wavelength Division Multiplexing (DWDM) to provide host-to-
storage connectivity across two data centers up to 100km apart.
In the Extended Campus Cluster architecture, each clustered server is directly connected to the storage
in both data centers. The following diagram depicts a 4-node Extended Campus Cluster using dual
cluster lock disks for arbitration. Cluster locks are discussed in Appendix A of this document.
10
DATA CENTER 1 DATA CENTER 2
Data LAN + Heartbeat Data LAN + Heartbeat
Heartbeat Heartbeat
UP TO 100 KM
If DWDM is used
SAN SAN
Figure 3. Extended Campus Cluster with two Data Centers (dual cluster lock disks used for cluster arbitration)
11
• With MirrorDisk/UX, there is an increased I/O load for writes, since each write has to be done
twice by the host. If data resynchronization is required, based on the amount of data involved, this
can have a major performance impact on the host.
Extended Cluster for RAC merges Extended Campus Cluster with SGeRAC. One key difference
between the two configurations is the volume manager. While Extended Campus Cluster uses LVM
and VxVM, Extended Cluster for RAC implements SLVM and CVM 3.5. Additionally, CFS support is
targeted for (calendar year) 2006.
Benefits of Extended Cluster for RAC
• In addition to the benefits of Extended Campus Cluster, RAC runs in active/active mode in the
cluster, so that all resources in both data centers are utilized. The database and data are
synchronized and replicated across two data centers up to 100km apart. In event of a site failure,
no failover is required, since the instance is already running at the remote site.
• Extended Cluster for RAC implements SLVM so that SGeRAC has a “built-in” mechanism for
determining the status of volume group extents in both data centers (i.e., the state of the volume
groups is kept in memory at the remote site), and SLVM will not operate on non-current data.
Metrocluster
Similar to Extended Campus Cluster, a Metrocluster is a normal Serviceguard cluster that has
clustered nodes and storage devices located in different data centers separated by some distance.
Applications run in an active/standby mode (i.e., application resources are only available to one
node at a time). The distinct characteristic of Metrocluster is its integration with array-based data
replication. Currently, Metrocluster implements three different solutions:
• Metrocluster/CAXP – HP StorageWorks Continuous Access XP
12
• Metrocluster/CAEVA – HP StorageWorks Continuous Access EVA
• Metrocluster/SRDF - EMC’s Symmetrix arrays
Each data center has a set of nodes connected to the storage local to that data center. Disk arrays in
the two data centers are physically connected to each other. Since the data replication/mirroring is
done by the storage subsystem, there is no need to have storage connection from a local server to the
disk array at the remote data center. Either arbitrator nodes, located in a third location, or a quorum
server is used for cluster arbitration.
The following diagram provides an example of Metrocluster/CAXP, configured with arbitrator nodes
at a location separate from either of the two data centers.
rd
3 Site
Arbitrator node Arbitrator node
nd
st
2 Site: DATA CENTER 2
1 Site: DATA CENTER 1
Data LAN + Heartbeat Data LA N + Heartbeat
DWDM DWDM
rd
Figure 4. Metrocluster/CAXP CA with two data centers & a 3 location for arbitrator nodes
The distance separating the data centers in a Metrocluster is based on the cluster network and data
replication link. In a Metrocluster configuration, maximum distance is the shortest of the distances
defined by:
– Cluster network – maximum distance cannot exceed roundtrip cluster heartbeat network latency
requirement of 200ms
– DWDM provider – distance cannot exceed the maximum as specified for the product supplied by
the DWDM provider
– Data replication link – maximum supported distance as stated by the storage partner
Since this is a single SG cluster, all cluster nodes have to be on the same IP subnet for cluster network
communication.
13
Benefits of Metrocluster
• Metrocluster offers a more resilient solution than Extended Campus Cluster, as it provides full
integration between Serviceguard’s application package and the data replication subsystem. The
storage subsystem is queried to determine the state of the data on the arrays. Metrocluster knows
that application package data is replicated between two data centers. It takes advantage of this
knowledge to evaluate the status of the local and remote copies of the data, including whether the
local site holds the primary copy or the secondary copy of data, whether the local data is consistent
or not, and whether the local data is current or not. Depending on the result of this evaluation, it
decides if it is safe to start the application package, whether a resynchronization of data is needed
before the package can start, or whether manual intervention is required to determine the state of
the data before the application package is started. Metrocluster allows for customization of the
startup behavior for application packages depending on the customer's requirements, such as data
currency or application availability. This means that by default, Metrocluster will always prioritize
data consistency and data currency over application availability. If, however, the customer
chooses to prioritize availability over currency, s/he can configure Metrocluster to start up even
when the state of the data cannot be determined to be fully current (but the data is consistent).
• Users wishing to prioritize performance over data currency between the data centers have a choice
of Metrocluster CAXP or Metrocluster SRDF, as each supports both synchronous and asynchronous
replication modes.
• Because data replication and resynchronization are performed by the storage subsystem,
Metrocluster may provide significantly better performance than Extended Campus Cluster during
recovery. Unlike Extended Campus Cluster, Metrocluster does not require any additional CPU time,
which minimizes the impact on the host.
• There is little or no lag time writing to the replica, so the data remains very current.
• Data can be copied in both directions, so that if the primary site fails and the replica takes over,
data can be copied back to the primary site when it comes back up.
• Disk resynchronization is independent from CPU failure (i.e., if the hosts at the primary site fail but
the disk remains up, the disk knows it does not have to be resynchronized).
Limitations of Metrocluster
• Specialized storage hardware is required in a Metrocluster environment, meaning customers are
not allowed to choose their own storage component. Supported storage subsystems include HP
StorageWorks XP, HP StorageWorks EVA, and EMC Symmetrix with SRDF. In addition to
specialized storage, disk arrays from different vendors are incompatible (i.e., a pair of disk arrays
from the same vendor is required).
• There are no plans to support Oracle RAC (neither 9i nor 10g) in a Metrocluster configuration.
• There are no plans to support CFS in a Metrocluster configuration.
Continentalclusters
Continentalclusters provides an alternative disaster tolerant solution in which short to long
distances separate distinct Serviceguard clusters, with either a local area network (LAN) or a wide
area network (WAN) between the clusters. Unlike Metrocluster and Extended Campus Cluster that
have single-cluster architecture, Continentalclusters uses multiple clusters to provide application
recovery. Applications run in the active/standby mode, with application data replicated between
data centers by either storage array-based data replication products (such as Continuous Access XP
or EMC's SRDF), or software-based data replication (such as Oracle 8i Standby DBMS and Oracle 9i
Data Guard).
14
Two types of connections are needed between the two Serviceguard clusters in this architecture; one
for the inter-cluster communication, and another for the data replication. Depending on the distance
between the two sites, either LAN (i.e., single IP subnet) or WAN connections may be used for cluster
network communication. For data replication, depending on the type of connection (ESCON or FC)
that is supported by the data replication software, the data can be replicated over DWDM, 100Base-
T and Gigabit Ethernet using Internet Protocol (IP), ATM, and T1 or T3/E3 leased lines or switched
lines. The Ethernet links and ATM can be implemented over multiple T1 or T3/E3 leased lines.
Continentalclusters provides the ability to monitor a Serviceguard cluster and fail over mission critical
applications to another cluster if the monitored cluster should become unavailable. In addition,
Continentalclusters supports mutual recovery, which allows for mission critical applications to run on
both clusters, with each cluster configured to recover the mission critical applications of the other. As
of March 2003, Continentalclusters supports SGeRAC in addition to Serviceguard. In an SGeRAC
configuration, Oracle RAC database instances are simultaneously accessible by nodes in the same
cluster (i.e., the database is only accessible to one site at a time). The Oracle database and data are
replicated to the 2nd data center, and the RAC instances are configured for recoverability, so that the
2nd data center stands by, ready to begin processing in event of a site failure at the 1st data center
(i.e., across sites, this is an active/standby configuration such that the data base is only accessible to
one site at a time).
NOTE: THE MOVEMENT OF AN APPLICATION FROM ONE CLUSTER TO ANOTHER CLUSTER DOES NOT REPLACE
LOCAL FAILOVER. APPLICATION PACKAGES SHOULD BE CONFIGURED TO FAIL BETWEEN NODES (OR
PARTITIONS) IN THE LOCAL CLUSTER.
The following diagram depicts a Continentalclusters configuration with two data centers.
DATA CENTER 1 IP Router IP Router
DATA CENTER 2
Data LAN + Heartbeat
Data LAN + Heartbeat
Heartbeat Heartbeat
IP Router IP
IP Router
Network
CNT
CNT Edge
Edge
15
Benefits of Continentalclusters (CC)
• Customers can virtually build data centers anywhere and still have the data centers provide disaster
tolerance for each other. Since Continentalclusters uses two clusters, theoretically there is no limit to
the distance between the two clusters. The distance between the clusters is dictated by the required
rate of data replication to the remote site, level of data currency, and the quality of networking links
between the two data centers.
• Inter-cluster communication can be implemented with either WAN or LAN. LAN support is a great
advantage for customers who have data centers in proximity of each other, but for whatever
reason, do not want the data centers configured into a single cluster. One example may be a
customer who already has two SG clusters close to each other. For business reasons, the customer
cannot merge these two clusters into a single cluster, but is concerned about having one of the
centers become unavailable. Continentalclusters can be added to provide disaster tolerance.
• Customers can integrate Continentalclusters with any storage component of choice that is supported
by Serviceguard. Continentalclusters provides a structure to work with any type of data replication
mechanism. A set of guidelines for integrating customers’ chosen data replication scheme with
Continentalclusters is included in the “Designing Disaster Tolerant High Availability Clusters”
manual.
• Besides selecting their own storage and data replication solution, customers can also take
advantage of the following (HP) pre-integrated solutions
– Storage subsystems implemented by Metrocluster are also pre-integrated with Continentalclusters.
Continentalclusters uses the same data replication integration module that Metrocluster
implements to check for data status of the application package before package start up.
– If either Oracle8i or Oracle9i DBMS is used and logical data replication is the preferred method,
depending on the version, either Oracle 8i Standby or Oracle 9i Data Guard with log shipping is
used to replicate the data between two data centers. HP provides a supported integration toolkit
for Oracle 8i Standby DB in the Enterprise Cluster Management Toolkit (ECMT). Contributed
integration templates for Oracle 9i Data Guard are available at the following location:
http://haweb.cup.hp.com/ATC/. While the integration templates for Oracle 9i Data Guard have
been tested with Continentalclusters by ACSL, the scripts are provided at no charge, with no
support from HP.
• Both Oracle9i and Oralce10g RAC are supported by Continentalclusters by integrating CC with
SGeRAC. In this configuration, multiple nodes in a single cluster can simultaneously access the
database (i.e., nodes in one data center can access the database). If the site fails, the RAC
instances can be recovered at the second site.
• In a 2-data center configuration, Continentalclusters supports a maximum of 32 nodes – i.e., a
maximum of 16 nodes per data center.
• Continentalclusters supports up to three data centers. In this configuration, the first two data centers
must implement Metrocluster so that applications automatically fail over between the first two data
centers before migrating to the third data center. The third data center is a traditional (single)
Serviceguard data center. If both the first and second data centers fail, the customer will be notified
and advised to migrate the application to the third site.
NOTE: THIS CONFIGURATION MUST BE VERY CAREFULLY DEPLOYED, AS APPLICATION AND DATA FAILBACK IS
VERY MANUALLY INTENSIVE
• Failover for Continentalclusters is semi-automatic. If a data center fails, the administrator is advised,
and is required to take action to bring the application up on the surviving cluster. Per customer
feedback via our Field personnel, some customers prefer notification that the site is down before the
application migrates to the recovery site.
• CFS support is targeted for 2006.
16
Limitations of Continentalclusters
• Semi-automatic failover is a concern for some customers, depending on their Recovery Time
Objectives (RTO). Per feedback from Field personnel, some customers would like the option of
automatic failover as well as semi-automatic failover.
• Although not a limitation of the Continentalclusters product, it should be noted that increased
distance could significantly complicate the solution. For example, operational issues, such as
working with different staff with different processes, and conducting failover rehearsals, are more
difficult the further apart the clusters are. In addition, the physical connection is one or more leased
lines managed by a common carrier for configurations that require WAN between the clusters
because of the distance separating them. Common carriers cannot guarantee the same reliability as
a dedicated physical cable. The distance can introduce a time lag for data replication, which
creates an issue with data currency. This could increase the overall solution cost by requiring higher
speed connections to improve data replication performance and reduce latency.
Comparison of Solutions
One of the major problems the Field faces is distinguishing between Extended Campus Cluster and
Metrocluster. The following section is provided to highlight key differences between the two.
17
Comparison - All DTS Solutions
The following table extends the comparison to include Extended Cluster with RAC and
Continentalclusters.
Attributes Extended Extended Metrocluster Continentalclusters
Campus Cluster with (CC)
Cluster RAC
The following attributes are included, as they must be considered, based on the type of disaster(s)
about which the customer is concerned.
Key Benefit Excellent in “normal” Excellent in “normal” Two significant Increased data protection
operations, and operations, and benefits: by supporting unlimited
partial failure. Since partial failure. - Provides maximum distance between data
all hosts have access Active/active data protection. State centers (protects against
to both disks, in a configuration of the data is such disasters as those
failure where the provides maximum determined before caused by earthquakes or
node running the data throughput and application is started. violent attacks, where an
application is up but reduces the need for If necessary, data entire area can be
the disk becomes fail over (since both resynchronization is disrupted).
unavailable, no data centers are performed before
application is brought
failover occurs. The active, the
up.
node will access the application is
remote disk to already up on the 2nd - Better performance
continue processing. site). than Extended
Campus Cluster for
resync, as replication
is done by storage
subsystem (no impact
to host)
Key Limitation No ability to check SLVM configuration Specialized storage No automatic failover
the state of the data is limited to 2 nodes. required. Currently, between clusters.
before starting up the CVM 3.5 XP with continuous
application. If the configuration access, EVA with
volume group (vg) supports up to 4 continuous access,
can be activated, the nodes. However, 4- and EMC’s Symmetrix
application will be node configuration is with SRDF are
started. If mirrors limited to a distance supported.
are split or PV links of10km.
are down, as long as
the vg can be Data
activated, the resynchronization
application will be can have a big
started. impact on system
performance as this
Data is a host-based
resynchronization solution.
can have a big
impact on system
performance, as this
is a host-based
solution.
100 kilometers 100 km (maximum 2 Shortest of the 3
Maximum No distance restrictions
Distance1 nodes, with either distances between
SLVM or CVM 3.5) •
Cluster network
10 km apart latency (not to
(maximum is 4 nodes exceed 200ms)
with CVM 3.5) •
Data replication
max distance
•
DWDM provider
2
max distance
18
Attributes Extended Extended Metrocluster Continentalclusters (CC)
Campus Cluster
Cluster with RAC
The following attributes are included, as they directly affect data consistency, currency, and
availability and must be considered when evaluating the customer’s RTO
Data Replication Host-based, via Host-based, via Array-based, via Customers have a choice of either
Mechanism MirrorDisk/UX MirrorDisk/UX CAXP, CAEVA, or selecting their own SG-supported
or (Symantec) or (Symantec) EMC SRDF. storage and data replication
VERITAS VxVM. VERITAS CVM mechanism, or implementing one of
Replication can 3.5. Replication and HP’s pre-integrated solutions
affect Replication can resynchronization (including CAXP, CAEVA, and EMC
performance impact performed by the SRDF for array-based, or Oracle 8i
(writes are performance storage subsystem, Standby for host based.) Also,
synchronous). (writes are so the host does customers may choose Oracle 9i
Re-syncs can synchronous). not experience a Data Guard as a host-based solution.
impact Re-syncs can performance hit. Contributed (i.e., unsupported)
performance (full impact Incremental re- integration templates for Oracle 9i
re-sync is performance syncs are done, Data Guard are available for
required in (full re-sync is minimizing the download at the following location:
many scenarios required in need for full re- http://haweb.cup.hp.com/ATC/
that have many scenarios syncs.
multiple that have
4 multiple
failures.)
4
failures.)
Application Automatic (no Instance is Automatic (no Semi-automatic (user must “push the
Failover type manual already running manual button” to initiate recovery)
intervention at the 2nd site intervention
required) required)
Access Mode5 Active/Standby Active/Active Active/Standby Active/Standby
Client Client detects the Client may Client detects the User must reconnect once the
Transparency lost connection. already have a lost connection. application is recovered at 2nd site
User must standby User must
reconnect once connection to reconnect once the
the application remote site application is
is recovered at recovered at 2nd
2nd site site
The following attributes are included, as they directly impact system scalability
Maximum Cluster 2 to 16 nodes 2 nodes with 3 to 16 nodes 1 to 16 nodes in each cluster
size allowed (up to 4 when SLVM or CVM (max total of 32 nodes – 16 nodes
using dual lock 3.5 with a per cluster in a 2-data center
disks) maximum configuration)
distance of
100km
4 nodes with
CVM 3.5 with a
maximum
distance of
10km
19
Attributes Extended Extended Metrocluster Continentalclusters (CC)
Campus Cluster
Cluster with RAC
The following attributes are included, as they directly affect cost of implementation and maintenance
Storage Identical storage Identical Identical storage is Identical storage is required if
is not required storage is not required storage-based mirroring is used
(replication is required,
host-based with replication is Identical storage is not required for
either host-based with other data replication
MirrorDisk/UX either implementations
OR MirrorDisk/UX
VxVM mirroring) OR
CVM 3.5
Mirroring)
Data replication Dark Fiber Dark Fiber Dark Fiber WAN
link FC over IP LAN
FC over ATM Dark Fiber (pre-integrated solution)
FC over IP (pre-integrated solution)
FC over ATM (pre-integrated
solution)
Cluster network Single IP subnet Single IP subnet Single IP subnet Two configurations:
Single IP subnet for both clusters
(LAN connection between clusters)
1
Data centers that are farther apart increase the likelihood that alternate nodes will be available for
failover in event of a disaster.
2
Metrocluster distance is determined by the shortest of
–
the maximum distance that guarantees a network latency no more than 200ms,
–
the maximum supported distance for the data replication link, or
–
the DWDM provided maximum supported distance
As such, these values will vary between configurations, based on these factors.
3
Continentalclusters has no limitation on distance between the two data centers. The distance is
dictated by the required rate of data replication to the remote site, level of data currency, and the
quality of networking links between the two data centers.
4
A full re-sync is required if a failure that caused one of the mirrors to be unavailable (such as a path
failure to the remote site) is followed by a failure that causes a failover to the host at the remote site
that uses the mirror that was unavailable.
5
Active/standby access means one node at a time is accessing the application’s resources.
Active/active access means all resources are available to multiple nodes.
20
Section 8: Disaster Tolerant Cluster Limitations
Disaster tolerant clusters have limitations, some of which can be mitigated by good planning. Some
examples of multiple points of failure that may not be covered by disaster tolerant configurations
include:
• Failure of all networks among the data centers — using a different route for all network cables can
mitigate the risk.
• Loss of power in more than one site (e.g., a data center + the site housing arbitrator nodes) — This
can be mitigated by making sure sites are on different power circuits, redundant power supplies are
on different circuits, and power circuits are fed from different grids. If power outages are frequent
in your area, and down time is expensive, you may want to invest in a backup generator.
• Loss of all copies of the on-line data — this can be mitigated by replicating data off-line (frequent
backups). It can also be mitigated by taking snapshots of consistent data and storing it on-line;
Business Copy XP and EMC Symmetrix BCV (Business Consistency Volumes) provide this
functionality and the additional benefit of quick recovery should anything happen to both copies of
on-line data.
• A rolling disaster is a disaster that occurs before the cluster is able to recover from a non-
disastrous failure. An example is a data replication link that fails, then, as it is being restored and
data is being resynchronized, a disaster causes an entire data center to fail. Ensuring that a copy
of the data is stored either off-line or on a separate disk that can quickly be restored can mitigate
the effects of rolling disasters. The trade-off is a lack of currency of the data in the off-line copy.
Section 9: Recommendations
As previously stated, customers’ recovery time and recovery point objectives (RTO and RPO) typically
drive the type of disaster tolerant solution selected. The following guidelines are provided to help
determine how to select a solution for recommendation.
• When should I recommend Extended Campus Cluster or Extended Cluster for RAC?
Extended Campus Cluster is recommended for any of the following situations:
Ø A Customer needs to provide some level of protection, but has his own storage. Since any
storage supported by SG is approved for Extended Campus Cluster, this may be the best
solution for this customer.
Ø A customer has a requirement to implement disaster tolerance on a very limited budget.
Metrocluster would be the customer’s choice, but the cost to deploy it exceeds his budget.
Extended Campus Cluster is a good recommendation – as long as the customer understands
and accepts its limitations.
Ø If a customer’s business is the financial industry (such as banking) with an extraordinarily
large volume of real-time transactions, the customer needs to maximize resource usage. The
customer is concerned about such natural events as flooding. In this instance, you may
recommend Extended Cluster for RAC.
21
Ø A customer has one data center running SG. The shared storage is a disk array (XP, EMC, or
EVA). The customer is investigating building a 2nd data center a few miles away to be used
primarily for development and test. This data center can also be used as a back up for the
existing data center.
Ø A customer has two data centers that are within Metrocluster distance limits. One data center
is running an SG cluster. The 2nd data center is used strictly to back up the data via physical
data replication (such as EMC’s SRDF). The 2nd data center is not running any (business
critical) applications. In this situation, the data is protected, such that in the event of an
outage at the primary data center, the data can be physically moved to a location where a
cluster can be brought up and transaction processing restored. This process is manually
intensive. Because of its automatic failover capability, Metrocluster shortens recovery time,
offering a much better solution.
Ø A customer has three data centers running independently from each other, and realizes the
vulnerability of having unprotected data at each of the data centers. HP offers a solution for
three data centers. The first two data centers implement Metrocluster for automatic failover.
Continentalclusters is then implemented so that the third data center backs up the first two. In
this configuration, if the entire Metrocluster fails, the third data center will take over
operations.
As you can see, disaster tolerant solutions require a significant investment in hardware with
geographically dispersed data centers, a means to continuously replicate data from the primary site
to the recovery site, clustering software to monitor faults and manage the failover of the applications,
as well as IT staff in all data centers to operate the environment. With the defined RTO and RPO, the
customer can then decide on whether implementing a disaster tolerant solution is worth the
investment.
22
Appendix A – DTS Design Considerations
Once a customer defines his requirements and chooses to implement a disaster tolerant solution, he
must make many decisions about the actual implementation. The following information is included to
help with the selection of solution components.
Cluster Arbitration
To protect application data integrity, Serviceguard uses a process called arbitration to prevent
more than one incarnation of a cluster from running and starting up a second instance of an
application. In the Serviceguard user’s manual, this process is known as tie breaking, because it is a
means to decide on a definitive cluster membership when different competing cluster nodes are
independently trying to re-form a cluster. Cluster re-formation takes place when there is a change in
cluster membership. In general, the algorithm for cluster re-formation requires the new cluster to
achieve a cluster quorum of a strict majority (that is, more than 50%) of the nodes previously
running. If both halves (exactly 50%) of a previously running cluster were allowed to re-form, there
would be a split-brain situation in which two instances of the same cluster were running.
Serviceguard employs a lock disk, a quorum server, or arbitrator nodes to provide definitive
arbitration to prevent split-brain conditions.
In an Extended Campus Cluster where the cluster nodes are running in two separate data centers, a
single cluster lock disk would be a single point of failure if the data center it resides in suffers a
catastrophic failure. In this solution, there should be one lock disk in each of the two data centers,
and all nodes must have access to both lock disks. In the event of a failure of one of the data centers,
the nodes in the remaining data center will be able to acquire their local lock disk, allowing them to
successfully reform a new cluster. A solution uses dual cluster lock disks is susceptible to split-brain
syndrome. If it is properly designed, configured, and deployed, it would be very difficult for split-
brain to occur (all storage links and cluster network links must all fail) but it is still possible. A dual
cluster lock disk is only supported with Extended Cluster for RAC and Extended Campus Cluster in a
cluster size of four nodes or less.
23
away from the two data centers. The farther the third location is away from the two data centers, the
higher disaster protection the solution can provide. If the customer chooses a building within the
same campus as one of the data centers to house the quorum server, the customer may be protected
from a fire or power outage, but may not be protected from an earthquake or a hurricane. One
advantage of the quorum server is that additional cluster nodes do not have to be configured for
arbitration. Also, one quorum server can serve multiple clusters.
Since you cannot configure redundant quorum server, an entire cluster will fail if the quorum server
fails followed by a failure that requires cluster reformation. To reduce this exposure, you need to
make sure that the quorum server is packaged in its own SG cluster so that when a disaster occurs to
one of the main data centers, the quorum server is available to provide cluster quorum to the
remaining cluster nodes to form a new cluster. A solution using quorum server is not susceptible to
split brain syndrome.
An arbitrator node is the same as any other cluster node and is not configured in any special way in
the cluster configuration file. It is used to make an even partition of the cluster impossible or at least
extremely unlikely. A single failure in a four-node cluster could result in two equal-sized partitions, but
a single failure in a five-node cluster could not. The fifth node in the cluster, then, performs the job of
arbitration by virtue of the fact that it makes the number of nodes in the cluster odd. If one data
center in the solution were down due to disaster, the surviving data center would still remain
connected to the arbitrator node, so the surviving group of nodes would be larger than 50% of the
previously running nodes in the cluster. It could therefore obtain the quorum and re-form the cluster.
As in the case of quorum server, the arbitrator node should be located in a site separate from the two
data centers to provide the appropriate degree of disaster tolerance. The farther the site is away from
the two data centers, the higher disaster protection the solution can provide. A properly designed
cluster solution with two data centers and a 3rd site using arbitrator node(s) will always be able to
achieve cluster quorum after a site failure because a cluster quorum of a strict majority (that is, more
than 50%) of the nodes previously running will always be available to form a new cluster.
It is recommended that two arbitrator nodes be configured in a site separate from either of the data
centers to eliminate the single arbitrator node being an SPOF of the solution. The arbitrator nodes
can be used to run an application that doesn’t need disaster tolerant protection. The arbitrator nodes
can be configured to share some common local disk storage. A Serviceguard package can be
configured to provide local fail over of the application between the two arbitrator nodes.
24
• Dual cluster lock disk is lowest cost but is susceptible to a slight chance of split-brain
syndrome – only supported with Extended Campus Cluster and Extended Cluster for RAC
Off-line data replication is fine for many applications for which recovery time is not critical to the
business. Although data might be replicated weekly or even daily, recovery could take from a day to
a week depending on the volume of data. Some applications, depending on the role they play in the
business, may need to have a faster recovery time, within hours or even minutes. For these
applications, off-line data replication would not be appropriate.
Currently the two ways of replicating data on-line are physical data replication and logical data
replication. Either of these can be configured to use synchronous or asynchronous writes.
25
applications under normal circumstances. Then, if a disaster occurs, an alternate node can take
ownership of applications and data, provided the replicated data is current and consistent.
Replication Mode
Currently, there are three hardware physical data replication products integrated and supported with
HP-UX Disaster Tolerant Solutions – CAXP, CAEVA, and EMC SRDF. Both CAXP and EMC SRDF are
supported with both synchronous and asynchronous mode.
26
• Data copies are peers, so there is no issue with reconfiguring a replica to function as a primary disk
after failover.
• Because there are multiple read devices, that is, the node has access to both copies of data, there
may be improvements in read performance.
• Writes are synchronous unless the link or disk is down.
For logical data replication, currently the Continentalclusters product has a fully integrated and
supported solution with Oracle 8i Standby Database. The integration script is available in the
Enterprise Cluster Master Toolkit. Contributed integration templates for Continentalclusters with
Oracle 9i Data Guard are available for downloaded from http://haweb.cup.hp.com/ATC/. While the
integration templates for Oracle 9i Data Guard have been tested with Continentalclusters by ACSL,
the scripts are provided at no charge, with no support from HP.
27
separate data replication link. As a result, there may be a significant lag in replicating transactions
at the remote site, which affects data currency.
• When a site disaster occurs, logical records or logs being prepared for shipment and in the process
of being transferred to the recovery site will be lost. In this instance, the amount of data loss can be
significant depending on the number of transactions contained within the logical records or logs
(e.g., an Oracle archive log can potentially contain hundreds of database transactions). Reducing
the number of transactions contained within a data transfer and increasing the frequency of the
transfers, which will also improve data currency, can minimize data loss.
• If the primary database fails and is corrupt, and the replica takes over, the process for restoring the
primary database so that it can be used as the replica is complex. It often involves recreating the
database and doing a database dump from the replica.
• Logic errors in applications or in the RDBMS code itself that cause database corruption will be
replicated to remote sites. This is also an issue with physical replication. However, with Oracle
Standby it could be configured such that replicated logs do not get applied immediately to the
standby database, providing a window for DBA intervention.
• Most logical replication methods do not support personality swapping, which is the ability after a
failure to allow the secondary site to become the primary and the original primary to become the
new secondary site. This capability can provide increased up time.
Housing remote nodes in another building often implies they are powered by a different circuit, so it
is especially important to make sure all nodes are powered from a different source if the disaster
tolerant cluster is located in two data centers in the same building. Some disaster tolerant designs go
as far as ensuring their redundant power source is supplied by a different power substation on the
grid, and the power circuits are fed from different grids. This adds protection against large-scale
power failures, such as brownouts, sabotage, or electrical storms.
Standard high-availability guidelines require redundant networks. Redundant networks may be highly
available, but they are not disaster tolerant if a single accident can interrupt both network
connections. For example, if you use the same trench to lay cables for both networks, you do not have
a disaster tolerant architecture because a single accident, such as backhoe digging in the wrong
28
place, can sever both cables at once. This may lead to a split-brain syndrome in an Extended
Campus Cluster using dual cluster lock disks. In a disaster tolerant architecture, the reliability of the
network is paramount. To reduce the likelihood of a single accident causing both networks to fail,
redundant network cables should be installed to use physically different routes for each network.
In addition to redundant lines, you also need to consider what bandwidth you need to support the
data replication method you have chosen. Bandwidth affects the rate of data replication, and
therefore the currency of the data at the remote site. For Extended Campus Cluster, Extended Cluster
with RAC, and Metrocluster, the networking link for cluster communication should have no more than
200 milliseconds latency.
The reliability of the data replication link affects whether or not data replication happens, and
therefore the consistency of the data at the remote site. Dark fiber is more reliable but more costly
than leased lines.
Cost influences both bandwidth and reliability. It is best to address data consistency issues first by
installing redundant lines, then weigh the price of data currency and select the line speed
accordingly.
29
Even if recovery is automated, you may choose to, or need to recover from some types of disasters
with manual recovery. A rolling disaster, which is a disaster that happens before the cluster has
recovered from a previous disaster, is an example of when you may want to manually switch over.
If the data link failed, and as it was coming up and re-synchronizing data, a data center failed, you
would want human intervention to make judgment calls on whether the remote site has consistent
data before failing over.
Each data center often has its own operations staff with their own processes and ways of working.
These operations people will now be required to communicate with each other and coordinate
maintenance and failover rehearsals, change control, IT process, as well as working together to
recover from an actual disaster. If the remote nodes are placed in a “lights-out” data center, the
operations staff may want to put additional processes or monitoring software in place to maintain
the nodes in the remote location. Rehearsals of failover scenarios are important to keep everyone
prepared. Changes made to the production environment (such as OS and/or application upgrades)
must also be tested at the recovery site, to ensure applications failover correctly in the event of
disaster. A written plan should outline rehearsal of what to do in cases of disaster with a minimum
recommended rehearsal schedule of once every 6 months, ideally once every 3 months.
30
For more information
• Product User’s Guides and Release Notes, found at http://docs.hp.com/en/ha.html
• Current Unix Server Configuration Guide (by chapter) - found under “Ordering/Configuration
Guides” at http://source.hp.com/portal/site/source/
• DTS Whitepapers and Customer Presentations – found with the search key “DTS” at
http://source.hp.com/portal/site/source/
• HA ATC links, found at http://haweb.cup.hp.com/ATC/
• Cluster for High Availability, Second Edition, Peter S. Weygant
• DWDM: A white paper, Joseph Algieri and Xavier Dahan
• Evaluation of Data Replication Solutions, Bob Baird
• Extended SAN: A Performance Study, Xavier Dahan
• Extended MC/Serviceguard Cluster Configurations (Campus Cluster), Joseph Algieri and Xavier
Dahan
• High Availability Technical Documentation
• http://docs.hp.com/hpux/ha/index.html
• HP Extended Cluster for RAC – 100 Kilometer Separation Becomes a Reality
31