Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Octavian Lascu
Mustafa Mah
Michel Passet
Harald Hammershøi
SeongLul Son
Maciej Przepiórka
ibm.com/redbooks
International Technical Support Organization
April 2008
SG24-7541-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page vii.
This edition applies to Version 3, Release 1, Modification 6 of IBM General Parallel File system (product
number 5765-G66), Version 5, Modification 3 of IBM High Availability Cluster Multi-Processing (product
number 5765-F62), Version 5, Release 3, Technology Level 6 of AIX (product number 5765-G03), and Oracle
CRS Version 10 Release 2 and Oracle RAC Version 10 Release 2.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Why clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 RAC and Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 IBM GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Configuration options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 RAC with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 RAC with automatic storage management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 RAC with HACMP and CLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Contents v
Part 5. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that does
not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Redbooks (logo) ® Enterprise Storage Server® POWER5™
eServer™ General Parallel File System™ Redbooks®
AIX 5L™ GPFS™ System p™
AIX® HACMP™ System p5™
Blue Gene® IBM® System Storage™
DS4000™ POWER™ Tivoli®
DS6000™ POWER3™ TotalStorage®
DS8000™ POWER4™
Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or
its affiliates.
Snapshot, and the Network Appliance logo are trademarks or registered trademarks of Network Appliance,
Inc. in the U.S. and other countries.
InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade
Association.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
This IBM Redbooks publication will help you architect, install, tailor, and configure Oracle®
10g RAC on
System p™ clusters running AIX®. We describe the architecture and how to design, plan,
and implement a highly available infrastructure for Oracle database using the IBM® General
Parallel File System™ V3.1.
This book gives a broad understanding of how Oracle 10g RAC can use and benefit from
virtualization facilities embedded in System p architecture, and how to efficiently use the
tremendous computing power and available characteristics of the POWER5™ hardware and
AIX 5L™ operating system.
This book also helps you design and create a solution to migrate your existing Oracle 9i RAC
configurations to Oracle 10g RAC by simplifying configurations and making them easier to
administer and more resilient to failures.
This book also describes how to quickly deploy Oracle 10g RAC test environments, and how
to use some of the built-in disaster recovery capabilities of the IBM GPFS™ and storage
subsystems to make you cluster resilient to various failures.
Mustafa Mah is an Advisory Software Engineer working for IBM System and Technology
Group in Poughkeepsie, New York. He currently provides problem determination and
technical assistance in the IBM General Parallel File System (GPFS) to clients on IBM
System p, System x, and Blue Gene® clusters. He previously worked as an application
developer for the IBM System and Technology Group supporting client fulfillment tools. He
holds a Bachelor of Science in Electrical Engineering from the State University of New York in
New Paltz, New York, and a Master of Science in Software Development from Marist College
in Poughkeepsie, New York.
SeongLul Son is a Senior IT specialist working at IBM Korea. He has eleven years of
experience in the IT industry and his expertise includes networking, e-learning, System p
virtualization, HACMP™, and GPFS with Oracle. He has written extensively about GPFS
implementation, database migration, and Oracle in a virtualized environment in this
publication. He also co-authored the AIX 5L Version 5.3 Differences Guide and AIX 5L and
Windows® 2000: Solutions for Interpretability IBM Redbooks® publications in previous
residencies.
Maciej Przepiorka is an IT Architect with the IBM Innovation Center in Poland. His job is to
provide IBM Business Partners and clients with IBM technical consulting and equipment. His
areas of expertise include technologies related to IBM System p servers running AIX,
virtualization, and information management systems, including Oracle databases
(architecture, clustering, RAC, performance tuning, optimization, and problem determination).
He has over 12 years of experience in the IT industry and holds an M.Sc. Eng. degree in
Computer Science from Warsaw University of Technology, Faculty of Electronics and
Information Technology.
Authors: Mustafa (insert), Michel, SeongLul (SL), Harald, Octavian, and Maciej (Mike)
Oracle/IBM Joint Solution Center in Montpellier, France, for reviewing the draft
Dino Quintero
IBM Poughkeepsie
Andrei Socolic
IBM Romania
Cristian Stanciu
IBM Romania
Rick Piasecki
IBM Austin
Jonggun Shin
Goodus Inc., Korea
Renee Johnson
ITSO Austin
The authors of the previous edition of this book, Deploying Oracle9i RAC on eServer Cluster
1600 with GPFS, SG24-6954, published in October 2003:
Octavian Lascu
Vigil Carastanef
Lifang Li
Michel Passet
Norbert Pistoor
James Wang
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you
will develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Preface xi
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review IBM Redbooks publication form found at:
ibm.com/redbooks
Send your comments in an e-mail to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Chapter 1. Introduction
This chapter provides an overview of the infrastructure and clustering technologies that you
can use to deploy a highly available, load balancing database environment using Oracle 10g
RAC and IBM System p, running AIX and IBM General Parallel File System (GPFS). We also
provide information about various other storage management techniques.
In general, clusters are used to provide higher performance and availability than what a single
computer or application can deliver. They are typically more cost-effective than solutions
based on single computers of similar performance2.
However, in most cases, a clustering solution provides more than one benefit; for example, a
high availability cluster can also provide load balancing for the same application.
Today’s commercial environments require that their applications are available 24x7x365. For
commercial environments, high availability and load balancing are key features for IT
infrastructure. Applications must be able to work with hardware and operating systems to
deliver according to the agreed upon service level.
The idea of having multiple instances accessing the same physical data files is traced back to
Oracle 7 (actually, on virtual memory system (VMS). It started back in 1998 on Oracle 6).
Oracle 7 was developed to scale horizontally, when a single system modification program
(SMP) server did not provide adequate performance.
1
According to Oxford University Press’ American Dictionary of Current English, a cluster is: “A number of things of
the same sort gathered together or growing together”.
2 Performance calculated using standardized benchmark programs.
Due to locking granularity, false pinging can occur for blocks that are already cleaned. Oracle
8 introduces fine grain locking, which eliminated the false pinging.
Oracle 8.1, Parallel Server introduced the cache fusion mechanism for consistent reads (that
is, exchanging data blocks through an interconnect network to avoid a physical/disk I/O read
operation).
Starting with Oracle 9i Real Application Clusters (RAC), consistent read and current read
operations use the cache fusion mechanism.
In Oracle 10g, the basic cluster functionality and the database Real Application Clusters
(RACs) are split into two products:
The basic cluster functionality is now Oracle Clusterware (10.2 and forward, Cluster
Ready Services (CRS) in 10.1).
CRS is now a component of Oracle Clusterware. Most Oracle Clusterware commands
reflect the former name, CRS.
Oracle 10g RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so
that they can operate as a single system.
Oracle Clusterware is a cluster management solution that is integrated with Oracle database.
The Oracle Clusterware is also a required component when using RAC. In addition, Oracle
Clusterware enables both single-instance Oracle databases and RAC databases to use the
Oracle high availability infrastructure.
In the past, Oracle RAC configurations required vendor specific clusterware. With Oracle
Clusterware (10.2), vendor specific clusterware is no longer required. However, Oracle
Clusterware can coexist with vendor clusterware, such as High-Availability Cluster
Multi-Processing (HACMP). The integration between Oracle Clusterware and Oracle
database means that Oracle Clusterware has inherent knowledge of the relationships among
RAC instances, automatic storage management (ASM) instances, and listeners. It knows
which sequence to start and stop all components.
Chapter 1. Introduction 5
Oracle Clusterware components
Figure 1-1 shows a diagram of the major functional components that are provided by Oracle
Clusterware.
Virtual IP address
Monitoring other application Actions
Starting/restarting applications,
and so on
Oracle Clusterware
Group membership Process monitor
(Topology) (Watchdog)
halt/reset
Interconnect (IP)
The Oracle Clusterware requires two components from the platform: shared storage and an
IP interconnect. Shared storage is required for voting disks to record node membership
information and for the Oracle Cluster Registry (OCR) for cluster configuration information
(repository).
Oracle Clusterware requires that each node is connected to a dedicated high speed
(preferably low latency) IP network3.
We highly recommend that the interconnect is inaccessible to nodes (systems) that are not
part of the cluster (not managed by Oracle Clusterware).
The Oracle Clusterware shared file is stored on the OCR file (can be a raw disk). There are
no strict rules for OCR placement, such as there are on Oracle database pfile/spfile. Oracle
has to record the location of the OCR disk/file on each cluster node.
In Figure 1-2, the component processes are grouped, and access to the OCR and voting
disks is shown for one node.
The VIP address is handled as a CRS resource, just as other resources, such as a database,
an instance, a listener, and so on. It does not have a dedicated process.
evmd init.cssd
evmd.bin crds.bin ocssd
evmlogger ocssd.bin Process
monitor
Oprocd
oclsomon(*)
Voting
OCR
disk
Figure 1-2 Oracle Clusterware component relationship4
Oracle recommends that you configure redundant network adapters to prevent interconnect
components from being a single point of failure.
3
In certain configurations Oracle may also support InfiniBand® using RDS (Reliable Datagram Socket) protocol.
4
(*)The oclsomon daemon is not mentioned in the 10.2 documentation, but it is running in 10.2.0.3. According to 11g
documentation, oclsomon is monitoring css (to detect if css hangs).
Chapter 1. Introduction 7
Here are a few examples of what happens in typical situations:
Listener failure
When CRS detects that a registered component, such as the listener is not responding,
CRS tries to restart this component. CRS, by default, tries to restart this component five
times.
Interconnect failure
If interconnect is lost for one or more nodes (split brain), CSS resolves this failure through
the voting disks. The surviving subcluster is the:
– Subcluster with the largest number of nodes
– Subcluster that contains the node with the lowest number
Node malfunction
If the OPROCD process is unable to become active within the expected time, CRS
reboots the node.
RAC is the Oracle database option that provides a single system image for multiple servers to
access one Oracle database. In RAC, each Oracle instance usually runs on a separate
server (OS image).
You can use Oracle 10g RAC for both horizontal scaling (scale out in Oracle terms) and for
high availability where client connections from a malfunctioning node are taken over by the
remaining nodes in RAC.
RAC instances use two processes to ensure that each RAC database instance obtains the
block that it needs to satisfy a query or transaction: the Global Cache Service (GCS) and the
Global Enqueue Service (GES).
The GCS and GES maintain status records for each data file and each cached block using a
Global Cache Directory (GCD). The GCD contents are distributed across all active instances
and are part of the SGA.
An instance is defined as the shared memory (SGA) and the associated background
processes. When running in a RAC, the SGA has an additional member, the Global Cache
Directory (GCD), and an additional background process for the GCS and GES services.
GC mode can be NULL, shared, or exclusive. A NULL mode means that another instance
has this block in exclusive mode. Exclusive mode means the instance has the privilege to
update the block.
The GCS and GES use the private interconnect for exchanging control messages and for
actually exchanging data when performing Cache Fusions. Cache Fusion is a data block
transfer on the interconnect. This type of a transfer occurs when one instance needs access
to a data block that is already cached by another instance, thus avoiding physical I/O. GCS
modes are cached on the blocks. If an instance needs to update a block that is already
granted exclusive mode, additional interconnect traffic is not required.
Chapter 1. Introduction 9
The basic concept for updates is that when an instance wants to update a data block, it must
get exclusive mode granted on that block from the GRD, which means that at any given time,
only one instance is able to update any data block. And therefore, if interconnect is lost, no
instance can gain exclusive mode granted on any block, until the cluster recovers
interconnect capability between the nodes.
However, in a multinode RAC, in a scenario where the interconnect network failing on certain
nodes results in subclusters (split brain configuration), the subclusters all consider
themselves survivors. This scenario is avoided by Oracle Clusterware by use of the voting
disks.
GPFS has two major components: a GPFS daemon, running on all cluster nodes, providing
cluster management and membership and disk over-the-network access, and a kernel
extension (the file system device driver) that provides file system access to the applications.
GPFS provides cluster topology and membership management based on built-in heartbeat
and quorum decision mechanisms. Also, at the file system level, GPFS provides concurrent
and consistent access using locking mechanisms and a file system descriptor quorum.
Because GPFS is Portable Operating System Interface (POSIX) compliant, most applications
work in a predefined manner; however, in certain cases, applications must be recompiled to
fully benefit from the concurrent mechanism provided by GPFS.
In addition to concurrent access, GPFS also provides availability and reliability through
replication and metadata logging, as well as advanced functions, such as information life
cycle management, access control lists, quota management, multi-clustering, and disaster
recovery support. Caching, as well as direct I/O, is supported.
Oracle RAC uses GPFS for concurrent access to Oracle database files. For database
administrators, GPFS is easy to use and manage compared to other concurrent storage
mechanisms (concurrent raw devices, ASM). It provides almost the same performance level
as raw devices. The basic requirement for Oracle 10g RAC is that all disks used by GPFS
(and by Oracle) are directly accessible from all nodes (each node must have a host bus
adapter (HBA) connected to the shared storage and access the same logical unit numbers
(LUNs)).
There are several options available to implement Oracle 10g RAC on advanced interactive
executive (AIX) in terms of storage and data file placement. Prior to Oracle 10g, for
configurations running on AIX, only two possibilities existed: GPFS file systems or raw
devices on concurrent logical volume managers (CLVMs). In Oracle 10g release one, Oracle
introduced its own disk management layer named Automatic Storage Management (ASM).
GPFS greatly simplifies the installation and administration of Oracle 10g RAC. Because it is a
shared file system, all database files can be placed in one common directory, and database
administrators can use the file system as a typical journaled file system (JFS)/JFS2.
Allocation of new data files or resizing existing files does not require system administrator
intervention. Free space on GPFS is seen as a traditional file system that is easily monitored
by administrators.
Moreover, with GPFS we can keep a single image of Oracle binary files and share them
between all cluster nodes. This single image applies both to Oracle database binaries
(ORACLE_HOME) and Oracle Clusterware binary files. This approach simplifies
maintenance operations, such as applying patch sets and one-off patches, and keeps all sets
of log files and installation media in one common space.
For clients running Oracle Applications (eBusiness Suite) with multiple application tier nodes,
it is also possible and convenient to use GPFS as shared APPL_TOP file system.
In GPFS Version 2.3, IBM introduces cluster topology services within GPFS. Thus, for GPFS
configuration, other clustering layers, such as HACMP or RSCT, are no longer required.
You can locate every single Oracle 10g RAC file type (for database and clusterware
products) on the GPFS, which includes:
Clusterware binaries
Clusterware registry files
Clusterware voting files
Database binary files
Database initialization files (init.ora or spfile)
Control files
Data files
Redo log files
Archived redo log files
Flashback recovery area files
Chapter 1. Introduction 11
You can use the GPFS to store database backups. In that case, you can perform the restore
process from any available cluster node. You can locate other non-database-related files on
the same GPFS as well.
Figure 1-4 shows a diagram of Oracle RAC on GPFS architecture. All files related to Oracle
Clusterware and database are located on the GPFS.
Shared storage
Although it is possible to place Oracle Clusterware configuration disk data (OCR) and voting
disk (quorum) data on the GPFS, we generally do not recommend it. In case of GPFS
configuration manager node failure, failover time and I/O freeze during its reconfiguration
might be too long for Oracle Clusterware, and nodes might be evicted from the cluster.
Figure 1-5 shows the recommended architecture with CRS devices outside the GPFS.
Note: CRS config and vote devices located on raw physical volumes.
Shared storage
We discuss detailed information about GPFS installation and configuration in the following
sections of this book.
Important: The previous list is not exhaustive or up-to-date for all of the versions
supported. When using RAC, always check with both Oracle and the storage manufacturer
for the latest support and compatibility list.
For a complete list of GPFS 3.1 software and hardware requirements, visit GPFS 3.1
documentation and FAQs on the following Web page:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cl
uster.gpfs.doc/gpfsbooks.html
Each disk subsystem requires a specific set of device drivers for proper operation while
attached to a host running GPFS.
Note: For the minimum software versions and patches that are required to support Oracle
products on IBM AIX, read Oracle Metalink bulletin 282036.1.
Chapter 1. Introduction 13
Oracle RAC Oracle RAC
Node A Node B
Shared storage
With AIX, each LUN has a raw device file in the /dev directory, such as /dev/rhdisk0. For an
ASM environment, this raw device file for a LUN is assigned to the oracle user. An ASM
instance manages these device files. In a RAC cluster, one ASM instance is created per RAC
node.
Important: In AIX, for each hdisk, there are two devices created in /dev directory: hdisk
and rhdisk. The hdisk device is a block type device, and rhdisk is a character (sequential)
device. For Oracle Clusterware and database, you must use character devices:
root@austin1:/> ls -l /dev |grep hdisk10
brw------- 1 root system 20, 11 Sep 14 19:35 hdisk10
crw------- 1 root system 20, 11 Sep 14 19:35 rhdisk10
Collections of these disk devices are assigned to ASM to form ASM disk groups. For each
ASM disk group, a level of redundancy is defined, which might be normal (mirrored), high
(three mirrors), or external (no mirroring). When normal or high redundancy is used, disks can
be organized in failure groups to ensure that data and its redundant copy do not both reside
on disks that are likely to fail together.
Figure 1-7 on page 15 shows dependencies between disk devices, failure groups, and disk
groups within ASM. Within disk group ASMDG1, data is mirrored between failure groups one
and two. For performance reasons, ASM implements the Stripe And Mirror Everything
(SAME) strategy across disk groups, so that data is distributed across all disk devices.
rhdisk20 rhdisk21
ASMDG2
ASMDG1 (normal redundancy) (external redundancy)
Important: Assigning ASM used hdisks to a volume group or setting the PVID results in
data corruption.
ASM does not rely on any AIX mechanism to manage disk devices. No PVID, volume
group label, or hardware reservation can be assigned to an hdisk device belonging to an
ASM disk group. AIX reports ASM disks as not belonging to a volume group (unused
disks). This raises a serious security problem.
Example 1-1 shows a result of the lspv command on the AIX server; hdisk2, hdisk3, hdisk4,
hdisk5, and hdisk6 do not have PVID signatures and are not assigned to any volume group.
They look like unused hdisks, but they might also belong to an ASM disk group.
Chapter 1. Introduction 15
Example 1-1 AIX lspv command result
root@austin1:/> lspv
hdisk0 0022be2ab1cd11ac rootvg active
hdisk1 00cc5d5caa5832e0 None
hdisk2 none None
hdisk3 none None
hdisk4 none None
hdisk5 none None
hdisk6 none None
hdisk7 none nsd_tb1
hdisk8 none nsd_tb2
hdisk9 none nsd_tb3
hdisk10 none nsd01
hdisk11 none nsd02
hdisk12 none nsd03
hdisk13 none nsd04
hdisk14 none nsd05
hdisk15 none nsd06
The same problem exists with Oracle Clusterware disks when they reside outside of the file
system or any LVM. It is not obvious if they are used by Oracle or available.
ASM manages storage only for database files. Oracle binaries, OCR, and voting disks cannot
be located on ASM disk groups. If shared binaries are desired, you must use a clustered file
system, such as GPFS.
Detailed ASM installation and configuration on the AIX operating system is covered in
CookBook V2 - Oracle RAC 10g Release 2 with ASM on IBM System p running AIX V5
(5.2/5.3) on SAN Storage by Oracle/IBM Joint Solutions Center at:
http://www.oracleracsig.com/
Note: For the minimum software versions and patches that are required to support Oracle
products on IBM AIX, check the Oracle Metalink bulletin 282036.1.
In Oracle 9i RAC, HACMP is used to provide cluster topology services and shared disk
access and to maintain high availability for the interconnect network for Oracle instances.
This is the major drawback of this approach, because administrators have to maintain both
clusterware products within the same environment, and most of HACMP core functionality,
providing services high-availability and failover, is not used at all.
HACMP provides Oracle 10g RAC with the infrastructure for concurrent access to disks.
Although HACMP provides concurrent access and a disk locking mechanism, this
mechanism is only used to open the files (raw devices) and for managing hardware disk
reservation. Oracle database, instead, provides its own data block locking mechanism for
concurrent data access, integrity, and consistency.
Volume groups are varied on all the nodes (under the control of RSCT), thus ensuring short
failover time in case one node loses the disk or network connection. This type of concurrent
access can only be provided for RAW logical volumes (devices).
Oracle datafiles use the raw devices located on the shared disk subsystem. In this
configuration, you must define an HACMP resource group to handle the concurrent volume
groups.
There are two options when using HACMP and CLVM with Oracle RAC. Oracle Clusterware
devices are located on concurrent (raw) logical volumes provided by HACMP (Figure 1-8) or
on separate physical disk devices or LUNs. You must start HACMP services on all nodes
before Oracle Clusterware services are activated.
Note: CRS devices are located on concurrent logical volumes provided by HACMP and
CLVM.
Shared storage
When using physical raw volumes (Figure 1-9), Oracle Clusterware and HACMP are not
dependent on each other; however, both products have to be up and running before the
database startup.
Note: CRS devices are located on raw physical volumes. CRS does not make use of any
extended HACMP functionality.
Chapter 1. Introduction 17
HACMP Oracle
/ CLVMRACOracle CRS HACMP /Oracle
CLVMRACOracle CRS
Node A Node B
Shared storage
Shared storage
For both sample scenarios, if HACMP is configured before Oracle, CRS uses HACMP node
names and numbers.
The drawback of this configuration option stems from the fairly complex administrative tasks,
such as maintaining datafiles, Oracle code, and backup and restore operations.
Detailed Oracle RAC installation and configuration on AIX operating system is covered in
CookBook V1 - Oracle RAC 10g Release 2 on IBM System p running AIX V5 with SAN
Storage by Oracle/IBM Joint Solutions Center, January 2006, at:
http://www.oracleracsig.com/
Note: For the minimum software versions and the patches that are required to support
Oracle products on IBM AIX, read Oracle Metalink bulletin 282036.1.
The diagram in Figure 2-1 shows the test environment that we use for this scenario.
austin1_vip austin2_vip
RAC interconnect
192.168.100.31 192.168.100.32 Public network
austin1 austin1_interconn austin2 austin2_interconn
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
DS4800
austin1 austin2
rootvg hdisk2 rootvg
hdisk0 hdisk0
Nodes
We implement a configuration consisting of two nodes (logical partitions (LPARs)) in two IBM
System p5™ p570s. Each LPAR has four processors and 16 GB of random access memory
(RAM).
Networks
Each node is connected to two networks:
We use one “private” network for RAC inteconnect and GPFS metadata traffic, configured
as Etherchannel with two Ethernet interfaces on each node. Communication protocol is
IP.
One public network (Ethernet or IP) storage
The storage (DS4800) connects to a SAN switch (2109-F32) via two 2 GB Fibre Channel
paths. Each node has one 2 Gb 64-bit PCI-x Fibre Channel (FC) adapter. Figure 2-1 shows
an overview of the configuration that was used in our environment.
You must prepare the host operating system before installing and configuring Oracle
Clusterware. In addition to OS prerequisites (software packages), Oracle Clusterware
requires:
Configuring network IP addresses
Name resolution
Enabling remote command execution
Oracle user and group
bos.adt.base bos.adt.base
bos.adt.lib bos.adt.lib
bos.adt.libm bos.adt.libm
bos.perf.libperfstat bos.perf.libperfstat
bos.perf.perfstat bos.perf.perfstat
bos.perf.proctools bos.perf.proctools
rsct.basic.rte rsct.basic.rte
rsct.compat.clients.rte rsct.compat.clients.rte
xlC.aix50.rte 7.0.0.4 or 8.xxx xlC.aix50.rte 7.0.0.4 or 8.xxx
xlC.rte 7.0.0.1 or 8.xxx xlC.rte 7.0.0.1 or 8.xxx
bos.adt.profa
bos.cifs_fs
a. See the following information in the shaded Tip box.
Tip: If bos.adt.prof and bos.cifs_fs filesets are missing, the Oracle installation verification
utility complains about this during CRS installation. However, these files are not required
for Oracle, and this error message can be ignored at this point. See Oracle Metalink doc
ID: 340617.1 at:
http://metalink.oracle.com
For better availability, we recommend that you set up separate Etherchannel interfaces for
Oracle interconnect and GPFS. However, it is possible to use the same Etherchannel
interface for both Oracle interconnect and GPFS metadata traffic. For more information, refer
to 2.5, “Networking considerations” on page 76.
Example 2-1 shows the list of Ethernet interfaces (ent0 and ent1) that we use to set up the
Etherchannel interface.
Example 2-3 on page 23 shows RAC and GPFS interconnect, ent0, and ent1 interfaces.
Important: Make sure that the network interfaces’ names and numbers (for example, en2
and en3 in our case) are identical on all nodes that are part of the RAC cluster. This
consistency is an Oracle RAC requirement.
We decided to use the same Etherchannel interface for Oracle Clusterware, Oracle RAC, and
GPFS interconnect. This configuration is possible, because GPFS does not use significant
communication bandwidth.
Example 2-4 on page 24 shows the system management interface tool (SMIT) window
through which we choose the Etherchannel interface parameters. We use the round_robin
load balancing mode and default values for all other fields. To see details about configuring
Etherchannel, refer to Appendix A, “EtherChannel parameters on AIX” on page 255.
[Entry Fields]
EtherChannel / Link Aggregation Adapters ent0,ent1 +
Enable Alternate Address no +
Alternate Address [] +
Enable Gigabit Ethernet Jumbo Frames no +
Mode round_robin +
Hash Mode default +
Backup Adapter +
Automatically Recover to Main Channel yes +
Perform Lossless Failover After Ping Failure yes +
Internet Address to Ping []
Number of Retries [] +#
Retry Timeout (sec) [] +#
Next, we configure the IP address for the Etherchannel interface using the SMIT fastpath
smitty chinet. We select the previously created interface from the list (en3 in our case) and
fill in the required fields, as shown in Example 2-5 on page 25.
[Entry Fields]
Network Interface Name en3
INTERNET ADDRESS (dotted decimal) [10.1.100.31]
Network MASK (hexadecimal or dotted decimal) [255.255.255.0]
Current STATE up +
Use Address Resolution Protocol (ARP)? yes +
BROADCAST ADDRESS (dotted decimal) []
Interface Specific Network Options
('NULL' will unset the option)
rfc1323 []
tcp_mssdflt []
tcp_nodelay []
tcp_recvspace []
tcp_sendspace []
Apply change to DATABASE only no +
Note: The user and group id must be the same on both nodes.
Use the SMIT commands, smitty mkuser and smitty mkgroup, to create the user and the
group. We use the command line, as shown in Example 2-6.
Optionally, you can create the oinstall group. This group is the Oracle inventory group. If this
group exists, it owns the Oracle code files. This group is a secondary group for the oracle
user (besides the staff group).
Note: If a process runs with process-wide contention scope (the default) or with
system-wide contention scope, use the AIXTHREAD_SCOPE environment variable. When
using system-wide contention scope, there is a one-to-one mapping between the user
thread and a kernel thread.
This mechansim operates most efficiently with Oracle applications when using
system-wide thread contention scope (AIXTHREAD_SCOPE=S). In addition, as of AIX
V5.2, system-wide thread contention scope also significantly reduces the amount of
memory that is required for each Oracle process. For these reasons, we recommend to
always export AIXTHREAD_SCOPE=S before starting Oracle processes.
Example 2-8 shows the sample /etc/hosts file that we use for our environment. The IP labels
and addresses in this scenario are in bold characters.
# Public network
192.168.100.31 austin1
192.168.100.32 austin2
192.168.100.251 ds4800_c1
192.168.100.252 ds4800_c2
192.168.100.231 hmc_p5
192.168.100.232 hmc_p6
You can use either ssh or standard remote shell (rsh). If ssh is already configured, Oracle
automatically uses ssh as a remote execution. Otherwise, rsh is used. To keep it simple, we
use rsh in our test environment.
Important: Oracle remote command execution fails if there are any intermediate
messages (including banners) during the authentication phase. For example, if you are
using rsh with two authentication methods (kerberos and system), and kerberos
authentication fails, even though the system authentication works correctly, the
intermediate kerberos failing message received by Oracle will result in Oracle remote
command execution failure.
GPFS also requires remote command execution without user interaction between cluster
nodes (as root). GPFS also supports using ssh or rsh. You can specify the remote command
execution when creating the GPFS cluster (the mmcrcluster command).
For rsh, rcp, and rlogin, you must set up user equivalence for the oracle and root accounts.
We set up equivalency editing the /etc/hosts.equiv files on each cluster node and also in root
and oracle home directory $HOME/.rhosts files as shown in Example 2-9.
Change the parameter Maximum number of PROCESSES allowed per user to 2048 or
greater:
root@austin1:/> chdev -l sys0 -a maxuproc=2048
Also, Oracle recommends that you configure the user file, CPU, data, and stack limits in
/etc/security/limits as shown in Example 2-10.
oracle:
fsize = -l
cpu = -1
data = -l
stack = -l
Table 2-2 on page 30 shows the TCP/IP stack parameters minimum recommended values for
Oracle installation. For production database systems, Oracle recommends that you tune
these values to optimize system performance.
Refer to your operating system documentation for more information about tuning TCP/IP
parameters.
ipqmaxlen 512
rfc1323 1
sb_max 1310720
tcp_recvspace 65536
tcp_sendspace 65536
udp_recvspace 655360a
udp_sendspace 65536b
a. The recommended value of this parameter is 10 times the value of the udp_sendspace
parameter. The value must be less than the value of the sb_max parameter.
b. This value is suitable for a default database installation. For production databases, the
minimum value for this parameter is 4 KB plus the value of the database DB_BLOCK_SIZE
initialization parameter multiplied by the value of the DB_MULTIBLOCK_READ_COUNT
initialization parameter: (DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB
Note: Certain parameters are set at interface (en*) level (check with lsattr -El en*).
In our configuration, we use two GPFSs, one for Oracle data files and the other for Oracle
binary files. For Oracle Cluster Repository (OCR) and CRS voting disks that are required for
Oracle Clusterware installation, we use raw devices (disks).
Note: We decided for this configuration to avoid a situation where GPFS and Oracle
Clusterware interfere during node recovery process.
Installing GPFS
We use GPFS V3.1 for our test environment. We have installed the filesets and verified the
packages by using the lslpp command on each node as shown in Example 2-11.
When creating the GPFS cluster, you must provide a file containing a list of node descriptors,
one per line, for each node to be included in the cluster, as shown in Example 2-13. Because
this is a two node configuration, both nodes are quorum and manager nodes.
Node roles
The node roles are:
quorum | nonquorum
This designation specifies whether or not the node is included in the
pool of nodes from which quorum is derived. The default is
nonquorum. You must designate at least one node as a quorum node.
manager | client Indicates whether a node is part of the node pool from which
configuration managers, file system managers, and the token
manager can be selected. The special functions of the file system
manager consume extra CPU time.
Prepare each physical disk for GPFS Network Shared Disks (NSDs)1 using the mmcrnsd
command, as shown in Example 2-14 on page 32. You can create NSDs on physical disks
(hdisk or vpath devices in AIX).
In our testing environment, because both nodes are directly attached to storage, we are not
going to assign any NSD server into the disk description file. However, you must create NSDs
anyway, because it is required to create a file system (unless you are using VSDs),
regardless of whether you use NSD servers.
1
Network Shared Disk is a concept that represents the way that the GPFS file system device driver accesses a raw
disk device regardless of whether the disk is locally attached (SAN or SCSI) or is attached to another GPFS node
(via network).
Tip: If you use many small files and the file system metatdata is dynamic, separating data
and metadata improves performance. However, if large files are mostly used, and there is
little metadata activity, separating data from metadata does not improve performance.
Tip: We recommend that you define DesiredName, because you can use meaningful
names that make system administration easier (see Example 2-21, nsd_tb1 is used as a
tiebreaker, because of the “tb” in the suffix). If a desired name is not specified, the NSD is
assigned a name according to the convention: gpfsNNnsd where NN is a unique
nonnegative integer (for example, gpfs01nsd, gpfs02nsd, and so on).
StoragePool StoragePool specifies the name of the storage pool to which the NSD
is assigned (if desired). If this name is not provided, the default is
system. Only the system pool can contain metadataOnly,
dataAndMetadata, or descOnly disks.
Example 2-15 shows the disk description file for tiebreaker disks.
Note: The disk descriptor file shown in Example 2-15 does not specify the diskUsage,
because we use these NSDs for cluster quroum (tie breakers), and they will not be part of
any file systems.
Example 2-15 Sample disk description file used for creating tiebreaker NSDs
root@austin1:/etc/gpfs_config> cat gpfs_disks_tb
hdisk7:::::nsd_tb1
hdisk8:::::nsd_tb2
hdisk9:::::nsd_tb3
Cluster quorum
GPFS cluster quorum must be maintained for the GPFS file systems to remain available. If
the quorum semantics are broken, GPFS performs the recovery in an attempt to achieve
quorum again. GPFS can use one of two methods for determining quorum:
Node quorum
Node quorum with tiebreaker disks
Table 2-3 explains the difference between node quorum and node quorum with
tiebreakerDisks.
Table 2-3 Difference between node quorum and node quorum with tiebreaker disks
Node quorum Node quorum with tiebreaker disksa
Quorum is defined as one plus half of the There is a maximum of eight quorum nodes.
explicitly defined quorum nodes in the GPFS You must include the primary and secondary
cluster. cluster configuration servers as quorum
There are no default quorum nodes; you nodes.
must specify which nodes have this role. You can have an unlimited number of
GPFS does not limit the number of quorum non-quorum nodes.
nodes.
a. See the following tip.
Most Oracle RAC with GPFS configurations are two node clusters; therefore, you must set
up node quorum with tiebreaker disks. A GPFS cluster can survive and maintain file
systems available with one quorum node and one available tiebreaker disk in this
configuration. You can have one, two, or three tiebreaker disks. However, we recommend
that you use an odd number of tiebreaker disks (three).
Configuring GPFS
Before you start setting up a GPFS cluster, verify that the remote command execution is
working properly between all GPFS nodes via the interfaces that are used for GPFS
metadata traffic (austin1_interconnect and austin2_interconnect).
The following list gives a short explanation for the mmcrcluster command parameters shown
in Example 2-16.
-N NodeFile NodeFile specifies the file containing the list of node descriptors (see
Example 2-13 on page 31), one per line, to be included in the GPFS
cluster.
-p PrimaryServer PrimaryServer specifies the primary GPFS cluster configuration server
node used to store the GPFS configuration data.
-s SecondaryServer SecondaryServer specifies the secondary GPFS cluster configuration
server node used to store the GPFS cluster data. We suggest that you
specify a secondary GPFS cluster configuration server to prevent the
loss of configuration data in the event that your primary GPFS cluster
configuration server goes down. When the GPFS daemon starts up, at
least one of the two GPFS cluster configuration servers must be
accessible.
To check the current configuration information for the GPFS cluster, use the mmlscluster
command (see Example 2-17).
root@austin1:/etc/gpfs_config> mmlscluster
Use the mmlsnsd command to display the current NSD information, as shown in
Example 2-19.
Upon successful completion of the mmcrnsd command, the disk descriptor files are rewritten to
contain the created NSD names in place of the device name, as shown in Example 2-20. This
is done to prepare the disk descriptor files for subsequent usage for creating GPFS file
systems (mmcrfs or mmadddisk commands).
Now all of the NSDs are defined, and you can see the mapping of physical disks to GPFS
NSDs using the command shown in Example 2-21 on page 37.
root@austin1:/> mmlsnsd -a -m
root@austin1:/> lspv
hdisk0 0022be2ab1cd11ac rootvg active
hdisk1 00cc5d5caa5832e0 None
hdisk2 none None
hdisk3 none None
hdisk4 none None
hdisk5 none None
hdisk6 none None
hdisk7 none nsd_tb1
hdisk8 none nsd_tb2
hdisk9 none nsd_tb3
hdisk10 none nsd01
hdisk11 none nsd02
hdisk12 none nsd03
hdisk13 none nsd04
hdisk14 none nsd05
hdisk15 none nsd06
Configure the tiebreakerDisks by using the mmchconfig command as shown in Example 2-22
on page 38. Then, run the mmlsconfig command to check if they are on the list.
root@austin1:/etc/gpfs_config> mmlsconfig
Configuration data for cluster austin_cluster.austin1_interconnect:
-------------------------------------------------------------------
clusterName austin_cluster.austin1_interconnect
clusterId 720967500852369612
clusterType lc
autoload yes
useDiskLease yes
maxFeatureLevelAllowed 906
tiebreakerDisks nsd_tb1;nsd_tb2;nsd_tb3
[austin1_interconnect]
takeOverSdrServ yes
To create file systems, run the mmcrfs command. We have created two file systems: /orabin
for oracle binaries (Example 2-24 on page 39) and /oradata for oracle database files
(Example 2-25 on page 39).
GPFS: 6027-531 The following disks of orabin will be formatted on node austin1:
nsd05: size 10485760 KB
nsd06: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 25 GB can be added to storage pool 'system'.
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
GPFS: 6027-572 Completed creation of file system /dev/orabin.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
Tip: We recommend that you have 50000 inodes for a file system that is used for Oracle
binaries if you plan to install Oracle Clusterware and the database in this file system. The
mmcrfs -N option is for the maximum number of files in the file system. This value defaults
to the size of the file system divided by 1M. Therefore, we intentionally used mmcrfs -N
50000 for the /orabin file system.
GPFS: 6027-531 The following disks of oradata will be formatted on node austin2:
nsd01: size 10485760 KB
nsd02: size 10485760 KB
nsd03: size 10485760 KB
nsd04: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 51 GB can be added to storage pool 'system'.
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
GPFS: 6027-572 Completed creation of file system /dev/oradata.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
The following list gives a short explanation for the mmcrfs command parameters shown in
Example 2-25:
/oradata This parameter is the mount point directory of the GPFS.
oradata This parameter is the name of the file system to be created, because it
will appear in /dev directory. File system names do not need to be
fully-qualified; orada is as acceptable as /dev/oradata. However, file
system names must be unique within a GPFS cluster. Do not specify
an existing entry in /dev.
Tip: In an Oracle with GPFS environment, we generally recommend a GPFS block size of
512 KB. Using 256 KB block size is recommended when there is significant file activity
other than Oracle, or there are many small files not belonging to the database. A block size
of 1 MB is recommended for file systems of 100 TB or larger. See Oracle Metalink doc ID:
302806.1 at:
http://metalink.oracle.com
-M MaxMetadataReplicas
This parameter is the default maximum number of copies of inodes,
directories, and indirect blocks for a file. Valid values are 1 and 2 but
cannot be less than DefaultMetadataReplicas. The default is 1.
-m DefaultMetadataReplicas
This parameter is the default number of copies of inodes, directories,
and indirect blocks for a file. Valid values are 1 and 2 but cannot be
greater than the value of MaxMetadataReplicas. The default is 1.
-R MaxDataReplicas This parameter is the default maximum number of copies of data
blocks for a file. Valid values are 1 and 2 but cannot be less than
DefaultDataReplicas. The default is 1.
-r DefaultDataReplicas
This parameter is the default number of copies of each data block for a
file. Valid values are 1 and 2 but cannot be greater than
MaxDataReplicas. The default is 1.
-n NumNodes This parameter is the estimated number of nodes that mounts the file
system. This value is used as a best estimate for the initial size of
several file system data structures. The default is 32. When you create
a GPFS file system, you might want to overestimate the number of
nodes that mount the file system. GPFS uses this information for
creating data structures that are essential for achieving maximum
parallelism in file system operations. Although a large estimate
consumes additional memory, underestimating the data structure
allocation can reduce the efficiency of a node when it processes
parallel requests, such as the allotment of disk space to a file. If you
cannot predict the number of nodes that mounts the file system, apply
the default value. If you are planning to add nodes to your system,
specify a number larger than the default. However, do not make
estimates that are unrealistic. Specifying an excessive number of
nodes can have an adverse effect on buffer operations.
-N NumInodes This parameter is the maximum number of files in the file system. This
value defaults to the size of the file system at creation, divided by 1 M,
and can be specified with a suffix, for example 8 K or 2 M. This value
is also constrained by the formula:
maximum number of files = (total file system space/2) / (inode size + subblock size)
Tip: For file systems that will perform parallel file creates, if the total number of free inodes
is not greater than 5% of the total number of inodes, there is the potential for slowdown in
file system access. Take this into consideration when changing your file system.
-v {yes | no} Verify that specified disks do not belong to an existing file system. The
default is -v yes. Specify -v no only when you want to reuse disks that
are no longer needed for an existing file system. If the command is
interrupted for any reason, you must use the -v no option on the next
invocation of the command.
Example 2-26 shows the file system information. You can see block size, maximum number of
inodes, number of replicas, and so on.
You can check the mounted file systems using the mmlsmount (GPFS) and mount (system)
commands, as shown in Example 2-28 on page 43.
root@austin1:/> mount
node mounted mounted over vfs date options
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Oct 03 11:11 rw,log=/dev/hd8
/proc /proc procfs Oct 03 11:11 rw
/dev/hd10opt /opt jfs2 Oct 03 11:11 rw,log=/dev/hd8
/dev/fslv00 /oracle jfs2 Oct 03 11:11 rw,log=/dev/hd8
/dev/orabin /orabin mmfs Oct 03 11:13
rw,mtime,atime,dev=orabin
/dev/oradata /oradata mmfs Oct 03 11:13
rw,mtime,atime,dev=oradata
To check the available space in a GPFS file system, use the mmdf command, as shown in
Example 2-29. The system df command can display inaccurate information about GPFS file
systems; thus, we recommend using the mmdf command. This command displays information,
such as free blocks, that is presented by failure group and storage pool.
============= ====================
===================
(total) 20971520 20893696 (100%)
1376 ( 0%)
Tip:
prefetchThreads is for large sequential file I/O, whereas worker1Threads is for random,
small file I/O.
worker1Threads is primarily used for random read or write requests that cannot be
prefetched, random I/O requests, or small file activity. worker1Threads controls the
maximum number of concurrent file operations at any one instant. If there are more
requests than that, the excess will wait until a previous request has finished (default: 48,
maximum: 548).
These changes through the mmchconfig command take effect upon restart of the GPFS
daemon.
The number of AIX AIO kprocs to create is approximately the same as the GPFS
worker1Threads setting:
– The AIX AIO maxservers setting is the number of kprocs PER CPU. We suggest to set
this value slightly larger than worker1Threads divided by the number of CPUs.
– Set the Oracle read-ahead value to prefetch one or two full GPFS blocks. For example,
if your GPFS block size is 512 KB, set the Oracle blocks to either 32 or 64 16 KB
blocks.
Do not use the dio option on the mount command, because using the dio option forces
DIO when accessing all files. Oracle automatically uses DIO to open database files on
GPFS.
When running Oracle RAC 10g R1, we suggest that you increase the value for
OPROCD_DEFAULT_MARGIN to at least 500 to avoid possible random reboots of
nodes.
Note: The Oracle Clusterware I/O fencing daemon has its margin defined in two places
in the /etc/init.cssd, and the values are 500 and 100 respectively. Because it is defined
twice in the same file, the latter value of 100 is used; thus, we recommend that you
remove the second (100) value.
From a GPFS perspective, even 500 milliseconds might be too low in situations where
node failover can take up to one minute or two minutes to resolve. However, if during node
failure, the surviving node is already performing direct IO to the oprocd control file, the
surviving node has the necessary tokens and indirect block cached and therefore does not
have to wait during failover.
Oracle databases requiring high performance usually benefit from running with a pinned
Oracle SGA, which is also true when running with GPFS, because GPFS uses DIO, which
requires that the user I/O buffers (in the SGA) are pinned. GPFS normally pins the I/O
We use raw disks for OCR and voting disk for Oracle Clusterware. During the Oracle
Clusterware installation, you are prompted to provide two OCR disks and three CRS voting
(vote) disks. Even though it is possible to install the Oracle Clusterware with one OCR disk
and one voting (vote) disk, we encourage you to have multiple OCR disks and voting (vote)
disks for availability. In Example 2-31, we select rhdisk2 and rhdisk3 for OCR disks. We use
rhdisk4, rhdisk5, and rhdisk6 as voting (vote) disks.
Example 2-31 Selecting raw physical disks for OCR and CRS voting (vote) disks
root@austin1:/> ls -l /dev/rhdisk*
crw------- 1 root system 20, 3 Sep 14 17:45 /dev/rhdisk2
crw------- 1 root system 20, 4 Sep 14 17:45 /dev/rhdisk3
crw------- 1 root system 20, 5 Sep 14 17:45 /dev/rhdisk4
crw------- 1 root system 20, 6 Sep 14 17:45 /dev/rhdisk5
crw------- 1 root system 20, 7 Sep 14 17:45 /dev/rhdisk6
Creating special device files for OCR and CRS voting disks
We create special files (using the mknod command) for OCR and voting (vote) disks, as shown
in Example 2-32 on page 47. Then, change ownership and permission for those files. You
must run these commands on both nodes:
mknod SpecialFileName { b | c } Major# Minor#
b indicates the special file is a block-oriented device.
c indicates the special file is a character-oriented device.
Example 2-32 Creating special files for ocr and vote disks
root@austin1:/> mknod /dev/ocrdisk1 c 20 3
root@austin1:/> mknod /dev/ocrdisk2 c 20 4
Verify and change the reservation_policy to no_reserve on the disks (rhdisk2, rhdisk3,
rhdisk4, rhdisk5, and rhdisk6 on both nodes) that are used for OCR and CRS voting (vote)
disks as shown in Example 2-33. Run these commands on both nodes.
root@austin1:/> df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd2 2.25 0.10 96% 41949 62% /usr
/dev/hd9var 0.06 0.05 28% 494 5% /var
/dev/hd3 1.06 1.03 4% 57 1% /tmp
/dev/hd1 0.50 0.49 2% 76 1% /home
root@austin2:/> /usr/sbin/slibclean
root@austin2:/> df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd2 2.31 0.16 94% 41921 51% /usr
/dev/hd9var 0.06 0.04 31% 494 5% /var
/dev/hd3 1.06 0.78 27% 833 1% /tmp
/dev/hd1 0.44 0.44 1% 28 1% /home
Important: Oracle Clusterware uses interface number (en3, for example) to define the
interconnect network to be used. It is mandatory that this interface number is the same on
all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle
Clusterware.
You need a graphical user interface (GUI) to run OUI (Oracle universal installer). Export
DISPLAY to the appropriate value, change directory to the directory that contains the Oracle
install packages, and run installer as the oracle user, as shown in Figure 2-2 on page 50.
You are asked if rootpre.sh has run, as shown in Figure 2-3 on page 51. Make sure to
execute Disk1/rootpre/rootpre.sh as root user on each node. After running rootpre.sh on
both nodes, type <y> to proceed to the next step.
Note: If you have the Oracle code on a CDROM mounted on one of the nodes, you need
to NFS export the CDROM mounted directory and mount it on the other node. You can
also remote copy files to the other node, then run rootpre.sh on both nodes.
However because in our test environment, we have the CRS Disk1 (code) on a GPFS
shared file system, there is no need to make an NFS mount or copy files to the other node.
5. Repeat for the second node and check the cluster information as shown in Figure 2-8.
Important: Oracle Clusterware uses interface number (en3 in our example) to define the
interconnect network to use. It is mandatory that this interface number is the same on all
the nodes in the cluster. You must enforce this requirement prior to installing Oracle
Clusterware.
Note: At the final stage of executing root.sh on the second node, VIP Configuration
Assistant starts automatically. But, if you had an error message stating that “The given
interface(s), ‘en2’ is not public”, use Public interfaces to configure VIPs. Run
$ORACLE_CRS_HOME/bin/vipca as root in a GUI environment.
The reason for this error message: When verifying the IP addresses, VIP uses calls to
determine if an IP address is valid. In this case, VIP finds that the IPs are nonroutable (for
example, IP addresses, such as 192.168.* and 10.10.*). Oracle is aware that the IPs can
be made public, but because mostly these IPs are used for Private, it displays this error
message.
Wait for the patch installation process to complete (Figure 2-28 on page 75).
For more information about Oracle 9i RAC on an IBM System p setup, refer to Deploying
Oracle9i RAC on eServer Cluster 1600 with GPFS, SG24-6954.
Oracle Clusterware
Oracle 10g RAC needs a “private” interconnect network for cache fusion traffic, which
includes data block exchange between instances, plus service messages. Depending on the
amount of load and the type of database operations (select, insert, updates, cross-update,
and so forth) running on the instances, the throughput on this interconnect can be high. Most
Oracle database traffic between instances is based on UDP protocol.
The term “private” used for interconnect means that this network must be separated from the
client access (public) network (used by the clients to access the database). The private
interconnect is limited to the nodes hosting a RAC instance. The public network might
connect to WAN. However, the term “private” does not mean that another cluster layer (in this
case, GPFS) cannot share it.
Important: Oracle Clusterware uses interface number (en3 for example) to define the
interconnect network to be used. It is mandatory that this interface number is the same on
all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle
Clusterware.
GPFS
As a cluster file system, GPFS needs an interconnect network. In a typical Oracle database
and GPFS configuration, the actual data I/O flows through the host bus adapters (for
example, Fibre Channel) and not through the IP network (interconnect). This method allows
superior performances. The GPFS interconnect network is used for service messages and
token management mechanism. However, Oracle10gRAC comes with its own data
synchronization mechanism and does not use GPFS locking.
GPFS uses TCP for its internal messages and relies on IP addresses not on a specific
interface number. Because the data I/Os are not using the IP network, GPFS does not require
a high network bandwidth; thus, the GPFS interconnect can be overlapped with the Oracle
interconnect (same network).
Note: Even though Oracle 10g RAC interconnect requires special attention, sizing for this
network is based on the same principles as for any other IP network. If the interconnect is
properly sized, this network can be shared for other clustering traffic, such as GPFS, which
adds almost no load onto the network. Therefore, the GPFS interconnect traffic can be
mixed together with Oracle interconnect without a potential impact on RAC performance.
Note: We recommend using a single network for both Oracle 10g RAC and GPFS
interconnects.
Public network
192.168.100.31 192.168.100.32
node austin1 node austin2
RAC RAC
instance1 instance2
Oracle Oracle
Clusterware Clusterware
GPFS GPFS
10.1.100.31 10.1.100.32
Note: In our configuration, Oracle public network is using the same adapter, en2, on both
nodes, and en3 is dedicated to RAC and GPFS interconnect traffic.
Example 2-36 presents the network interface configuration for both public and private
networks on node austin1.
Oracle Clusterware installation fails network interface names are not the same on all nodes in
the cluster. The VIP addresses are configured as IP aliases on the public network, as shown
in Example 2-37.
Oracle does not support the use of crossover cables between two nodes. Use a switch in all
cases. A switch is needed for interconnect network failure detection by Oracle Clusterware.
Although there are no AIX issues, crossover cables are not recommended nor supported.
Tip: For more information, see the Oracle Metalink note 220970.1 at:
http://metalink.oracle.com
RAC node 1
Virtual ethernet
RAC node 2 interconnect network
Figure 2-31 Virtual interconnect network for nodes in the same server
A virtual network environment can be used for development, test, or benchmark purposes
where no high availability is required for client access, and when the nodes are different
logical partitions (LPARs) of the same physical server. In this case, a virtual network can be
created without the need of physical network interfaces, or a Virtual I/O Server (VIOS).
This virtual Ethernet network practically never fails, because it does not rely on physical
adapters or cables. A virtual network inside a physical server is highly available in itself. No
need to secure it with Etherchannel for example, as you do for the physical Ethernet
networks.
This virtual network is a perfect candidate for RAC and GPFS interconnects when all cluster
nodes reside inside one physical server (for example, for a test environment). The bandwidth
is 1 Gb/s minimum, but it can be much higher. The latency varies depending on the overall
CPU load for the entire server.
In the current IBM System p5 implementation, external network access from a virtual Ethernet
network requires a VIOS with a shared Ethernet adapter (SEA). It is possible to design,
implement, and use a a interconnect network similar to the one shown in Figure 2-32 on
page 82. A typical configuration uses two VIOSs per frame with the SEA failover, which
provides a good high availability of the network.
i
SEA
r nterco SEA
failover ot fo nne failover
Physical
N ct
interconnect network
using VIOS + SEA failover
Figure 2-32 Virtual interconnect network for nodes on different servers (see previous Note)
The use of a single VIOS (and thus, no SEA failover) for the interconnect network is not
resilient enough. The VIOS is a single point of failure. This configuration is not recommended,
although it is supported.
If this setup is not the best one for RAC interconnect purposes, it remains the state of art for
all other usages, including public or administrative networks.
For another virtual network setup using Etherchannel over dual VIOS to protect the network
against failures (instead of using SEA failover), refer to 7.1, “Virtual networking environment”
on page 229.
When RAC nodes reside on different servers (which needs to be the standard configuration),
we recommend that you set up a physical interconnect network by using dedicated adapters
that are not managed by VIOS as shown in Figure 2-33 on page 83.
Physical ethernet
interconnect network
Note: Except for testing or development, we recommend nodes on different hardware with
a physical network as interconnect.
Jumbo frames
Most of the modern 1Gb (or higher) Ethernet network switches support a feature called
“jumbo frames”, which allows to them handle a maximum packet size of 9000 bytes instead of
traditional Ethernet frames (1500 bytes). You can set this parameter at the interface level and
switch. Jumbo frames are not activated by default.
Example 2-38 shows how to enable the jumbo frames for one adapter.
[Entry Fields]
Ethernet Adapter ent0
Description 2-Port 10/100/1000 Ba>
Status Available
Location 03-08
Rcv descriptor queue size [1024] +#
TX descriptor queue size [512] +#
Software transmit queue size [8192] +#
Transmit jumbo frames yes +
Enable hardware TX TCP resegmentation yes +
Enable hardware transmit and receive checksum yes +
Media speed Auto_Negotiation +
Enable ALTERNATE ETHERNET address no +
ALTERNATE ETHERNET address [0x000000000000] +
If you use Etherchannel, enabling jumbo frames when creating the Etherchannel pseudo
device automatically sets transmit jumbo frames to yes for all the underlying interfaces
(starting in AIX 5.2), as shown in Example 2-39.
[Entry Fields]
EtherChannel / Link Aggregation Adapters ent0,ent1 +
Enable Alternate Address no +
Alternate Address [] +
Enable Gigabit Ethernet Jumbo Frames yes +
Mode round_robin +
Hash Mode default +
Backup Adapter +
Automatically Recover to Main Channel yes +
Perform Lossless Failover After Ping Failure yes +
Internet Address to Ping []
Number of Retries [] +#
Retry Timeout (sec) [] +#
Switches and all other networking components involved must support jumbo frames.
Whenever possible, choosing jumbo frames for interconnect network is a good choice.
Choosing jumbo frames reduces the number of packets (thus data fragmentation) used in
heavy loads. If all of your networks are 1Gb or faster, and none of them are 10 or 100 Mb,
jumbo frames can be set everywhere.
Note: As long as the switches support jumbo frames, we recommend using jumbo frames
for interconnect network.
Etherchannel
Etherchannel is a network port aggregation technology that allows several Ethernet adapters
to be put together to form a single pseudo-Ethernet device. In our test environment, on nodes
austin1 and austin2, ent0 and ent1 interfaces have been aggregated to form the logical
device called ent2; the interface ent2 is configured with an IP address. The system and the
remote hosts consider these aggregated adapters as one logical device (interface).
All adapters in an Etherchannel must be configured for the same speed (1Gb, for example)
and must be full duplex. Mixing adapters of different speeds in the same Etherchannel is not
supported.
In order to achieve bandwidth aggregation, all physical adapters have to be connected to the
same switch, which must also support Etherchannel.
You can have up to eight primary Ethernet adapters and only one backup adapter per
Etherchannel.
Both Etherchannel and IEEE 802.3ad Link Aggregation require switches capable of handling
these protocols. Certain switches can auto-discover the IEEE 802.3ad ports to aggregate.
Etherchannel needs configuration at the switch level to define the grouped ports.
IEEE 802.3ad
According to the IEEE 802.3ad specification, the packets are always distributed in the
standard fashion, never in a round-robin mode.
Example 2-39 on page 84 shows you how to set the round-robin mode when creating the
Etherchannel.
Note: For Oracle 10g RAC private interconnect network, we recommend an Etherchannel
using the round-robin algorithm.
Remember, all adapters are connected to the same switch, which is a constraint to aggregate
the bandwidths. Although we have several adapters, the switch itself can be considered as a
single point of failure. The entire Etherchannel is lost if the switch is unplugged or fails, even if
the network adapters are still available.
To address this issue and remove the last single point of failure found in the interconnect
networks, Etherchannel provides a backup interface. In the event that all of the adapters in
the Etherchannel fail, or if the primary switch fails, the backup adapter will be used to send
and receive all traffic. In this case, the bandwidth is the one provided by the single backup
adapter, with no aggregation any longer. When any primary link in the Etherchannel is
restored, the service is moved back to the Etherchannel. Only one backup adapter per
Etherchannel can be configured. The adapters configured in the primary Etherchannel are
used preferentially over the backup adapter. As long as at least one of the primary adapters is
functional, it is used.
Of course, the backup adapter has to be connected to a separate switch and linked to a
different network infrastructure. It is not necessary for the backup switchto be Etherchannel
capable or enabled.
Figure 2-34 on page 87 shows how to design a resilient Etherchannel and how to connect the
physical network adapters to the switches. Interfaces en2 and en3 are used together in an
aggregated mode and are connected on the Etherchannel capable switch. Interface en1 is
connected on the backup switch and is used only if both en2 and en3 fail, or if the primary
switch has problems.
Node 1 Node 1
en3
en3
en4
en4
en2
en2
Interconnect
en1
en1
Backup switch
Figure 2-34 Resilient Etherchannel architecture
When set with an IP address, this mechanism overrides clusterware settings, and the
specified IP address is used for the interconnect traffic, including Oracle Global Cache
Service (GCS), Global Enqueue Service (GES), and Interprocessor Parallel Query (IPQ). If
set with two addresses, both addresses are used in a load balancing mode, but as soon as
one link is down, all interconnect traffic is stopped, because the failover mode is turned off.
To query (using Oracle SQL client) the network used by Oracle 10g RAC for its private
interconnect usage, see Example 2-41.
SQL>
With GPFS, it is possible to share the same set of binary files by all instances and thus
minimize the effort for upgrading od patching software. The following components can be
stored on GPFS file systems:
Oracle database files
Oracle Clusterware files (OCR and voting disks)
Oracle Flash Recovery Area
Oracle archive log destination
Oracle Inventory
Oracle database binaries (ORACLE_HOME)
Oracle Clusterware binaries (ORA_CRS_HOME)
Oracle database log/trace files
Oracle Clusterware log/trace files
Oracle datafiles, Oracle Clusterware OCR and voting disks, Oracle Flash Recovery Area, and
Oracle archive log destination require shared storage. However, this is not mandatory for the
remaining components. For these components, you can choose a shared space (file system)
or individual storage space on each cluster node. The advantage of using GPFS for these
components is ease of administration as well as the possibility to access files belonging to a
crashed node, before the node has been recovered. The disadvantage of this solution is the
extra layer that GPFS introduces and the constraint of not being able to perform Oracle rolling
upgrades.
When using a shared file system for Oracle binaries, you need to make sure that all instances
are shut down before code upgrades, because code files for Oracle RAC ,as well as Oracle
Clusterware, cannot be changed while the instance is running.
To shut down the Oracle cluster, refer to the readme file shipped with the patch code. You
must make sure that all database instances, Enterprise Manager Database Control,
iSQL*Plus, and Oracle Clusterware processes are shut down.
Similar to OCR and voting disks, you can argue that having Oracle Clusterware binaries and
log files on GPFS might cause clusterware malfunction and thus cause node eviction in case
of GPFS freeze or an erroneous configuration. Furthermore, Oracle Clusterware does
support rolling upgrades, which will not work with a shared binaries installation. In fact, it
seems that OUI is actually not fully understanding that Oracle Clusterware is installed on
shared space.
In conclusion, even though GPFS can be used to provide shared storage for Oracle
Clusterware throughout the environment, we recommend to use local file systems for Oracle
Clusterware code.
Figure 2-35 on page 90 shows an example of a partitioned eight CPU IBM System p5 server.
Processors, memory, and disks are shared among partitions using virtualization capabilities.
Through the Hardware Management Console (HMC), an administrator can dynamically
adjust running partitions by changing the number of assigned processors, the size of
memory, and physical or virtual adapters. This capability allows for better utilization of all
server resources by moving them to partitions that have higher requirements.
Oracle 10g Database is DLPAR aware, which means that it is capable of adapting to changes
in the LPAR configuration and make use of additional (dynamically added) resources. This
section describes how Oracle database exploits the dynamic changes in processors and
memory when running in LPAR.
Note: When running Oracle 10g RAC on LPAR nodes, we recommend that you have
LPARs located on separate System p servers in order to avoid single points of failure, such
as the power supply, Central Electronic Complex (CEC), system backplane, and so forth.
DISK D D D
D
DISK DISK
DISK
MEM M
M
M
MEM
MEM M
MEM
4 CPU C
2 CPU
½ ½
CPU CPU C C
C
HYPERVISOR
Management P5 Server (8 CPU)
Console
The size of virtual memory allocated by Oracle during startup time is equal to the value of the
SGA_MAX_SIZE parameter, but only part of it, which is specified by SGA_TARGET, is
actually used. It means that Oracle database can start with larger SGA_MAX_SIZE than the
amount of memory assigned to a partition at Oracle startup time. SGA_TARGET can be
increased to reach the limit of physical memory available to LPAR. By adding more memory
to the partition, SGA_TARGET can be increased as well. The administrator must anticipate
the amount of memory that can be given to the instance and must set the SGA_MAX_SIZE
parameter.
The following Oracle views are useful to monitor the behavior of the dynamic SGA:
v$sga view displays summary information about SGA
v$sgastat displays detailed information about SGA
v$sgainfo displays size information about SGA, including sizes of different SGA
components, granule size, and free memory
v$sga_dynamic_components displays current, minimum, and maximum size for the
dynamic SGA components
v$sga_dynamic_free_memory displays information about the amount of SGA memory
that is available for future dynamic SGA operations
v$sga_resize_ops displays information about the last 400 completed SGA resize
operations
The AIX 5L operating system does not allow the removal of pinned memory, which means
that when using a pinned SGA, database administrator can neither reduce the effective size
of the SGA_TARGET values, nor remove real memory from the LPAR. For this reason, when
using a pinned SGA, it is not possible to change the SGA_TARGET value to move memory
out of the LPAR. When the SGA is not pinned, this is possible.
Note: When specifying pinned memory for SGA, an instance does not start unless there is
enough memory for the LPAR to host the SGA_MAX_SIZE. Also, DLPAR memory
operations are not permitted on memory reserved for SGA_MAX_SIZE.
Example 2-42 Physical memory available to AIX operating system before addition
{texas:oracle}/orabin/ora102/dbs -> prtconf | grep "Memory Size"
Memory Size: 2560 MB
Good Memory Size: 2560 MB
Figure 2-37 on page 93 shows an additional 1 GB of memory assigned to this partition with
the Hardware Management Console.
After this operation, AIX sees more physical memory (3.5 GB), as shown in Example 2-44.
At this point, Oracle allocates only 1 GB of memory for SGA (SGA_TARGET parameter
value). Output from the Oracle sqlplus command is in Example 2-45.
Example 2-45 Size information about SGA memory, including free SGA
SQL> select * from v$sgainfo;
11 rows selected.
The next step is to change the SGA_TARGET value, so Oracle can use additional memory
segments. In Example 2-46, SGA_TARGET is set to 3.5 GB (3584 MB).
System altered.
At this point, v$sgainfo view values correspond to new values. Only about 512 MB of SGA
memory is available (see Example 2-47).
Example 2-47 Size information about SGA memory after resizing SGA_TARGET
SQL> select * from v$sgainfo;
11 rows selected.
Things are more complicated when micropartitioning is used. The AIX operating system
sees virtual processors instead of physical ones, because the kernel and its scheduler have
to see a natural number of processors.
Also, up to 10 virtual processors can be defined on a partition with the assigned processing
unit of 1.0 CPU, and the other way, a single virtual CPU can utilize only 0.1 of real processor.
With dynamic partitioning, both entitled capacity (amount of processing units) and the
number of virtual processors can be changed dynamically. When necessary, both entitled
capacity and the number of virtual processors can change at the same time.
When the capacity is increased, applications run faster, because the power hypervisor
assigns more physical processor time to each virtual processor. By increasing the number of
virtual CPUs only, it is unlikely that the application runs faster, because the overall amount of
processing units does not change.
All running applications gain performance when the capacity is increased in an LPAR. Oracle
also recognizes new virtual processors (because they appear the same way as dedicated
CPUs on AIX) and adjusts its SQL optimizer plans.
With POWER5 processor and AIX V5.3, Simultaneous Multi-Threading (SMT) is introduced.
With SMT, the POWER5 processor gets instructions from more than one thread. What
differentiates this implementation is its ability to schedule instructions for execution from all
threads concurrently.
With SMT, the system dynamically adjusts to the environment, allowing instructions to
execute from each thread (if possible) and allowing instructions from one thread to use all the
execution units if the other thread encounters a long latency event. The POWER5 design
implements two-way SMT on each CPU.
The simultaneous multi-threading policy is controlled by the operating system and is partition
specific.
The action shown in Example 2-49 does not change the actual partition configuration. In the
next step, the number of virtual processors changes in the partition to two, as presented on
Figure 2-39 on page 98.
Example 2-50 shows the number of processors that change in AIX and Oracle.
Additional information appears in the Oracle alert.log file (Example 2-51 on page 99).
Oracle does not change the CPU_COUNT value if the number of CPUs are more than three
times the CPU count at instance startup. For example, after starting Oracle instance with one
CPU and increasing the number of processors to four, the CPU_COUNT is set to three and
the following entry, shown on Example 2-52, is generated in the alert.log file.
When operating in the AIX 5L and System p environment, you can dynamically add or
remove CPUs from an LPAR with an active Oracle instance. The AIX 5L kernel scheduler
automatically distributes work across all CPUs. In addition, Oracle Database 10gl dynamically
detects the change in CPU count and exploits it with parallel query processes.
Linux p5 570
gw8810 12-way 32GB
Houston1 .51
.2 .52
Houston2
Alamo1 .53
To IBM network .54
Alamo2
The purpose of this scenario is to change the storage space for Oracle files (single instance)
from JFS/JFS2 or raw devices to GPFS. There is no Oracle clustering involved at this time. At
the end of this migration, a single instance database is still running on a single node GPFS
cluster.
In this section, we consider two source scenarios. The target is GPFS, but sources can be
both JFS2 or raw partitions for datafiles.
Note: Although a single node GPFS cluster is not officially supported, this scenario is in
fact a step toward a multi-node RAC environment based on GPFS. For this matter, GPFS
file system parameters are configured as for a multi-node cluster (the correct number of
estimated nodes that mounts the file system (the -n option in mmcrfs)).
For details about GPFS considerations, refer to the GPFS V3.1 Concepts, Planning, and
Installation Guide, GA76-0413, and section 2.1.7, “Special consideration for GPFS with
Oracle” on page 44.
The starting point for this scenario is a single database instance, using JFS2 (local) file
system. Oracle code files are located in /orabin directory (a separate file system) and Oracle
data files are in /oradata file system. Oracle Inventory is stored in /home/oracle directory.
Oracle Inventory, data files, and code are moved to GPFS.
To move ORACLE_HOME and Oracle Inventory from JFS to GPFS, follow these steps:
1. Shut down all Oracle processes.
2. Unmount the /orabin file system and remount it on /jfsorabin.
3. Create a single node GPFS cluster and the NSDs that you are using for GPFS.
4. Create GPFS for /orabin, and mount on /orabin. Make sure that the right permissions exist
for the /orabin GPFS file system, for example, oracle:dba.
5. Copy the entire ORACLE_HOME from JFS2 to GPFS (as oracle user):
cd /jsforabin; tar cvf - ora102 | (cd /orabin; tar xvf -)
6. Unmount the /jfsorabin file system.
7. Move Oracle Inventory:
a. cd /home/oracle; tar cvf - OraInventory | (cd /orabin; tar xvf -)
b. Update the OraInventory location stored in /etc/oraInst.loc.
Note: On the test system, the file oraInst.loc exists in several places:
root@dallas1:/> find / -name oraInst.loc -print 2> /dev/null
/etc/oraInst.loc
/orabin/ora102/oraInst.loc
/orabin/ora102/bigbend1_GPFSMIG1/oraInst.loc
We describe a more complex example covering this topic in 3.4, “Migrating from
HACMP-based RAC cluster to GPFS using RMAN” on page 123 and in chapter 3.5,
“Migrating from RAC with HACMP cluster to GPFS using dd” on page 133.
For this scenario, we conduct a full installation of both Oracle Clusterware and Oracle RAC
software. Software installation is needed, because Oracle must link its binary files to the
system libraries. After software installation is complete, the steps to convert from single
instance to RAC are:
1. Perform basic node preparation: prerequisites, network, and GPFS code.
2. Add the node to the GPFS cluster.
3. Install and configure Oracle Clusterware using OUI.
4. Install Oracle RAC code using OUI.
5. Configure the database for RAC.
6. Setup Transparent Application Failover (TAF).
Tip: A good technique is to use AIX Network install Manager to “clone” an mksysb of the
existing node.
2. Make sure that you have sufficient free space in /tmp. Oracle Installer requires 600 MB,
but the requirements for the node might be higher.
3. Check for the identical parameters to the existing node kernel configuration parameters.
4. Create the oracle user with same user and group ID as on the existing node.
5. Set up oracle user environment and shell limits as the existing node.
6. Attach the new node to the storage subsystem, and make sure that you can access the
GPFS logical unit numbers (LUNs) from both nodes.
7. Check the remote command execution (rsh/ssh) between nodes.
3.2.2 Add the new node to existing (single node) GPFS cluster
After the preparations described in section 3.2.1, “Setting up the new node” on page 107
have been completed, add the new node to the GPFS cluster by running the mmaddnode
command from the node that is already part of the cluster. After you add the node, make sure
the new node has been added to the cluster, then start the GPFS daemon on the new node
using the mmstartup command. If necessary, use the mmmount command to mount the existing
file systems on the new node.
Note: After the file system has been successfully mounted and is accessible from both
nodes in the cluster, stop GPFS on both nodes and make sure your GPFS cluster and file
systems follow the quorum and availability recommendations:
– Check and adjust the cluster quorum method. Add NSD tiebreaker disks to the
cluster and change cluster quorum to node quorum with tiebreaker disks.
– Check and configure the secondary cluster data server.
Note: If Oracle Clusterware uninstall is needed, note that just running OUI will not cleanly
deinstall Oracle Clusterware. Oracle Metalink Doc ID Note:239998.1 documents this
process. For a complete installation, $ORACLE_CRS_HOME/install contains the scripts
rootdelete.sh and rootdeinstall.sh, which you need to run before using the OUI.
If your nodes continuously reboot, the only chance you have to stop this behavior is to try
to log on to the system as root as soon as you get a login prompt, before Oracle
Clusterware starts, and use the crsctl disable crs command. This command will stop
repeated system reboot. Oracle Metalink can be found at:
http://metalink.oracle.com
We use the default parameters from the 10g installation. However in the field, you might run
into installations that are upgraded from 9 i, or even with MAXINSTANCES deliberately set to
2. The 10 g defaults from austin1 are:
MAXLOGFILES 192
MAXLOGMEMBERS 3
MAXDATAFILES 1024
MAXINSTANCES 32
MAXLOGHISTORY 292
One way to verify the parameters is to back up controlfile to trace, which produces a file in the
udump destination, as shown in Example 3-1.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
Session altered.
Database altered.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 -
64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
{austin1:oracle}/home/oracle ->
Edit the pfile and add specific RAC information. In Oracle RAC, each instance needs its own
undo table space and its own redo logs. Thus, the new configuration needs to be similar to
the configuration shown in Example 3-3.
The new environment has specific (per instance) configurations; you must remove any
database-wide configuration that might conflict with this new environment’s configurations. In
our environment, only the undo configuration is in conflict. The parameter that we removed is
shown in Example 3-4.
Note: Even though the spfiles are created in $ORACLE_HOME/dbs, we recommend that
you place the spfiles outside of $ORACLE_HOME.
Other configurations might be needed to increase the SGA, because Oracle RAC uses part of
the SGA for Global Cache Directory (GCD). The size of the GCD depends on the database
size. Also, due to the multi-versioning of data blocks in the instances, increased SGA might
be needed. Whether to increase SGA varies depending on the way that the application loads
characteristics.
Note: It is impossible to recommend a proper value for the SGA size. Use the buffer cache
advisor to assess the effectiveness of the caching.
Note: You must evaluate various options, such as naming, sizing, and mirroring for redo
logs based on the installation that you are upgrading.
In this scenario, undo and redo logs for the new instance are created in a similar manner to
the old instance. The undo log creation is shown in Example 3-6 on page 112.
Tablespace created.
In our scenario, we chose to rename the redo logs. Example 3-7 lists the current redo logs.
GROUP# MEMBER
---------- ------------------------------
3 /oradata/GPFSMIG/redo03.log
2 /oradata/GPFSMIG/redo02.log
1 /oradata/GPFSMIG/redo01.log
Example 3-8 on page 113 shows the creation of new redo log files. The naming is chosen so
the thread is part of the redo log file name. Therefore, the new redo logs are named
differently. In a later step, the old files are renamed.
Database altered.
Database altered.
Database altered.
To rename the redo logs, the database must be in mount mode. Renaming is shown in
Example 3-9.
GROUP# MEMBER
---------- ------------------------------
3 /oradata/GPFSMIG/redo03.log
2 /oradata/GPFSMIG/redo02.log
1 /oradata/GPFSMIG/redo01.log
4 /oradata/GPFSMIG/redo01-02.log
5 /oradata/GPFSMIG/redo02-02.log
6 /oradata/GPFSMIG/redo03-02.log
6 rows selected.
Database altered.
GROUP# MEMBER
---------- ------------------------------
3 /oradata/GPFSMIG/redo03-01.log
2 /oradata/GPFSMIG/redo02-01.log
1 /oradata/GPFSMIG/redo01-01.log
6 rows selected.
Finally, the new thread is enabled, as shown in Example 3-10. After the new thread is
enabled, the new instance can be started.
Database altered.
Both instances can now be started, and the database can be opened from both nodes.
Note: At this time, you must create certain Oracle RAC specific data dictionary views by
running the catclust.sql file. This action produces a lot of output, so we do not show
running the catclust.sql file here. The catclust file is in $ORACLE_HOME/rdbms/admin.
Example 3-12 on page 115 shows how the crs_stat -t output reflects that srvctl starts the
database and instances.
INSTANCE_NUMBER
---------------
1
shutdown abort
ORACLE instance shut down.
exit
SQL>
REM end shutdown instance 1
SQL>
You add the new node information on the second window of the OUI, which is shown in
Figure 3-2 on page 118. In all other windows, click Next.
When the installation is finished, OUI asks to run root scripts on both the old nodes and the
new nodes (see Figure 3-3 on page 119). Running these root scripts performs all of the
configuration changes required to install Oracle Clusterware and start Oracle Clusterware on
the new node.
2. Next, run the /orabin/crs102/root.sh script on the new node, which is shown in
Example 3-16 on page 120.
Done.
root@bigbend3:/orabin/crs102> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....G1.inst application OFFLINE OFFLINE
ora....G2.inst application OFFLINE OFFLINE
ora.GPFSMIG.db application OFFLINE OFFLINE
ora....D1.lsnr application ONLINE ONLINE bigbend1
ora....nd1.gsd application ONLINE ONLINE bigbend1
ora....nd1.ons application ONLINE ONLINE bigbend1
ora....nd1.vip application ONLINE ONLINE bigbend1
ora....D2.lsnr application ONLINE ONLINE bigbend2
As you can see in the last part of Example 3-16 on page 120, bigbend3 now appears with the
basic Oracle Clusterware services, but not instance and listener, because they are not
configured.
Note: The CRS-0215 error can occur when configuring VIP. According to Oracle Metalink,
this error is caused by default routing configuration. In our test environment, we have
observed the same issue. The VIP is not configured (the vipca command did not complete
successfully). We solve this problem by running the following command:
ifconfig en0 alias 192.168.100.157 netmask 255.255.255.0
The interface is en0, and 192.168.100.157 is the IP address used as the VIP.
Note: When addNode.sh was run with Oracle Inventory on shared storage, the node list
was not updated. When Oracle Inventory resides on local storage, node list was updated
on the local Inventory, but not for all nodes. To update the Oracle Inventory node list for a
specific HOME on a specific node, you can use the following command:
The initial cluster configuration (Figure 3-4) is based on HACMP, oraclevg is in enhanced
concurrent mode (ECM) and is opened in concurrent mode when HACMP is up and running
(RSCT is responsible for resolving concurrent access).
austin1_vip austin2_vip
RAC interconnect
192.168.100.31 192.168.100.32 Public network
austin1 austin1_interconn austin2 austin2_interconn
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
austin1 austin2
Note: oraclevg is an ECM VG
rootvg hdisk0 and is managed by HACMP hdisk0 rootvg
oraclevg
All raw devices in this test environment are created with the mklv -B -TO options, thus the
logical volume control block (LVCB) does not occupy the first block of the logical volume.
Special consideration must be taken if the raw devices for Oracle are created without the mklv
-TO option for later use of the dd command to copy data files from raw logical volumes to a file
system, as described in 3.5.1, “Logical volume type and the dd copy command” on page 134.
Example 3-19 on page 124 shows the list of raw devices that we have used for this scenario.
austin1_vip austin2_vip
RAC interconnect
192.168.100.31 192.168.100.32 Public network
austin1 austin1_interconn austin2 austin2_interconn
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
DS4800
austin1 austin2
rootvg hdisk2 rootvg
hdisk0 hdisk0
Example 3-20 Migrate control files and data files to GPFS using RMAN
SQL> startup nomount
ORACLE instance started.
System altered.
###Even though there is a RMAN command for copying all data files at once, we
decided to copy each data file at a time, because we want to give the files names
of our own choice.
database opened
Tablespace created.
Database altered.
Tablespace dropped.
Tablespace altered.
FILE_NAME
----------------------------------------------------------------------------------
-----------
/oradata/temp01.dbf
SQL> alter database add logfile thread 1 group 5 '/oradata/redo5.log' size 120M;
SQL> alter database add logfile thread 1 group 6 '/oradata/redo6.log' size 120M;
SQL> alter database add logfile thread 2 group 7 '/oradata/redo7.log' size 120M;
SQL> alter database add logfile thread 2 group 8 '/oradata/redo8.log' size 120M;
###while keeping running SQL> alter system switch logfile on each node;
drop logfile that is in “inactive” or “unused” status one by one.
File created.
File created.
###Remove a previous password file and link to the new password file.
{austin1:oracle}/oracle/ora102/dbs -> ls -l
total 80
-rw-rw---- 1 oracle dba 1552 Sep 29 21:53 hc_austindb1.dat
-rw-rw---- 1 oracle dba 1552 Sep 27 09:32 hc_raw1.dat
-rw-r----- 1 oracle dba 8385 Sep 11 1998 init.ora
-rw-r----- 1 oracle dba 34 Sep 28 15:12 initaustindb1.ora
-rw-r----- 1 oracle dba 12920 May 03 2001 initdw.ora
lrwxrwxrwx 1 oracle dba 17 Sep 28 12:02 orapwaustindb1 -> /dev/
rraw_pwdfile
###Remove a previous password file and link to the new password file on the second
node.
root@austin2:/oracle/ora102/dbs> rm orapwaustindb2
root@austin2:/oracle/ora102/dbs> ln -s /oradata/orapw_austindb orapwaustindb2
Example 3-25 Error when running root.sh without removing the /opt/ORCLcluster directory
root@austin1:/oracle/crs> root.sh
WARNING: directory '/oracle' is not owned by root
Checking to see if Oracle CRS stack is already configured
root@austin1:/oracle/ora102/lib> ls -l libskgxn2*
lrwxrwxrwx 1 oracle dba 32 Oct 03 14:04 libskgxn2.a ->
/opt/ORCLcluster/lib/libskgxn2.a
###Verify the new CRS links for the newly created library files
root@austin1:/oracle/ora102/lib> cd /oracle/crs/lib
root@austin1:/oracle/crs/lib> ls -l libskgxn2*
lrwxrwxrwx 1 oracle system 27 Oct 03 22:49 libskgxn2.a ->
/oracle/crs/lib/libskgxns.a
lrwxrwxrwx 1 oracle system 28 Oct 03 22:49 libskgxn2.so ->
/oracle/crs/lib/libskgxns.so
root@austin1:/oracle/ora102/lib> rm libskgxn2.a
root@austin1:/oracle/ora102/lib> rm libskgxn2.so
root@austin1:/oracle/ora102/lib> ls -l libskgxn2*
lrwxrwxrwx 1 root system 27 Oct 03 23:53 libskgxn2.a ->
/oracle/crs/lib/libskgxns.a
lrwxrwxrwx 1 root system 28 Oct 03 23:54 libskgxn2.so ->
/oracle/crs/lib/libskgxns.so
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.1.0
- Production
Start Date 03-OCT-2007 23:25:58
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP ON
Listener Parameter File /oracle/ora102/network/admin/listener.ora
Listener Log File /oracle/ora102/network/log/listener.log
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=austin2)(PORT=1521)))
The listener supports no services
The command completed successfully
Example 3-29 Register a database and instance using the srvctl command
###Add a database
Depending on the logical volume device subtype (see Table 3-1), you need different options
for the dd command: DS_LVZ or DS_LV. The mklv -TO flag indicates that the logical volume
control block does not occupy the first block of the logical volume; therefore, the space is
available for application data. This logical volume has a device subtype of DS_LVZ. A logical
volume created without this option has a device subtype of DS_LV. For “classic” volume
groups, the devsubtype of a logical volume is always DS_LV. For scalable format volume
groups, the devsubtype of a logical volume is always DS_LVZ, regardless of whether the mklv
-TO flag is used to create the logical volume.
Normal volume group Always DS_LV The logical volume mklv -TO flag is always
control block will ignored in normal
occupy the first block volume group.
of the logical volume.
Big volume group DS_LV The logical volume mklv without -TO in
control block will big volume group
occupy the first block
of the logical volume.
Scalable volume group Always DS_LVZ The logical volume DS_LVZ (mklv -TO) is
control block will not always set by default
occupy the first block in scalable volume
of the logical volume. group.
If the “-TO” flag is used in a big volume group, it will show the following additional attribute:
"DEVICESUBTYPE : DS_LVZ".
If raw devices are not DS_LVZ type, when using the dd command to copy raw devices to a file
system, you must skip the first block to avoid data corruption:
$ dd if=/dev/rraw_control1 of=/oradta/control1.dbf bs=4096 skip=1 count=30720
###Repeat the same process for other data files(sysaux, users tablespaces)except
system and undo tablespaces.
For the system and undo tablespaces, because they cannot be offline, run the dd command as
shown in Example 3-32 on page 137 while having the database in mount (not open) status.
Database altered.
Example 3-33 Stop CRS and database and upgrade HACMP in the RAC environment
###Check the current status of CRS and database
In this section, we document both methods. The setup used for this exercise is a two-node
cluster with nodes dallas1 and dallas2. GPFS Version 2.3 is installed, and there are two file
systems: /oradata and /orabin. For details, refer to Appendix C, “Creating a GPFS 2.3” on
page 263.
Note: In preparation for any migration or upgrade operation, we strongly recommend that
you save your data and also have a fallback or recovery plan in case something goes
wrong during this process.
dallas1_vip dallas2_vip
RAC interconnect
192.168.100.133 192.168.100.134 Public network
dallas1 dallas1_interconn dallas2 dallas2_interconn
192.168.100.33 10.1.100.33 192.168.100.34 10.1.100.34
ent2 ent3 ent2 ent3
DS4800
austin1 dallas2
rootvg hdisk2 rootvg
hdisk0 hdisk0
hdisk22
Migrating to GPFS 3.1 from GPFS 2.3 consists of the following steps:
1. Stop all file system user activity. For Oracle 10g RAC, you stop all file system user activity
by using the following command as oracle user on all nodes:
crsctl stop crs
Note: You might also need to run the emctl stop dbconsole and isqlplusctl stop
commands. Any scripts run from cron or other places must be stopped as well.
2. As root, cleanly unmount all GPFS file systems. Do not use force unmount. Use the
fuser -cux command to identify any leftover processes attached to the file system.
3. Stop GPFS on all nodes in the cluster (as root user):
mmshutdown -a
Note: After the upgrade, the output of the mmlsconfig config command shows the same
maxFeatureLevelAllowed (822) as before, which is normal behavior.
8. Migrate all file systems to reflect the latest metadata format changes. For each file system
in your cluster, use: mmchfs <file_system> -V, as shown in Example 3-35.
For more details about the GPFS upgrade procedure, see the manual GPFS V3.1 Concepts,
Planning, and Installation Guide, GA76-0413.
Important: In this scenario, we delete the existing GPFS cluster and recreate it after
installing new GPFS code. You must prepare the environment, node, and disk definition
files for the new cluster.
Start by shutting down all activity (see 3.7.1, “Upgrading using the mmchconfig and mmchfs
commands” on page 139), and then perform the following actions:
1. Export the file systems one by one: mmexportfs <file system> -o <Export-file>, as
shown in Example 3-36 on page 141.
Note: The mmexportfs command actually removes the file system definition from the
cluster.
Note: We recommend that you use mmexportfs for individual file systems, and do not use
mmexportfs all, because this will also export NSD disks that are not used for any file
systems, such as tiebreaker NSDs. Using all can create issues when importing all file
systems into the new cluster.
root@dallas1:/etc/gpfs_config/3.1-Upgrade>
Note: Because we are reusing disk, use the -v no option to let mmcrnsd overwrite disks:
mmcrnsd -F /etc/gpfs_config/gpfs_disks_tb -v no
3.8 Moving OCR and voting disks from GPFS to raw devices
This section presents the actions that we took to move OCR and Oracle Clusterware voting
disks from GPFS to raw devices. In order to move OCR and voting disks out of GPFS, you
must first prepare the raw partitions and then run the commands for actually moving OCR and
voting disks.
Note: Even though Oracle installation documentation states that a minimum of 100 MB
are required for OCR, for replacement you will need a minimum of 256 MB LUNs. In
fact, even 256 MB do not work in the test case, so we had to increase the LUN size to
260 MB. We have seen the following errors:
PROT-21: Invalid parameter
PROT-16: Internal error
PROT-22: Storage too small
2. Use at least 20 MB per voting disk partition. Ownership for OCR devices must be root,
and the group must be the same as the oracle installation owner, in this case dba. The
voting disk must have owner and group as the oracle installation, in this case oracle and
dba. Permissions must be 640 for OCR and 644 for voting disks.
The current voting disks can be listed with the crsctl command, as shown in
Example 3-41.
located 3 votedisk(s).
oracle@dallas1:/oracle>
OCR device/path names can be obtained using the ocrcheck command, as shown in
Example 3-42 on page 145.
3. Use the mknod command to create a device with a meaningful name, using the same
major/minor number as the AIX hdisk.
In Example 3-43, we show how to identify the LUNs for DS4000 Series storage.
---dar0---
Knowing the mapping between LUN names and AIX default naming, we can now get the
major/minor numbers that we need to create devices, as shown in Example 3-44 on
page 146.
The LUN DALLAS_ocr1 is used for OCR. This is translated to hdisk2, so its major/minor
numbers are 36.3. Example 3-45 shows how we use the mknod command to create the
device OCR1.
The link between these devices, the hdisks, are the major/minor numbers and AIX default
naming. To identify which hdisk is actually used for /dev/crs_votedisk2, use the major and
minor number as shown in Example 3-46.
Example 3-46 Listing all devices with the specific major/minor number
root@dallas1:/> ls -l /dev/crs_votedisk2
crw-r--r-- 1 root system 36, 6 Sep 14 15:38 /dev/crs_votedisk2
root@dallas1:/> ls -l /dev | grep "36, 6"
crw-r--r-- 1 root system 36, 6 Sep 14 15:38 crs_votedisk2
brw------- 1 root system 36, 6 Sep 10 10:07 hdisk5
crw------- 1 root system 36, 6 Sep 10 10:07 rhdisk5
root@dallas1:/>
4. Set the ownership mode to 640 and root.dba for all OCR devices and to oracle.dba and
644 for CRS voting disk devices. Make sure that the AIX LUN reservation policy is set to
no_reserve. To change the reservation policy, use the chdev command, as shown in
Example 3-47.
5. Make sure that new raw devices do not contain any information that might confuse CRS.
We used the dd command to erase any information about /dev/OCR1, as shown in
Example 3-48.
6. Erase all raw devices before proceeding to the next step. Refer to the UNIX man pages for
more information about the dd command. The write error in Example 3-48 indicates that
/dev/zero is larger that the raw device. To check the device size, use the bootinfo
command, as shown in Example 3-49.
Note: Make sure that the mknod, chown, chmod, and chdev commands are run on all nodes
in the cluster.
2. The current voting disks are listed using the crsctl command, as shown in Example 3-55
on page 149.
located 3 votedisk(s).
Even though Oracle Clusterware is shut down, we still need to use the -force option when
deleting and adding voting disks. Example 3-56 shows how to delete and add voting disks
with the crsctl command.
located 3 votedisk(s).
Note: During the testing, we experienced Oracle Clusterware rebooting the nodes (due
to our own mistake), while we were adding a voting disk, which led to a voting disk entry
without a name for the voting disk. Use the /orabin/crs/bin/crsctl delete css
votedisk ... to forcefully remove it.
3. Repeat this for all voting disks. Example 3-57 shows how we remove the remaining two
voting disks.
located 3 votedisk(s).
For more information, refer to the Chapter 3 of the Oracle Database Oracle Clusterware and
Oracle Real Application Clusters Administration and Deployment Guide,10g Release 2
(10.2), Part Number B14197-04.
GPFS mirroring is also known as replication and is independent of any other replication
mechanism (storage-based or AIX Logical Volume Manager (LVM)). GPFS replication uses
synchronous mirroring. This solution consists of production nodes and storage that are
located in two sites, plus a third node that is located in a separate (third) site. The node in the
third site keeps the GPFS cluster alive in case one of the production sites fails. It acts as a
quorum buster for both GPFS cluster GPFS file systems. It also participates in Oracle CRS
voting, by defining a CRS voting disk on an NFS share held by this third node. The third node
is not connected to the SAN.
Server level
EtherChannel and Multi-Path I/O (MPIO) are providing high availability at the AIX level. Each
new release of AIX is also providing more features that contribute to continuous operations,
for example, by reducing the need to reboot the server when upgrading the OS or performing
system maintenance. However, the server itself remains a SPOF.
Storage level
A SAN can also provide high availability, because the storage subsystems are designed to be
fully redundant and fault resilient. Failures from individual spindles (disks) are managed
through the RAID algorithm by using automatic replacement with hot spare disks. All the
Fibre Channel connections are at least doubled, and there are two separate controllers to
manage the host access to the data. There is no single point of failure.
Application level
On top of resilient hardware, Oracle 10g RAC also provides a highly available database,
which is provided by Oracle Clusterware and RAC software.
Figure 4-1 on page 155 shows a common architecture for high availability: two nodes
connected to a storage device. Of course, the nodes belong to two different physical frames.
We do not recommend using two logical partitions (LPARs) in the same frame, because the
frame itself is a single point of failure. This solution is also called local high availability,
because the two servers and the storage device are located in the same data center.
S to r a g e
Figure 4-1 Typical high availability architecture: two nodes and one type of storage in one data center
This setup provides excellent high availability to your IT environment. But in case of a global
disaster that affects the entire data center, all the hardware, servers, and storage are lost at
the same time. A disaster includes fire, flood, building collapse, power supply failure, but also
malicious attacks or terrorism acts.
To address these issues related to a single data center and thus reduce the risk related to a
disaster, you must use a second data center. In this case, this is not called high availability,
but disaster recovery.
In addition to having two data centers, a disaster recovery solution also requires two storage
subsystems and a storage replication mechanism.
Storage Storage
Distance considerations
The distance between the sites provides a better separation, but it also introduces latency in
the communication between the sites (IP and SAN).
The distance between the sites impacts the system performance depending on the data
throughput required by your application. If the throughput is high, a maximum of a few
kilometers between the two sites must be considered. If the SAN is less heavily used, a
distance of 20 to 40 Km is not a problem. These distances are only indicative, and varies
depending on the quality of the SAN, I/O on disks, the application, and so on.
A good compromise is two locate the two data centers on two different buildings of the same
company. The distance is less than a few kilometers, so the distance is not of any concern.
This is a good response to fire disasters, but not the best for earthquakes or floods. This
setup is called a campus-wide disaster recovery solution.
Another frequently used architecture involves two sites of the company in the same city, or
nearby, located less than 20 km (12 miles) away. The impact of the distance remains
reasonable and has the ability to administer and manage both sites. This architecture is
considered a metropolitan disaster recovery solution.
To fully address the earthquake risk, imagine a backup data center on another continent.
Here, the major point is only the distance, and a completely different set of solutions applies
(for example, asynchronous replication). We do not address this subject in this book.
Mirroring considerations
Because we have two storage units (one in each site) and the same set of nodes and
applications accessing the same data, mirroring must be defined between the two storage
units. You need to use mirroring to keep the application running with only one surviving site
and with a full data copy.
You can implement mirroring at the file system level (here GPFS), or at the storage level, by
using Metro Mirror (synchronous) or Global Mirror (asynchronous). There are differences
between these mirroring methods. We describe GPFS mirroring, also called replication, in
this chapter. We describe how to use Metro Mirror for disaster recovery in Chapter 5,
“Disaster recovery using PPRC over SAN” on page 185.
Note: In this configuration, the application runs in both primary and secondary production
sites, but not in the third site. In case of either production site failure, operation in the
surviving site continues without any user intervention.
Data mirroring is done at the GPFS level, so all the disks must be visible from both nodes.
GPFS mirrors data based on failure group information. In this case, a failure group is a set of
disks that belongs to the same site. GPFS enforces the mirroring between the failure groups,
guaranteeing that each site contains a good copy of the entire data, metadata, and file
system descriptors.
SAN
switches
The third node is not attached to the SAN and has only internal disks. It provides GPFS with
an internal disk as a Network Shared Disk (NSD). This disk only holds a third copy of the file
The I/O throughput for this node is negligible, because the disk attached to this node does not
hold any data or metadata, and there is no application running on this node that accesses the
GPFS file system (in fact, this node must not run any application), thus, this node does not
affect the overall GPFS performance. This node is called the tiebreaker node, and the site is
called the tiebreaker site.
This third (tiebreaker) site must be an independent site. It cannot be a node that is hosted in
one of the two production sites. GPFS cannot survive if a main site and the third site are
down. Actually, GPFS can survive if only one site is failing. If two sites fail at the same time,
GPFS stops on the surviving site, even though this site might still have a whole set of data
(one valid copy). Figure 4-4 shows the disaster resilient architecture using GPFS replication.
Site A Site B
Internal disk
(No external storage)
austin1 austin2
α Storage β α Storage β
IP network
SAN
GPFS replication
Important: This file must be the same on all the nodes and must remain unchanged after
the GPFS cluster is created. IP name resolution is critical for any clustering environment.
You must make sure that all nodes in this cluster resolve all IP labels (names) identically.
The two production nodes must have the manager attribute, but the third node is only a client.
However, all tree nodes must be quorum nodes.
We have prepared a node descriptor file, which is shown in Example 4-2. For more details
about how to create a GPFS cluster, see 2.1.6, “GPFS configuration” on page 30.
Example 4-4 GPFS three node topology where each node must be part of the quorum
root@austin1:/home/michel> mmlscluster
Example 4-5 Avoid propagating inappropriate error messages on the third node
root@gpfs_dr:/home/michel> mmchconfig unmountOnDiskFail=yes gpfs_dr_interconnect
mmchconfig: Command successfully completed
mmchconfig: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
Verify that the parameter has been set as shown in Example 4-6.
In our example, there are only two LUNs, but in an actual configuration, you might have more
LUNs. Just make sure that you have an even number of LUNs and that all LUNs are the
same size. Also make sure that half of the LUNs are located in each storage subsystem (in
different sites). By assigning the LUNs in each storage subsystem to a different failure group,
you make sure that GPFS replication (mirroring) is consistent and useful in case one
production site fails.
On node gpfs_dr, we have one free internal SCSI disk. The size of this disk is not that
important; it contains only a copy of the file system descriptors (no data or metadata). This
disk is in a separate failure group.
The LUNs in storage subsystems in sites A and B and the internal SCSI disk belong to the
same GPFS. If you plan to have more than one file system, in addition to an equal number of
Example 4-7 shows the disks that will be used for the new GPFS file system.
Example 4-7 Free LUNs on the main nodes and free internal disk on the third node
root@austin1:/home/michel> lspv
hdisk0 0022be2ab1cd11ac rootvg active
...
hdisk16 none None
hdisk17 none None
root@gpfs_dr:/home/michel> lspv
hdisk0 00c6629e00bddee5 rootvg active
...
hdisk3 none None
Create the NSDs to use later by GPFS on one of the main nodes and on the third node. The
command is mmcrnsd. You must issue this command on a node that sees the disk or the LUN.
Example 4-8 shows the disk descriptor file that we use for this scenario. For more information
about this command, refer to 2.1.6, “GPFS configuration” on page 30.
Make sure that the failure group (1, 2, or 3) for each LUN reflects the actual site; there is one
failure group on each site. The disks in our example are:
hdisk16 is a LUN in site A storage (failure group 1)
hdisk17 is a LUN (same size than hdisk16) in site B storage (failure group 2)
hdisk3 is an internal disk in the node in site C (failure group 3)
For more information about the sites, see Figure 4-4 on page 159.
Example 4-8 GPFS disk file for NSD creation on the main nodes and third node
root@austin1:/home/michel> cat gpfs_disk_file
hdisk16:austin1_interconnect:austin2_interconnect:dataAndMetadata:1:dr_copy1:
hdisk17:austin2_interconnect:austin1_interconnect:dataAndMetadata:2:dr_copy2:
hdisk3:gpfs_dr_interconnect::descOnly:3:dr_desc:
root@austin1:/home/michel> mmlsnsd -m
Note: Disks dr_copy1 and dr_copy2 appear twice in the listing shown in Example 4-9,
which is normal, because these disks are attached (via SAN) to both production nodes.
We are now ready to create the file system. To create the file system, we use the same disk
descriptor file that was used for the mmcrnsd command. This file is modified by the mmcrnsd
command, when the NSDs have been created, as shown in Example 4-10. At this point, start
the GPFS daemon on all nodes in the cluster (mmstartup -a).
The NSD disks are accessible from any node in the cluster; thus, you can run the mmcrfs
command on any node (see Example 4-11 on page 165). Because you want a fully replicated
file system, make sure that you use the correct replication parameters: -m2 -M2 -r2 -R2.
GPFS: 6027-531 The following disks of disaster will be formatted on node austin1:
dr_copy1: size 4194304 KB
dr_copy2: size 4194304 KB
dr_desc: size 71687000 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 208 GB can be added to storage pool 'system'.
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
GPFS: 6027-572 Completed creation of file system /dev/disaster.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected
nodes. This is an asynchronous process.
The replication parameters that you use to enable and activate the replication are:
-m Number of copies of metadata (inodes, directories, and indirect blocks) for a file. Valid
values are 1 and 2 (activates metadata replication by default). You cannot set this
parameter to 2 if -M was not set to 2 also.
-M Maximum number of copies of metadata (of inodes, directories, and indirect blocks) for
a file. Valid values are also 1 and 2 (this enables metadata replication).
-r Number of copies of each data block for a file. Valid values are 1 and 2 (activates data
replication by default). You cannot set this parameter to 2 if -R was not also set to 2.
-R Maximum number of copies of data blocks for a file. Valid values are 1 and 2 (enables
data replication).
Mount and check the parameters of the newly created file system, as shown in Example 4-12.
root@austin1:/home/michel> mount
node mounted mounted over vfs date options
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Sep 16 02:06 rw,log=/dev/hd8
/proc /proc procfs Sep 16 02:06 rw
/dev/hd10opt /opt jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/disaster /disaster mmfs Sep 19 18:24
rw,mtime,atime,dev=disaster
After successful creation, you can use this file system to store the Oracle data files. The file
system is resilient to the loss of either one of the production sites.
4.2.4 Oracle 10g RAC clusterware configuration using three voting disks
We have seen earlier that Oracle 10g RAC provides its own high availability mechanism. Is it
the same for disaster recovery?
A disaster recovery configuration implies duplicated SAN storage in different sites. Because
the Oracle data files are located on GPFS, they are protected by the file system layer against
disaster (if one site is down). Now, what about Oracle Clusterware, which is also called CRS?
Oracle RAC is based on a concurrent (shared) storage architecture, and the main goal of the
clustering layer is to prevent unauthorized storage access from nodes that are not considered
“safe”. From this perspective, Oracle Clusterware Cluster Ready Services (CRS) manages
node failure similarly to GPFS. It uses a voting disk to act as a tiebreaker in case of a node
failure. The CRS voting disk might be a raw device or a file that is accessible to all nodes in
the cluster. The voting disk (or the access to the disk) is vital for Oracle Clusterware. If the
voting disk is lost to any node, even temporarily, it triggers the reboot of the respective node.
Oracle Clusterware cannot survive without a valid voting disk. You can define up to 32 voting
disks, all of which contain the same information. For the cluster to be up and running, more
than half of the declared voting disks must be accessible. Assume that half of the voting disks
are located on a storage unit in site A and the other half in site B. We can see immediately
that if one of the storage units (through a site failure) is lost, the voting disks quorum cannot
be fulfilled, and all nodes are rebooted by CRS. After the reboot, CRS can reconfigure itself
with only the surviving voting disks and can restart all the instances. However, to avoid any
disruption, we must have an odd number of voting disks (usually three copies are enough)
and have (at least) the third copy on a third site.
As discussed in 1.3.1, “RAC with GPFS” on page 11, we do not recommend using GPFS for
storing the CRS voting disks. You must use NFS-shared or SAN-attached storage (as raw
devices). It is also possible to use a combination of NFS-shared and SAN-attached raw
devices.
Site C
gpfs_dr
NSD vote3
Site A Site B
NFS server
No external storage
austin1 austin2
storage storage
DB
DB vote1 DB vote2
DB IP network
DB DB
SAN
Configure the buffer size, timeout, protocol, and security method, as shown in Example 4-16
on page 170. Make sure that this directory is mounted automatically after a reboot.
Check the /etc/filesystems file for the new entry, as shown in Example 4-17 on page 171.
Also, add the noac option in the /etc/filesystems file.
Make sure that the file system is mounted, as shown in Example 4-18.
root@austin1:/home/michel> mount
node mounted mounted over vfs date options
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Sep 21 14:52 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Sep 21 14:52 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Sep 21 14:53 rw,log=/dev/hd8
/proc /proc procfs Sep 21 14:53 rw
/dev/hd10opt /opt jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/disaster /disaster mmfs Sep 21 14:54
rw,mtime,atime,dev=disaster
gpfs_dr /voting_disk /voting_disk nfs3 Sep 21 16:10
rw,bg,hard,intr,rsize=32768,wsize=32768,timeo=600,vers=3,proto=tcp,noac,sec=sys
2. Check to see if CRS is really stopped, as shown in Example 4-20 on page 172.
The initial configuration was made with three voting disks, which are located on two different
storage devices. The voting disks are shown in Example 4-21.
located 3 votedisk(s).
3. Delete one of these disks located in the storage that holds two voting disks, as shown in
Example 4-22.
4. We add the NFS-shared voting disk as shown in Example 4-23. Even though the NFS
voting disk creates traffic over the IP network, this traffic is insignificant, and the existence
of the voting disk is more important than its actual I/O.
5. Next, we have to change the owner of the new voting disk (an NFS file in our case). This
step, shown in Example 4-24, is extremely important, and if you skip it, CRS will not start.
Example 4-24 Change the owner of the newly created voting disk
root@austin2:/voting_disk> ll
-rw-r--r-- 1 root system 10306048 Sep 21 16:27 voting_disk3_for_DR
root@austin2:/voting_disk> ll
-rw-r--r-- 1 oracle dba 10306048 Sep 21 16:27 voting_disk3_for_DR
6. Check the configuration. You must have an output similar to Example 4-25 on page 173.
located 3 votedisk(s).
7. Restart Oracle Clusterware on all the nodes, which triggers the restart of instances as
well.
Node failure is simulated by halting the node (halt -q command), which also powers off the
node. This method is different from a normal shutdown (shutdown -Fr), which stops all
applications and processes, synchronizes the file systems, and then stops.
Storage failure is simulated by removing the host mapping at the storage level; thus, the host
loses the disk connection immediately.
The purpose of this series of tests is to verify that GPFS and Oracle 10g RAC behave as
expected in a disaster recovery situation. The results are in line with the expectations, for
both GPFS and Oracle, as long as the configuration explained in this chapter is complete.
The hardware architecture of the test platform is shown in Figure 4-4 on page 159. There are
two production nodes: austin1 and austin2. A third one, gpfs_dr, is used as a tiebreaker
node for GPFS and holds the third Oracle Clusterware voting disk that was exported via NFS
to austin1 and austin2.
We test the worse case scenario, by stopping the node primary node, austin1. This node is
the GPFS cluster manager and also the file system manager node for the GPFS (mountpoint
/disaster), as shown in Example 4-27.
Example 4-27 Checking the file system manager node for /disaster file system
root@austin1:/> mmlsmgr
file system manager node [from 10.1.100.31 (austin1_interconnect)]
---------------- ------------------
disaster 10.1.100.31 (austin1_interconnect)
The script shown in Example 4-28 has been run before the failure on austin1 and austin2,
the main nodes. Its goal is to estimate the outage time. We have observed no outage.
Example 4-28 Script to check GPFS file system availability during failures
root@austin1:/home/michel> while true
> do
> print $(date) >> /disaster/test_date_austin1
> sleep 1
> done
On the other node, austin2, we can see in Example 4-30 on page 175 that there is no outage
on the GPFS file system /disaster, which remains up and running despite the failure of one
node. The third node (gpfs_dr) is important to maintain the node quorum, thus, keeping the
GPFS file system active on the surviving nodes.
When the failing node has the role of cluster configuration manager, the process to fail over
this management role to another node (which has the management capability) must be less
that 135 seconds. During this time, the GPFS file systems are frozen on all the nodes. The
I/O can still process in memory, as long as the page pool size is sufficient. After the memory
buffers are filled, the application will wait for the I/O to complete, just like normal I/O. If the
node that is failing is not the cluster configuration manager, there is no freeze at all.
The second node is aware of the austin1 failure, as displayed in austin2’s GPFS log shown
in Example 4-31.
Example 4-31 Node austin2 GPFS log during node failure test
root@austin2:/var/adm/ras> cat mmfs.log.latest
...
Thu Sep 20 12:22:04 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31
Thu Sep 20 12:22:05 2007: GPFS: 6027-630 Node 10.1.100.32 (austin2_interconnect)
appointed as manager for disaster.
Thu Sep 20 12:22:38 2007: GPFS: 6027-643 Node 10.1.100.32 (austin2_interconnect)
completed take over for disaster.
Thu Sep 20 12:22:38 2007: GPFS: 6027-2706 Recovered 1 nodes.
...
Node austin2 assumes the role of file system manager (mmlsmgr command) and cluster
configuration manager (mmfsadm dump cfgmgr command) as shown in Example 4-32.
Example 4-32 Node austin2 is the new manager node for /disaster GPFS file system after node failure
root@gpfs_dr:/home/michel> mmlsmgr
file system manager node [from 10.1.100.32 (austin2_interconnect)]
---------------- ------------------
disaster 10.1.100.32 (austin2_interconnect)
Current clock tick (seconds since boot): 163390.45 (resolution 0.010) = 2007-09-20
14:25:24
For more information about the cluster configuration manager and file system manager roles,
refer to 2.1.6, “GPFS configuration” on page 30.
The third node (gpfs_dr) is also aware of the cluster changes, but it takes no special action,
as shown in Example 4-33.
Example 4-33 Node gpfs_dr GPFS log during node failure test
root@gpfs_dr:/var/adm/ras> cat mmfs.log.latest
...
Thu Sep 20 12:21:54 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31
Thu Sep 20 12:21:55 2007: GPFS: 6027-2706 Recovered 1 nodes.
...
Even if the failing node had the cluster configuration manager role before its failure, this role
is not transferred back automatically. The other node (austin2) continues to perform this role,
thus avoiding an unnecessary fallback that might freeze file system activity for a short time.
Austin1 has both management roles (listed in Example 4-34 and Example 4-35 on
page 178). We test what happens if this node loses its access to the local storage.
Example 4-34 Node austin1 is the file system manager for /disaster
root@austin1:/var/mmfs/gen> mmlsmgr
file system manager node [from 10.1.100.31 (austin1_interconnect)]
---------------- ------------------
disaster 10.1.100.31 (austin1_interconnect)
Current clock tick (seconds since boot): 3413.57 (resolution 0.010) = 2007-09-20
16:05:29
It is important to check the GPFS (/disaster) replication before the test. In Example 4-36, you
can see that the NSD disks dr_copy1 and dr_copy2 are both holding data, metadata, and file
system descriptors. They are connected with dual Fibre Channel attachment to both austin1
and austin2. These disks are located on different storage units, situated in separate sites, so
the risk of losing both disks is limited. The third NSD disk, dr_desc, is an internal SCSI disk in
the third node. Accessed by the network only (no SAN), it contains a third copy of the file
system descriptors. The replication settings have been defined at the file system level (see
4.2.3, “Disk configuration using GPFS replication” on page 162).
Example 4-36 GPFS replicated file system configuration before disk failure
Now, the austin1 node has lost its NSD disks, because the LUN mapping to the host is
removed at 16h19m19s. The failing disk is hdisk16 for AIX, or dr_copy1 for GPFS, as
revealed by the the disk error messages from the AIX error report (errpt |egrep
“ARRAY|mmfs”), which is detailed using errpt -aj command, as shown in Example 4-37.
Example 4-37 Node austin1 AIX error report during disk failure
root@austin1:/> errpt |egrep “ARRAY|mmfs”
2E493F13 0920161907 P H hdisk16 ARRAY OPERATION ERROR
9C6C05FA 0920161907 P H mmfs DISK FAILURE
2E493F13 0920162007 P H hdisk16 ARRAY OPERATION ERROR
Description
ARRAY OPERATION ERROR
Probable Causes
ARRAY DASD DEVICE
Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Description
DISK FAILURE
Probable Causes
STORAGE SUBSYSTEM
DISK
Failure Causes
STORAGE SUBSYSTEM
DISK
Recommended Actions
CHECK POWER
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE
Detail Data
EVENT CODE
15913921
VOLUME
disaster
RETURN CODE
22
PHYSICAL VOLUME
dr_copy1
Because GPFS replication is activated, there is no impact and no freeze of the file system
/disaster. The file system remains operational on all three nodes. A user or an application
cannot see anything special regarding the I/O. You only see the problem in the logs (the AIX
error report shown in Example 4-37 on page 179 and the GPFS log, which is shown in
Example 4-38).
Example 4-39 shows the status of the file system during the disk failure. A good copy of the
data and the metadata is still accessible via the dr_copy2 disk, and the data is not lost during
the failure. Also, because two valid copies of the file system descriptor still exist (dr_copy2
and dr_desc disks), the file system is still mounted and active.
Note: When using a GPFS cluster with three nodes in three sites and a replicated GPFS
file system on two storage devices, the failure of one storage device has no impact on the
I/O. There is no freeze, and there is no data loss. Everything is managed transparently by
GPFS.
Example 4-40 Synchronize the cluster configuration (if changed during the failure)
root@austin1:/> mmchcluster -p LATEST
2. Then, run the command shown in Example 4-41, from any node, to tell GPFS to accept
the disk that is marked down since its failure.
As a result, the command in Example 4-42 shows that our disk is operational.
3. The last action is to replicate the data and metadata (synchronize the mirror). Be aware
that this can be an I/O intensive action, depending on the size of your file system.
Example 4-43 on page 182 shows how to resynchronize the file system.
Now, you have fully recovered from the disaster. It was not difficult.
As in the previous examples, the node stopped is the cluster configuration manager and also
the file system manager for the /disaster file system. Also, the mapping of dr_copy2 disk is
removed (at the storage subsystem level) to simulate a storage device problem in site1, and
the austin1 node is halted at the same time. So Site A is not responding anymore.
Although it represents two events at the same time, it does not differ from the node failure and
disk failure cases that we discussed in 4.3.1, “Failure of a GPFS node” on page 174 and
4.3.3, “Loss of one storage unit” on page 177.
The GPFS log on a surviving node has captured both events, as shown in Example 4-44.
Example 4-44 GPFS log showing a disk failure (dr_copy2), and a node failure (austin1)
root@austin2:/var/mmfs/gen> cat mmfslog
Thu Sep 20 17:48:12 2007: GPFS: 6027-680 Disk failure. Volume disaster. rc = 22.
Physical volume dr_copy2.
Thu Sep 20 17:49:13 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31
Thu Sep 20 17:49:13 2007: GPFS: 6027-630 Node 10.1.100.32 (austin2_interconnect)
appointed as manager for disaster.
Thu Sep 20 17:49:37 2007: GPFS: 6027-643 Node 10.1.100.32 (austin2_interconnect)
completed take over for disaster.
Thu Sep 20 17:49:37 2007: GPFS: 6027-2706 Recovered 1 nodes.
In this scenario, we test the failure of one of the CRS voting disks.
Note: We do not recommend that you store the voting disks on GPFS file systems, as
stated in 1.3.1, “RAC with GPFS” on page 11. Use the raw hdisk (shared SAN) without any
Logical Volume Manager or file system layer. One third voting disk is supported on NFS
(see 4.2.4, “Oracle 10g RAC clusterware configuration using three voting disks” on
page 166).
We want to determine if Oracle 10g RAC can survive a site failure with a node and instance
crash and storage outage (including one of the voting disks). We have already tested the
GPFS layer, and we know it can survive a disaster without concerns. However, because the
CRS voting disks are outside GPFS (two disks on the shared storage as LUNs, and one NFS
is mounted on a third node), we test this scenario separately.
CRS voting disk failure is simulated (in the same way that the previous tests were simulated)
by removing the LUN mapping on the storage subsystem. As a result, RAC remains up and
running with no loss of service. In the CRS log, we can see these lines shown in
Example 4-45.
Example 4-45 CRS logs during the failure of one voting disk
/orabin/crs/log/austin1/alertaustin1.log
/orabin/crs/log/austin1/cssd/ocssd.log
[ CSSD]2007-09-21 17:16:17.076 [1287] >ERROR: clssnmvReadBlocks: read failed
1 at offset 133 of /dev/votedisk2
Note: Oracle 10g RAC can survive the loss of one of the three voting disks.
Example 4-46 CRS logs during the failure of two voting disks
/orabin/crs/log/austin2/alertaustin2.log
2007-09-21 17:48:09.659
[cssd(479262)]CRS-1606:CSSD Insufficient voting files available [1 of 3]. Details
in /orabin/crs/log/austin2/cssd/ocssd.log.
Because the voting disk quorum is not matched any longer (more than half must be
accessible), all Oracle 10g RAC instances are stopped, and CRS reboots the servers. When
nodes come back up, CRS reconfigures itself to use only one voting disk and restarts the
instances. So, the database service is up again, but it is not disaster resilient any longer.
Note: Oracle 10g RAC cannot survive the loss of two out of the three voting disks.
S ite A S ite B
a u s tin 1 _ v ip a u s tin 2 _ v ip
R A C in te rc o n n e c t
1 9 2 .1 6 8 .1 0 0 .3 1 1 9 2 .1 6 8 .1 0 0 .3 2 P u b lic n e tw o rk
a u s tin 1 a u s tin 1 _ in te rc o n n a u s tin 2 a u s tin 2 _ in te rc o n n
1 9 2 .1 6 8 .1 0 0 .3 1 1 0 .1 .1 0 0 .3 1 1 9 2 .1 6 8 .1 0 0 .3 2 1 0 .1 .1 0 0 .3 2
e n t2 e n t3 e n t2 e n t3
Node 1 Node 2
fc s 0 fc s 0
SAN
A c tiv e c lu s te r lin k s P P R C lin k s
M e tro M irro r
S to ra g e A S to ra g e B
In this configuration, both nodes A and B are part of the same GPFS cluster and Oracle RAC.
Moreover, both nodes are active and can be used for submitting application workload.
However, only storage in Site A is active and provides logical unit numbers (LUNs) for GPFS
and RAC. Storage in Site B (secondary) provides replication for the LUNs in Site A. The
LUNs in the secondary storage are unavailable to either node during normal operation.
Note: The configuration in Figure 5-1 requires two nodes. In normal configurations, both
nodes are active at the same time and access the LUNs in Storage A. One of the benefits
of this configuration is that it does not require additional (standby) hardware, and in case
Site A fails, Site B can provide service with degraded performance (as opposed to
requiring dedicated contingency hardware). Although reasonably simple, this configuration
requires extensive effort for implementation and testing.
A more sophisticated configuration consists of two active nodes in Site A and two backup
nodes (inactive) in Site B. However, this configuration adds an additional complexity level
for the (manually initiated) failover and failback operations.
During normal operation, the LUNs in the primary storage device are replicated
synchronously to secondary storage device. In case the storage in Site A becomes
You must take extra precaution when performing recovery. In normal operations, replicated
LUNs in storage B are not mapped to any of the nodes (because they have the same IDs as
the LUNs in primary storage). During the recovery process, you must prevent LUNs with
same IDs from become active, because activating the LUNs confuses the application (GPFS
or Oracle). Thus, you must make sure that the replicated LUNs belonging to primary (failing)
storage are unmapped from both nodes before you resume the primary storage operation.
When the storage subsystem in Site A is restored to operational status, the system
administrator must reinitiate the replication process to synchronize the copies. After the data
has been synchronized, the secondary storage might remain active (the primary copy) or you
must manually restore the original configuration.
Note: This configuration is based on synchronous replication. The distance between sites
is a factor that affects the performance of your application.
5.2 Implementation
Metro Mirror for IBM System Storage DS8000 (formerly Peer to Peer Remote Copy (PPRC))
is a storage replication product that is totally platform or application independent. PPRC can
provide replication between sites for all types of storage methods that are used for Oracle. It
can be used for both stand-alone and RAC (clustered) databases. Database files can be plain
files (JFS or JFS2), raw devices, ASM, or files in GPFS file systems.
In this test, we have used a two node RAC/GPFS cluster and two IBM System Storage
DS8000 units. To simulate the two locations, the SAN provides two IBM 2109-F32 switches
that are connected using long wave single mode optical fiber (1300nm - LW GBICs).
Important: Metro Mirror (PPRC) is used to replicate all of the LUNs that are used for our
configuration, which include:
Oracle Cluster Repository
CRS voting disks
GPFS NSDs
We have configured LUNs, masking, and zoning to support our configuration. We do not
describe the masking and zoning process in this book.
In this section, we describe how we establish replication between the two storage units, the
actions that we must take when storage in Site A becomes unavailable, and the steps to
perform when the primary storage is recovered.
We assume that the LUN configuration has already been performed, and we use only one
pair of LUNs to show our configuration.
Note: In your environment, you must make sure that all LUNs belonging to your application
are replicated, including OCR and CRS voting disks and GPFS tiebreaker disks. We
recommend that you script the failover and failback process and test it thoroughly before
deploying the production environment.
Check the pair of LUNs that are going to be used for replication on both storage subsystems,
as shown in Example 5-2.
# On second storage:
Check the PPRC links available between the two storage subsystems, as shown in
Example 5-3 on page 189.
dscli> lspprcpath 90
Date/Time: November 22, 2007 4:42:30 PM EET IBM DSCLI Version: 5.1.720.139 DS:
IBM.2107-75N0291
Src Tgt State SS Port Attached Port Tgt WWNN
=========================================================
90 90 Success FF90 I0100 I0332 5005076303FFC46A
90 90 Success FF90 I0101 I0333 5005076303FFC46A
90 90 Success FF90 I0230 I0302 5005076303FFC46A
90 90 Success FF90 I0231 I0231 5005076303FFC46A
# and on Storage B:
dscli> lspprcpath 90
Date/Time: November 22, 2007 4:42:26 PM EET IBM DSCLI Version: 5.1.720.139 DS:
IBM.2107-7572791
Src Tgt State SS Port Attached Port Tgt WWNN
=========================================================
90 90 Success FF90 I0231 I0231 5005076306FFC1DE
90 90 Success FF90 I0302 I0230 5005076306FFC1DE
90 90 Success FF90 I0332 I0100 5005076306FFC1DE
90 90 Success FF90 I0333 I0101 5005076306FFC1DE
Create the PPRC relationship between storage in A and B (A → B), as shown in Example 5-4.
List the PPRC relationship as shown in Example 5-5. At this point, the two copies are not
synchronized yet.
Example 5-7 lists the relationship as seen from storage B (to connect to the storage B
console, use dscli -cfg /opt/ibm/dscli/profile/ds_B.profile):
This situation is the most complicated of all of the recovery situations, because the node in
Site A is still available. Therefore, we must take extra precautions when reconfiguring the
LUN mapping on both nodes.
Connected to storage B, we activate the secondary copy using the command that is shown in
Example 5-9.
Next, make sure that Oracle and GPFS are stopped on both nodes. Then, unmap the LUNs
belonging to Storage A from both nodes, A and B; if Storage A is unavailable, make sure that
when it comes back up, the LUNs used for PPRC are not available to either node A or node B.
Note: Unmapping LUNs is storage-specific, and we do not discuss it in this publication. For
storage subsystem operations, check with your storage/SAN administrator to make sure
that you understand the consequences of any action that you might take.
After LUNs in storage A have been unmapped, map the replicated LUNs (in this case vol1_B)
to both nodes (A and B). Start GPFS and check if it can see the NSDs. Make sure also that
OCR and CRS voting disks are available from Storage B.
Verify the NSDs’ availability and file system quorum as shown in Example 5-10. Make sure
that all disks are accessed via the localhost (direct access according to RAC requirements).
Check the GPFS cluster quorum and node availability using the mmgetstate -a -L command,
as shown in Example 5-11.
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
------------------------------------------------------------------------------------
1 austin1_interconnect 2* 2 3 active quorum node
2 austin2_interconnect 2* 2 3 active quorum node
root@austin1:/>
When the GPFS file system is available, you can start Oracle Clusterware (CRS), then Oracle
RAC, and resume operation.
Important: Restoring the original configuration is a disruptive action and requires planned
downtime.
We recommend that you script all operations and check the procedures before putting your
system in production.
In this section, we describe only step 3, because the other steps have been discussed in
other sections or materials.
4. The synchronization process starts automatically when you create the relationship. Check
for the synchronized copy by using the command shown in Example 5-15.
5. After the synchronization, start the process of moving the primary copy back to Site A:
a. Pause the B → A relationship, as shown in Example 5-16 on page 194 (commands run
on
Storage B).
b. Fail over to Site A. On Storage A, execute the commands shown in Example 5-17.
Asa se vede in A
dscli> lspprc -remotedev IBM.2107-7572791 9070:9070
Date/Time: November 22, 2007 5:17:39 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291
ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status
=====================================================================================================
9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid
c. Fail back A → B, as shown in Example 5-18 (commands run on Storage A). Check for
the synchronized copy.
6. At this point, redo the original LUN mapping (both AIX nodes can only see the LUNs that
belong to the primary PPRC copy, located in Storage A).
7. Next, run cfgmgr on both AIX nodes and make sure that all LUNs are available, including
OCR and CRS voting disks.
8. Start CRS, GPFS, and Oracle RAC.
The GPFS snapshot is designed to be fast. GPFS basically performs a file system
synchronization of all dirty data, blocks new requests, performs file system sync again of any
new dirty data, then creates the empty snapshot inode file, and then resumes. The slow part
is waiting for all existing file system write requests to complete and get synchronized to disk.
A really busy file system can be blocked for several seconds to flush all dirty data.
The GPFS mmbackup utility uses snapshots to back up the contents of a GPFS at a moment in
time to a Tivoli® Storage Manager server. GPFS snapshots also provide online backup
means to recover quickly from accidently deleting files.
Snapshots are read-only, so changes are only made in active files and directories. Because
snapshots are not a copy of the entire file system, they cannot be used as protection against
disk subsystem failures.
When using GPFS snapshots with databases that perform Direct I/O to disk (Oracle uses this
feature), there is a severe performance penalty while the snapshot exists and is being backed
up. Every time that a write occurs, GPFS checks to make sure that the old block is
copied-on-write to the snapshot file. This extra checking overhead can double or triple the
normal I/O time.
mmcrsnapshot
The mmcrsnapshot command creates a snapshot of an entire GPFS file system at a single
point in time. The command syntax is:
mmcrsnapshot Device Directory
Where:
Device is the device name of the file system for which the snapshot is to be created. File
system names do not need to be fully qualified. Using oradata is just as acceptable as
/dev/oradata.
Directory is the subdirectory name where the snapshots are stored. This is a subdirectory
of the root directory and must be a unique name within the root directory.
Where:
Device is the device name of the file system for which snapshot information is to be
shown.
-d displays the amount of storage used by the snapshot.
-Q displays whether quotas were set to be automatically activated upon mounting the file
system at the time that the snapshot was taken.
mmdelsnapshot
The mmdelsnapshot command deletes a GPFS snapshot. It has the following syntax:
mmdelsnapshot Device Directory
Where:
Device is the device name of the file system for which the snapshot is to be deleted.
Directory is the snapshot subdirectory to be deleted.
mmrestorefs
The mmrestorefs command restores a file system from a GPFS snapshot. The syntax is:
mmrestorefs Device Directory [-c]
Where:
Device is the device name of the file system for which the restore is to be run.
Directory is the snapshot with which to restore the file system.
-c continues to restore the file system in the event that errors occur.
mmsnapdir
The mmsnapdir command creates and deletes invisible directories that connect to the
snapshots of a GPFS file system and changes the name of the snapshots subdirectory. The
syntax is:
mmsnapdir Device {[-r | -a] [-s SnapDirName]}
mmsnapdir Device [-q]
Where:
Device is the device name of the file system.
-a adds a snapshots subdirectory to all subdirectories in the file system.
-q displays current settings if it issued without any other flags.
-r reverses the effect of the -a option. All invisible snapshot directories are removed. The
snapshot directory under the file system root directory is not affected.
In Example 6-1, we use the time command to check the elapsed time during the execution of
the mmcrsnapshot command.
real 0m0.64s
user 0m0.19s
sys 0m0.05s
The last few ls commands show that the default snapshots directory name is .snapshots and
the default subdirectory exists on the root directory of the file system.
You can change the default subdirectory using the mmsnapdir command as shown in
Example 6-2.
Note: For an overview of GPFS snapshots, refer to the chapter titled “Creating and
maintaining snapshots of GPFS file system” in the GPFS V3.1 Advanced Administration
Guide, SC23-5182.
Storage manufacturers develop functions in their disk systems called flash copy or snapshots.
They provide a point-in-time view of a specified volume on a storage level, and they are not
discussed in this document. GPFS snapshot is the similar mechanism that is built into GPFS
and is storage subsystem independent.
We present an overview of GPFS snapshots that are used with Oracle Database in
Figure 6-1.
n
lo catio
mote
to re
copy
GPFS snapshot 1 at SCN + 7
/oradata/.snapshots/snap1
delete snapshot
create snapshot
restore from
snapshot
snapshot
create
user error
at SCN + 18
GPFS filesystem
/oradata X
time
SCN + 3 SCN + 7 SCN + 14 SCN + 14
In our test, the production database is located on the /oradata file system. Each horizontal
block of the production file system (at the bottom of Figure 6-1) represents a single System
Change Number (SCN), which is assigned to each transaction in the database. SCN
numbers increase with time. They are used for consistency and recovery purposes.
At SCN+7, the administrator creates the first GPFS snapshot of the database file system.
Because snapshots do not change with time (they are consistent and read-only), they can be
used as the source for a database backup or database clone. Remember that in this
scenario, the backup will not be consistent from a database point of view. After restoration
from this backup, you must perform a recovery by using online redo logs. If redo logs are
unavailable, recovery is impossible, which makes the backup unusable. After a copy of files
within the /oradata/.snapshots/snap1 directory is complete, snapshot 1 can be deleted to
save disk space.
At SCN+14, another snapshot is created. It might coexist with snapshot 1, but additional disk
space is required. In this example, the user accidentally deletes data at time SCN+18. The
database can be restored from snapshot 2, and the data reflects the state at SCN+14.
When the restore process is necessary, backup files must be copied back to the original
database location, not to the snapshots directory. After that operation, you can open the
Oracle database, and no recovery process is performed while starting up the instance.
The database has to be stopped before creating a snapshot, which means all users have to
be logged off, and all applications connected to the database must be shut down. Downtime
caused by stopping and starting the database might be as long as several minutes, but taking
a GPFS snapshot of the file system with Oracle datafiles takes only a fraction of a second.
Overall downtime might be as long as 10 - 15 minutes (or longer, depending on how much
time is required to start the remaining applications), but when considering this method for a
large database, using GPFS snapshots can reduce system unavailability from hours to
minutes.
Cloning databases
Database “cloning” means creating a copy or multiple copies of a database, usually for
testing and development purposes.
When cloning the database to a new host, the database name and paths do not have to be
modified. The process is easier, because the control files do not have to be recreated, and
init.ora parameters do not have to be changed. Only the source files from the snapshots
directory have to be transferred to the target host.
6.1.3 Examples
We provide several examples in this section.
2. Next, create a GPFS snapshot of the database file system. If the database is spanned
across multiple GPFS file systems, a snapshot has to be created on each of these file
systems. We have created a snapshot by using the command shown in Example 6-5.
real 0m0.66s
user 0m0.18s
sys 0m0.05s
All database files (all /oradata file system) are frozen in the /oradata/.snapshots/snap1
directory, as presented in Example 6-6 on page 203. They will not change after the
modification of the original files (/oradata).
3. After creating the snapshot, you can start the database. Because taking the GPFS
snapshot was very quick and the files will be backed up later, the overall downtime of the
database is much shorter than taking the database down until the backup process is
finished. Start the database as shown in Example 6-7.
Because Oracle opens all of the database files, while the database is running, the snapshot
contents do not change. Snapshot files can be backed up without stopping the database and
the consistency of the database is not threatened. When the database is running, all file
changes are reflected in snapshot, as shown in Example 6-8. We can see that (for now) only
40 MB of data and 1.7 MB of metadata in the GPFS file system have changed so far, but the
numbers will increase with time.
Example 6-9 Backing up snapshot files with the tar and gzip commands
root@alamo1:/> cd /oradata/.snapshots/snap1/ALAMO
root@alamo1:/oradata/.snapshots/snap1/ALAMO> tar cfv - * | gzip >
/backup/ALAMO.tar.gz
After all of the files are archived, the GPFS snapshot is no longer necessary and can be
deleted to preserve disk space, as shown in Example 6-10.
Whenever restore is necessary, the .tar.gz file created in the previous step can be restored to
the original database location.
The database is open all of the time, and the sample load is generated from a different
machine on the network.
real 0m0.67s
user 0m0.20s
sys 0m0.04s
3. After the database corruption or the deletion of a datafile, the database is stopped on both
cluster nodes to restore the files. Files are restored to the original database location using
the commands shown in Example 6-13.
Because the snapshot was taken with the database active, the database is inconsistent,
and a recovery process is necessary before the database is usable. In this case, all of the
database files were a part of a single GPFS file system. It makes this scenario much
easier, because consistency at the file system level is preserved.
4. In this case, the recovery process is handled automatically by the Oracle database. When
starting the instance, after the MOUNT phase and before OPEN, a recovery process will
be performed. Example 6-14 shows the alert.log entries for instance ALAMO1 that were
logged after the performing MOUNT phase.
In this scenario, the redo log files were a part of the same file system as the rest of the
database files. If the database spans across multiple GPFS file systems, it is impossible to
create several GPFS snapshots (one for each file system) at exactly the same time; thus,
recovery is impossible.
In this case, to guarantee consistency across several snapshots, the I/O must be frozen at
the database level using the alter system suspend command, and after creating GPFS
snapshots, resumed with the alter system resume command as presented in Example 6-15.
In the case of Oracle RAC, alter system suspend and alter system resume are
cluster-aware and global; therefore, the operations on all cluster nodes will be suspended or
resumed accordingly.
Moreover, to be absolutely sure that the database is recovered, switch the database to the
archivelog mode, which puts all tablespaces (or the whole database) in backup mode. The
Oracle Database Backup and Recovery Advanced User’s Guide, B14191-01, describes this
information in detail.
mmlsfs
This command displays file system attributes. The syntax is:
mmlsfs [-P]
–P displays storage pools that are defined within the file system.
mmdf
This command queries the available file space on a GPFS file system. The syntax is:
mmdf [-P poolName]
-P poolName lists only the disks that belong to the requested storage pool.
mmlsattr
This command queries file attributes. The syntax is:
mmlsattr [-L] FileName
Where:
–L displays additional file attributes.
FileName is the name of the file to be queried.
mmchattr
This command changes the replication attributes, storage pool assignment, and I/O caching
policy for one or more GPFS files. The syntax is:
mmchattr [-P PoolName] [-I {yes|defer}] Filename
Where:
–P PoolName changes the file’s assigned storage pool to the specified user pool name.
-I {yes | defer} specifies if migration between pools is to be performed immediately
(-I yes) or deferred until a later call to mmrestripefs or mmrestripefile (-I defer). By
deferring the updates to more than one file, the data movement can be done in parallel.
The default is yes
Filename is the name of the file to be changed.
mmrestripefs
This command rebalances or restores the replication factor of all files in a file system. The
syntax is:
mmrestripefs {-p} [-P PoolName]
Where:
-p indicates mmrestripefs will repair the file placement within the storage pool.
-P PoolName indicates mmrestripefs will repair only files that are assigned to the specified
storage pool.
Where:
Filename is the name of one or more files to be restriped.
-m migrates all critical data off any suspended disk in this file system.
-r migrates all data off of the suspended disks and restores all replicated files in the file
system to their designated degree of replication.
-p repairs the file placement within the storage pool.
-b rebalances all files across all disks that are not suspended.
Policy commands
We present the GPFS policy commands and options next.
Where:
Device is the device name of the file system for which policy information is to be
established or changed.
PolicyFileName is the name of the file containing the policy rules.
-t DescriptiveName is the optional descriptive name to be associated with the policy
rules.
-I {yes | test} specifies whether to activate the rules in the policy file PolicyFileName.
yes means that policy rules are validated and immediately activated, which is the default.
test means that policy rules are validated, but not installed.
mmlspolicy
This GPFS policy command displays policy information for the file system. The syntax is:
mmlspolicy Device [-L]
Where:
Device is the device name of the file system for which policy information is to be displayed.
-L shows the entire original policy file.
mmapplypolicy
This GPFS policy command deletes files or migrates file data between storage pools in
accordance with policy rules. The syntax is:
mmapplypolicy {Device|Directory} [-P PolicyFile] [-I {yes|defer|test}] [-L n ]
[-D yyyy-mm-dd[@hh:mm[:ss]]] [-s WorkDirectory]
Where:
Device is the device name of the file system from which files are to be deleted or migrated.
-P PolicyFile is the name of the policy file name.
Directory is the fully qualified path name of a GPFS file system subtree from which files
are to be deleted or migrated.
-I {yes | defer | test} determines which actions the mmapplypolicy command
performs on files.
yes means that all applicable MIGRATE and DELETE policy rules are run, and the data
movement between pools is done during the processing of the mmapplypolicy command.
This is the default action.
defer means that all applicable MIGRATE and DELETE policy rules are run, but actual
data movement between pools is deferred until the next mmrestripefs or mmrestripefile
command.
test means that all policy rules are evaluated, but the mmapplypolicy command only
displays the actions that are performed if -I defer or -I yes is specified.
-L n controls the level of information that is displayed by the mmapplypolicy command.
-D yyyy-mm-dd[@hh:mm[:ss]] specifies a date and optionally a Coordinated Universal
Time (UTC) as year-month-day at hour:minute:second.
-s WorkDirectory is the directory to be used for temporary storage during the
mmapplypolicy command processing. The default directory is /tmp.
In this chapter, we discuss storage pools and policy rules in the context of data partitioning in
an Oracle database test environment.
Note: For details about GPFS policy-based data management implementations (storage
pools, filesets, policies, and rules), refer to the GPFS V3.1 Advanced Administration
Guide, SC23-5182.
The Oracle Database Administrator’s Guide 10g Release 2, B14231-01, provides a detailed
explanation of all partitioning methods.
In this chapter, we used a range partitioning, because it is the closest to ILM strategy and the
easiest to demonstrate the described features.
Range partitioning is useful when data has logical ranges into which it can be distributed.
Accounting data is an example of this type of data. Every accounting operation has an
assigned date (a time stamp). By splitting this data into partitions, each period can reside on a
different tablespace. In addition, with GPFS 3.1 storage pools, each of those tablespaces can
have a different storage pool assigned while being located on the same file system.
Q1 2006 data
Q2 2006 data
POOL2
Oracle range-partitioned table
Q4 2006 data
Q1 2007 data
POOL1
Q2 2007 data
Midrange
Q3 2007 data
SYSTEM
Q4 2007 data
Enterprise
(highest performance)
Figure 6-2 Oracle Database partitioning and GPFS storage pools idea
With these mechanisms, it is possible to achieve high performance while reducing the cost of
hardware.
The following sections describe GPFS storage pools and Oracle Database partitioning
working together.
Example 6-18 shows creating a working collective that includes GPFS nodes so that we can
ssh/scp across the cluster nodes.
We created a disk descriptor file named disk.desc (Example 6-19 on page 215). We used
this file to create Network Shared Disks (NSDs).
In the next step, we edited the disk descriptor file, disks.desc, as shown in Example 6-22.
In this example:
gpfs14nsd and gpfs15nsd will be used for the system storage pool.
gpfs16nsd and gpfs17nsd are for user storage pool1.
gpfs18nsd and gpfs19nsd are for user storage pool2.
In the next step, we created the /oradata file system by using the mmcrfs command.
Example 6-23 shows the log of the detailed output.
Example 6-23 Creating the GPFS file system with storage pools
root@alamo1:/tmp/mah> mmcrfs /oradata oradata -F disks.desc
GPFS: 6027-531 The following disks of oradata will be formatted on node alamo2:
gpfs14nsd: size 10485760 KB
gpfs15nsd: size 10485760 KB
gpfs16nsd: size 10485760 KB
gpfs17nsd: size 10485760 KB
gpfs18nsd: size 10485760 KB
gpfs19nsd: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'system'.
GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool1'.
GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool2'.
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
GPFS: 6027-572 Completed creation of file system /dev/oradata.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
All of the disks are up, and the file system is ready. The file system was mounted by using the
mmdsh mount /oradata command.
We installed an Oracle Clusterware and the Oracle database code files in another file system,
/orabin, which is also shared between cluster nodes. We created a sample database to
demonstrate partitioning, and we located the database files on the /oradata/ALAMO directory,
where ALAMO is the database name.
To make use of storage pools, we had to create all of the tablespaces necessary for table and
index partitions first. In our scenario, we assume that both the data and index segments are
located on the same table space. Example 6-24 shows the table space creation process.
Tablespace created.
Tablespace created.
Tablespace created.
Tablespace created.
Tablespace created.
Tablespace created.
Tablespace created.
Tablespace created.
We created a sample table, TRANSACTIONS, and within the table, we defined several partitions
by using the range partitioning key.
The example below (Example 6-25) creates the TRANSACTIONS table with eight partitions. Each
table partition corresponds to one quarter of the year, and the corresponding data is stored in
separate tablespaces. Partition TRANS2007Q1 will contain the transactions for only the first
quarter of year 2007.
Then, we created the range-partitioned index global index on the TRANSACTIONS table. Each
index’s partition is stored in a different tablespace in the same manner that it is stored for the
table.
Table created.
Index created.
Inode Information
-----------------
Number of used inodes: 4062
Number of free inodes: 58402
Number of allocated inodes: 62464
Maximum number of inodes: 62464
We decided to migrate all 2006 data to the storage pool pool2 and Q1 and Q2 2007 data to
pool1. We created and tested the GPFS policy file, as shown in Example 6-27.
The tested GPFS policy file was executed and the datafiles were moved according to defined
rules. Example 6-28 shows the output of the mmapplypolicy command.
Example 6-29 Storage pools’ free space after applying the policy
root@alamo1:/oradata/ALAMO> mmdf oradata
disk disk size failure holds holds free KB free KB
name in KB group metadata data in full blocks in fragments
--------------- ------------- ------- -------- ----- --------------- ------------
Disks in storage pool: system
gpfs14nsd 10485760 1 yes yes 8342784 ( 80%) 6184 ( 0%)
gpfs15nsd 10485760 1 yes yes 8343552 ( 80%) 6448 ( 0%)
------------- --------------- ------------
(pool total) 20971520 16686336 ( 80%) 12632 ( 0%)
Inode Information
-----------------
Number of used inodes: 4062
Number of free inodes: 58402
Number of allocated inodes: 62464
Maximum number of inodes: 62464
As seen in Example 6-29, we used user storage pools pool1 and pool2.
Tablespace created.
SQL> alter table transactions add partition trans2008q1 values less than
(to_date('2008-04-01','YYYY-MM-DD')) tablespace data2008q1;
Table altered.
Index altered.
After this operation, you can shift older partitions to other GPFS storage pools.
Example 6-31 Useful GPFS commands that are related to storage pools
root@alamo1:/oradata/ALAMO> mmlsattr -L data2006q3.dbf
file name: data2006q3.dbf
metadata replication: 1 max 1
data replication: 1 max 1
flags:
storage pool name: pool2
fileset name: root
snapshot name:
In Example 6-31 on page 223, the mmlsattr outputs show that each file is in its assigned
storage pool.
Note: For specific command syntax, refer to the section titled “GPFS commands” in GPFS
V3.1 Administration and Programming Reference, SA23-2221.
The flexibility of the IBM System p virtualization features provides a cost-effective solution for
quick deployment of test environments to validate solutions before they are put into
production.
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle
cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.html
For example, in the case of loss or temporary unavailability of the VIO server partition, all
resources associated with this partition are unavailable, which causes the outage of all
associated client partitions. Because Oracle RAC is designed to be highly available,
architects and administrators do not want to compromise this configuration by introducing the
VIO server as a single point of failure. In this chapter, we demonstrate that an Oracle RAC
solution can be deployed with good availability in a System p environment utilizing virtual
resources.
With careful design and planning and relying on the IBM System p exceptional virtualization
capabilities, you can achieve high redundancy in the virtualized System p environment. For
example, by using two VIO server partitions per server and the proper configuration of virtual
devices (Multi-Path I/O (MPIO), Logical Volume Manager (LVM) mirroring, EtherChannel,
and so on), you can mask the failure of hardware resources and even an entire VIO server.
These configurations allow you to shut down one of the VIO servers for maintenance
purposes, software upgrade, or reconfiguration while the other VIO server provides network
and disk connectivity for client partitions. These configurations are redundant, so in the case
In this chapter, we demonstrate how to set up a dual VIO server configuration that will provide
high availability for Oracle RAC. Remember that for a highly available Oracle RAC
environment, you need to build two similar hardware configurations (two systems each with
two VIO servers). Although you can install and run Oracle RAC on two logical partitions
(LPARs) of the same server, we do not recommend that you use the same server, because
the server itself represents a single point of failure.
This chapter provides the necessary information to create a resilient architecture for a
two-node Oracle RAC cluster. We discuss the following topics:
Configuration of the network and shared Ethernet adapters
Storage configuration with MPIO
Considerations when using System p virtualization with production RAC databases
We do not describe the installation and configuration of Oracle RAC here, because the
installation and configuration of Oracle RAC are the same as installing RAC on two physical
servers, which we have already described in this book.
There are two common ways to provide high availability for a virtualized network:
SEA failover
A link aggregation adapter with one primary adapter and one backup adapter, known as a
Network Interface Backup (NIB)
SEA is implemented at the VIO server level. When dealing with several client partitions
running within the same system, SEA is configured only one time for the entire System p
server and provides high network connectivity to every partition that utilizes virtual Ethernet.
When using SEA, a failover to a second VIO server can take as long as 30 seconds in case of
an adapter failure, which can cause problems when SEA is used for Oracle RAC
interconnect. Timeouts might be long enough to cause a “split brain” resolution in Oracle
Clusterware and evict nodes from the cluster. Of course, you can still use SEA for
administrative and Virtual IP address (VIP) networks and dedicate physical Ethernet adapters
for interconnect. Mixing physical and virtual adapters is allowed and fully supported in System
p virtualization.
The second possibility, a NIB, is implemented on every client partition and does not rely on
the SEA failover mechanism. A NIB is implemented the same way as an EtherChannel
adapter with a single primary adapter and a backup adapter.
In Figure 7-1 on page 230, the client uses two virtual Ethernet adapters to create an
EtherChannel adapter (en3) that consists of one primary adapter (en1) and one backup
adapter (en2). If the primary adapter becomes unavailable due to VIO server unavailability or
the corresponding physical Ethernet adapter failure on a VIO server partition, the NIB
switches to the backup adapter and routes the traffic through the second VIO server. This
configuration allows and supports a total of two virtual adapters: one active virtual adapter
and one standby virtual adapter.
LAN Switches
AIX partition
In this scenario, because there is no hardware link failure for virtual Ethernet adapters to
trigger a failover to the other adapter, it is mandatory to use the ping-to-address feature of
EtherChannel to detect network failures. When configuring virtual adapters for NIB, the two
internal networks must be separated in the hypervisor layer by assigning two different PVIDs.
Note: There is a common behavior with both SEA failover and NIB: They do not check the
reachability of the specified IP address through the backup-path as long as the primary
path is active. They do not check, because the virtual Ethernet adapter is always
connected, and there is no linkup event, such as there is with physical adapters. You do
not know if you really have an operational backup until your primary path fails.
Virtual Ethernet uses the system processors for all communication functions instead of
offloading the load to processors on network adapter cards. As a result, there is an increase
in the system processor load that is generated by the virtual Ethernet traffic. This might be a
good reason to consider using physical adapters for Oracle RAC interconnect.
The connection to the client partition that is shown in Figure 7-1 is still available in case of:
A switch failure
Failure of any Ethernet link
Failure of the physical Ethernet adapter on the VIO server
Virtual I/O server failure or maintenance
Assuming both virtual interfaces are visible on the AIX partition, the easiest way to configure
NIB is by using SMIT. Follow these steps to create an ent3 adapter, which will be the
aggregated adapter with the ent1and ent2 adapters:
1. Use the following SMIT fastpath: smitty etherchannel.
2. Select Add An EtherChannel / Link Aggregation.
3. From the list, choose a primary adapter for NIB, in this case, ent1.
4. The window in Example 7-1 appears.
[Entry Fields]
EtherChannel / Link Aggregation Adapters ent1 +
Enable Alternate Address no +
Alternate Address [] +
Enable Gigabit Ethernet Jumbo Frames no +
Mode standard +
Hash Mode default +
Backup Adapter ent2 +
Automatically Recover to Main Channel yes +
Perform Lossless Failover After Ping Failure yes +
Internet Address to Ping [10.10.10.1]
Number of Retries [2] +#
Retry Timeout (sec) [1] +#
The next step is to assign an IP address for the newly created NIB. In our test scenario, we
configure it with address 10.10.10.2 (see Example 7-2).
After completing this part, a Network Interface Backup is configured and ready to use.
During the transfer, the first VIO server partition (which was handling the network traffic) is
shut down with the Hardware Management Console (HMC).
8. After FTP transfer completes, we verify the integrity of the transferred file on the AIX
partition, and we saw no loss of data.
At the same time, during the VIO server failure, AIX detected the failure of the primary
interface that was used for the network interface backup and switched to the backup
interface. Example 7-4 shows the output from the errpt command that indicates the failure of
the network interface backup.
root@texas:/> errpt -a
---------------------------------------------------------------------------
LABEL: ECH_PING_FAIL_PRMRY
IDENTIFIER: 9F7B0FA6
Description
PING TO REMOTE HOST FAILED
Probable Causes
CABLE
SWITCH
ADAPTER
Recommended Actions
CHECK CABLE AND ITS CONNECTIONS
IF ERROR PERSISTS, REPLACE ADAPTER CARD.
Detail Data
FAILING ADAPTER
PRIMARY
SWITCHING TO ADAPTER
ent2
Unable to reach remote host through primary adapter: switching over to backup
adapter
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle
cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.htm
For deploying test or POC environments, you can deploy a configuration that is not based on
a SAN. In fact, a carefully designed VIO server environment can simulate a virtual SAN. Refer
to Chapter 8, “Deploying test environments using virtualized SAN” on page 241 for more
details.
7.2.1 External storage LUNs for Oracle 10g RAC data files
Due to the characteristics of the VIO server implementation, you must configure concurrent
access to same storage devices from two or more client partitions considering the following
aspects.
No reserve
To access the same set of LUNs on external storage from two VIO servers at the same time,
the disk reservation or Small Computer System Interface (SCSI) has to be disabled. Failing to
disable the SCSI prevents one VIO server from accessing the disk. This configuration has to
be enforced, and it does not depend on your choice to use GPFS, ASM, or direct hdisks for
voting or OCR disks.
You must set this no reserve policy on both VIO servers and on all RAC nodes.
Make sure that the reserve policy is set to no_reserve, as shown in Example 7-5. Note that
the default value is single_path.
For Oracle 10g RAC shared storage, the same LUNs on the SAN have to be accessed
concurrently by two partitions through two VIO servers. Thus. you cannot use a part of a
physical disk or logical volume (LV) as a shared disk, because the definition is local to the
LVM of a VIO server. So, when defining the mappings of the Oracle 10g RAC data disks on
the VIO server, map only an entire LUN to the client partition.
For disks to be used as rootvg by the client partitions, you can map logical volumes in the VIO
server if you do not want to dedicate an entire disk (LUN) for one partition. See 7.2.2, “Internal
disk for client LPAR rootvg” on page 237 for more details.
Note: All LUNs on external storage (that will be used to hold Oracle 10g RAC data files)
must be mapped in the VIO server as a whole disk to the client partitions. The reserve
policy must be set to no_reserve.
AIX partition
MPIO
hdisk hdisk
VIOS1 VIOS2
Figure 7-2 External storage disk configuration with MPIO and redundant VIO servers
The same LUN is accessed by two VIO servers and mapped (the entire disk) to the same AIX
node. The AIX MPIO layer is aware that the two paths point to the same disk.
If any of the SAN switches, cables, or physical HBAs fail, or if a VIO server is stopped, there
is still another path available to reach the disk. MPIO manages the load balancing and
failover at the AIX level, which is transparent to Oracle. Thus, it is not mandatory to have a
dual HBA for each VIO server, because there is no improved security at the node level.
If a VIO server is stopped, it results in the failure of one link to the storage. MPIO fails over the
surviving link without requiring administrative action. When the link is back, MPIO
reintegrates the link automatically. The failure and failback are completely transparent to
anyone. There is nothing to do. Only errors are stored in the AIX error report to keep track of
the events.
AIX partition
Figure 7-3 Disk configuration with redundant VIO servers for DS4000
One of the goals of virtualization is to utilize the resources in the best manner possible. For
example, one 143 GB disk might be too large for a single rootvg; thus, you can efficiently use
the space on this disk by allocating logical volumes on this disk at the VIO server level and
mapping the LVs as virtual SCSI disks used for rootvg to each client LPAR. Another goal is to
share resources between various client LPARs. A limited number of internal disks in the VIO
server can be shared by a large number of client LPARs. With the LVM mirroring proposed
next, you can further increase the level of high availability.
Mirrored rootvg
To remove all the SPOFs, we use two VIO servers, and the rootvg must be mirrored using
LVM, which we display in Figure 7-4.
A failure of an internal disk in one VIO server, or a shutdown of the VIO server, results in the
failure of one LV copy. However, rootvg is still alive. After the VIO server is rebooted (for
example), you must resynchronize the mirror (syncvg command).
rootvg
LVM mirroring
LV hdisk hdisk LV
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle
cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.html
AIX1 AIX4
RAC node 1 RAC node 2
AIX5
AIX2
LAN
AIX6
AIX3
AIX7
System p System p
server 1 server 2
Database
storage
Figure 7-5 Oracle RAC configuration with two virtualized System p servers
The architecture that we propose is suitable for development and testing purposes. One of its
goals is to use the least possible number of disk and network (physical) adapters, which we
achieve by sharing the same physical disk for all of the rootvg and virtual Ethernet adapters,
for example. Creating a virtual SAN resource further contributes to reducing the hardware
and administrative costs that are required to deploy clusters. You can create lightweight
partitions with almost no hardware. For example, if you have a virtual I/O (VIO) server and
one free disk, you can create two partitions for running Oracle 10g RAC easily. Of course, this
configuration cannot match the performance of a similar environment with dedicated physical
resources.
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle
cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.html
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI
disks are only supported using Oracle's ASM, but not with GPFS. For the current
IBM/Oracle cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.htm
Logical volumes can be assigned as virtual SCSI disks, but they can only be used in one
LPAR; they cannot be shared between two client LPARs.
RAC node 1
virtual physical
disk disk
virtual network
en5
interface
en5
physical network
en0
interface
rootvg en6
RAC node 2
en6
rootvg en5
VIOS
en5
hdisk
Figure 8-1 Simple fully virtualized Oracle 10g RAC architecture for development or testing
This architecture is not highly available. There are several single points of failure. Usually,
Oracle 10g RAC is used for providing high availability or even disaster recovery capabilities.
However, the goal of this configuration is to deploy a real (but lightweight) RAC.
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's Automated Storage Management (ASM), but not with
GPFS. For the current IBM/Oracle cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.html
You can create a GPFS with as little as one SCSI disk. This disk can hold the normal files,
data, and be used as a tiebreaker disk at the same time.
Note: This configuration provides no data protection, which is acceptable because this is a
test environment.
When creating the virtual adapters, you must make sure that the interface’s number (en#) is
the same on all of the nodes for the same network (interconnect and public).The Oracle
Clusterware configuration uses the interface number to define the networks, and not the
associated IP label (name).
The outbound client traffic for two RAC nodes shares the same physical interface. If this
design becomes a bottleneck, you can use a link aggregation interface (EtherChannel) with
two or more physical interfaces.
$ lspv
NAME PVID VG STATUS
hdisk0 0022be2abc04a1ca rootvg active
hdisk1 00cc5d5c6b8fd309 None
hdisk2 0022be2a80b97feb None
hdisk3 0022be2abc247c91 None
$ lsdev -virtual
name status description
ent3 Available Virtual I/O Ethernet Adapter (l-lan)
vhost0 Available Virtual SCSI Server Adapter
vhost1 Available Virtual SCSI Server Adapter
vsa0 Available LPAR Virtual Serial Adapter
### Create virtual target devices for mapping backing devices to virtual SCSI
adapters. These devices will be used for local client rootvgs.
### Create virtual target devices for mapping backing devices to virtual SCSI
adapters. These devices will be used for shared disks.
### Verify mapping information between backing devices and virtual SCSI adapters
VTD clinet1_vg_vtd
LUN 0x8100000000000000
Backing device client1vg_lv
Physloc
VTD vtscsi0
LUN 0x8200000000000000
Backing device hdisk2
Physloc U7879.001.DQDKZNP-P1-T14-L4-L0
VTD clinet2_vg_vtd
LUN 0x8100000000000000
Backing device client2vg_lv
Physloc
VTD vtscsi1
LUN 0x8200000000000000
Backing device hdisk2
Physloc U7879.001.DQDKZNP-P1-T14-L4-L0
### Choose a physical Ethernet adapter and a virtual Ethernet adapter to create a
shared Ethernet adapter. Make sure that IP address should not be assigned to the
physical adapter at the time of creating a shared Ethernet adapter.
### Check the virtual devices on both nodes defined in client partitions
root@client1:/> lspv
hdisk0 00c7cd9e76c83540 rootvg active
hdisk1 00c7cd9ece71f8d4 None
When the VIO server and the client partitions have been configured, proceed to install the
operating system on the client1 and client2 LPARs (we have used the Network Installation
Management (NIM) installation), configure networking, and configure GPFS.
Install Oracle Clusterware and database as described in Chapter 2, “Basic RAC configuration
with GPFS” on page 19.
Tip: The Available Network Adapters window displays all Ethernet adapters. If you select
an Ethernet adapter that is already being used (has a defined interface), you get an error
message. You first need to detach this interface if you want to use it.
Note: It is an invalid combination to select a Hash Mode other than default with a Mode of
round_robin.
Backup Adapter: This field is optional. Enter the adapter that you want to use as your
EtherChannel backup.
Internet Address to Ping: This field is optional and only takes effect if you are running
Network Interface Backup mode, or if you have one or more adapters in the EtherChannel
and a backup adapter. The EtherChannel pings the IP address or host name that you
specify here. If the EtherChannel is unable to ping this address for the number of times
specified in the Number of Retries field, and in the intervals specified in the Retry Timeout
field, the EtherChannel switches adapters.
Number of Retries: Enter the number of ping response failures that are allowed before
the EtherChannel switches adapters. The default is three. This field is optional and valid
only if you set an Internet Address to Ping.
Retry Timeout: Enter the number of seconds between the times for the EtherChannel
Ping and the Internet Address to Ping. The default is one second. This field is optional and
valid only if you have set an Internet Address to Ping.
4. Press Enter after changing the desired fields to create the EtherChannel. Configure IP
over the newly created EtherChannel device by typing smitty chinet at the command
line.
5. Select your new EtherChannel interface from the list. Fill in all of the required fields and
press Enter.
Note: The methods described allow for encrypted traffic between the cluster nodes without
the need to enter a password or passphrase, which means that even though traffic is
encrypted, any malicious hacker, who gains access to one of the cluster nodes, has
access to all cluster nodes.
Important: As we change the server keys on some of the nodes in the cluster, be careful.
Doing things in the wrong order might prevent you from logging on to the systems
(especially if ssh is the only way to access the system over the network).
a. Then, edit the file, adding the remaining nodes in the cluster. See Example B-3.
b. As seen in Example B-3, there are two nodes in the cluster, alamo1 and alamo2. Both
nodes are accessible in the 192.168.100.x and 10.1.100.x subnets.
3. Create the authorized_keys.
Because we intend to use the same keys on all nodes, the authorized keys will be just the
root user’s key. See Example B-4.
b. After changing the server keys, sshd must be restarted, as shown in Example B-6.
Example: B-6 Restart sshd on the nodes which has had the keys changed
root@alamo2:/> stopsrc -s sshd
0513-044 The sshd Subsystem was requested to stop.
root@alamo2:/> startsrc -s sshd
0513-059 The sshd Subsystem has been started. Subsystem PID is 295030.
root@alamo2:/>
e. Finally, verify that everything works. We use the command shown in Example B-9.
---dar0---
2. We create the node and disk descriptor files in /etc/gpfs_config. They are listed in
Example C-2 on page 264.
3. We create the cluster using the gpfs_nodes descriptor file, as shown in Example C-3.
4. We create the Network Shared Disks (NSDs) using the descriptor file gpfs_disks_tb for
tiebreaker disks, gpfs_disk_oradata for the /oradata file system disks, and
gpfs_disk_orabin for the /orabin file system, as shown in Example C-4 on page 265.
5. Add the tiebreaker NSDs to the cluster configuration, as shown in Example C-5.
6. The cluster is now ready, and we start GPFS on all nodes (Example C-6).
7. The /orabin file system is then created, as shown in Example C-7 on page 266. The block
size is set to 256k; the file system is created with a maximum of 80k inodes.
GPFS: 6027-531 The following disks of orabin will be formatted on node dallas1:
nsd05: size 10485760 KB
nsd06: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
Flushing Allocation Maps
GPFS: 6027-535 Disks up to size 27 GB can be added to this file system.
GPFS: 6027-572 Completed creation of file system /dev/orabin.
mmcrfs: 6027-1371 Propagating the changes to all affected nodes.
This is an asynchronous process.
GPFS: 6027-531 The following disks of oradata will be formatted on node dallas2:
nsd01: size 10485760 KB
nsd02: size 10485760 KB
nsd03: size 10485760 KB
nsd04: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
Flushing Allocation Maps
GPFS: 6027-535 Disks up to size 70 GB can be added to this file system.
GPFS: 6027-572 Completed creation of file system /dev/oradata.
mmcrfs: 6027-1371 Propagating the changes to all affected nodes.
This is an asynchronous process.
9. Finally, the GPFS cluster is restarted and the file systems are checked, as shown in
Example C-9 on page 267.
You are asked if rootpre.sh has been run, as shown in Figure 2-3 on page 51. Make sure
that you execute Disk1/rootpre/rootpre.sh as root user on each node. The steps are:
12.As the root user, execute root.sh on each node as shown in Example 8-2.
root@austin2:/orabin/ora102> root.sh
Running Oracle10 root.sh script...
If there is no problem, you will see “End of Installation” as shown in Figure D-12.
Before you proceed to database creation, we recommend that you apply the recommended
patch set for the database code files.
For details, see also the following Oracle Metalink document: Removing a Node from a 10g
RAC Cluster, Doc ID: Note:269320.1 at:
http://metalink.oracle.com
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
Other publications
These publications are also relevant as further information sources:
GPFS V3.1 Concepts, Planning, and Installation Guide, GA76-0413
GPFS V3.1 Advanced Administration Guide, SC23-5182
GPFS V3.1 Administration and Programming Reference, SA23-2221
GPFS V3.1 Problem Determination Guide, GA76-0415-00
Online resources
These Web sites are also relevant as further information sources:
Oracle articles about changing OCR and CRS voting disks to raw devices:
http://www.oracle.com/technology/pub/articles/vallath-nodes.html
http://www.oracle.com/technology/pub/articles/chan_sing2rac_install.html
Oracle Knowledge Base (Metalink)
http://metalink.oracle.com
G I
General Parallel File System (GPFS) 19, 103–107, 163, IBM DSCLI (ID) 188, 190–191, 193
165, 195–199 IEEE 802.3ad
Global Cache Directory (GCD) 111 Link Aggregation configuration 256
GPFS 3.1 Information Lifecycle Management (ILM) 206, 212
release 207 Inode File 165, 196, 216, 266
storage pool 212 interface number 243
GPFS cluster 104–107, 138, 153, 157–158, 160–161, Internet Address 231, 256–257
174, 181, 263, 266 IP address 230–231, 251, 256–257
GPFS code 106, 140, 142 IP label 160
GPFS commands IP traffic 256–257
mmapplypolicy 209–210, 220
mmbackup 196
mmchattr 208 J
mmchpolicy 210 jumbo frame 255
mmcommon 214
mmcrnsd 215
mmcrsnapshot 196 L
mmdelsnapshot 197, 199, 204 Link Aggregation 229, 231, 255–256
mmdf 208, 219 Adapter 255
mmdsh 214 Control Protocol 256
mmfsadm dump cfgmgr 175 link aggregation
mmgetstate 192 interface 243
mmlsattr 208, 223 logical volume
mmlscluster 214 control block 134
mmlsdisk 191 first block 123, 134
mmlsfs 208 manager 183
mmlsnsnapshot 197 logical volume (LV) 123, 134–135, 235, 237, 242, 250
mmlspolicy 210 long wave (LW) 187
mmrestorefs 197, 199 LPAR 20, 242, 247, 249–250, 252
mmrestripefile 209 LUN mapping 179, 183
mmrestripefs 208 LUN size 144, 190
mmsnapdir 197–198 LUNs 107, 144–145, 162, 183, 186–188, 191, 194,
GPFS file 234–236
system 105, 199 LVM mirroring 227, 237–238
system layer 183
GPFS file system 105–106, 139, 162, 164, 166, M
174–175, 178, 181, 183, 196, 199, 201–202 Metro Mirror 185
layer 183 migration 211
namespace 209 mklv 123, 133–135
subtree 210 mmcrcluster 264
Index 291
select instance_number 116, 122 unavailability 229
separate tablespaces 212, 218 virtual I/O server
Shared Ethernet Adapter (SEA) 229, 243, 251, 255 external storage 234
single point 154, 157, 227, 235–236, 238, 242 Virtual I/O Server (VIOS) 229–232, 243
Single Point of Failure (SPOF) 154, 235 virtual IO server
size 10485760 KB 216, 266 dual HBA 236
size 50M 113 Voting disc 144, 148, 166, 171–172, 183
snapshot files 201 voting disc
spfile 108, 110, 122, 125 NFS clients 169
SQL commands
alter database 205
alter system 206
create index 219
create table 218
create tablespace 217
drop tablespace 127
select database_status 206
sqlplus 109, 116
ssh 28
startup nomount 125
Storage Area Network (SAN) 154, 185, 234–236
storage B 187, 190–191, 193
Storage pool
file placement 208
GPFS file system 213, 216
newly created files 209
storage pool 151, 165, 206–207, 210, 212
storage pools 207, 213
storage subsystem 107, 154–155, 158, 162, 185,
187–188, 191–192, 199
original mapping 192
System Change Number (SCN) 200, 205
System storage pool 207
T
tablespaces 206, 212, 217
tar cfv 204
TCP traffic 256–257
test environment 20, 104, 121, 123, 135, 243, 255
raw devices 123
third node 153, 158–160, 162–163
free internal disk 163
inappropriate error messages 162
internal SCSI disk 178
NFS server 169
Thu Sep 20 174, 180
Thu Sep 27 125, 203, 205
Transparent Application Failover (TAF) 106, 183
ttl 233
U
user storage pools 222
V
VIO server 227, 244–246, 249, 252
Configuring virtual resources 249
Virtual I/O Server
partition 227
Understand clustering This IBM Redbooks publication helps you architect, install, tailor, and
configure Oracle 10g RAC on System p™ clusters running AIX®. We INTERNATIONAL
layers that help
describe the architecture and how to design, plan, and implement a TECHNICAL
harden your
highly available infrastructure for Oracle database using IBM General SUPPORT
configuration Parallel File System (GPFS) V3.1. ORGANIZATION
This book gives a broad understanding of how Oracle 10g RAC can use
Learn System p
and benefit the virtualization facilities embedded in System p
virtualization and architecture and how to efficiently use the tremendous computing
advanced GPFS power and availability characteristics of the POWER5 hardware and AIX
features 5L operating system. BUILDING TECHNICAL
INFORMATION BASED ON
This book also helps you design and create a solution to migrate your PRACTICAL EXPERIENCE
Deploy disaster existing Oracle 9i RAC configurations to Oracle 10g RAC, simplifying
recovery and test configurations and making them easier to administer and more IBM Redbooks are developed
scenarios resilient to failures. by the IBM International
This book also describes how to quickly deploy Oracle 10g RAC test Technical Support
environments and how to use some of the built-in disaster recovery Organization. Experts from
capabilities of IBM GPFS and storage subsystems to make your cluster
IBM, Customers and Partners
from around the world create
resilient to various failures. timely technical information
This book is intended for anyone planning to architect, install, tailor, based on realistic scenarios.
and configure Oracle 10g RAC on System p™ clusters running AIX and Specific recommendations
GPFS. are provided to help you
implement IT solutions more
effectively in your
environment.