Deploying Oracle 10g RAC On AIX V5 With GPFS

Front cover
Deploying Oracle 10g

RAC on AIX V5 with
GPFS
Understand clustering layers that help
harden your configuration
Learn System p virtualization and

advanced GPFS features
Deploy disaster recovery and

test scenarios
Octavian Lascu
Mustafa Mah
Michel Passet
Harald Hammershøi
SeongLul Son
Maciej Przepiórka
ibm.com/redbooks
International Technical Support Organization
Deploying Oracle 10g RAC on AIX V5 with GPFS
April 2008
SG24-7541-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page vii.
First Edition (April 2008)
This edition applies to Version 3, Release 1, Modification 6 of IBM General Parallel File system (product
number 5765-G66), Version 5, Modification 3 of IBM High Availability Cluster Multi-Processing (product
number 5765-F62), Version 5, Release 3, Technology Level 6 of AIX (product number 5765-G03), and Oracle
CRS Version 10 Release 2 and Oracle RAC Version 10 Release 2.
© Copyright International Business Machines Corporation 2008. All rights reserved.

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Part 1. Concepts and configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Why clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 RAC and Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 IBM GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Configuration options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 RAC with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 RAC with automatic storage management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 RAC with HACMP and CLVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 2. Basic RAC configuration with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Basic scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Server hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Operating system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Adding the user and group for Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.4 Enabling remote command execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.5 System configuration parameters and network options . . . . . . . . . . . . . . . . . . . . 28
2.1.6 GPFS configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 Special consideration for GPFS with Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Oracle 10g Clusterware installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 Oracle 10g Clusterware patch set update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4 Oracle 10g database installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5 Networking considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.6 Considerations for Oracle code on shared space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7 Dynamic partitioning and Oracle 10g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.7.1 Dynamic memory changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.7.2 Dynamic CPU allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Part 2. Configurations using dedicated resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 3. Migration and upgrade scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.1 Migrating a single database instance to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.1.1 Moving JFS2-based ORACLE_HOME to GPFS. . . . . . . . . . . . . . . . . . . . . . . . . 105
3.1.2 Moving JFS2-based datafiles to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.1.3 Moving raw devices database files to GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.2 Migrating from Oracle single instance to RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.2.1 Setting up the new node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.2 Add the new node to existing (single node) GPFS cluster . . . . . . . . . . . . . . . . . 107
3.2.3 Installing and configuring Oracle Clusterware using OUI . . . . . . . . . . . . . . . . . . 107
© Copyright IBM Corp. 2008. All rights reserved. iii

3.2.4 Installing Oracle RAC option using OUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2.5 Configure database for RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2.6 Configuring Transparent Application Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3 Adding a node to an existing RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.3.1 Add the node to Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.3.2 Adding a new instance to existing RAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3.3 Reconfiguring the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.3.4 Final verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4 Migrating from HACMP-based RAC cluster
to GPFS using RMAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.4.1 Current raw devices with HACMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.4.2 Migrating data files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.4.3 Migrating the temp tablespace to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.4.4 Migrating the redo log files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.4.5 Migrating the spfile to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.6 Migrating the password file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.7 Removing Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.8 Removing HACMP filesets and third-party clusterware information . . . . . . . . . . 130
3.4.9 Reinstalling Oracle Clusterware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.10 Switch link two library files and relink database . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4.11 Starting listeners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.4.12 Adding a database and instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5 Migrating from RAC with HACMP cluster to GPFS using dd . . . . . . . . . . . . . . . . . . . 133
3.5.1 Logical volume type and the dd copy command . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.2 Migrate control files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.5.3 Migrate data files to GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.6 Upgrading from HACMP V5.2 to HACMP V5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.7 GPFS upgrade from 2.3 to 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.7.1 Upgrading using the mmchconfig and mmchfs commands . . . . . . . . . . . . . . . . 139
3.7.2 Upgrading using mmexportfs, cluster recreation, and mmimportfs. . . . . . . . . . . 140
3.8 Moving OCR and voting disks from GPFS to raw devices . . . . . . . . . . . . . . . . . . . . . 144
3.8.1 Preparing the raw devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.8.2 Moving OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.8.3 Moving CRS voting disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Part 3. Disaster recovery and maintenance scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Chapter 4. Disaster recovery scenario using GPFS replication . . . . . . . . . . . . . . . . . 153

4.1 Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.1.1 High availability: One storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.1.2 Disaster recovery: Two storage subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.2.1 SAN configuration for the two production sites . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.2.2 GPFS node configuration using three nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.2.3 Disk configuration using GPFS replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.4 Oracle 10g RAC clusterware configuration using three voting disks . . . . . . . . . 166
4.3 Testing and recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.3.1 Failure of a GPFS node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.3.2 Recovery when the GPFS node is back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.3.3 Loss of one storage unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.3.4 Fallback after the GPFS disks are recovered . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.3.5 Site disaster (node and disk failure) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
iv Deploying Oracle 10g RAC on AIX V5 with GPFS

4.3.6 Recovery after the disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.3.7 Loss of one Oracle Clusterware voting disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.3.8 Loss of a second Oracle Clusterware (CRS) voting disk . . . . . . . . . . . . . . . . . . 184
Chapter 5. Disaster recovery using PPRC over SAN. . . . . . . . . . . . . . . . . . . . . . . . . . 185

5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.2.1 Storage and PPRC configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.2.2 Recovering from a disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.2.3 Restoring the original configuration (primary storage in site A) . . . . . . . . . . . . . 192
Chapter 6. Maintaining your environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

6.1 Database backups and cloning with GPFS snapshots . . . . . . . . . . . . . . . . . . . . . . . . 196
6.1.1 Overview of GPFS snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.1.2 GPFS snapshots and Oracle Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.2 GPFS storage pools and Oracle data partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.2.1 GPFS 3.1 storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.2.2 GPFS 3.1 filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.2.3 GPFS policies and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.2.4 Oracle data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.2.5 Storage pools and Oracle data partitioning example . . . . . . . . . . . . . . . . . . . . . 213
Part 4. Virtualization scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Chapter 7. Highly available virtualized environments . . . . . . . . . . . . . . . . . . . . . . . . . 227

7.1 Virtual networking environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.1.1 Configuring EtherChannel with NIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.1.2 Testing NIB failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.2 Disk configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.2.1 External storage LUNs for Oracle 10g RAC data files . . . . . . . . . . . . . . . . . . . . 234
7.2.2 Internal disk for client LPAR rootvg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.3 System configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Chapter 8. Deploying test environments using virtualized SAN. . . . . . . . . . . . . . . . . 241

8.1 Totally virtualized simple architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8.1.1 Disk configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
8.1.2 Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
8.1.3 Creating virtual adapters (VIO server and clients) . . . . . . . . . . . . . . . . . . . . . . . 244
8.1.4 Configuring virtual resources in VIO server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Contents v
Part 5. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Appendix A. EtherChannel parameters on AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Appendix B. Setting up trusted ssh in a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Appendix C. Creating a GPFS 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Appendix D. Oracle 10g database installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Appendix E. How to cleanly remove CRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
vi Deploying Oracle 10g RAC on AIX V5 with GPFS

Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that does
not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 2008. All rights reserved. vii

Trademarks
BM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are
marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at "Copyright and trademark information" at:
http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Redbooks (logo) ® Enterprise Storage Server® POWER5™
eServer™ General Parallel File System™ Redbooks®
AIX 5L™ GPFS™ System p™
AIX® HACMP™ System p5™
Blue Gene® IBM® System Storage™
DS4000™ POWER™ Tivoli®
DS6000™ POWER3™ TotalStorage®
DS8000™ POWER4™
The following terms are trademarks of other companies:
Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or
its affiliates.
Snapshot, and the Network Appliance logo are trademarks or registered trademarks of Network Appliance,
Inc. in the U.S. and other countries.
InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade
Association.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
viii Deploying Oracle 10g RAC on AIX V5 with GPFS

Preface
This IBM Redbooks publication will help you architect, install, tailor, and configure Oracle®
10g RAC on
System p™ clusters running AIX®. We describe the architecture and how to design, plan,
and implement a highly available infrastructure for Oracle database using the IBM® General
Parallel File System™ V3.1.
This book gives a broad understanding of how Oracle 10g RAC can use and benefit from
virtualization facilities embedded in System p architecture, and how to efficiently use the
tremendous computing power and available characteristics of the POWER5™ hardware and
AIX 5L™ operating system.
This book also helps you design and create a solution to migrate your existing Oracle 9i RAC
configurations to Oracle 10g RAC by simplifying configurations and making them easier to
administer and more resilient to failures.
This book also describes how to quickly deploy Oracle 10g RAC test environments, and how
to use some of the built-in disaster recovery capabilities of the IBM GPFS™ and storage
subsystems to make you cluster resilient to various failures.
The team that wrote this book

This book was produced by a team of specialists from around the world working at the
International Technical Support Organization, Austin Center.
Octavian Lascu is a project leader at the International Technical Support Organization,

Poughkeepsie. He writes extensively and teaches IBM classes worldwide on all areas of AIX
and Linux® Clustering. His areas of expertise include AIX, UNIX®, high availability and high
performance computing, systems management, and application architecture. He holds a
Master’s degree in Electronic Engineering from Polytechnical Institute in Bucharest,
Romania. Before joining the ITSO six years ago, Octavian worked in IBM Global Services,
Romania, as a Software and Hardware Services Manager. He has worked for IBM since
1992.
Mustafa Mah is an Advisory Software Engineer working for IBM System and Technology
Group in Poughkeepsie, New York. He currently provides problem determination and
technical assistance in the IBM General Parallel File System (GPFS) to clients on IBM
System p, System x, and Blue Gene® clusters. He previously worked as an application
developer for the IBM System and Technology Group supporting client fulfillment tools. He
holds a Bachelor of Science in Electrical Engineering from the State University of New York in
New Paltz, New York, and a Master of Science in Software Development from Marist College
in Poughkeepsie, New York.
Michel Passet is a benchmark manager at the PSSC Customer Center in Montpellier,

France. He manages benchmarks on an AIX System p environment. He has over twenty
years of experience in IT, especially with AIX and Oracle. His areas of expertise include
designing highly available and disaster resilient infrastructures for worldwide clients. He has
been a speaker at international conferences for five years. He has written other IBM
Redbooks publications extensively and published white papers. He presents briefings in the
© Copyright IBM Corp. 2008. All rights reserved. ix

Montpellier Executive Briefing Center and teaches education courses. He also provides
on-site technical support in the field. He holds a degree in Computer Science Engineering.
Harald Hammershøi is an IT specialist, currently working as an IT Architect in the Danish

APMM account architect core team. He has over 16 years of experience in the IT industry. He
holds a “Civilingeniør i system konstruktion” (Danish equivalence of M.Sc. EE) from Aalborg
University in Denmark. His areas of expertise include architecture, performance tuning,
optimization, high availability, and troubleshooting on various database platforms, especially
on Oracle 9i and 10g RAC installations.
SeongLul Son is a Senior IT specialist working at IBM Korea. He has eleven years of
experience in the IT industry and his expertise includes networking, e-learning, System p
virtualization, HACMP™, and GPFS with Oracle. He has written extensively about GPFS
implementation, database migration, and Oracle in a virtualized environment in this
publication. He also co-authored the AIX 5L Version 5.3 Differences Guide and AIX 5L and
Windows® 2000: Solutions for Interpretability IBM Redbooks® publications in previous
residencies.
Maciej Przepiorka is an IT Architect with the IBM Innovation Center in Poland. His job is to
provide IBM Business Partners and clients with IBM technical consulting and equipment. His
areas of expertise include technologies related to IBM System p servers running AIX,
virtualization, and information management systems, including Oracle databases
(architecture, clustering, RAC, performance tuning, optimization, and problem determination).
He has over 12 years of experience in the IT industry and holds an M.Sc. Eng. degree in
Computer Science from Warsaw University of Technology, Faculty of Electronics and
Information Technology.
Authors: Mustafa (insert), Michel, SeongLul (SL), Harald, Octavian, and Maciej (Mike)
Thanks to the following people for their contributions to this project:
Oracle/IBM Joint Solution Center in Montpellier, France, for reviewing the draft
x Deploying Oracle 10g RAC on AIX V5 with GPFS

Oracle Romania
Dino Quintero
IBM Poughkeepsie
Andrei Socolic
IBM Romania
Cristian Stanciu
IBM Romania
Christian Allan Schmidt

IBM Denmark
Rick Piasecki
IBM Austin
Jonggun Shin
Goodus Inc., Korea
Renee Johnson
ITSO Austin
The authors of the previous edition of this book, Deploying Oracle9i RAC on eServer Cluster
1600 with GPFS, SG24-6954, published in October 2003:
򐂰 Octavian Lascu
򐂰 Vigil Carastanef
򐂰 Lifang Li
򐂰 Michel Passet
򐂰 Norbert Pistoor
򐂰 James Wang
Become a published author

Join us for a two- to six-week residency program! Help write a book dealing with specific
products or solutions, while getting hands-on experience with leading-edge technologies. You
will have the opportunity to team with IBM technical professionals, Business Partners, and
Clients.
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you
will develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Preface xi
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review IBM Redbooks publication form found at:
ibm.com/redbooks
򐂰 Send your comments in an e-mail to:
redbooks@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
xii Deploying Oracle 10g RAC on AIX V5 with GPFS

Part 1
Part 1 Concepts and

configurations
Part one introduces clustering concepts and discusses various cluster types, such as:
򐂰 High availability
򐂰 Load balancing
򐂰 Disaster recovery
© Copyright IBM Corp. 2008. All rights reserved. 1

2 Deploying Oracle 10g RAC on AIX V5 with GPFS
1
Chapter 1. Introduction
This chapter provides an overview of the infrastructure and clustering technologies that you
can use to deploy a highly available, load balancing database environment using Oracle 10g
RAC and IBM System p, running AIX and IBM General Parallel File System (GPFS). We also
provide information about various other storage management techniques.

1.1 Why clustering
A cluster1 is a group of computers connected together using a form of network. The
computers are managed by specially designed software. This software makes the cluster
appear as a single entity and is used for running an application distributed among nodes in
the cluster.
In general, clusters are used to provide higher performance and availability than what a single
computer or application can deliver. They are typically more cost-effective than solutions
based on single computers of similar performance2.
According to their purpose, clusters are generally classified as:

򐂰 High availability
Cluster components are redundant, and a node failure has limited impact on provided
service.
򐂰 Load sharing and balancing
An application is cluster aware and distributes its tasks (load) on multiple nodes.
򐂰 High performance computing
Using special libraries and programming techniques, numerical intensive computing tasks
are parallel when using the computing power of multiple nodes managed by a clustering
infrastructure.
However, in most cases, a clustering solution provides more than one benefit; for example, a
high availability cluster can also provide load balancing for the same application.
Today’s commercial environments require that their applications are available 24x7x365. For
commercial environments, high availability and load balancing are key features for IT
infrastructure. Applications must be able to work with hardware and operating systems to
deliver according to the agreed upon service level.
1.2 Architectural considerations

This section describes the clustering infrastructure that we use for this project. Although this
publication contains other clustering and storage methods, we concentrate on Oracle
Clusterware.
1.2.1 RAC and Oracle Clusterware

This section is based on Oracle Clusterware and Oracle Real Application Clusters,
B14197-03, and the Administration and Deployment Guide, 10g Release 2 (10.2),
B14197-04.
The idea of having multiple instances accessing the same physical data files is traced back to
Oracle 7 (actually, on virtual memory system (VMS). It started back in 1998 on Oracle 6).
Oracle 7 was developed to scale horizontally, when a single system modification program
(SMP) server did not provide adequate performance.
1
According to Oxford University Press’ American Dictionary of Current English, a cluster is: “A number of things of
the same sort gathered together or growing together”.
2 Performance calculated using standardized benchmark programs.

However, scalability was dependent on application partitioning due to an issue with using
physical input/output (I/O) to exchange information, known as pinging. Thus, when an
instance requested to update a block, which was modified by another instance but not yet
written to disk, the block was written to disk by the instance that had modified it (block
cleaning), and then it was read by the instance requesting it.
Due to locking granularity, false pinging can occur for blocks that are already cleaned. Oracle
8 introduces fine grain locking, which eliminated the false pinging.
Oracle 8.1, Parallel Server introduced the cache fusion mechanism for consistent reads (that
is, exchanging data blocks through an interconnect network to avoid a physical/disk I/O read
operation).
Starting with Oracle 9i Real Application Clusters (RAC), consistent read and current read
operations use the cache fusion mechanism.
In Oracle 10g, the basic cluster functionality and the database Real Application Clusters
(RACs) are split into two products:
򐂰 The basic cluster functionality is now Oracle Clusterware (10.2 and forward, Cluster
Ready Services (CRS) in 10.1).
򐂰 CRS is now a component of Oracle Clusterware. Most Oracle Clusterware commands
reflect the former name, CRS.
Oracle 10g RAC uses Oracle Clusterware for the infrastructure to bind multiple servers so
that they can operate as a single system.
Oracle Clusterware is a cluster management solution that is integrated with Oracle database.
The Oracle Clusterware is also a required component when using RAC. In addition, Oracle
Clusterware enables both single-instance Oracle databases and RAC databases to use the
Oracle high availability infrastructure.
In the past, Oracle RAC configurations required vendor specific clusterware. With Oracle
Clusterware (10.2), vendor specific clusterware is no longer required. However, Oracle
Clusterware can coexist with vendor clusterware, such as High-Availability Cluster
Multi-Processing (HACMP). The integration between Oracle Clusterware and Oracle
database means that Oracle Clusterware has inherent knowledge of the relationships among
RAC instances, automatic storage management (ASM) instances, and listeners. It knows
which sequence to start and stop all components.
Chapter 1. Introduction 5
Oracle Clusterware components
Figure 1-1 shows a diagram of the major functional components that are provided by Oracle
Clusterware.
Virtual IP address
Monitoring other application Actions
Starting/restarting applications,
and so on
Event management HA framework
Oracle Clusterware
Group membership Process monitor
(Topology) (Watchdog)
halt/reset
Interconnect (IP)
OS (Kernel and libraries)

Figure 1-1 Oracle Clusterware functional components
The major components of Oracle Clusterware are:

򐂰 Group membership
Cluster Synchronization Service (CSS) manages the cluster configuration.
򐂰 Process Monitor Daemon
The Process Monitor Daemon (OPROCD) is locked in memory to monitor the cluster and
provide I/O fencing. OPROCD performs its checks, puts itself to sleep, and if awakened
beyond the expected time, OPROCD reboots the node. Thus, an OPROCD failure results
in Oracle Clusterware restarting the node.
򐂰 High Availability Framework
Cluster Ready Services (CRS and RACG) manage the high availability operations within
the cluster, such as start, stop, monitor, and failover operations.
򐂰 Event management
Event Management (EVM) is a background process that publishes events that are created
by CRS.

򐂰 Virtual IP address
Virtual IP address (VIP) is used for the application access to avoid transmission control
protocol (TCP) timeout from a dead node. To be correct, the VIP is not actually a
component, it is just a CRS resource that is being maintained.
The Oracle Clusterware requires two components from the platform: shared storage and an
IP interconnect. Shared storage is required for voting disks to record node membership
information and for the Oracle Cluster Registry (OCR) for cluster configuration information
(repository).
Oracle Clusterware requires that each node is connected to a dedicated high speed
(preferably low latency) IP network3.
We highly recommend that the interconnect is inaccessible to nodes (systems) that are not
part of the cluster (not managed by Oracle Clusterware).
The Oracle Clusterware shared file is stored on the OCR file (can be a raw disk). There are
no strict rules for OCR placement, such as there are on Oracle database pfile/spfile. Oracle
has to record the location of the OCR disk/file on each cluster node.
On the AIX systems, this location is stored in the /etc/oracle/ocr.loc file.
In Figure 1-2, the component processes are grouped, and access to the OCR and voting
disks is shown for one node.
The VIP address is handled as a CRS resource, just as other resources, such as a database,
an instance, a listener, and so on. It does not have a dedicated process.
Event High Group

management availability membership
evmd init.cssd
evmd.bin crds.bin ocssd
evmlogger ocssd.bin Process
monitor
Oprocd
oclsomon(*)
Voting
OCR
disk
Figure 1-2 Oracle Clusterware component relationship4
Oracle recommends that you configure redundant network adapters to prevent interconnect
components from being a single point of failure.
3
In certain configurations Oracle may also support InfiniBand® using RDS (Reliable Datagram Socket) protocol.
4
(*)The oclsomon daemon is not mentioned in the 10.2 documentation, but it is running in 10.2.0.3. According to 11g
documentation, oclsomon is monitoring css (to detect if css hangs).
Here are a few examples of what happens in typical situations:
򐂰 Listener failure
When CRS detects that a registered component, such as the listener is not responding,
CRS tries to restart this component. CRS, by default, tries to restart this component five
times.
򐂰 Interconnect failure
If interconnect is lost for one or more nodes (split brain), CSS resolves this failure through
the voting disks. The surviving subcluster is the:
– Subcluster with the largest number of nodes
– Subcluster that contains the node with the lowest number
򐂰 Node malfunction
If the OPROCD process is unable to become active within the expected time, CRS
reboots the node.
Oracle RAC components

Oracle RAC provides a mechanism for consistently buffering updates in multiple instances.
Single-instance Oracle databases have a one-to-one relationship between the Oracle

database and the instance. RAC environments, however, have a one-to-many relationship
between the database and instances; that is, in RAC environments, multiple instances have
access to one database. The combined processing power of the multiple servers provides
greater throughput and scalability than a single server.
RAC is the Oracle database option that provides a single system image for multiple servers to
access one Oracle database. In RAC, each Oracle instance usually runs on a separate
server (OS image).
You can use Oracle 10g RAC for both horizontal scaling (scale out in Oracle terms) and for
high availability where client connections from a malfunctioning node are taken over by the
remaining nodes in RAC.
RAC instances use two processes to ensure that each RAC database instance obtains the
block that it needs to satisfy a query or transaction: the Global Cache Service (GCS) and the
Global Enqueue Service (GES).
The GCS and GES maintain status records for each data file and each cached block using a
Global Cache Directory (GCD). The GCD contents are distributed across all active instances
and are part of the SGA.

Figure 1-3 shows the instance components for a three-node Oracle RAC.
Instance 1 Instance 2 ..... Instance n
Ims Imd Ims Imd Imd Ims

RAC specific RAC specific RAC specific
background background background
processes processes processes
IckO Imon IckO Imon Imon IckO
SGA SGA SGA

Some of the Some of the Some of the
standard GCD standard GCD GCD standard
background background background
processes Redo log processes Redo log Redo log processes
buffer buffer buffer
dbwr Shared dbwr Shared Shared dbwr

pool pool pool
mmon Other mmon Other Other mmon

pools pools pools
lgwr lgwr lgwr
. . .
Data Data Data
. buffers
. buffers buffers
.
. . .
pmon pmon pmon
Figure 1-3 Oracle RAC components
An instance is defined as the shared memory (SGA) and the associated background
processes. When running in a RAC, the SGA has an additional member, the Global Cache
Directory (GCD), and an additional background process for the GCS and GES services.
The GCD maintains information for each block being cached:

򐂰 Which instance has the block cached
򐂰 Which GC mode each instance grants for the block
GC mode can be NULL, shared, or exclusive. A NULL mode means that another instance
has this block in exclusive mode. Exclusive mode means the instance has the privilege to
update the block.
The GCS and GES use the private interconnect for exchanging control messages and for
actually exchanging data when performing Cache Fusions. Cache Fusion is a data block
transfer on the interconnect. This type of a transfer occurs when one instance needs access
to a data block that is already cached by another instance, thus avoiding physical I/O. GCS
modes are cached on the blocks. If an instance needs to update a block that is already
granted exclusive mode, additional interconnect traffic is not required.
The basic concept for updates is that when an instance wants to update a data block, it must
get exclusive mode granted on that block from the GRD, which means that at any given time,
only one instance is able to update any data block. And therefore, if interconnect is lost, no
instance can gain exclusive mode granted on any block, until the cluster recovers
interconnect capability between the nodes.
However, in a multinode RAC, in a scenario where the interconnect network failing on certain
nodes results in subclusters (split brain configuration), the subclusters all consider
themselves survivors. This scenario is avoided by Oracle Clusterware by use of the voting
disks.
1.2.2 IBM GPFS

GPFS, released in 1998 as a commercial product, was initially running on IBM Parallel
Systems (PSSP) environment. The basic principle of GPFS is to provide concurrent file
system access from multiple nodes without the single server limitation observed with network
file systems (NFS).
GPFS has two major components: a GPFS daemon, running on all cluster nodes, providing
cluster management and membership and disk over-the-network access, and a kernel
extension (the file system device driver) that provides file system access to the applications.
GPFS provides cluster topology and membership management based on built-in heartbeat
and quorum decision mechanisms. Also, at the file system level, GPFS provides concurrent
and consistent access using locking mechanisms and a file system descriptor quorum.
Because GPFS is Portable Operating System Interface (POSIX) compliant, most applications
work in a predefined manner; however, in certain cases, applications must be recompiled to
fully benefit from the concurrent mechanism provided by GPFS.
In addition to concurrent access, GPFS also provides availability and reliability through
replication and metadata logging, as well as advanced functions, such as information life
cycle management, access control lists, quota management, multi-clustering, and disaster
recovery support. Caching, as well as direct I/O, is supported.
Oracle RAC uses GPFS for concurrent access to Oracle database files. For database
administrators, GPFS is easy to use and manage compared to other concurrent storage
mechanisms (concurrent raw devices, ASM). It provides almost the same performance level
as raw devices. The basic requirement for Oracle 10g RAC is that all disks used by GPFS
(and by Oracle) are directly accessible from all nodes (each node must have a host bus
adapter (HBA) connected to the shared storage and access the same logical unit numbers
(LUNs)).
1.3 Configuration options

Oracle RAC is a clustering architecture that is based on shared storage (disk) architecture.
There are several options available to implement Oracle 10g RAC on advanced interactive
executive (AIX) in terms of storage and data file placement. Prior to Oracle 10g, for
configurations running on AIX, only two possibilities existed: GPFS file systems or raw
devices on concurrent logical volume managers (CLVMs). In Oracle 10g release one, Oracle
introduced its own disk management layer named Automatic Storage Management (ASM).

This chapter assesses three storage configuration scenarios and describes advantages of
each of them. In all three cases, important basic hardware requirements must be fulfilled. All
Oracle RAC nodes must have access to the same shared disk subsystem.
1.3.1 RAC with GPFS

GPFS is a high performance shared-disk file system that provides data access from all nodes
in a cluster environment. Parallel and serial applications can access the files on the shared
disk space using standard UNIX file system interfaces. The same file can be accessed
concurrently from multiple nodes (single name space). GPFS is designed to provide high
availability through logging and replication. It can be configured for failover from both disk and
server malfunctions.
GPFS greatly simplifies the installation and administration of Oracle 10g RAC. Because it is a
shared file system, all database files can be placed in one common directory, and database
administrators can use the file system as a typical journaled file system (JFS)/JFS2.
Allocation of new data files or resizing existing files does not require system administrator
intervention. Free space on GPFS is seen as a traditional file system that is easily monitored
by administrators.
Moreover, with GPFS we can keep a single image of Oracle binary files and share them
between all cluster nodes. This single image applies both to Oracle database binaries
(ORACLE_HOME) and Oracle Clusterware binary files. This approach simplifies
maintenance operations, such as applying patch sets and one-off patches, and keeps all sets
of log files and installation media in one common space.
For clients running Oracle Applications (eBusiness Suite) with multiple application tier nodes,
it is also possible and convenient to use GPFS as shared APPL_TOP file system.
In GPFS Version 2.3, IBM introduces cluster topology services within GPFS. Thus, for GPFS
configuration, other clustering layers, such as HACMP or RSCT, are no longer required.
Advantages of running Oracle 10g RAC on a GPFS file system are:

򐂰 Simplified installation and configuration
򐂰 Possibility of using AUTOEXTEND option for Oracle datafiles, similar to JFS/JFS2
installation
򐂰 Ease of monitoring free space for data files storage
򐂰 Ease of running cold backups and restores, similar to a traditional file system
򐂰 Capability to place Oracle binaries on a shared file system, thus making the patching
process easier and faster
You can locate every single Oracle 10g RAC file type (for database and clusterware
products) on the GPFS, which includes:
򐂰 Clusterware binaries
򐂰 Clusterware registry files
򐂰 Clusterware voting files
򐂰 Database binary files
򐂰 Database initialization files (init.ora or spfile)
򐂰 Control files
򐂰 Data files
򐂰 Redo log files
򐂰 Archived redo log files
򐂰 Flashback recovery area files
You can use the GPFS to store database backups. In that case, you can perform the restore
process from any available cluster node. You can locate other non-database-related files on
the same GPFS as well.
Figure 1-4 shows a diagram of Oracle RAC on GPFS architecture. All files related to Oracle
Clusterware and database are located on the GPFS.
Note: CRS config and vote devices located on the GPFS.
Oracle RAC Oracle RAC

Node A Node B
Oracle Database instance Oracle Database instance
Oracle CRS Oracle CRS
IBM GPFS IBM GPFS
Shared storage
Figure 1-4 RAC on GPFS basic architecture
Although it is possible to place Oracle Clusterware configuration disk data (OCR) and voting
disk (quorum) data on the GPFS, we generally do not recommend it. In case of GPFS
configuration manager node failure, failover time and I/O freeze during its reconfiguration
might be too long for Oracle Clusterware, and nodes might be evicted from the cluster.
Figure 1-5 shows the recommended architecture with CRS devices outside the GPFS.
Note: CRS config and vote devices located on raw physical volumes.

Node A Node B
GPFS Oracle CRS GPFS Oracle CRS
Shared storage
Figure 1-5 RAC on GPFS basic architecture
We discuss detailed information about GPFS installation and configuration in the following
sections of this book.

GPFS requirements
GPFS V3.1 supports both AIX 5L and Linux nodes in a homogeneous or heterogeneous
cluster. The minimum hardware requirements for GPFS on AIX 5L are IBM POWER3™ or a
newer processor, 1 GB of memory, and one of the following shared disk subsystems:
򐂰 IBM TotalStorage® DS6000™ using either Subsystem Device Driver (SDD) or Subsystem
Device Driver Path Control Module (SDDPCM)
򐂰 IBM TotalStorage DS8000™ using either SDD or SDDPCM
򐂰 IBM TotalStorage DS4000™ Series
򐂰 IBM TotalStorage ESS (2105-F20 or 2105-800 with SDD or AIX 5L Multi-Path I/O (MPIO)
and SDDPCM)
򐂰 IBM TotalStorage Storage Area Network (SAN) Volume Controller (SVC) V1.1, V1.2, and
V2.1
򐂰 IBM 7133 Serial Disk System (all disk sizes)
򐂰 Hitachi Lightning 9900 (Hitachi Dynamic Link Manager required)
򐂰 EMC Symmetrix DMX Storage Subsystems (Fibre Channel (FC) attachment only)
Important: The previous list is not exhaustive or up-to-date for all of the versions
supported. When using RAC, always check with both Oracle and the storage manufacturer
for the latest support and compatibility list.
For a complete list of GPFS 3.1 software and hardware requirements, visit GPFS 3.1
documentation and FAQs on the following Web page:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cl
uster.gpfs.doc/gpfsbooks.html
Each disk subsystem requires a specific set of device drivers for proper operation while
attached to a host running GPFS.
Note: For the minimum software versions and patches that are required to support Oracle
products on IBM AIX, read Oracle Metalink bulletin 282036.1.
1.3.2 RAC with automatic storage management

Automatic storage management (ASM) was introduced with Version 10g of Oracle database.
It is a storage layer between physical devices (hdisk devices inside AIX) and database files.
ASM is a management method for raw (character) device layer with simplified administration.
ASM’s major goal is to give raw device performance to the database and facilitate file system
ease of administration. You can use ASM in single instance and cluster (RAC) environments.
See Figure 1-6 on page 14.
Note: CRS and ASM use raw physical volumes.
Node A Node B
ASM Oracle CRS ASM Oracle CRS
Shared storage
Figure 1-6 RAC on ASM basic architecture
With AIX, each LUN has a raw device file in the /dev directory, such as /dev/rhdisk0. For an
ASM environment, this raw device file for a LUN is assigned to the oracle user. An ASM
instance manages these device files. In a RAC cluster, one ASM instance is created per RAC
node.
Important: In AIX, for each hdisk, there are two devices created in /dev directory: hdisk
and rhdisk. The hdisk device is a block type device, and rhdisk is a character (sequential)
device. For Oracle Clusterware and database, you must use character devices:
root@austin1:/> ls -l /dev |grep hdisk10
brw------- 1 root system 20, 11 Sep 14 19:35 hdisk10
crw------- 1 root system 20, 11 Sep 14 19:35 rhdisk10
Collections of these disk devices are assigned to ASM to form ASM disk groups. For each
ASM disk group, a level of redundancy is defined, which might be normal (mirrored), high
(three mirrors), or external (no mirroring). When normal or high redundancy is used, disks can
be organized in failure groups to ensure that data and its redundant copy do not both reside
on disks that are likely to fail together.
Figure 1-7 on page 15 shows dependencies between disk devices, failure groups, and disk
groups within ASM. Within disk group ASMDG1, data is mirrored between failure groups one
and two. For performance reasons, ASM implements the Stripe And Mirror Everything
(SAME) strategy across disk groups, so that data is distributed across all disk devices.

rhdisk16 rhdisk17
rhdisk10 rhdisk11 rhdisk12
Failure group 1 rhdisk18 rhdisk19
rhdisk20 rhdisk21
rhdisk13 rhdisk14 rhdisk15

rhdisk22 rhdisk23
Failure group 2
ASMDG2
ASMDG1 (normal redundancy) (external redundancy)
Figure 1-7 ASM disk groups and failure groups
Important: Assigning ASM used hdisks to a volume group or setting the PVID results in
data corruption.
ASM does not rely on any AIX mechanism to manage disk devices. No PVID, volume
group label, or hardware reservation can be assigned to an hdisk device belonging to an
ASM disk group. AIX reports ASM disks as not belonging to a volume group (unused
disks). This raises a serious security problem.
Example 1-1 shows a result of the lspv command on the AIX server; hdisk2, hdisk3, hdisk4,
hdisk5, and hdisk6 do not have PVID signatures and are not assigned to any volume group.
They look like unused hdisks, but they might also belong to an ASM disk group.
Example 1-1 AIX lspv command result
root@austin1:/> lspv
hdisk0 0022be2ab1cd11ac rootvg active
hdisk1 00cc5d5caa5832e0 None
hdisk2 none None
hdisk3 none None
hdisk4 none None
hdisk5 none None
hdisk6 none None
hdisk7 none nsd_tb1
hdisk8 none nsd_tb2
hdisk9 none nsd_tb3
hdisk10 none nsd01
hdisk11 none nsd02
hdisk12 none nsd03
hdisk13 none nsd04
hdisk14 none nsd05
hdisk15 none nsd06
The same problem exists with Oracle Clusterware disks when they reside outside of the file
system or any LVM. It is not obvious if they are used by Oracle or available.
The following file types can be located on ASM:

򐂰 Database spfile
򐂰 Control files
򐂰 Data files
򐂰 Redo log files
򐂰 Archived log files
򐂰 Flashback recovery area
򐂰 RMAN backups
ASM manages storage only for database files. Oracle binaries, OCR, and voting disks cannot
be located on ASM disk groups. If shared binaries are desired, you must use a clustered file
system, such as GPFS.
Detailed ASM installation and configuration on the AIX operating system is covered in
CookBook V2 - Oracle RAC 10g Release 2 with ASM on IBM System p running AIX V5
(5.2/5.3) on SAN Storage by Oracle/IBM Joint Solutions Center at:
http://www.oracleracsig.com/
Note: For the minimum software versions and patches that are required to support Oracle
products on IBM AIX, check the Oracle Metalink bulletin 282036.1.
1.3.3 RAC with HACMP and CLVM

This configuration is suitable for system administrators familiar with AIX LVM and who prefer
to store the Oracle database files on devices. This configuration is one of the most popular
options to deploy Oracle 9i RAC clusters. It offers very good performance, because direct
access to disk is provided, but it requires additional system administration (AIX LVM) and
HACMP knowledge.
In Oracle 9i RAC, HACMP is used to provide cluster topology services and shared disk
access and to maintain high availability for the interconnect network for Oracle instances.

Oracle 10g RAC has its own layer to manage cluster interconnect, Oracle Clusterware (CRS).
In this case, HACMP is only used to provide concurrent LVM functionality.
This is the major drawback of this approach, because administrators have to maintain both
clusterware products within the same environment, and most of HACMP core functionality,
providing services high-availability and failover, is not used at all.
HACMP provides Oracle 10g RAC with the infrastructure for concurrent access to disks.
Although HACMP provides concurrent access and a disk locking mechanism, this
mechanism is only used to open the files (raw devices) and for managing hardware disk
reservation. Oracle database, instead, provides its own data block locking mechanism for
concurrent data access, integrity, and consistency.
Volume groups are varied on all the nodes (under the control of RSCT), thus ensuring short
failover time in case one node loses the disk or network connection. This type of concurrent
access can only be provided for RAW logical volumes (devices).
Oracle datafiles use the raw devices located on the shared disk subsystem. In this
configuration, you must define an HACMP resource group to handle the concurrent volume
groups.
There are two options when using HACMP and CLVM with Oracle RAC. Oracle Clusterware
devices are located on concurrent (raw) logical volumes provided by HACMP (Figure 1-8) or
on separate physical disk devices or LUNs. You must start HACMP services on all nodes
before Oracle Clusterware services are activated.
Note: CRS devices are located on concurrent logical volumes provided by HACMP and
CLVM.

Node A Node B
Oracle CRS Oracle CRS
IBM HACMP / CLVM IBM HACMP / CLVM
Shared storage
Figure 1-8 Oracle CRS using HACMP and CLVM
When using physical raw volumes (Figure 1-9), Oracle Clusterware and HACMP are not
dependent on each other; however, both products have to be up and running before the
database startup.
Note: CRS devices are located on raw physical volumes. CRS does not make use of any
extended HACMP functionality.
HACMP Oracle
/ CLVMRACOracle CRS HACMP /Oracle
CLVMRACOracle CRS
Node A Node B
HACMP / CLVM Oracle CRS HACMP / CLVM Oracle CRS
Shared storage
Shared storage
Figure 1-9 Oracle CRS with HACMP and CLVM
For both sample scenarios, if HACMP is configured before Oracle, CRS uses HACMP node
names and numbers.
The drawback of this configuration option stems from the fairly complex administrative tasks,
such as maintaining datafiles, Oracle code, and backup and restore operations.
Detailed Oracle RAC installation and configuration on AIX operating system is covered in
CookBook V1 - Oracle RAC 10g Release 2 on IBM System p running AIX V5 with SAN
Storage by Oracle/IBM Joint Solutions Center, January 2006, at:
http://www.oracleracsig.com/
Note: For the minimum software versions and the patches that are required to support
Oracle products on IBM AIX, read Oracle Metalink bulletin 282036.1.

2
Chapter 2. Basic RAC configuration with

GPFS
This chapter describes the most common configuration used for running Oracle 10g RAC with
IBM GPFS on IBM System p hardware. Oracle 10g RAC is a shared storage cluster
architecture. All nodes in the cluster access the same physical database files. We describe a
two node cluster configuration that is used for implementing Oracle 10g RAC on
IBM System p platforms:
򐂰 Hardware configuration
򐂰 Operating system configuration
򐂰 GPFS configuration
򐂰 Oracle 10g CRS installation
򐂰 Oracle 10g database installation

2.1 Basic scenario
In this scenario, our goal is to create a two node Oracle RAC cluster with Storage Area
Network (SAN)-attached shared storage and an EtherChannel that is used for a RAC
interconnect network.
The diagram in Figure 2-1 shows the test environment that we use for this scenario.
austin1_vip austin2_vip
RAC interconnect
192.168.100.31 192.168.100.32 Public network
austin1 austin1_interconn austin2 austin2_interconn
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
DS4800
austin1 austin2
rootvg hdisk2 rootvg
hdisk0 hdisk0
alt_rootvg hdisk1 hdisk3

hdisk1 alt_rootvg
fcs0 fcs0
hdisk22 Note: ent3 is an Etherchannel

based on ent0 and ent1
Figure 2-1 Two node cluster configuration
2.1.1 Server hardware configuration

The hardware platform that we use for our tests (diagram in Figure 2-1) consists of the
following components.
Nodes
We implement a configuration consisting of two nodes (logical partitions (LPARs)) in two IBM
System p5™ p570s. Each LPAR has four processors and 16 GB of random access memory
(RAM).
Networks
Each node is connected to two networks:
򐂰 We use one “private” network for RAC inteconnect and GPFS metadata traffic, configured
as Etherchannel with two Ethernet interfaces on each node. Communication protocol is
IP.
򐂰 One public network (Ethernet or IP) storage
The storage (DS4800) connects to a SAN switch (2109-F32) via two 2 GB Fibre Channel
paths. Each node has one 2 Gb 64-bit PCI-x Fibre Channel (FC) adapter. Figure 2-1 shows
an overview of the configuration that was used in our environment.

We implement a SAN architecture consisting of:
򐂰 One IBM 2019-F32 storage-attached network (SAN) switch
򐂰 One IBM TotalStorage DS4800 with 640 GB disk space formatted in a RAID5 array
2.1.2 Operating system configuration

The operating system that we use in our test environment is AIX 5.3 Technology Level 06. In
a cluster environment, we highly recommend that all the software packages are at the same
level on all cluster nodes. Using different versions can result in different node behavior and
unexpected operations.
You must prepare the host operating system before installing and configuring Oracle
Clusterware. In addition to OS prerequisites (software packages), Oracle Clusterware
requires:
򐂰 Configuring network IP addresses
򐂰 Name resolution
򐂰 Enabling remote command execution
򐂰 Oracle user and group
This section presents these tasks, as well as other requirements.
Operating system requirements

The Oracle Clusterware and database requires a specific minimum level operating system.
Before you proceed, make sure that you have the latest requirements and compatibilities.
Table 2-1 shows the additional packages that are needed to install and satisfy Oracle
prerequisites.
Table 2-1 AIX minimum requirements for Oracle 10g RAC

AIX V5.2 ML 04 or later AIX V5.3 ML 02 or later
bos.adt.base bos.adt.base
bos.adt.lib bos.adt.lib
bos.adt.libm bos.adt.libm
bos.perf.libperfstat bos.perf.libperfstat
bos.perf.perfstat bos.perf.perfstat
bos.perf.proctools bos.perf.proctools
rsct.basic.rte rsct.basic.rte
rsct.compat.clients.rte rsct.compat.clients.rte
xlC.aix50.rte 7.0.0.4 or 8.xxx xlC.aix50.rte 7.0.0.4 or 8.xxx
xlC.rte 7.0.0.1 or 8.xxx xlC.rte 7.0.0.1 or 8.xxx
bos.adt.profa
bos.cifs_fs
a. See the following information in the shaded Tip box.
Tip: If bos.adt.prof and bos.cifs_fs filesets are missing, the Oracle installation verification
utility complains about this during CRS installation. However, these files are not required
for Oracle, and this error message can be ignored at this point. See Oracle Metalink doc
ID: 340617.1 at:
http://metalink.oracle.com
Note: You need an Oracle Metalink ID to access this document.
Chapter 2. Basic RAC configuration with GPFS 21

Etherchannel for RAC interconnect
Oracle Clusterware and GPFS do not provide built-in protection for network interface failure.
For high availability, we set up a link aggregation Ethernet interface (in this case,
Etherchannel) for RAC and GPFS interconnect. Etherchannel requires at least two (up to a
maximum of eight) Ethernet interfaces connected to the same physical switch (which must
also support the Etherchannel protocol).
For better availability, we recommend that you set up separate Etherchannel interfaces for
Oracle interconnect and GPFS. However, it is possible to use the same Etherchannel
interface for both Oracle interconnect and GPFS metadata traffic. For more information, refer
to 2.5, “Networking considerations” on page 76.
Example 2-1 shows the list of Ethernet interfaces (ent0 and ent1) that we use to set up the
Etherchannel interface.
Example 2-1 Verifying the Ethernet adapters to be used for Etherchannel

root@austin1:/> lsdev -Cc adapter
ent0 Available 03-08 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
ent1 Available 03-09 2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
ent2 Available 06-08 10/100 Mbps Ethernet PCI Adapter II (1410ff01)
fcs0 Available 05-08 FC Adapter
sisscsia0 Available 04-08 PCI-X Ultra320 SCSI Adapter
vsa0 Available LPAR Virtual Serial Adapter
root@austin1:/>
To configure an Etherchannel interface, use the SMIT fastpath: smitty etherchannel.

Example 2-2 shows the SMIT panel.
Example 2-2 The smitty Etherchannel menu

EtherChannel / IEEE 802.3ad Link Aggregation
Move cursor to desired item and press Enter.
List All EtherChannels / Link Aggregations

Add An EtherChannel / Link Aggregation
Change / Show Characteristics of an EtherChannel / Link Aggregation
Remove An EtherChannel / Link Aggregation
Force A Failover In An EtherChannel / Link Aggregation
F1=Help F2=Refresh F3=Cancel F8=Image

F9=Shell F10=Exit Enter=Do
Example 2-3 on page 23 shows RAC and GPFS interconnect, ent0, and ent1 interfaces.
Important: Make sure that the network interfaces’ names and numbers (for example, en2
and en3 in our case) are identical on all nodes that are part of the RAC cluster. This
consistency is an Oracle RAC requirement.
We decided to use the same Etherchannel interface for Oracle Clusterware, Oracle RAC, and
GPFS interconnect. This configuration is possible, because GPFS does not use significant
communication bandwidth.

Example 2-3 Selecting the Ethernet interfaces to be used for Etherchannel
EtherChannel / IEEE 802.3ad Link Aggregation
Move cursor to desired item and press Enter.
List All EtherChannels / Link Aggregations

Change / Show Characteristics of an EtherChannel / Link Aggregation
Remove An EtherChannel / Link Aggregation
+--------------------------------------------------------------------------+
¦ Available Network Interfaces ¦
¦ ¦
¦ Move cursor to desired item and press F7. ¦
¦ ONE OR MORE items can be selected. ¦
¦ Press Enter AFTER making all selections. ¦
¦ ¦
¦ > ent0 ¦
¦ > ent1 ¦
¦ ent2 ¦
¦ ¦
¦ F1=Help F2=Refresh F3=Cancel ¦
¦ F7=Select F8=Image F10=Exit ¦
F1¦ Enter=Do /=Find n=Find Next ¦
F9+--------------------------------------------------------------------------+
Example 2-4 on page 24 shows the system management interface tool (SMIT) window
through which we choose the Etherchannel interface parameters. We use the round_robin
load balancing mode and default values for all other fields. To see details about configuring
Etherchannel, refer to Appendix A, “EtherChannel parameters on AIX” on page 255.

Example 2-4 Configuring Etherchannel parameters
Type or select values in entry fields.

Press Enter AFTER making all desired changes.
[Entry Fields]
EtherChannel / Link Aggregation Adapters ent0,ent1 +
Enable Alternate Address no +
Alternate Address [] +
Enable Gigabit Ethernet Jumbo Frames no +
Mode round_robin +
Hash Mode default +
Backup Adapter +
Automatically Recover to Main Channel yes +
Perform Lossless Failover After Ping Failure yes +
Internet Address to Ping []
Number of Retries [] +#
Retry Timeout (sec) [] +#
F1=Help F2=Refresh F3=Cancel F4=List

F5=Reset F6=Command F7=Edit F8=Image
Next, we configure the IP address for the Etherchannel interface using the SMIT fastpath
smitty chinet. We select the previously created interface from the list (en3 in our case) and
fill in the required fields, as shown in Example 2-5 on page 25.

Example 2-5 Configuring the IP address over an Etherchannel interface
Change / Show a Standard Ethernet Interface

[Entry Fields]
Network Interface Name en3
INTERNET ADDRESS (dotted decimal) [10.1.100.31]
Network MASK (hexadecimal or dotted decimal) [255.255.255.0]
Current STATE up +
Use Address Resolution Protocol (ARP)? yes +
BROADCAST ADDRESS (dotted decimal) []
Interface Specific Network Options
('NULL' will unset the option)
rfc1323 []
tcp_mssdflt []
tcp_nodelay []
tcp_recvspace []
tcp_sendspace []
Apply change to DATABASE only no +

2.1.3 Adding the user and group for Oracle

To install and use Oracle software, we must create a user and a group. We create user
oracle and group dba on both nodes that are part of the cluster.
Note: The user and group id must be the same on both nodes.
Use the SMIT commands, smitty mkuser and smitty mkgroup, to create the user and the
group. We use the command line, as shown in Example 2-6.
Example 2-6 Creating user and group for Oracle

root@austin1:/> mkgroup -A id=300 dba
root@austin1:/> mkuser id=300 pgrp=dba groups=staff oracle
root@austin1:/> id oracle
uid=300(oracle) gid=300(dba) groups=1(staff)
root@austin1:/> rsh austin2 id oracle
uid=300(oracle) gid=300(dba) groups=1(staff)
root@austin1:/>
Optionally, you can create the oinstall group. This group is the Oracle inventory group. If this
group exists, it owns the Oracle code files. This group is a secondary group for the oracle
user (besides the staff group).

Oracle user profile setup
We create a user profile for user oracle and store it in oracle’s home directory. Example 2-7
shows the contents of the profile. We use the /orabin directory for Oracle code and the
/oradata for database files.
Example 2-7 User oracle profile

{austin1:oracle}/home/oracle -> cat .profile
export PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:$HOME/bin:/usr/bin/X11:/sbin:.
export PS1='{'$(hostname)':'$LOGIN'}$PWD -> '
set -o vi
#export ORACLE_SID=RAC1
export ORACLE_SCOPE=/orabin
export ORACLE_HOME=/orabin/crs
#export ORACLE_HOME=/orabin/ora102
export ORACLE_CRS=/orabin/crs
export ORACLE_CRS_HOME=/orabin/crs
export ORA_CRS_HOME=/orabin/crs
export LD_LIBRARY_PATH=/orabin/crs/lib:/orabin/crs/lib32
export PATH=$ORACLE_HOME/bin:$PATH
export AIXTHREAD_SCOPE=S
export NLS_LANG=american_america.we8iso8859p1
export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS'
export TEMP=/tmp
export TMP=/tmp
export TMPDIR=/tmp
umask 022
Note: If a process runs with process-wide contention scope (the default) or with
system-wide contention scope, use the AIXTHREAD_SCOPE environment variable. When
using system-wide contention scope, there is a one-to-one mapping between the user
thread and a kernel thread.
On UNIX systems, Oracle applications are primarily multi-process and single-threaded.

One of the mechanisms that enables this multi-process system to operate effectively is the
AIX post/wait mechanism:
򐂰 Thread_post()
򐂰 Thread_post_many()
򐂰 Thread_wait()
This mechansim operates most efficiently with Oracle applications when using
system-wide thread contention scope (AIXTHREAD_SCOPE=S). In addition, as of AIX
V5.2, system-wide thread contention scope also significantly reduces the amount of
memory that is required for each Oracle process. For these reasons, we recommend to
always export AIXTHREAD_SCOPE=S before starting Oracle processes.

Name resolution
The IP labels that are used for Oracle public and private nodes must be resolved identically
(regardless the method used: flat files, DNS, or NIS) on all nodes in the cluster. For our
environment, we use name resolutions using flat files (/etc/hosts). We created a list that
includes:
򐂰 Public IP labels (names used for public IP networks)
򐂰 Private IP labels (names used for interconnect networks)
򐂰 VIP labels
򐂰 Corresponding numerical IP addresses
This file is used for populating:

򐂰 /etc/hosts
򐂰 /etc/hosts.equiv
򐂰 Oracle and root users~/.rhosts files
Example 2-8 shows the sample /etc/hosts file that we use for our environment. The IP labels
and addresses in this scenario are in bold characters.
Example 2-8 Sample /etc/hosts file

root@austin1:/> cat /etc/hosts
127.0.0.1 loopback localhost # loopback (lo0) name/address
# Public network
192.168.100.31 austin1
192.168.100.32 austin2
# Oracle RAC + GPFS interconnect network

10.1.100.31 austin1_interconnect
#Virtual IP for oracle

192.168.100.131 austin1_vip
192.168.100.132 austin2_vip
# Others servers, switches and storage

192.168.100.1 gw8810 # Linux
192.168.100.20 nim8810 # Nim server p550
192.168.100.21 p550_lpar2
192.168.100.10 switch1
192.168.100.11 switch2
192.168.100.12 switch3
192.168.100.241 2109_f32
192.168.100.251 ds4800_c1
192.168.100.252 ds4800_c2
192.168.100.231 hmc_p5
192.168.100.232 hmc_p6

Virtual IP (VIP) in Oracle Clusterware provides client connection high availability. When a
cluster node fails, the VIP associated with it is automatically failed over to one of the surviving
nodes using the following procedure:
1. When a public network fails, the VIP mechanism detects the failure and generates a Fast
Application Notification (FAN) event.
2. ORA-3113 error or equivalent is sent to clients subscribing to a FAN.
3. For the subsequent connection requests, Oracle client software parses the tnsnames.ora
address list skipping the missing cluster nodes, thus avoiding a client connection to wait
for TCP/IP timeouts (which often take 10 minutes) to expire the connection.
2.1.4 Enabling remote command execution

Oracle Clusterware (root user) and RAC communication (oracle user) require remote
command execution to coordinate commands between cluster nodes. You must set up
remote command execution in such a way that it does not require any user intervention
(password prompt secure shell, (ssh)key acceptance, ssh key passphrase).
You can use either ssh or standard remote shell (rsh). If ssh is already configured, Oracle
automatically uses ssh as a remote execution. Otherwise, rsh is used. To keep it simple, we
use rsh in our test environment.
Important: Oracle remote command execution fails if there are any intermediate
messages (including banners) during the authentication phase. For example, if you are
using rsh with two authentication methods (kerberos and system), and kerberos
authentication fails, even though the system authentication works correctly, the
intermediate kerberos failing message received by Oracle will result in Oracle remote
command execution failure.
GPFS also requires remote command execution without user interaction between cluster
nodes (as root). GPFS also supports using ssh or rsh. You can specify the remote command
execution when creating the GPFS cluster (the mmcrcluster command).
For rsh, rcp, and rlogin, you must set up user equivalence for the oracle and root accounts.
We set up equivalency editing the /etc/hosts.equiv files on each cluster node and also in root
and oracle home directory $HOME/.rhosts files as shown in Example 2-9.
Example 2-9 /etc/hosts.equiv and ~/.rhosts

austin1
austin2
austin1_interconnect
2.1.5 System configuration parameters and network options

In addition to IP and name resolution configuration, you must also configure certain system
and network operational parameters.
Change the parameter Maximum number of PROCESSES allowed per user to 2048 or
greater:
root@austin1:/> chdev -l sys0 -a maxuproc=2048

Tip: For production systems, this value must be at least 128 plus the sum of PROCESSES
and PARALLEL_MAX_SERVERS initialization parameters for each database running on
the system.
Also, Oracle recommends that you configure the user file, CPU, data, and stack limits in
/etc/security/limits as shown in Example 2-10.
Example 2-10 Changing limits in /etc/security/limits

root:
fsize = -l
cpu = -1
data = -l
stack = -l
oracle:
fsize = -l
cpu = -1
data = -l
stack = -l
Table 2-2 on page 30 shows the TCP/IP stack parameters minimum recommended values for
Oracle installation. For production database systems, Oracle recommends that you tune
these values to optimize system performance.
Refer to your operating system documentation for more information about tuning TCP/IP
parameters.

Table 2-2 Configuring network options
Parameter Recommended value on all nodes
ipqmaxlen 512
rfc1323 1
sb_max 1310720
tcp_recvspace 65536
tcp_sendspace 65536
udp_recvspace 655360a
udp_sendspace 65536b
a. The recommended value of this parameter is 10 times the value of the udp_sendspace
parameter. The value must be less than the value of the sb_max parameter.
b. This value is suitable for a default database installation. For production databases, the
minimum value for this parameter is 4 KB plus the value of the database DB_BLOCK_SIZE
initialization parameter multiplied by the value of the DB_MULTIBLOCK_READ_COUNT
initialization parameter: (DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB
Note: Certain parameters are set at interface (en*) level (check with lsattr -El en*).
2.1.6 GPFS configuration

IBM GPFS provides file system services to parallel and serial applications. GPFS allows
parallel applications simultaneous access to the same files, or different files, from any node
that has the GPFS file system mounted while managing a high level of control over all file
system operations.
In our configuration, we use two GPFSs, one for Oracle data files and the other for Oracle
binary files. For Oracle Cluster Repository (OCR) and CRS voting disks that are required for
Oracle Clusterware installation, we use raw devices (disks).
Note: We decided for this configuration to avoid a situation where GPFS and Oracle
Clusterware interfere during node recovery process.
Installing GPFS
We use GPFS V3.1 for our test environment. We have installed the filesets and verified the
packages by using the lslpp command on each node as shown in Example 2-11.
Example 2-11 Verifying GPFS filesets installation

root@austin1:/> lslpp -l |grep gpfs
gpfs.base 3.1.0.6 COMMITTED GPFS File Manager
gpfs.msg.en_US 3.1.0.5 COMMITTED GPFS Server Messages - U.S.
gpfs.base 3.1.0.6 COMMITTED GPFS File Manager
gpfs.docs.data 3.1.0.1 COMMITTED GPFS Server Manpages and
Preparing node and disk descriptor files

We have created the files shown in Example 2-12 on page 31 for our test environment. These
are one node descriptor file and three disk descriptor files.

Example 2-12 Node and disk descriptor files for GPFS
root@austin1:/etc/gpfs_config> ls -l gpfs*
-rw-r--r-- 1 root system 128 Sep 14 19:35 gpfs_disks_orabin
-rw-r--r-- 1 root system 256 Sep 14 19:36 gpfs_disks_oradata
-rw-r--r-- 1 root system 156 Sep 14 19:36 gpfs_disks_tb
-rw-r--r-- 1 root system 72 Sep 14 18:52 gpfs_nodes
When creating the GPFS cluster, you must provide a file containing a list of node descriptors,
one per line, for each node to be included in the cluster, as shown in Example 2-13. Because
this is a two node configuration, both nodes are quorum and manager nodes.
Example 2-13 Sample of GPFS node file

root@austin1:/etc/gpfs_config> cat gpfs_nodes
austin1_interconnect:quorum-manager
Node roles
The node roles are:
quorum | nonquorum
This designation specifies whether or not the node is included in the
pool of nodes from which quorum is derived. The default is
nonquorum. You must designate at least one node as a quorum node.
manager | client Indicates whether a node is part of the node pool from which
configuration managers, file system managers, and the token
manager can be selected. The special functions of the file system
manager consume extra CPU time.
Notes: Additional considerations:

򐂰 Configuration manager: There is only one configuration manager per cluster. The
configuration manager is elected out of the pool of quorum nodes. The role of
configuration manager is to select a file system manager node out of the pool of
manager nodes. It also initiates and controls node failure recovery procedures. The
configuration manager also determines if the quorum rule is fulfilled.
򐂰 File system manager: There is one file system manager per mounted file system. The
file system manager is responsible for file system configuration (adding disks, changing
disk availability, and mounting and unmounting file systems) and managing disk space
allocation, token management, and security services.
Prepare each physical disk for GPFS Network Shared Disks (NSDs)1 using the mmcrnsd
command, as shown in Example 2-14 on page 32. You can create NSDs on physical disks
(hdisk or vpath devices in AIX).
In our testing environment, because both nodes are directly attached to storage, we are not
going to assign any NSD server into the disk description file. However, you must create NSDs
anyway, because it is required to create a file system (unless you are using VSDs),
regardless of whether you use NSD servers.
1
Network Shared Disk is a concept that represents the way that the GPFS file system device driver accesses a raw
disk device regardless of whether the disk is locally attached (SAN or SCSI) or is attached to another GPFS node
(via network).

Example 2-14 Sample of GPFS disk description file
root@austin1:/etc/gpfs_config> cat gpfs_disks_orabin
hdisk14:::dataAndMetadata:1:nsd05
root@austin1:/etc/gpfs_config> cat gpfs_disks_oradata

The Disk descriptor file has the following format:

DiskName:PrimaryNSDServer:BackupNSDServer:DiskUsage:FailureGroup:DesiredName:Stora
gePool
DiskName The block device name appearing in /dev for the disk that you are
defining as an NSD. If a PrimaryServer node is specified, DiskName
must be the /dev name for the disk device on the primary NSD server
node.
PrimaryNSDServer The name of the primary NSD server node. If this field is omitted, the
disk is assumed to be SAN-attached to all nodes in the cluster. If not,
all nodes in the cluster have access to the disk, or if the file system
that the disk belongs to is accessed by other GPFS clusters,
PrimaryServer must be specified.
BackupNSDServer The name of the backup NSD server node. If the PrimaryServer is
specified and this field is omitted, it is assumed that you do not want
failover in the event that the PrimaryServer fails. If the BackupServer
is specified and the PrimaryServer is not specified, the command fails.
The host name or IP address must refer to the communications
adapter over which the GPFS daemons communicate. Alias interfaces
are not allowed. Use the original address or a name that is resolved by
the host command to that original address.
DiskUsage Specify a disk usage or accept the default. This field is ignored by the
mmcrnsd command and is passed unchanged to the output descriptor
file produced by the mmcrnsd command. Possible values are:
dataAndMetadata dataAndMetadata indicates that the disk contains both data and
metadata. This is the default.
dataOnly dataOnly indicates that the disk contains data and does not contain
metadata.
metadataOnly metadataOnly indicates that the disk contains metadata and does
not contain data.
descOnly descOnly indicates that the disk contains no data and no metadata.
This disk is used solely to keep a copy of the file system descriptor
and can be used as a third failure group in certain disaster recovery
configurations.
Tip: If you use many small files and the file system metatdata is dynamic, separating data
and metadata improves performance. However, if large files are mostly used, and there is
little metadata activity, separating data from metadata does not improve performance.

FailureGroup FailureGroup is a number identifying the failure group to which this
disk belongs. GPFS uses this information during data and metadata
placement to assure that no two replicas of the same block are written
in such a way as to become unavailable due to a single failure.
DesiredName Specify a name for the NSD.
Tip: We recommend that you define DesiredName, because you can use meaningful
names that make system administration easier (see Example 2-21, nsd_tb1 is used as a
tiebreaker, because of the “tb” in the suffix). If a desired name is not specified, the NSD is
assigned a name according to the convention: gpfsNNnsd where NN is a unique
nonnegative integer (for example, gpfs01nsd, gpfs02nsd, and so on).
StoragePool StoragePool specifies the name of the storage pool to which the NSD
is assigned (if desired). If this name is not provided, the default is
system. Only the system pool can contain metadataOnly,
dataAndMetadata, or descOnly disks.
Example 2-15 shows the disk description file for tiebreaker disks.
Note: The disk descriptor file shown in Example 2-15 does not specify the diskUsage,
because we use these NSDs for cluster quroum (tie breakers), and they will not be part of
any file systems.
Example 2-15 Sample disk description file used for creating tiebreaker NSDs
root@austin1:/etc/gpfs_config> cat gpfs_disks_tb
hdisk7:::::nsd_tb1
hdisk8:::::nsd_tb2
hdisk9:::::nsd_tb3
Cluster quorum
GPFS cluster quorum must be maintained for the GPFS file systems to remain available. If
the quorum semantics are broken, GPFS performs the recovery in an attempt to achieve
quorum again. GPFS can use one of two methods for determining quorum:
򐂰 Node quorum
򐂰 Node quorum with tiebreaker disks
Table 2-3 explains the difference between node quorum and node quorum with
tiebreakerDisks.
Table 2-3 Difference between node quorum and node quorum with tiebreaker disks
Node quorum Node quorum with tiebreaker disksa
򐂰 Quorum is defined as one plus half of the 򐂰 There is a maximum of eight quorum nodes.
explicitly defined quorum nodes in the GPFS 򐂰 You must include the primary and secondary
cluster. cluster configuration servers as quorum
򐂰 There are no default quorum nodes; you nodes.
must specify which nodes have this role. 򐂰 You can have an unlimited number of
򐂰 GPFS does not limit the number of quorum non-quorum nodes.
nodes.
a. See the following tip.

Tip: You can use tiebreaker disks with GPFS 2.3 or later. With node (only) quorum as a
quorum method, you must have at least three nodes to maintain quorum, because quorum
is defined as one plus half of the explicitly defined quorum nodes. Otherwise, with two
quorum nodes only, if one quorum node goes down, a GPFS cluster will go into
“arbitrating” state due to insufficient quorum nodes, rendering all file systems in the cluster
unavailable.
Most Oracle RAC with GPFS configurations are two node clusters; therefore, you must set
up node quorum with tiebreaker disks. A GPFS cluster can survive and maintain file
systems available with one quorum node and one available tiebreaker disk in this
configuration. You can have one, two, or three tiebreaker disks. However, we recommend
that you use an odd number of tiebreaker disks (three).
Configuring GPFS
Before you start setting up a GPFS cluster, verify that the remote command execution is
working properly between all GPFS nodes via the interfaces that are used for GPFS
metadata traffic (austin1_interconnect and austin2_interconnect).
Creating the cluster

To create the GPFS cluster, we use the mmcrcluster that is shown in Example 2-16. We use
rsh/rcp for remote command/copy programs (default option), so we do not need to specify
these parameters.
Example 2-16 Creating a GPFS cluster

root@austin1:/etc/gpfs_config> mmcrcluster -N gpfs_nodes -p austin1_interconnect \
> -s austin2_interconnect -C austin_cluster -A
Wed Sep 12 11:50:20 CDT 2007: 6027-1664 mmcrcluster: Processing node
Wed Sep 12 11:50:22 CDT 2007: 6027-1664 mmcrcluster: Processing node
mmcrcluster: Command successfully completed
mmcrcluster: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
The following list gives a short explanation for the mmcrcluster command parameters shown
in Example 2-16.
-N NodeFile NodeFile specifies the file containing the list of node descriptors (see
Example 2-13 on page 31), one per line, to be included in the GPFS
cluster.
-p PrimaryServer PrimaryServer specifies the primary GPFS cluster configuration server
node used to store the GPFS configuration data.
-s SecondaryServer SecondaryServer specifies the secondary GPFS cluster configuration
server node used to store the GPFS cluster data. We suggest that you
specify a secondary GPFS cluster configuration server to prevent the
loss of configuration data in the event that your primary GPFS cluster
configuration server goes down. When the GPFS daemon starts up, at
least one of the two GPFS cluster configuration servers must be
accessible.

-C ClusterName Clustername specifies a name for the cluster. If the user-provided
name contains dots, it is assumed to be a fully qualified domain name.
Otherwise, to make the cluster name unique, the domain of the
primary configuration server will be appended to the user-provided
name. If the -C flag is omitted, the cluster name defaults to the name
of the primary GPFS cluster configuration server.
-A This parameter specifies that GPFS daemons automatically start
when nodes come up. The default is not to start daemons
automatically.
To check the current configuration information for the GPFS cluster, use the mmlscluster
command (see Example 2-17).
Example 2-17 Checking current GPFS configuration
root@austin1:/etc/gpfs_config> mmlscluster
GPFS cluster information

========================
GPFS cluster name: austin_cluster.austin1_interconnect
GPFS cluster id: 720967500852369612
GPFS UID domain: austin_cluster.austin1_interconnect
Remote shell command: /usr/bin/rsh
Remote file copy command: /usr/bin/rcp
GPFS cluster configuration servers:

-----------------------------------
Primary server: austin1_interconnect
Secondary server: austin2_interconnect
Node Daemon node name IP address Admin node name

Designation
-----------------------------------------------------------------------
1 austin1_interconnect 10.1.100.31 austin1_interconnect
quorum-manager
quorum-manager
Creating the NSDs

Example 2-18 shows how to create network shared disks (NSDs) using the previously
created disk descriptor files (see Example 2-14 on page 32 and Example 2-15 on page 33).
We use these NSDs for creating file systems or configuring tiebreaker disks.
Example 2-18 Creating NSDs

root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_orabin
mmcrnsd: Processing disk hdisk14
mmcrnsd: 6027-1371 Propagating the cluster configuration data to all
root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_oradata


root@austin1:/etc/gpfs_config> mmcrnsd -F gpfs_disks_tb

Use the mmlsnsd command to display the current NSD information, as shown in
Example 2-19.
Example 2-19 Checking current NSDs

root@austin1:/etc/gpfs_config> mmlsnsd
File system Disk name Primary node Backup node

---------------------------------------------------------------------------
(free disk) nsd01 (directly attached)
(free disk) nsd_tb1 (directly attached)
Upon successful completion of the mmcrnsd command, the disk descriptor files are rewritten to
contain the created NSD names in place of the device name, as shown in Example 2-20. This
is done to prepare the disk descriptor files for subsequent usage for creating GPFS file
systems (mmcrfs or mmadddisk commands).
Example 2-20 Node description files after creating NSDs

root@austin1:/etc/gpfs_config> cat gpfs_disks_orabin
# hdisk14:::dataAndMetadata:1:nsd05
nsd05:::dataAndMetadata:1::
root@austin1:/etc/gpfs_config> cat gpfs_disks_oradata

Now all of the NSDs are defined, and you can see the mapping of physical disks to GPFS
NSDs using the command shown in Example 2-21 on page 37.

Example 2-21 Mapping physical disks to GPFS NSDs
root@austin1:/> mmlsnsd -a -m
Disk name NSD volume ID Device Node name Remarks

--------------------------------------------------------------------------------
nsd01 C0A8641F46EB28F7 /dev/hdisk10 austin1_interconnect directly
attached
attached
attached
nsd04 C0A8641F46EB28FA /dev/hdisk13 austin1_interconnect directly
attached
nsd05 C0A8641F46EB28E2 /dev/hdisk14 austin1_interconnect directly
attached
nsd06 C0A8641F46EB28E3 /dev/hdisk15 austin1_interconnect directly
attached
nsd_tb1 C0A8641F46EB2906 /dev/hdisk7 austin1_interconnect directly
attached
attached
attached
root@austin1:/> lspv
hdisk1 00cc5d5caa5832e0 None
hdisk2 none None
hdisk3 none None
hdisk4 none None
hdisk5 none None
hdisk6 none None
hdisk7 none nsd_tb1
hdisk8 none nsd_tb2
hdisk9 none nsd_tb3
hdisk10 none nsd01
hdisk11 none nsd02
hdisk12 none nsd03
hdisk13 none nsd04
hdisk14 none nsd05
hdisk15 none nsd06
Changing cluster quorum (adding tiebreaker disks)

Before you start the GPFS cluster, you must configure tiebreaker disks. If you want to change
cluster configuration later, you must stop the GPFS daemon on all nodes; thus, it is better to
change the cluster quorum at this point.
Configure the tiebreakerDisks by using the mmchconfig command as shown in Example 2-22
on page 38. Then, run the mmlsconfig command to check if they are on the list.

Example 2-22 Configuring tiebreakerDisks
root@austin1:/etc/gpfs_config> mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"
Verifying GPFS is stopped on all nodes ...
mmchconfig: Command successfully completed
mmchconfig: 6027-1371 Propagating the cluster configuration data to all
root@austin1:/etc/gpfs_config> mmlsconfig
Configuration data for cluster austin_cluster.austin1_interconnect:
-------------------------------------------------------------------
clusterName austin_cluster.austin1_interconnect
clusterId 720967500852369612
clusterType lc
autoload yes
useDiskLease yes
maxFeatureLevelAllowed 906
tiebreakerDisks nsd_tb1;nsd_tb2;nsd_tb3
[austin1_interconnect]
takeOverSdrServ yes
File systems in cluster austin_cluster.austin1_interconnect:

------------------------------------------------------------
(none)
Creating GPFS file systems

In order to create a file system, the GPFS cluster must be up and running. Start the GPFS
cluster with mmstartup -a, as shown in Example 2-23. This command starts GPFS on all
nodes in the cluster. To start only the local node, just run mmstartup without -a option.
Example 2-23 Starting up GPFS cluster

root@austin1:/etc/gpfs_config> mmstartup -a
Wed Sep 12 11:57:24 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...
root@austin1:/etc/gpfs_config>
To create file systems, run the mmcrfs command. We have created two file systems: /orabin
for oracle binaries (Example 2-24 on page 39) and /oradata for oracle database files
(Example 2-25 on page 39).

Example 2-24 Creating the /orabin file system
root@austin1:/etc/gpfs_config> mmcrfs /orabin orabin -F gpfs_disks_orabin -A yes \
> -B 512k -M2 -m2 -R2 -r2 -n 4 -N 50000
GPFS: 6027-531 The following disks of orabin will be formatted on node austin1:
nsd05: size 10485760 KB
nsd06: size 10485760 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 25 GB can be added to storage pool 'system'.
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
GPFS: 6027-572 Completed creation of file system /dev/orabin.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all
Tip: We recommend that you have 50000 inodes for a file system that is used for Oracle
binaries if you plan to install Oracle Clusterware and the database in this file system. The
mmcrfs -N option is for the maximum number of files in the file system. This value defaults
to the size of the file system divided by 1M. Therefore, we intentionally used mmcrfs -N
50000 for the /orabin file system.
Example 2-25 Creating the /oradata file system

root@austin1:/etc/gpfs_config> mmcrfs /oradata oradata -F gpfs_disks_oradata -A yes \
> -B 512k -M2 -m2 -R2 -r2 -n 4
GPFS: 6027-531 The following disks of oradata will be formatted on node austin2:
nsd01: size 10485760 KB
nsd02: size 10485760 KB
nsd03: size 10485760 KB
nsd04: size 10485760 KB
Creating Inode File
GPFS: 6027-572 Completed creation of file system /dev/oradata.
The following list gives a short explanation for the mmcrfs command parameters shown in
Example 2-25:
/oradata This parameter is the mount point directory of the GPFS.
oradata This parameter is the name of the file system to be created, because it
will appear in /dev directory. File system names do not need to be
fully-qualified; orada is as acceptable as /dev/oradata. However, file
system names must be unique within a GPFS cluster. Do not specify
an existing entry in /dev.

-F Disk description file
This parameter specifies a file containing a list of disk descriptors, one
per line. You can use the rewritten DiskDesc file created by the
mmcrnsd command.
-A yes This parameter indicates the file system mounts automatically when
the GPFS daemon starts (this is the default). Other options are: no -
manual mount, and automount - when the file system is first accessed.
-B BlockSize This parameter is the size of data blocks. This parameter must be 16
KB, 64 KB, 256 KB (the default), 512 KB, or 1024 KB (1 MB is also
acceptable). Specify this value with the character K or M, for example,
512K.
Tip: In an Oracle with GPFS environment, we generally recommend a GPFS block size of
512 KB. Using 256 KB block size is recommended when there is significant file activity
other than Oracle, or there are many small files not belonging to the database. A block size
of 1 MB is recommended for file systems of 100 TB or larger. See Oracle Metalink doc ID:
302806.1 at:
Note: You need an Oracle Metalink ID to access this note.
-M MaxMetadataReplicas
This parameter is the default maximum number of copies of inodes,
directories, and indirect blocks for a file. Valid values are 1 and 2 but
cannot be less than DefaultMetadataReplicas. The default is 1.
-m DefaultMetadataReplicas
This parameter is the default number of copies of inodes, directories,
and indirect blocks for a file. Valid values are 1 and 2 but cannot be
greater than the value of MaxMetadataReplicas. The default is 1.
-R MaxDataReplicas This parameter is the default maximum number of copies of data
blocks for a file. Valid values are 1 and 2 but cannot be less than
DefaultDataReplicas. The default is 1.
-r DefaultDataReplicas
This parameter is the default number of copies of each data block for a
file. Valid values are 1 and 2 but cannot be greater than
MaxDataReplicas. The default is 1.
-n NumNodes This parameter is the estimated number of nodes that mounts the file
system. This value is used as a best estimate for the initial size of
several file system data structures. The default is 32. When you create
a GPFS file system, you might want to overestimate the number of
nodes that mount the file system. GPFS uses this information for
creating data structures that are essential for achieving maximum
parallelism in file system operations. Although a large estimate
consumes additional memory, underestimating the data structure
allocation can reduce the efficiency of a node when it processes
parallel requests, such as the allotment of disk space to a file. If you
cannot predict the number of nodes that mounts the file system, apply
the default value. If you are planning to add nodes to your system,
specify a number larger than the default. However, do not make
estimates that are unrealistic. Specifying an excessive number of
nodes can have an adverse effect on buffer operations.

Tip: This -n NumNodes value cannot be changed after the file system has been created.
-N NumInodes This parameter is the maximum number of files in the file system. This
value defaults to the size of the file system at creation, divided by 1 M,
and can be specified with a suffix, for example 8 K or 2 M. This value
is also constrained by the formula:
maximum number of files = (total file system space/2) / (inode size + subblock size)
Tip: For file systems that will perform parallel file creates, if the total number of free inodes
is not greater than 5% of the total number of inodes, there is the potential for slowdown in
file system access. Take this into consideration when changing your file system.
-v {yes | no} Verify that specified disks do not belong to an existing file system. The
default is -v yes. Specify -v no only when you want to reuse disks that
are no longer needed for an existing file system. If the command is
interrupted for any reason, you must use the -v no option on the next
invocation of the command.
Example 2-26 shows the file system information. You can see block size, maximum number of
inodes, number of replicas, and so on.
Example 2-26 Checking the file system information

root@austin1:/etc/gpfs_config> mmlsfs orabin
flag value description
---- -------------- -----------------------------------------------------
-s roundRobin Stripe method
-f 16384 Minimum fragment size in bytes
-i 512 Inode size in bytes
-I 16384 Indirect block size in bytes
-m 2 Default number of metadata replicas
-M 2 Maximum number of metadata replicas
-r 2 Default number of data replicas
-R 2 Maximum number of data replicas
-j cluster Block allocation type
-D posix File locking semantics in effect
-k posix ACL semantics in effect
-a -1 Estimated average file size
-n 4 Estimated number of nodes that will mount file system
-B 524288 Block size
-Q none Quotas enforced
none Default quotas enabled

-F 51200 Maximum number of inodes
-V 9.03 File system version. Highest supported version: 9.03
-u yes Support for large LUNs?
-z no Is DMAPI enabled?
-E yes Exact mtime mount option
-S no Suppress atime mount option
-K whenpossible Strict replica allocation option
-P system Disk storage pools in file system
-d nsd05;nsd06 Disks in file system
-A yes Automatic mount option
-o none Additional mount options
-T /orabin Default mount point
Mounting the file systems

Mount GPFS using the mmmount command as shown in Example 2-27. Starting with GPFS
V3.1, two new commands, mmmount and mmumount, are shipped. These commands can be
used to mount and to unmount GPFS on multiple nodes without using the OS mount and
umount commands.
Example 2-27 Mounting all GPFS file systems on both nodes

root@austin1:/etc/gpfs_config> mmmount all -a
Wed Sep 12 12:17:08 CDT 2007: 6027-1623 mmmount: Mounting file systems ...
root@austin1:/etc/gpfs_config>
You can check the mounted file systems using the mmlsmount (GPFS) and mount (system)
commands, as shown in Example 2-28 on page 43.

Example 2-28 Displaying mounted GPFS file systems
root@austin1:/> mmlsmount all_local -L
File system orabin is mounted on 2 nodes:

File system oradata is mounted on 2 nodes:

root@austin1:/> mount
node mounted mounted over vfs date options
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Oct 03 11:10 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Oct 03 11:11 rw,log=/dev/hd8
/proc /proc procfs Oct 03 11:11 rw
/dev/hd10opt /opt jfs2 Oct 03 11:11 rw,log=/dev/hd8
/dev/fslv00 /oracle jfs2 Oct 03 11:11 rw,log=/dev/hd8
/dev/orabin /orabin mmfs Oct 03 11:13
rw,mtime,atime,dev=orabin
/dev/oradata /oradata mmfs Oct 03 11:13
rw,mtime,atime,dev=oradata
To check the available space in a GPFS file system, use the mmdf command, as shown in
Example 2-29. The system df command can display inaccurate information about GPFS file
systems; thus, we recommend using the mmdf command. This command displays information,
such as free blocks, that is presented by failure group and storage pool.
Example 2-29 Checking the /orabin file system (mmfd)
root@austin1:/etc/gpfs_config> mmdf orabin

disk disk size failure holds holds free KB
free KB
name in KB group metadata data in full blocks
in fragments
---------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system
nsd05 10485760 1 yes yes 10446848 (100%)
688 ( 0%)
nsd06 10485760 2 yes yes 10446848 (100%)
688 ( 0%)
------------- --------------------
-------------------
(pool total) 20971520 20893696 (100%)
1376 ( 0%)
============= ====================
===================
(total) 20971520 20893696 (100%)
1376 ( 0%)

Inode Information
-----------------
Number of used inodes: 4010
Number of free inodes: 47190
Number of allocated inodes: 51200
Maximum number of inodes: 51200
Tip: Difference between blocks and fragments in GPFS:

򐂰 Block: The block is the largest amount of data that can be accessed in a single I/O
operation.
򐂰 GPFS divides each block into 32 subblocks.
򐂰 Subblock: The subblock is the smallest unit of disk space that can be allocated.
򐂰 Fragments: Fragments consist of one or more subblocks.
򐂰 Files smaller than one block size are stored in fragments.
򐂰 Large files are stored in a number of full blocks plus zero more subblocks to hold the
data at the end of the file.
򐂰 For a block size of 256 KB, GPFS reads as much as 256 KB of data in a single I/O
operation, and small files can occupy as little as 8 KB (256 K/32) of disk space.
2.1.7 Special consideration for GPFS with Oracle

Using GPFS with Oracle requires special considerations:
򐂰 RAID consideration: If using RAID devices, configure a single LUN for each RAID device.
Do not create LUNs across RAID devices for use by GPFS, because creating LUNs
across RAID devices ultimately results in significant performance degradation. GPFS
stripes data and metadata across multiple LUNs (RAIDs) using its own optimized method.
򐂰 Block size: For file systems holding large Oracle databases, set the GPFS file system
block size through the mmcrfs command using the -B option:
– We generally suggest 512 KB.
– We suggest 256 KB if there is activity other than Oracle using the same file system and
many small files exist, which are not in the database (files that belong to Oracle
Applications, for example).
– We suggest 1 MB for file systems 100 TB or larger.
– The large block size makes the allocation of space for the databases manageable and
does not affect performance when Oracle is using the Asynchronous I/O (AIO) and
Direct I/O (DIO) features of AIX.
– Set the Oracle database block size equal to the LUN segment size or a multiple of the
LUN segment size.
򐂰 Set the GPFS worker threads through the mmchconfig -prefetchThreads command to
allow the maximum parallelism of the Oracle AIO threads:
– On a 64-bit AIX kernel, the setting can be as large as 548. The GPFS prefetch threads
must be adjusted accordingly through the mmchconfig -prefetchThreads command.

– When requiring GPFS sequential I/O, set the prefetch threads between 50 and 100
(the default is 64), and set the worker threads to have the remainder. However,
remember the following formula:
• prefetchThreads < 548
• worker1Threads
• prefetchThreads + worker1Threads =< 550 (in the 64 bit environment)
Tip:
prefetchThreads is for large sequential file I/O, whereas worker1Threads is for random,
small file I/O.
prefetchThreads controls the maximum possible number of threads dedicated to

prefetching data for files that are read sequentially or to handle sequential write-behind
(default: 72).
worker1Threads is primarily used for random read or write requests that cannot be
prefetched, random I/O requests, or small file activity. worker1Threads controls the
maximum number of concurrent file operations at any one instant. If there are more
requests than that, the excess will wait until a previous request has finished (default: 48,
maximum: 548).
These changes through the mmchconfig command take effect upon restart of the GPFS
daemon.
򐂰 The number of AIX AIO kprocs to create is approximately the same as the GPFS
worker1Threads setting:
– The AIX AIO maxservers setting is the number of kprocs PER CPU. We suggest to set
this value slightly larger than worker1Threads divided by the number of CPUs.
– Set the Oracle read-ahead value to prefetch one or two full GPFS blocks. For example,
if your GPFS block size is 512 KB, set the Oracle blocks to either 32 or 64 16 KB
blocks.
򐂰 Do not use the dio option on the mount command, because using the dio option forces
DIO when accessing all files. Oracle automatically uses DIO to open database files on
GPFS.
򐂰 When running Oracle RAC 10g R1, we suggest that you increase the value for
OPROCD_DEFAULT_MARGIN to at least 500 to avoid possible random reboots of
nodes.
Note: The Oracle Clusterware I/O fencing daemon has its margin defined in two places
in the /etc/init.cssd, and the values are 500 and 100 respectively. Because it is defined
twice in the same file, the latter value of 100 is used; thus, we recommend that you
remove the second (100) value.
From a GPFS perspective, even 500 milliseconds might be too low in situations where
node failover can take up to one minute or two minutes to resolve. However, if during node
failure, the surviving node is already performing direct IO to the oprocd control file, the
surviving node has the necessary tokens and indirect block cached and therefore does not
have to wait during failover.
򐂰 Oracle databases requiring high performance usually benefit from running with a pinned
Oracle SGA, which is also true when running with GPFS, because GPFS uses DIO, which
requires that the user I/O buffers (in the SGA) are pinned. GPFS normally pins the I/O

buffers on behalf of the application, but if Oracle has already pinned the SGA, GPFS
recognizes this has been done and does not duplicate the pinning, which saves additional
system resources.
Note: See Oracle Metalink doc ID: 302806.1 at:

You need an Oracle Metalink ID to access this note.
Pinning the SGA on AIX 5L requires the following three steps:

a. /usr/sbin/vmo -r -o v_pinshm=1
b. /usr/sbin/vmo -r -o maxpin%=percent_of_real_memory
Where percent_of_real_memory = ((size of SGA / size of physical memory) *100) + 3
c. Set LOCK_SGA parameter to TRUE in the init.ora file.
2.2 Oracle 10g Clusterware installation

Before installing the Oracle code and creating the database, change the ownership and
permission of the GPFS to oracle user and dba group, as shown in Example 2-30.
Example 2-30 Changing ownership and permission of GPFS file systems

root@austin1:/> ls -ld /ora*
drwxr-xr-x 4 oracle dba 16384 Sep 16 00:49 /orabin
drwxr-xr-x 4 oracle dba 16384 Sep 16 01:33 /oradata
We use raw disks for OCR and voting disk for Oracle Clusterware. During the Oracle
Clusterware installation, you are prompted to provide two OCR disks and three CRS voting
(vote) disks. Even though it is possible to install the Oracle Clusterware with one OCR disk
and one voting (vote) disk, we encourage you to have multiple OCR disks and voting (vote)
disks for availability. In Example 2-31, we select rhdisk2 and rhdisk3 for OCR disks. We use
rhdisk4, rhdisk5, and rhdisk6 as voting (vote) disks.
Example 2-31 Selecting raw physical disks for OCR and CRS voting (vote) disks
root@austin1:/> ls -l /dev/rhdisk*
crw------- 1 root system 20, 3 Sep 14 17:45 /dev/rhdisk2
Creating special device files for OCR and CRS voting disks
We create special files (using the mknod command) for OCR and voting (vote) disks, as shown
in Example 2-32 on page 47. Then, change ownership and permission for those files. You
must run these commands on both nodes:
mknod SpecialFileName { b | c } Major# Minor#
b indicates the special file is a block-oriented device.
c indicates the special file is a character-oriented device.

Important: The major and minor numbers might not be same on all nodes. However, there
is nothing wrong. You can still use those different numbers to create special files.
Example 2-32 Creating special files for ocr and vote disks
root@austin1:/> mknod /dev/ocrdisk1 c 20 3
root@austin1:/> mknod /dev/votedisk1 c 20 5

root@austin1:/> chown oracle.dba /dev/ocr*

root@austin1:/> chown oracle.dba /dev/vote*
root@austin1:/> chmod 660 /dev/ocr*

root@austin1:/> chmod 660 /dev/vote*


root@austin2:/> chown oracle.dba /dev/ocr*

root@austin2:/> chown oracle.dba /dev/vote*
root@austin2:/> chmod 660 /dev/ocr*

root@austin2:/> chmod 660 /dev/vote
Verify and change the reservation_policy to no_reserve on the disks (rhdisk2, rhdisk3,
rhdisk4, rhdisk5, and rhdisk6 on both nodes) that are used for OCR and CRS voting (vote)
disks as shown in Example 2-33. Run these commands on both nodes.
Example 2-33 Verifying and changing reservation policy
root@austin1:/> lsattr -El hdisk2

PR_key_value none Persistant Reserve Key Value
True
cache_method fast_write Write Caching method
False
ieee_volname 600A0B800011A6620000007C0002944E IEEE Unique volume name
False
lun_id 0x0000000000000000 Logical Unit Number
False
max_transfer 0x100000 Maximum TRANSFER Size
True
prefetch_mult 1 Multiple of blocks to prefetch on
read False
pvid none Physical volume identifier
False

q_type simple Queuing Type
False
queue_depth 10 Queue Depth
True
raid_level 5 RAID Level
False
reassign_to 120 Reassign Timeout value
True
reserve_policy single_path Reserve Policy
True
rw_timeout 30 Read/Write Timeout value
True
scsi_id 0x661600 SCSI ID
False
size 128 Size in Mbytes
False
write_cache yes Write Caching enabled
False
root@austin1:/> chdev -l hdisk2 -a reserve_policy=no_reserve

hdisk2 changed
root@austin1:/> lsattr -El hdisk2
PR_key_value none Persistant Reserve Key Value
True
cache_method fast_write Write Caching method
False
ieee_volname 600A0B800011A6620000007C0002944E IEEE Unique volume name
False
lun_id 0x0000000000000000 Logical Unit Number
False
max_transfer 0x100000 Maximum TRANSFER Size
True
prefetch_mult 1 Multiple of blocks to prefetch on
read False
pvid none Physical volume identifier
False
q_type simple Queuing Type
False
queue_depth 10 Queue Depth
True
raid_level 5 RAID Level
False
reassign_to 120 Reassign Timeout value
True
reserve_policy no_reserve Reserve Policy
True
rw_timeout 30 Read/Write Timeout value
True
scsi_id 0x661600 SCSI ID
False
size 128 Size in Mbytes
False
write_cache yes Write Caching enabled
False

On the both nodes, run the AIX command /usr/sbin/slibclean as root to clean all
unreferenced libraries from memory. Also, check if /tmp file system has enough free space,
about 500 MB on each node, as shown in Example 2-34.
Example 2-34 Run slibclean and verify free space on /tmp

root@austin1:/> /usr/sbin/slibclean
root@austin1:/> df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd2 2.25 0.10 96% 41949 62% /usr
/dev/hd9var 0.06 0.05 28% 494 5% /var
/dev/hd3 1.06 1.03 4% 57 1% /tmp
/dev/hd1 0.50 0.49 2% 76 1% /home
root@austin2:/> /usr/sbin/slibclean
root@austin2:/> df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd2 2.31 0.16 94% 41921 51% /usr
/dev/hd9var 0.06 0.04 31% 494 5% /var
/dev/hd3 1.06 0.78 27% 833 1% /tmp
/dev/hd1 0.44 0.44 1% 28 1% /home
Installing Oracle CRS code
Important: Oracle Clusterware uses interface number (en3, for example) to define the
interconnect network to be used. It is mandatory that this interface number is the same on
all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle
Clusterware.
You need a graphical user interface (GUI) to run OUI (Oracle universal installer). Export
DISPLAY to the appropriate value, change directory to the directory that contains the Oracle
install packages, and run installer as the oracle user, as shown in Figure 2-2 on page 50.

Figure 2-2 Run OUI from the installation directory
You are asked if rootpre.sh has run, as shown in Figure 2-3 on page 51. Make sure to
execute Disk1/rootpre/rootpre.sh as root user on each node. After running rootpre.sh on
both nodes, type <y> to proceed to the next step.
Note: If you have the Oracle code on a CDROM mounted on one of the nodes, you need
to NFS export the CDROM mounted directory and mount it on the other node. You can
also remote copy files to the other node, then run rootpre.sh on both nodes.
However because in our test environment, we have the CRS Disk1 (code) on a GPFS
shared file system, there is no need to make an NFS mount or copy files to the other node.

Figure 2-3 Executing the runInstaller program

To install the Oracle CRS code:
1. At the OUI welcome window, click Next, as shown in Figure 2-4.
Figure 2-4 OUI Welcome window

2. Specify an ORACLE_HOME name and destination directory for the CRS installation as
shown in Figure 2-5.
Figure 2-5 Specifying Home name and path

3. Figure 2-6 shows Product-Specific Prerequisite Checks. The installer verifies that your
environment meets minimum requirements. If there are no failures, click Next.
Figure 2-6 Product-Specific Prerequisite Checks

4. Specify a cluster name, which is austin_crs in our example. Click Add to add public node
name, private node name, and virtual host name specified in the /etc/hosts file as shown
in Figure 2-7.
Figure 2-7 Specifying node configuration
5. Repeat for the second node and check the cluster information as shown in Figure 2-8.
Figure 2-8 Specifying cluster configuration

6. Click Edit to change the Interface type for each network interface, as shown in Figure 2-9
and Figure 2-10 on page 57.
Figure 2-9 Specifying interface type
Important: Oracle Clusterware uses interface number (en3 in our example) to define the
interconnect network to use. It is mandatory that this interface number is the same on all
the nodes in the cluster. You must enforce this requirement prior to installing Oracle
Clusterware.

Figure 2-10 Specifying network interface usage

7. Specify the Oracle Cluster Registry (OCR) disk location as shown as in Figure 2-11. With
normal redundancy, you can have two OCR disks. However, If you select external
redundancy, no OCR mirroring is provided by Oracle.
Figure 2-11 Specifying OCR location

8. Specify the voting disks’ location as shown in Figure 2-12. With normal redundancy, you
can have three vote disks.
Figure 2-12 Specifying voting disk location

9. Check if you can see both cluster local node and remote node before the installation
begins (see Figure 2-13), and then click Install.
Figure 2-13 Summary before CRS installation

10.The OUI continues the installation on the first node, and then copies the code files on the
second node automatically, as shown in Figure 2-14.
Figure 2-14 Installing CRS codes

11.Figure 2-15 shows the list of configuration scripts that you need. Keep the OUI window
open, execute the scripts in the following order. Wait for each script execution to complete
successfully before you run the next script. Do not run these commands at the same time.
As root user:
a. On austin1, execute orainstRoot.sh.
b. On austin2, execute orainstRoot.sh.
c. On austin1, execute root.sh.
d. On austin2, execute root.sh.
Figure 2-15 Executing configuration scripts
Note: At the final stage of executing root.sh on the second node, VIP Configuration
Assistant starts automatically. But, if you had an error message stating that “The given
interface(s), ‘en2’ is not public”, use Public interfaces to configure VIPs. Run
$ORACLE_CRS_HOME/bin/vipca as root in a GUI environment.
The reason for this error message: When verifying the IP addresses, VIP uses calls to
determine if an IP address is valid. In this case, VIP finds that the IPs are nonroutable (for
example, IP addresses, such as 192.168.* and 10.10.*). Oracle is aware that the IPs can
be made public, but because mostly these IPs are used for Private, it displays this error
message.

To use the VIP Configuration Assistant:
1. Figure 2-16 shows the Welcome window for VIP configuration. Click Next.
Figure 2-16 Welcome window for VIP configuration

2. Select the network interface corresponding to the public network as shown in Figure 2-17,
then click Next.
Figure 2-17 Selecting network interfaces for VIP

3. Provide the IP alias name and IP address as shown in Figure 2-18, then click Next.
Figure 2-18 Configuring VIPs

4. Validate the previous entries for VIP configuration as shown in Figure 2-19 and click
Finish.
Figure 2-19 Summary before configuring VIPs

In Figure 2-20, the VIP Configuration Assistant proceeds with the creation, configuration, and
startup of all application resources on all selected nodes.
Figure 2-20 VIP configuration progress window

5. Check the configuration results in Figure 2-21 and click Exit.
Figure 2-21 VIP configuration results

6. Clicking Exit takes you back to the previous window in Figure 2-22 (same as Figure 2-16
on page 63). Click OK and Configuration Assistant automatically executes.
Figure 2-22 Executing configuration scripts

If Configuration Assistant is successful, “End of Installation” appears automatically as
shown in Figure 2-23.
Figure 2-23 End of CRS installation window
2.3 Oracle 10g Clusterware patch set update

Before updating Oracle CRS with the patch set, stop CRS on both nodes using crsctl stop
crs as a root user on both nodes. Execute <patchset_directory>/Disk1/runInstaller in GUI
environment. You are asked if /usr/sbin/slibclean is run by root. Run this command on
both nodes and type <y> (yes) to proceed. To update the patch set:

1. You see the Welcome window as shown in Figure 2-24. Click Next.
Figure 2-24 Welcome window for installing CRS patch

2. Specify ORACLE_CRS_HOME installation directory (Figure 2-25), and click Next.
Figure 2-25 Specifying ORACLE_CRS_HOME installation directory

3. Click Next. All the nodes are selected by default (Figure 2-26).
Figure 2-26 Specifying cluster nodes

4. Click Next to verify the ORACLE_CRS_HOME directory, node list, space requirements,
and so on (Figure 2-27).
Figure 2-27 Summary before updating patch set
Wait for the patch installation process to complete (Figure 2-28 on page 75).

Figure 2-28 Installing update patch set

5. Run $ORACLE_CRS_HOME/install/root102.sh as root user on both nodes to complete
update patch set and then click Exit (Figure 2-29).
Figure 2-29 End of CRS update
2.4 Oracle 10g database installation

Refer to Appendix D, “Oracle 10g database installation” on page 269.
2.5 Networking considerations

This section presents networking considerations and differences from the previous version of
Oracle 9i RAC.
Network architecture simpler than for Oracle 9i RAC

In the previous release of Oracle 9i RAC, the network requirements and architecture for the
interconnect network are quite complex, due to the fact that Oracle 9i RAC relies on a
vendor-provided high availability infrastructure, such as HACMP in the case of AIX. At the
same time, GPFS required a separate interconnect network, which is not managed by
HACMP.
For more information about Oracle 9i RAC on an IBM System p setup, refer to Deploying
Oracle9i RAC on eServer Cluster 1600 with GPFS, SG24-6954.

Oracle 10g RAC does not require HACMP to provide the clustering layer nor to protect the
interconnect network. Etherchannel provides network (interconnect) high availability for
Oracle 10g RAC. The Etherchannel is configured and managed at the AIX OS level. The
configuration is simple, standard, and fault resilient. Oracle Clusterware does not have to
care about the availability of its interconnect, because AIX provides this availability without
any other clustering product, such as HACMP.
Single interconnect network for RAC and GPFS

All clusters need a network for internal communication. This communication link is essential,
and in case of the loss of this network, the cluster cannot operate normally. This type of a
network is called interconnect. For the architecture that we are considering, two clustering
layers are configured, GPFS and Oracle Clusterware (formally called CRS), and both
architectures require an interconnect network.
Oracle Clusterware
Oracle 10g RAC needs a “private” interconnect network for cache fusion traffic, which
includes data block exchange between instances, plus service messages. Depending on the
amount of load and the type of database operations (select, insert, updates, cross-update,
and so forth) running on the instances, the throughput on this interconnect can be high. Most
Oracle database traffic between instances is based on UDP protocol.
The term “private” used for interconnect means that this network must be separated from the
client access (public) network (used by the clients to access the database). The private
interconnect is limited to the nodes hosting a RAC instance. The public network might
connect to WAN. However, the term “private” does not mean that another cluster layer (in this
case, GPFS) cannot share it.
Important: Oracle Clusterware uses interface number (en3 for example) to define the
interconnect network to be used. It is mandatory that this interface number is the same on
all the nodes in the cluster. You have to enforce this requirement prior to installing Oracle
Clusterware.
GPFS
As a cluster file system, GPFS needs an interconnect network. In a typical Oracle database
and GPFS configuration, the actual data I/O flows through the host bus adapters (for
example, Fibre Channel) and not through the IP network (interconnect). This method allows
superior performances. The GPFS interconnect network is used for service messages and
token management mechanism. However, Oracle10gRAC comes with its own data
synchronization mechanism and does not use GPFS locking.
GPFS uses TCP for its internal messages and relies on IP addresses not on a specific
interface number. Because the data I/Os are not using the IP network, GPFS does not require
a high network bandwidth; thus, the GPFS interconnect can be overlapped with the Oracle
interconnect (same network).
Sizing the Interconnect network
Note: Even though Oracle 10g RAC interconnect requires special attention, sizing for this
network is based on the same principles as for any other IP network. If the interconnect is
properly sized, this network can be shared for other clustering traffic, such as GPFS, which
adds almost no load onto the network. Therefore, the GPFS interconnect traffic can be
mixed together with Oracle interconnect without a potential impact on RAC performance.

Because these networks share the same purpose and need the same level of availability, it is
worth grouping them into a single network. Doing so, we avoid the cost and complexity of
maintaining two resilient networks, one of which is not heavily loaded (GPFS) but necessary.
Note: We recommend using a single network for both Oracle 10g RAC and GPFS
interconnects.
The diagram in Figure 2-30 presents a typical network configuration.
Public network
VIP 192.168.100.131 192.168.100.132 VIP
192.168.100.31 192.168.100.32
node austin1 node austin2
RAC RAC
instance1 instance2
Oracle Oracle
Clusterware Clusterware
GPFS GPFS
10.1.100.31 10.1.100.32
Oracle 10gRAC + GPFS

interconnect network
Figure 2-30 Typical network configuration

Example 2-35 shows the IP name resolution for our sample cluster.
Example 2-35 /etc/hosts file for both nodes

# Public network
192.168.100.31 austin1
192.168.100.32 austin2
# Oracle RAC Virtual IP addresses on the public network

192.168.100.131 austin1_vip
192.168.100.132 austin2_vip

Note: In our configuration, Oracle public network is using the same adapter, en2, on both
nodes, and en3 is dedicated to RAC and GPFS interconnect traffic.
Example 2-36 presents the network interface configuration for both public and private
networks on node austin1.
Example 2-36 Network configuration on node austin1

root@austin1:/> ifconfig -a
en2:
flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,
CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 192.168.100.31 netmask 0xffffff00 broadcast 192.168.100.255
en3:
tcp_sendspace 131072 tcp_recvspace 65536
Oracle Clusterware installation fails network interface names are not the same on all nodes in
the cluster. The VIP addresses are configured as IP aliases on the public network, as shown
in Example 2-37.
Example 2-37 Network configuration for node austin2

root@austin2:/> ifconfig -a
en2:
en3:
tcp_sendspace 131072 tcp_recvspace 65536

Different types of networks for interconnect
The network chosen for interconnect has to provide the high bandwidth that is needed for
Oracle cache fusion mechanism and also low latency.
Oracle does not support the use of crossover cables between two nodes. Use a switch in all
cases. A switch is needed for interconnect network failure detection by Oracle Clusterware.
Although there are no AIX issues, crossover cables are not recommended nor supported.
There are several network choices for interconnect usage:

򐂰 Ethernet 100 Mb/s
Although certified with Oracle 10g RAC, this network does not provide adequate
performance for current demands and must be avoided in production environments.
򐂰 Ethernet 1 Gb/s
Currently, this network is the most commonly used network and provides reasonable
throughput and low latency.
򐂰 Ethernet 10 Gb/s
Although not yet officially supported by Oracle (quite new technology), because IP
protocol is used (as for Fast and Gbit Ethernet), this network can be used for testing
purposes. When certification becomes available on your platform, this network is a very
good candidate for interconnect, removing the limitations of the single 1Gb Ethernet
adapter. It also avoids having to use Etherchannel with multiple 1 Gb adapters to
aggregate their bandwidth.
You can check for 10 Gb Ethernet support on the following link:
http://www.oracle.com/technology/products/database/clustering/certify/tech_gene
ric_unix_new.html
򐂰 InfiniBand
On AIX, at the time of this writing, only IP over IB is supported for Oracle interconnect.
Currently, on AIX, Oracle does not support Reliable Datagram Sockets (RDS), which can
be used on Linux with the release 10.2.0.3 and higher. With RDS, latency is lower,
because protocol is simpler than IP and takes less CPU to build each frame; thus, the
throughput is higher.
Using virtual network for interconnect

Virtual Ethernet network is also supported for Oracle interconnect private network and also
for client (public) network with VIP address failover.
Tip: For more information, see the Oracle Metalink note 220970.1 at:
You need an Oracle Metalink ID to access this note.
RAC cluster nodes on a single physical server

Figure 2-31 on page 81 shows the virtual network used as interconnect.

Physical ethernet public network
RAC node 1
Virtual ethernet
RAC node 2 interconnect network
Figure 2-31 Virtual interconnect network for nodes in the same server
A virtual network environment can be used for development, test, or benchmark purposes
where no high availability is required for client access, and when the nodes are different
logical partitions (LPARs) of the same physical server. In this case, a virtual network can be
created without the need of physical network interfaces, or a Virtual I/O Server (VIOS).
This virtual Ethernet network practically never fails, because it does not rely on physical
adapters or cables. A virtual network inside a physical server is highly available in itself. No
need to secure it with Etherchannel for example, as you do for the physical Ethernet
networks.
This virtual network is a perfect candidate for RAC and GPFS interconnects when all cluster
nodes reside inside one physical server (for example, for a test environment). The bandwidth
is 1 Gb/s minimum, but it can be much higher. The latency varies depending on the overall
CPU load for the entire server.
RAC cluster nodes on separate physical servers

For high availability purposes, we recommend for production to use LPARs on separate
physical systems.
In the current IBM System p5 implementation, external network access from a virtual Ethernet
network requires a VIOS with a shared Ethernet adapter (SEA). It is possible to design,
implement, and use a a interconnect network similar to the one shown in Figure 2-32 on
page 82. A typical configuration uses two VIOSs per frame with the SEA failover, which
provides a good high availability of the network.

Note: However, in the current implementation, this network architecture is not
recommended (although supported) for Oracle RAC interconnect. Even though in case of
a failure in the primary VIOS physical network adapter, the SEA failover mechanism
provides failover to the second VIOS without TCP/IP disruption, the process is considered
too slow for Oracle Clusterware and can lead to problems with the CRS daemons.
RAC node 1 RAC node 2
VIOS VIOS VIOS VIOS

Virtual
i
SEA
r nterco SEA
failover ot fo nne failover
Physical
N ct
using VIOS + SEA failover
Figure 2-32 Virtual interconnect network for nodes on different servers (see previous Note)
The use of a single VIOS (and thus, no SEA failover) for the interconnect network is not
resilient enough. The VIOS is a single point of failure. This configuration is not recommended,
although it is supported.
If this setup is not the best one for RAC interconnect purposes, it remains the state of art for
all other usages, including public or administrative networks.
For another virtual network setup using Etherchannel over dual VIOS to protect the network
against failures (instead of using SEA failover), refer to 7.1, “Virtual networking environment”
on page 229.
When RAC nodes reside on different servers (which needs to be the standard configuration),
we recommend that you set up a physical interconnect network by using dedicated adapters
that are not managed by VIOS as shown in Figure 2-33 on page 83.

Physical ethernet
Figure 2-33 Physical interconnect network for nodes on different servers
Note: Except for testing or development, we recommend nodes on different hardware with
a physical network as interconnect.
Jumbo frames
Most of the modern 1Gb (or higher) Ethernet network switches support a feature called
“jumbo frames”, which allows to them handle a maximum packet size of 9000 bytes instead of
traditional Ethernet frames (1500 bytes). You can set this parameter at the interface level and
switch. Jumbo frames are not activated by default.
Example 2-38 shows how to enable the jumbo frames for one adapter.
Example 2-38 smit chgenet window at the adapter level

Change / Show Characteristics of an Ethernet Adapter

[Entry Fields]
Ethernet Adapter ent0
Description 2-Port 10/100/1000 Ba>
Status Available
Location 03-08
Rcv descriptor queue size [1024] +#
TX descriptor queue size [512] +#
Software transmit queue size [8192] +#
Transmit jumbo frames yes +
Enable hardware TX TCP resegmentation yes +
Enable hardware transmit and receive checksum yes +
Media speed Auto_Negotiation +
Enable ALTERNATE ETHERNET address no +
ALTERNATE ETHERNET address [0x000000000000] +

Apply change to DATABASE only no +
Enable failover mode disable +
If you use Etherchannel, enabling jumbo frames when creating the Etherchannel pseudo
device automatically sets transmit jumbo frames to yes for all the underlying interfaces
(starting in AIX 5.2), as shown in Example 2-39.
Example 2-39 smit etherchannel to create an Etherchannel with jumbo frames


[Entry Fields]
EtherChannel / Link Aggregation Adapters ent0,ent1 +
Enable Gigabit Ethernet Jumbo Frames yes +
Mode round_robin +
Hash Mode default +
Backup Adapter +
Internet Address to Ping []
Number of Retries [] +#
Retry Timeout (sec) [] +#
Switches and all other networking components involved must support jumbo frames.
Whenever possible, choosing jumbo frames for interconnect network is a good choice.
Choosing jumbo frames reduces the number of packets (thus data fragmentation) used in
heavy loads. If all of your networks are 1Gb or faster, and none of them are 10 or 100 Mb,
jumbo frames can be set everywhere.
Note: As long as the switches support jumbo frames, we recommend using jumbo frames
for interconnect network.
Etherchannel
Etherchannel is a network port aggregation technology that allows several Ethernet adapters
to be put together to form a single pseudo-Ethernet device. In our test environment, on nodes
austin1 and austin2, ent0 and ent1 interfaces have been aggregated to form the logical
device called ent2; the interface ent2 is configured with an IP address. The system and the
remote hosts consider these aggregated adapters as one logical device (interface).
All adapters in an Etherchannel must be configured for the same speed (1Gb, for example)
and must be full duplex. Mixing adapters of different speeds in the same Etherchannel is not
supported.
In order to achieve bandwidth aggregation, all physical adapters have to be connected to the
same switch, which must also support Etherchannel.
You can have up to eight primary Ethernet adapters and only one backup adapter per
Etherchannel.

Etherchannel or IEEE 802.3ad
There are two types of port aggregation:
򐂰 IEEE 802.3ad Link Aggregation
򐂰 Etherchannel, which is the standard CISCO implementation
Both Etherchannel and IEEE 802.3ad Link Aggregation require switches capable of handling
these protocols. Certain switches can auto-discover the IEEE 802.3ad ports to aggregate.
Etherchannel needs configuration at the switch level to define the grouped ports.
Note: For interconnect network, we recommend Etherchannel, because it provides

performance enhancement through the round-robin load balancing algorithm.
Etherchannel mode: Standard or round-robin

There are three supported configuration modes for Etherchannel:
򐂰 Standard (the default)
Outgoing packets to the same IP address are all sent over the same adapter (chosen
depending on an hash mode algorithm). In a two node cluster, which is common for
Oracle 10g RAC, this mode leads to using only one interface. Thus, there is no real
bandwidth aggregation (or load balancing). This mode is useful when communicating with
a large number of IP addresses.
򐂰 Round-robin
Outgoing packets are sent evenly across all adapters. In a two node cluster, this mode
provides the best load balancing possible. However, packets might be received out of
sequence at the destination node. The command netstat -s | grep out-of-order gives
the out of order packets received, as shown in Example 2-40. Observe the number of
out-of-order packets over a period of time under heavily loaded database conditions. If
this number steadily increases, you must not use this mode for your cluster.
Example 2-40 Checking for out of order packets
root@dallas2:/> netstat -s | grep out-of-order
1261 out-of-order packets (0 bytes)
root@dallas2:/> netstat -s | grep out-of-order
1274 out-of-order packets (0 bytes)
򐂰 IEEE 802.3ad
According to the IEEE 802.3ad specification, the packets are always distributed in the
standard fashion, never in a round-robin mode.
Example 2-39 on page 84 shows you how to set the round-robin mode when creating the
Etherchannel.
Note: For Oracle 10g RAC private interconnect network, we recommend an Etherchannel
using the round-robin algorithm.

Making the interconnect highly available
Etherchannel is also a good answer for high availability. Because multiple network adapters
are used together, the failure of one adapter reduces the actual bandwidth, but the IP
connectivity is still guaranteed, until at least one physical adapter is available.
Remember, all adapters are connected to the same switch, which is a constraint to aggregate
the bandwidths. Although we have several adapters, the switch itself can be considered as a
single point of failure. The entire Etherchannel is lost if the switch is unplugged or fails, even if
the network adapters are still available.
To address this issue and remove the last single point of failure found in the interconnect
networks, Etherchannel provides a backup interface. In the event that all of the adapters in
the Etherchannel fail, or if the primary switch fails, the backup adapter will be used to send
and receive all traffic. In this case, the bandwidth is the one provided by the single backup
adapter, with no aggregation any longer. When any primary link in the Etherchannel is
restored, the service is moved back to the Etherchannel. Only one backup adapter per
Etherchannel can be configured. The adapters configured in the primary Etherchannel are
used preferentially over the backup adapter. As long as at least one of the primary adapters is
functional, it is used.
Of course, the backup adapter has to be connected to a separate switch and linked to a
different network infrastructure. It is not necessary for the backup switchto be Etherchannel
capable or enabled.
Figure 2-34 on page 87 shows how to design a resilient Etherchannel and how to connect the
physical network adapters to the switches. Interfaces en2 and en3 are used together in an
aggregated mode and are connected on the Etherchannel capable switch. Interface en1 is
connected on the backup switch and is used only if both en2 and en3 fail, or if the primary
switch has problems.

Etherchannel or 802.3ad capable primary switch
Node 1 Node 1
en3
en3
en4
en4
en2
en2
Interconnect
en1
en1
Backup switch
Figure 2-34 Resilient Etherchannel architecture
Note: Etherchannel provides a resilient networking infrastructure, implemented at the AIX

level, without any other software (HACMP for instance). Oracle 10g RAC can use this
network regardless of other failover considerations.
Oracle parameter CLUSTER_INTERCONNECTS

The CLUSTER_INTERCONNECTS environment variable can be used for load balancing
traffic over more interfaces; however, this mechanism does not provide any failover
mechanism, which means interconnect failover must be handled by AIX.
When set with an IP address, this mechanism overrides clusterware settings, and the
specified IP address is used for the interconnect traffic, including Oracle Global Cache
Service (GCS), Global Enqueue Service (GES), and Interprocessor Parallel Query (IPQ). If
set with two addresses, both addresses are used in a load balancing mode, but as soon as
one link is down, all interconnect traffic is stopped, because the failover mode is turned off.

Note: We recommend that you use Etherchannel at the AIX level, which provides load
balancing and failover at the same time. Do not use the cluster_interconnects
parameter, so Oracle 10g RAC can use the network defined when installing CRS
Clusterware for its interconnect.
To query (using Oracle SQL client) the network used by Oracle 10g RAC for its private
interconnect usage, see Example 2-41.
Example 2-41 How to display the interconnect network actually used

SQL> select INST_ID, NAME_KSXPIA, IP_KSXPIA from X$KSXPIA where PUB_KSXPIA = 'N';
INST_ID NAME_KSXPIA IP_KSXPIA

---------- --------------- ----------------
1 en3 10.1.100.33
SQL>
2.6 Considerations for Oracle code on shared space

In this scenario, we describe how to use a shared repository for Oracle code files, which
simplifies code maintenance and backup operations.
With GPFS, it is possible to share the same set of binary files by all instances and thus
minimize the effort for upgrading od patching software. The following components can be
stored on GPFS file systems:
򐂰 Oracle database files
򐂰 Oracle Clusterware files (OCR and voting disks)
򐂰 Oracle Flash Recovery Area
򐂰 Oracle archive log destination
򐂰 Oracle Inventory
򐂰 Oracle database binaries (ORACLE_HOME)
򐂰 Oracle Clusterware binaries (ORA_CRS_HOME)
򐂰 Oracle database log/trace files
򐂰 Oracle Clusterware log/trace files
Oracle datafiles, Oracle Clusterware OCR and voting disks, Oracle Flash Recovery Area, and
Oracle archive log destination require shared storage. However, this is not mandatory for the
remaining components. For these components, you can choose a shared space (file system)
or individual storage space on each cluster node. The advantage of using GPFS for these
components is ease of administration as well as the possibility to access files belonging to a
crashed node, before the node has been recovered. The disadvantage of this solution is the
extra layer that GPFS introduces and the constraint of not being able to perform Oracle rolling
upgrades.
When using a shared file system for Oracle binaries, you need to make sure that all instances
are shut down before code upgrades, because code files for Oracle RAC ,as well as Oracle
Clusterware, cannot be changed while the instance is running.
To shut down the Oracle cluster, refer to the readme file shipped with the patch code. You
must make sure that all database instances, Enterprise Manager Database Control,
iSQL*Plus, and Oracle Clusterware processes are shut down.

Even though it is possible to place all Oracle files in GPFS, due to the recovery time in GPFS
when losing the Configuration Manager node, we recommend to put the Oracle Clusterware
voting disk and OCR files outside GPFS on raw partitions. For a description of how to move
OCR and voting disks from GPFS to raw devices, refer to 3.8, “Moving OCR and voting disks
from GPFS to raw devices” on page 144. This configuration avoids the situation when Oracle
Clusterware might reboot nodes, before GPFS has recovered. Another option is to tune the
MissCount parameter in Oracle Clusterware, but this option just extends the time for the
cluster to react on a real error situation as well.
Similar to OCR and voting disks, you can argue that having Oracle Clusterware binaries and
log files on GPFS might cause clusterware malfunction and thus cause node eviction in case
of GPFS freeze or an erroneous configuration. Furthermore, Oracle Clusterware does
support rolling upgrades, which will not work with a shared binaries installation. In fact, it
seems that OUI is actually not fully understanding that Oracle Clusterware is installed on
shared space.
In conclusion, even though GPFS can be used to provide shared storage for Oracle
Clusterware throughout the environment, we recommend to use local file systems for Oracle
Clusterware code.
2.7 Dynamic partitioning and Oracle 10g

With LPAR, IBM System p servers can run multiple operating systems in a single physical
machine. Logical partitioning helps to optimize system resources and reduce system
complexity. On servers featuring the POWER5 processor, up to 10 logical partitions can be
configured per processor. Each partition runs an individual copy of the operating system, with
its own memory and devices (physical or logical).
In addition to partitioning resources, dynamic resource management capabilities further

enhance system flexibility and usability. Administrators can change partition configuration
dynamically without rebooting the operating system or stop running processes. System
resources, such as processors, memory, and I/O devices (both physical and virtual) can be
added, removed, or moved between partitions that are running within the same physical
server. This flexibility allows administrators to dynamically assign the resources to the
partition based on the workload requirements.
Figure 2-35 on page 90 shows an example of a partitioned eight CPU IBM System p5 server.
Processors, memory, and disks are shared among partitions using virtualization capabilities.
Through the Hardware Management Console (HMC), an administrator can dynamically
adjust running partitions by changing the number of assigned processors, the size of
memory, and physical or virtual adapters. This capability allows for better utilization of all
server resources by moving them to partitions that have higher requirements.
Oracle 10g Database is DLPAR aware, which means that it is capable of adapting to changes
in the LPAR configuration and make use of additional (dynamically added) resources. This
section describes how Oracle database exploits the dynamic changes in processors and
memory when running in LPAR.
Note: When running Oracle 10g RAC on LPAR nodes, we recommend that you have
LPARs located on separate System p servers in order to avoid single points of failure, such
as the power supply, Central Electronic Complex (CEC), system backplane, and so forth.

UP TO 80 PARTITIONS
Application 1 App 2 App 3 Application 4 5 6 7 8
AIX 5 AIX 5 i5/0S LINUX A A L L
DISK D D D
D
DISK DISK
DISK
MEM M
M
M
MEM
MEM M
MEM
4 CPU C
2 CPU
½ ½
CPU CPU C C
C
HYPERVISOR
Management P5 Server (8 CPU)
Console
Figure 2-35 Dynamic partitioning with POWER5 systems
2.7.1 Dynamic memory changes

Oracle introduced dynamic SGA in Oracle 9i. In Oracle 10g, these capabilities are enhanced.
It is now possible for the database administrator to change (decrease or increase) memory
pools dynamically by setting the SGA_TARGET parameter. The SGA_TARGET parameter
can be increased up to the value of the SGA_MAX_SIZE parameter (set in init.ora file for
spfile device, for example). The administrator can specify the size of individual SGA pools
manually or just set the SGA_TARGET and let the database automatically size the pools
within the SGA.
The size of virtual memory allocated by Oracle during startup time is equal to the value of the
SGA_MAX_SIZE parameter, but only part of it, which is specified by SGA_TARGET, is
actually used. It means that Oracle database can start with larger SGA_MAX_SIZE than the
amount of memory assigned to a partition at Oracle startup time. SGA_TARGET can be
increased to reach the limit of physical memory available to LPAR. By adding more memory
to the partition, SGA_TARGET can be increased as well. The administrator must anticipate
the amount of memory that can be given to the instance and must set the SGA_MAX_SIZE
parameter.
The following Oracle views are useful to monitor the behavior of the dynamic SGA:
򐂰 v$sga view displays summary information about SGA
򐂰 v$sgastat displays detailed information about SGA
򐂰 v$sgainfo displays size information about SGA, including sizes of different SGA
components, granule size, and free memory
򐂰 v$sga_dynamic_components displays current, minimum, and maximum size for the
dynamic SGA components
򐂰 v$sga_dynamic_free_memory displays information about the amount of SGA memory
that is available for future dynamic SGA operations
򐂰 v$sga_resize_ops displays information about the last 400 completed SGA resize
operations

SGA must not use pinned memory, because the size of virtual memory allocated by Oracle
during startup time is equal to the value of SGA_MAX_SIZE parameter. The amount of
SGA_MAX_SIZE memory must be available in LPAR prior to instance startup. If the
LOCK_SGA parameter is set to TRUE, the SGA is pinned and is not paged.
The AIX 5L operating system does not allow the removal of pinned memory, which means
that when using a pinned SGA, database administrator can neither reduce the effective size
of the SGA_TARGET values, nor remove real memory from the LPAR. For this reason, when
using a pinned SGA, it is not possible to change the SGA_TARGET value to move memory
out of the LPAR. When the SGA is not pinned, this is possible.
Note: When specifying pinned memory for SGA, an instance does not start unless there is
enough memory for the LPAR to host the SGA_MAX_SIZE. Also, DLPAR memory
operations are not permitted on memory reserved for SGA_MAX_SIZE.
Example of dynamic memory change

Figure 2-36 shows the partition configured with 2.5 GB of memory. Limits are set within the
partition profile, and you can decrease the amount of memory to 1 GB or increase it up to 6
GB. See Figure 2-36.
Figure 2-36 LPAR configuration

The amount of 2.5 GB is consistent to the memory size available to AIX (Example 2-42).
Example 2-42 Physical memory available to AIX operating system before addition
{texas:oracle}/orabin/ora102/dbs -> prtconf | grep "Memory Size"
Memory Size: 2560 MB
Good Memory Size: 2560 MB
Example 2-43 shows Oracle initialization parameters related to SGA memory.
Example 2-43 Instance parameters related to SGA memory

SQL> show parameter sga
NAME TYPE VALUE

------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 4G
sga_target big integer 1G
Figure 2-37 on page 93 shows an additional 1 GB of memory assigned to this partition with
the Hardware Management Console.

Figure 2-37 Adding 1 GB of memory to a partition
After this operation, AIX sees more physical memory (3.5 GB), as shown in Example 2-44.
Example 2-44 Physical memory available to AIX operating system

{texas:oracle}/orabin/ora102/dbs -> prtconf | grep "Memory Size"
Memory Size: 3584 MB
Good Memory Size: 3584 MB
At this point, Oracle allocates only 1 GB of memory for SGA (SGA_TARGET parameter
value). Output from the Oracle sqlplus command is in Example 2-45.
Example 2-45 Size information about SGA memory, including free SGA
SQL> select * from v$sgainfo;
NAME BYTES RES

-------------------------------- ---------- ---
Fixed SGA Size 2078368 No
Redo Buffers 14696448 No

Buffer Cache Size 771751936 Yes
Shared Pool Size 251658240 Yes
Large Pool Size 16777216 Yes
Java Pool Size 16777216 Yes
Streams Pool Size 0 Yes
Granule Size 16777216 No
Maximum SGA Size 4294967296 No
Startup overhead in Shared Pool 67108864 No
Free SGA Memory Available 3221225472
11 rows selected.
The next step is to change the SGA_TARGET value, so Oracle can use additional memory
segments. In Example 2-46, SGA_TARGET is set to 3.5 GB (3584 MB).
Example 2-46 Changing SGA_TARGET value

SQL> alter system set sga_target=3584M;
System altered.
SQL> show parameter sga
NAME TYPE VALUE

------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 4G
sga_target big integer 3584M
At this point, v$sgainfo view values correspond to new values. Only about 512 MB of SGA
memory is available (see Example 2-47).
Example 2-47 Size information about SGA memory after resizing SGA_TARGET
SQL> select * from v$sgainfo;
NAME BYTES RES

-------------------------------- ---------- ---
Fixed SGA Size 2078368 No
Redo Buffers 14696448 No
Buffer Cache Size 3456106496 Yes
Shared Pool Size 251658240 Yes
Large Pool Size 16777216 Yes
Java Pool Size 16777216 Yes
Streams Pool Size 0 Yes
Granule Size 16777216 No
Maximum SGA Size 4294967296 No
Startup overhead in Shared Pool 67108864 No
Free SGA Memory Available 536870912
11 rows selected.

From now on, the Oracle database can use new memory segments that are added to the
partition dynamically.
2.7.2 Dynamic CPU allocation

Dynamic CPU reconfiguration is available starting in AIX V5.2 and POWER4 processors. AIX
V5.3 and POWER5 introduced micropartitioning and the ability to share a single physical
CPU between partitions. With POWER5 and AIX, V5.3 partitions can run on as little as 0.1 (or
10%) of one CPU. Up to 10 partitions with their own operating systems can run concurrently
on a single CPU.
Without micropartitioning, a single physical processor assigned to partition is visible as a

single processor in AIX. Each change in partition processors (dedicated processors) is also
visible in operating system.
Things are more complicated when micropartitioning is used. The AIX operating system
sees virtual processors instead of physical ones, because the kernel and its scheduler have
to see a natural number of processors.
Also, up to 10 virtual processors can be defined on a partition with the assigned processing
unit of 1.0 CPU, and the other way, a single virtual CPU can utilize only 0.1 of real processor.
With dynamic partitioning, both entitled capacity (amount of processing units) and the
number of virtual processors can be changed dynamically. When necessary, both entitled
capacity and the number of virtual processors can change at the same time.
When the capacity is increased, applications run faster, because the power hypervisor
assigns more physical processor time to each virtual processor. By increasing the number of
virtual CPUs only, it is unlikely that the application runs faster, because the overall amount of
processing units does not change.
All running applications gain performance when the capacity is increased in an LPAR. Oracle
also recognizes new virtual processors (because they appear the same way as dedicated
CPUs on AIX) and adjusts its SQL optimizer plans.
Both Oracle 9i and 10g use CPU_COUNT and PARALLEL_THREADS_PER_CPU

parameters to compute the number of parallel query processes. Oracle 10g automatically
adjusts the CPU_COUNT parameter when the number of CPUs changes in the partition.
With POWER5 processor and AIX V5.3, Simultaneous Multi-Threading (SMT) is introduced.
With SMT, the POWER5 processor gets instructions from more than one thread. What
differentiates this implementation is its ability to schedule instructions for execution from all
threads concurrently.
With SMT, the system dynamically adjusts to the environment, allowing instructions to
execute from each thread (if possible) and allowing instructions from one thread to use all the
execution units if the other thread encounters a long latency event. The POWER5 design
implements two-way SMT on each CPU.
If simultaneous multi-threading is activated:

򐂰 More instructions are executed at the same time.
򐂰 The operating system views the processing threads as twice the number of physical
processors installed in the system.
򐂰 Each physical processor (in dedicated partitions) or virtual processor (in shared partitions)
is visible to Oracle as two processing threads.

򐂰 OLTP application can gain up to 30% of performance.
Oracle works just fine on any OS that recognizes a hyper-threading-enabled system. In

addition, it takes advantage of the logical CPUs to their fullest extent (assuming the OS
reports that it recognizes that hyper-threading is enabled).
The simultaneous multi-threading policy is controlled by the operating system and is partition
specific.
Example of dynamic CPU change

An Oracle instance is running on LPAR with the following CPU resources:
򐂰 Current processing units: One
򐂰 Minimum processing units: Zero point five
򐂰 Maximum processing units: Four
򐂰 Current virtual processors: One
򐂰 Minimum virtual processors: One
򐂰 Maximum virtual processors: Eight
򐂰 SMT-enabled
Figure 2-38 shows the partition resources visible through HMC.
Figure 2-38 Partition resources visible through HMC

Because SMT functionality turns on by default, and processes cannot see any difference
between SMT threads and real CPUs, the number of processors visible by the operating
system and Oracle instance is different. In AIX, only one CPU is reported, while Oracle
reports two CPUs, as shown in Example 2-48.
Example 2-48 Number of processors in AIX and Oracle

root@texas:/> prtconf | grep Processors
Number Of Processors: 1
root@texas:/> lsattr -El proc0

frequency 1900098000 Processor Speed False
smt_enabled true Processor SMT enabled False
smt_threads 2 Processor SMT threads False
state enable Processor state False
type PowerPC_POWER5 Processor type False
SQL> show parameter cpu_count
NAME TYPE VALUE

------------------------------------ ----------- ------------------------------
cpu_count integer 2
By disabling SMT functionality, the number of processors visible in Oracle as CPU_COUNT is

reduced by 50%, as shown in Example 2-49.
Example 2-49 Disabling SMT in AIX

root@texas:/> smtctl -m off -w now
smtctl: SMT is now disabled.
root@texas:/> lsattr -El proc0

frequency 1900098000 Processor Speed False
smt_enabled false Processor SMT enabled False
smt_threads 2 Processor SMT threads False
state enable Processor state False
type PowerPC_POWER5 Processor type False
SQL> show parameter cpu_count
NAME TYPE VALUE

------------------------------------ ----------- ------------------------------
cpu_count integer 1
The action shown in Example 2-49 does not change the actual partition configuration. In the
next step, the number of virtual processors changes in the partition to two, as presented on
Figure 2-39 on page 98.

Figure 2-39 Increasing the number of virtual processors with HMC
Example 2-50 shows the number of processors that change in AIX and Oracle.
Example 2-50 Changed number of processors in AIX and Oracle

root@texas:/> prtconf | grep Processors
Number Of Processors: 2
SQL> show parameter cpu_count;
NAME TYPE VALUE

------------------------------------ ----------- ------------------------------
cpu_count integer 2
Additional information appears in the Oracle alert.log file (Example 2-51 on page 99).

Example 2-51 Oracle alert.log containing changes in the CPU count
Detected change in CPU count to 2
Oracle does not change the CPU_COUNT value if the number of CPUs are more than three
times the CPU count at instance startup. For example, after starting Oracle instance with one
CPU and increasing the number of processors to four, the CPU_COUNT is set to three and
the following entry, shown on Example 2-52, is generated in the alert.log file.
Example 2-52 Change in CPU from one to four

Detected change in CPU count to four
Detected CPU count four higher than the allowed maximum (3), capped
(3 times CPU count at instance startup)
Important: Dynamic LPAR reconfiguration can cause high utilization of processors in

partitions and cause some processes to slow down or stop. Oracle Clusterware is highly
sensitive and might evict the node that is being reconfigured from the cluster. The fix for
this issue (APAR: IY84564) is available from http://www.ibm.com/support.
When operating in the AIX 5L and System p environment, you can dynamically add or
remove CPUs from an LPAR with an active Oracle instance. The AIX 5L kernel scheduler
automatically distributes work across all CPUs. In addition, Oracle Database 10gl dynamically
detects the change in CPU count and exploits it with parallel query processes.

Part 2
Part 2 Configurations using

dedicated resources
Part two discusses the following Oracle 10g RAC scenarios:
򐂰 Basic RAC configuration with GPFS
We cover hardware, networking, operating system, and GPFS configuration parameters.
We also present the Oracle CRS installation and patch update steps.
򐂰 Migration and upgrade scenarios
We describe various migration and upgrade scenarios covering the steps that we
recommend to perform, expand, or shrink your system and to replace software
components or bring them to the latest version.

3
Chapter 3. Migration and upgrade scenarios

This chapter provides various migration scenarios that help you migrate and scale your
environment as needed. Figure 3-1 on page 104 shows the infrastructure that we use in the
scenarios. The following scenarios are tested in our environment:
򐂰 Migrating a standalone Oracle database instance from JFS2 or raw partitions to GPFS
򐂰 Migrating a single database instance to a clustered RAC environment with GPFS and
shared ORACLE_HOME
򐂰 Adding a node to an existing Oracle RAC (with GPFS)
򐂰 Migrating an HACMP-based RAC to Oracle Clusterware and GPFS
򐂰 Upgrading a RAC installation from GPFS V2.3 to GPFS V3.1
򐂰 Moving Oracle Clusterware OCR and voting disks from GPFS to raw devices

3.1 Migrating a single database instance to GPFS
Figure 3-1 shows a simple diagram of the test environment that we use for this publication.
Linux p5 570
gw8810 12-way 32GB
Houston1 .51
.2 .52
Houston2
Alamo1 .53
To IBM network .54
Alamo2
HMC BigBend1 .55

BigBend2 .56
p5 570 Network switches
.57
8-way 64GB BigBend3
.231 Texas .58
Austin1 Vios1 Vios2

.31
.41 .42
Austin2
.32 Public / admin
FC switch
192.168.100.x
Dallas1
.33 .241
.34 RAC+GPFS interconne
Dallas2
P5 550 10.1.100.x
To IBM network nim server DS4800
640 GB RAID5
AIX 5.3 TL5
GPFS 3.1.0.6
nim8810 .20 HACMP 5.2
VIOS 1.4
.251 Oracle RAC 10.2.0.3
.252
Figure 3-1 Residency test environment
The purpose of this scenario is to change the storage space for Oracle files (single instance)
from JFS/JFS2 or raw devices to GPFS. There is no Oracle clustering involved at this time. At
the end of this migration, a single instance database is still running on a single node GPFS
cluster.
In this section, we consider two source scenarios. The target is GPFS, but sources can be
both JFS2 or raw partitions for datafiles.
Note: Although a single node GPFS cluster is not officially supported, this scenario is in
fact a step toward a multi-node RAC environment based on GPFS. For this matter, GPFS
file system parameters are configured as for a multi-node cluster (the correct number of
estimated nodes that mounts the file system (the -n option in mmcrfs)).
For details about GPFS considerations, refer to the GPFS V3.1 Concepts, Planning, and
Installation Guide, GA76-0413, and section 2.1.7, “Special consideration for GPFS with
Oracle” on page 44.
The starting point for this scenario is a single database instance, using JFS2 (local) file
system. Oracle code files are located in /orabin directory (a separate file system) and Oracle
data files are in /oradata file system. Oracle Inventory is stored in /home/oracle directory.
Oracle Inventory, data files, and code are moved to GPFS.

3.1.1 Moving JFS2-based ORACLE_HOME to GPFS
For this scenario, you must have enough SAN-attached storage to hold all Oracle-related files
that you plan to move.
To move ORACLE_HOME and Oracle Inventory from JFS to GPFS, follow these steps:
1. Shut down all Oracle processes.
2. Unmount the /orabin file system and remount it on /jfsorabin.
3. Create a single node GPFS cluster and the NSDs that you are using for GPFS.
4. Create GPFS for /orabin, and mount on /orabin. Make sure that the right permissions exist
for the /orabin GPFS file system, for example, oracle:dba.
5. Copy the entire ORACLE_HOME from JFS2 to GPFS (as oracle user):
cd /jsforabin; tar cvf - ora102 | (cd /orabin; tar xvf -)
6. Unmount the /jfsorabin file system.
7. Move Oracle Inventory:
a. cd /home/oracle; tar cvf - OraInventory | (cd /orabin; tar xvf -)
b. Update the OraInventory location stored in /etc/oraInst.loc.
Note: On the test system, the file oraInst.loc exists in several places:
root@dallas1:/> find / -name oraInst.loc -print 2> /dev/null
/etc/oraInst.loc
/orabin/ora102/oraInst.loc
/orabin/ora102/bigbend1_GPFSMIG1/oraInst.loc
On our test system, all these files are updated.
3.1.2 Moving JFS2-based datafiles to GPFS

For copying the database files from JFS to GPFS, you have several options. Database size
and availability requirements might require a step-by-step approach or at least impose the
requirement of a certain parallelism of the copy operation. The options for moving database
files are:
1. Assuming the database files are located in one directory, /oradata/db in our case, and you
can afford to take the database offline, the simplest way is to copy data files from JFS2 to
GPFS:
After taking the database offline, change the mount point of the /oradata file system to
/jfsoradata and remount the file system into the new mount point; then, as oracle user, run
the following commands:
$ cd /jfsoradata; mv db /oradata/ && umount /jfsoradata
This approach requires that the space is allocated on both JFS and GPFS.
2. If the database is too large (too large in that copying takes longer than acceptable), but
you have the necessary space on both JFS2 and GPFS, you can use the Oracle RMAN
utility to back up the database as an image copy (see Oracle documentation).
Chapter 3. Migration and upgrade scenarios 105

3. Finally, if the database files cannot be held in two copies, we have two options:
a. Performing a full database backup to tape, destroy the existing JFS/JFS2 file system,
then create the GPFS reusing the disk space previously used for JFS/JFS2, and
restore the database files. This method requires that the database is offline during the
backup and restore operation.
b. Copy the datafiles in smaller portions. This method requires that the associated
tablespaces are offline during the operation. Copying is done on the OS level or by
using the Oracle RMAN utility.
3.1.3 Moving raw devices database files to GPFS

The operation of copying datafiles from raw partitions to GPFS is similar to copying datafiles
from JFS to GPFS. However, using the mv command to copy database files is not possible.
The other options are available.
We describe a more complex example covering this topic in 3.4, “Migrating from
HACMP-based RAC cluster to GPFS using RMAN” on page 123 and in chapter 3.5,
“Migrating from RAC with HACMP cluster to GPFS using dd” on page 133.
3.2 Migrating from Oracle single instance to RAC

In this section, you start with a single database instance running on GPFS. GPFS is also
used to store Oracle code and data files. Oracle Inventory is stored on a local file system, in
/home/oracle. For additional information about how to store Oracle code and data files, refer
to 3.1, “Migrating a single database instance to GPFS” on page 104.
For this scenario, we conduct a full installation of both Oracle Clusterware and Oracle RAC
software. Software installation is needed, because Oracle must link its binary files to the
system libraries. After software installation is complete, the steps to convert from single
instance to RAC are:
1. Perform basic node preparation: prerequisites, network, and GPFS code.
2. Add the node to the GPFS cluster.
3. Install and configure Oracle Clusterware using OUI.
4. Install Oracle RAC code using OUI.
5. Configure the database for RAC.
6. Setup Transparent Application Failover (TAF).

3.2.1 Setting up the new node
The steps to add the node to the cluster are similar to the steps presented in 2.1, “Basic
scenario” on page 20:
1. Install the node with the correct version of base software (AIX, GPFS, Secure Shell, and
so forth). Make sure that the new node has exactly the same software versions as the
existing node running Oracle.
Tip: A good technique is to use AIX Network install Manager to “clone” an mksysb of the
existing node.
2. Make sure that you have sufficient free space in /tmp. Oracle Installer requires 600 MB,
but the requirements for the node might be higher.
3. Check for the identical parameters to the existing node kernel configuration parameters.
4. Create the oracle user with same user and group ID as on the existing node.
5. Set up oracle user environment and shell limits as the existing node.
6. Attach the new node to the storage subsystem, and make sure that you can access the
GPFS logical unit numbers (LUNs) from both nodes.
7. Check the remote command execution (rsh/ssh) between nodes.
3.2.2 Add the new node to existing (single node) GPFS cluster
After the preparations described in section 3.2.1, “Setting up the new node” on page 107
have been completed, add the new node to the GPFS cluster by running the mmaddnode
command from the node that is already part of the cluster. After you add the node, make sure
the new node has been added to the cluster, then start the GPFS daemon on the new node
using the mmstartup command. If necessary, use the mmmount command to mount the existing
file systems on the new node.
Note: After the file system has been successfully mounted and is accessible from both
nodes in the cluster, stop GPFS on both nodes and make sure your GPFS cluster and file
systems follow the quorum and availability recommendations:
– Check and adjust the cluster quorum method. Add NSD tiebreaker disks to the
cluster and change cluster quorum to node quorum with tiebreaker disks.
– Check and configure the secondary cluster data server.
For more details, refer to 2.1.6, “GPFS configuration” on page 30.
3.2.3 Installing and configuring Oracle Clusterware using OUI

The process is exactly the same as for any normal Oracle Clusterware installation. We
covered the process in detail in 2.2, “Oracle 10g Clusterware installation” on page 46, 2.3,
“Oracle 10g Clusterware patch set update” on page 70, and 2.4, “Oracle 10g database
installation” on page 76.

Note: Running crsctl stop crs might not always be sufficient to shut down Oracle
Clusterware entirely. We have observed that at least oprocd has not been stopped. This
process needs to be removed before any Oracle code patching operation. Repeated
crsctl start crs and crsctl stop crs, or kill <PID> might work, because using the
kill command can result in node reboot. The only process left running is:
root@bigbend1:/> ps -ef | grep crs
root 491592 1 0 09:37:53 - 0:00 /bin/sh /etc/init.crsd run
Note: If Oracle Clusterware uninstall is needed, note that just running OUI will not cleanly
deinstall Oracle Clusterware. Oracle Metalink Doc ID Note:239998.1 documents this
process. For a complete installation, $ORACLE_CRS_HOME/install contains the scripts
rootdelete.sh and rootdeinstall.sh, which you need to run before using the OUI.
If your nodes continuously reboot, the only chance you have to stop this behavior is to try
to log on to the system as root as soon as you get a login prompt, before Oracle
Clusterware starts, and use the crsctl disable crs command. This command will stop
repeated system reboot. Oracle Metalink can be found at:
Note: You need an Oracle Metalink ID to access this note.
3.2.4 Installing Oracle RAC option using OUI

During this installation, Oracle 10g database software is installed as though Oracle had not
previously been installed. The installation process is similar to the one shown in Appendix D,
“Oracle 10g database installation” on page 269. In this way, previous installation and
configuration can be kept unchanged (as a fallback/backup option). You will need to run
Oracle network configuration assistant (netca) for configuring listener.ora and tnsnames.ora.
You will also need to setup the ORACLE_SID with instance number to reflect the new
configuration.
3.2.5 Configure database for RAC

The changes to the database are:
򐂰 Recreate the database control file, depending on current settings
򐂰 Changing the spfile
򐂰 Add redo log thread for each instance
򐂰 Add undolog per thread
򐂰 Create RAC data dictionary views (catclust)
򐂰 Enable new thread
򐂰 Register new instance with Oracle Clusterware

Control file re-creation
Certain parameters might be inappropriate for RAC and might require the re-creation of the
control file to reflect the changes. These parameters are:
MAXLOGFILES
MAXLOGMEMBERS
MAXDATAFILES
MAXINSTANCES
MAXLOGHISTORY
Change these parameters according to your requirements.
We use the default parameters from the 10g installation. However in the field, you might run
into installations that are upgraded from 9 i, or even with MAXINSTANCES deliberately set to
2. The 10 g defaults from austin1 are:
MAXLOGFILES 192
MAXLOGMEMBERS 3
MAXDATAFILES 1024
MAXINSTANCES 32
MAXLOGHISTORY 292
One way to verify the parameters is to back up controlfile to trace, which produces a file in the
udump destination, as shown in Example 3-1.
Example 3-1 Creating a backup of the database controlfile

{austin1:oracle}/home/oracle -> sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 15 01:24:41 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> alter session set tracefile_identifier='CTLTRACE';
Session altered.
SQL> alter database backup controlfile to trace;
Database altered.
SQL> host ls -l /oracle/ora102/admin/austindb/udump/*CTLTRACE*

-rw-r----- 1 oracle dba 6060 Nov 15 01:25
/oracle/ora102/admin/austindb/udump/austindb1_ora_712888_CTLTRACE.trc
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 -
64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
{austin1:oracle}/home/oracle ->

The file contains two sections, each of which has the complete statement for creating a
controlfile. The two sections cover two scenarios:
򐂰 Current redologs in place (NORESETLOGS)
򐂰 Current redologs lost/damaged (RESETLOGS)
To recreate the controlfile to change MAXINSTANCES, use the NORESETLOGS section

after changing the value of MAXINSTANCES appropriately. As also mentioned in the
tracefile, Oracle Recovery Manager (rman) information is lost, which you must also consider.
Database spfile reconfiguration

In this task, an spfile is already being used, but for this type of editing, the pfile is probably
faster. Thus, we create a pfile, as shown in Example 3-2.
Note: Due to the dynamic features in most of the configuration parameters, we

recommend that you use the spfile for the installation. The pfile is only used to quickly
make changes.
Example 3-2 sqlplus command to create pfile

SQL> create pfile=’mypfile.ora’ from spfile;
Edit the pfile and add specific RAC information. In Oracle RAC, each instance needs its own
undo table space and its own redo logs. Thus, the new configuration needs to be similar to
the configuration shown in Example 3-3.
Example 3-3 Oracle RAC specific configuration

GPFSMIG1.undo_tablespace=’UNDOTBS1’
GPFSMIG2.undo_tablespace=’UNDOTBS2’
*.cluster_database=true
*.cluster_database_instances=2
GPFSMIG1.thread=1
GPFSMIG2.thread=2
GPFSMIG1.instance_number=1
GPFSMIG2.instance_number=2
The new environment has specific (per instance) configurations; you must remove any
database-wide configuration that might conflict with this new environment’s configurations. In
our environment, only the undo configuration is in conflict. The parameter that we removed is
shown in Example 3-4.
Example 3-4 Configuration parameter to remove

*.undo_tablespace=’UNDOTBS1’

Finally, both instances need to run on the same spfile. Here, running both instances is
handled by creating spfile, spfileGPFSMIG.ora, which is then linked to the instance specific
names. The result in $ORACLE_HOME/dbs is shown in Example 3-5. The
spfileGPFSMIG.ora is created from the mypfile.ora (see Example 3-3 on page 110).
Example 3-5 Oracle spfile

{bigbend1:oracle}/orabin/ora102RAC/dbs -> ls -l spfileGPFSMIG*.ora
-rw-r----- 1 oracle dba 4608 Sep 30 16:07 spfileGPFSMIG.ora
lrwxrwxrwx 1 oracle dba 17 Sep 28 15:46 spfileGPFSMIG1.ora ->
spfileGPFSMIG.ora
lrwxrwxrwx 1 oracle dba 17 Sep 28 15:46 spfileGPFSMIG2.ora ->
spfileGPFSMIG.ora
Note: Even though the spfiles are created in $ORACLE_HOME/dbs, we recommend that
you place the spfiles outside of $ORACLE_HOME.
Other configurations might be needed to increase the SGA, because Oracle RAC uses part of
the SGA for Global Cache Directory (GCD). The size of the GCD depends on the database
size. Also, due to the multi-versioning of data blocks in the instances, increased SGA might
be needed. Whether to increase SGA varies depending on the way that the application loads
characteristics.
Note: It is impossible to recommend a proper value for the SGA size. Use the buffer cache
advisor to assess the effectiveness of the caching.
Creating new redo and undo logs
Note: You must evaluate various options, such as naming, sizing, and mirroring for redo
logs based on the installation that you are upgrading.
In this scenario, undo and redo logs for the new instance are created in a similar manner to
the old instance. The undo log creation is shown in Example 3-6 on page 112.

Example 3-6 Creating the undo log for second instance
SQL> select file_name, tablespace_name, bytes, bytes/1024/1024 MB
2 from dba_data_files where tablespace_name like 'UNDO%';
FILE_NAME TABLESPACE BYTES MB

-------------------- ---------- ---------- ----------
/oradata/GPFSMIG/und UNDOTBS1 267386880 255
otbs01.dbf
SQL> create undo tablespace UNDOTBS2

2 datafile ‘/oradata/GPFSMIG/undotbs02.dbf’ size 255M;
Tablespace created.
SQL> select file_name, tablespace_name, bytes, bytes/1024/1024 MB

2 from dba_data_files where tablespace_name like 'UNDO%';
FILE_NAME TABLESPACE BYTES MB

-------------------- ---------- ---------- ----------
otbs01.dbf

otbs02.dbf
In our scenario, we chose to rename the redo logs. Example 3-7 lists the current redo logs.
Example 3-7 Identify existing redo logs

SQL> select group#, member from v$logfile;
GROUP# MEMBER
---------- ------------------------------
3 /oradata/GPFSMIG/redo03.log
SQL> select group#, thread#, bytes, bytes/1024/1024 MB from v$log;
GROUP# THREAD# BYTES MB

---------- ---------- ---------- ----------
1 1 52428800 50
2 1 52428800 50
3 1 52428800 50
Example 3-8 on page 113 shows the creation of new redo log files. The naming is chosen so
the thread is part of the redo log file name. Therefore, the new redo logs are named
differently. In a later step, the old files are renamed.

Example 3-8 Creating the redo log groups
SQL> alter database add logfile thread 2 group 4 '/oradata/GPFSMIG/redo01-02.log'
size 50M;
Database altered.

size 50M;
Database altered.

size 50M;
Database altered.
To rename the redo logs, the database must be in mount mode. Renaming is shown in
Example 3-9.
Example 3-9 Renaming the existing redo logs
GROUP# MEMBER
---------- ------------------------------
4 /oradata/GPFSMIG/redo01-02.log
6 rows selected.
SQL> host mv /oradata/GPFSMIG/redo01.log /oradata/GPFSMIG/redo01-01.log
SQL> alter database rename file '/oradata/GPFSMIG/redo01.log',

2 '/oradata/GPFSMIG/redo02.log', '/oradata/GPFSMIG/redo03.log' to
3 '/oradata/GPFSMIG/redo01-01.log', '/oradata/GPFSMIG/redo02-01.log',
4 '/oradata/GPFSMIG/redo03-01.log';
Database altered.
GROUP# MEMBER
---------- ------------------------------

6 rows selected.
Finally, the new thread is enabled, as shown in Example 3-10. After the new thread is
enabled, the new instance can be started.
Example 3-10 Enabling new thread

SQL> alter database enable thread 2;
Database altered.
Both instances can now be started, and the database can be opened from both nodes.
Note: At this time, you must create certain Oracle RAC specific data dictionary views by
running the catclust.sql file. This action produces a lot of output, so we do not show
running the catclust.sql file here. The catclust file is in $ORACLE_HOME/rdbms/admin.
Registering a new instance with Oracle Clusterware

To enable Oracle Clusterware control of the new instance, you must register the second
instance into the clusterware as shown in Example 3-11.
Example 3-11 Adding database and instance with srvctl

{bigbend1:oracle}/home/oracle -> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....D1.lsnr application ONLINE ONLINE bigbend1
ora....nd1.gsd application ONLINE ONLINE bigbend1
ora....nd1.ons application ONLINE ONLINE bigbend1
ora....nd1.vip application ONLINE ONLINE bigbend1
{bigbend1:oracle}/home/oracle -> srvctl add database -d GPFSMIG -o $ORACLE_HOME
{bigbend1:oracle}/home/oracle -> srvctl add instance -d GPFSMIG -i GPFSMIG1 -n
bigbend1
{bigbend1:oracle}/home/oracle -> srvctl add instance -d GPFSMIG -i GPFSMIG2 -n
bigbend2
Example 3-12 on page 115 shows how the crs_stat -t output reflects that srvctl starts the
database and instances.

Example 3-12 Startup database using srvctl
------------------------------------------------------------
ora....G1.inst application OFFLINE OFFLINE
ora.GPFSMIG.db application OFFLINE OFFLINE
{bigbend1:oracle}/home/oracle -> srvctl start database -d GPFSMIG
------------------------------------------------------------
ora....G1.inst application ONLINE ONLINE bigbend1
ora.GPFSMIG.db application ONLINE ONLINE bigbend1
{bigbend1:oracle}/home/oracle ->
3.2.6 Configuring Transparent Application Failover

If transparent application failover (TAF) is used, you must set it up in the tnsnames.ora
configuration file. The tnsnames.ora that we use for the final verification is shown in
Example 3-13.
Example 3-13 tnsnames.ora with TAF

GPFSMIG =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = bigbend1_vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = bigbend2_vip)(PORT = 1521))
(LOAD_BALANCE = yes)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = GPFSMIG)
(FAILOVER_MODE =
(TYPE = select)
(METHOD = basic)
)
)
)

3.2.7 Verification
In this task, we connect to one of the nodes (the first node in this case), select information to
show the connection, stop this instance, and reselect information to show that failover has
occurred. Example 3-14 shows these actions. The test is run on node bigbend1.
Example 3-14 TAF failover

SQL> select instance_number instance#, instance_name, host_name, status
2 from v$instance;
INSTANCE# INSTANCE_NAME HOST_NAME STATUS

---------- ---------------- -------------------- ------------
1 GPFSMIG1 bigbend1 OPEN
SQL> select failover_type, failover_method, failed_over

2 from v$session where username='SYSTEM';
FAILOVER_TYPE FAILOVER_M FAILED_OVER

------------- ---------- -----------
SELECT BASIC NO
REM initiate shutdown instance 1

SQL> host sqlplus -s / as sysdba
select instance_number from v$instance;
INSTANCE_NUMBER
---------------
1
shutdown abort
ORACLE instance shut down.
exit
SQL>
REM end shutdown instance 1
SQL> select instance_number instance#, instance_name, host_name, status

2 from v$instance;
INSTANCE# INSTANCE_NAME HOST_NAME STATUS

---------- ---------------- -------------------- ------------
2 GPFSMIG2 bigbend2 OPEN
SQL> select failover_type, failover_method, failed_over

2 from v$session where username='SYSTEM';
FAILOVER_TYPE FAILOVER_M FAILED_OVER

------------- ---------- -----------
SELECT BASIC YES
SQL>

3.3 Adding a node to an existing RAC
The scenario described in this section assumes the existing Oracle RAC is running on GPFS.
The steps are:
1. Set up the new node basic, just as in 3.2.1, “Setting up the new node” on page 107. Also,
the file/etc/oraInst.loc is copied to the new node from one of the existing nodes.
2. Add the new node to GPFS (this is similar to 3.2.2, “Add the new node to existing (single
node) GPFS cluster” on page 107).
3. Add the node to Oracle Clusterware.
4. Add the new instance to RAC.
5. Reconfigure the database.
3.3.1 Add the node to Oracle Clusterware

Because the ORA_CRS_HOME is shared, and both Oracle Clusterware and Oracle RAC are
already configured, adding a new node is a straightforward process.
To add a node to the Oracle Clusterware configuration, Oracle provides a script,

$ORA_CRS_HOME/oui/bin/addNode.sh, which will start OUI. If Oracle Clusterware is
installed in a separate directory from Oracle RAC, the addNode.sh script must be in the
Oracle Clusterware home directory.
You add the new node information on the second window of the OUI, which is shown in
Figure 3-2 on page 118. In all other windows, click Next.

Figure 3-2 New node information in OUI
When the installation is finished, OUI asks to run root scripts on both the old nodes and the
new nodes (see Figure 3-3 on page 119). Running these root scripts performs all of the
configuration changes required to install Oracle Clusterware and start Oracle Clusterware on
the new node.

Figure 3-3 OUI root scripts
To run the scripts:

1. Start by running the /orabin/crs102/install/rootaddnode.sh script on the existing node,
(where the addNode.sh script was also run), which is shown in Example 3-15.
Example 3-15 OUI root script on existing node (bigbend1)

root@bigbend1:/> cd /orabin/crs102/install
root@bigbend1:/orabin/crs102/install> ./rootaddnode.sh
clscfg: EXISTING configuration version 3 detected.
clscfg: version 3 is 10G Release 2.
Attempting to add 1 new nodes to the configuration
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
node 3: bigbend3 bigbend3_interconnect bigbend3
Creating OCR keys for user 'root', privgrp 'system'..
Operation successful.
/orabin/crs102/bin/srvctl add nodeapps -n bigbend3 -A
bigbend3_vip/255.255.255.0/en0|en1 -o /orabin/crs102
root@bigbend1:/orabin/crs102/install>
2. Next, run the /orabin/crs102/root.sh script on the new node, which is shown in
Example 3-16 on page 120.

Example 3-16 OUI root script on new node (bigbend3)
root@bigbend3:/orabin/crs102> ./root.sh
WARNING: directory '/orabin' is not owned by root
Checking to see if Oracle CRS stack is already configured
OCR LOCATIONS = /dev/OCR1,/dev/OCR2
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory '/orabin' is not owned by root
clscfg: EXISTING configuration version 3 detected.
clscfg: version 3 is 10G Release 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node <nodenumber>: <nodename> <private interconnect name> <hostname>
clscfg: Arguments check out successfully.
NO KEYS WERE WRITTEN. Supply -force parameter to override.

-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
bigbend1
bigbend2
bigbend3
CSS is active on all nodes.
Waiting for the Oracle CRSD and EVMD to start
Oracle CRS stack installed and running under init(1M)
Running vipca(silent) for configuring nodeapps
Creating VIP application resource on (0) nodes.

Creating GSD application resource on (0) nodes.
Creating ONS application resource on (0) nodes.
Starting VIP application resource on (2) nodes...
Starting GSD application resource on (2) nodes...
Starting ONS application resource on (2) nodes...
Done.
root@bigbend3:/orabin/crs102> crs_stat -t
------------------------------------------------------------
ora.GPFSMIG.db application OFFLINE OFFLINE

As you can see in the last part of Example 3-16 on page 120, bigbend3 now appears with the
basic Oracle Clusterware services, but not instance and listener, because they are not
configured.
Note: The CRS-0215 error can occur when configuring VIP. According to Oracle Metalink,
this error is caused by default routing configuration. In our test environment, we have
observed the same issue. The VIP is not configured (the vipca command did not complete
successfully). We solve this problem by running the following command:
ifconfig en0 alias 192.168.100.157 netmask 255.255.255.0
The interface is en0, and 192.168.100.157 is the IP address used as the VIP.
Note: When addNode.sh was run with Oracle Inventory on shared storage, the node list
was not updated. When Oracle Inventory resides on local storage, node list was updated
on the local Inventory, but not for all nodes. To update the Oracle Inventory node list for a
specific HOME on a specific node, you can use the following command:
$ORA_CRS_HOME/oui/bin/runInstaller -noClusterEnabled -updateNodeList

ORACLE_HOME="$ORA_CRS_HOME" ORACLE_HOME_NAME="{CRSHome}"
CLUSTER_NODES="{bigbend1,bigbend2,bigbend3}" LOCAL_NODE="bigbend3"
In the previous command, $ORA_CRS_HOME=/orabin/crs102 is the path where Oracle

Clusterware is installed, CRSHome is the name for this Oracle HOME in OUI,
{bigbend1,bigbend2,bigbend3} is the complete list of nodes in the cluster, and bigbend3 is
the node from where the command is run.
3.3.2 Adding a new instance to existing RAC

To add a new instance to existing RAC that is running with shared ORACLE_HOME, perform
the following actions:
1. Make sure that the Oracle Inventory location is defined by copying /etc/oraInst.loc from
one of the existing nodes.
2. Run $ORACLE_HOME/root.sh to create the local files.
3. Update Oracle Inventory with the new cluster node: ./runInstaller -noClusterEnabled
-updateNodeList ORACLE_HOME="/orabin/ora102RAC" ORACLE_HOME_NAME="{oraRACHome}"
CLUSTER_NODES="{bigbend1,bigbend2,bigbend3}" LOCAL_NODE="bigbend1"
Oracle provides a shell script, $ORACLE_HOME/oui/bin/addNOde.sh for this task, but as with
Oracle Clusterware, the Oracle Inventory is not updated in shared inventory, and the script
is prompting for the execution of the $ORACLE_HOME/root.sh script.
4. Run the Oracle network configuration assistant, netca, or reconfigure listener.ora and
tnsnames.ora to reflect the new node.

3.3.3 Reconfiguring the database
The database reconfiguration process is similar to 3.2.5, “Configure database for RAC” on
page 108, except that the bullet “Create RAC data dictionary views, catclust” is not needed
here:
1. Add node three configuration to spfile.
2. Create a link for node three to spfile in $ORACLE_HOME/dbs.
3. Add redo log groups for new RAC node.
4. Add undo tablespace for the new RAC node.
5. Enable thread three.
3.3.4 Final verification

To check if the third instance is actually part of the RAC, Example 3-17 shows output from
crs_stat -t and after that, select from gv$instance.
Example 3-17 crs_stat -t output with three nodes

------------------------------------------------------------
ora.GPFSMIG.db application ONLINE ONLINE bigbend1
Example 3-18 shows the output from querying gv$instance.
Example 3-18 gv$instance output

SQL> select instance_number, host_name, database_status from gv$instance;
INSTANCE_NUMBER HOST_NAME DATABASE_STATUS

--------------- --------------- -----------------
1 bigbend1 ACTIVE
3 bigbend3 ACTIVE
2 bigbend2 ACTIVE

3.4 Migrating from HACMP-based RAC cluster
to GPFS using RMAN
This section describes how to migrate from an HACMP-based RAC cluster to a GPFS-based
configuration, and this section provides the walk-through process of completely removing
HACMP to run only Oracle Clusterware and RAC on GPFS with raw partitions for OCR and
voting disks.
3.4.1 Current raw devices with HACMP

This scenario starts with a RAC configuration that is based on concurrent raw devices
managed by HACMP and AIX CLVM. The basic elements of this configuration are:
򐂰 Hardware: Two p5-570 LPARs
򐂰 OS: AIX V5.3 TL6 and HACMP V5.3 PTF 4
򐂰 Oracle 10g RAC Release 2
The initial cluster configuration (Figure 3-4) is based on HACMP, oraclevg is in enhanced
concurrent mode (ECM) and is opened in concurrent mode when HACMP is up and running
(RSCT is responsible for resolving concurrent access).
RAC interconnect
192.168.100.31 192.168.100.32 Public network
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
austin1 austin2
Note: oraclevg is an ECM VG
rootvg hdisk0 and is managed by HACMP hdisk0 rootvg
alt_rootvg hdisk1 DS4800 alt_rootvg

hdisk1
fcs0 fcs0
oraclevg
Figure 3-4 Oracle RAC with HACMP (before migration)
All raw devices in this test environment are created with the mklv -B -TO options, thus the
logical volume control block (LVCB) does not occupy the first block of the logical volume.
Special consideration must be taken if the raw devices for Oracle are created without the mklv
-TO option for later use of the dd command to copy data files from raw logical volumes to a file
system, as described in 3.5.1, “Logical volume type and the dd copy command” on page 134.
Example 3-19 on page 124 shows the list of raw devices that we have used for this scenario.

Example 3-19 Current raw devices for oracle RAC
root@austin1:/> lsvg -l oraclevg
oraclevg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
raw_system jfs 85 85 1 open/syncd N/A
raw_sysaux jfs 65 65 1 open/syncd N/A
raw_undotbs1 jfs 7 7 1 open/syncd N/A
raw_temp jfs 33 33 1 closed/syncd N/A
raw_example jfs 20 20 1 open/syncd N/A
raw_users jfs 15 15 1 open/syncd N/A
raw_redo1_1 jfs 15 15 1 closed/syncd N/A
raw_control1 jfs 15 15 1 open/syncd N/A
raw_control2 jfs 15 15 1 open/syncd N/A
raw_spfile jfs 1 1 1 closed/syncd N/A
raw_pwdfile jfs 1 1 1 closed/syncd N/A
raw_ocr1 jfs 40 40 1 closed/syncd N/A
raw_ocr2 jfs 40 40 1 closed/syncd N/A
raw_vote1 jfs 40 40 1 open/syncd N/A
raw_undotbs2 jfs 7 7 1 open/syncd N/A
3.4.2 Migrating data files to GPFS

The target configuration is presented in Figure 3-5. Oracle data files reside on GPFS
(/oradata based on hdisk2 ... hdisk22). GPFS is providing concurrent access to data files.
RAC interconnect
192.168.100.31 192.168.100.32 Public network
192.168.100.31 10.1.100.31 192.168.100.32 10.1.100.32
ent2 ent3 ent2 ent3
DS4800
austin1 austin2
hdisk0 hdisk0
alt_rootvg hdisk1 hdisk3 alt_rootvg

hdisk1
fcs0 fcs0
Note: hdisk2...hdisk22 are

hdisk22 NSDs, configured in /oradata
file system
Figure 3-5 Target configuration (data files on GPFS)

Example 3-20 shows the commands that we use to migrate control files and data files from
raw devices to a GPFS using Oracle Recovery Manager (RMAN). RMAN is a utility for
backing up, restoring, and recovering Oracle databases. RMAN is a standard utility in Oracle
and does not require a separate installation.
Example 3-20 Migrate control files and data files to GPFS using RMAN
SQL> startup nomount
ORACLE instance started.
Total System Global Area 4966055936 bytes

Fixed Size 2027648 bytes
Variable Size 889196416 bytes
Database Buffers 4060086272 bytes
Redo Buffers 14745600 bytes
SQL> alter system set db_create_file_dest='/oradata/';
SQL> alter system set

control_files='/oradata/control1.dbf','/oradata/control2.dbf' scope=spfile;
System altered.
SQL> shutdown immediate;

ORA-01507: database not mounted
{austin1:oracle}/home/oracle -> rman target /
Recovery Manager: Release 10.2.0.1.0 - Production on Thu Sep 27 14:51:29 2007
connected to target database (not started)
Tip: Follow the instructions while the second node is down.
RMAN> startup nomount;
Oracle instance started

RMAN> restore controlfile from '/dev/rraw_control1';
Starting restore at 2007-09-27 15:15:16

using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=144 instance=austindb1 devtype=DISK

channel ORA_DISK_1: copied control file copy
output filename=/oradata/control1.dbf
output filename=/oradata/control2.dbf
Finished restore at 2007-09-27 15:15:18
RMAN> alter database mount ;
using target database control file instead of recovery catalog

database mounted
###Even though there is a RMAN command for copying all data files at once, we
decided to copy each data file at a time, because we want to give the files names
of our own choice.
RMAN> copy datafile '/dev/rraw_system' to '/oradata/system.dbf';
Starting backup at 2007-09-27 16:31:54

using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile copy
input datafile fno=00001 name=/dev/rraw_system
output filename=/oradata/system.dbf tag=TAG20070927T163154 recid=10
stamp=634408334
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:25
Finished backup at 2007-09-27 16:32:19
###Repeat copying other datafiles
RMAN> copy datafile '/dev/rraw_sysaux' to '/oradata/sysaux.dbf';

RMAN> copy datafile '/dev/rraw_example' to '/oradata/example.dbf';
RMAN> copy datafile '/dev/rraw_users' to '/oradata/users.dbf';
RMAN> copy datafile '/dev/rraw_undotbs1' to '/oradata/undotbs1.dbf';
RMAN> copy datafile '/dev/rraw_undotbs2' to '/oradata/undotbs.dbf';
RMAN> switch database to copy;
datafile 1 switched to datafile copy "/oradata/system.dbf"

datafile 2 switched to datafile copy "/oradata/undotbs1.dbf"
datafile 3 switched to datafile copy "/oradata/sysaux.dbf"
datafile 4 switched to datafile copy "/oradata/users.dbf"
datafile 5 switched to datafile copy "/oradata/example.dbf"
datafile 6 switched to datafile copy "/oradata/undotbs.dbf"
RMAN> alter database open;
database opened

3.4.3 Migrating the temp tablespace to GPFS
Because temporary tablespace (temp) files do not contain any volatile information at this
point, it is easier to create new ones and drop the old versions. Create a new temporary
tablespace before removing the previous default temp tablespace. Example 3-21 shows how
to replace the temp tablespace. The size and number of temp files are the same as on the
system being upgraded.
Example 3-21 Migrate temp tablespace to GPFS
SQL> create temporary tablespace TEMPORARY tempfile '/oradata/temp.dbf' size 512

M;
Tablespace created.
SQL> alter database default temporary tablespace temporary;
Database altered.
SQL> drop tablespace temp;
Tablespace dropped.
SQL> alter tablespace TEMPORARY rename to TEMP;
Tablespace altered.
SQL> select file_name from dba_temp_files;
FILE_NAME
----------------------------------------------------------------------------------
-----------
/oradata/temp01.dbf
3.4.4 Migrating the redo log files to GPFS

Drop existing redo logs and recreate them in GPFS. Each instance (thread) in RAC must
have at least two redo log groups. In order to replace redo log files in the RAC environment,
add two log file groups per each instance (thread). And then, run the SQL> alter system
switch logfile; command to make the old log files inactive (see Example 3-22 on
page 128). You can drop logfiles that are in “inactive” or “unused” status only.

Example 3-22 Migrate redo logs to GPFS
SQL> select group#, members,status,thread# from v$log;
GROUP# MEMBERS STATUS THREAD#

---------- ---------- ---------------- ----------
1 1 INACTIVE 1
2 1 CURRENT 1
3 1 CURRENT 2
4 1 UNUSED 2
SQL> alter database add logfile thread 1 group 5 '/oradata/redo5.log' size 120M;
###while keeping running SQL> alter system switch logfile on each node;
drop logfile that is in “inactive” or “unused” status one by one.
SQL> alter database drop logfile group 1;

3.4.5 Migrating the spfile to GPFS

Example 3-23 shows how to migrate the spfile from raw device to GPFS.
Example 3-23 Migrate spfile to GPFS

SQL> create pfile='/tmp/tmppfile.ora' from spfile;
File created.
SQL> shutdown immediate;

Database closed.
Database dismounted.
SQL> startup pfile='/tmp/tmppfile.ora'


Database mounted.
Database opened.
SQL> create spfile='/oradata/spfile_austindb' from pfile;
File created.

###Create a link from the new spfile to $ORACLE_HOME/dbs/spfile_name on both
nodes.
{austin1:oracle}/oracle/ora102/dbs -> ln -s /oradata/spfile_austindb

spfileaustindb1.ora
{austin2:oracle}/oracle/ora102/dbs -> ln -s /oradata/spfile_austindb

spfileaustindb2.ora
###Create a pfile from spfile if necessary.
3.4.6 Migrating the password file

Using the orapwd utility, create a password file in the /oradata GPFS file system, as shown in
Example 3-24.
Example 3-24 Migrate a password file

{austin1:oracle}/home/oracle -> orapwd file=/oradata/orapw_austindb
password=itsoadmin
###Remove a previous password file and link to the new password file.
{austin1:oracle}/oracle/ora102/dbs -> ls -l
total 80
-rw-rw---- 1 oracle dba 1552 Sep 29 21:53 hc_austindb1.dat
-rw-rw---- 1 oracle dba 1552 Sep 27 09:32 hc_raw1.dat
-rw-r----- 1 oracle dba 8385 Sep 11 1998 init.ora
-rw-r----- 1 oracle dba 34 Sep 28 15:12 initaustindb1.ora
-rw-r----- 1 oracle dba 12920 May 03 2001 initdw.ora
lrwxrwxrwx 1 oracle dba 17 Sep 28 12:02 orapwaustindb1 -> /dev/
rraw_pwdfile
{austin1:oracle}/oracle/ora102/dbs -> rm orapwaustindb1

{austin1:oracle}/oracle/ora102/dbs -> ln -s /oradata/orapw_austindb orapwaustindb1
###Remove a previous password file and link to the new password file on the second
node.
root@austin2:/oracle/ora102/dbs> rm orapwaustindb2
root@austin2:/oracle/ora102/dbs> ln -s /oradata/orapw_austindb orapwaustindb2
3.4.7 Removing Oracle Clusterware

Because the previous CRS is installed on top of the HACMP cluster, you must reinstall CRS
to convert to a CRS only based cluster. It is possible to remove CRS after HACMP is
uninstalled. However, OCR and voting (vote) disks are relying on the logical volumes that are
managed by HACMP; thus, we decided to remove CRS first, followed by uninstalling the
HACMP filesets. Use the following steps to remove CRS:
1. Stop the database, instance, and node application as described in Example 3-33 on
page 137.
2. Stop CRS demons on both nodes as shown in Example 3-33 on page 137.

3. Remove CRS as described in Appendix E, “How to cleanly remove CRS” on page 283.
3.4.8 Removing HACMP filesets and third-party clusterware information

Before you reinstall CRS, you must uninstall HACMP file sets and third-party
clusterware-related information. The third-party clusterware-related information includes the
/opt/ORCLcluster directory. Files in this directory are created during the previous
HACMP-based CRS installation and contain information about HACMP. If the CRS installer
detects this directory, it creates a soft link from $ORA_CRS_HOME/lib/libskgxn* to
/opt/ORCLcluster/lib/*. Running the $ORACLE_CRS/root.sh script fails during CRS installation
with the following error shown in Example 3-25.
Example 3-25 Error when running root.sh without removing the /opt/ORCLcluster directory
root@austin1:/oracle/crs> root.sh
WARNING: directory '/oracle' is not owned by root
Checking to see if Oracle CRS stack is already configured
Setting the permissions on OCR backup directory

Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
.
.
.
Startup will be queued to init within 30 seconds.
Expecting the CRS daemons to be up within 600 seconds.
Failure at final check of Oracle CRS stack.
10
To avoid this issue, perform the following tasks:

1. Uninstall the following HACMP filesets:
– HACMP filesets:
cluster.adt.*
cluster.doc.*
cluster.es.*
– Uninstall RSCT filesets for HACMP:
rsct.basic.hacmp
rsct.compat.basic.hacmp
rsct.compat.clients.hacmp
2. Remove the hagsuser group for oracle RAC.
3. Remove the directory /opt/ORCLcluster.
3.4.9 Reinstalling Oracle Clusterware

Remove the previous CRS and reinstall it without HACMP. Refer to Appendix E, “How to
cleanly remove CRS” on page 283 for removing CRS, and refer to 2.2, “Oracle 10g
Clusterware installation” on page 46 to reinstall CRS on raw disks. Instead of storing OCR
and voting (vote) devices on the GPFS file system, we use raw disks to prevent CRS and
GPFS from getting in each other’s way during node recovery.

3.4.10 Switch link two library files and relink database
We have experienced an issue with the Oracle database, because the CRS installation has a
problem with the /opt/ORCLcluster directory. This issue is explained in 3.4.8, “Removing
HACMP filesets and third-party clusterware information” on page 130. Several library files
linked to the /opt/ORCLcluster directory need to be changed to the $ORACLE_CRS/lib
directory that has been newly installed. Change these links on both nodes, as shown in
Example 3-26.
Example 3-26 Switch link for two library files

###Run the following commands on both nodes.
###Verify current links for old two database libray files
root@austin1:/oracle/ora102/lib> ls -l libskgxn2*
lrwxrwxrwx 1 oracle dba 32 Oct 03 14:04 libskgxn2.a ->
/opt/ORCLcluster/lib/libskgxn2.a
lrwxrwxrwx 1 oracle dba 33 Oct 03 14:04 libskgxn2.so ->

/opt/ORCLcluster/lib/libskgxn2.so
###Verify the new CRS links for the newly created library files
root@austin1:/oracle/ora102/lib> cd /oracle/crs/lib
root@austin1:/oracle/crs/lib> ls -l libskgxn2*
lrwxrwxrwx 1 oracle system 27 Oct 03 22:49 libskgxn2.a ->
/oracle/crs/lib/libskgxns.a
lrwxrwxrwx 1 oracle system 28 Oct 03 22:49 libskgxn2.so ->
/oracle/crs/lib/libskgxns.so
###Remove the links for old two database libray files
root@austin1:/oracle/ora102/lib> rm libskgxn2.a
root@austin1:/oracle/ora102/lib> rm libskgxn2.so
###Make new links for the library files
root@austin1:/oracle/ora102/lib> ln -s /oracle/crs/lib/libskgxns.a libskgxn2.a

root@austin1:/oracle/ora102/lib> ln -s /oracle/crs/lib/libskgxns.so libskgxn2.so
###Verify the new links
root@austin1:/oracle/ora102/lib> ls -l libskgxn2*
lrwxrwxrwx 1 root system 27 Oct 03 23:53 libskgxn2.a ->
/oracle/crs/lib/libskgxns.a
lrwxrwxrwx 1 root system 28 Oct 03 23:54 libskgxn2.so ->
/oracle/crs/lib/libskgxns.so

Next, relink the database binaries on both nodes as oracle user, as shown in Example 3-27.
If the Oracle home directory ($ORACLE_HOME) is moved to GPFS, relinking is done on one
node only. However, the changes to /opt/ORCLcluster must be made on all nodes.
Example 3-27 Relink database binaries

###Run this command on both nodes.
{austin1:oracle}/oracle/ora102/bin -> relink all
3.4.11 Starting listeners

If not already started, start listeners on both nodes as shown in Example 3-28.
Example 3-28 Start a listener

{austin2:oracle}/home/oracle -> lsnrctl start
LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.1.0 - Production on

03-OCT-2007 23:25:57
Starting /oracle/ora102/bin/tnslsnr: please wait...
TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.1.0 - Production

System parameter file is /oracle/ora102/network/admin/listener.ora
Log messages written to /oracle/ora102/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=austin2)(PORT=1521)))
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.1.0
- Production
Start Date 03-OCT-2007 23:25:58
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP ON
Listener Parameter File /oracle/ora102/network/admin/listener.ora
Listener Log File /oracle/ora102/network/log/listener.log
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=austin2)(PORT=1521)))
The listener supports no services
The command completed successfully

3.4.12 Adding a database and instances
Register a database and instance using the srvctl command as shown in Example 3-29.
Example 3-29 Register a database and instance using the srvctl command
###Add a database
{austin1:oracle}/home/oracle -> srvctl add database -d austindb -o /oracle/ora102
###Add an instance for each node
{austin1:oracle}/home/oracle -> srvctl add instance -d austindb -i austindb1 -n

austin1
{austin1:oracle}/home/oracle -> srvctl add instance -d austindb -i austindb2 -n

austin2
{austin1:oracle}/home/oracle -> crs_stat -t

------------------------------------------------------------
ora....N1.lsnr application ONLINE ONLINE austin1
ora....in1.gsd application ONLINE ONLINE austin1
ora....in1.ons application ONLINE ONLINE austin1
ora....in1.vip application ONLINE ONLINE austin1
ora....b1.inst application ONLINE ONLINE austin1
ora....indb.db application ONLINE ONLINE austin2
3.5 Migrating from RAC with HACMP cluster to GPFS using dd

This section describes how to migrate a HACMP-based RAC to a GPFS-based RAC cluster
using the dd command. The migration process is almost the same as migration using RMAN
that we explained in the previous section. The only difference is in migrating control files and
data files. Here, we focus on this difference, which is highlighted in bold in the following list.
For the remaining steps, refer to 3.4, “Migrating from HACMP-based RAC cluster to GPFS
using RMAN” on page 123. The steps are:
1. Use the logical volume type (mklv -TO) and the dd copy command.
2. Migrate control files GPFS.
3. Migrate data files GPFS.
4. Migrate temp tablespace to GPFS.
5. Migrate redo logs to GPFS.
6. Migrate spfile to GPFS.
7. Migrate password file to GPFS.
8. Remove Oracle Clusterware.

9. Remove HACMP filesets and third-party clusterware-related information.
10.Reinstall Oracle Clusterware.
11.Switch link two library files and relink database binaries.
12.Start listeners.
13.Add a database and instances using the srvctl command.
3.5.1 Logical volume type and the dd copy command

It is important to identify whether the raw devices used for Oracle are created with the mklv
-TO (capital O, not number 0) command, which determines the parameters that you have to
use for the dd command when copying the raw devices to flat files (in GPFS).
Depending on the logical volume device subtype (see Table 3-1), you need different options
for the dd command: DS_LVZ or DS_LV. The mklv -TO flag indicates that the logical volume
control block does not occupy the first block of the logical volume; therefore, the space is
available for application data. This logical volume has a device subtype of DS_LVZ. A logical
volume created without this option has a device subtype of DS_LV. For “classic” volume
groups, the devsubtype of a logical volume is always DS_LV. For scalable format volume
groups, the devsubtype of a logical volume is always DS_LVZ, regardless of whether the mklv
-TO flag is used to create the logical volume.
Table 3-1 Types of volume group used for raw devices

VG type LV subtype Description Others
Normal volume group Always DS_LV The logical volume mklv -TO flag is always
control block will ignored in normal
occupy the first block volume group.
of the logical volume.
Big volume group DS_LV The logical volume mklv without -TO in
control block will big volume group
occupy the first block
DS_LVZ The logical volume mklv with -TO in big

control block will not volume group
occupy the first block
Scalable volume group Always DS_LVZ The logical volume DS_LVZ (mklv -TO) is
control block will not always set by default
occupy the first block in scalable volume
of the logical volume. group.

Tip: The following output shows how to determine if the “-TO” flag is used at LV creation:
#root@austin1:/oracle/crs/bin> lslv raw_system
LOGICAL VOLUME: raw_system VOLUME GROUP: oraclevg
LV IDENTIFIER: 00cc5d5c00004c000000011538f31351.1 PERMISSION:
read/write
VG STATE: active/complete LV STATE: closed/syncd
TYPE: jfs WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 8 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 85 PPs: 85
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: no
INTRA-POLICY: middle UPPER BOUND: 128
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV ?: no
Serialize IO ?: NO
DEVICESUBTYPE : DS_LVZ
If the “-TO” flag is used in a big volume group, it will show the following additional attribute:
"DEVICESUBTYPE : DS_LVZ".
If raw devices are not DS_LVZ type, when using the dd command to copy raw devices to a file
system, you must skip the first block to avoid data corruption:
$ dd if=/dev/rraw_control1 of=/oradta/control1.dbf bs=4096 skip=1 count=30720
3.5.2 Migrate control files to GPFS

However, because raw devices in our test environment are created with mklv -TO, we use the
dd command without the “skip=1” option to copy all raw devices to GPFS file system, as
shown in Example 3-30 and Example 3-31 on page 136.
Example 3-30 Migrating control files using the dd command

SQL> alter system set
control_files='/oradata/control1.dbf','/oradata/control2.dbf' scope=spfile;
SQL> shutdown immediate

Database closed.
Database dismounted.
{austin1:oracle}/oradata -> dd if=/dev/rraw_control2 of=/oradata/control2.dbf

bs=1M
120+0 records in.
{austin1:oracle}/oradata -> dd if=/dev/rraw_control2 of=/oradata/control2.dbf

bs=1M
120+0 records in.

3.5.3 Migrate data files to GPFS
It is possible to migrate each raw device data file to GPFS while a database is online with the
dd command as shown in Example 3-31 except for system and undo table spaces. For the
system and undo tablespaces, refer to Example 3-32 on page 137, which cannot be put offline
independently.
Example 3-31 Online data file migration using the dd command

SQL> alter tablespace example offline;
Tablespace altered.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_example of=/oradata/example.dbf

bs=1M
160+0 records in.
160+0 records out.
SQL> alter database rename file '/dev/rraw_example' to '/oradata/example.dbf';

Database altered.
SQL> alter tablespace example online;

Database altered.
###Repeat the same process for other data files(sysaux, users tablespaces)except
system and undo tablespaces.
SQL> alter tablespace sysaux offline;

Tablespace altered.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_sysaux of=/oradata/sysaux.dbf

bs=1M
160+0 records in.
160+0 records out.
SQL> alter database rename file '/dev/rraw_sysaux' to '/oradata/sysaux.dbf';

Database altered.
SQL> alter tablespace sysaux online;

Database altered.
SQL> alter tablespace users offline;

Tablespace altered.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_users of=/oradata/users.dbf bs=1M

160+0 records in.
160+0 records out.
SQL> alter database rename file '/dev/rraw_users' to '/oradata/users.dbf';

Database altered.
SQL> alter tablespace users online;

Database altered.
For the system and undo tablespaces, because they cannot be offline, run the dd command as
shown in Example 3-32 on page 137 while having the database in mount (not open) status.

Example 3-32 Migrate system and undo tablespaces using the dd command
SQL> startup mount

Database mounted.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_system of=/oradata/system.dbf

bs=1M
SQL> alter database rename file '/dev/rraw_system' to '/oradata/system.dbf';
Database altered.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_undotbs1 of=/oradata/undotbs1.dbf

bs=1M
SQL> alter database rename file '/dev/rraw_undotbs1' to '/oradata/undotbs1.dbf';
Database altered.
{austin1:oracle}/home/oracle -> dd if=/dev/rraw_undotbs2 of=/oradata/undotbs2.dbf

bs=1M
SQL> alter database rename file '/dev/rraw_undotbs2' to '/oradata/undotbs2.dbf';

Database altered.
SQL> alter database open;
Database altered.
3.6 Upgrading from HACMP V5.2 to HACMP V5.3

In our environment, we test the HACMP upgrade in Oracle 10g RAC environment. We
realized that rolling migration is not supported. We also experienced periodical reboots when
either node is down. Because CRS is relying on raw devices, which are managed by HACMP,
system reboots might occur when OCR and voting (vote) devices are lost. In order to upgrade
HACMP in the RAC environment, we perform the steps shown in Example 3-33.
Example 3-33 Stop CRS and database and upgrade HACMP in the RAC environment
###Check the current status of CRS and database

------------------------------------------------------------

ora....indb.db application ONLINE ONLINE austin1
###stop database and nodeapps in order.
{austin1:oracle}/home/oracle -> srvctl stop database -d austindb

{austin1:oracle}/home/oracle -> srvctl stop nodeapps -n austin1
{austin1:oracle}/home/oracle -> srvctl stop nodeapps -n austin2
------------------------------------------------------------
ora....N1.lsnr application OFFLINE OFFLINE
ora....in1.gsd application OFFLINE OFFLINE
ora....in1.ons application OFFLINE OFFLINE
ora....in1.vip application OFFLINE OFFLINE
ora....N2.lsnr application OFFLINE OFFLINE
ora....in2.gsd application OFFLINE OFFLINE
ora....in2.ons application OFFLINE OFFLINE
ora....in2.vip application OFFLINE OFFLINE
ora....b1.inst application OFFLINE OFFLINE
ora....b2.inst application OFFLINE OFFLINE
ora....indb.db application OFFLINE OFFLINE
###As root Stop CRS on both nodes
root@austin1:/oracle/crs/bin> crsctl stop crs

Stopping resources. This can take several minutes.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
###Stop HACMP cluster on both nodes using smitty clstop.

###Install HACMP 5.3 filesets on both and reboot the systems.
3.7 GPFS upgrade from 2.3 to 3.1

This section describes the procedure that we use to migrate a GPFS V2.3 file system to
GPFS V3.1 in an Oracle 10g RAC environment. You can migrate from GPFS V2.3 to V3.1 in
at least two ways:
򐂰 Upgrading the code and the existing file system, reusing disk and configuration
򐂰 Exporting the file system, deleting the GPFS cluster, removing the old code, installing the
new code, then creating the new GPFS cluster, and finally, importing the previously
exported file systems

The major advantage of the second option is that it also allows for easy fallback to the
previous configuration if you experience problems with the new GPFS version.
In this section, we document both methods. The setup used for this exercise is a two-node
cluster with nodes dallas1 and dallas2. GPFS Version 2.3 is installed, and there are two file
systems: /oradata and /orabin. For details, refer to Appendix C, “Creating a GPFS 2.3” on
page 263.
Note: In preparation for any migration or upgrade operation, we strongly recommend that
you save your data and also have a fallback or recovery plan in case something goes
wrong during this process.
3.7.1 Upgrading using the mmchconfig and mmchfs commands

Figure 3-6 presents the cluster diagram (nodes dallas1 and dallas2) that we use for this
migration scenario.
dallas1_vip dallas2_vip
RAC interconnect
192.168.100.133 192.168.100.134 Public network
dallas1 dallas1_interconn dallas2 dallas2_interconn
192.168.100.33 10.1.100.33 192.168.100.34 10.1.100.34
ent2 ent3 ent2 ent3
DS4800
austin1 dallas2
hdisk0 hdisk0
alt_rootvg hdisk1 hdisk3 alt_rootvg

hdisk1
fcs0 fcs0
hdisk22
Figure 3-6 Test configuration diagram
Migrating to GPFS 3.1 from GPFS 2.3 consists of the following steps:
1. Stop all file system user activity. For Oracle 10g RAC, you stop all file system user activity
by using the following command as oracle user on all nodes:
crsctl stop crs
Note: You might also need to run the emctl stop dbconsole and isqlplusctl stop
commands. Any scripts run from cron or other places must be stopped as well.
2. As root, cleanly unmount all GPFS file systems. Do not use force unmount. Use the
fuser -cux command to identify any leftover processes attached to the file system.
3. Stop GPFS on all nodes in the cluster (as root user):
mmshutdown -a

4. Copy the GPFS installation packages and install the new code on the nodes in the cluster
from AIX (we use AIX Network Install Manager).
5. Start GPFS on all nodes in the cluster and mount the file systems if this is not done
automatically on GPFS daemon start:
mmstartup -a; mmmount all -a
6. Operate GPFS with the new level of code until you are sure that you want to permanently
migrate.
7. Migrate the cluster configuration data and enable the new cluster-wide functionality as
Example 3-34 Migrating cluster configuration data

root@dallas1:/> mmchconfig release=LATEST
root@dallas1:/>
Note: After the upgrade, the output of the mmlsconfig config command shows the same
maxFeatureLevelAllowed (822) as before, which is normal behavior.
8. Migrate all file systems to reflect the latest metadata format changes. For each file system
in your cluster, use: mmchfs <file_system> -V, as shown in Example 3-35.
Example 3-35 Upgrading file systems using mmchfs

root@dallas1:/> mmchfs oradata -V
GPFS: 6027-471 You have requested that the file system be upgraded to version
9.03.
This will enable new functionality but will prevent you from using
the file system with earlier releases of GPFS. Do you want to continue?y
root@dallas1:/> mmchfs orabin -V
GPFS: 6027-471 You have requested that the file system be upgraded to version
9.03.
This will enable new functionality but will prevent you from using
the file system with earlier releases of GPFS. Do you want to continue?y
root@dallas1:/>
For more details about the GPFS upgrade procedure, see the manual GPFS V3.1 Concepts,
Planning, and Installation Guide, GA76-0413.
3.7.2 Upgrading using mmexportfs, cluster recreation, and mmimportfs
Important: In this scenario, we delete the existing GPFS cluster and recreate it after
installing new GPFS code. You must prepare the environment, node, and disk definition
files for the new cluster.
Start by shutting down all activity (see 3.7.1, “Upgrading using the mmchconfig and mmchfs
commands” on page 139), and then perform the following actions:
1. Export the file systems one by one: mmexportfs <file system> -o <Export-file>, as
shown in Example 3-36 on page 141.

Example 3-36 Exporting file systems
root@dallas1:/etc/gpfs_config> cd 3.1-Upgrade
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmexportfs oradata -o oradata.exp
mmexportfs: Processing file system oradata ...

mmexportfs: 6027-1371 Propagating the cluster configuration data to all
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmexportfs orabin -o orabin.exp
mmexportfs: Processing file system orabin ...

mmexportfs: 6027-1371 Propagating the cluster configuration data to all
root@dallas1:/etc/gpfs_config/3.1-Upgrade>
Note: The mmexportfs command actually removes the file system definition from the
cluster.
Note: We recommend that you use mmexportfs for individual file systems, and do not use
mmexportfs all, because this will also export NSD disks that are not used for any file
systems, such as tiebreaker NSDs. Using all can create issues when importing all file
systems into the new cluster.
2. Check the current cluster configuration as shown Example 3-37.
Example 3-37 Checking mmlscluster
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlscluster > Cluster-Def

root@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-Def

========================
GPFS cluster name: dallas_cluster.dallas1_interconnect
GPFS cluster id: 720967509442396828
GPFS UID domain: dallas_cluster.dallas1_interconnect
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp

-----------------------------------
Primary server: dallas1_interconnect
Secondary server: dallas2_interconnect

Designation
-----------------------------------------------------------------------
1 dallas1_interconnect 10.1.100.33 dallas1_interconnect
quorum-manager
2 dallas2_interconnect 10.1.100.34 dallas2_interconnect
quorum-manager

3. Document the current quorum tiebreaker disk configuration as shown in Example 3-38.
The output of mmlsconfig correctly states (none) in the list of file systems, because the file
systems are exported.
Example 3-38 Documenting current configuration

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlsconfig > Cluster-Config
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmlspv > Cluster-PV
root@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-Config
Configuration data for cluster dallas_cluster.dallas1_interconnect:
-------------------------------------------------------------------
clusterName dallas_cluster.dallas1_interconnect
clusterId 720967509442396828
clusterType lc
multinode yes
autoload yes
useDiskLease yes
tiebreakerDisks nsd_tb1;nsd_tb2;nsd_tb3
[dallas2_interconnect]
File systems in cluster dallas_cluster.dallas1_interconnect:

------------------------------------------------------------
(none)
root@dallas1:/etc/gpfs_config/3.1-Upgrade> cat Cluster-PV
hdisk7 nsd_tb1
hdisk8 nsd_tb2
hdisk9 nsd_tb3
hdisk10 nsd01
hdisk11 nsd02
hdisk12 nsd03
hdisk13 nsd04
hdisk14 nsd05
hdisk15 nsd06
4. Shut down the GPFS cluster:

mmshutdown -a
5. Remove the GPFS cluster:
mmdelnode -a
6. Remove the current GPFS code:
installp -u gpfs.*
7. Install the new code using the method of choice (we use Network Installation Management
(NIM)).
8. Create a new cluster using the information in the <Cluster-Def> file created in the previous
step and node descriptor files. Refer to 2.1.6, “GPFS configuration” on page 30 for details
about how to create the GPFS cluster.
9. Create tiebreaker disks.
Note: Because we are reusing disk, use the -v no option to let mmcrnsd overwrite disks:
mmcrnsd -F /etc/gpfs_config/gpfs_disks_tb -v no

10.Add tiebreaker disks to the cluster:
mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"
11.Start GPFS:
mmstartup -a
12.Import file systems, one by one, as shown in Example 3-39.
Example 3-39 Importing file systems

root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmimportfs oradata -i oradata.exp
mmimportfs: Processing file system oradata ...

mmimportfs: Processing disk nsd01
mmimportfs: Committing the changes ...
mmimportfs: The following file systems were successfully imported:

oradata
mmimportfs: The NSD servers for the following disks from file system oradata were
reset or not defined:
nsd01
nsd02
nsd03
nsd04
mmimportfs: Use the mmchnsd command to assign NSD servers as needed.
mmimportfs: 6027-1371 Propagating the cluster configuration data to all
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmimportfs orabin -i orabin.exp
mmimportfs: Processing file system orabin ...

mmimportfs: Committing the changes ...
mmimportfs: The following file systems were successfully imported:

orabin
mmimportfs: The NSD servers for the following disks from file system orabin were
reset or not defined:
nsd05
nsd06
mmimportfs: Use the mmchnsd command to assign NSD servers as needed.
mmimportfs: 6027-1371 Propagating the cluster configuration data to all
13.Mount the file systems as shown in Example 3-40 on page 144.

Example 3-40 Mounting file systems
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmmount oradata
Thu Sep 20 11:49:20 CDT 2007: 6027-1623 mmmount: Mounting file systems ...
root@dallas1:/etc/gpfs_config/3.1-Upgrade> mmmount orabin
Thu Sep 20 11:49:26 CDT 2007: 6027-1623 mmmount: Mounting file systems ...
3.8 Moving OCR and voting disks from GPFS to raw devices
This section presents the actions that we took to move OCR and Oracle Clusterware voting
disks from GPFS to raw devices. In order to move OCR and voting disks out of GPFS, you
must first prepare the raw partitions and then run the commands for actually moving OCR and
voting disks.
3.8.1 Preparing the raw devices

To prepare the raw devices, we performed the following actions:
1. Create a raw device (LUN/hdisk) for each component that is being moved. We have two
LUNs for OCR and three LUNs for CRS voting disks.
Note: Even though Oracle installation documentation states that a minimum of 100 MB
are required for OCR, for replacement you will need a minimum of 256 MB LUNs. In
fact, even 256 MB do not work in the test case, so we had to increase the LUN size to
260 MB. We have seen the following errors:
򐂰 PROT-21: Invalid parameter
򐂰 PROT-16: Internal error
򐂰 PROT-22: Storage too small
These all seem to be related to insufficient LUN size or insufficient privilege.
2. Use at least 20 MB per voting disk partition. Ownership for OCR devices must be root,
and the group must be the same as the oracle installation owner, in this case dba. The
voting disk must have owner and group as the oracle installation, in this case oracle and
dba. Permissions must be 640 for OCR and 644 for voting disks.
The current voting disks can be listed with the crsctl command, as shown in
Example 3-41.
Example 3-41 Checking CRS voting disks

oracle@dallas1:/oracle> crsctl query css votedisk
0. 0 /oradata/crs/votedisk1
located 3 votedisk(s).
oracle@dallas1:/oracle>
OCR device/path names can be obtained using the ocrcheck command, as shown in

Example 3-42 Checking OCR
oracle@dallas1:/oracle> ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 130984
Used space (kbytes) : 1984
Available space (kbytes) : 129000
ID : 2020226445
Device/File Name : /oradata/crs/OCR1
Device/File integrity check succeeded
Cluster registry integrity check succeeded

oracle@dallas1:/oracle>
3. Use the mknod command to create a device with a meaningful name, using the same
major/minor number as the AIX hdisk.
In Example 3-43, we show how to identify the LUNs for DS4000 Series storage.
Example 3-43 Getting LUN information

root@dallas1:/> fget_config -vA
---dar0---
User array name = 'Austin_DS4800'

dac0 ACTIVE dac1 ACTIVE
Disk DAC LUN Logical Drive

hdisk2 dac1 0 DALLAS_ocr1
hdisk4 dac1 2 DALLAS_vote1
hdisk7 dac0 5 DALLAS_gptb1
hdisk10 dac1 8 DALLAS_gpDataA1
hdisk12 dac0 10 DALLAS_gpDataB1
hdisk14 dac1 12 DALLAS_gpOraHomeA
hdisk15 dac0 13 DALLAS_gpOraHomeB
root@dallas1:/>
Knowing the mapping between LUN names and AIX default naming, we can now get the
major/minor numbers that we need to create devices, as shown in Example 3-44 on
page 146.

Example 3-44 Listing major/minor numbers
root@dallas1:/> ls -l /dev/hdisk[2-6]
brw------- 1 root system 36, 3 Sep 18 10:21 /dev/hdisk2
root@dallas1:/>
The LUN DALLAS_ocr1 is used for OCR. This is translated to hdisk2, so its major/minor
numbers are 36.3. Example 3-45 shows how we use the mknod command to create the
device OCR1.
Example 3-45 Creating the new devices

root@dallas1:/> mknod /dev/OCR1 c 36 3
root@dallas1:/> mknod /dev/OCR2 c 36 4
root@dallas1:/> mknod /dev/crs_votedisk1 c 36 5
The link between these devices, the hdisks, are the major/minor numbers and AIX default
naming. To identify which hdisk is actually used for /dev/crs_votedisk2, use the major and
minor number as shown in Example 3-46.
Example 3-46 Listing all devices with the specific major/minor number
root@dallas1:/> ls -l /dev/crs_votedisk2
crw-r--r-- 1 root system 36, 6 Sep 14 15:38 /dev/crs_votedisk2
root@dallas1:/> ls -l /dev | grep "36, 6"
crw-r--r-- 1 root system 36, 6 Sep 14 15:38 crs_votedisk2
brw------- 1 root system 36, 6 Sep 10 10:07 hdisk5
crw------- 1 root system 36, 6 Sep 10 10:07 rhdisk5
root@dallas1:/>
4. Set the ownership mode to 640 and root.dba for all OCR devices and to oracle.dba and
644 for CRS voting disk devices. Make sure that the AIX LUN reservation policy is set to
no_reserve. To change the reservation policy, use the chdev command, as shown in
Example 3-47.
Example 3-47 Setting and verifying reservation policy

root@dallas1:/> chdev -l hdisk5 -a reserve_policy=no_reserve
root@dallas1:/> lsattr -El hdisk5
PR_key_value none Persistant Reserve Key Value True
cache_method fast_write Write Caching method False
ieee_volname 600A0B800011A6620000019A00125A8A IEEE Unique volume name False
lun_id 0x0003000000000000 Logical Unit Number False
max_transfer 0x100000 Maximum TRANSFER Size True
prefetch_mult 1 Multiple of blocks to prefetch on read False
pvid none Physical volume identifier False
q_type simple Queuing Type False
queue_depth 10 Queue Depth True
raid_level 5 RAID Level False
reassign_to 120 Reassign Timeout value True
reserve_policy no_reserve Reserve Policy True
rw_timeout 30 Read/Write Timeout value True

scsi_id 0x661600 SCSI ID False
size 20 Size in Mbytes False
write_cache yes Write Caching enabled False
root@dallas1:/>
5. Make sure that new raw devices do not contain any information that might confuse CRS.
We used the dd command to erase any information about /dev/OCR1, as shown in
Example 3-48.
Example 3-48 Using dd to erase disks

root@dallas1:/> dd if=/dev/zero of=/dev/OCR1 bs=1024k
dd: 0511-053 The write failed.
: There is a request to a device or address that does not exist.
262+0 records in.
260+0 records out.
6. Erase all raw devices before proceeding to the next step. Refer to the UNIX man pages for
more information about the dd command. The write error in Example 3-48 indicates that
/dev/zero is larger that the raw device. To check the device size, use the bootinfo
command, as shown in Example 3-49.
Example 3-49 Checking device size

root@dallas1:/> bootinfo -s hdisk2
260
Note: Make sure that the mknod, chown, chmod, and chdev commands are run on all nodes
in the cluster.
3.8.2 Moving OCR

Make sure that Oracle Clusterware is running on all nodes in the cluster. If any node is shut
down during this operation, you must run ocrconfig -repair on this node afterwards:
1. Create a backup of the OCR. The ocrconfig command must be run as root, as shown in
Example 3-50.
Example 3-50 Creating a backup of OCR

root@dallas1:/> /orabin/crs/bin/ocrconfig -export /oradata/OCR-before-move.bck -s
online
Example 3-51 shows how to restore OCR information from a backup.
Example 3-51 Restoring a backup of OCR

ocrconfig -import /oradata/OCR-before-move.bck
2. Move the OCR as shown in Example 3-52 on page 148.

Example 3-52 Moving the OCR location
root@dallas1:/> /orabin/crs/bin/ocrconfig -replace ocr /dev/OCR1
root@dallas1:/> /orabin/crs/bin/ocrcheck
Version : 2
ID : 187234612
Device/File Name : /dev/OCR1
Example 3-53 shows how to move the OCR mirror.
Example 3-53 Moving OCR mirror location

root@dallas1:/> /orabin/crs/bin/ocrconfig -replace ocrmirrir /dev/OCR2
root@dallas1:/> /orabin/crs/bin/ocrcheck
Version : 2
ID : 187234612
3.8.3 Moving CRS voting disks

Compared to OCR, voting disks are used to store dynamic cluster information; thus, we do
not recommend changing these disks while CRS is running. You must first shut down Oracle
Clusterware on all nodes:
1. Example 3-54 shows how to shut down Oracle Clusterware. Shut down Oracle
Clusterware on all nodes.
Example 3-54 Shutting down Oracle Clusterware

root@dallas2:/orabin/crs/bin> crsctl stop crs
Stopping resources. This could take several minutes.
Stopping CSSD.
2. The current voting disks are listed using the crsctl command, as shown in Example 3-55
on page 149.

Example 3-55 Listing current voting disks
root@dallas1:/> /orabin/crs/bin/crsctl query css votedisk
Even though Oracle Clusterware is shut down, we still need to use the -force option when
deleting and adding voting disks. Example 3-56 shows how to delete and add voting disks
with the crsctl command.
Example 3-56 Deleting and adding voting disks

root@dallas1:/> /orabin/crs/bin/crsctl delete css votedisk /oradata/crs/votedisk1
-force
successful deletion of votedisk /oradata/crs/votedisk1.
root@dallas1:/> /orabin/crs/bin/crsctl add css votedisk /dev/crs_votedisk1 -force
Now formatting voting disk: /dev/crs_votedisk1
successful addition of votedisk /dev/crs_votedisk1.
0. 0 /dev/crs_votedisk1
Note: During the testing, we experienced Oracle Clusterware rebooting the nodes (due
to our own mistake), while we were adding a voting disk, which led to a voting disk entry
without a name for the voting disk. Use the /orabin/crs/bin/crsctl delete css
votedisk ... to forcefully remove it.
3. Repeat this for all voting disks. Example 3-57 shows how we remove the remaining two
voting disks.
Example 3-57 Removing the remaining voting disks

-force
-force

4. Finally, start Oracle Clusterware on all nodes.
For more information, refer to the Chapter 3 of the Oracle Database Oracle Clusterware and
Oracle Real Application Clusters Administration and Deployment Guide,10g Release 2
(10.2), Part Number B14197-04.

Part 3
Part 3 Disaster recovery and

maintenance scenarios
Part three covers certain configurations and aspects concerning disaster recovery (DR)
scenarios. We describe the architecture and the steps that you need to take to implement a
disaster resilient Oracle RAC configuration using GPFS and storage replication. We also
present several of the tools available to help you use, protect, and maintain your environment
more effectively, such as GPFS snapshots and storage pools (introduced in GPFS V3.1).

4
Chapter 4. Disaster recovery scenario using

GPFS replication
This chapter describes the architecture and the steps that we take to set up a disaster
recovery scenario using the configuration for Oracle 10g RAC using GPFS mirroring.
GPFS mirroring is also known as replication and is independent of any other replication
mechanism (storage-based or AIX Logical Volume Manager (LVM)). GPFS replication uses
synchronous mirroring. This solution consists of production nodes and storage that are
located in two sites, plus a third node that is located in a separate (third) site. The node in the
third site keeps the GPFS cluster alive in case one of the production sites fails. It acts as a
quorum buster for both GPFS cluster GPFS file systems. It also participates in Oracle CRS
voting, by defining a CRS voting disk on an NFS share held by this third node. The third node
is not connected to the SAN.
The advantages of this solution are:

򐂰 No outage or disruption occurs in the case of the total loss of a site (node and storage).
򐂰 No manual intervention. The disaster recovery failover works in unattended mode.
The disadvantages of this solution are:

򐂰 You must have three sites: Two production sites with nodes and storage, plus a third site
with only a (low end) node with internal disks only. The third node cannot be located in any
of the production sites.
򐂰 Only applications using GPFS or Oracle 10g RAC are protected.
򐂰 Networking is configured to support Oracle Virtual IP (VIP) address takeover.

4.1 Architectural considerations
In this section, we describe the architecture and design elements that you must consider
when planning and implementing a disaster recovery configuration.
4.1.1 High availability: One storage

A failure of a single component (server, switch, adapter, cable, power, disk, and so on) can
lead to an application outage. This single component is defined as a a Single Point of Failure
(SPOF). Identifying these components is critical for designing an environment that is resilient
to the various failures. The components that are identified as SPOFs can be eliminated
(doubled and managed) so that failures that do occur will not affect the application users.
Defining a highly available architecture means that all SPOFs are eliminated.
Server level
EtherChannel and Multi-Path I/O (MPIO) are providing high availability at the AIX level. Each
new release of AIX is also providing more features that contribute to continuous operations,
for example, by reducing the need to reboot the server when upgrading the OS or performing
system maintenance. However, the server itself remains a SPOF.
Storage level
A SAN can also provide high availability, because the storage subsystems are designed to be
fully redundant and fault resilient. Failures from individual spindles (disks) are managed
through the RAID algorithm by using automatic replacement with hot spare disks. All the
Fibre Channel connections are at least doubled, and there are two separate controllers to
manage the host access to the data. There is no single point of failure.
Application level
On top of resilient hardware, Oracle 10g RAC also provides a highly available database,
which is provided by Oracle Clusterware and RAC software.
Figure 4-1 on page 155 shows a common architecture for high availability: two nodes
connected to a storage device. Of course, the nodes belong to two different physical frames.
We do not recommend using two logical partitions (LPARs) in the same frame, because the
frame itself is a single point of failure. This solution is also called local high availability,
because the two servers and the storage device are located in the same data center.

n o d e n o d e
S to r a g e
Figure 4-1 Typical high availability architecture: two nodes and one type of storage in one data center
This setup provides excellent high availability to your IT environment. But in case of a global
disaster that affects the entire data center, all the hardware, servers, and storage are lost at
the same time. A disaster includes fire, flood, building collapse, power supply failure, but also
malicious attacks or terrorism acts.
To address these issues related to a single data center and thus reduce the risk related to a
disaster, you must use a second data center. In this case, this is not called high availability,
but disaster recovery.
In addition to having two data centers, a disaster recovery solution also requires two storage
subsystems and a storage replication mechanism.
4.1.2 Disaster recovery: Two storage subsystems

To achieve disaster recovery, you must have two data centers. Figure 4-2 on page 156 shows
a common disaster recovery architecture, which is also called multi-site architecture. There is
one node and one storage device in each site. You must consider the location of the two data
centers, and more importantly, the level of separation between the sites.
Chapter 4. Disaster recovery scenario using GPFS replication 155

node node
Storage Storage
Figure 4-2 Typical disaster recovery architecture
Distance considerations
The distance between the sites provides a better separation, but it also introduces latency in
the communication between the sites (IP and SAN).
The distance between the sites impacts the system performance depending on the data
throughput required by your application. If the throughput is high, a maximum of a few
kilometers between the two sites must be considered. If the SAN is less heavily used, a
distance of 20 to 40 Km is not a problem. These distances are only indicative, and varies
depending on the quality of the SAN, I/O on disks, the application, and so on.
A good compromise is two locate the two data centers on two different buildings of the same
company. The distance is less than a few kilometers, so the distance is not of any concern.
This is a good response to fire disasters, but not the best for earthquakes or floods. This
setup is called a campus-wide disaster recovery solution.
Another frequently used architecture involves two sites of the company in the same city, or
nearby, located less than 20 km (12 miles) away. The impact of the distance remains
reasonable and has the ability to administer and manage both sites. This architecture is
considered a metropolitan disaster recovery solution.
To fully address the earthquake risk, imagine a backup data center on another continent.
Here, the major point is only the distance, and a completely different set of solutions applies
(for example, asynchronous replication). We do not address this subject in this book.
Mirroring considerations
Because we have two storage units (one in each site) and the same set of nodes and
applications accessing the same data, mirroring must be defined between the two storage
units. You need to use mirroring to keep the application running with only one surviving site
and with a full data copy.
In campus-wide and metropolitan-wide solutions, data mirroring can be synchronous. Further

than 100 km (62 miles), or between two continents, data mirroring must be asynchronous.
Even though technically possible, synchronous mirroring over thousands of kilometers is not
a viable approach due to the poor I/O performance (which makes the application respond too
slowly).
You can implement mirroring at the file system level (here GPFS), or at the storage level, by
using Metro Mirror (synchronous) or Global Mirror (asynchronous). There are differences
between these mirroring methods. We describe GPFS mirroring, also called replication, in
this chapter. We describe how to use Metro Mirror for disaster recovery in Chapter 5,
“Disaster recovery using PPRC over SAN” on page 185.

4.2 Configuration
To provide fault resilience and to survive the loss of a site (node and storage), GPFS needs
three separate sites. There are two main sites (production), which are connected to the same
SAN and LAN. Each production site has a complete copy of GPFS file systems’ data and
metadata. The third site is not connected to the SAN and has only one node that is connected
to the same LAN as the nodes in the production sites. This third site, although it requires little
hardware, is essential in case of the loss of a production site. Its role is to keep GPFS alive,
by participating in GPFS cluster quorum voting and by holding a third copy of the file systems’
descriptors, thus being able to arbitrate the file systems’ quorum in case an entire storage unit
is lost.
Note: In this configuration, the application runs in both primary and secondary production
sites, but not in the third site. In case of either production site failure, operation in the
surviving site continues without any user intervention.
4.2.1 SAN configuration for the two production sites

As required for high availability, all SPOF must be eliminated, including SAN elements. At
least two SAN switches are required, and each node must have redundant paths to each
SAN switch and to each storage device. Figure 4-3 on page 158 shows the minimum SAN
cabling requirements between the two sites. Each node must be able to access all application
disks from both storage units.
Data mirroring is done at the GPFS level, so all the disks must be visible from both nodes.
GPFS mirrors data based on failure group information. In this case, a failure group is a set of
disks that belongs to the same site. GPFS enforces the mirroring between the failure groups,
guaranteeing that each site contains a good copy of the entire data, metadata, and file
system descriptors.

Node 2
Node 1
SAN
switches
Storage 1 A few Km Storage 2

Site A Site B
Figure 4-3 SAN and network connections through the two sites
4.2.2 GPFS node configuration using three nodes

The GPFS cluster consists of three nodes: two production nodes and one quorum buster.
The two production nodes are directly connected to the SAN (FC connection) together with
the storage subsystems. The application and database running on these nodes (site A and
site B) perform disk I/O directly from each node through the Fibre Channel attachment.
The third node is not attached to the SAN and has only internal disks. It provides GPFS with
an internal disk as a Network Shared Disk (NSD). This disk only holds a third copy of the file

system descriptors, which is limited in volume and access. All the read and write requests on
this NSD disk are done through the network and processed on this third node.
The I/O throughput for this node is negligible, because the disk attached to this node does not
hold any data or metadata, and there is no application running on this node that accesses the
GPFS file system (in fact, this node must not run any application), thus, this node does not
affect the overall GPFS performance. This node is called the tiebreaker node, and the site is
called the tiebreaker site.
This third (tiebreaker) site must be an independent site. It cannot be a node that is hosted in
one of the two production sites. GPFS cannot survive if a main site and the third site are
down. Actually, GPFS can survive if only one site is failing. If two sites fail at the same time,
GPFS stops on the surviving site, even though this site might still have a whole set of data
(one valid copy). Figure 4-4 shows the disaster resilient architecture using GPFS replication.
File system descriptor copy #3
Data copy #1 Data copy #2 Site C

Metadata copy #1 Metadata copy #2 gpfs_dr
File system descriptor copy #1 File system descriptor copy #2
desc
Site A Site B
Internal disk
(No external storage)
austin1 austin2
α Storage β α Storage β
IP network
SAN
GPFS replication
Figure 4-4 GPFS replication using three nodes

Three GPFS nodes
In this example, nodes austin1 and austin2 are located in two production sites (site A and
B), and gpfs_dr is the node in the third site. The names used for naming the GPFS hosts are
the names used for the interconnect network designed for GPFS + RAC (see Chapter 2,
“Basic RAC configuration with GPFS” on page 19). We do not recommend that you use the
public (or administrative) network for GPFS; it is not designed for this purpose and is less
reliable (no etherchannel). A sample /etc/hosts file is shown in Example 4-1.
Important: This file must be the same on all the nodes and must remain unchanged after
the GPFS cluster is created. IP name resolution is critical for any clustering environment.
You must make sure that all nodes in this cluster resolve all IP labels (names) identically.
Example 4-1 /etc/hosts file for the GPFS nodes

# Public network
192.168.100.31 austin1 aus_lpar1
192.168.100.32 austin2 aus_lpar2
192.168.100.21 gpfs_dr

10.1.100.21 gpfs_dr_interconnect
The two production nodes must have the manager attribute, but the third node is only a client.
However, all tree nodes must be quorum nodes.
We have prepared a node descriptor file, which is shown in Example 4-2. For more details
about how to create a GPFS cluster, see 2.1.6, “GPFS configuration” on page 30.
Example 4-2 Node file for creating a GPFS cluster

root@austin1:/home/michel> cat gpfs_nodefile
gpfs_dr_interconnect:quorum
No GPFS tiebreaker disk

The tiebreaker disk is mainly designed for use with a two node cluster. In this case, we have
three nodes, and because the third node is not connected to the storage, having tiebreaker
disks does not make sense. If you migrate an existing cluster to this disaster recovery
configuration, and you plan to reuse an existing GPFS cluster, make sure that you are not
using tiebreaker disks, as shown in Example 4-3 on page 161.

Example 4-3 GPFS configuration must not use tiebreaker disks
root@austin1:/home/michel> mmlsconfig

-------------------------------------------------------------------
clusterId 720967500852570099
clusterType lc
autoload yes
useDiskLease yes
tiebreakerDisks no
[gpfs_dr_interconnect]
unmountOnDiskFail yes
takeOverSdrServ yes
GPFS cluster topology

Example 4-4 shows the GPFS cluster configuration.
Example 4-4 GPFS three node topology where each node must be part of the quorum
root@austin1:/home/michel> mmlscluster

========================
GPFS cluster name: austin_cluster.austin1_interconnect
GPFS cluster id: 720967500852570099
GPFS UID domain: austin_cluster.austin1_interconnect
Remote shell command: /usr/bin/rsh
Remote file copy command: /usr/bin/rcp

-----------------------------------
Primary server: austin1_interconnect
Secondary server: austin2_interconnect

Designation
----------------------------------------------------------------------------------
quorum-manager
quorum-manager
3 gpfs_dr_interconnect 10.1.100.21 gpfs_dr_interconnect
quorum

Third node special setup
Because the function of this node is to serve as a tiebreaker in GPFS quorum decisions, the
third node does not require normal file system access and SAN connectivity. To ignore disk
access errors on the tiebreaker node, enable the unmountOnDiskFail configuration
parameter, as shown in Example 4-5. When enabled, this parameter forces the tiebreaker
node to treat the lack of disk connectivity as a local error, resulting in a failure to mount the file
system, rather that reporting this condition to the file system manager as a disk failure.
Example 4-5 Avoid propagating inappropriate error messages on the third node
root@gpfs_dr:/home/michel> mmchconfig unmountOnDiskFail=yes gpfs_dr_interconnect
Verify that the parameter has been set as shown in Example 4-6.
Example 4-6 GPFS configuration with unmountOnDiskFail option turned on

root@austin1:/home/michel> mmlsconfigonfig
-------------------------------------------------------------------
clusterId 720967500852570099
clusterType lc
autoload yes
useDiskLease yes
tiebreakerDisks no
[gpfs_dr_interconnect]
unmountOnDiskFail yes
takeOverSdrServ yes
4.2.3 Disk configuration using GPFS replication

On nodes austin1 and austin2, we have two free shared LUNs, one of which is allocated on
each SAN storage device. They must be equal in size and meet the space requirements for
your database. Each LUN holds one copy of all data and metadata and one copy of the file
system descriptor area.
In our example, there are only two LUNs, but in an actual configuration, you might have more
LUNs. Just make sure that you have an even number of LUNs and that all LUNs are the
same size. Also make sure that half of the LUNs are located in each storage subsystem (in
different sites). By assigning the LUNs in each storage subsystem to a different failure group,
you make sure that GPFS replication (mirroring) is consistent and useful in case one
production site fails.
On node gpfs_dr, we have one free internal SCSI disk. The size of this disk is not that
important; it contains only a copy of the file system descriptors (no data or metadata). This
disk is in a separate failure group.
The LUNs in storage subsystems in sites A and B and the internal SCSI disk belong to the
same GPFS. If you plan to have more than one file system, in addition to an equal number of

and equal size LUNs in sites A and B, you need one separate disk attached to the node in the
third site.
Example 4-7 shows the disks that will be used for the new GPFS file system.
Example 4-7 Free LUNs on the main nodes and free internal disk on the third node
root@austin1:/home/michel> lspv
...
hdisk16 none None
hdisk17 none None
root@gpfs_dr:/home/michel> lspv
hdisk0 00c6629e00bddee5 rootvg active
...
hdisk3 none None
Create the NSDs to use later by GPFS on one of the main nodes and on the third node. The
command is mmcrnsd. You must issue this command on a node that sees the disk or the LUN.
Example 4-8 shows the disk descriptor file that we use for this scenario. For more information
about this command, refer to 2.1.6, “GPFS configuration” on page 30.
Make sure that the failure group (1, 2, or 3) for each LUN reflects the actual site; there is one
failure group on each site. The disks in our example are:
򐂰 hdisk16 is a LUN in site A storage (failure group 1)
򐂰 hdisk17 is a LUN (same size than hdisk16) in site B storage (failure group 2)
򐂰 hdisk3 is an internal disk in the node in site C (failure group 3)
For more information about the sites, see Figure 4-4 on page 159.
Example 4-8 GPFS disk file for NSD creation on the main nodes and third node
root@austin1:/home/michel> cat gpfs_disk_file
hdisk16:austin1_interconnect:austin2_interconnect:dataAndMetadata:1:dr_copy1:
hdisk17:austin2_interconnect:austin1_interconnect:dataAndMetadata:2:dr_copy2:
hdisk3:gpfs_dr_interconnect::descOnly:3:dr_desc:

Check the NSD created as shown in Example 4-9.
Example 4-9 List of NSD disks
root@austin1:/home/michel> mmlsnsd -m
Disk name NSD volume ID Device Node name Remarks

----------------------------------------------------------------------------------
dr_copy1 C0A8641F46F18EF0 /dev/hdisk16 austin1_interconnect primary
node
dr_copy1 C0A8641F46F18EF0 /dev/hdisk16 austin2_interconnect backup
node
dr_copy2 C0A8642046F18EF9 /dev/hdisk17 austin1_interconnect backup
node
dr_copy2 C0A8642046F18EF9 /dev/hdisk17 austin2_interconnect primary
node
dr_desc C0A8641546F18FDE /dev/hdisk3 gpfs_dr_interconnect primary
node
Note: Disks dr_copy1 and dr_copy2 appear twice in the listing shown in Example 4-9,
which is normal, because these disks are attached (via SAN) to both production nodes.
We are now ready to create the file system. To create the file system, we use the same disk
descriptor file that was used for the mmcrnsd command. This file is modified by the mmcrnsd
command, when the NSDs have been created, as shown in Example 4-10. At this point, start
the GPFS daemon on all nodes in the cluster (mmstartup -a).
Example 4-10 Disk file for GPFS file system creation

root@austin1:/home/michel> cat gpfs_disk_file
# hdisk16:austin1_interconnect:austin2_interconnect:dataAndMetadata:1:dr_copy1:
dr_copy1:::dataAndMetadata:1::
# hdisk17:austin2_interconnect:austin1_interconnect:dataAndMetadata:2:dr_copy2:
dr_copy2:::dataAndMetadata:2::
# hdisk3:gpfs_dr_interconnect::descOnly:3:dr_desc:
dr_desc:::descOnly:3::
The NSD disks are accessible from any node in the cluster; thus, you can run the mmcrfs
command on any node (see Example 4-11 on page 165). Because you want a fully replicated
file system, make sure that you use the correct replication parameters: -m2 -M2 -r2 -R2.

Example 4-11 GPFS-replicated file system creation
root@austin1:/home/michel> mmcrfs /disaster /dev/disaster -F gpfs_disk_file -n3
-m2 -M2 -r2 -R2 -A yes
GPFS: 6027-531 The following disks of disaster will be formatted on node austin1:
dr_copy1: size 4194304 KB
dr_copy2: size 4194304 KB
dr_desc: size 71687000 KB
Creating Inode File
GPFS: 6027-572 Completed creation of file system /dev/disaster.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all affected
nodes. This is an asynchronous process.
The replication parameters that you use to enable and activate the replication are:
򐂰 -m Number of copies of metadata (inodes, directories, and indirect blocks) for a file. Valid
values are 1 and 2 (activates metadata replication by default). You cannot set this
parameter to 2 if -M was not set to 2 also.
򐂰 -M Maximum number of copies of metadata (of inodes, directories, and indirect blocks) for
a file. Valid values are also 1 and 2 (this enables metadata replication).
򐂰 -r Number of copies of each data block for a file. Valid values are 1 and 2 (activates data
replication by default). You cannot set this parameter to 2 if -R was not also set to 2.
򐂰 -R Maximum number of copies of data blocks for a file. Valid values are 1 and 2 (enables
data replication).
Mount and check the parameters of the newly created file system, as shown in Example 4-12.
Example 4-12 Checking the new GPFS-replicated file system
root@austin1:/home/michel> mount
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Sep 16 02:06 rw,log=/dev/hd8
/proc /proc procfs Sep 16 02:06 rw
/dev/hd10opt /opt jfs2 Sep 16 02:06 rw,log=/dev/hd8
/dev/disaster /disaster mmfs Sep 19 18:24
rw,mtime,atime,dev=disaster
root@austin1:/home/michel> mmlsdisk disaster -L

disk driver failure holds holds storage
name type size group metadata data status availability disk id pool
remarks
-------- ---- ---- ----- -------- ---- ------ ------------ ------- -------
-------
dr_copy1 nsd 512 1 yes yes ready up 1 system desc

dr_copy2 nsd 512 2 yes yes ready up 2 system desc
dr_desc nsd 512 3 no no ready up 3 system desc
Number of quorum disks: 3

Read quorum value: 2
Write quorum value: 2
After successful creation, you can use this file system to store the Oracle data files. The file
system is resilient to the loss of either one of the production sites.
4.2.4 Oracle 10g RAC clusterware configuration using three voting disks
We have seen earlier that Oracle 10g RAC provides its own high availability mechanism. Is it
the same for disaster recovery?
A disaster recovery configuration implies duplicated SAN storage in different sites. Because
the Oracle data files are located on GPFS, they are protected by the file system layer against
disaster (if one site is down). Now, what about Oracle Clusterware, which is also called CRS?
Oracle RAC is based on a concurrent (shared) storage architecture, and the main goal of the
clustering layer is to prevent unauthorized storage access from nodes that are not considered
“safe”. From this perspective, Oracle Clusterware Cluster Ready Services (CRS) manages
node failure similarly to GPFS. It uses a voting disk to act as a tiebreaker in case of a node
failure. The CRS voting disk might be a raw device or a file that is accessible to all nodes in
the cluster. The voting disk (or the access to the disk) is vital for Oracle Clusterware. If the
voting disk is lost to any node, even temporarily, it triggers the reboot of the respective node.
Oracle Clusterware cannot survive without a valid voting disk. You can define up to 32 voting
disks, all of which contain the same information. For the cluster to be up and running, more
than half of the declared voting disks must be accessible. Assume that half of the voting disks
are located on a storage unit in site A and the other half in site B. We can see immediately
that if one of the storage units (through a site failure) is lost, the voting disks quorum cannot
be fulfilled, and all nodes are rebooted by CRS. After the reboot, CRS can reconfigure itself
with only the surviving voting disks and can restart all the instances. However, to avoid any
disruption, we must have an odd number of voting disks (usually three copies are enough)
and have (at least) the third copy on a third site.
As discussed in 1.3.1, “RAC with GPFS” on page 11, we do not recommend using GPFS for
storing the CRS voting disks. You must use NFS-shared or SAN-attached storage (as raw
devices). It is also possible to use a combination of NFS-shared and SAN-attached raw
devices.
Three sites are necessary

With three copies of the voting disk, one copy per site, we can be sure that there are at least
two good copies in case of a site failure. Oracle Clusterware remains up and does not reboot
any node from the surviving sites. The issue here is that the third node cannot to be
connected to the SAN to hold a voting disk. That is why Oracle has extended the support for
a third voting disk on an NFS share, which allows the use of a third site, with no SAN,
connected only to the the network (accessible from all cluster nodes). This site exports via
NFS a normal file, which is used to hold the third voting disk, and this makes the difference
when a disaster occurs. This third node does not run any instance of the RAC database. Its
role is only as a tiebreaker, similar to GPFS. And of course, the same third site and node can
be used for both the Oracle third voting disk via NFS and for a copy of the GPFS file system
descriptor.

Figure 4-5 shows the architecture suitable for a third voting disk on NFS.
Site C
gpfs_dr
NSD vote3
Site A Site B
NFS server
No external storage
austin1 austin2
storage storage
DB
DB vote1 DB vote2
DB IP network
DB DB
SAN
Figure 4-5 Third voting disk on NFS
Minimum required releases

The minimum required releases are:
򐂰 AIX 5.3 ML4
򐂰 Oracle Clusterware 10.2.0.2
Preparing the third node as NFS server

The steps to prepare the third node are:
1. First, create the oracle user and group (in our case, oracle and dba), and then create the
same ID as on the other nodes.
2. Make sure that the system, network settings, and parameters correlate with the other
nodes.
3. NFS must be running and automatically started at system boot.
4. Create a directory for the CRS voting disk with oracle.dba ownership, and export it via
NFS. Nodes austin1 and austin2 are in the two production sites, and gpfs_dr is the node
on the third site. Allow the client and root access for austin1 and austin2 (shown in
Example 4-13 on page 168).

Example 4-13 NFS directory export with access granted to the primary nodes
smit nfs
Add a Directory to Exports List

[TOP] [Entry Fields]

* Pathname of directory to export [/voting_disk] /
Anonymous UID [-2]
Public filesystem? no +
* Export directory now, system restart or both both +
Pathname of alternate exports file []
Allow access by NFS versions [] +
External name of directory (NFS V4 access only) []
Referral locations (NFS V4 access only) []
Replica locations []
Ensure primary hostname in replica list yes +
Allow delegations? no +
* Security method 1 [sys,krb5p,krb5i,krb5,> +
* Mode to export directory read-write +
Hostname list. If exported read-mostly []
Hosts & netgroups allowed client access [austin1,austin2]
Hosts allowed root access [austin1,austin2]
Security method 2 [] +
Mode to export directory [] +
Hosts & netgroups allowed client access []
Hosts allowed root access []
[MORE...3]

The result is shown in Example 4-14 on page 169.

Example 4-14 Configuration of NFS server on the third node
root@gpfs_dr:/> ls -l /voting_disk
drwxrwxrwx 2 oracle dba 256 Sep 21 13:56 voting_disk/
root@gpfs_dr:/> exportfs -a
/voting_disk
-sec=sys:krb5p:krb5i:krb5:dh,rw,access=austin1:austin2,root=austin1:austin2
Mounting the directory on nodes in production sites

These nodes are NFS clients for the voting disk. Create a mount point with oracle.dba
ownership, as shown in Example 4-15.
Example 4-15 Mount point for NFS voting disk

root@austin2:/> mkdir /voting_disk
root@austin2:/> chown oracle.dba /voting_disk
Configure the buffer size, timeout, protocol, and security method, as shown in Example 4-16
on page 170. Make sure that this directory is mounted automatically after a reboot.

Example 4-16 SMIT add an NFS for mounting window
Add a File System for Mounting

[TOP] [Entry Fields]

* Pathname of mount point [/voting_disk] /
* Pathname of remote directory [/voting_disk]
* Host where remote directory resides [gpfs_dr]
Mount type name []
* Security method [sys] +
* Mount now, add entry to /etc/filesystems or both? both +
* /etc/filesystems entry will mount the directory yes +
on system restart.
* Mode for this NFS file system read-write +
* Attempt mount in foreground or background background +
Number of times to attempt mount [] #
Buffer size for read [32768] #
Buffer size for writes [32768] #
NFS timeout. In tenths of a second [600] #
NFS version for this NFS filesystem 3 +
Transport protocol to use tcp +
Internet port number for server [] #
* Allow execution of setuid and setgid programs yes +
in this file system?
* Allow device access via this mount? yes +
* Server supports long device numbers? yes +
* Mount file system soft or hard hard +
Minimum time, in seconds, for holding [3] #
attribute cache after file modification
Allow keyboard interrupts on hard mounts? yes +
Maximum time, in seconds, for holding [60] #
attribute cache after file modification
Minimum time, in seconds, for holding [30] #
attribute cache after directory modification
Maximum time, in seconds, for holding [60] #
attribute cache after directory modification
Minimum & maximum time, in seconds, for [] #
holding attribute cache after any modification
[MORE...6]

Check the /etc/filesystems file for the new entry, as shown in Example 4-17 on page 171.
Also, add the noac option in the /etc/filesystems file.

Example 4-17 Valid /etc/filesystems file
root@austin1:/home/michel> cat /etc/filesystems file
...
/voting_disk:
dev = /voting_disk
vfs = nfs
nodename = gpfs_dr
mount = true
options =
rw,bg,hard,intr,rsize=32768,wsize=32768,timeo=600,vers=3,proto=tcp,noac,sec=sys
account = false
Make sure that the file system is mounted, as shown in Example 4-18.
Example 4-18 Valid result of mount command
root@austin1:/home/michel> mount
-------- --------------- --------------- ------ ------------ ---------------
/dev/hd4 / jfs2 Sep 21 14:52 rw,log=/dev/hd8
/dev/hd2 /usr jfs2 Sep 21 14:52 rw,log=/dev/hd8
/dev/hd9var /var jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/hd3 /tmp jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/hd1 /home jfs2 Sep 21 14:53 rw,log=/dev/hd8
/proc /proc procfs Sep 21 14:53 rw
/dev/hd10opt /opt jfs2 Sep 21 14:53 rw,log=/dev/hd8
/dev/disaster /disaster mmfs Sep 21 14:54
rw,mtime,atime,dev=disaster
gpfs_dr /voting_disk /voting_disk nfs3 Sep 21 16:10
rw,bg,hard,intr,rsize=32768,wsize=32768,timeo=600,vers=3,proto=tcp,noac,sec=sys
Adding the NFS voting disk to Oracle Clusterware

We assume that Oracle Clusterware is running and configured with two nodes and three
voting disks on SAN. Two of the nodes are located on one storage, and the remaining node is
on the other storage. Even with these three voting disks, the cluster is not disaster resilient,
because if the storage that holds the two voting disks fails, Oracle reboots all nodes. To add
the NFS voting disk to Oracle Clusterware, follow these steps:
1. Because the online modification of the CRS voting disks configuration is not supported,
we stop our Oracle 10g RAC database and CRS on all the nodes, as shown in
Example 4-19.
Example 4-19 Stopping Oracle Clusterware

root@austin1:/orabin/crs/bin> crsctl stop crs
Stopping resources. This could take several minutes.
Stopping CSSD.
2. Check to see if CRS is really stopped, as shown in Example 4-20 on page 172.

Example 4-20 Check that the CRS stack is stopped
root@austin1:/orabin/crs/bin> crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM
The initial configuration was made with three voting disks, which are located on two different
storage devices. The voting disks are shown in Example 4-21.
Example 4-21 Initial list of the configured voting disks on SAN

root@austin1:/orabin/crs/bin> crsctl query css votedisk
0. 0 /dev/votedisk1
1. 0 /dev/votedisk2
2. 0 /dev/votedisk3
3. Delete one of these disks located in the storage that holds two voting disks, as shown in
Example 4-22.
Example 4-22 Removal of an unnecessary voting disk on SAN

root@austin1:/orabin/crs/bin> crsctl delete css votedisk /dev/votedisk3 -force
successful deletion of votedisk /dev/votedisk3.
4. We add the NFS-shared voting disk as shown in Example 4-23. Even though the NFS
voting disk creates traffic over the IP network, this traffic is insignificant, and the existence
of the voting disk is more important than its actual I/O.
Example 4-23 Adding the third NFS voting disk

root@austin1:/> crsctl add css votedisk /voting_disk/voting_disk3_for_DR -force
Now formatting voting disk: /voting_disk/voting_disk3_for_DR

successful addition of votedisk /voting_disk/voting_disk3_for_DR.
5. Next, we have to change the owner of the new voting disk (an NFS file in our case). This
step, shown in Example 4-24, is extremely important, and if you skip it, CRS will not start.
Example 4-24 Change the owner of the newly created voting disk
root@austin2:/voting_disk> ll
-rw-r--r-- 1 root system 10306048 Sep 21 16:27 voting_disk3_for_DR
root@austin2:/voting_disk> chown oracle.dba voting_disk3_for_DR
root@austin2:/voting_disk> ll
-rw-r--r-- 1 oracle dba 10306048 Sep 21 16:27 voting_disk3_for_DR
6. Check the configuration. You must have an output similar to Example 4-25 on page 173.

Example 4-25 Final configuration: Two voting disks on SAN and one voting disk on NFS
root@austin1:/orabin/crs/bin> crsctl query css votedisk
0. 0 /dev/votedisk1
1. 0 /dev/votedisk2
2. 0 /voting_disk/voting_disk3_for_DR
7. Restart Oracle Clusterware on all the nodes, which triggers the restart of instances as
well.
Example 4-26 Restart Oracle Clusterware

root@austin1:/orabin/crs/bin> crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
4.3 Testing and recovery

The tests that we describe in this section simulate a disaster. We have tested node failure
tests, the failure of one storage unit, and the failure of both storage units.
Node failure is simulated by halting the node (halt -q command), which also powers off the
node. This method is different from a normal shutdown (shutdown -Fr), which stops all
applications and processes, synchronizes the file systems, and then stops.
Storage failure is simulated by removing the host mapping at the storage level; thus, the host
loses the disk connection immediately.
We can categorize the failure tests as follows:

򐂰 High availability tests: Loss of a node
򐂰 Disaster recovery tests: Loss of a storage unit or the node and storage at the same time
The purpose of this series of tests is to verify that GPFS and Oracle 10g RAC behave as
expected in a disaster recovery situation. The results are in line with the expectations, for
both GPFS and Oracle, as long as the configuration explained in this chapter is complete.
The hardware architecture of the test platform is shown in Figure 4-4 on page 159. There are
two production nodes: austin1 and austin2. A third one, gpfs_dr, is used as a tiebreaker
node for GPFS and holds the third Oracle Clusterware voting disk that was exported via NFS
to austin1 and austin2.

4.3.1 Failure of a GPFS node
In this test, we verify the GPFS node quorum rule (three quorum nodesand no tiebreaker
disk).
We test the worse case scenario, by stopping the node primary node, austin1. This node is
the GPFS cluster manager and also the file system manager node for the GPFS (mountpoint
/disaster), as shown in Example 4-27.
Example 4-27 Checking the file system manager node for /disaster file system
root@austin1:/> mmlsmgr
file system manager node [from 10.1.100.31 (austin1_interconnect)]
---------------- ------------------
disaster 10.1.100.31 (austin1_interconnect)
The script shown in Example 4-28 has been run before the failure on austin1 and austin2,
the main nodes. Its goal is to estimate the outage time. We have observed no outage.
Example 4-28 Script to check GPFS file system availability during failures
root@austin1:/home/michel> while true
> do
> print $(date) >> /disaster/test_date_austin1
> sleep 1
> done
The node fails

The node austin1 is halted at 12h19m37s. The process that writes the date in a file
disappears together with the node. The date command output on austin1 is shown in
Example 4-29.
Example 4-29 Node austin1 halted at 12:19:37

...
Thu Sep 20 12:19:30 CDT 2007
Thu Sep 20 12:19:31 CDT 2007
Thu Sep 20 12:19:32 CDT 2007
Thu Sep 20 12:19:33 CDT 2007
Thu Sep 20 12:19:34 CDT 2007
Thu Sep 20 12:19:35 CDT 2007
Thu Sep 20 12:19:36 CDT 2007
Thu Sep 20 12:19:37 CDT 2007
Thu Sep 20 12:19:3
On the other node, austin2, we can see in Example 4-30 on page 175 that there is no outage
on the GPFS file system /disaster, which remains up and running despite the failure of one
node. The third node (gpfs_dr) is important to maintain the node quorum, thus, keeping the
GPFS file system active on the surviving nodes.
When the failing node has the role of cluster configuration manager, the process to fail over
this management role to another node (which has the management capability) must be less
that 135 seconds. During this time, the GPFS file systems are frozen on all the nodes. The
I/O can still process in memory, as long as the page pool size is sufficient. After the memory
buffers are filled, the application will wait for the I/O to complete, just like normal I/O. If the
node that is failing is not the cluster configuration manager, there is no freeze at all.

Example 4-30 Node austin2 node does not stop
...
Thu Sep 20 12:19:30 CDT 2007
Thu Sep 20 12:19:31 CDT 2007
Thu Sep 20 12:19:32 CDT 2007
Thu Sep 20 12:19:33 CDT 2007
Thu Sep 20 12:19:34 CDT 2007
Thu Sep 20 12:19:35 CDT 2007
Thu Sep 20 12:19:36 CDT 2007
Thu Sep 20 12:19:37 CDT 2007
Thu Sep 20 12:19:38 CDT 2007
Thu Sep 20 12:19:39 CDT 2007
Thu Sep 20 12:19:40 CDT 2007
Thu Sep 20 12:19:41 CDT 2007
Thu Sep 20 12:19:42 CDT 2007
Thu Sep 20 12:19:43 CDT 2007
Thu Sep 20 12:19:44 CDT 2007
...
The second node is aware of the austin1 failure, as displayed in austin2’s GPFS log shown
in Example 4-31.
Example 4-31 Node austin2 GPFS log during node failure test
root@austin2:/var/adm/ras> cat mmfs.log.latest
...
Thu Sep 20 12:22:04 2007: GPFS: 6027-777 Recovering nodes: 10.1.100.31
Thu Sep 20 12:22:05 2007: GPFS: 6027-630 Node 10.1.100.32 (austin2_interconnect)
appointed as manager for disaster.
completed take over for disaster.
Thu Sep 20 12:22:38 2007: GPFS: 6027-2706 Recovered 1 nodes.
...
Node austin2 assumes the role of file system manager (mmlsmgr command) and cluster
configuration manager (mmfsadm dump cfgmgr command) as shown in Example 4-32.
Example 4-32 Node austin2 is the new manager node for /disaster GPFS file system after node failure
root@gpfs_dr:/home/michel> mmlsmgr
---------------- ------------------
root@gpfs_dr:/home/michel> mmfsadm dump cfgmgr

nClusters 1
Cluster Configuration [0] "austin_cluster.austin1_interconnect": Type: 'LC' id

0A01641F46EB27F3
ccUseCount 1 unused since (never) contactListRefreshMethod 0
Domain , myAddr 0 10.1.100.21, authIsRequired false
UID domain 0xF1000004404EB650 (0xF1000004404EB650) Name
"austin_cluster.austin1_interconnect"
hold count 1 CredCacheHT 0x0 IDCacheHT 0x0
No of nodes: 3 total, 3 local, 3 core nodes.

Authorized keys list:
clusterName port cKeyGen nKeyGen cipherList
Cluster info list:

clusterName port cKeyGen nKeyGen cipherList
austin_cluster.austin1_interconnect 1191 -1 -1 EMPTY
node primary admin --status--- join fail SGs other ip

addrs,
no idx host name ip address func tr p rpc seqNo cnt mngd last
failure
---- ----- -------------- ------------ ----- ----------- ------ ---- ----
-------------------
3 0 gpfs_dr_interc 10.1.100.21 q-l -- J up 1 0 0
1 1 austin1_interc 10.1.100.31 qml -- - down 1 1 0
2007-09-20 12:21:54
2 2 austin2_interc 10.1.100.32 qml -- J up 1 0 0
Current clock tick (seconds since boot): 163390.45 (resolution 0.010) = 2007-09-20
14:25:24
Groupleader 10.1.100.32 0x00000002 (other node)

Cluster configuration manager is 10.1.100.32 (other node); pendingOps 0
group quorum formation time 2007-09-19 18:23:58
gid 46f1af91:0a01641f elect <2:3> seq 2 pendingSeq 2, gpdIdle phase 0, joined
GroupLeader: joined 1 gid <2:3>
useDiskLease yes, leaseDuration 35 recoveryWait 35 dmsTimeout 23
lastFailedLeaseGranted: 155944.67
lastLeaseObtained 163390.32, 34.87 sec left (ok)
lastLeaseReplyReceived 163390.32 = 0.00 sec after request
stats: nTemporaryLeaseLoss 1 nSupendIO 0 nLeaseOverdue 4 nTakeover 0 nPinging 0
Summary of lease renewal round-trip times:
Number of keys = 1, total count 2522
Min 0 Max 0, Most common 0 (2522)
Mean 0, Median 0, 99th 0
ccSANergyExport no
For more information about the cluster configuration manager and file system manager roles,
refer to 2.1.6, “GPFS configuration” on page 30.
The third node (gpfs_dr) is also aware of the cluster changes, but it takes no special action,
as shown in Example 4-33.
Example 4-33 Node gpfs_dr GPFS log during node failure test
root@gpfs_dr:/var/adm/ras> cat mmfs.log.latest
...
...

Note: If a GPFS node fails, the file systems remain active on the remaining nodes with no
I/O disruption. However, disk I/O might be suspended for up to 135 seconds if the failing
node has a management role (configuration manager, file system manager, or metanode)
that must be migrated to a surviving node.
4.3.2 Recovery when the GPFS node is back

There is actually no special action for recovery. You must fix the failing node and restart it.
After GPFS starts, the node is reintegrated and Oracle can start. Note that the GPFS is never
unmounted on the other nodes during the failure.
Even if the failing node had the cluster configuration manager role before its failure, this role
is not transferred back automatically. The other node (austin2) continues to perform this role,
thus avoiding an unnecessary fallback that might freeze file system activity for a short time.
4.3.3 Loss of one storage unit

This is the most important case of the disaster recovery scenarios. The loss of a storage
device is testing the actual GPFS replication (mirroring) and its capacity to survive with only
one copy of the data. Remember that even after a disaster, we still have two good copies of
the file system descriptors (gpfs_dr third node internal SCSI disk), which is mandatory to
keep the file system active.
Austin1 has both management roles (listed in Example 4-34 and Example 4-35 on
page 178). We test what happens if this node loses its access to the local storage.
Example 4-34 Node austin1 is the file system manager for /disaster
root@austin1:/var/mmfs/gen> mmlsmgr
---------------- ------------------

Example 4-35 Node austin1 is the cluster configuration manager
root@austin1:/var/mmfs/gen> mmfsadm dump cfgmgr

...
node primary admin --status--- join fail SGs
-lease-renewals-- --heartbeats-- other ip addrs,
no idx host name ip address func tr p rpc seqNo cnt mngd sent
processed
- -------------- ------------ ----- ----------- ------ ---- ---- -----------------
3 2 gpfs_dr_interc 10.1.100.21 q-l -- J up 2 0 0 03389.57
03390.17
1 0 austin1_interc 10.1.100.31 qml -- J up 1 0 2
00101.34
2 1 austin2_interc 10.1.100.32 qml -- J up 1 0 1 03404.27
03404.97
Current clock tick (seconds since boot): 3413.57 (resolution 0.010) = 2007-09-20
16:05:29
Groupleader 10.1.100.31 0x00000000 (this node)

Cluster configuration manager is 10.1.100.31 (this node); pendingOps 0
group quorum formation time 2007-09-20 15:10:16
...
It is important to check the GPFS (/disaster) replication before the test. In Example 4-36, you
can see that the NSD disks dr_copy1 and dr_copy2 are both holding data, metadata, and file
system descriptors. They are connected with dual Fibre Channel attachment to both austin1
and austin2. These disks are located on different storage units, situated in separate sites, so
the risk of losing both disks is limited. The third NSD disk, dr_desc, is an internal SCSI disk in
the third node. Accessed by the network only (no SAN), it contains a third copy of the file
system descriptors. The replication settings have been defined at the file system level (see
4.2.3, “Disk configuration using GPFS replication” on page 162).
Example 4-36 GPFS replicated file system configuration before disk failure
root@austin1:/var/mmfs/gen> mmlsdisk disaster -L
disk driver sector failure holds holds storage

name type size group metadata data status availability disk id pool
remarks
------ ------ ------- -------- ----- ------ ------------ ------- ------- -------
dr_copy1 nsd 512 1 yes yes ready up 1 system
desc
dr_copy2 nsd 512 2 yes yes ready up 2 system
desc
dr_desc nsd 512 3 no no ready up 3 system
desc


The disk fails
As we previously mentioned, we simulate disk loss by removing LUN mapping at the storage
subsystem level.
Now, the austin1 node has lost its NSD disks, because the LUN mapping to the host is
removed at 16h19m19s. The failing disk is hdisk16 for AIX, or dr_copy1 for GPFS, as
revealed by the the disk error messages from the AIX error report (errpt |egrep
“ARRAY|mmfs”), which is detailed using errpt -aj command, as shown in Example 4-37.
Example 4-37 Node austin1 AIX error report during disk failure
root@austin1:/> errpt |egrep “ARRAY|mmfs”
2E493F13 0920161907 P H hdisk16 ARRAY OPERATION ERROR
9C6C05FA 0920161907 P H mmfs DISK FAILURE
2E493F13 0920162007 P H hdisk16 ARRAY OPERATION ERROR
root@austin1:/> errpt -aj 2E493F13

LABEL: FCP_ARRAY_ERR2
IDENTIFIER: 2E493F13
Date/Time: Thu Sep 20 16:19:19 CDT 2007

Sequence Number: 82
Machine Id: 00CC5D5C4C00
Node Id: austin1
Class: H
Type: PERM
Resource Name: hdisk16
Resource Class: disk
Resource Type: array
Location: U7879.001.DQDKZNV-P1-C1-T1-W201300A0B811A662-LE000000000000
Description
ARRAY OPERATION ERROR
Probable Causes
ARRAY DASD DEVICE
Failure Causes
DISK DRIVE
DISK DRIVE ELECTRONICS
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
root@austin1:/> errpt -aj 9C6C05FA

LABEL: MMFS_DISKFAIL
IDENTIFIER: 9C6C05FA
Date/Time: Thu Sep 20 16:19:24 CDT 2007

Sequence Number: 83
Machine Id: 00CC5D5C4C00
Node Id: austin1
Class: H
Type: PERM
Resource Name: mmfs
Resource Class: NONE

Resource Type: NONE
Location:
Description
DISK FAILURE
Probable Causes
STORAGE SUBSYSTEM
DISK
Failure Causes
STORAGE SUBSYSTEM
DISK
Recommended Actions
CHECK POWER
RUN DIAGNOSTICS AGAINST THE FAILING DEVICE
Detail Data
EVENT CODE
15913921
VOLUME
disaster
RETURN CODE
22
PHYSICAL VOLUME
dr_copy1
Because GPFS replication is activated, there is no impact and no freeze of the file system
/disaster. The file system remains operational on all three nodes. A user or an application
cannot see anything special regarding the I/O. You only see the problem in the logs (the AIX
error report shown in Example 4-37 on page 179 and the GPFS log, which is shown in
Example 4-38).
Example 4-38 GPFS log on austin1 node

root@austin1:/var/mmfs/gen> cat mmfslog
...
Thu Sep 20 16:19:24 2007: GPFS: 6027-680 Disk failure. Volume disaster. rc = 22.
Physical volume dr_copy1.
...
Example 4-39 shows the status of the file system during the disk failure. A good copy of the
data and the metadata is still accessible via the dr_copy2 disk, and the data is not lost during
the failure. Also, because two valid copies of the file system descriptor still exist (dr_copy2
and dr_desc disks), the file system is still mounted and active.
Example 4-39 GPFS-replicated file system configuration during disk failure

root@austin1:/> mmlsdisk disaster -L
name type size group metadata data status availability pool remarks
-------- ------ ------ ------- -------- ----- ------ ------------ ------- -------
dr_copy1 nsd 512 1 yes yes ready down system desc
dr_copy2 nsd 512 2 yes yes ready up system desc
dr_desc nsd 512 3 no no ready up system desc

Refer to Figure 4-4 on page 159 for a reminder of this architecture.
Note: When using a GPFS cluster with three nodes in three sites and a replicated GPFS
file system on two storage devices, the failure of one storage device has no impact on the
I/O. There is no freeze, and there is no data loss. Everything is managed transparently by
GPFS.
4.3.4 Fallback after the GPFS disks are recovered

When the storage is back up and running, you must run administrative commands to recover
to the original situation:
1. If the GPFS cluster configuration has changed during the failure, run the command shown
in Example 4-40 to ensure that the configuration level is the same on all nodes.
Example 4-40 Synchronize the cluster configuration (if changed during the failure)
root@austin1:/> mmchcluster -p LATEST
mmchcluster: Command successfully completed
2. Then, run the command shown in Example 4-41, from any node, to tell GPFS to accept
the disk that is marked down since its failure.
Example 4-41 The failing disks reintegrates

root@austin1:/> mmchdisk disaster start -a
GPFS: 6027-589 Scanning file system metadata, phase 1 ...

GPFS: 6027-552 Scan completed successfully.
GPFS: 6027-565 Scanning user file metadata ...
As a result, the command in Example 4-42 shows that our disk is operational.
Example 4-42 GPFS-replicated file system configuration after disk failure

root@austin1:/> mmlsdisk disaster

name type size group metadata data status availability pool
-------- -------- ------ ------- -------- ----- ------ ------------ -------
dr_copy1 nsd 512 1 yes yes ready up system
dr_copy2 nsd 512 2 yes yes ready up system
dr_desc nsd 512 3 no no ready up system
3. The last action is to replicate the data and metadata (synchronize the mirror). Be aware
that this can be an I/O intensive action, depending on the size of your file system.
Example 4-43 on page 182 shows how to resynchronize the file system.

Example 4-43 Resynchronization of the GPFS mirror
root@austin1:/home/michel> mmrestripefs disaster -b

100 % complete on Mon Oct 1 11:13:20 2007
Now, you have fully recovered from the disaster. It was not difficult.
4.3.5 Site disaster (node and disk failure)

To simulate a disaster in Site A, both the node and storage device must stop suddenly and at
the same time.
As in the previous examples, the node stopped is the cluster configuration manager and also
the file system manager for the /disaster file system. Also, the mapping of dr_copy2 disk is
removed (at the storage subsystem level) to simulate a storage device problem in site1, and
the austin1 node is halted at the same time. So Site A is not responding anymore.
Although it represents two events at the same time, it does not differ from the node failure and
disk failure cases that we discussed in 4.3.1, “Failure of a GPFS node” on page 174 and
4.3.3, “Loss of one storage unit” on page 177.
The GPFS log on a surviving node has captured both events, as shown in Example 4-44.
Example 4-44 GPFS log showing a disk failure (dr_copy2), and a node failure (austin1)
root@austin2:/var/mmfs/gen> cat mmfslog
Thu Sep 20 17:48:12 2007: GPFS: 6027-680 Disk failure. Volume disaster. rc = 22.
Physical volume dr_copy2.
appointed as manager for disaster.
completed take over for disaster.
However, GPFS remains functional on both austin2 and gpfs_dr.

Note: After a complete Site A disaster, GPFS remains available on Site B because:
򐂰 The node quorum is still matched due to the gpfs_dr third node on Site C.
򐂰 The file system is replicated (mirrored), so GPFS can use the copy#2 on the Site B
storage unit.
򐂰 GPFS still has two good copies of the file system descriptors due to the tiebreaker
node, gpfs_dr.
4.3.6 Recovery after the disaster

Recovering after a disaster and the separate node or disk failure cases that we tested
previously are similar. Refer to 4.3.2, “Recovery when the GPFS node is back” on page 177
and 4.3.4, “Fallback after the GPFS disks are recovered” on page 181.
4.3.7 Loss of one Oracle Clusterware voting disk

Oracle data and code files are stored on GPFS so that they are secure and do not need extra
care. Oracle Clusterware and Oracle 10g RAC are managing the availability of the instances
by failing over the sessions of a failed instance to another instance using Transparent
Application Failover (TAF).
In this scenario, we test the failure of one of the CRS voting disks.
Note: We do not recommend that you store the voting disks on GPFS file systems, as
stated in 1.3.1, “RAC with GPFS” on page 11. Use the raw hdisk (shared SAN) without any
Logical Volume Manager or file system layer. One third voting disk is supported on NFS
(see 4.2.4, “Oracle 10g RAC clusterware configuration using three voting disks” on
page 166).
We want to determine if Oracle 10g RAC can survive a site failure with a node and instance
crash and storage outage (including one of the voting disks). We have already tested the
GPFS layer, and we know it can survive a disaster without concerns. However, because the
CRS voting disks are outside GPFS (two disks on the shared storage as LUNs, and one NFS
is mounted on a third node), we test this scenario separately.
CRS voting disk failure is simulated (in the same way that the previous tests were simulated)
by removing the LUN mapping on the storage subsystem. As a result, RAC remains up and
running with no loss of service. In the CRS log, we can see these lines shown in
Example 4-45.
Example 4-45 CRS logs during the failure of one voting disk
/orabin/crs/log/austin1/alertaustin1.log
[crsd(704594)]CRS-1012:The OCR service started on node austin1.

2007-09-21 16:30:45.498
[evmd(630950)]CRS-1401:EVMD started on node austin1.
2007-09-21 16:30:46.946
[crsd(704594)]CRS-1201:CRSD started on node austin1.
2007-09-21 16:33:58.435
[cssd(635070)]CRS-1601:CSSD Reconfiguration complete. Active nodes are austin1
austin2 .
2007-09-21 17:04:08.000

[cssd(635070)]CRS-1604:CSSD voting file is offline: /dev/votedisk2. Details in
/orabin/crs/log/austin1/cssd/ocssd.log.
2007-09-21 17:04:13.035
/orabin/crs/log/austin1/cssd/ocssd.log
[ CSSD]2007-09-21 17:16:17.076 [1287] >ERROR: clssnmvReadBlocks: read failed
1 at offset 133 of /dev/votedisk2
Note: Oracle 10g RAC can survive the loss of one of the three voting disks.
4.3.8 Loss of a second Oracle Clusterware (CRS) voting disk

Next, we bring down a second voting disk. The CRS error log adds the lines shown in
Example 4-46.
Example 4-46 CRS logs during the failure of two voting disks
/orabin/crs/log/austin2/alertaustin2.log
2007-09-21 17:48:09.659
[cssd(479262)]CRS-1606:CSSD Insufficient voting files available [1 of 3]. Details
in /orabin/crs/log/austin2/cssd/ocssd.log.
Because the voting disk quorum is not matched any longer (more than half must be
accessible), all Oracle 10g RAC instances are stopped, and CRS reboots the servers. When
nodes come back up, CRS reconfigures itself to use only one voting disk and restarts the
instances. So, the database service is up again, but it is not disaster resilient any longer.
Note: Oracle 10g RAC cannot survive the loss of two out of the three voting disks.

5
Chapter 5. Disaster recovery using PPRC

over SAN
This chapter describes a sample Oracle 10g RAC disaster recovery (DR) configuration with
GPFS using storage replication that is provided by the storage subsystems. The storage
replication mechanism uses SAN (Storage Area Network) as the transport infrastructure. The
application itself is unaware of the mirroring, because the mirroring is done directly by the
storage devices without involving GPFS or Logical Volume Manager (LVM) mirroring. In this
scenario, we describe a solution that is based on synchronous mirroring, which is called
Metro Mirror for IBM System Storage™ DS8000 (formerly Peer to Peer Remote Copy
(PPRC)).
Advantages of this solution are:

򐂰 Only uses two sites, which both contain nodes and storage
򐂰 Other applications using the same storage can be protected as well
Disadvantages of this solution are:

򐂰 Manual operations are necessary in the case of a disaster (complete loss of a site) in
order to make the second copy into the copy for the remaining node

5.1 Architecture
The diagram in Figure 5-1 presents the configuration that we propose for providing a disaster
recovery solution for Oracle 10g RAC with GPFS.
S ite A S ite B
a u s tin 1 _ v ip a u s tin 2 _ v ip
R A C in te rc o n n e c t
1 9 2 .1 6 8 .1 0 0 .3 1 1 9 2 .1 6 8 .1 0 0 .3 2 P u b lic n e tw o rk
a u s tin 1 a u s tin 1 _ in te rc o n n a u s tin 2 a u s tin 2 _ in te rc o n n
1 9 2 .1 6 8 .1 0 0 .3 1 1 0 .1 .1 0 0 .3 1 1 9 2 .1 6 8 .1 0 0 .3 2 1 0 .1 .1 0 0 .3 2
e n t2 e n t3 e n t2 e n t3
Node 1 Node 2
fc s 0 fc s 0
SAN
A c tiv e c lu s te r lin k s P P R C lin k s
M e tro M irro r
S to ra g e A S to ra g e B
LUN 01 LUN nn LUN 01' LUN nn'
Figure 5-1 DR configuration with two sites and Metro Mirror
In this configuration, both nodes A and B are part of the same GPFS cluster and Oracle RAC.
Moreover, both nodes are active and can be used for submitting application workload.
However, only storage in Site A is active and provides logical unit numbers (LUNs) for GPFS
and RAC. Storage in Site B (secondary) provides replication for the LUNs in Site A. The
LUNs in the secondary storage are unavailable to either node during normal operation.
Note: The configuration in Figure 5-1 requires two nodes. In normal configurations, both
nodes are active at the same time and access the LUNs in Storage A. One of the benefits
of this configuration is that it does not require additional (standby) hardware, and in case
Site A fails, Site B can provide service with degraded performance (as opposed to
requiring dedicated contingency hardware). Although reasonably simple, this configuration
requires extensive effort for implementation and testing.
A more sophisticated configuration consists of two active nodes in Site A and two backup
nodes (inactive) in Site B. However, this configuration adds an additional complexity level
for the (manually initiated) failover and failback operations.
During normal operation, the LUNs in the primary storage device are replicated
synchronously to secondary storage device. In case the storage in Site A becomes

unavailable, the system administrator must manually break the replication, activate the
secondary copy, map the LUNs belonging to the secondary copy onto node B, and resume
GPFS and Oracle operations.
You must take extra precaution when performing recovery. In normal operations, replicated
LUNs in storage B are not mapped to any of the nodes (because they have the same IDs as
the LUNs in primary storage). During the recovery process, you must prevent LUNs with
same IDs from become active, because activating the LUNs confuses the application (GPFS
or Oracle). Thus, you must make sure that the replicated LUNs belonging to primary (failing)
storage are unmapped from both nodes before you resume the primary storage operation.
When the storage subsystem in Site A is restored to operational status, the system
administrator must reinitiate the replication process to synchronize the copies. After the data
has been synchronized, the secondary storage might remain active (the primary copy) or you
must manually restore the original configuration.
Note: This configuration is based on synchronous replication. The distance between sites
is a factor that affects the performance of your application.
5.2 Implementation
Metro Mirror for IBM System Storage DS8000 (formerly Peer to Peer Remote Copy (PPRC))
is a storage replication product that is totally platform or application independent. PPRC can
provide replication between sites for all types of storage methods that are used for Oracle. It
can be used for both stand-alone and RAC (clustered) databases. Database files can be plain
files (JFS or JFS2), raw devices, ASM, or files in GPFS file systems.
In this test, we have used a two node RAC/GPFS cluster and two IBM System Storage
DS8000 units. To simulate the two locations, the SAN provides two IBM 2109-F32 switches
that are connected using long wave single mode optical fiber (1300nm - LW GBICs).
Important: Metro Mirror (PPRC) is used to replicate all of the LUNs that are used for our
configuration, which include:
򐂰 Oracle Cluster Repository
򐂰 CRS voting disks
򐂰 GPFS NSDs
We have configured LUNs, masking, and zoning to support our configuration. We do not
describe the masking and zoning process in this book.
In this section, we describe how we establish replication between the two storage units, the
actions that we must take when storage in Site A becomes unavailable, and the steps to
perform when the primary storage is recovered.
5.2.1 Storage and PPRC configuration

We use the DSS command line interface (dscli) that is installed on both nodes: A and B. We
work on both nodes, as required. First, in Storage A, we check the storage IDs on the two
systems that we want to use for replication, as shown in Example 5-1 on page 188
(ds_A.profile and ds_B.profile are used for connecting the dscli to the management consoles
for storage A and also for storage B).
Chapter 5. Disaster recovery using PPRC over SAN 187

Example 5-1 Checking the storage subsystems’ IDs
root@dallas1:/> dscli -cfg /opt/ibm/dscli/profile/ds_A.profile
dscli> lssi
Date/Time: November 22, 2007 4:43:39 PM EET IBM DSCLI Version: 5.1.720.139
Name ID Storage Unit Model WWNN State ESSNet
================================================================================
ds_A IBM.2107-75N0291 IBM.2107-75N0290 932 5005076306FFC1DE Online Enabled
# And for second storage:
root@dallas1:/> dscli -cfg /opt/ibm/dscli/profile/ds_B.profile

dscli> lssi
Date/Time: November 22, 2007 4:43:42 PM EET IBM DSCLI Version: 5.1.720.139
Name ID Storage Unit Model WWNN State ESSNet
=================================================================================
ds_B IBM.2107-7572791 IBM.2107-7572790 922 5005076303FFC46A Online Enabled
We assume that the LUN configuration has already been performed, and we use only one
pair of LUNs to show our configuration.
Note: In your environment, you must make sure that all LUNs belonging to your application
are replicated, including OCR and CRS voting disks and GPFS tiebreaker disks. We
recommend that you script the failover and failback process and test it thoroughly before
deploying the production environment.
Check the pair of LUNs that are going to be used for replication on both storage subsystems,
Example 5-2 Checking the LUNs
dscli> lsfbvol 9070

Date/Time: November 22, 2007 4:47:25 PM EET IBM DSCLI Version: 5.1.720.139 DS:
IBM.2107-75N0291
Name ID accstate datastate configstate deviceMTM datatype extpool cap (2^30B)
cap (10^9B) cap (blocks)
==================================================================================
vol1_A 9070 Online Normal Normal 2107-900 FB 512 P32 10.0
- 20971520
# On second storage:
dscli> lsfbvol 9070

IBM.2107-7572791
Name ID accstate datastate configstate deviceMTM datatype extpool cap
(2^30B) cap (10^9B) cap (blocks)
==================================================================================
vol1_B 9070 Online Normal Normal 2107-900 FB 512 P20 10.0
- 20971520
Check the PPRC links available between the two storage subsystems, as shown in

Example 5-3 Checking the PPRC links
# on Storage A:
dscli> lspprcpath 90
IBM.2107-75N0291
Src Tgt State SS Port Attached Port Tgt WWNN
=========================================================
90 90 Success FF90 I0100 I0332 5005076303FFC46A
# and on Storage B:
dscli> lspprcpath 90
IBM.2107-7572791
Src Tgt State SS Port Attached Port Tgt WWNN
=========================================================
90 90 Success FF90 I0231 I0231 5005076306FFC1DE
Create the PPRC relationship between storage in A and B (A → B), as shown in Example 5-4.
Example 5-4 Creating PPRC relationship A → B

# on Storage A:
dscli> mkpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070
IBM.2107-75N0291
CMUC00153I mkpprc: Remote Mirror and Copy volume pair relationship 9070:9070
successfully created.
List the PPRC relationship as shown in Example 5-5. At this point, the two copies are not
synchronized yet.
Example 5-5 PPRC relationship is not synchronized

dscli> lspprc -remotedev IBM.2107-7572791 -l 9070:9070
IBM.2107-75N0291
ID State Reason Type Out Of Sync Tracks Tgt Read Src Cascade
Tgt Cascade Date Suspended SourceLSS Timeout (secs) Critical Mode First Pass
Status GMIR CG PPRC CG
==================================================================================
==================================================================================
====================
9070:9070 Copy Pending - Metro Mirror 10811 Disabled Disabled
Invalid - 90 300 Disabled Invalid
Disabled Disabled

After a while (depending on the distance and LUN size), the copies are synchronized, as
Example 5-6 Synchronized copies

dscli> lspprc -remotedev IBM.2107-7572791 -l 9070:9070
IBM.2107-75N0291
ID State Reason Type Out Of Sync Tracks Tgt Read Src Cascade
Tgt Cascade Date Suspended SourceLSS Timeout (secs) Critical Mode First Pass
Status GMIR CG PPRC CG
==================================================================================
==================================================================================
===================
9070:9070 Full Duplex - Metro Mirror 0 Disabled Disabled
Invalid - 90 300 Disabled Invalid
Disabled Disabled
Example 5-7 lists the relationship as seen from storage B (to connect to the storage B
console, use dscli -cfg /opt/ibm/dscli/profile/ds_B.profile):
Example 5-7 Checking PPRC relationship from storage B

dscli> lspprc -remotedev IBM.2107-75N0291 9070:9070
IBM.2107-7572791
CMUC00234I lspprc: No Remote Mirror and Copy found.
As you can see, the PPRC relationship is only from A to B.
5.2.2 Recovering from a disaster

We have simulated a disaster by pausing the PPRC replica, as shown in Example 5-8. This is
equivalent to the storage in Site A being lost.
This situation is the most complicated of all of the recovery situations, because the node in
Site A is still available. Therefore, we must take extra precautions when reconfiguring the
LUN mapping on both nodes.
Example 5-8 Pausing the PPRC replication

# on Storage A:
dscli> pausepprc -remotedev IBM.2107-7572791 9070:9070
IBM.2107-75N0291
CMUC00157I pausepprc: Remote Mirror and Copy volume pair 9070:9070 relationship
successfully paused.
dscli> lspprc -remotedev IBM.2107-7572791 9070:9070

IBM.2107-75N0291
ID State Reason Type SourceLSS Timeout (secs) Critical
Mode First Pass Status
==================================================================================
===================

9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled
Invalid
Connected to storage B, we activate the secondary copy using the command that is shown in
Example 5-9.
Example 5-9 Reversing PPRC replicas

# on Storage B:
dscli> failoverpprc -remotedev IBM.2107-75N0291 -type mmir 9070:9070
IBM.2107-7572791
CMUC00196I failoverpprc: Remote Mirror and Copy pair 9070:9070 successfully
reversed.

IBM.2107-7572791
ID State Reason Type SourceLSS Timeout (secs) Critical
Mode First Pass Status
==================================================================================
===================
9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled
Invalid
Next, make sure that Oracle and GPFS are stopped on both nodes. Then, unmap the LUNs
belonging to Storage A from both nodes, A and B; if Storage A is unavailable, make sure that
when it comes back up, the LUNs used for PPRC are not available to either node A or node B.
Note: Unmapping LUNs is storage-specific, and we do not discuss it in this publication. For
storage subsystem operations, check with your storage/SAN administrator to make sure
that you understand the consequences of any action that you might take.
After LUNs in storage A have been unmapped, map the replicated LUNs (in this case vol1_B)
to both nodes (A and B). Start GPFS and check if it can see the NSDs. Make sure also that
OCR and CRS voting disks are available from Storage B.
Verify the NSDs’ availability and file system quorum as shown in Example 5-10. Make sure
that all disks are accessed via the localhost (direct access according to RAC requirements).
Example 5-10 Checking disk availability

root@austin1:/> mmlsdisk oradata -L
name type size group metadata data status availability disk id pool remarks
------------ -------- ------ ------- -------- ----- ------------- ------------ ------- ---------- ---------
nsd01 nsd 512 1 yes yes ready up 1 system desc
nsd04 nsd 512 2 yes yes ready up 4 system
root@austin1:/> mmlsdisk oradata -M

Disk name IO performed on node Device Availability
------------ ----------------------- ----------------- ------------
nsd01 localhost /dev/hdisk10 up
root@austin1:/>
Check the GPFS cluster quorum and node availability using the mmgetstate -a -L command,
Example 5-11 Checking cluster quorum

root@austin1:/> mmgetstate -a -L
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
------------------------------------------------------------------------------------
1 austin1_interconnect 2* 2 3 active quorum node
2 austin2_interconnect 2* 2 3 active quorum node
root@austin1:/>
When the GPFS file system is available, you can start Oracle Clusterware (CRS), then Oracle
RAC, and resume operation.
5.2.3 Restoring the original configuration (primary storage in site A)

The failback operation requires more attention, because restoring the primary PPRC
relationship (A → B) requires more steps than the failover operation.
Important: Restoring the original configuration is a disruptive action and requires planned
downtime.
We recommend that you script all operations and check the procedures before putting your
system in production.
The failback process requires the following steps:

1. Stop all operations (database, CRS, and GPFS).
2. On both AIX nodes, delete all disks (hdisk*) belonging to GPFS, Oracle OCR, and CRS.
3. Perform the PPRC steps that are required to restore the original A → B relationship. Make
sure that the replicas are in sync before switching back to Site A.
4. Restore the original mapping on both storage subsystems.
5. Run cfgmgr on both nodes.
6. Make sure that disks are available to RAC and GPFS.
7. Start GPFS and RAC.
In this section, we describe only step 3, because the other steps have been discussed in
other sections or materials.

Step 3: Perform the PPRC steps
Perform the PPRC steps that are required to restore the original A → B relationship. Make
sure that the replicas are in sync before switching back to Site A. The steps are:
1. Delete the original A → B PPRC relationships:
On Storage A, run the command that is shown in Example 5-12.
Example 5-12 Deleting original PPRC relation ship on storage A

dscli> rmpprc -remotedev IBM.2107-7572791 9070:9070
IBM.2107-75N0291
CMUC00160W rmpprc: Are you sure you want to delete the Remote Mirror and Copy
volume pair relationship 9070:9070:? [y/n]:y
CMUC00155I rmpprc: Remote Mirror and Copy volume pair 9070:9070 relationship
successfully withdrawn.
2. Repeat on Storage B, as shown in Example 5-13.
Example 5-13 Deleting original PPRC relationship on Storage B

dscli> rmpprc -remotedev IBM.2107-75N0291 9070:9070
IBM.2107-7572791
CMUC00160W rmpprc: Are you sure you want to delete the Remote Mirror and Copy
volume pair relationship 9070:9070:? [y/n]:y
CMUC00155I rmpprc: Remote Mirror and Copy volume pair 9070:9070 relationship
successfully withdrawn.
3. Next, recreate a new (scratch) PPRC relationship, B → A.

On Storage B, run the command shown in Example 5-14.
Example 5-14 Creating B → A PPRC relationship

dscli> mkpprc -remotedev IBM.2107-75N0291 -type mmir 9070:9070
IBM.2107-7572791
CMUC00153I mkpprc: Remote Mirror and Copy volume pair relationship 9070:9070
successfully created.
4. The synchronization process starts automatically when you create the relationship. Check
for the synchronized copy by using the command shown in Example 5-15.
Example 5-15 Checking PPRC status

Date/Time: November 22, 2007 5:11:39 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-7572791
ID State Reason Type SourceLSS Timeout (secs) Critical Mode First Pass Status
===================================================================================================
9070:9070 Copy Pending - Metro Mirror 90 300 Disabled Invalid
5. After the synchronization, start the process of moving the primary copy back to Site A:
a. Pause the B → A relationship, as shown in Example 5-16 on page 194 (commands run
on
Storage B).

Example 5-16 Suspending the PPRC relationship
dscli> pausepprc -remotedev IBM.2107-75N0291 9070:9070
CMUC00157I pausepprc: Remote Mirror and Copy volume pair 9070:9070 relationship successfully
paused.

=====================================================================================================
9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid
b. Fail over to Site A. On Storage A, execute the commands shown in Example 5-17.
Example 5-17 Fail over to Storage A (PPRC relationship is still suspended)

Storage A
dscli> failoverpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070
Date/Time: November 22, 2007 5:17:28 PM EET IBM DSCLI Version: 5.1.720.139 DS: IBM.2107-75N0291
CMUC00196I failoverpprc: Remote Mirror and Copy pair 9070:9070 successfully reversed.
Asa se vede in A
=====================================================================================================
9070:9070 Suspended Host Source Metro Mirror 90 300 Disabled Invalid
c. Fail back A → B, as shown in Example 5-18 (commands run on Storage A). Check for
the synchronized copy.
Example 5-18 Checking for synchronized copy

dscli> failbackpprc -remotedev IBM.2107-7572791 -type mmir 9070:9070
CMUC00197I failbackpprc: Remote Mirror and Copy pair 9070:9070 successfully failed back.

==================================================================================================
9070:9070 Full Duplex - Metro Mirror 90 300 Disabled Invalid
d. Check the relationship on Storage B, as shown in Example 5-19.
Example 5-19 Checking PPRC relationship on Storage B

IBM.2107-7572791
CMUC00234I lspprc: No Remote Mirror and Copy found.
6. At this point, redo the original LUN mapping (both AIX nodes can only see the LUNs that
belong to the primary PPRC copy, located in Storage A).
7. Next, run cfgmgr on both AIX nodes and make sure that all LUNs are available, including
OCR and CRS voting disks.
8. Start CRS, GPFS, and Oracle RAC.

6
Chapter 6. Maintaining your environment

In this chapter, we discuss how Oracle and GPFS work together to make a system and
database administrator’s tasks easier and safer. Maintaining a database environment
includes tasks, such as backups, and database cloning for test and validation purposes.
GPFS V3.1 provides certain facilities that allow for the simplification of these tasks. We
describe how these facilities are used with examples that we tested in our environment. This
chapter contains the following information:
򐂰 Database backups and cloning with GPFS snapshots
򐂰 GPFS storage pools and Oracle data partitioning:
– GPFS 3.1 storage pools
– GPFS 3.1 filesets
– GPFS policies and rules
– Oracle data partitioning

6.1 Database backups and cloning with GPFS snapshots
This section describes how to use the GPFS snapshot to clone a database for a test
environment or for taking an offline backup.
6.1.1 Overview of GPFS snapshots

A GPFS snapshot is a space-efficient logical copy of a GPFS at a single point in time.
Snapshot™ contains a copy of the file system data that has changed since it was created. It
enables a backup or mirroring application to run concurrently with user updates and still
obtain a consistent copy of the file system at the time that the copy was created.
The GPFS snapshot is designed to be fast. GPFS basically performs a file system
synchronization of all dirty data, blocks new requests, performs file system sync again of any
new dirty data, then creates the empty snapshot inode file, and then resumes. The slow part
is waiting for all existing file system write requests to complete and get synchronized to disk.
A really busy file system can be blocked for several seconds to flush all dirty data.
The GPFS mmbackup utility uses snapshots to back up the contents of a GPFS at a moment in
time to a Tivoli® Storage Manager server. GPFS snapshots also provide online backup
means to recover quickly from accidently deleting files.
Snapshots are read-only, so changes are only made in active files and directories. Because
snapshots are not a copy of the entire file system, they cannot be used as protection against
disk subsystem failures.
When using GPFS snapshots with databases that perform Direct I/O to disk (Oracle uses this
feature), there is a severe performance penalty while the snapshot exists and is being backed
up. Every time that a write occurs, GPFS checks to make sure that the old block is
copied-on-write to the snapshot file. This extra checking overhead can double or triple the
normal I/O time.
The default name of the GPFS snapshots subdirectory is “.snapshots”.
GPFS snapshots commands

This section quickly describes useful commands to handle GPFS snapshots.
mmcrsnapshot
The mmcrsnapshot command creates a snapshot of an entire GPFS file system at a single
point in time. The command syntax is:
mmcrsnapshot Device Directory
Where:
򐂰 Device is the device name of the file system for which the snapshot is to be created. File
system names do not need to be fully qualified. Using oradata is just as acceptable as
/dev/oradata.
򐂰 Directory is the subdirectory name where the snapshots are stored. This is a subdirectory
of the root directory and must be a unique name within the root directory.

mmlssnapshot
The mmlssnapshot command displays GPFS snapshot information for the specified file
system.
The syntax is:

mmlssnapshot Device [-d] [-Q]
Where:
򐂰 Device is the device name of the file system for which snapshot information is to be
shown.
򐂰 -d displays the amount of storage used by the snapshot.
򐂰 -Q displays whether quotas were set to be automatically activated upon mounting the file
system at the time that the snapshot was taken.
mmdelsnapshot
The mmdelsnapshot command deletes a GPFS snapshot. It has the following syntax:
mmdelsnapshot Device Directory
Where:
򐂰 Device is the device name of the file system for which the snapshot is to be deleted.
򐂰 Directory is the snapshot subdirectory to be deleted.
mmrestorefs
The mmrestorefs command restores a file system from a GPFS snapshot. The syntax is:
mmrestorefs Device Directory [-c]
Where:
򐂰 Device is the device name of the file system for which the restore is to be run.
򐂰 Directory is the snapshot with which to restore the file system.
򐂰 -c continues to restore the file system in the event that errors occur.
mmsnapdir
The mmsnapdir command creates and deletes invisible directories that connect to the
snapshots of a GPFS file system and changes the name of the snapshots subdirectory. The
syntax is:
mmsnapdir Device {[-r | -a] [-s SnapDirName]}
mmsnapdir Device [-q]
Where:
򐂰 Device is the device name of the file system.
򐂰 -a adds a snapshots subdirectory to all subdirectories in the file system.
򐂰 -q displays current settings if it issued without any other flags.
򐂰 -r reverses the effect of the -a option. All invisible snapshot directories are removed. The
snapshot directory under the file system root directory is not affected.
Chapter 6. Maintaining your environment 197

GPFS snapshot command examples
Here are GPFS snapshot command examples from our test cluster.
In Example 6-1, we use the time command to check the elapsed time during the execution of
the mmcrsnapshot command.
Example 6-1 mmcrsnapshot example

root@alamo1:/tmp/mah> time mmcrsnapshot oradata snap1
Writing dirty data to disk
Quiescing all file system operations
Writing dirty data to disk again
Creating snapshot.
Resuming operations.
real 0m0.64s
user 0m0.19s
sys 0m0.05s
root@alamo1:/tmp/mah> mmlssnapshot oradata -d

Snapshots in file system oradata: [data and metadata in KB]
Directory SnapId Status Created Data Metadata
snap1 4 Valid Wed Sep 26 15:13:10 2007 1536 1120
root@alamo1:/tmp/mah> ls -l /oradata/
total 33
dr-xr-xr-x 3 root system 8192 Sep 26 15:13 .snapshots
drwxr-xr-x 2 oracle dba 8192 Sep 25 15:10 ALAMO
root@alamo1:/tmp/mah> ls -l /oradata/.snapshots/
total 32
drwxr-xr-x 3 oracle dba 8192 Sep 24 16:28 snap1
root@alamo1:/tmp/mah> ls -l /oradata/.snapshots/snap1/
total 32
The last few ls commands show that the default snapshots directory name is .snapshots and
the default subdirectory exists on the root directory of the file system.
You can change the default subdirectory using the mmsnapdir command as shown in
Example 6-2.
Example 6-2 mmsnapdir command example

root@alamo1:/tmp/mah> mmsnapdir oradata -s .oradatasnapshots
root@alamo1:/tmp/mah> ls -l /oradata/
total 33
dr-xr-xr-x 3 root system 8192 Sep 26 15:13 .oradatasnapshots
root@alamo1:/tmp/mah> ls -l /oradata/.oradatasnapshots/
total 32
drwxr-xr-x 3 oracle dba 8192 Sep 24 16:28 snap1
root@alamo1:/tmp/mah> ls -l /oradata/.oradatasnapshots/snap1/
total 32
Example 6-3 on page 199 shows how to delete a GPFS snapshot.

Example 6-3 Deleting GPFS snapshot with mmdelsnapshot command
root@alamo1:/tmp/mah> mmdelsnapshot oradata snap1
Deleting snapshot files...
Delete snapshot snap1 complete, err = 0
root@alamo1:/tmp/mah> mmlssnapshot oradata -d
GPFS snapshots possible errors

The following list presents possible errors that you might get while creating GPFS snapshots:
򐂰 mmcrsnapshot is unable to create a snapshot, because there are 31 existing
snapshots already in file system.
򐂰 mmcrsnapshot cannot create a snapshot, because the specified snapshot
subdirectory exists in file system.
򐂰 If there is a conflicting command running at the same time, such as another
mmcrsnapshot or mmdelsnapshot, the second command will wait.
򐂰 mmdelsnapshot cannot delete snapshot, because the specified snapshot does not
exist.
򐂰 mmrestorefs cannot restore snapshot, because file system is mounted on any
number of nodes. GPFS file system must be unmounted on all nodes before a
restore is attempted.
򐂰 If mmrestorefs is interrupted, the file system could be inconsistent. GPFS will
not mount until the restore command completes.
򐂰 Snapshot commands cannot be executed, because GPFS is down on all of the nodes
or none of GPFS nodes are reachable.
Note: For an overview of GPFS snapshots, refer to the chapter titled “Creating and
maintaining snapshots of GPFS file system” in the GPFS V3.1 Advanced Administration
Guide, SC23-5182.
A detailed description of snapshot commands is documented in the manual GPFS V3.1

Administration and Programming Reference, SA23-2221.
6.1.2 GPFS snapshots and Oracle Database

Whenever administrators deploy a test or development Oracle environment, a process of
creating a copy of the entire production database is mandatory. For large databases and
systems that have restricted maintenance windows, it is usually not an option, because it
takes too long.
Storage manufacturers develop functions in their disk systems called flash copy or snapshots.
They provide a point-in-time view of a specified volume on a storage level, and they are not
discussed in this document. GPFS snapshot is the similar mechanism that is built into GPFS
and is storage subsystem independent.

You might consider GPFS snapshots an ideal solution when it is used with the Oracle
database for the following purposes:
򐂰 Cloning databases
򐂰 Performing fast backup and recovery of databases with short life spans (test or
development systems)
򐂰 Reducing downtime when creating cold backups of databases
We present an overview of GPFS snapshots that are used with Oracle Database in
Figure 6-1.
Oracle Database backup

(SCN + 7)
n
lo catio
mote
to re
copy
GPFS snapshot 1 at SCN + 7
/oradata/.snapshots/snap1
delete snapshot
create snapshot
GPFS snapshot 2 at SCN + 14

Oracle Database
/oradata/.snapshots/snap2
restore from
snapshot
snapshot
create
user error
at SCN + 18
GPFS filesystem
/oradata X
time
SCN + 3 SCN + 7 SCN + 14 SCN + 14
Figure 6-1 GPFS snapshots overview
In our test, the production database is located on the /oradata file system. Each horizontal
block of the production file system (at the bottom of Figure 6-1) represents a single System
Change Number (SCN), which is assigned to each transaction in the database. SCN
numbers increase with time. They are used for consistency and recovery purposes.
At SCN+7, the administrator creates the first GPFS snapshot of the database file system.
Because snapshots do not change with time (they are consistent and read-only), they can be
used as the source for a database backup or database clone. Remember that in this
scenario, the backup will not be consistent from a database point of view. After restoration
from this backup, you must perform a recovery by using online redo logs. If redo logs are
unavailable, recovery is impossible, which makes the backup unusable. After a copy of files
within the /oradata/.snapshots/snap1 directory is complete, snapshot 1 can be deleted to
save disk space.
At SCN+14, another snapshot is created. It might coexist with snapshot 1, but additional disk
space is required. In this example, the user accidentally deletes data at time SCN+18. The
database can be restored from snapshot 2, and the data reflects the state at SCN+14.

Creating a cold database backup with minimum downtime
In this scenario, we create a consistent backup of the Oracle database. Consistent database
backup means that all transactions are written to datafiles, and the database recovery
process does not need to be performed during the instance startup after the recovery. To
create a consistent backup, the database must be shut down before taking the file system
snapshot. Database archivelog mode is not required. To perform a cold database backup,
execute the following steps:
1. Shut down the database cleanly (shutdown or shutdown immediate only, because
shutdown abort does not provide datafile consistency). From this time, the database is
unavailable.
2. Take a snapshot of the GPFS with the database files. If the database files are spread
across many GPFSs, snapshots must be created for all of the file systems.
3. Start and open the database. After the startup, the database is available again.
4. Copy all snapshot files to tape, or a different disk. No matter how long it takes to copy or
how heavy the database workload is, the snapshot files remain consistent.
5. After completing the copy process, you can remove snapshot.
When the restore process is necessary, backup files must be copied back to the original
database location, not to the snapshots directory. After that operation, you can open the
Oracle database, and no recovery process is performed while starting up the instance.
The database has to be stopped before creating a snapshot, which means all users have to
be logged off, and all applications connected to the database must be shut down. Downtime
caused by stopping and starting the database might be as long as several minutes, but taking
a GPFS snapshot of the file system with Oracle datafiles takes only a fraction of a second.
Overall downtime might be as long as 10 - 15 minutes (or longer, depending on how much
time is required to start the remaining applications), but when considering this method for a
large database, using GPFS snapshots can reduce system unavailability from hours to
minutes.
Cloning databases
Database “cloning” means creating a copy or multiple copies of a database, usually for
testing and development purposes.
To create a database clone, follow these steps:

1. On the source database, back up the controlfile to the trace directory with the following
command:
SQL> alter database backup controlfile to trace
2. Shut down the source database:
SQL> shutdown (or shutdown immediate)
3. Create a snapshot of the GPFS file system with database files.
4. Start and open the source database.
5. Create a new init.ora file for the cloned Oracle instance. The original (source) database
init.ora can be used, but make sure to modify the file paths so that they reflect the new
values of the target database. If the database uses spfile, create a pfile first.
6. Copy the controlfile trace file created in step 1 from the user dump directory and change
the database name and paths. Insert the set command before the database name.
7. Copy the GPFS snapshot files to a new location (cloned database files).
8. Remove the GPFS snapshot (if necessary).

9. Export the new ORACLE_SID and start the new database in NOMOUNT mode. Recreate
the control file using the file that you prepared in step 6.
When cloning the database to a new host, the database name and paths do not have to be
modified. The process is easier, because the control files do not have to be recreated, and
init.ora parameters do not have to be changed. Only the source files from the snapshots
directory have to be transferred to the target host.
6.1.3 Examples
We provide several examples in this section.
Cold database backup

The following example is a step-by-step procedure of how to create a cold (consistent)
database backup. In this example, a clustered database is backed up.
The initial configuration of this example is:

򐂰 The database name is ALAMO, and it is a clustered two-node RAC database.
򐂰 There is one GPFS file system for Oracle data files (/oradata).
򐂰 All database files are located in /oradata/ALAMO directory.
The steps are:

1. The first step is to shut down all database instances by using the command shown in
Example 6-4.
Example 6-4 Shutting down ALAMO database instances

$ srvctl stop database -d ALAMO
2. Next, create a GPFS snapshot of the database file system. If the database is spanned
across multiple GPFS file systems, a snapshot has to be created on each of these file
systems. We have created a snapshot by using the command shown in Example 6-5.
Example 6-5 Creating a snapshot of the GPFS file system

root@alamo1:/oradata> time mmcrsnapshot oradata snap1
Creating snapshot.
real 0m0.66s
user 0m0.18s
sys 0m0.05s
All database files (all /oradata file system) are frozen in the /oradata/.snapshots/snap1
directory, as presented in Example 6-6 on page 203. They will not change after the
modification of the original files (/oradata).

Example 6-6 Contents of the GPFS snapshot directory
root@alamo1:/oradata> ls -l /oradata/.snapshots/snap1/ALAMO
total 41531904
-rw-r----- 1 oracle dba 11616256 Sep 27 10:21 control01.ctl
-rw-r----- 1 oracle dba 1536 Sep 24 17:11 orapwALAMO
-rw-r----- 1 oracle dba 52429312 Sep 27 09:00 redo01.log
-rw-r----- 1 oracle dba 3584 Sep 25 15:15 spfileALAMO.ora
-rw-r----- 1 oracle dba 146808832 Sep 27 10:21 sysaux01.dbf
-rw-r----- 1 oracle dba 398467072 Sep 27 10:21 system01.dbf
-rw-r----- 1 oracle dba 51388416 Sep 26 22:01 temp01.dbf
-rw-r----- 1 oracle dba 990912512 Sep 27 10:21 undotbs01.dbf
-rw-r----- 1 oracle dba 209723392 Sep 27 10:21 undotbs02.dbf
-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userd01.dbf
-rw-r----- 1 oracle dba 1073750016 Sep 27 10:21 userx01.dbf
3. After creating the snapshot, you can start the database. Because taking the GPFS
snapshot was very quick and the files will be backed up later, the overall downtime of the
database is much shorter than taking the database down until the backup process is
finished. Start the database as shown in Example 6-7.
Example 6-7 Starting ALAMO database instances

$ srvctl start database -d ALAMO
Because Oracle opens all of the database files, while the database is running, the snapshot
contents do not change. Snapshot files can be backed up without stopping the database and
the consistency of the database is not threatened. When the database is running, all file
changes are reflected in snapshot, as shown in Example 6-8. We can see that (for now) only
40 MB of data and 1.7 MB of metadata in the GPFS file system have changed so far, but the
numbers will increase with time.
Example 6-8 mmlssnapshot command output

root@alamo1:/> mmlssnapshot oradata -d
snap1 5 Valid Thu Sep 27 10:23:13 2007 40464 1760

Example 6-9 demonstrates how to back up snapshot files using the tar command.
Example 6-9 Backing up snapshot files with the tar and gzip commands
root@alamo1:/> cd /oradata/.snapshots/snap1/ALAMO
root@alamo1:/oradata/.snapshots/snap1/ALAMO> tar cfv - * | gzip >
/backup/ALAMO.tar.gz
After all of the files are archived, the GPFS snapshot is no longer necessary and can be
deleted to preserve disk space, as shown in Example 6-10.
Example 6-10 Deleting the GPFS snapshot

root@alamo1:/> mmdelsnapshot oradata snap1
root@alamo1:/> mmlssnapshot oradata -d

Whenever restore is necessary, the .tar.gz file created in the previous step can be restored to
the original database location.
Creating an inconsistent database backup with zero downtime

GPFS snapshots can be created regardless of whether the file system is being used
(changed), and the snapshots are always consistent in the file system. However, when the
database backup is inconsistent, a recovery process is needed after the restore and before
users can log in. This example demonstrates how to create a GPFS snapshot while the
database is running and how to restore and recover the database.
The initial system configuration for this example was:

򐂰 The database ALAMO is in a two-node RAC configuration.
򐂰 All of the database files are on a single GPFS file system named /oradata.
The database is open all of the time, and the sample load is generated from a different
machine on the network.
The steps are:

1. As a first step, a GPFS snapshot is created, as shown in Example 6-11.
Example 6-11 Creating a GPFS snapshot

root@alamo1:/> time mmcrsnapshot oradata snap1
Creating snapshot.
real 0m0.67s
user 0m0.20s
sys 0m0.04s

2. Snapshot data (the files in /oradata/.snapshots/snap1 directory) was backed up with tar
and gzip, as in the previous example. After this step, snapshot was removed.
Example 6-12 shows the detailed command.
Example 6-12 Deleting snapshot

root@alamo1:/> mmdelsnapshot oradata snap1
3. After the database corruption or the deletion of a datafile, the database is stopped on both
cluster nodes to restore the files. Files are restored to the original database location using
the commands shown in Example 6-13.
Example 6-13 Restoring database files to the original location

{alamo1:oracle}/home/oracle -> cd /oradata/ALAMO
{alamo1:oracle}/oradata/ALAMO -> gunzip -c /home/oracle/ALAMO.tar.gz | tar xvf -
Because the snapshot was taken with the database active, the database is inconsistent,
and a recovery process is necessary before the database is usable. In this case, all of the
database files were a part of a single GPFS file system. It makes this scenario much
easier, because consistency at the file system level is preserved.
4. In this case, the recovery process is handled automatically by the Oracle database. When
starting the instance, after the MOUNT phase and before OPEN, a recovery process will
be performed. Example 6-14 shows the alert.log entries for instance ALAMO1 that were
logged after the performing MOUNT phase.
Example 6-14 Alert.log entries for the ALAMO1 instance

ALTER DATABASE MOUNT
Thu Sep 27 12:14:55 2007
This instance was first to mount
Setting recovery target incarnation to 1
Thu Sep 27 12:14:59 2007
Successful mount of redo thread 1, with mount id 275443199
Thu Sep 27 12:14:59 2007
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Thu Sep 27 12:14:59 2007
ALTER DATABASE OPEN
This instance was first to open
Thu Sep 27 12:14:59 2007
Beginning crash recovery of 2 threads
parallel recovery started with 3 processes
Thu Sep 27 12:15:00 2007
Started redo scan
Thu Sep 27 12:15:00 2007
Completed redo scan
12723 redo blocks read, 4206 data blocks need recovery
Thu Sep 27 12:15:01 2007
Started redo application at
Thread 1: logseq 802, block 3741
Thread 2: logseq 4, block 93783, scn 3122006
Thu Sep 27 12:15:01 2007
Recovery of Online Redo Log: Thread 1 Group 2 Seq 802 Reading mem 0
Mem# 0: /oradata/ALAMO/redo02.log

Thu Sep 27 12:15:01 2007
Recovery of Online Redo Log: Thread 2 Group 4 Seq 4 Reading mem 0
Mem# 0: /oradata/ALAMO/redo04.log
Thu Sep 27 12:15:02 2007
Completed redo application
Thu Sep 27 12:15:03 2007
Completed crash recovery at
4206 data blocks read, 4206 data blocks written, 12723 redo blocks read
In this scenario, the redo log files were a part of the same file system as the rest of the
database files. If the database spans across multiple GPFS file systems, it is impossible to
create several GPFS snapshots (one for each file system) at exactly the same time; thus,
recovery is impossible.
In this case, to guarantee consistency across several snapshots, the I/O must be frozen at
the database level using the alter system suspend command, and after creating GPFS
snapshots, resumed with the alter system resume command as presented in Example 6-15.
In the case of Oracle RAC, alter system suspend and alter system resume are
cluster-aware and global; therefore, the operations on all cluster nodes will be suspended or
resumed accordingly.
Example 6-15 Suspending I/O operations in Oracle Database

SQL> alter system suspend;
System altered.
SQL> select database_status from v$instance;

DATABASE_STATUS
-----------------
SUSPENDED
SQL> alter system resume;

System altered.
SQL> select database_status from v$instance;

DATABASE_STATUS
-----------------
ACTIVE
Moreover, to be absolutely sure that the database is recovered, switch the database to the
archivelog mode, which puts all tablespaces (or the whole database) in backup mode. The
Oracle Database Backup and Recovery Advanced User’s Guide, B14191-01, describes this
information in detail.
6.2 GPFS storage pools and Oracle data partitioning

Both GPFS 3.1 and the Oracle database have functions that help maintain data in the
Information Lifecycle Management process. Information Lifecycle Management (ILM) is a
process for managing data throughout its life cycle, from creation until deletion, in a way that
reduces the total cost of ownership by better managing the storage resources required for
running and backing up or archiving the data. ILM is used to manage data placement (at

creation time), data migration (moving data in storage hierarchy) as its value changes, and
data storage for disaster recovery or document retention.
GPFS Release 3.1 provides the following ILM tools:

򐂰 Storage pools
򐂰 Filesets
򐂰 Policies and rules
6.2.1 GPFS 3.1 storage pools

Storage pools are a logical organization of the underlying disks of a file system. They are a
collection of disks with similar attributes that are managed as a group. Storage pools allow
system administrators to manage file system storage based on performance, location, or
reliability:
򐂰 The storage pool is an attribute of disk inside a file system. It is defined as a field in each
disk descriptor when the file system is created or when disks are added to an existing file
system.
򐂰 Each file system can have up to eight storage pools.
򐂰 Files are placed in a storage pool when they are created or moved to a storage pool,
based on a policy.
򐂰 Storage pool names must be unique within a file system, but not across file systems.
򐂰 Storage pool names are case sensitive.
򐂰 If the disk descriptor does not include a storage pool name, the disk is assigned to the
system storage pool.
򐂰 There are two types of storage pools: the system storage pool and user storage pools.
The system storage pool:

򐂰 The system storage pool is created when the file system is created.
򐂰 There is only one system storage pool per file system.
򐂰 It contains file system metadata and metadata associated with regular files.
򐂰 Disks used for system storage pools must be extremely reliable and highly available in
order to keep the file system online.
򐂰 If there is no policy installed, only the system storage pool is used.
򐂰 The system storage pools must be monitored for available space.
򐂰 Each file system’s metadata is stored in the system storage pool.
The user storage pools:

򐂰 There can be more than one user storage pool in a file system.
򐂰 All file user data is stored in an assigned user storage pool.
򐂰 The file’s data can be moved to a different storage pool based on a policy.

Storage pool commands and options
We briefly describe the storage pool commands next.
mmlsfs
This command displays file system attributes. The syntax is:
mmlsfs [-P]
–P displays storage pools that are defined within the file system.
mmdf
This command queries the available file space on a GPFS file system. The syntax is:
mmdf [-P poolName]
-P poolName lists only the disks that belong to the requested storage pool.
mmlsattr
This command queries file attributes. The syntax is:
mmlsattr [-L] FileName
Where:
򐂰 –L displays additional file attributes.
򐂰 FileName is the name of the file to be queried.
mmchattr
This command changes the replication attributes, storage pool assignment, and I/O caching
policy for one or more GPFS files. The syntax is:
mmchattr [-P PoolName] [-I {yes|defer}] Filename
Where:
򐂰 –P PoolName changes the file’s assigned storage pool to the specified user pool name.
򐂰 -I {yes | defer} specifies if migration between pools is to be performed immediately
(-I yes) or deferred until a later call to mmrestripefs or mmrestripefile (-I defer). By
deferring the updates to more than one file, the data movement can be done in parallel.
The default is yes
򐂰 Filename is the name of the file to be changed.
mmrestripefs
This command rebalances or restores the replication factor of all files in a file system. The
syntax is:
mmrestripefs {-p} [-P PoolName]
Where:
򐂰 -p indicates mmrestripefs will repair the file placement within the storage pool.
򐂰 -P PoolName indicates mmrestripefs will repair only files that are assigned to the specified
storage pool.

mmrestripefile
This command performs a repair operation over the specified list of files. The syntax is:
mmrestripefile {-m|-r|-p|-b} {[ -F FilenameFile] | Filename[ Filename...]}
Where:
򐂰 Filename is the name of one or more files to be restriped.
򐂰 -m migrates all critical data off any suspended disk in this file system.
򐂰 -r migrates all data off of the suspended disks and restores all replicated files in the file
system to their designated degree of replication.
򐂰 -p repairs the file placement within the storage pool.
򐂰 -b rebalances all files across all disks that are not suspended.
6.2.2 GPFS 3.1 filesets

Filesets are subtrees of the GPFS file system namespace. They are used to organize data in
the file system. Filesets provide a way of partitioning the file system to allow system
administrators to perform operations at the fileset level rather than across the entire file
system. For example, a system administrator can set quota limits for a particular fileset or
specify a fileset in a policy rule for data placement or migration.
6.2.3 GPFS policies and rules

GPFS has a policy engine that allows you to create rules to determine initial file placement
when a file is created and how it is managed through its life cycle until disposal. For example,
rules can be used to place a file in a specific storage pool, migrate from one storage pool to
another, or delete a file based on specific attributes, such as owner, file size, file name, the
time it was last modified, and so forth:
򐂰 GPFS supports file placement policies and a file management policy.
򐂰 The file placement policy is used to store newly created files in a specific storage pool.
򐂰 The file management policy is used to manage files during their life cycle, migrate data to
another storage pool, or delete files.
򐂰 If the GPFS file system does not have an installed policy established, all data is placed
into the system storage pool.
򐂰 You can only have one installed placement policy in effect at a time.
򐂰 Any newly created files are placed according to the currently installed placement policy.
򐂰 The mmapplypolicy command is used to migrate, delete, or exclude data.
򐂰 A policy file is limited to a size of 1 MB.
򐂰 GPFS verifies the basic syntax of all rules in a policy file.
򐂰 If a rule in a policy refers to a storage pool that does not exist, GPFS returns an error and
the policy will not be installed.
Policy commands
We present the GPFS policy commands and options next.

mmchpolicy
This GPFS policy command establishes policy rules for a GPFS file system. The syntax is:
mmchpolicy Device PolicyFileName [-t DescriptiveName ][-I {yes | test} ]
Where:
򐂰 Device is the device name of the file system for which policy information is to be
established or changed.
򐂰 PolicyFileName is the name of the file containing the policy rules.
򐂰 -t DescriptiveName is the optional descriptive name to be associated with the policy
rules.
򐂰 -I {yes | test} specifies whether to activate the rules in the policy file PolicyFileName.
yes means that policy rules are validated and immediately activated, which is the default.
test means that policy rules are validated, but not installed.
mmlspolicy
This GPFS policy command displays policy information for the file system. The syntax is:
mmlspolicy Device [-L]
Where:
򐂰 Device is the device name of the file system for which policy information is to be displayed.
򐂰 -L shows the entire original policy file.
mmapplypolicy
This GPFS policy command deletes files or migrates file data between storage pools in
accordance with policy rules. The syntax is:
mmapplypolicy {Device|Directory} [-P PolicyFile] [-I {yes|defer|test}] [-L n ]
[-D yyyy-mm-dd[@hh:mm[:ss]]] [-s WorkDirectory]
Where:
򐂰 Device is the device name of the file system from which files are to be deleted or migrated.
򐂰 -P PolicyFile is the name of the policy file name.
򐂰 Directory is the fully qualified path name of a GPFS file system subtree from which files
are to be deleted or migrated.
򐂰 -I {yes | defer | test} determines which actions the mmapplypolicy command
performs on files.
yes means that all applicable MIGRATE and DELETE policy rules are run, and the data
movement between pools is done during the processing of the mmapplypolicy command.
This is the default action.
defer means that all applicable MIGRATE and DELETE policy rules are run, but actual
data movement between pools is deferred until the next mmrestripefs or mmrestripefile
command.
test means that all policy rules are evaluated, but the mmapplypolicy command only
displays the actions that are performed if -I defer or -I yes is specified.
򐂰 -L n controls the level of information that is displayed by the mmapplypolicy command.
򐂰 -D yyyy-mm-dd[@hh:mm[:ss]] specifies a date and optionally a Coordinated Universal
Time (UTC) as year-month-day at hour:minute:second.
򐂰 -s WorkDirectory is the directory to be used for temporary storage during the
mmapplypolicy command processing. The default directory is /tmp.

The file attributes in policy rules are:
򐂰 NAME specifies the name of the file.
򐂰 GROUP_ID specifies the numeric group ID.
򐂰 USER_ID specifies the numeric user ID.
򐂰 FILESET_NAME specifies the fileset where the file’s path name is located or is to be
created.
򐂰 MODIFICATION_TIME specifies an SQL time stamp value for the date and time that the
file was last modified.
򐂰 ACCESS_TIME specifies an SQL value for the time stamp date and time that the file was
last accessed.
򐂰 PATH_NAME specifies the fully qualified path name.
򐂰 POOL_NAME specifies the current location of the file data.
򐂰 FILE_SIZE is the size or length of the file in bytes.
򐂰 KB_ALLOCATED specifies the number of kilobytes of disk space that are allocated for the
file data.
򐂰 CURRENT_TIMESTAMP determines the current date and time on the GPFS server.
Policy rules syntax

File placement, migration, deletion, and exclusion have the formats that are presented in
Example 6-16.
Example 6-16 GPFS policy rules syntax

RULE [’rule_name’]
SET POOL ’pool_name’
[ REPLICATE(data-replication) ]
[ FOR FILESET( ’fileset_name1’, ’fileset_name2’, ... )]
[ WHERE SQL_expression ]
RULE [‘rule_name’] [WHEN time-boolean-expression]

MIGRATE
[ FROM POOL ’pool_name_from’
[THRESHOLD(high-%[,low-%])]] [ WEIGHT(weight_expression)]
TO POOL ’pool_name’
[ LIMIT(occupancy-%) ] [ REPLICATE(data-replication) ]
[ FOR FILESET( ‘fileset_name1’, ‘fileset_name2’, ... )]
[ WHERE SQL_expression]
RULE [‘rule_name’] [ WHEN time-boolean-expression]

DELETE
[THRESHOLD(high-%[,low-%])]] [ WEIGHT(weight_expression)]
RULE [‘rule_name’] [ WHEN time-boolean-expression]

EXCLUDE

Consider the storage pool feature as a way to separate certain types of Oracle files across
physical disks or storage LUNs. The most typical example is to locate database online redo
logs on separate volumes to achieve better performance and avoid concurrency between
other database files. By using GPFS storage pools, redo log files can be placed on different
physical volumes while they are still on the same file system, which greatly simplifies
administration.
In this chapter, we discuss storage pools and policy rules in the context of data partitioning in
an Oracle database test environment.
Note: For details about GPFS policy-based data management implementations (storage
pools, filesets, policies, and rules), refer to the GPFS V3.1 Advanced Administration
Guide, SC23-5182.
6.2.4 Oracle data partitioning

Data partitioning enables splitting large amounts of data into smaller chunks. Tables, as well
as indexes, can be partitioned. Although all partitions of a specific table or index share the
same logical attributes (columns and constraint definitions), the physical attributes, such as
the table space where a table or index is located, can differ. Partitioning is transparent to
applications and Data Manipulation Language (DML) statements. In addition to table or index
partitioning, you can divide each partition into subpartitions for a finer level of granularity.
Partitioning is beneficial for:

򐂰 Achieving better performance by ignoring partitions that do not have the requested (in the
WHERE clause) rows of tables, thus, fewer blocks are scanned
򐂰 Achieving better performance with parallel query where multiple processes scan many
partitions at the same time
򐂰 Enabling ILM by locating partitions with less frequent access and separate tablespaces
(and datafiles) on less expensive storage
򐂰 Improving manageability because each partition can be backed up or recovered
independently when the partitions reside on different tablespaces
Oracle Database 10g offers the following partitioning methods:

򐂰 range
򐂰 hash
򐂰 list
򐂰 composite range-hash
򐂰 composite range-list
The Oracle Database Administrator’s Guide 10g Release 2, B14231-01, provides a detailed
explanation of all partitioning methods.
In this chapter, we used a range partitioning, because it is the closest to ILM strategy and the
easiest to demonstrate the described features.
Range partitioning is useful when data has logical ranges into which it can be distributed.
Accounting data is an example of this type of data. Every accounting operation has an
assigned date (a time stamp). By splitting this data into partitions, each period can reside on a
different tablespace. In addition, with GPFS 3.1 storage pools, each of those tablespaces can
have a different storage pool assigned while being located on the same file system.

Using these mechanisms, administrators can shift noncurrent (less frequently accessed) data
to less expensive disk devices, which is one of the goals of ILM. The current and frequently
read and modified data stays on high performance storage. With the time stamp, in the
Oracle database, new table partitions are created for new (current) periods. Datafiles
containing older data are shifted to slower (less expensive disk) storage pools. Figure 6-2
illustrates the way that data partitioning works.
Q1 2006 data
Q2 2006 data
POOL2
Oracle range-partitioned table
Physical storage layer

Q3 2006 data
GPFS storage pools

Entry-level
Q4 2006 data
Q1 2007 data
POOL1
Q2 2007 data
Midrange
Q3 2007 data
SYSTEM
Q4 2007 data
Enterprise
(highest performance)
Figure 6-2 Oracle Database partitioning and GPFS storage pools idea
With these mechanisms, it is possible to achieve high performance while reducing the cost of
hardware.
The following sections describe GPFS storage pools and Oracle Database partitioning
working together.
6.2.5 Storage pools and Oracle data partitioning example

Next, we show you an example of creating a GPFS with storage pools.
Creating a GPFS with storage pools

You can create storage pools in two ways:
򐂰 At file system creation time, you can create storage pools by using the mmcrfs command.
򐂰 If the file system exists, you can add new disks to the file system by using the mmadddisk
command and specifying a new storage pool.

We have created one file system named oradata with a storage pool by using the mmcrfs
command. Example 6-17 lists a cluster configuration.
Example 6-17 GPFS cluster configuration

root@alamo1:/tmp/mah> mmlscluster

========================
GPFS cluster name: alamo1_interconnect
GPFS cluster id: 720967595342276183
GPFS UID domain: alamo1_interconnect
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp

-----------------------------------
Primary server: alamo1
Secondary server: alamo2
Node Daemon node name IP address Admin node name Designation

------------------------------------------------------------------------------
1 alamo1_interconnect 10.1.100.53 alamo1 quorum-manager
2 alamo2_interconnect 10.1.100.54 alamo2 quorum-manager
Example 6-18 shows creating a working collective that includes GPFS nodes so that we can
ssh/scp across the cluster nodes.
Example 6-18 Creating a working collective

root@alamo1:/tmp/mah> mmcommon getNodeList|awk '{print $3}' > gpfs.nodes
root@alamo1:/tmp/mah> cat gpfs.nodes
alamo1_interconnect
alamo2_interconnect
root@alamo1:/tmp/mah> export WCOLL=gpfs.nodes
root@alamo1:/tmp/mah> mmdsh date
alamo1_interconnect: Fri Sep 21 08:43:19 CDT 2007
alamo2_interconnect: Fri Sep 21 08:43:19 CDT 2007
We created a disk descriptor file named disk.desc (Example 6-19 on page 215). We used
this file to create Network Shared Disks (NSDs).

Example 6-19 Disk descriptor file
root@alamo1:/tmp/mah> cat disks.desc
hdisk8:alamo1_interconnect:alamo2_interconnect
Example 6-20 lists currently configured NSDs.
Example 6-20 Currently configured NSDs

root@alamo1:/tmp/mah> mmlsnsd
File system Disk name Primary node Backup node

---------------------------------------------------------------------------
orabin gpfs3nsd (directly attached)
orabin gpfs8nsd (directly attached)
(free disk) gpfs1nsd (directly attached)
root@alamo1:/tmp/mah> mmcrnsd -F disks.ddesc

Example 6-21displays the newly created NSDs.
Example 6-21 Newly created NSDs

root@alamo1:/tmp/mah> mmlsnsd -X
Disk name NSD volume ID Device Devtype Node name Remarks

----------------------------------------------------------------------------
gpfs14nsd C0A8643546F3CD2C /dev/hdisk8 hdisk alamo1 primary node
gpfs14nsd C0A8643546F3CD2C /dev/hdisk8 hdisk alamo2 backup node
gpfs15nsd C0A8643546F3CD2F /dev/hdisk10 hdisk alamo1 primary node
gpfs15nsd C0A8643546F3CD2F /dev/hdisk10 hdisk alamo2 backup node
gpfs16nsd C0A8643546F3CD31 /dev/hdisk11 hdisk alamo1 primary node
gpfs16nsd C0A8643546F3CD31 /dev/hdisk11 hdisk alamo2 backup node
gpfs1nsd C0A8643546F1A7F4 /dev/hdisk7 hdisk alamo1 directly attached

gpfs8nsd C0A8643546F1A7FB /dev/hdisk14 hdisk alamo1 directly attached
gpfs9nsd C0A8643546F1A7FC /dev/hdisk15 hdisk alamo1 directly attached
In the next step, we edited the disk descriptor file, disks.desc, as shown in Example 6-22.
Example 6-22 Disk descriptor file contents

root@alamo1:/tmp/mah> cat disks.desc
# hdisk8:alamo1_interconnect:alamo2_interconnect
gpfs14nsd:::dataAndMetadata:1::
gpfs15nsd:::dataAndMetadata:1::
gpfs16nsd:::dataOnly:2::pool1
In this example:
򐂰 gpfs14nsd and gpfs15nsd will be used for the system storage pool.
򐂰 gpfs16nsd and gpfs17nsd are for user storage pool1.
򐂰 gpfs18nsd and gpfs19nsd are for user storage pool2.
In the next step, we created the /oradata file system by using the mmcrfs command.
Example 6-23 shows the log of the detailed output.
Example 6-23 Creating the GPFS file system with storage pools
root@alamo1:/tmp/mah> mmcrfs /oradata oradata -F disks.desc
GPFS: 6027-531 The following disks of oradata will be formatted on node alamo2:
gpfs14nsd: size 10485760 KB
GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool1'.
GPFS: 6027-535 Disks up to size 24 GB can be added to storage pool 'pool2'.
Creating Inode File
root@alamo1:/tmp/mah> mmlsdisk oradata -Ll

disk driver sector failure holds holds availa disk storage
name type size group metadata data status bility id pool remarks

--------- ------ ------ ------- -------- ----- ------ ------ ---- ------- -------
gpfs14nsd nsd 512 1 yes yes ready up 1 system desc
gpfs15nsd nsd 512 1 yes yes ready up 2 system
gpfs16nsd nsd 512 2 no yes ready up 3 pool1 desc
gpfs17nsd nsd 512 2 no yes ready up 4 pool1
gpfs18nsd nsd 512 3 no yes ready up 5 pool2 desc
gpfs19nsd nsd 512 3 no yes ready up 6 pool2
All of the disks are up, and the file system is ready. The file system was mounted by using the
mmdsh mount /oradata command.
We installed an Oracle Clusterware and the Oracle database code files in another file system,
/orabin, which is also shared between cluster nodes. We created a sample database to
demonstrate partitioning, and we located the database files on the /oradata/ALAMO directory,
where ALAMO is the database name.
Creating partitioned objects within Oracle database

We created a sample cluster database and named it ALAMO. We placed all of the database
files on the same shared GPFS file system that is mounted in the /oradata directory.
To make use of storage pools, we had to create all of the tablespaces necessary for table and
index partitions first. In our scenario, we assume that both the data and index segments are
located on the same table space. Example 6-24 shows the table space creation process.
Example 6-24 Creating tablespaces

SQL> create tablespace data2006q1 logging datafile '/oradata/ALAMO/data2006q1.dbf'
size 1024M reuse extent management local segment space management auto;
Tablespace created.

Tablespace created.

Tablespace created.

Tablespace created.

Tablespace created.

Tablespace created.

Tablespace created.

Tablespace created.
We created a sample table, TRANSACTIONS, and within the table, we defined several partitions
by using the range partitioning key.
The example below (Example 6-25) creates the TRANSACTIONS table with eight partitions. Each
table partition corresponds to one quarter of the year, and the corresponding data is stored in
separate tablespaces. Partition TRANS2007Q1 will contain the transactions for only the first
quarter of year 2007.
Then, we created the range-partitioned index global index on the TRANSACTIONS table. Each
index’s partition is stored in a different tablespace in the same manner that it is stored for the
table.
Example 6-25 Creating partitioned table and global index

SQL> create table TRANSACTIONS (trans_number NUMBER NOT NULL, trans_date DATE NOT
NULL, trans_type NUMBER NOT NULL, trans_status NUMBER NOT NULL)
2 partition by range (trans_date)
3 (
4 partition TRANS2006Q1 values less than (to_date('2006-04-01','YYYY-MM-DD'))
tablespace data2006q1,
tablespace data2007q4
12 ) enable row movement;
Table created.
SQL> desc transactions;

Name Null? Type
----------------------------------------- -------- ----------------------------

TRANS_NUMBER NOT NULL NUMBER
TRANS_DATE NOT NULL DATE
TRANS_TYPE NOT NULL NUMBER
TRANS_STATUS NOT NULL NUMBER
SQL> create index TRANSACTIONS_DATE on TRANSACTIONS (trans_date)

2 global partition by range (trans_date)
3 (
11 partition TRANS2007Q4 values less than (MAXVALUE) tablespace data2007q4
12 );
Index created.
In this table, a sample row with transaction date=2007-09-19 is stored in partition

TRANS2007Q3. We specified the ENABLE ROW MOVEMENT clause to allow the migration of a
row to a new partition if an update to a key value is made that places the row in a different
partition.
Assigning storage pools

All the database files were stored on the system storage pool, because it is the default
storage pool. The user storage pools (poo1 and pool2) are empty, as shown in Example 6-26.
Example 6-26 Storage pools’ free space

root@alamo1:/tmp/mah> mmdf oradata
disk disk size failure holds holds free KB free KB
name in KB group metadata data in full blocks in fragments
--------------- ------------- ------- -------- ----- --------------- ------------
gpfs14nsd 10485760 1 yes yes 5229312 ( 50%) 6160 ( 0%)
gpfs15nsd 10485760 1 yes yes 5229568 ( 50%) 6424 ( 0%)
------------- --------------- ------------
(pool total) 20971520 10458880 ( 50%) 12584 ( 0%)
Disks in storage pool: pool1

gpfs16nsd 10485760 2 no yes 10483456 (100%) 248 ( 0%)
gpfs17nsd 10485760 2 no yes 10483456 (100%) 248 ( 0%)
------------- --------------- ------------
(pool total) 20971520 20966912 (100%) 496 ( 0%)

gpfs18nsd 10485760 3 no yes 10483456 (100%) 248 ( 0%)
gpfs19nsd 10485760 3 no yes 10483456 (100%) 248 ( 0%)
------------- --------------- ------------
(pool total) 20971520 20966912 (100%) 496 ( 0%)
============= =============== ============

(data) 62914560 52392704 ( 83%) 13576 ( 0%)
(metadata) 20971520 10458880 ( 50%) 12584 ( 0%)
============= =============== ============
(total) 62914560 52392704 ( 83%) 13576 ( 0%)
Inode Information
-----------------
root@alamo1:/oradata/ALAMO> ls -ltr data*.dbf

-rw-r----- 1 oracle dba 1073750016 Sep 21 09:17 data2007q4.dbf
We decided to migrate all 2006 data to the storage pool pool2 and Q1 and Q2 2007 data to
pool1. We created and tested the GPFS policy file, as shown in Example 6-27.
Example 6-27 Creating GPFS policy

root@alamo1:/tmp/mah> cat gpfs.policy
RULE 'migrate_oldest' MIGRATE FROM POOL 'system' TO POOL 'pool2'
WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,8))='data2006'
RULE 'migrate_old_q1' MIGRATE FROM POOL 'system' TO POOL 'pool1'
WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,10))='data2007q1'
RULE 'migrate_old_q2' MIGRATE FROM POOL 'system' TO POOL 'pool1'
WHERE lower(NAME) LIKE '%.dbf' AND LOWER(SUBSTR(NAME,1,10))='data2007q2'
RULE 'DEFAULT' SET POOL 'system'
root@alamo1:/oradata/ALAMO> mmapplypolicy /dev/oradata -P /tmp/mah/gpfs.policy -I

test
GPFS Current Data Pool Utilization in KB and %
pool1 4608 20971520 0.021973%
pool2 4608 20971520 0.021973%
system 10512640 20971520 50.128174%
Loaded policy rules from /tmp/mah/gpfs.policy.
Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP =
2007-09-21@14:32:39 UTC
parsed 1 Placement Rules, 3 Migrate/Delete/Exclude Rules
Directories scan: 23 files, 2 directories, 0 other objects, 0 'skipped' files
and/or errors.
Inodes scan: 23 files, 0 'skipped' files and/or errors.
Summary of Rule Applicability and File Choices:

Rule# Hit_Cnt Chosen KB_Chosen KB_Ill Rule
0 4 4 4194336 0 RULE 'migrate_oldest' MIGRATE FROM POOL
'system' TO POOL 'pool2' WHERE(.)
1 1 1 1048584 0 RULE 'migrate_old_q1' MIGRATE FROM POOL
GPFS Policy Decisions and File Choice Totals:
Chose to migrate 6291504KB: 6 of 6 candidates;
Chose to delete 0KB: 0 of 0 candidates;
0KB of chosen data is illplaced or illreplicated;
Predicted Data Pool Utilization in KB and %:
pool1 2101776 20971520 10.022049%
pool2 4198944 20971520 20.022125%
system 4221136 20971520 20.127945%
The tested GPFS policy file was executed and the datafiles were moved according to defined
rules. Example 6-28 shows the output of the mmapplypolicy command.
Example 6-28 Applying the policy

root@alamo1:/oradata/ALAMO> mmapplypolicy /dev/oradata -P /tmp/mah/gpfs.policy -I
yes
GPFS Current Data Pool Utilization in KB and %
pool1 4608 20971520 0.021973%
pool2 4608 20971520 0.021973%
system 10512640 20971520 50.128174%
Loaded policy rules from /tmp/mah/gpfs.policy.
Evaluating MIGRATE/DELETE/EXCLUDE rules with CURRENT_TIMESTAMP =
2007-09-21@14:33:04 UTC
parsed 1 Placement Rules, 3 Migrate/Delete/Exclude Rules
Directories scan: 23 files, 2 directories, 0 other objects, 0 'skipped' files
and/or errors.
Inodes scan: 23 files, 0 'skipped' files and/or errors.
Summary of Rule Applicability and File Choices:
Rule# Hit_Cnt Chosen KB_Chosen KB_Ill Rule
0 4 4 4194336 0 RULE 'migrate_oldest' MIGRATE FROM POOL
GPFS Policy Decisions and File Choice Totals:
Chose to migrate 6291504KB: 6 of 6 candidates;
Chose to delete 0KB: 0 of 0 candidates;
0KB of chosen data is illplaced or illreplicated;
Predicted Data Pool Utilization in KB and %:
pool1 2101776 20971520 10.022049%
pool2 4198944 20971520 20.022125%
system 4221136 20971520 20.127945%
A total of 6 files have been migrated and/or deleted; 0 'skipped' files and/or
errors.

After the data files were migrated to their assigned storage pools, the file system space
looked similar to Example 6-29. The migration process is transparent to users and
applications, and in this example, it was fully transparent to the database.
Example 6-29 Storage pools’ free space after applying the policy
root@alamo1:/oradata/ALAMO> mmdf oradata
--------------- ------------- ------- -------- ----- --------------- ------------
gpfs14nsd 10485760 1 yes yes 8342784 ( 80%) 6184 ( 0%)
gpfs15nsd 10485760 1 yes yes 8343552 ( 80%) 6448 ( 0%)
------------- --------------- ------------
(pool total) 20971520 16686336 ( 80%) 12632 ( 0%)

gpfs16nsd 10485760 2 no yes 9435136 ( 90%) 488 ( 0%)
gpfs17nsd 10485760 2 no yes 9434368 ( 90%) 248 ( 0%)
------------- --------------- ------------
(pool total) 20971520 18869504 ( 90%) 736 ( 0%)

gpfs18nsd 10485760 3 no yes 8386816 ( 80%) 496 ( 0%)
gpfs19nsd 10485760 3 no yes 8385280 ( 80%) 480 ( 0%)
------------- --------------- ------------
(pool total) 20971520 16772096 ( 80%) 976 ( 0%)
============= =============== ============

(data) 62914560 52327936 ( 83%) 14344 ( 0%)
(metadata) 20971520 16686336 ( 80%) 12632 ( 0%)
============= =============== ============
(total) 62914560 52327936 ( 83%) 14344 ( 0%)
Inode Information
-----------------
root@alamo1:/oradata/ALAMO> ls -ltr data*.dbf

As seen in Example 6-29, we used user storage pools pool1 and pool2.

Management of partitions and storage pools
Whenever it is necessary to create new partitions, use Example 6-30 to demonstrate the
process. Example 6-30 shows how to create a tablespace for new object partitions and how
to create additional partitions within the partitioned table and index.
Example 6-30 Creating space for new data

Tablespace created.
SQL> alter table transactions add partition trans2008q1 values less than
(to_date('2008-04-01','YYYY-MM-DD')) tablespace data2008q1;
Table altered.
SQL> alter index transactions_date split partition trans2007q4 at

(to_date('2008-01-01','YYYY-MM-DD')) into
2> (
3> partition trans2007q4 tablespace data2007q4,
4> partition trans2008q1 tablespace data2008q1
5> );
Index altered.
After this operation, you can shift older partitions to other GPFS storage pools.
We list several examples of useful GPFS commands in Example 6-31.
Example 6-31 Useful GPFS commands that are related to storage pools
root@alamo1:/oradata/ALAMO> mmlsattr -L data2006q3.dbf
file name: data2006q3.dbf
metadata replication: 1 max 1
data replication: 1 max 1
flags:
storage pool name: pool2
fileset name: root
snapshot name:
root@alamo1:/oradata/ALAMO> mmlsattr -L data2007q1.dbf

flags:
storage pool name: pool1
fileset name: root
snapshot name:
root@alamo1:/oradata/ALAMO> mmlslattr -L data2007q3.dbf

flags:
storage pool name: system

fileset name: root
snapshot name:
root@alamo1:/oradata/ALAMO> mmdf oradtata -P system

--------------- ------------- -------- -------- ----- --------------- ------------
gpfs14nsd 10485760 1 yes yes 8374528 ( 80%) 6184 ( 0%)
gpfs15nsd 10485760 1 yes yes 8375808 ( 80%) 6448 ( 0%)
------------- --------------- ------------
(pool total) 20971520 16750336 ( 80%) 12632 ( 0%)
root@alamo1:/oradata/ALAMO> mmdf oradata -P pool1

--------------- ------------- -------- -------- ----- --------------- ------------
gpfs16nsd 10485760 2 no yes 9435136 ( 90%) 488 ( 0%)
gpfs17nsd 10485760 2 no yes 9434368 ( 90%) 248 ( 0%)
------------- --------------- ------------
(pool total) 20971520 18869504 ( 90%) 736 ( 0%)
root@alamo1:/oradata/ALAMO> mmrestripefs - oradata -b -P pool1

Scanning pool1 storage pool
In Example 6-31 on page 223, the mmlsattr outputs show that each file is in its assigned
storage pool.
Note: For specific command syntax, refer to the section titled “GPFS commands” in GPFS
V3.1 Administration and Programming Reference, SA23-2221.

Part 4
Part 4 Virtualization scenarios

Part 4 contains information and examples about how you can use the IBM System p
virtualization features. Although virtual resources do not always provide the performance that
is required for a high volume production database, they are still useful for testing various
configurations and scenarios, which is known as proof of concept (POC).
The flexibility of the IBM System p virtualization features provides a cost-effective solution for
quick deployment of test environments to validate solutions before they are put into
production.

7
Chapter 7. Highly available virtualized

environments
IBM Virtualization on System p servers provides a flexible, cost-effective way to deploy
complex IT environments with good resource separation and usage and excellent
management capabilities. However, the virtual I/O (VIO) server partition is considered a
single point of failure. Virtualization is considered by many administrators as less highly
available than other solutions.
Disclaimer: The configuration examples using System p virtual resources (VIO Server,
virtual Ethernet, virtual SCSI) are just for test purposes. Virtual SCSI disks are currently
NOT supported in all configurations. As of the release date of this book, virtual SCSI disks
are only supported using Oracle's ASM, but not with GPFS. For the current IBM/Oracle
cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.html
For example, in the case of loss or temporary unavailability of the VIO server partition, all
resources associated with this partition are unavailable, which causes the outage of all
associated client partitions. Because Oracle RAC is designed to be highly available,
architects and administrators do not want to compromise this configuration by introducing the
VIO server as a single point of failure. In this chapter, we demonstrate that an Oracle RAC
solution can be deployed with good availability in a System p environment utilizing virtual
resources.
With careful design and planning and relying on the IBM System p exceptional virtualization
capabilities, you can achieve high redundancy in the virtualized System p environment. For
example, by using two VIO server partitions per server and the proper configuration of virtual
devices (Multi-Path I/O (MPIO), Logical Volume Manager (LVM) mirroring, EtherChannel,
and so on), you can mask the failure of hardware resources and even an entire VIO server.
These configurations allow you to shut down one of the VIO servers for maintenance
purposes, software upgrade, or reconfiguration while the other VIO server provides network
and disk connectivity for client partitions. These configurations are redundant, so in the case

of the failure of one VIO server, network, or Fibre Channel (FC) adapter, there is no loss of
service.
In this chapter, we demonstrate how to set up a dual VIO server configuration that will provide
high availability for Oracle RAC. Remember that for a highly available Oracle RAC
environment, you need to build two similar hardware configurations (two systems each with
two VIO servers). Although you can install and run Oracle RAC on two logical partitions
(LPARs) of the same server, we do not recommend that you use the same server, because
the server itself represents a single point of failure.
This chapter provides the necessary information to create a resilient architecture for a
two-node Oracle RAC cluster. We discuss the following topics:
򐂰 Configuration of the network and shared Ethernet adapters
򐂰 Storage configuration with MPIO
򐂰 Considerations when using System p virtualization with production RAC databases
We do not describe the installation and configuration of Oracle RAC here, because the
installation and configuration of Oracle RAC are the same as installing RAC on two physical
servers, which we have already described in this book.

7.1 Virtual networking environment
To achieve high availability for an Oracle RAC instance with Virtual Ethernet Adapters and a
Shared Ethernet Adapter (SEA), you need to configure two VIO servers so that in case one
VIO server fails, the other VIO server will handle the network traffic.
There are two common ways to provide high availability for a virtualized network:
򐂰 SEA failover
򐂰 A link aggregation adapter with one primary adapter and one backup adapter, known as a
Network Interface Backup (NIB)
SEA is implemented at the VIO server level. When dealing with several client partitions
running within the same system, SEA is configured only one time for the entire System p
server and provides high network connectivity to every partition that utilizes virtual Ethernet.
When using SEA, a failover to a second VIO server can take as long as 30 seconds in case of
an adapter failure, which can cause problems when SEA is used for Oracle RAC
interconnect. Timeouts might be long enough to cause a “split brain” resolution in Oracle
Clusterware and evict nodes from the cluster. Of course, you can still use SEA for
administrative and Virtual IP address (VIP) networks and dedicate physical Ethernet adapters
for interconnect. Mixing physical and virtual adapters is allowed and fully supported in System
p virtualization.
The second possibility, a NIB, is implemented on every client partition and does not rely on
the SEA failover mechanism. A NIB is implemented the same way as an EtherChannel
adapter with a single primary adapter and a backup adapter.
In Figure 7-1 on page 230, the client uses two virtual Ethernet adapters to create an
EtherChannel adapter (en3) that consists of one primary adapter (en1) and one backup
adapter (en2). If the primary adapter becomes unavailable due to VIO server unavailability or
the corresponding physical Ethernet adapter failure on a VIO server partition, the NIB
switches to the backup adapter and routes the traffic through the second VIO server. This
configuration allows and supports a total of two virtual adapters: one active virtual adapter
and one standby virtual adapter.
Chapter 7. Highly available virtualized environments 229

external network
LAN Switches
Physical Ethernet Adapters
AIX partition
Shared EtherChannel Shared

Ethernet Adapter en3 Ethernet
Adapter en1 – primary Adapter
en2 – backup
VIO1 en1 en2 VIO2
Virtual Ethernet Adapters (Hypervisor)

Figure 7-1 Network configuration with redundant VIO servers
In this scenario, because there is no hardware link failure for virtual Ethernet adapters to
trigger a failover to the other adapter, it is mandatory to use the ping-to-address feature of
EtherChannel to detect network failures. When configuring virtual adapters for NIB, the two
internal networks must be separated in the hypervisor layer by assigning two different PVIDs.
Note: There is a common behavior with both SEA failover and NIB: They do not check the
reachability of the specified IP address through the backup-path as long as the primary
path is active. They do not check, because the virtual Ethernet adapter is always
connected, and there is no linkup event, such as there is with physical adapters. You do
not know if you really have an operational backup until your primary path fails.
Virtual Ethernet uses the system processors for all communication functions instead of
offloading the load to processors on network adapter cards. As a result, there is an increase
in the system processor load that is generated by the virtual Ethernet traffic. This might be a
good reason to consider using physical adapters for Oracle RAC interconnect.
The connection to the client partition that is shown in Figure 7-1 is still available in case of:
򐂰 A switch failure
򐂰 Failure of any Ethernet link
򐂰 Failure of the physical Ethernet adapter on the VIO server
򐂰 Virtual I/O server failure or maintenance

For detailed information about setting up Virtualization on IBM System p Servers, refer to
Advanced POWER Virtualization on IBM System p5: Introduction and Configuration,
SG24-7940.
7.1.1 Configuring EtherChannel with NIB

In this example, we set up an EtherChannel with NIB on the AIX partition. The configuration is
consistent with the configuration that we presented in Figure 7-1 on page 230. We used these
network interfaces to configure NIB:
򐂰 ent1: The virtual interface for the primary adapter, which goes through VIOS1
򐂰 ent2: The virtual interface for the backup adapter, which goes through VIOS2
Assuming both virtual interfaces are visible on the AIX partition, the easiest way to configure
NIB is by using SMIT. Follow these steps to create an ent3 adapter, which will be the
aggregated adapter with the ent1and ent2 adapters:
1. Use the following SMIT fastpath: smitty etherchannel.
2. Select Add An EtherChannel / Link Aggregation.
3. From the list, choose a primary adapter for NIB, in this case, ent1.
4. The window in Example 7-1 appears.
Example 7-1 Configuring the NIB SMIT window


[Entry Fields]
EtherChannel / Link Aggregation Adapters ent1 +
Enable Gigabit Ethernet Jumbo Frames no +
Mode standard +
Hash Mode default +
Backup Adapter ent2 +
Internet Address to Ping [10.10.10.1]
Number of Retries [2] +#
Retry Timeout (sec) [1] +#

5. Choose ent2 as the backup adapter.

6. Configure an IP address that we will ping to determine if the connection of the VIO server
that is used for sharing the primary virtual Ethernet adapter (ent1) is down. Make sure that
the address exists in the network and choose the same network that was created for the
NIB. In this example, we assume that 10.10.10.1 is the address in the same network. In
this case, it is an address of a network gateway.

7. Minimize the number of retries and the duration of retry timeouts, as indicated in
Example 7-1 on page 231, which shortens the time that is necessary to fail over to back
up the adapter and the VIO server.
The next step is to assign an IP address for the newly created NIB. In our test scenario, we
configure it with address 10.10.10.2 (see Example 7-2).
Example 7-2 Network interface IP addresses on AIX partition

root@texas:/> ifconfig -a
en0:
flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT
,CHECKSUM_OFFLOAD(ACTIVE),CHAIN>
tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
en3:
flags=1e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,
CHECKSUM_OFFLOAD(ACTIVE),CHAIN>
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
After completing this part, a Network Interface Backup is configured and ready to use.
7.1.2 Testing NIB failover

For testing purposes, we initiate a ping command on the gateway server (10.10.10.1
address) to the AIX server (10.10.10.2, which is the NIB adapter). At the same time, we
initiated an FTP transfer, so we can see if the link is heavily utilized and if the transfer
survives the VIO server failure.
During the transfer, the first VIO server partition (which was handling the network traffic) is
shut down with the Hardware Management Console (HMC).
We observed on the gateway machine:

򐂰 The FTP transfer stops for about two seconds and continues without any error messages.
򐂰 The ping command loses two packets, as shown in Example 7-3.

Example 7-3 Testing the ping command on the NIB interface
root@nim8810:/> ping 10.10.10.2
PING 10.10.10.2: (10.10.10.2): 56 data bytes
64 bytes from 10.10.10.2: icmp_seq=0 ttl=255 time=0 ms
8. After FTP transfer completes, we verify the integrity of the transferred file on the AIX
partition, and we saw no loss of data.
At the same time, during the VIO server failure, AIX detected the failure of the primary
interface that was used for the network interface backup and switched to the backup
interface. Example 7-4 shows the output from the errpt command that indicates the failure of
the network interface backup.
Example 7-4 Output from errpt command on AIX partition

root@texas:/> errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
9F7B0FA6 1004151707 I H ent3 PING TO REMOTE HOST FAILED
root@texas:/> errpt -a
---------------------------------------------------------------------------
LABEL: ECH_PING_FAIL_PRMRY
IDENTIFIER: 9F7B0FA6
Date/Time: Thu Oct 4 15:17:12 CDT 2007

Sequence Number: 222
Machine Id: 00C7CD9E4C00
Node Id: texas
Class: H
Type: INFO
Resource Name: ent3
Resource Class: adapter
Resource Type: ibm_ech
Location:
Description
PING TO REMOTE HOST FAILED
Probable Causes
CABLE
SWITCH
ADAPTER

Failure Causes
CABLES AND CONNECTIONS
Recommended Actions
CHECK CABLE AND ITS CONNECTIONS
IF ERROR PERSISTS, REPLACE ADAPTER CARD.
Detail Data
FAILING ADAPTER
PRIMARY
SWITCHING TO ADAPTER
ent2
Unable to reach remote host through primary adapter: switching over to backup
adapter
7.2 Disk configuration

In this section, we propose architecture that uses SAN disks with virtualization, which allows
good I/O performance and also high availability. You can use this infrastructure also for a
production environment.
http://www.oracle.com/technology/support/metalink/index.htm
For deploying test or POC environments, you can deploy a configuration that is not based on
a SAN. In fact, a carefully designed VIO server environment can simulate a virtual SAN. Refer
to Chapter 8, “Deploying test environments using virtualized SAN” on page 241 for more
details.
7.2.1 External storage LUNs for Oracle 10g RAC data files
Due to the characteristics of the VIO server implementation, you must configure concurrent
access to same storage devices from two or more client partitions considering the following
aspects.
No reserve
To access the same set of LUNs on external storage from two VIO servers at the same time,
the disk reservation or Small Computer System Interface (SCSI) has to be disabled. Failing to
disable the SCSI prevents one VIO server from accessing the disk. This configuration has to
be enforced, and it does not depend on your choice to use GPFS, ASM, or direct hdisks for
voting or OCR disks.
You must set this no reserve policy on both VIO servers and on all RAC nodes.
Make sure that the reserve policy is set to no_reserve, as shown in Example 7-5. Note that
the default value is single_path.

Example 7-5 The reserve_policy must be set to no_reserve
root@alamo1:/> lsattr -El hdisk7
PCM PCM/friend/vscsi Path Control Module False
algorithm fail_over Algorithm True
hcheck_cmd test_unit_rdy Health Check Command True
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid none Physical volume identifier False
queue_depth 3 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True
Whole disk VIO server mapping for data disks

One of the features of IBM Advanced Power Virtualization is that you can assign a client
partition either a logical volume (part of a volume group) or an entire disk (logical unit number
(LUN)). This is a great advantage if you do not have many internal disks attached to your VIO
server, and you want to share existing disks to define multiple virtual disks for the AIX
partitions’ rootvg usage. A logical volume at the VIO server level becomes a virtual hdisk at
the AIX partition level.
For Oracle 10g RAC shared storage, the same LUNs on the SAN have to be accessed
concurrently by two partitions through two VIO servers. Thus. you cannot use a part of a
physical disk or logical volume (LV) as a shared disk, because the definition is local to the
LVM of a VIO server. So, when defining the mappings of the Oracle 10g RAC data disks on
the VIO server, map only an entire LUN to the client partition.
For disks to be used as rootvg by the client partitions, you can map logical volumes in the VIO
server if you do not want to dedicate an entire disk (LUN) for one partition. See 7.2.2, “Internal
disk for client LPAR rootvg” on page 237 for more details.
Note: All LUNs on external storage (that will be used to hold Oracle 10g RAC data files)
must be mapped in the VIO server as a whole disk to the client partitions. The reserve
policy must be set to no_reserve.
No single point of failure disk configuration (general configuration)

The architecture proposed in Figure 7-2 uses using a virtualized disk environment with no
single point of failure (SPOF). The SAN provides performance and disk reliability. The VIO
server allows partitions that do not have a SAN connection to share the Fibre Channel
connections.

LUN1
SAN switches LUN2
LUN3
DS8000
AIX partition
Physical HBA adapter hdisk Physical HBA adapter
MPIO
hdisk hdisk
VIOS1 VIOS2
Virtual SCSI adapters

Hypervisor
Figure 7-2 External storage disk configuration with MPIO and redundant VIO servers
The same LUN is accessed by two VIO servers and mapped (the entire disk) to the same AIX
node. The AIX MPIO layer is aware that the two paths point to the same disk.
If any of the SAN switches, cables, or physical HBAs fail, or if a VIO server is stopped, there
is still another path available to reach the disk. MPIO manages the load balancing and
failover at the AIX level, which is transparent to Oracle. Thus, it is not mandatory to have a
dual HBA for each VIO server, because there is no improved security at the node level.
If a VIO server is stopped, it results in the failure of one link to the storage. MPIO fails over the
surviving link without requiring administrative action. When the link is back, MPIO
reintegrates the link automatically. The failure and failback are completely transparent to
anyone. There is nothing to do. Only errors are stored in the AIX error report to keep track of
the events.
No single point of failure disk configuration using DS4xxx storage

IBM System Storage DS4000 Series (which is used for testing purposes in this publication)
has special host connection requirements, such as the VIO server in our case. This storage
must be attached using two Fibre Channel interfaces, each of which is in a different SAN
zone. For a specified LUN, one Fibre Channel is used normally (the preferred path), while the
second FC is used only in the case of a failure of the first FC. In this configuration, there is no
load balancing for a single LUN. To achieve load balancing (static) for an array, configure
LUNs in pairs, which uses different preferred paths. The protocol managing the DS4000
connections is RDAC. MPIO is not used at the VIO server level. However, the client partitions
use MPIO (over the virtual SCSI adapters) to manage the two paths for the same disk. We
show this architecture in Figure 7-3.

LUN1
SAN switches LUN2
LUN3
DS4000
AIX partition
Physical HBA adapters

Physical HBA adapters
rdac hdisk rdac

I/O driver I/O driver
MPIO
hdisk hdisk
VIOS1 VIOS2

Hypervisor
Figure 7-3 Disk configuration with redundant VIO servers for DS4000
7.2.2 Internal disk for client LPAR rootvg

The root volume group of the client partition can use a virtual SCSI disk mapped on an
internal disk of the VIO server. Volume groups other than the volume groups that are used for
Oracle 10g RAC data (which are accessed concurrently from two or more nodes) can also
take advantage of this virtual architecture. Because these volume groups are accessed by
only one server (no shared or concurrent access), you can also map a logical volume on the
VIO server as an hdisk on the client partition.
One of the goals of virtualization is to utilize the resources in the best manner possible. For
example, one 143 GB disk might be too large for a single rootvg; thus, you can efficiently use
the space on this disk by allocating logical volumes on this disk at the VIO server level and
mapping the LVs as virtual SCSI disks used for rootvg to each client LPAR. Another goal is to
share resources between various client LPARs. A limited number of internal disks in the VIO
server can be shared by a large number of client LPARs. With the LVM mirroring proposed
next, you can further increase the level of high availability.
Mirrored rootvg
To remove all the SPOFs, we use two VIO servers, and the rootvg must be mirrored using
LVM, which we display in Figure 7-4.
A failure of an internal disk in one VIO server, or a shutdown of the VIO server, results in the
failure of one LV copy. However, rootvg is still alive. After the VIO server is rebooted (for
example), you must resynchronize the mirror (syncvg command).

AIX partition
rootvg
LVM mirroring
LV hdisk hdisk LV
VIOS1 hdisk hdisk VIOS2

Hypervisor
Figure 7-4 Internal SCSI disk configuration for rootvg with LVM mirroring and redundant VIO servers
7.3 System configuration

In this final scenario (Figure 7-5 on page 239), we use two frames to avoid a single point of
failure. There is one RAC node per frame. In this configuration, we define four VIO servers,
two per frame, which is the minimum required configuration. Of course, these VIO servers
can be shared with all the other partitions in the same frame (other LPARS on part of the RAC
cluster). Actually, two VIO servers per frame is a good setup for all virtualization needs and
also provides the required high availability.

Virtual I/O Server Virtual I/O Server
AIX1 AIX4
AIX5
AIX2
LAN
AIX6
AIX3
AIX7
Virtual I/O Server Virtual I/O Server
System p System p
server 1 server 2
Database
storage
Figure 7-5 Oracle RAC configuration with two virtualized System p servers

8
Chapter 8. Deploying test environments

using virtualized SAN
The architecture that we propose in this chapter is based entirely on IBM System p
virtualization features. Its target is to build a cost-effective infrastructure for deploying Oracle
10g RAC test configurations. Because all of the partitions that are used for this type of a
configuration are located in the same frame, the configuration does not provide the highest
possible availability and is unsuitable for disaster recovery.
The architecture that we propose is suitable for development and testing purposes. One of its
goals is to use the least possible number of disk and network (physical) adapters, which we
achieve by sharing the same physical disk for all of the rootvg and virtual Ethernet adapters,
for example. Creating a virtual SAN resource further contributes to reducing the hardware
and administrative costs that are required to deploy clusters. You can create lightweight
partitions with almost no hardware. For example, if you have a virtual I/O (VIO) server and
one free disk, you can create two partitions for running Oracle 10g RAC easily. Of course, this
configuration cannot match the performance of a similar environment with dedicated physical
resources.

8.1 Totally virtualized simple architecture
Figure 8-1 shows an example of this type of a test architecture. The configuration consists of:
򐂰 Three partitions (including the VIO server) are the minimum number required.
򐂰 Two separate networks are also required: one network for private interconnect and one
network for public client access (which is also used for administrative purposes).
򐂰 The number of disks depends on the space required for your database. The shared disks
that will be used for GPFS must be separate logical unit numbers (LUNs).
NOT supported in all configurations. As of the release date of this book, virtual SCSI
disks are only supported using Oracle's ASM, but not with GPFS. For the current
IBM/Oracle cross-certification status, check the following URL:
http://www.oracle.com/technology/support/metalink/index.htm
Logical volumes can be assigned as virtual SCSI disks, but they can only be used in one
LPAR; they cannot be shared between two client LPARs.
RAC node 1
virtual physical
disk disk
virtual network
en5
interface
en5
physical network
en0
interface
rootvg en6
RAC node 2
en6
rootvg en5
VIOS
en5
en3 SEA: Shared Ethernet Adapter

Oracle 10g RAC SEA
LV data on GPFS
filesystem en0
LV
hdisk
Figure 8-1 Simple fully virtualized Oracle 10g RAC architecture for development or testing
This architecture is not highly available. There are several single points of failure. Usually,
Oracle 10g RAC is used for providing high availability or even disaster recovery capabilities.
However, the goal of this configuration is to deploy a real (but lightweight) RAC.

This configuration is also suitable for testing the application scalability in a RAC environment.
You can expand the cluster easily by creating a new logical partition (LPAR) and adding this
LPAR as a node to the existing RAC cluster. For details, refer to 3.3, “Adding a node to an
existing RAC” on page 117.
8.1.1 Disk configuration

The disks that are used are only internal to the frame (no SAN connection to external
storage), and they are all managed by the VIO server. The RAC nodes are using virtual Small
Computer System Interface (SCSI) disks. There is no physical disk requirement in the RAC
nodes.
are only supported using Oracle's Automated Storage Management (ASM), but not with
GPFS. For the current IBM/Oracle cross-certification status, check the following URL:
You can create a GPFS with as little as one SCSI disk. This disk can hold the normal files,
data, and be used as a tiebreaker disk at the same time.
Note: This configuration provides no data protection, which is acceptable because this is a
test environment.
8.1.2 Network configuration

The networks are also virtual. The Oracle 10g RAC private interconnect network is virtual
between the two RAC nodes. It does not go through the VIO server, because there is no need
to be connected to the outside world. The client (public) network is also virtual and uses a
single physical interface as a Shared Ethernet Adapter (SEA) in the VIO server.
When creating the virtual adapters, you must make sure that the interface’s number (en#) is
the same on all of the nodes for the same network (interconnect and public).The Oracle
Clusterware configuration uses the interface number to define the networks, and not the
associated IP label (name).
The outbound client traffic for two RAC nodes shares the same physical interface. If this
design becomes a bottleneck, you can use a link aggregation interface (EtherChannel) with
two or more physical interfaces.
Chapter 8. Deploying test environments using virtualized SAN 243

8.1.3 Creating virtual adapters (VIO server and clients)
In this step, we create the virtual adapters using the Hardware Management Console (HMC)
interface. We create a virtual SCSI server adapter for each client partition as shown in
Figure 8-2. The steps are:
Figure 8-2 Virtual SCSI server adapters in the VIO server

1. To create a shared Ethernet adapter in the VIO server, you need to define (also in the VIO
server) at least one virtual Ethernet adapter. You must select Access external network
when you create a virtual Ethernet adapter in the VIO server. The Ethernet adapter in Slot
2 that is shown in Figure 8-3 has the Bridged attribute set to Yes, which means that this
Ethernet adapter can be used as a shared Ethernet adapter.
Figure 8-3 Virtual Ethernet adapter in the VIO server

2. Create a virtual SCSI client adapter for the client1 partition as shown in Figure 8-4.
Remember that the slot numbers for the SCSI client adapter match the slot numbers of the
server adapter (defined in the VIO server).
Figure 8-4 Virtual SCSI client adapter in the client1 partition

3. Create the virtual Ethernet adapters for the client1 LPAR as shown in Figure 8-5. We
create two Ethernet adapters: one Ethernet adapter for the interconnect and one Ethernet
adapter for the client network.
Figure 8-5 Virtual Ethernet adapter in the client1 partition

4. Create a virtual SCSI client adapter in the client2 partition as shown in Figure 8-6.
Figure 8-6 Virtual SCSI client adapter in the client2 partition

5. Create the virtual Ethernet adapters for the client2 LPAR as shown in Figure 8-7.
Figure 8-7 Virtual Ethernet adapter in the client2 partition
8.1.4 Configuring virtual resources in VIO server

Example 8-1 shows how to configure a VIO server with the adapters that were defined in the
previous section.
Example 8-1 Configuring the VIO server

### In the VIO server, run the following commands
$ lspv
NAME PVID VG STATUS
hdisk0 0022be2abc04a1ca rootvg active
hdisk1 00cc5d5c6b8fd309 None
hdisk2 0022be2a80b97feb None
hdisk3 0022be2abc247c91 None

### Create a volume group for client partitions
$ mkvg -vg client_rootvg hdisk1

client_rootvg
### Create logical volumes for client partitions
$ mklv -lv client1vg_lv client_rootvg 10G

$ mklv -lv client2vg_lv client_rootvg 10G
###Verify that logical volumes are properly created
$ lsvg -lv client_rootvg

client_rootvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
client1vg_lv jfs 160 160 1 closed/syncd N/A
client2vg_lv jfs 160 160 1 closed/syncd N/A
### Verify virtual SCSI adapters
$ lsdev -virtual
name status description
ent3 Available Virtual I/O Ethernet Adapter (l-lan)
vhost0 Available Virtual SCSI Server Adapter
vhost1 Available Virtual SCSI Server Adapter
### Create virtual target devices for mapping backing devices to virtual SCSI
adapters. These devices will be used for local client rootvgs.
$ mkvdev -vdev client1vg_lv -vadapter vhost0 -dev clinet1_vg_vtd

clinet1_vg_vtd Available
$ mkvdev -vdev client2vg_lv -vadapter vhost1 -dev clinet2_vg_vtd

clinet2_vg_vtd Available
### Create virtual target devices for mapping backing devices to virtual SCSI
adapters. These devices will be used for shared disks.
$ mkvdev -vdev hdisk2 -vadapter vhost0

vtscsi0 Available
$ mkvdev -vdev hdisk2 -vadapter vhost1

"hdisk2" is already being used as a backing device. Specify the -f flag
to force this device to be used anyway.
$ mkvdev -f -vdev hdisk2 -vadapter vhost1

vtscsi1 Available
### Verify mapping information between backing devices and virtual SCSI adapters
$ lsmap -all |more
SVSA Physloc Client Partition ID

--------------- -------------------------------------------- ------------------
vhost0 U9117.570.10C5D5C-V5-C3 0x00000000
VTD clinet1_vg_vtd
LUN 0x8100000000000000
Backing device client1vg_lv
Physloc
VTD vtscsi0
LUN 0x8200000000000000
Backing device hdisk2
Physloc U7879.001.DQDKZNP-P1-T14-L4-L0
SVSA Physloc Client Partition ID

--------------- -------------------------------------------- ------------------
vhost1 U9117.570.10C5D5C-V5-C4 0x00000000
VTD clinet2_vg_vtd
LUN 0x8100000000000000
Backing device client2vg_lv
Physloc
VTD vtscsi1
LUN 0x8200000000000000
Backing device hdisk2
Physloc U7879.001.DQDKZNP-P1-T14-L4-L0
### Choose a physical Ethernet adapter and a virtual Ethernet adapter to create a
shared Ethernet adapter. Make sure that IP address should not be assigned to the
physical adapter at the time of creating a shared Ethernet adapter.
$ lsdev -vpd |grep ent

Model Implementation: Multiple Processor, PCI bus
ent3 U9117.570.10C5D5C-V5-C2-T1 Virtual I/O Ethernet Adapter
(l-lan)
ent2 U7879.001.DQDKZNP-P1-C4-T1 10/100 Mbps Ethernet PCI
Adapter II (1410ff01)
ent0 U7879.001.DQDKZNV-P1-C5-T1 2-Port Gigabit Ethernet-SX
PCI-X Adapter (14108802)
ent1 U7879.001.DQDKZNV-P1-C5-T2 2-Port Gigabit Ethernet-SX
PCI-X Adapter (14108802)
Device Type: PowerPC-External-Interrupt-Presentation
### Create a shared Ethernet adapter.
$ mkvdev -sea ent0 -vadapter ent3 -default ent3 -defaultid 1

ent4 Available
en4
et4
### Check the virtual devices on both nodes defined in client partitions
root@client1:/> lspv
hdisk0 00c7cd9e76c83540 rootvg active
hdisk1 00c7cd9ece71f8d4 None

root@client1:/> lsdev -Cc adapter
vscsi0 Available Virtual SCSI Client Adapter
When the VIO server and the client partitions have been configured, proceed to install the
operating system on the client1 and client2 LPARs (we have used the Network Installation
Management (NIM) installation), configure networking, and configure GPFS.
Install Oracle Clusterware and database as described in Chapter 2, “Basic RAC configuration
with GPFS” on page 19.

Part 5 Appendixes
This part contains helpful information about various aspects of installing, configuring, and
maintaining your environment, but the information is either not directly related to the
mainstream topic of this publication, such as the Secure Shell configuration and the GPFS
2.3 installation, or is described as an extension to other documents or manuals.

A
Appendix A. EtherChannel parameters on AIX

We use this procedure to configure an EtherChannel interface in our test environment:
1. Type smitty etherchannel at the command line.
2. Select Add an EtherChannel / Link Aggregation from the list and press Enter.
3. Select the Ethernet adapters that you want in your EtherChannel, and press Enter. If you
are planning to use EtherChannel backup, do not select the adapter that you plan to use
for the backup at this point.
The EtherChannel backup option is available in AIX 5.2 and later.
Tip: The Available Network Adapters window displays all Ethernet adapters. If you select
an Ethernet adapter that is already being used (has a defined interface), you get an error
message. You first need to detach this interface if you want to use it.
Enter the information in the fields according to the following guidelines:

򐂰 Parent Adapter: This field provides information about an EtherChannel's parent device
(for example, when an EtherChannel belongs to a Shared Ethernet Adapter). This field
displays a value of NONE if the EtherChannel is not contained within another adapter (the
default). If the EtherChannel is contained within another adapter, this field displays the
parent adapter’s name (for example, ent6). This field is informational only and cannot be
modified. The parent adapter option is available in AIX 5.3 and later.
򐂰 EtherChannel / Link Aggregation Adapters: You see all of the primary adapters that
you use in your EtherChannel. You selected these adapters in the previous step.
򐂰 Enable/ Alternate Address: This field is optional. Setting this to yes enables you to
specify the MAC address that you want the EtherChannel to use. If you set this option to
no, the EtherChannel uses the MAC address of the first adapter.
򐂰 Alternate Address: If you set Enable Alternate Address to yes, specify the MAC address
that you want to use here. The address that you specify must start with 0x and be a
12-digit hexadecimal address (for example, 0x001122334455).
򐂰 Enable Gigabit Ethernet Jumbo Frames: This field is optional. In order to use this field,
your switch must support jumbo frames, which will only work with a Standard Ethernet
(en) interface, not an IEEE 802.3 (et) interface. Set this to yes to enable it.

򐂰 Mode: You can choose from the following modes:
– Standard: In this mode, the EtherChannel uses an algorithm to choose on which
adapter it will send the packets out. The algorithm consists of taking a data value,
dividing it by the number of adapters in the EtherChannel, and using the remainder
(using the modulus operator) to identify the outgoing link.
– Round_robin: In this mode, the EtherChannel rotates through the adapters, giving
each adapter one packet before repeating. The packets might be sent out in a slightly
different order than they are given to the EtherChannel, but it makes the best use of its
bandwidth. It is an invalid combination to select this mode with a Hash Mode other than
default. If you choose the round-robin mode, leave the Hash Mode value as default.
– 8023ad: This options enables the use of the IEEE 802.3ad Link Aggregation Control
Protocol (LACP) for automatic link aggregation. For more details about this feature,
refer to the IEEE 802.3ad Link Aggregation configuration.
– Netif_backup: This option is available only in AIX 5.1 and AIX 4.3.3. In this mode, the
EtherChannel activates only one adapter at a time. The intention is that the adapters
are plugged into different Ethernet switches, each of which is capable of getting to any
other machine on the subnet or network. When a problem is detected either with the
direct connection (or optionally through the inability to ping a machine), the
EtherChannel deactivates the current adapter and activates a backup adapter. This
mode is the only mode that uses the Internet Address to Ping, Number of Retries, and
Retry Timeout fields. Network Interface Backup Mode does not exist as an explicit
mode in AIX 5.2 and later. To enable Network Interface Backup Mode in AIX 5.2 and
later, you can configure multiple adapters in the primary EtherChannel and a backup
adapter. For more information, see Configuring Network Interface Backup in AIX 5.2
and later.
򐂰 Hash mode: The Hash Mode value determines which data value is fed into this algorithm
(see the Hash Mode attribute for an explanation of the different hash modes). For
example, if the Hash Mode is standard, it uses the packet’s destination IP address. If this
is 10.10.10.11 and there are two adapters in the EtherChannel, (1 / 2) = 0 with remainder
one, the second adapter is used (the adapters are numbered starting from zero). The
adapters are numbered in the order they are listed in the SMIT menu. This is the default
operation mode.
Choose from the following hash modes; this field determines the data value that is used by
the algorithm to determine the outgoing adapter:
– Default: The destination IP address of the packet is used to determine the outgoing
adapter. For non-IP traffic (such as ARP), the last byte of the destination MAC address
is used to perform the calculation. This mode guarantees packets are sent out over the
EtherChannel in the order in which they are received, but it might not make full use of
the bandwidth.
– src_port: The source UDP or TCP port value of the packet is used to determine the
outgoing adapter. If the packet is not UDP or TCP traffic, the last byte of the destination
IP address is used. If the packet is not IP traffic, the last byte of the destination MAC
address is used.
– dst_port: The destination UDP or TCP port value of the packet is used to determine the
outgoing adapter. If the packet is not UDP or TCP traffic, the last byte of the destination
IP will be used. If the packet is not IP traffic, the last byte of the destination MAC
address is used.

– src_dst_port: The source and destination UDP or TCP port values of the packet are
used to determine the outgoing adapter (specifically, the source and destination ports
are added and then divided by two before being fed into the algorithm). If the packet is
not UDP or TCP traffic, the last byte of the destination IP is used. If the packet is not IP
traffic, the last byte of the destination MAC address is used. This mode gives good
packet distribution in most situations, both for clients and servers.
Note: It is an invalid combination to select a Hash Mode other than default with a Mode of
round_robin.
򐂰 Backup Adapter: This field is optional. Enter the adapter that you want to use as your
EtherChannel backup.
򐂰 Internet Address to Ping: This field is optional and only takes effect if you are running
Network Interface Backup mode, or if you have one or more adapters in the EtherChannel
and a backup adapter. The EtherChannel pings the IP address or host name that you
specify here. If the EtherChannel is unable to ping this address for the number of times
specified in the Number of Retries field, and in the intervals specified in the Retry Timeout
field, the EtherChannel switches adapters.
򐂰 Number of Retries: Enter the number of ping response failures that are allowed before
the EtherChannel switches adapters. The default is three. This field is optional and valid
only if you set an Internet Address to Ping.
򐂰 Retry Timeout: Enter the number of seconds between the times for the EtherChannel
Ping and the Internet Address to Ping. The default is one second. This field is optional and
valid only if you have set an Internet Address to Ping.
4. Press Enter after changing the desired fields to create the EtherChannel. Configure IP
over the newly created EtherChannel device by typing smitty chinet at the command
line.
5. Select your new EtherChannel interface from the list. Fill in all of the required fields and
press Enter.
Appendix A. EtherChannel parameters on AIX 257

B
Appendix B. Setting up trusted ssh in a

cluster
This Appendix describes how to set ssh up to accept trusted connections for user root@nodes
within the cluster. In our example, we reuse the server keys that already exist on host alamo1,
which are generated during the ssh install. Host alamo1 is used for all the definitions, which
are copied to the other nodes in the cluster. The setup relies on three types of files:
򐂰 SSH server keys
Located in /etc/ssh
򐂰 User keys
Located in the user’s home directory .ssh (~/.ssh)
򐂰 Authentication files
Store information of the trusted servers/users (also located in ~/.ssh):
• Known_hosts
Stores the public keys for known hosts (sshd - the server) together with IP
addresses
• Authorized_keys
Stores the public keys for known (authorized) users
Note: The methods described allow for encrypted traffic between the cluster nodes without
the need to enter a password or passphrase, which means that even though traffic is
encrypted, any malicious hacker, who gains access to one of the cluster nodes, has
access to all cluster nodes.
Important: As we change the server keys on some of the nodes in the cluster, be careful.
Doing things in the wrong order might prevent you from logging on to the systems
(especially if ssh is the only way to access the system over the network).

In this example, we reuse the existing keys on server alamo1. The steps for configuring root
user access are:
1. Generate root user (client) key set.
2. Create the known_hosts file that contains the public key of the hosts (ssh servers) for user
root connection.
3. Create the authorized_keys file storing the public key of the root@cluster_nodes.
4. Distribute keys: both server level (/etc/ssh) and user level (~/.ssh).
The detailed steps are:

1. Generate the root user (client) key set.
To generate the root user (client) keys, we use ssh-keygen, as shown in Example B-1.
Example: B-1 Generating and displaying client keys

root@alamo1:/> ssh-keygen -t rsa -q -f ~/.ssh/id_rsa -N ''
root@alamo1:/> ls -l ~/.ssh
total 32
-rw-r--r-- 1 root system 393 Sep 17 16:00 authorized_keys
-rw------- 1 root system 1675 Sep 17 15:55 id_rsa
-rw-r--r-- 1 root system 393 Sep 17 15:55 id_rsa.pub
-rw-r--r-- 1 root system 489 Sep 17 15:59 known_hosts
root@alamo1:/> cat ~/.ssh/id_rsa.pub
ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEArpdFZc+ynyxG8jS0wE0YBT9l6ztuvZ+p7GGoivP4PdBMHD+KdEoZ/w
42A+kYREOV5/0TN4+8wfgYBCl8ZvcZg2zQ6/Pamh1nsGKaXPLEd4rllPxyPTsZi1rCmUcAx2+qN7Rktx4/
WWqYZOdZQ54xHQqHk0uNnNkfENSRNSBhqsKqnCiob0ITjt8GvG15qyvg+1OxK6Q72P52DjmU8Tr1zPY9P8
zVYFdWes5jLnPxW79UjPiMv3c5J2k0AVxLreQOLykXBsnaXH+PP0/76pK46mzYbz/weVLcsnWXRZTus2qR
kWSlR9jJ8SfZVsRM0zG50pDgn0OVV6p+TqKGW6MdeQ== root@alamo1
root@alamo1:/.ssh>
2. Create the known_hosts file.

For now, we use a different name for the file, cluster_known_hosts, not known_hosts (the
name the SSH client uses), because we do not want to use this file at this time. The
closer_known_hosts file contains the SSH server public keys, but the file has to be copied
on all nodes and renamed to ~/.ssh/.known_hosts before it becomes active.
Example: B-2 Server public key in known_hosts file

root@alamo1:/> cat /etc/ssh/ssh_host_rsa_key.pub
ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQzFX5HINv5Z3jXVFc6jDTJiW6iJ8s/zrPHc
pBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdGThxmchTpAnW2vummmfkHwnG+TJHZJIL1lO
P8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZEeGUPaPVPHFd5fwm0eh7KKDCmyNXXCNf1b
Y9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdpIZ295QIla4iWLeV9K0F/KbEp9lY/dY9L6Z
z9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==
root@alamo1:/> cd .ssh
root@alamo1:/.ssh> pwd
/.ssh
root@alamo1:/.ssh> ssh-keyscan -t rsa alamo1 >> ~/.ssh/cluster_known_hosts
# austin1 SSH-1.99-OpenSSH_4.3
root@alamo1:/.ssh> cat ~/.ssh/cluster_known_hosts
alamo1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQzFX5HINv5Z3jXVFc6j
DTJiW6iJ8s/zrPHcpBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdGThxmchTpAnW2vummmf
kHwnG+TJHZJIL1lOP8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZEeGUPaPVPHFd5fwm0e

h7KKDCmyNXXCNf1bY9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdpIZ295QIla4iWLeV9K0
F/KbEp9lY/dY9L6Zz9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==
a. Then, edit the file, adding the remaining nodes in the cluster. See Example B-3.
Example: B-3 The updated cluster_known_hosts

root@alamo1:/.ssh> ls -l
total 24
-rw-r--r-- 1 root system 489 Sep 17 15:59 cluster_known_hosts
root@alamo1:/.ssh> cat cluster_known_hosts
alamo1,192.168.100.53,alamo1_interconnect,10.1.100.53,alamo2,192.168.100.54,alamo2
_interconnect,10.1.100.54 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArmEnYFkEbv6BF3rQZBPQ
zFX5HINv5Z3jXVFc6jDTJiW6iJ8s/zrPHcpBhY7+VDFreXhBCuCNFfhDfbQTu28e7BXaklcKeoG9lI2pdG
ThxmchTpAnW2vummmfkHwnG+TJHZJIL1lOP8F+XOB9jwFGV0oogSWqq662WLdhRqA6wbyu8DxWJyMQ0ZQZ
EeGUPaPVPHFd5fwm0eh7KKDCmyNXXCNf1bY9v0nyGchwNULEBOV/Y6BGnmYYTKcA4jr/VsRypHqKlbbGdp
IZ295QIla4iWLeV9K0F/KbEp9lY/dY9L6Zz9qlqa+u/7jX1A+IexG7hxKTFsnj/tGHj+sUC9IaPw==
root@alamo1:/.ssh>
b. As seen in Example B-3, there are two nodes in the cluster, alamo1 and alamo2. Both
nodes are accessible in the 192.168.100.x and 10.1.100.x subnets.
3. Create the authorized_keys.
Because we intend to use the same keys on all nodes, the authorized keys will be just the
root user’s key. See Example B-4.
Example: B-4 Authorized users file

root@alamo1:/.ssh> cat id_rsa.pub >> ~/.ssh/authorized_keys
root@alamo1:/.ssh> cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEArpdFZc+ynyxG8jS0wE0YBT9l6ztuvZ+p7GGoivP4PdBMHD
+KdEoZ/w42A+kYREOV5/0TN4+8wfgYBCl8ZvcZg2zQ6/Pamh1nsGKaXPLEd4rllPxyPTsZi1rCmUcAx2+q
N7Rktx4/WWqYZOdZQ54xHQqHk0uNnNkfENSRNSBhqsKqnCiob0ITjt8GvG15qyvg+1OxK6Q72P52DjmU8T
r1zPY9P8zVYFdWes5jLnPxW79UjPiMv3c5J2k0AVxLreQOLykXBsnaXH+PP0/76pK46mzYbz/weVLcsnWX
RZTus2qRkWSlR9jJ8SfZVsRM0zG50pDgn0OVV6p+TqKGW6MdeQ== root@alamo1
root@alamo1:/.ssh>
4. Distribute files to the remaining nodes in the cluster.

Copy the entire /etc/ssh directory to other nodes in the cluster (in this case, alamo2):
a. Make sure that file read/write modes are maintained when copied. After copying,
restart sshd. Example B-5 shows the distribution of the server keys.
Example: B-5 Distribute server keys (/etc/ssd directory)

root@alamo1:/.ssh> scp -pr /etc/ssh/* root@alamo2:/etc/ssh
The authenticity of host 'alamo2 (192.168.100.54)' can't be established.
RSA key fingerprint is 9c:8d:7c:51:ce:f2:4d:06:93:64:07:0b:94:43:2f:1a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'alamo2,192.168.100.54' (RSA) to the list of known
hosts.
root@alamo2's password:
moduli 100% 130KB 129.7KB/s 00:00
ssh_config 100% 1354 1.3KB/s 00:00
ssh_host_dsa_key 100% 672 0.7KB/s 00:00
ssh_host_dsa_key.pub 100% 590 0.6KB/s 00:00
Appendix B. Setting up trusted ssh in a cluster 261

ssh_host_key 100% 963 0.9KB/s 00:00
ssh_host_key.pub 100% 627 0.6KB/s 00:00
ssh_host_rsa_key 100% 1675 1.6KB/s 00:00
ssh_host_rsa_key.pub 100% 382 0.4KB/s 00:00
ssh_prng_cmds 100% 2341 2.3KB/s 00:00
sshd.pid 100% 7 0.0KB/s 00:00
sshd_config 100% 2865 2.8KB/s 00:00
root@alamo1:/.ssh>
b. After changing the server keys, sshd must be restarted, as shown in Example B-6.
Example: B-6 Restart sshd on the nodes which has had the keys changed
root@alamo2:/> stopsrc -s sshd
0513-044 The sshd Subsystem was requested to stop.
root@alamo2:/> startsrc -s sshd
0513-059 The sshd Subsystem has been started. Subsystem PID is 295030.
root@alamo2:/>
c. Now, distribute the root user keys, as shown in Example B-7.
Example: B-7 Distribute root user files

root@alamo1:/.ssh> scp -pr /.ssh/* root@alamo2:/.ssh/.
root@alamo2's password:
authorized_keys 100% 393 0.4KB/s 00:00
cluster_known_hosts 100% 489 0.5KB/s 00:00
id_rsa 100% 1675 1.6KB/s 00:00
id_rsa.pub 100% 393 0.4KB/s 00:00
known_hosts 100% 403 0.4KB/s 00:00
root@alamo1:/.ssh>
d. Now, everything is in place, so rename the cluster_known_hosts files to known_hosts

on all nodes, as shown in Example B-8.
Example: B-8 Rename cluster_known_hosts

root@alamo2:/.ssh> ls -l
total 40
-rw-r--r-- 1 root system 393 Sep 17 16:00 authorized_keys
-rw-r--r-- 1 root system 489 Sep 17 15:59 cluster_known_hosts
-rw-r--r-- 1 root system 403 Sep 17 16:01 known_hosts
root@alamo2:/.ssh> mv cluster_known_hosts known_hosts
root@alamo2:/.ssh>
Connection to alamo2 closed.
root@alamo1:/.ssh> mv cluster_known_hosts known_hosts
e. Finally, verify that everything works. We use the command shown in Example B-9.
Example: B-9 Verification

root@alamo1:/.ssh> ssh alamo1 ssh alamo2 ssh alamo2 ssh alamo1 date
Mon Sep 17 16:10:46 CDT 2007
root@alamo1:/.ssh>

C
Appendix C. Creating a GPFS 2.3

This appendix shows the procedure that we use to create a GPFS 2.3 cluster and the file
systems that we use in 3.7, “GPFS upgrade from 2.3 to 3.1” on page 138. We set up the
GPFS cluster with three tiebreaker disks. The steps are:
1. First, we get the disk information regarding the logical unit numbers (LUNs) that we use for
this test, which is shown in Example C-1.
Example: C-1 Listing the DS4000 Series LUNs

root@dallas1:/etc/gpfs_config> fget_config -vA
---dar0---
User array name = 'Austin_DS4800'

dac0 ACTIVE dac1 ACTIVE
Disk DAC LUN Logical Drive

hdisk14 dac1 12 DALLAS_gpOraHomeA
hdisk15 dac0 13 DALLAS_gpOraHomeB
2. We create the node and disk descriptor files in /etc/gpfs_config. They are listed in
Example C-2 on page 264.

Example: C-2 GPFS descriptor files
root@dallas1:/etc/gpfs_config> for F in *
> do
> echo "File $F contains:"
> cat $F
> echo
> done
File gpfs_disks_orabin contains:
File gpfs_disks_oradata contains:

File gpfs_disks_tb contains:

hdisk7:::::nsd_tb1
hdisk8:::::nsd_tb2
hdisk9:::::nsd_tb3
File gpfs_nodes contains:

dallas1_interconnect:quorum-manager
dallas2_interconnect:quorum-manager
3. We create the cluster using the gpfs_nodes descriptor file, as shown in Example C-3.
Example: C-3 Creating the cluster

root@dallas1:/> mmcrcluster -n /etc/gpfs_config/gpfs_nodes -p dallas1_interconnect
-s dallas2_interconnect -r /usr/bin/ssh -R /usr/bin/scp
-C dallas_cluster -A
Thu Sep 13 13:34:04 CDT 2007: 6027-1664 mmcrcluster: Processing node
dallas1_interconnect
Thu Sep 13 13:34:06 CDT 2007: 6027-1664 mmcrcluster: Processing node
dallas2_interconnect
mmcrcluster: Command successfully completed
mmcrcluster: 6027-1371 Propagating the changes to all affected nodes.
This is an asynchronous process.
4. We create the Network Shared Disks (NSDs) using the descriptor file gpfs_disks_tb for
tiebreaker disks, gpfs_disk_oradata for the /oradata file system disks, and
gpfs_disk_orabin for the /orabin file system, as shown in Example C-4 on page 265.

Example: C-4 Creating the NSDs
root@dallas1:/> mmcrnsd
mmcrnsd: 6027-1268 Missing arguments
Usage: mmcrnsd -F DescFile [-v {yes | no}]
root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_tb
mmcrnsd: 6027-1371 Propagating the changes to all affected nodes.
root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_oradata
root@dallas1:/> mmcrnsd -F /etc/gpfs_config/gpfs_disks_orabin
5. Add the tiebreaker NSDs to the cluster configuration, as shown in Example C-5.
Example: C-5 Adding tiebreaker disks to the cluster

root@dallas1:/> mmchconfig
mmchconfig: 6027-1268 Missing arguments
Usage:
mmchconfig Attribute=value[,Attribute=value...] [-i | -I]
[-n NodeFile | NodeName[,NodeName,...]]
root@dallas1:/> mmchconfig tiebreakerDisks="nsd_tb1;nsd_tb2;nsd_tb3"
Verifying GPFS is stopped on all nodes ...
mmchconfig: 6027-1371 Propagating the changes to all affected nodes.
6. The cluster is now ready, and we start GPFS on all nodes (Example C-6).
Example: C-6 Cluster startup

root@dallas1:/> mmstartup -a
Thu Sep 13 13:48:14 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...
7. The /orabin file system is then created, as shown in Example C-7 on page 266. The block
size is set to 256k; the file system is created with a maximum of 80k inodes.
Appendix C. Creating a GPFS 2.3 265

Example: C-7 Creating the /orabin file system (GPFS)
root@dallas1:/> mmcrfs /orabin orabin -F /etc/gpfs_config/gpfs_disks_orabin -A yes
-B 256k -m 2 -M 2 -r 2 -R 2 -n 4 -N 80k
GPFS: 6027-531 The following disks of orabin will be formatted on node dallas1:
nsd05: size 10485760 KB
nsd06: size 10485760 KB
Creating Inode File
Flushing Allocation Maps
GPFS: 6027-535 Disks up to size 27 GB can be added to this file system.
GPFS: 6027-572 Completed creation of file system /dev/orabin.
mmcrfs: 6027-1371 Propagating the changes to all affected nodes.
8. Create the /oradata file system as shown in Example C-8.
Example: C-8 Creating the /oradata file system (GPFS)

root@dallas1:/> mmcrfs /oradata oradata -F /etc/gpfs_config/gpfs_disks_oradata -A
yes -B 256k -m 2 -M 2 -r 2 -R 2 -n 4
GPFS: 6027-531 The following disks of oradata will be formatted on node dallas2:
nsd01: size 10485760 KB
nsd02: size 10485760 KB
nsd03: size 10485760 KB
nsd04: size 10485760 KB
Creating Inode File
Flushing Allocation Maps
GPFS: 6027-535 Disks up to size 70 GB can be added to this file system.
mmcrfs: 6027-1371 Propagating the changes to all affected nodes.
9. Finally, the GPFS cluster is restarted and the file systems are checked, as shown in
Example C-9 on page 267.

Example: C-9 Restarting GPFS and verifying file systems
root@dallas1:/> mmshutdown -a
Thu Sep 13 13:56:20 CDT 2007: 6027-1341 mmshutdown: Starting force unmount of GPFS
file systems
Thu Sep 13 13:56:25 CDT 2007: 6027-1344 mmshutdown: Shutting down GPFS daemons
dallas1_interconnect: Shutting down!
dallas2_interconnect: Shutting down!
dallas1_interconnect: 'shutdown' command about to kill process 344180
dallas2_interconnect: 'shutdown' command about to kill process 249916
Thu Sep 13 13:56:31 CDT 2007: 6027-1345 mmshutdown: Finished
root@dallas1:/> mmstartup -a
Thu Sep 13 13:56:36 CDT 2007: 6027-1642 mmstartup: Starting GPFS ...
root@dallas1:/> cd /oradata
root@dallas1:/oradata> df -k .
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/oradata 41943040 41851392 1% 14 1% /oradata
root@dallas1:/oradata> cd /orabin
root@dallas1:/orabin> df -k .
/dev/orabin 20971520 20925440 1% 10 1% /orabin
root@dallas1:/orabin> df
/dev/hd4 131072 98680 25% 1741 14% /
/dev/hd2 2490368 30360 99% 30823 84% /usr
/dev/hd9var 131072 113008 14% 424 4% /var
/dev/hd3 131072 130200 1% 22 1% /tmp
/dev/hd1 131072 130360 1% 5 1% /home
/proc - - - - - /proc
/dev/hd10opt 262144 99576 63% 2417 18% /opt
/dev/orabin 41943040 41850880 1% 10 1% /orabin
/dev/oradata 83886080 83702784 1% 14 1% /oradata
root@dallas1:/> chown oracle.dba /oradata /orabin
root@dallas1:/> ls -ld /oradata /orabin
drwxr-xr-x 2 oracle dba 8192 Sep 13 13:49 /orabin
drwxr-xr-x 2 oracle dba 8192 Sep 13 13:50 /oradata
Appendix C. Creating a GPFS 2.3 267

D
Appendix D. Oracle 10g database installation

This section presents the Oracle 10g database code installation steps using the Oracle
Universal Installer (OUI). We do not create a database, because creating a database is
beyond the purpose of this document. A graphical user interface (GUI) is needed to run OUI
(Oracle Universal Installer). From a GUI terminal, run the installer as the oracle user from the
installation directory.
You are asked if rootpre.sh has been run, as shown in Figure 2-3 on page 51. Make sure
that you execute Disk1/rootpre/rootpre.sh as root user on each node. The steps are:

1. At the OUI Welcome window, click Next as shown in Figure D-1.
Figure D-1 Welcome window for database installation

2. There are three types of installations: Enterprise Edition, Standard Edition, and Custom.
Select Custom to avoid creating a database (see Figure D-2). Click Next.
Figure D-2 Select the installation type of oracle database
Appendix D. Oracle 10g database installation 271

3. Specify the ORACLE_HOME name and destination directory for the database installation
as shown in Figure D-3. Click Next.
Figure D-3 Specifying oracle home directory

4. Specify the cluster nodes on which the Oracle database code will be installed (see
Figure D-4). If CRS is correctly installed and is up and running, you can select both nodes.
If you are unable to select both nodes, correct the CRS configuration and retry the
installation process. Click Next.
Figure D-4 Specify cluster installation mode

5. Choose the database components that you are installing from the Available Product
Components, as shown in Figure D-5. Click Next.
Figure D-5 Select database components

6. Figure D-6 shows the Product-Specific Prerequisite Checks. The installer verifies that your
environment meets the requirements. Click Next.
Figure D-6 Product-specific prerequisite checks

7. Specify the Privileged Operating System Groups. Because the dba group is chosen by
default, click Next as shown in Figure D-7.
Figure D-7 Select privileged operating system groups

8. Choose Install database Software only to avoid creating a database at this phase, and
then click Next as shown in Figure D-8.
Figure D-8 Select to install database software only

9. On the Summary window shown in Figure D-9, check that the RAC database software and
the other selected options are shown and click Install.
Figure D-9 Summary of Oracle database 10g installation selections

10.The Oracle Universal Installer proceeds with the installation on the first node, and then
copies the code automatically onto the other selected nodes as shown in Figure D-10.
Figure D-10 Installing Oracle 10g database

11.Check on which nodes the root.sh will be run, as shown Figure D-11.
Figure D-11 Executing configuration scripts
12.As the root user, execute root.sh on each node as shown in Example 8-2.
Example 8-2 Running configuration scripts on the database

root@austin1:/orabin/ora102> root.sh
Running Oracle10 root.sh script...
The following environment variables are set as:

ORACLE_OWNER= oracle
ORACLE_HOME= /orabin/ora102
Enter the full pathname of the local bin directory: [/usr/local/bin]:

Creating /usr/local/bin directory...
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Creating /etc/oratab file...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
root@austin2:/orabin/ora102> root.sh
Running Oracle10 root.sh script...

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME= /orabin/ora102
Enter the full pathname of the local bin directory: [/usr/local/bin]:

Creating /usr/local/bin directory...
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Creating /etc/oratab file...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
If there is no problem, you will see “End of Installation” as shown in Figure D-12.
Figure D-12 End of database installation
Before you proceed to database creation, we recommend that you apply the recommended
patch set for the database code files.

E
Appendix E. How to cleanly remove CRS

This appendix describes how to remove the Oracle CRS software. We found this procedure
useful especially if you plan to migrate your cluster from a HACMP-based RAC cluster to a
CRS only-based RAC cluster. The steps are:
1. Run the following scripts on both nodes:
$ORA_CRS_HOME/install/rootdelete.sh
$ORA_CRS_HOME/install/rootdeinstall.sh
2. If the previous scripts run successfully, stop the nodeapps on both nodes:
srvctl stop nodeapps -n <nodename>
srvctl stop nodeapps -n austin1
srvctl stop nodeapps -n austin2
3. In order to prevent CRS from starting when a node starts, run the following commands:
rm /etc/init.cssd
rm /etc/init.crs
rm /etc/init.crsd
rm /etc/init.evmd
rm /etc/rc.d/rc2.d/K96init.crs
rm /etc/rc.d/rc2.d/S96init.crs
rm -Rf /etc/oracle/scls_scr
rm -Rf /etc/oracle/oprocd
rm /etc/oracle/ocr.loc
rm /etc/inittab.crs
cp /etc/inittab.orig /etc/inittab
4. Stop the EVM, CRS, and CSS processes if they are still active:
ps -ef | grep crs
kill <crs pid>
ps -ef | grep evm
kill <evm pid>
ps -ef | grep css
kill <css pid>
5. Deinstall CRS Home with Oracle Universal Installer.

6. Remove the CRS install location if it is not deleted by Oracle Universal Installer:
rm -Rf <CRS Install Location>
7. Clean out all the OCR and Voting Files with dd:
# dd if=/dev/zero of=/dev/votedisk1 bs=8192 count=2560
# dd if=/dev/zero of=/dev/ocrdisk1 bs=8192 count=12800
For details, see also the following Oracle Metalink document: Removing a Node from a 10g
RAC Cluster, Doc ID: Note:269320.1 at:
Note: You need an Oracle Metalink ID to access this document.

Abbreviations and acronyms
ACL Access Control List GRD Global Resource Directory
ACL access control list GUI Graphical User Interface
AIO Asynchronous I/O HACMP High-Availability Cluster
AIX Advanced Interactive Executive Multi-Processing
ARP Address Resolution Protocol HBA Host Bus Adapter
ASM Automatic Storage Management HBA host bus adapter
CDT class descriptor table HMC Hardware Management Console
CLVM Concurrent Logical Volume I/O input/output

Manager IB InfiniBand
CRS Cluster Ready Services IBM International Business Machines
CRS configuration report server Corporation
CSS cascading style sheet channel ID identifier

subsystem IEEE Institute of Electrical and
DAC Disk Array Controller Electronics Engineers
DAC digital-to-analog converter ILM Information Lifecycle Management
DASD Direct Access Storage Device IP Internet Protocol
DB database ITSO International Technical Support

Organization
DIO Direct I/O
JFS Journaled File System
DLPAR Dynamic LPAR
JFS journaled file system
DLPAR dynamic LPAR
KB kilobyte
DMAPI Data Management API
LACP Link Aggregation Control Protocol
DML data manipulation language
LAN local area network
DNS Domain Name Services
LP licensed program
DNS Domain Name System
LPAR Logical Partition
DR Disaster Recovery
LPAR logical partition
DR definite response
LUN Logical Unit Number
DR disaster recovery
LUN logical unit number
DS directory services
LV Logical Volume
EMC electromagnetic compatibility
LV logical volume
ESS IBM TotalStorage Enterprise
Storage Server® LVCB Logical Volume Control Block
EVM Event management LVM Logical Volume Manager
FAN Financial Analysis MAC Media Access Control
FC Fibre Channel MAC Medium Access Control
FOR file-owning region MB megabyte
FTP File Transfer Protocol MPIO Multi-Path I/O
GB gigabyte NFS Network File System
GC graphics context NIB Network Interface Backup
GCD Global Cache Directory NIM Network Installation Management
GCS Global Cache Service NIM Network Installation Manager
GPFS General Parallel File System NIS Network Information Service

NIS Network Information Services SDD Subsystem Device Driver
NS network services SDDPCM Subsystem Device Driver Path
NSD Network Shared Disk Control Module
OCR Oracle Cluster Registry SEA Shared Ethernet Adapter
OLAP online analytical processing SET Secure Electronic Transaction
OLTP On-Line Transaction Processing SGA System Global Array
OLTP online transaction processing SL standard label
OS operating system SMIT System Management Interface

Tool
OUI Oracle Universal Installer
SMP System Modification Program
OUI organizationally unique identifier
SMT station management
PCI Peripheral Component
Interconnect SPOF Single Point of Failure
PCI-X Peripheral Component SQL Structured Query Language

Interconnect-X SS start-stop
PCM Path Control Module SVC SAN Volume Controller
PID persistent identifier SVC switched virtual circuit
PM project manager SW special weight
POSIX Portable Operating System TAF Transparent Application Failover
Interface TB terabyte
PP physical partition TCP Transmission Control Protocol
PPRC Peer-to-Peer Remote Copy TCP/IP Transmission Control
PTF Program Temporary Fix Protocol/Internet Protocol
PTF program temporary fix TRUE task-related user exit
PV Physical Volume UDP User Datagram Protocol
PV persistent verification physical UID AIX windows User Interface
volume Definition
PVID Physical Volume Identifier UTC Universal Time Coordinated
RAC Real Application Clusters VG Volume Group
RAID Redundant Array of Independent VG volume group
Disks VIO Virtual I/O
RAM Random Access Memory VIOS Virtual I/O Server
RAM random access memory VIP Virtual IP
RDAC Redundant Disk Array Controller VIPA Virtual IP Address
REM ring error monitor VMS Voice Message Service
RISC reduced instruction set computer WAN wide area network
RMAN Recovery Manager WWNN worldwide node name
RSA Rivest (Ron), Shamir (Adi), and
Adelman (Leonard)
RSA register save area
RSCT Reliable Scalable Clustering
Technology
SAN Storage Area Network
SAN storage area network
SAN system area network
SCN System Change Number
SCSI Small Computer System Interface

Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
IBM Redbooks publications

For information about ordering these publications, see “How to get IBM Redbooks
publications” on page 287. Note that some of the documents referenced here might be
available in softcopy only:
򐂰 Advanced POWER Virtualization on IBM System p5: Introduction and Configuration,
SG24-7940
Other publications
These publications are also relevant as further information sources:
򐂰 GPFS V3.1 Concepts, Planning, and Installation Guide, GA76-0413
򐂰 GPFS V3.1 Advanced Administration Guide, SC23-5182
򐂰 GPFS V3.1 Administration and Programming Reference, SA23-2221
򐂰 GPFS V3.1 Problem Determination Guide, GA76-0415-00
Online resources
These Web sites are also relevant as further information sources:
򐂰 Oracle articles about changing OCR and CRS voting disks to raw devices:
http://www.oracle.com/technology/pub/articles/vallath-nodes.html
http://www.oracle.com/technology/pub/articles/chan_sing2rac_install.html
򐂰 Oracle Knowledge Base (Metalink)
Note: You need an Oracle Metalink ID to access the Knowledge Base.
How to get IBM Redbooks publications

You can search for, view, or download IBM Redbooks publications, Redpapers, Technotes,
draft publications and Additional materials, as well as order hardcopy IBM Redbooks
publications, at this Web site:
ibm.com/redbooks

Help from IBM
IBM Support and downloads
ibm.com/support
IBM Global Services

ibm.com/services

Index
disks configuration 171
Symbols CRS voting disc 144, 148
. 104 crs_stat 114, 120, 122, 133
css votedisk 144, 149, 172
A
affected node 140–141, 143, 165, 215–216, 264–266 D
AIX 5.2 data block 111, 165, 205
Configuring Network Interface Backup 256 data file 104, 106, 202, 222
explicit mode 256 same process 136
Network Interface Backup Mode 256 Data Metadata 198
Allocation Map 165, 216, 266 data partitioning 213
alter database 109, 113–114, 126, 128, 201 database backup 201–202
default temporary tablespace 127 database file 105–106, 187, 200–203, 205
alter tablespace datafile copy 126
example offline 136 dd 123, 133–136
example online 136 dd command 133–137
sysaux offline 136 different options 134
Temporary 127 deletion 211
users offline 136 destination UDP 256–257
users online 136 device name 197, 210
Alternate Address 231, 255 device subtype 134
asynchronous process 140–141, 143, 165, 215–216, Device/File Name 145, 148
264–266 disaster recovery
aust in1_interconnect 160, 163–164, 174 data storage 207
Metro Mirror 156
B disaster recovery (DR) 151, 153, 155–156, 166, 173,
backup adapter 229, 231–232, 234, 256–257 186, 207, 241–242
backup process 203 Disk descriptor file 215
disk descriptor file 214, 216, 263
disk failure 162, 178–183
C file system 180
channel ORA_DISK_1 125–126 Disk file 163
chown oracle.dba 172, 267 dscli 187–188
client partition 227, 229–230, 235, 237, 244, 250 dscli > lspprc 190
root volume group 237 DSCLI commands
virtual SCSI server adapter 244 failbackpprc 194
volume group 250 failoverpprc 194
Cloning database 201 lsfbvol 188
cluster configuration lspprc 191, 194
data 140–141, 143, 162, 165, 215–216 lspprcpath 189
manager 174 mkpprc 189, 193
manager role 177 pausepprc 190, 194
cluster configuration data 140–141, 143 rmpprc 193
clusterType lc 142, 161–162 DSCLi commands
CMUC00155I rmpprc 193 lssi 188
commands
cfgmgr 192
chown 169 E
date 174 enhanced concurrent mode (ECM) 123
control file 109, 125, 133 entry field 168, 170, 231
CRS installation 19, 130 etc/oratab file 280
CRS voting EtherChannel 20, 154, 160, 227, 229–231, 255–257
disc 144, 153, 166–167, 183, 187–188, 191, 194 EtherChannel interface 255, 257
disk device 146 exclusion 211

F GPFS policy 209, 212, 220–221
Failover 106, 115–116, 186, 188, 192, 229–232, 236 command 221
Fibre Channel file 221
attachment 158, 178 GPFS replication 162
connection 154, 236 GPFS snapshot 151, 196–199, 201, 204
Fibre Channel (FC) 228, 235–236 file system 197
file system 104–107, 123, 129, 156, 158–159, 162, GPFS snapshots commands 196
196–200, 264–266 GPFS V3.1
Completed creation 266 Administration 199, 224
device name 197 Concept 104, 140
replicated files 209 GPFSMIG 114–115
system storage pool 207 graphical user interface (GUI) 269
fileset name 223
Filesets 207 H
filesets 209 Hash Mode 256–257
free disc 215, 241 host alamo1 259
free inodes 220, 222 Hostname list 168
G I
General Parallel File System (GPFS) 19, 103–107, 163, IBM DSCLI (ID) 188, 190–191, 193
165, 195–199 IEEE 802.3ad
Global Cache Directory (GCD) 111 Link Aggregation configuration 256
GPFS 3.1 Information Lifecycle Management (ILM) 206, 212
release 207 Inode File 165, 196, 216, 266
storage pool 212 interface number 243
GPFS cluster 104–107, 138, 153, 157–158, 160–161, Internet Address 231, 256–257
174, 181, 263, 266 IP address 230–231, 251, 256–257
GPFS code 106, 140, 142 IP label 160
GPFS commands IP traffic 256–257
mmapplypolicy 209–210, 220
mmbackup 196
mmchattr 208 J
mmchpolicy 210 jumbo frame 255
mmcommon 214
mmcrnsd 215
mmcrsnapshot 196 L
mmdelsnapshot 197, 199, 204 Link Aggregation 229, 231, 255–256
mmdf 208, 219 Adapter 255
mmdsh 214 Control Protocol 256
mmfsadm dump cfgmgr 175 link aggregation
mmgetstate 192 interface 243
mmlsattr 208, 223 logical volume
mmlscluster 214 control block 134
mmlsdisk 191 first block 123, 134
mmlsfs 208 manager 183
mmlsnsnapshot 197 logical volume (LV) 123, 134–135, 235, 237, 242, 250
mmlspolicy 210 long wave (LW) 187
mmrestorefs 197, 199 LPAR 20, 242, 247, 249–250, 252
mmrestripefile 209 LUN mapping 179, 183
mmrestripefs 208 LUN size 144, 190
mmsnapdir 197–198 LUNs 107, 144–145, 162, 183, 186–188, 191, 194,
GPFS file 234–236
system 105, 199 LVM mirroring 227, 237–238
system layer 183
GPFS file system 105–106, 139, 162, 164, 166, M
174–175, 178, 181, 183, 196, 199, 201–202 Metro Mirror 185
layer 183 migration 211
namespace 209 mklv 123, 133–135
subtree 210 mmcrcluster 264

mmcrnsd command 164 oracle user 105, 107, 132, 139, 167, 269
mmlssnapshot oradata 198, 203 oracle/ora102/lib >
ms 233 cd 131
ln 131
ORACLE_HOME name 272
N
Network Shared Disk (NSD) 158, 163–165, 178
NFS version 168 P
node austin1 165, 174, 183 partitioning method
November 22 188 composite range-hash 212
NS directory 120, 130 composite range-list 212
NSDs 105, 141, 164, 187, 191, 264–265 hash 212
list 212
range 212
O partitioning methods 212
OCR location 120 Path Control Module (PCM) 235
Oracle 10g database 279 Peer to Peer Remote Copy (PPRC) 185, 187–190
Check 280 placement 211
Oracle Cluster Policies 207
Registry 120, 130, 145, 148 policies 209
Repository 187 Policy commands 209
Oracle Clusterware policy file 209
10.2.0.2 167 policy rule 209–211, 220
configuration 117 File Attributes 211
control 114 pool total 219, 222, 224
full installation 106 PPRC
home directory 117 Failover 194
new instance 108, 114 PPRC recovery 190
split brain resolution 229 PPRC relation 189–190, 192–194
Oracle clusterware 103, 106–108, 114, 117, 154, 166, previous CRS 129–130
217, 229, 243, 252 primary adapter 229, 231, 234
Oracle code 104, 106, 108 link aggregation adapter 229
Oracle commands
srvctl 202
Oracle CRS R
installation 101 RAC data
software 283 dictionary view 108, 122
voting 153 RAC environment 103–104, 127, 137, 243
Oracle data partitioning 206, 212 log files 127
Oracle Database RAC w 133
consistent backup 201 random access memory (RAM) 20
I/O operations 206 raw device 103–104, 106, 123, 166, 187
partitioned objects 217 data files 125
oracle database 103, 109, 125, 131, 150, 199–201, raw partition 103–104, 106, 123, 144
205–206, 271, 273, 278 Redbooks Web site 287
oracle home Contact us xii
directory 132, 272 Remote Mirror 189–191, 193–194
Oracle instance 116, 125, 128, 135, 137, 201 remotedev Ibm 191, 193
Oracle Inventory 104–106, 121 RMAN commands
Oracle Metalink copy datafile 126
doc Id 108 restore controlfile 125
Id 108, 284 switch database 126
Oracle RAC 20, 103, 106, 108, 110, 114, 160, 166, 186, root user 139, 260, 262, 269
192, 194, 206, 227–228, 230 rsh 28
code 106 rules 207, 209
configuration 227
high availability 228–229
instance 229 S
solution 227 SAN connection 235, 243
Oracle Universal Installer (OUI) 106–108, 117–118, 269, select failover_type 116
279, 283–284 select file_name 112, 127
Index 291
select instance_number 116, 122 unavailability 229
separate tablespaces 212, 218 virtual I/O server
Shared Ethernet Adapter (SEA) 229, 243, 251, 255 external storage 234
single point 154, 157, 227, 235–236, 238, 242 Virtual I/O Server (VIOS) 229–232, 243
Single Point of Failure (SPOF) 154, 235 virtual IO server
size 10485760 KB 216, 266 dual HBA 236
size 50M 113 Voting disc 144, 148, 166, 171–172, 183
snapshot files 201 voting disc
spfile 108, 110, 122, 125 NFS clients 169
SQL commands
alter database 205
alter system 206
create index 219
create table 218
create tablespace 217
drop tablespace 127
select database_status 206
sqlplus 109, 116
ssh 28
startup nomount 125
Storage Area Network (SAN) 154, 185, 234–236
storage B 187, 190–191, 193
Storage pool
file placement 208
GPFS file system 213, 216
newly created files 209
storage pool 151, 165, 206–207, 210, 212
storage pools 207, 213
storage subsystem 107, 154–155, 158, 162, 185,
187–188, 191–192, 199
original mapping 192
System Change Number (SCN) 200, 205
System storage pool 207
T
tablespaces 206, 212, 217
tar cfv 204
TCP traffic 256–257
test environment 20, 104, 121, 123, 135, 243, 255
raw devices 123
third node 153, 158–160, 162–163
free internal disk 163
inappropriate error messages 162
internal SCSI disk 178
NFS server 169
Thu Sep 20 174, 180
Thu Sep 27 125, 203, 205
Transparent Application Failover (TAF) 106, 183
ttl 233
U
user storage pools 222
V
VIO server 227, 244–246, 249, 252
Configuring virtual resources 249
Virtual I/O Server
partition 227

Deploying Oracle 10g RAC on AIX V5
with GPFS
with GPFS
(0.5” spine)
0.475”<->0.873”
250 <-> 459 pages
with GPFS
with GPFS
Back cover ®
Deploying Oracle 10g

RAC on AIX V5 with
GPFS ®
Understand clustering This IBM Redbooks publication helps you architect, install, tailor, and
configure Oracle 10g RAC on System p™ clusters running AIX®. We INTERNATIONAL
layers that help
describe the architecture and how to design, plan, and implement a TECHNICAL
harden your
highly available infrastructure for Oracle database using IBM General SUPPORT
configuration Parallel File System (GPFS) V3.1. ORGANIZATION
This book gives a broad understanding of how Oracle 10g RAC can use
Learn System p
and benefit the virtualization facilities embedded in System p
virtualization and architecture and how to efficiently use the tremendous computing
advanced GPFS power and availability characteristics of the POWER5 hardware and AIX
features 5L operating system. BUILDING TECHNICAL
INFORMATION BASED ON
This book also helps you design and create a solution to migrate your PRACTICAL EXPERIENCE
Deploy disaster existing Oracle 9i RAC configurations to Oracle 10g RAC, simplifying
recovery and test configurations and making them easier to administer and more IBM Redbooks are developed
scenarios resilient to failures. by the IBM International
This book also describes how to quickly deploy Oracle 10g RAC test Technical Support
environments and how to use some of the built-in disaster recovery Organization. Experts from
capabilities of IBM GPFS and storage subsystems to make your cluster
IBM, Customers and Partners
from around the world create
resilient to various failures. timely technical information
This book is intended for anyone planning to architect, install, tailor, based on realistic scenarios.
and configure Oracle 10g RAC on System p™ clusters running AIX and Specific recommendations
GPFS. are provided to help you
implement IT solutions more
effectively in your
environment.
For more information:

ibm.com/redbooks
SG24-7541-00 ISBN 0738485837

Deploying Oracle 10g RAC On AIX V5 With GPFS

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Deploying Oracle 10g RAC On AIX V5 With GPFS

Caricato da

Copyright:

Formati disponibili

Front cover

Deploying Oracle 10g

Learn System p virtualization and

Deploy disaster recovery and

Deploying Oracle 10g RAC on AIX V5 with GPFS

First Edition (April 2008)

© Copyright International Business Machines Corporation 2008. All rights reserved.

Part 1. Concepts and configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2. Basic RAC configuration with GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Part 2. Configurations using dedicated resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Chapter 3. Migration and upgrade scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

© Copyright IBM Corp. 2008. All rights reserved. iii

Part 3. Disaster recovery and maintenance scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Chapter 4. Disaster recovery scenario using GPFS replication . . . . . . . . . . . . . . . . . 153

iv Deploying Oracle 10g RAC on AIX V5 with GPFS

Chapter 5. Disaster recovery using PPRC over SAN. . . . . . . . . . . . . . . . . . . . . . . . . . 185

Chapter 6. Maintaining your environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Part 4. Virtualization scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Chapter 7. Highly available virtualized environments . . . . . . . . . . . . . . . . . . . . . . . . . 227

Chapter 8. Deploying test environments using virtualized SAN. . . . . . . . . . . . . . . . . 241

Appendix A. EtherChannel parameters on AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Appendix B. Setting up trusted ssh in a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

Appendix C. Creating a GPFS 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Appendix D. Oracle 10g database installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

Appendix E. How to cleanly remove CRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

vi Deploying Oracle 10g RAC on AIX V5 with GPFS

© Copyright IBM Corp. 2008. All rights reserved. vii

The following terms are trademarks of other companies:

viii Deploying Oracle 10g RAC on AIX V5 with GPFS

The team that wrote this book

Octavian Lascu is a project leader at the International Technical Support Organization,

Michel Passet is a benchmark manager at the PSSC Customer Center in Montpellier,

© Copyright IBM Corp. 2008. All rights reserved. ix

Harald Hammershøi is an IT specialist, currently working as an IT Architect in the Danish

Thanks to the following people for their contributions to this project:

x Deploying Oracle 10g RAC on AIX V5 with GPFS

Christian Allan Schmidt

Become a published author

xii Deploying Oracle 10g RAC on AIX V5 with GPFS

Part 1 Concepts and

© Copyright IBM Corp. 2008. All rights reserved. 1

© Copyright IBM Corp. 2008. All rights reserved. 3

According to their purpose, clusters are generally classified as:

1.2 Architectural considerations

1.2.1 RAC and Oracle Clusterware

4 Deploying Oracle 10g RAC on AIX V5 with GPFS

Event management HA framework

OS (Kernel and libraries)

The major components of Oracle Clusterware are:

6 Deploying Oracle 10g RAC on AIX V5 with GPFS

On the AIX systems, this location is stored in the /etc/oracle/ocr.loc file.

Event High Group

Oracle RAC components

Single-instance Oracle databases have a one-to-one relationship between the Oracle

8 Deploying Oracle 10g RAC on AIX V5 with GPFS

Instance 1 Instance 2 ..... Instance n

Ims Imd Ims Imd Imd Ims

SGA SGA SGA

dbwr Shared dbwr Shared Shared dbwr

mmon Other mmon Other Other mmon

Figure 1-3 Oracle RAC components