Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
This document describes the procedure to restore from bare metal on Sun Oracle Database Machine database server (Sun Fire X4170) and HP Oracle Database Machine database server (HP DL360 G5).
Scope
This document describes the procedure to rebuild a computer node that was determined to have been irretrievably damaged and replace it with a new, unconfigured computer node (Bare Metal) that must be re-imaged to the proper specifications. At the end of the procedure, the Bare Metal compute node will be synchronized with the surviving members of the cluster with respect to Exadata and Oracle stack components. This document does not include the diagnostics to determine that hardware has failed, the procedures to mechanically replace the failed hardware, the replacement of any customer scripting, cron jobs, maintenance actions, or non-Oracle software that may have been placed on the compute nodes of the cluster. After the Bare Metal Restore Procedure is complete, the customer is responsible for restoring scripting, cron jobs, maintenance actions, and non-Oracle software.
Impact
During the Bare Metal Restore Procedure, the databases that are running on the database machine are available. When the failed database server is added back into the cluster, the software is copied from one of the surviving database servers to the replacement database server over the TCP/IP network. However, apart from this, the system resources on the surviving database servers are not significantly impacted.
Conventions
The examples in this document use the standard naming and deployment conventions of other Oracle documentation. If the environment in which the Bare Metal Restore Procedure is to be performed uses other naming and location conventions, then you must modify the commands presented in this document to fit the environment.
The host dm01db01 refers to the replacement database server. Host dm01db02 refers to the surviving database server. In this example, there are only two nodes in the cluster (this is a Quarter Rack configuration).
[root@replacement] indicates the command should be run on the replacement database server while
[oracle@surviving] indicates the command should be run on a surviving database server while logged in
The environment variable ORACLE_HOME is set to the directory where the database software was installed previously. The environment variable PATH includes the ORACLE_HOME/bin path. Default username and passwords used during the installation of the database machine are used throughout the example syntax. As part of the re-imaging process, a blank 2 GB USB flash drive needs to be obtained onto which the Image Maker will write the bootable imaging software.
The pre-repair steps are performed as soon as permanent failure is detected. The repair phase is performed either by an HP or Oracle engineer, or in the case of a CRU, you can perform the repair phase. The reconfiguration steps are performed after the failed database server is repaired.
Table 1 shows time estimates for performing each step in the pre-repair and reconfiguration phases. Each step provides a link to the section that provides more information.
TABLE 1. TIME DURATION FOR THE PRE-RESTORE AND RECONFIGURATION PHASES
STEP
PRE-REPAIR STEPS
Open Oracle Support request Remove failed database server from cluster Prepare the USB flash drive for imaging
RECONFIGURATION STEPS
15 30 to 60 30
Image replacement database server Configure replacement database server Prepare replacement database server for the cluster Apply Exadata patch bundles to replacement database server Clone Oracle Grid Infrastructure to replacement database server Clone Oracle Database home to replacement database server
30 to 60 30 60
30 to 60
30 to 60
30 to 60
1.
2.
3.
4.
Stop the VIP Resources for the failed database server and delete:
[root@surviving]# srvctl stop vip -i dm01db01-vip PRCC-1016 : dm01db01-vip.acme.com was already stopped [root@surviving]# srvctl remove vip -i dm01db01-vip Please confirm that you intend to remove the VIPs dm01db01-vip (y/[n]) y
6.
7.
Note: To make sure dualboot is forced to no, modify file makeImageMedia.sh. Search for a line like dualboot= and add no, like this: dualboot=no a. #cd dl360
b. #./makeImageMedia.sh For versions 11.2.2.3.5 and up: 1. #cd dl360 2. #./makeImageMedia.sh dualboot no
Configure the service processor: This must include the IP Address / Subnet Mask / Gateway, NTP Server and Time Zone. Configure the BIOS Boot Order: This should be the RAID Controller first and USB flash drive next. Assuming the system is unable to boot from the RAID Controller, then the BIOS will fall through to USB flash drive.
b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB flash drive from the server, and to then press enter to power off the server. 5. 6. Remove the USB flash drive from the replacement database server. Press Enter to power off the server.
Image Replacement Database Server using the ISO file image 1. The iso file image needs to be transferred to the desktop where web ILOM is going to be used for the 6 reimage process.
7. Remove the USB flash drive from the replacement database server. 8. Press Enter to power off the server.
Copy the /etc/security/limits.conf file. Merge the contents of /etc/hosts. Copy the /etc/oracle/cell/network-config/cellinit.ora file and update the IP Address to reflect the IP Address of the bond0 interface on the replacement database server. Copy the /etc/oracle/cell/network-config/cellip.ora file. The content of the cellip.ora file should be the same on all database servers.
2. Set up the oracle user on the replacement database server. a) Add the group (or groups) for the Oracle software owner (typically, the owner is oracle): On the surviving node, obtain the current group information:
[root@surviving]# id oracle uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)
On the replacement node, use the groupadd command to add the group information:
[root@replacement]# [root@replacement]# [root@replacement]# [root@replacement]# groupadd groupadd groupadd groupadd g g g g 1001 1002 1003 1004 oinstall dba oper asmdba
b) Add the user (or users) for the Oracle environment (typically, this is oracle). On the surviving node, obtain the current user information:
[root@surviving]# id oracle
c) Set the password for the Oracle software owner (the password is typically configured during deployment to be the oracle user):
[root@replacement]# passwd oracle Changing password for user oracle. New UNIX password: Retype new UNIX password: passwd: all authentication tokens updated successfully.
d) Create the ORACLE_BASE and Grid Infrastructure directories such as /u01/app/oracle and /u01/app/11.2.0/grid, as follows:
[root@replacement]# mkdir -p /u01/app/oracle [root@replcaement]# mkdir -p /u01/app/11.2.0/grid [root@replacement]# chown -R oracle:oinstall /u01/app
e) Change the ownership on the cellip.ora and cellinit.ora files. This is typically oracle:dba
[root@replacement]# chown -R oracle:dba /etc/oracle/cell/networkconfig
f) Set up SSH within the oracle user account. i. ii. iii. Login to the oracle account:
[root@replacement]# su - oracle
Create a dcli group file listing the nodes in the Oracle Cluster. Run the setup ssh script (this assumes the oracle password on all servers in the dbs_group list is set to welcome)
g) Verify ssh equivalency has been set up. Login to the oracle account and verify the database server using the dcli command:
[root@replacement]# su - oracle [oracle@replacement]$ dcli -g dbs_group -l oracle date dm01db01: Wed Mar 10 17:21:33 CST 2010 dm01db02: Wed Mar 10 17:21:34 CST 2010
At the end of the report, you should see the text Post-check for hardware and operating system setup was successful. 2. Verify peer compatibility:
[oracle@surviving]$ cluvfy comp peer -refnode dm01db02 -n dm01db01 -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched Compatibility check: Available memory [reference node: dm01db02] Node Name -----------dm01db01 Status Ref. node status Comment ----------------------- ----------------------- ---------31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched
Available memory check failed Compatibility check: Free disk space for "/tmp" [reference node: dm01db02] Node Name -----------dm01db01 Status Ref. node status Comment ----------------------- ---------------------- ---------55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched
If the only components that failed are related to physical memory, swap space, and disk space, then it is safe to continue.
10
If the only component that fails is related to swap space, then it is safe to continue. 4. Add the replacement database server into the cluster:
[oracle@surviving]$ cd /u01/app/11.2.0/grid/oui/bin/ [oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={dm01db01-vip}"
This initiates the OUI to copy the clusterware software to the replacement database server.
WARNING: A new inventory has been created on one or more nodes in this session. However, it has not yet been registered as the central inventory of this system. To register the new inventory please run the script at '/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'dm01db01'. If you do not register the inventory, you may not be able to update or patch the products you installed. The following configuration scripts need to be executed as the "root" user in each cluster node: /u01/app/oraInventory/orainstRoot.sh #On nodes dm01db01 /u01/app/11.2.0/grid/root.sh #On nodes dm01db01
To execute the configuration scripts: a) Open a terminal window. b) Log in as root. c) Run the scripts on each cluster node After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/11.2.0/grid was successful. Please check '/tmp/silentInstall.log' for more details.
5.
Run the orainstRoot.sh and root.sh scripts for the replacement database server:
[root@replacement]# /u01/app/oraInventory/orainstRoot.sh Creating the Oracle inventory pointer file (/etc/oraInst.loc) Changing permissions of /u01/app/oraInventory. Adding read,write permissions for group. Removing read,write,execute permissions for world. Changing groupname of /u01/app/oraInventory to oinstall. The execution of the script is complete. [root@replacement]# /u01/app/11.2.0/grid/root.sh Check /u01/app/11.2.0/grid/install/root_dm01db01.acme.com_2010-03-10_17-5915.log for the output of root script The output file created above will report that the LISTENER resource on the replaced database server failed to start. This is the expected output. PRCR-1013 : Failed to start resource ora.LISTENER.lsnr PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node dm01db01 CRS-2662: Resource 'ora.LISTENER.lsnr' is disabled on server 'dm01db01' start listener on node=dm01db01 ... failed
11
These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to the replacement database server. However, to complete the installation, you must run the root scripts on the replacement database server after the command completes.
WARNING: The following configuration scripts need to be executed as the root user in each cluster node. /u01/app/oracle/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01 To execute the configuration scripts: Open a terminal window. Log in as root. Run the scripts on each cluster node.
After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/oracle/product/11.2.0/dbhome_1 was successful. Please check '/tmp/silentInstall.log' for more details.
2.
3.
Validate initialization parameter files Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared storage. Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode, needs to be changed to orapw<SID>
12