Sei sulla pagina 1di 12

Bare Metal Restore Procedure

Introduction
This document describes the procedure to restore from bare metal on Sun Oracle Database Machine database server (Sun Fire X4170) and HP Oracle Database Machine database server (HP DL360 G5).

Scope
This document describes the procedure to rebuild a computer node that was determined to have been irretrievably damaged and replace it with a new, unconfigured computer node (Bare Metal) that must be re-imaged to the proper specifications. At the end of the procedure, the Bare Metal compute node will be synchronized with the surviving members of the cluster with respect to Exadata and Oracle stack components. This document does not include the diagnostics to determine that hardware has failed, the procedures to mechanically replace the failed hardware, the replacement of any customer scripting, cron jobs, maintenance actions, or non-Oracle software that may have been placed on the compute nodes of the cluster. After the Bare Metal Restore Procedure is complete, the customer is responsible for restoring scripting, cron jobs, maintenance actions, and non-Oracle software.

Impact
During the Bare Metal Restore Procedure, the databases that are running on the database machine are available. When the failed database server is added back into the cluster, the software is copied from one of the surviving database servers to the replacement database server over the TCP/IP network. However, apart from this, the system resources on the surviving database servers are not significantly impacted.

Conventions
The examples in this document use the standard naming and deployment conventions of other Oracle documentation. If the environment in which the Bare Metal Restore Procedure is to be performed uses other naming and location conventions, then you must modify the commands presented in this document to fit the environment.

The host dm01db01 refers to the replacement database server. Host dm01db02 refers to the surviving database server. In this example, there are only two nodes in the cluster (this is a Quarter Rack configuration).
[root@replacement] indicates the command should be run on the replacement database server while

logged in as the root user.

[oracle@surviving] indicates the command should be run on a surviving database server while logged in

as the oracle user.

The environment variable ORACLE_HOME is set to the directory where the database software was installed previously. The environment variable PATH includes the ORACLE_HOME/bin path. Default username and passwords used during the installation of the database machine are used throughout the example syntax. As part of the re-imaging process, a blank 2 GB USB flash drive needs to be obtained onto which the Image Maker will write the bootable imaging software.

Bare Metal Restore Procedure


Overview of the Bare Metal Restore Procedure
The Bare Metal restore procedure is broken into the following phases:

The pre-repair steps are performed as soon as permanent failure is detected. The repair phase is performed either by an HP or Oracle engineer, or in the case of a CRU, you can perform the repair phase. The reconfiguration steps are performed after the failed database server is repaired.

Table 1 shows time estimates for performing each step in the pre-repair and reconfiguration phases. Each step provides a link to the section that provides more information.
TABLE 1. TIME DURATION FOR THE PRE-RESTORE AND RECONFIGURATION PHASES

STEP

DURATION (IN MINUTES)

PRE-REPAIR STEPS

Open Oracle Support request Remove failed database server from cluster Prepare the USB flash drive for imaging
RECONFIGURATION STEPS

15 30 to 60 30

Image replacement database server Configure replacement database server Prepare replacement database server for the cluster Apply Exadata patch bundles to replacement database server Clone Oracle Grid Infrastructure to replacement database server Clone Oracle Database home to replacement database server

30 to 60 30 60

30 to 60

30 to 60

30 to 60

Restore Procedure Steps


Open Oracle Support Request
As part of diagnosing the failed database server, you must open an SR with Oracle Support. The support engineer will identify the failed component and send a replacement part. Additionally, the Oracle support engineer will provide a link to the computeImageMaker that you use to re-image the database server after the failed components have been fixed.

Bare Metal Restore Procedure


Remove the Failed Database Server from the Cluster
Perform the steps in this section on a surviving database server. Hence, you can perform the steps before the hardware servicing has been completed. The following steps are for example purposes only and are based on information in the chapter about Adding and Deleting Oracle RAC from Nodes on Linux and UNIX Systems in the Oracle Real Application Clusters Administration and Deployment Guide for Oracle Database 11g Release 2 (11.2). If you are running Oracle Database 11g Release 1 (11.1), then review the documentation for that release of the database. Note: The steps to delete the instances associated to the failed database server are not required. (delete undo tablespace and redo log threads associated to the instance). Because these objects are not affected as they reside on the storage cells and because same node will be added back to the cluster using same ORACLE_SIDs, there is no need to remove the instance. Same apply for any configuration file like spfile or resources stored on the OCR.

1.

Disable the listener that runs on the failed database server:


[oracle@surviving]$ srvctl disable listener -n dm01db01 [oracle@surviving]$ srvctl stop listener -n dm01db01 PRCC-1017 : LISTENER was already stopped on dm01db01

2.

Delete the Oracle Home from the Oracle inventory:


[oracle@surviving]$ cd ${ORACLE_HOME}/oui/bin [oracle@surviving]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1 "CLUSTER_NODES=dm01db02" Starting Oracle Universal Installer... Checking swap space: must be greater than 500 MB. Actual 16383 MB Passed The inventory pointer is located at /etc/oraInst.loc The inventory is located at /u01/app/oraInventory 'UpdateNodeList' was successful.

3.

Verify that the failed database server is unpinned:


[oracle@surviving]$ olsnodes -s -t dm01db01 Inactive Unpinned dm01db02 Active Unpinned

4.

Stop the VIP Resources for the failed database server and delete:
[root@surviving]# srvctl stop vip -i dm01db01-vip PRCC-1016 : dm01db01-vip.acme.com was already stopped [root@surviving]# srvctl remove vip -i dm01db01-vip Please confirm that you intend to remove the VIPs dm01db01-vip (y/[n]) y

Bare Metal Restore Procedure


5. Delete the node from the cluster:
[root@surviving]# crsctl delete node -n dm01db01 CRS-4661: Node dm01db01 successfully deleted.

6.

Update the Oracle Inventory:


[oracle@surviving]$ cd ${ORACLE_HOME}/oui/bin [oracle@surviving]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/11.2.0/grid "CLUSTER_NODES=dm01db02" CRS=TRUE Starting Oracle Universal Installer... Checking swap space: must be greater than 500 MB. Actual 16383 MB Passed The inventory pointer is located at /etc/oraInst.loc The inventory is located at /u01/app/oraInventory 'UpdateNodeList' was successful.

7.

Verify the node deletion is successful:


[oracle@surviving]$ cluvfy stage -post nodedel -n dm01db01 -verbose Performing post-checks for node removal Checking CRS integrity... The Oracle clusterware is healthy on node "dm01db02" CRS integrity check passed Result: Node removal check passed Post-check for node removal was successful.

Prepare the USB Flash Drive for Imaging


Identify the correct image For compute nodes that may be originally installed with an old image, to identify the correct image that needs to be used: Run command imagehistory on one of the healthy nodes Identify the original image, so the correct linux kernel will be installed. If command imageinfo is used, this will return the latest patch applied. Reimaging the node with that version, may install a newer Linux kernel which could be different to the kernel used on the other nodes of the cluster. Apply the latest convenience package included in the latest image applied to the storage cells. Master note 888828.1 includes the matrix with the different images and the linux kernel.

Bare Metal Restore Procedure


The most common methods to reimage the compute node are using ISO file image or using USB Flash image.

Using ISO file Image


Oracle Support can provide an ISO file image that will be attached as a virtual CDROM through ILOM.

Using USB image


A USB flash drive is used to restore the image to the new database server. The following procedure describes how to prepare the USB flash drive for use: 1. Insert a blank USB flash drive into a working database server in the cluster. 2. Log in as root user 3. Receive from support tar file computeImageMaker_<Exadata_release>_LINUX.X64_ <release_date.platform>.tar 4. Execute command tar xvf computeImageMaker_<Exadata_release>_LINUX.X64_ <release_date.platform>.tar 5. Geneate USB image For versions before 11.2.2.2.X: a) #cd dl360 b)#./makeImageMedia.sh For versions between 11.2.2.2.X and 11.2.2.3.2:

Note: To make sure dualboot is forced to no, modify file makeImageMedia.sh. Search for a line like dualboot= and add no, like this: dualboot=no a. #cd dl360

b. #./makeImageMedia.sh For versions 11.2.2.3.5 and up: 1. #cd dl360 2. #./makeImageMedia.sh dualboot no

Bare Metal Restore Procedure


6. Remove the USB flash drive from the surviving node in the cluster. 7. Remove the unzipped dl360directory and the computeImageMakerfile. These two combined re- quire about 2GB of disk space. Prepare Replacement Database Server
This step is performed by the hardware engineer replacing the failed component, if required. If the service processor and motherboard does not need to be replaced, then the information entered at this time should still be present after the engineer has fixed the failed components. However, if the hardware engineer replaces the service processor (SP) or the motherboard, then the hardware engineer must configure ILOM for Sun Oracle Database Machine, or iLO2 for HP Oracle Database Machine. The engineer needs the following information, which should be the same on all database servers in the Database Ma- chine. The IP Address for the SP can be obtained from DNS. Additionally, if the motherboard is replaced, then the hardware engineer must configure the correct BIOS boot order.

Configure the service processor: This must include the IP Address / Subnet Mask / Gateway, NTP Server and Time Zone. Configure the BIOS Boot Order: This should be the RAID Controller first and USB flash drive next. Assuming the system is unable to boot from the RAID Controller, then the BIOS will fall through to USB flash drive.

Image Replacement Database Server


Imaging the database server can be implemented either using the ISO file image or the USB flash drive, generated in the previous step. This section describes the imaging procedure for each source:

Image Replacement Database Server using the USB flash drive


1. 2. 3. 4. Insert the USB flash drive prepared in the previous step into the USB port on the replacement database server. Login to the console through the service processor or via the KVM to monitor the progress. Power on the database server using either the service processor interface or by physically pressing the server power button. The system boots and should detect the CELLUSBINSTALL media. Allow the system to boot. a. The first phase of the imaging process identifies any BIOS or Firmware that is out of date, and upgrades the components to the expected level for the ImageMaker. If any components must be upgraded (or downgraded), then the system is automatically rebooted.

b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB flash drive from the server, and to then press enter to power off the server. 5. 6. Remove the USB flash drive from the replacement database server. Press Enter to power off the server.

Image Replacement Database Server using the ISO file image 1. The iso file image needs to be transferred to the desktop where web ILOM is going to be used for the 6 reimage process.

Bare Metal Restore Procedure


2. Log into ILOM via web and enable remote console 3. Attach the ISO image to the CD ROM 4. Connect to the ILOM via web interface. Go to Remote Control tab , then Host Control tab. From the Next Boot Device, select CDROM. Next the server is rebooted, it will use the ISO image attached. This is valid for one time, which after the default BIOS order settings will remain.
5. Reboot the box and let the process pick the ISO image and start the re-image process 6. The system boots and should detect the ISO image media. Allow the system to boot. The first phase of the imaging process identifies any BIOS or Firmware that is out of date, and upgrades the components to the expected level for the ImageMaker. If any components must be upgraded (or downgraded), then the system is automatically rebooted. b. The second phase of the imaging process will install the factory image on to the replacement database server. At the end of the imaging process, a message requests you to remove the USB flash drive from the server, and to then press enter to power off the server. a.

7. Remove the USB flash drive from the replacement database server. 8. Press Enter to power off the server.

Configure Replacement Database Server


The database server at present does not have any hostnames, IP, DNS, or NTP settings. This section describes how this information is configured in the process called Configuring Oracle Exadata. The following list shows the information that you will be asked to supply: 1. 2. 3. 4. 5. 6. 7. 8. Name servers Time zone (for example, America/Chicago) NTP servers IP Address information for Management Network IP Address information for Client Access Network IP Address information for Infiniband Network The canonical host name The default gateway

Bare Metal Restore Procedure


This information should be the same on all database servers in the Database Machine and the IP Addresses can be obtained from DNS. Additionally, when the database machine was installed, a document should have been provided that included all of this information. To begin the configuration process, power on the replacement database server. When the system boots it automatically runs the Configuring Oracle Exadata routine and prompts you for the information above. After all information is entered, the system prompts you to confirm the settings and then completes the boot process. Note: If the database server does not utilize all network interfaces, then the configuration process will stop warning that some network interfaces are disconnected, and prompt you about whether you want to retry the discovery phase again. Answer Yes or No, as appropriate. Note: If bonding is used for the Client Access Network, then this will be set up in the default activepassive mode at this time.

Prepare Replacement Database Server for the Cluster


During the initial installation of the Database Machine, certain files will have been modified by the installation process known as OneCommand. Perform the following tasks to ensure that the changes to the system made by OneCommand are reflected on the replacement database server. 1. Using the files on a surviving database server for reference, copy or merge the contents of the following files:

Copy the /etc/security/limits.conf file. Merge the contents of /etc/hosts. Copy the /etc/oracle/cell/network-config/cellinit.ora file and update the IP Address to reflect the IP Address of the bond0 interface on the replacement database server. Copy the /etc/oracle/cell/network-config/cellip.ora file. The content of the cellip.ora file should be the same on all database servers.

2. Set up the oracle user on the replacement database server. a) Add the group (or groups) for the Oracle software owner (typically, the owner is oracle): On the surviving node, obtain the current group information:
[root@surviving]# id oracle uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba)

On the replacement node, use the groupadd command to add the group information:
[root@replacement]# [root@replacement]# [root@replacement]# [root@replacement]# groupadd groupadd groupadd groupadd g g g g 1001 1002 1003 1004 oinstall dba oper asmdba

b) Add the user (or users) for the Oracle environment (typically, this is oracle). On the surviving node, obtain the current user information:
[root@surviving]# id oracle

Bare Metal Restore Procedure


uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),1002(dba),1003(oper),1004(asmdba) [root@surviving]# finger oracle Login: oracle Directory: /home/oracle Never logged in. No mail. No Plan. Name: (null) Shell: /bin/bash

On the replacement node, add user information:


[root@replacement]# useradd -u 1000 -g 1001 -G 1001,1002,1003,1004 -m -d /home/oracle -s /bin/bash oracle

c) Set the password for the Oracle software owner (the password is typically configured during deployment to be the oracle user):
[root@replacement]# passwd oracle Changing password for user oracle. New UNIX password: Retype new UNIX password: passwd: all authentication tokens updated successfully.

d) Create the ORACLE_BASE and Grid Infrastructure directories such as /u01/app/oracle and /u01/app/11.2.0/grid, as follows:
[root@replacement]# mkdir -p /u01/app/oracle [root@replcaement]# mkdir -p /u01/app/11.2.0/grid [root@replacement]# chown -R oracle:oinstall /u01/app

e) Change the ownership on the cellip.ora and cellinit.ora files. This is typically oracle:dba
[root@replacement]# chown -R oracle:dba /etc/oracle/cell/networkconfig

f) Set up SSH within the oracle user account. i. ii. iii. Login to the oracle account:
[root@replacement]# su - oracle

Create a dcli group file listing the nodes in the Oracle Cluster. Run the setup ssh script (this assumes the oracle password on all servers in the dbs_group list is set to welcome)

[oracle@replacement]$ /opt/oracle.SupportTools/onecommand/setssh.sh -s -u oracle -p welcome -n N -h dbs_group .........................

g) Verify ssh equivalency has been set up. Login to the oracle account and verify the database server using the dcli command:
[root@replacement]# su - oracle [oracle@replacement]$ dcli -g dbs_group -l oracle date dm01db01: Wed Mar 10 17:21:33 CST 2010 dm01db02: Wed Mar 10 17:21:34 CST 2010

Bare Metal Restore Procedure


h) Set up or copy any custom login scripts from a surviving database server:
[oracle@surviving]$ scp .bash* oracle@dm01db01:.

Apply Exadata Patch Bundles to Replacement Database Server


Oracle periodically releases Exadata patch bundles that need to be applied to the Database Machine. These patch bundles might require steps to be performed on the database servers and the Exadata Storage Servers. If an Exadata patch bundle has been applied to the database server, upgrading the version of Exadata past that provided by the computeImageMaker file provided in the Prepare the USB Flash Drive section, then apply the Exadata patch to the replacement database server at this time. Prior to Exadata release 11.2.1.2.3, the database servers did not maintain version history information. Login to an Exadata Cell and run the "imageinfo -ver" command. If the command shows a different version to the version used by the imageMaker file, then an Exadata patch has been applied to the Database Machine and should be applied to the replacement node. Starting with release 11.2.1.2.3, the imagehistory" command exists on the database server. Thus, be aware that if you compare information on the replacement database server and on the surviving database server, there may be a difference regarding the need to apply the Exadata patch bundle to the replacement database server.

Clone Oracle Grid Infrastructure to the Replacement Database Server


The following steps are for example purposes only and the information is based on the discussion in Chapter 4 Adding a Cluster Node on Linux and UNIX Systems in the Oracle Clusterware Administration and Deployment Guide 11g Release 2 (11.2). If you are running Oracle Database 11g Release 1 (11.1), then review the documentation for that release of the database. 1. Verify the hardware and operating system installations with the Cluster Verification Utility (CVU):
[oracle@surviving]$ cluvfy stage -post hwos -n dm01db01,dm01db02 verbose

At the end of the report, you should see the text Post-check for hardware and operating system setup was successful. 2. Verify peer compatibility:
[oracle@surviving]$ cluvfy comp peer -refnode dm01db02 -n dm01db01 -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched Compatibility check: Available memory [reference node: dm01db02] Node Name -----------dm01db01 Status Ref. node status Comment ----------------------- ----------------------- ---------31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched

Available memory check failed Compatibility check: Free disk space for "/tmp" [reference node: dm01db02] Node Name -----------dm01db01 Status Ref. node status Comment ----------------------- ---------------------- ---------55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched

Free disk space check failed

If the only components that failed are related to physical memory, swap space, and disk space, then it is safe to continue.

10

Bare Metal Restore Procedure


3. Perform requisite checks for node addition:
[oracle@surviving]$ cluvfy stage -pre nodeadd -n dm01db01 -fixup -fixupdir /home/oracle/fixup.d

If the only component that fails is related to swap space, then it is safe to continue. 4. Add the replacement database server into the cluster:
[oracle@surviving]$ cd /u01/app/11.2.0/grid/oui/bin/ [oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={dm01db01-vip}"

This initiates the OUI to copy the clusterware software to the replacement database server.
WARNING: A new inventory has been created on one or more nodes in this session. However, it has not yet been registered as the central inventory of this system. To register the new inventory please run the script at '/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'dm01db01'. If you do not register the inventory, you may not be able to update or patch the products you installed. The following configuration scripts need to be executed as the "root" user in each cluster node: /u01/app/oraInventory/orainstRoot.sh #On nodes dm01db01 /u01/app/11.2.0/grid/root.sh #On nodes dm01db01

To execute the configuration scripts: a) Open a terminal window. b) Log in as root. c) Run the scripts on each cluster node After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/11.2.0/grid was successful. Please check '/tmp/silentInstall.log' for more details.

5.

Run the orainstRoot.sh and root.sh scripts for the replacement database server:
[root@replacement]# /u01/app/oraInventory/orainstRoot.sh Creating the Oracle inventory pointer file (/etc/oraInst.loc) Changing permissions of /u01/app/oraInventory. Adding read,write permissions for group. Removing read,write,execute permissions for world. Changing groupname of /u01/app/oraInventory to oinstall. The execution of the script is complete. [root@replacement]# /u01/app/11.2.0/grid/root.sh Check /u01/app/11.2.0/grid/install/root_dm01db01.acme.com_2010-03-10_17-5915.log for the output of root script The output file created above will report that the LISTENER resource on the replaced database server failed to start. This is the expected output. PRCR-1013 : Failed to start resource ora.LISTENER.lsnr PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node dm01db01 CRS-2662: Resource 'ora.LISTENER.lsnr' is disabled on server 'dm01db01' start listener on node=dm01db01 ... failed

11

Bare Metal Restore Procedure


6. Reenable the listener resource that was stopped and disabled in the Remove the Failed Database Server from the Cluster section earlier in this white paper.
[root@replacement]# /u01/app/11.2.0/grid/bin/srvctl enable listener -l LISTENER -n dm01db01 [root@replacement]# /u01/app/11.2.0/grid/bin/srvctl start listener -l LISTENER n dm01db01

Clone Oracle Database Homes to Replacement Database Server


The following steps are for example purposes only and are based on Chapter 9 Adding Oracle RAC to Nodes with Oracle Clusterware Installed in Oracle Real Application Cluster Administration and Deployment Guide 11g Release 2 (11.2). If you use Oracle Database 11g Release 1 (11.1), then review the documentation for that release. 1. Add the RDBMS ORACLE_HOME on the replacement database server:
[oracle@surviving]$ cd /u01/app/oracle/product/11.2.0/dbhome_1/oui/bin/ [oracle@surviving]$ ./addNode.sh -silent "CLUSTER_NEW_NODES={dm01db01}

These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to the replacement database server. However, to complete the installation, you must run the root scripts on the replacement database server after the command completes.
WARNING: The following configuration scripts need to be executed as the root user in each cluster node. /u01/app/oracle/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01 To execute the configuration scripts: Open a terminal window. Log in as root. Run the scripts on each cluster node.

After the scripts are finished, you should see the following informational messages:
The Cluster Node Addition of /u01/app/oracle/product/11.2.0/dbhome_1 was successful. Please check '/tmp/silentInstall.log' for more details.

2.

Run the following scripts on the replacement database server:


[root@replacement]# /u01/app/oracle/product/11.2.0/dbhome_1/root.sh Check /u01/app/oracle/product/11.2.0/dbhome_1/install/root_dm01db01.acme.com_2010-0310_18-27-16.log for the output of root script

3.

Validate initialization parameter files Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared storage. Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode, needs to be changed to orapw<SID>

12

Potrebbero piacerti anche