Ioug12 Hadb Public

Deploying Oracle Database
11gR2 for High Availability

Deep Dive
Alex Gorbachev
Las Vegas, 22-Apr-2012 (updated 9-May-2012)
Alex Gorbachev
CTO, The Pythian Group

Blogger
OakTable Network member

Oracle ACE Director
BattleAgainstAnyGuess.com
President, Oracle RAC SIG
2 2009/2010 Pythian
Why Companies Trust Pythian
Recognized Leader:
Global industry-leader in remote database administration services and consulting for Oracle,
Oracle Applications, MySQL and SQL Server
Work with over 150 multinational companies such as Western Union, Fox Interactive Media, and
MDS Inc. to help manage their complex IT deployments
Expertise:
One of the worlds largest concentrations of dedicated, full-time DBA expertise.
Global Reach & Scalability:

24/7/365 global remote support for DBA and consulting, systems administration, special projects
or emergency response
38 2011 Pythian
- Successful growing business for more than 10 years

- Served many customers with complex requirements/infrastructure just like yours.
- Operate globally for 24 x 7 always awake services
4 2012 Pythian
Apply at hr@pythian.com
Deep Dive Agenda
What does HA mean?

HA principles
Components of HA system (h/w, s/w, processes, people)
AutomaticStorage Management
DataGuard and other replication solutions
RAC, RAC One Node, cold failover

Host virtualization
Licensing cheat sheet
5 2012 Pythian
What Does HA Mean?
6
Availability
is the
proportion of time a system
is in a functioning condition
http://en.wikipedia.org/wiki/Availability
7 2012 Pythian
Availability is the
proportion of time a system
is in a functioning condition
uptime - system is functional

downtime - otherwise
A(t) = Pr [ system is functional ]
8 2012 Pythian
Functional?
100% operational?
10% operational?
somewhat operational?
pingable?
most critical functionality is available?
9 2012 Pythian
Example: DaaS - Database as a Service
What is functional?
Instance up?
Can connect? faster than 1 sec?
Can select data?
Can insert a row?
Can commit? faster than 100ms?
Get my minimal CPU capacity?
Get my minimal I/O capacity?
10 2012 Pythian
Example: Monitoring System
What is functional?
System is functional if monitored target failure is detected

within 1 minute
Systemis functional if it provides assurance that there is no

monitored target failures
11 2012 Pythian
Database Availability as Infrastructure SLA
SLA users are database applications / clients

Use cases
DaaS, IaaS and consolidation deployments
Inabilityor unwillingness to integrate with application SLAs and
design for holistic availability
SLAs defined as uptime percentage
Measured most frequently by sampling with a probe
Manually tracked
Degraded mode - limited capacity or functionality
Service outage window threshold (consider an outage is
longer than 5 minutes)
12 2012 Pythian
Database Availability as Application SLA
SLA users are application users

Use cases
Holistic application high availability design
SLA defined as percentage of successful transactions
Example: 99.9% transactions completed within 10 seconds with 99%
within 1 second over one month rolling or calendar month
SLA defined as data freshness for ETL-like workload
Measured often from end user perspective (application logs)
No degraded mode other than in SLA definition
Like variability in response time of different SLA per transaction type
13 2012 Pythian
Whats first?
Application SLA
or
Infrastructure SLA
Image courtesy of http://bp2.blogger.com/_aOANamthC7U/SCLiz3HPYuI/AAAAAAAAAB0/OQ7De4aWoEw/s320/Chicken_or_Egg.jpg
14 2012 Pythian
Infrastructure SLA as Proxy for Application SLA
99.9% of transactions should be completed within 10 seconds
If average transaction rate is constant 24x7...

1 month = 30 days = 720 hours = 43,200 minutes
0.1%of transaction failures means cumulative outage for 43.2
minutes per month
If 90% of traffic is from 9am to 9pm

Peak-time = 21,600 minutes, Off-peak = 21,600 minutes
0.01% of transaction failures is 24 minutes peak-time downtime or
3.6 hours of off-peak downtime or some combination
15 2012 Pythian
Alexs timeless (?) definition of High Availability
High Availability is when a system meets its SLAs
16 2012 Pythian
Roger Magoulas defined Big Data as amount of data that becomes challenging for an
organization to manage, store and process.
Service Availability vs Data Availability
DataAvailability is often called Recoverability to distinguish

from service availability
Recoverability
defines acceptable data loss as Recovery Point
Objective (RPO)
HAdesigners often need to balance between data

recoverability and service availability
17 2012 Pythian
HA Principles
18
KISS - Keep It Simple Stupid
The central enemy of reliability is complexity

-- Geer et al. (CyberInsecurity: The Cost of Monopoly)
Simplicity is prerequisite for reliability

-- Edsger W. Dijkstra (How do we tell truths that might hurt?)
Complexity is the
enemy of
availability!
19 2012 Pythian
Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
KISS - endorsed by Oracle
OracleDocumentation High Availability Overview 11.2 (2.2.5

Manageability Goal)
A manageability goal is more subjective than either the RPO or the RTO. It results
from an objective evaluation of the skill sets and management resources available
in an organization, and the degree to which the organization can successfully
manage all elements of a high availability architecture. Just as RPO and RTO
measure an organization's tolerance for downtime or data loss, your
manageability goal measures the organization's tolerance
for complexity in the IT environment. When less
complexity is a requirement, simpler methods of achieving
high availability are preferred over methods that may be
more complex to manage, even if the latter could attain more
aggressive RTO and RPO objectives. Understanding manageability goals helps
organizations differentiate between what is possible and what is practical to
implement.
20 2012 Pythian
Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
No SPOF (Single Point Of Failure)
21 2012 Pythian
Active vs Passive Redundancy
Passive redundancy
Enough disks capacity to sustain disk failures
Enough aggregated network capacity to sustain NIC failure
Active redundancy
Relocate services on RAC node failure
DataGuard failover
Active/passive multipathing configuration
22 2012 Pythian
Cost balance
23 2012 Pythian
RPO - Recovery Point Objective
RTO - Recovery Time Objective
24 2012 Pythian
SLAs + failure probability + $$ => RPO/RTO
Failure RTO RPO

NIC failure 1 sec 0
Server failure 1 min 0
SAN failure 15 min 5 min
Site failure 1 day 1 hour
Clusterware bug 1 hour 0
Clusterware bug 5 min 10 sec
25 2012 Pythian
How to Design for High Availability?
Option 1 - design to balance the cost and risk of violating

SLAs
Requires careful planning and close collaboration with business
Option 2 - HA by design
This is what most people do
26 2012 Pythian
Manageability vs Fault Tolerance
Planned downtime Unplanned downtime

Patching Software failures (bugs)
Adding/removing capacity Hardware failures
Application versions releases Human mistakes
Platform migrations Unexpected change impact
Hardware upgrades
DR validation
27 2012 Pythian
Maintenance windows
Option 1 - count as downtime towards SLA
Option
2 - count as uptime and explicitly specify
maintenance windows rules in SLAs
Example 1 - 4 hour maintenance window can be scheduled with 48 hours notice
on Sundays up to twice per month
Example 2 - Maintenance windows are 15 minutes every Saturday at 9am UTC
and 2 hours every last Saturday of the quarter at the same time
Example 3 - one 15 minutes maintenance window can be scheduled every
quarter agreed with customers for one year in advance
28 2012 Pythian
Availability must be
monitored!
29 2012 Pythian
Components of HA System
30
Hardware Components Relative Failure Rates
Average rating of hardware component failures - rated from 1 to 7
Hard disks
Network cards
Servers (except disks and network cards)
Network switches
Storage arrays
Cables
SAN switches
2.00 3.00 4.00 5.00 6.00
Less likely to fail More likely to fail
Based on public 51 respondents from Oracle community (Pythian, LinkedIn RAC SIG, Twitter)
31 2012 Pythian
Fault Tolerance in Enterprise Servers
Redundant components
Power Supplies
Cooling
Disks
Network cards
Proactivemonitoring / failure anticipation / remote
diagnostics
Enterprise vendor support with call home and proactive services
Environmental conditions - cool and clean
Enterprise grade servers and components
Server CPUs, ECC server RAM, enterprise HDDs and SSDs
32 2012 Pythian
Software Components
Operating Systems
Storage multipathing drivers
Network redundancy drivers
Oracle Database software
Oracle ASM
Oracle Grid Infrastructure (Clusterware)

Oracle DataGuard and other replication software
Monitoring software
Application database clients (OCI, JDBC, ODP.NET)
33 2012 Pythian
Importance of the Processes for High Availability
change control
monitoring of the relevant components
requirements and procurement
operations
avoidance of network failures
avoidance of internal application failures
avoidance of external services that fail
physical environment
data redundancy How large a share of

network redundancy currently unavailable
technical solution of backup
enterprise IT systems
process solution of backup
resilient client/server solutions

would you guess would be
physical location available if a best practice
infrastructure redundancy
factor X had been present?
storage architecture redundancy
0% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00%
Ulrik Franke, Pontus Johnson, Johan Knig, Liv Marcks von Wrtemberg: Availability of enterprise IT systems - an expert-
based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid
34 2012 Pythian
Time-Cost-Quality Triangle
time
cost
quality
35 2012 Pythian
HA Team Triangle
availability / down time
team
quality
processes
24x7
36 2012 Pythian
High Availability with Oracle Engineered Systems
Standard well tested configurations

Thousands of customers are running
*exactly* the same configurations
No issues with components compatibility
37 2012 Pythian
ASM and Storage Reliability
38
ASM Striping
Extents and Allocation Units

Coarse striping & Fine striping
Random striping -> equal distribution
You cannot disable ASM striping
file1 file2
39 2012 Pythian
ASM Rebalancing
asm_power_limit - rebalancing speed

Async rebalancing in 11.2.0.2
No auto-magic re-layout based on performance
You can force rebalancing for a DG
40 2012 Pythian
ASM Mirroring
Primary extent & secondary extent (mirrored copy)

Write is done to all extent copies
Read is done always from primary extent by default
41 2012 Pythian
ASM Failure Groups
Group of disks can fail at once

Mirror extents between failure groups
Beware of space provisioning
SAN 1 SAN 2
42 2012 Pythian
ASM Redundancy Levels
Per diskgroup or per file

DG type external
per file - unprotected only
DG type normal redundancy
per file - unprotected, two-way or three-way
DG type high redundancy
per file - three way only
43 2012 Pythian
Read IO Failures
Can re-read mirror block since 7.3 (initially done for Veritas)
Same happens with ASM - attempt to re-read from another
mirror
If successful, repair is done for all mirrors
H/w RAID mirroring doesnt give that flexibility
Re-read is done by database blindly hopping for the best
Read failures in the database are handled depending on
context
ASM can recover not only from media failure errors but
from corruptions (bad checksum or wrong SCN)
44 2012 Pythian
Read IO Failure Remapping
Read from secondary extent is performed

Write back to the *same* place is attempted
Disk might do its own block reallocation
If
the write to the same location fails then extent relocated
and original AU marked as unusable
If
the second write fails, then disk set OFFLINE like on any
other write failure
Fix occurs only if a reading process can lock that extent
REMAPcommand - doesnt detect block corruption but only

read media failure
45 2012 Pythian
Silent Corruptions of Secondary Extents
Reads are first attempted from primary extent

Secondary extent is accessed if primary read fails
preferred_read_failure_groups can cause the same
issue when some mirror extents are rarely read
Automatic remap is done when failure is detected
ASMCMD REMAP command forces read of an extent so if read
media error is produced, remapping happens
REMAP is used best alongside of disk scrubbing features
REMAP doesnt detect mirrors inconsistencies or logical block
corruptions!
AMDU utility - Google for Luca Canali AMDU
46 2012 Pythian
ASM Recovery from Read and Write I/O Errors
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise a healthy disk. ASM tries to recover from read errors on corrupted sectors
on a disk. When a read error by the database or ASM triggers the ASM instance to attempt bad block remapping, ASM reads a good copy of the extent and copies it to the disk that
had the read error.
If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad block
reallocation.
If the write fails, ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked as unusable. If the
write fails, the disk is taken offline.
One unique benefit on ASM-based mirroring is that the database instance is aware of the mirroring. For many types of logical corruptions such as a bad checksum or incorrect
System Change Number (SCN), the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that
encountered the read is in a position to obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.
When encountering a write error, a database instance sends the ASM instance a disk offline message.
If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from ASM, the write is considered successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the ASM instance receives a write error message from an database instance or when an ASM instance encounters a write error itself, ASM instance attempts to take the disk
offline. ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many partners are already offline, ASM forces the dismounting of
the disk group. Otherwise, ASM takes the disk offline.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before ASM or database I/O. For
information on the remap command, see "remap Command".
Partial Writes
ASM doesnt use DRL (Dirty Region Logging)

So
what happens if one mirror is done and second mirror write didnt
complete during a crash?
Oracle database has an ability to recover corrupted blocks
Oracle database reads both mirrors if corruption is detected
and if one of the mirrors is good, repair happens
47 2012 Pythian
Disk Failure
Disk is taken offline on write error - not on read error

On disk failure, ASM reads headers of disks in FG
Whole FG is dismounted rather then individual disks
Multiple FG failures -> dismount DG
ASM also probes partners when disk fails trying to identify
failure group pathology
If disks is lost while one of its partners is offline, DG is dismounted
Read failure from disk header -> ASM takes disk offline
48 2012 Pythian
What is MTBF?
Mean Time Before Failure (practically MTTF)

MTBF is inverse of the failure rate during useful life
Thus MTBF is indicator of failure rate but does not predict useful
lifetime
Assuming failures have exponential distribution
= 1-exp(-t/MTBF)
Annualized Failure Rate (AFR) is calculated for time t = 1 year
MTBF of 1,600,000 hours is 182+ years
Doesnt mean the drive will likely work 182 years!
AFR = 1 - exp(24 * 365 / 1,600,000) = 0.546%
49 2012 Pythian
Bathtub Failure Rate Curve
50 2012 Pythian
Real AFR & Utilization (ref Google study)
3 months infant
mortality:
10% AFR
Assuming
exponential
distribution,
hourly failure rate
= 0.001%
Poisson process:
- exponential
distribution
51 2012 Pythian
Hardware Mirroring RAID1 and 2nd Disk Failure
Dataloss happens only if 2nd failed disk is

the mirror of the 1st failed disk.
Window of vulnerability = 5 hours
Time to replace a failed disk
The chance that 2nd mirror fails in the
next 5 hours is 0.005% resulting in data
loss (based on Googles data of 10% AFR)
52 2012 Pythian
ASM Mirroring and 2nd Disk Failure
What if mirrored
extent is placed
randomly on other
disks?
... any 2nd disk failure
would result in data
loss
Think Exadata - 168
disks and one fails...
then 2nd failure of any
of 156 disks will result
in data loss
5h window of
vulnerability
Chance of data loss =
0.777%
53 2012 Pythian
5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)

Pr (DL) = 1-(1-0.00005)^156 = 0.00777 (0.777%)
ASM Partnering
Each disk is assigned

several partner disks
For manufacturers MTBF 1,600,000 hours (large(8 partners in
diskgroups): 11.2.0.1+
AFR = 0.546% )
_asm_partner_target_disk_part
data loss chance is 0.002% after first failure,

during 5 hours recovery time Now only one of 8
partner disk failures
would cause data loss
Assumption: disk failures follow Poisson process!
- i.e. only 0.04%
chance of data loss
during 5 hours
windows of
vulnerability after the
first disk failure
Above is for Googles
10% AFR.
54 2012 Pythian
5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)

Pr (DL) = 1-(1-0.00005)^156 = 0.00777 (0.777%)
Disk Failures in Real Life - a non-Poisson Process
Failures are correlated in time

Manufacturing defects affecting a batch
Environmental (temperature)
Operational (power surge)
Software bugs
Non-exponential distribution in time
55 2012 Pythian
Consider a group of n disks all coming from the same production batch.
We will consider two distinct failure processes:
1. Each disk will be subject to independent failures that will be
exponentially distributed with rate ; these independent failures are
the ones that are normally considered in reliability studies.
2. The whole batch will be subject to the unpredictable manifestation
of a common defect. This event will be exponentially distributed
with rate ' << . It will not result in the immediate failure of any
disk but will accelerate disk failures and make them happen at a rate
'' >> .
56 2012 Pythian
batch-correlated failures
Probability of Surviving a Data Loss After a Failure
P surv = exp(nTR)
n - # of partners for a failed ASM disk (8)
- rate of failure
TR - window of vulnerability (5 hours)
(normal redundancy)
57 2012 Pythian
Random Failure
(No Global Batch Defect Manifestations)
P surv = exp(nTR)
- normal rate of failure (or 1/MTBF)
MTBF = 1000000 hours:

P surv = 99.996%
Googles AFR of 10%:
P surv = 99.954% (normal redundancy)
58 2012 Pythian
After a Failure Caused by a Global Defect
P surv = exp(nTR)
- accelerated rate of failure
is one failure per week:

P surv = 78.813%
is one failure per month:
P surv = 94.596% (normal redundancy)
59 2012 Pythian
P surv = (1+nTR)exp(nTR)
n - 5 hours
P surv = 97.58%
P surv = 99.85% (high redundancy)
60 2012 Pythian
P surv = (1+nTR)exp(nTR)
n - 1 hour
P surv = 99.89%
P surv = 99.99% (high redundancy)
61 2012 Pythian
DEMO
-- target # of partners
select ksppstvl
from sys.x$ksppi p, sys.x$ksppcv v
where p.indx=v.indx and
ksppinm='_asm_partner_target_disk_part';
-- disk partners
select d.path, p.number_kfdpartner
from x$kfdpartner p, v$asm_disk_stat d
where p.disk=&disk_no
and p.grp=group_number
and p.number_kfdpartner=d.disk_number;
62 2012 Pythian
Disk Failure Recovery - ASM vs RAID1 mirroring
RAID1 - longer window of vulnerability but fewer

disks to fail (1)
ASM - shorter window of vulnerability but more

disks to fail (# of partners = 8)
63 2012 Pythian
High Availability with DataGuard
and other replication solutions
64
Two Kinds of Oracle Database Replication
Physical: database duplication

Physical block changes are replicated
DataGuard physical standby, Dbvisit, archivelogs transfer and apply
Storage replication (like EMC SRDF, Veritas VVR)
Logical: data duplication
Data rows changes are replicated
Logical DataGuard standby, Streams, GoldenGate, Shareplex, Dbvisit
Replicate
65 2012 Pythian
Three
Two Kinds of Oracle Database Replication
Physical: database duplication

Physical block changes are replicated
DataGuard physical standby, Dbvisit, archivelogs transfer and apply
Storage replication (like EMC SRDF, Veritas VVR)
Logical: data duplication
Data rows changes are replicated
Logical DataGuard standby, Streams, GoldenGate, Shareplex, Dbvisit
Replicate
Distributed systems: application driven replication
66 2012 Pythian
Physical replication is
the most common
replication for HA
67 2012 Pythian
Redo Generation
Buffer cache Instance
checkpoint redo logs buffer
Shadow commit
LGWR
DBWx Process
DBWx
DBWx
online redo logs

datafiles
Database
68 2012 Pythian
Database Recovery
Buffer cache Instance

Manual physical standby (poor
mans DataGuard)
Clone database to standby host
Transfer redo logs to standby as
they are produced on primary
(archivelogs normally)
Recovery Apply redo as it gets
DBWx
DBWx
DBWx
process
transferred to standby host
Usually implemented with

datafiles
online and archive scripts or using third party
redo logs
Database
products like Dbvisit
69 2012 Pythian
Physical Standby Architecture
redo permanently
generation recovering
(applying
Primary Instance Standby Instance redo)
redo transfer
Primary Standby
Database Database
70 2012 Pythian
ARCH redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR
redo apply
RFS
online redo logs redo transfer
ARCn
ARCn
ARCn achivelogs
achivelogs
71 2012 Pythian
LGWR SYNC redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR LNSn
redo apply
redo transfer
RFS
online redo logs
ARCn standby redo logs

ARCn
ARCn achivelogs
ARCn
ARCn
ARCn
achivelogs
72 2012 Pythian
LGWR ASYNC redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR
redo apply
RFS
LNSn
online redo logs
redo transfer
ARCn standby redo logs

ARCn
ARCn achivelogs
ARCn
ARCn
ARCn
achivelogs
73 2012 Pythian
DataGuard Protection Modes
Maximum Protection
ZERO DATA LOSS
Primary stops if redo cannot be shipped to at least one standby
Maximum Performance
redo transfer is asynchronous so some data loss is likely
Maximum Availability
Likemaximum protection mode but if redo cannot be shipped, it
switches to maximum performance mode instead of stopping
74 2012 Pythian
DataGuard Broker
Broker automates many operations

Configuration
Switchover
Failover
Enterprise Manager integration

Health-checks
Pre-requisite for Fast Start Failover (FSFO)
75 2012 Pythian
Fast Start Failover and Observer
Site 3
Observer
a t
t be he
ar a rt
he be
?
at
Site 1 Site 2
Primary Instance Standby Instance

redo transfer
heartbeat
Primary Standby
Database Database
76 2012 Pythian
Dbvisit - third party DataGuard alternative
77 2012 Pythian
Incrementally updated standby
Clone primary database from level 0 backup

Take incremental level 1 differential backups
Enable Block Change Tracking on primary
Transfer incremental backups on standby
Apply incremental backups using RMAN
Need to transfer archivelogs still
Supports NOLOGGING changes!
78 2012 Pythian
Storage layer replication
Storage array or volume manager is the replicator

Database independent
Cant optimize speed and amount of data copied
ASYNC replication
potential data loss
lower performance overhead
SYNC replication
no data loss
significant performance overhead - adding time to all write operations
79 2012 Pythian
Storage vs Database replication
Storage replications Database replication

see http://bit.ly/J1T49S
(DataGuard and etc.)
More bandwidth needed Less bandwidth needed
More performance overhead Less performance overhead

(writes must replicate in seq)
Data corruptions often Recovery process detects many
propagated block corruptions
Cold failover licensing rules Standby instance must be
apply licensed
Simple for NOLOGGING NOLOGGING replication needs
custom solution
Less flexibility More flexible
Cant query replicated data Active DataGuard option
80 2012 Pythian
Active DataGuard
Read Write Redo Apply

+
Primary Instance Standby Instance Open Read Only
redo transfer
Primary Standby
Database Database
81 2012 Pythian
GoldenGate and Streams?
More complex to setup and maintain than physical standby

Best features in manageability
Application releases
Database patching
Database upgrades
Platform migrations
82 2012 Pythian
Application level replication can be simple...
Parallel data warehouses

Two sites are running ETL independently using the same data or
sources
Queries can be served by one or both sites
Almost everything is independent
Except input data
Planned maintenance in rolling fashion
Rolling upgrades
Rolling application releases
Passive redundancy for any failure
83 2012 Pythian
High Availability with Oracle RAC
and other clustering solutions
84
Single Instance Oracle Database
APP
Query/DML/DDL
SERVER
INSTANCE
Memory (SGA, PGA)

Processes (PMON, SMON, LGWR and etc. +
multiple shadow processes)
Read/write
Datafiles
Controlfiles
redo logs
Database flashback logs, change tracking and etc...
85 2012 Pythian
Single Instance Oracle Database
APP
SERVER
INSTANCE
Database
86 2012 Pythian
Oracle RAC Database
APP
SERVER 1 SERVER 2 SERVER 3
INSTANCE 1 INSTANCE 2 INSTANCE 3
Database
87 2012 Pythian
RAC looks simple. Eh?
88 2012 Pythian
Actually... RAC is very complicated
Must manage multiple redo threads

Must manage multiple undo tablespaces
Must manage global buffer cache
Block mastering concept
Current blocks and consistent images
Global locks (library cache, dictionary cache)
Global ACID
Changes done in one instance must be immediate visible on others
Datafile writes must be syncronized
89 2012 Pythian
Role of Grid Infrastructure
OS OS OS
VIP VIP VIP

Listener Listener Listener
Service Service Service
Instance Instance Instance
ASM ASM ASM

Grid Infrastr. Grid Infrastr. Grid Infrastr.
interconnect
storage access
OCR Voting
disk
Shared storage
90 2012 Pythian
Clusterware is generic with customizations for Oracle resources.

Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
OS OS
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
91 2012 Pythian
CSSD cannot talk to each other -> operations are not synchronized -> shared data
access -> corruption
OS OS
Shoot Clusterware Clusterware
The
Other
Node CSSD CSSD
interconnect
In OPROCD OPROCD
The
Head Voting
disk
92 2012 Pythian
In addition to NHB, Oracle introduced DHB.

IO Fencing needed on split brain to avoid evicted node doing any further IOs.
Oracle doesnt rely on any hardware - need compatibility with all palatform/
hardware.
OS OS
VIP
RACG
EVMD
CRSD
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
93 2012 Pythian
OS OS
Ask
The
Other CSSD CSSD
Node OPROCD
interconnect
OPROCD
To
Reboot
Voting
Itself (c) known quote

disk
94 2012 Pythian
Oracle cant shoot another node without remote control and cant rely on one type of
IO fencing (HBA/SCSI reservations).
Whats left - beg another another - please shoot yourself!
11gR2 Grid Infrastructure:
CSSD attempts graceful
shutdown
95 2012 Pythian
As long as its able to stop all IO capable clients

OS OS
CSSD Monitor/Agent
CS
SD
CSSD
interconnect
OPROCD OPROCD
Voting
disk
96 2012 Pythian
What if CSSD is not healthy? Its very possible that its not network problem but CSSD
just doesnt reply for some reason. CSSD Monitor (OCLSOMON) comes to the scene.
OS
Clusterware
CSSD CSSD
interconnect
OPROCD/CSSD Mon OPROCD
Voting
disk
97 2012 Pythian
Worse yet, the whole node is sick and even OCLSOMON cant function properly. Like
CPU execution is stall.
OS OS
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
98 2012 Pythian
Losing access to voting disks - CSSD commit suicide.

Why? Cluster must have two communication paths + VD is the media for IO fencing.
OS OS
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
99 2012 Pythian
All nodes can reboot if voting disk is lost.

Good time to discuss voting disk redundancy? 1 vs 2 vs 3
11gR2 Grid Infrastructure:
CSSD attempts graceful
shutdown
100 2012 Pythian
As long as its able to stop all IO capable clients

OS OS
Instance Instance
LMON
member kill
CSSD CSSD
interconnect
OPROCD OPROCD
Eviction by escalation of
a member kill
101 2012 Pythian
oclskd: (Oracle Clusterware Kill Daemon)

OS OS
CSSD CSSD
interconnect
OPROCD OPROCD
11gR2
Intelligent
Platform
Voting
Management disk
Interface
102 2012 Pythian

OS OS
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
Exadata Fencing disk
103 2012 Pythian

Oracle Cluster Services
104 2012 Pythian

Orders processing Cluster Services example
Business areas
Customers order (web)
Content display (web cached on app-tier)
Feedback (web)
Orders fulfillment (back-office)
Data extraction batches
105 2012 Pythian

Services: preferred & available
NEW_ORD NEW_ORD NEW_ORD NEW_ORD
CONTENT CONTENT CONTENT CONTENT
FEEDBACK FEEDBACK FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
BATCH BATCH
DB1 DB2 DB3 DB4
106 2012 Pythian
CP based placement
default instances running service - preferred
available can take over on failure
automated failover
Cache Fusion overhead

Node affinity
NEW_ORD NEW_ORD
CONTENT CONTENT
FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
BATCH
DB1 DB2 DB3 DB4
107 2012 Pythian
Fits on single node, work with different subsets of data

Minimize RAC overhead - no block shipment on interconnect
Dynamic re-mastering minimizes need to communicate over interconnect
Service failover
NEW_ORD NEW_ORD
CONTENT CONTENT
FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
PROC_ORD
BATCH
BATCH
DB1 DB2 DB3 DB4
108 2012 Pythian
instance failure - service moved automatically

but impact from heavy-weights
solution - stop them... manually? too long!
Client-side connection balancing and connection-
time failover
LOAD_BALANCE=OFF lh1-vip
FAILOVER=OFF lh2-vip
lh3-vip
lh4-vip
LOAD_BALANCE=OFF
FAILOVER=ON
LOAD_BALANCE=ON
FAILOVER=ON
LISTENER_G41 LISTENER_G42 LISTENER_G43 LISTENER_G44
lh1 lh2 lh3 lh4
2012 Pythian
This works with ALL drivers - OCI, ODP.NET, thin/think JDBC

FAILOVER=ON by default
Remote listener registration (no SCAN)
LISTENER_G41 LISTENER_G41
Local registration
G41.local_listener=LISTENER_G41
Remote registration G41

G42
*.remote_listener=LISTENERS_G4
lh1 lh2
2012 Pythian
simplify - use tnsnames.ora named descriptors in

init.ora
Server-side CLB
<= 10gR1
Instance load LISTENER_G44
Node load
Which
10gR2 instance?
Service Metrics
AWR
G41 G42 G43 G44
111 2012 Pythian

Virtual IP
node1-vip node2-vip
Listener Listener
node1 node2
112 2012 Pythian
Compare SSH Connection refused vs Connection timed-out

Failover
Failure detection
Clean up and re-establish connection
Restore session state
Process failure in application logic
2012 Pythian
Barb Lundhilds whitepaper - http://www.oracle.com/technetwork/database/clustering/overview/awm11gr2-130711.pdf

Failure detection
James Morle has excellent explanation

Conflict speed vs. false positive
VIP is a must for node crash situations
VIP failover time impacted by CRS timing
Without VIP, network delay depends on some TCP settings
2012 Pythian
http://www.scaleabilities.co.uk/wp-content/uploads/downloads/2011/11/RAC_Connection_Management.pdf
Re-connect
No magic
Can be automated to some degree
FMF Fully Manual Failover
STRF Semi-Transparent Reactive Failover
TAF Transparent Application Failover
STPAF Semi-Transparent Pro-Active Failover
FCF Fast Connection Failover
2012 Pythian
Manual reconnect
Reacting upon response failure

Possibly much later at the next DB call
Need to clean up session
Re-establish new connection
Restore session context
Fail transaction in application and possibly re-process it
2012 Pythian
Transparent Application Failover
Semi-Transparent Reactive Failover

Reacting upon response failure
Clean up automated
Re-establish is automatic
Need session context restore TAF callback
Fail transaction in application and possibly re-
process it
OCI and thick JDBC (OCI-based), no thin JDBC
2012 Pythian
Fast Connection Failover (FCF) the next generation
Semi-Transparent Pro-Active Failover

Server-side technology
Based on FAN (Fast Application Notification)
Integrated with client drivers
Connection pooling is a must
2012 Pythian
Oracle Notification Services (ONS)
Diagram from Barb Lundhilds presentation
2012 Pythian
Diagram from Barb Lundhilds presentation
ONS daemon implements simple publish-subscribe messaging mechanism.

Cold Failover
APP
Oracle Grid Infrastructure

Microsoft Failsafe Active Passive
Using other clusterware VIP &

VIP
Listener
Database Database
Coldfailover licensing instance instance
rule of 10 days
Storage Storage
access access
Database
120 2012 Pythian

RAC One Node
RAC One Node has failure tolerance features of cold failover

RAC One Node has manageability features of RAC
OMotion - active/active instance switchover
RAC One Node is not a scaleability solution
RAC is a scalability solution
RAC One Node is cheaper than RAC
PassiveRAC One Node server can fit cold failover database
licensing conditions
121 2012 Pythian

Virtualization for High Availability
122
Server Virtualization HA
OnVM failure or host failure, VMs can be automatically

restarted on other hosts
Active failure tolerance measure
Vmware and OVM high availability features
Like cold failover
VMware VMotion and OVM Live Migration - move VM to
another host *online*
Manageability feature - not failure tolerance feature
Like RAC One Node switchover
Both VMware and OVM require shared storage
VMware vSphere 5 has replication feature
Careful with licensing (cold failover rule apply)
123 2012 Pythian
What about deploying RAC on VMware or OVM?
RAC scales out Virtualization scales in
RAC provides failure Virtualization provides

tolerance failure tolerance
RAChas manageability Virtualization

has
features manageability features
RAC is not managed like web

farms, RAC is much more
static environment
124 2012 Pythian
Licensing Cheat Sheet
125
Selecting Standard Database Editions
RAC is included
limit on number of sockets is per cluster - 2 or 4 sockets
Doesnt support DataGuard and managed standby
Manual physical standby using scripts of third party products like
Dbvisit
Missing most of online manageability features
Very limited Streams support
GoldenGate works well
Standard Edition One - $5,800 per socket
Standard Edition - $17,500 per socket
No extra options are available
126 2012 Pythian

Selecting Enterprise Database Edition
Enterprise Edition - $47,500 per core*
RAC option - $23,000 per core*
RAC One Node option - $10,000 per core*
DataGuard is included
Must license primary and standby(s)
Active DataGuard option - $10,000 per core*
Must license primary and standby(s)
Lots of online manageability features included
All Streams features are included
GoldenGate option - $17,500 per core*

Active DataGuard included
* Core multipliers apply - http://www.oracle.com/us/corporate/contracts/processor-core-factor-table-070634.pdf
127 2012 Pythian

Cold Failover licensing rules
Standby node license if included in primary license if

Oracle software is normally not running on cold failover server
Can run software in standby node for up to 10 days per calendar year
Any part of calendar day is counted as a full day
DataGuard standby doesnt qualify

Oracle instance is mounted on standby
Storage replication qualifies
Cold failover qualifies
RAC One Node qualifies see http://bit.ly/J1T49S
128 2012 Pythian

Q&A
Email me - gorbachev@pythian.com
Read my blog - http://www.pythian.com
Follow me on Twitter - @AlexGorbachev
Join Pythian fan club on Facebook & LinkedIn
129 2011 Pythian

Ioug12 Hadb Public

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ioug12 Hadb Public

Caricato da

Copyright:

Formati disponibili

Deploying Oracle Database

11gR2 for High Availability

CTO, The Pythian Group

OakTable Network member

President, Oracle RAC SIG

Global Reach & Scalability:

- Successful growing business for more than 10 years

What does HA mean?

RAC, RAC One Node, cold failover

uptime - system is functional

A(t) = Pr [ system is functional ]

most critical functionality is available?

System is functional if monitored target failure is detected

Systemis functional if it provides assurance that there is no

SLA users are database applications / clients

SLA users are application users

Image courtesy of http://bp2.blogger.com/_aOANamthC7U/SCLiz3HPYuI/AAAAAAAAAB0/OQ7De4aWoEw/s320/Chicken_or_Egg.jpg

99.9% of transactions should be completed within 10 seconds

If average transaction rate is constant 24x7...

If 90% of traffic is from 9am to 9pm

High Availability is when a system meets its SLAs

DataAvailability is often called Recoverability to distinguish

HAdesigners often need to balance between data

The central enemy of reliability is complexity

Simplicity is prerequisite for reliability

OracleDocumentation High Availability Overview 11.2 (2.2.5

RTO - Recovery Time Objective

Failure RTO RPO

Option 1 - design to balance the cost and risk of violating

Planned downtime Unplanned downtime

Option 1 - count as downtime towards SLA

Average rating of hardware component failures - rated from 1 to 7

Servers (except disks and network cards)

Less likely to fail More likely to fail

Oracle Grid Infrastructure (Clusterware)

monitoring of the relevant components

requirements and procurement

avoidance of network failures

avoidance of internal application failures

avoidance of external services that fail

data redundancy How large a share of

resilient client/server solutions

availability / down time

Standard well tested configurations

Extents and Allocation Units

asm_power_limit - rebalancing speed

Primary extent & secondary extent (mirrored copy)

Group of disks can fail at once

Per diskgroup or per file

Read from secondary extent is performed

REMAPcommand - doesnt detect block corruption but only

Reads are first attempted from primary extent

ASM Recovery from Read and Write I/O Errors

ASM doesnt use DRL (Dirty Region Logging)

Disk is taken offline on write error - not on read error

Mean Time Before Failure (practically MTTF)

Dataloss happens only if 2nd failed disk is

5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)

Each disk is assigned

data loss chance is 0.002% after first failure,

5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)

Failures are correlated in time

MTBF = 1000000 hours:

is one failure per week: