Sei sulla pagina 1di 129

Deploying Oracle Database

11gR2 for High Availability


Deep Dive
Alex Gorbachev
Las Vegas, 22-Apr-2012 (updated 9-May-2012)
Alex Gorbachev

CTO, The Pythian Group


Blogger

OakTable Network member


Oracle ACE Director
BattleAgainstAnyGuess.com

President, Oracle RAC SIG

2 2009/2010 Pythian
Why Companies Trust Pythian
Recognized Leader:
Global industry-leader in remote database administration services and consulting for Oracle,
Oracle Applications, MySQL and SQL Server
Work with over 150 multinational companies such as Western Union, Fox Interactive Media, and
MDS Inc. to help manage their complex IT deployments

Expertise:
One of the worlds largest concentrations of dedicated, full-time DBA expertise.

Global Reach & Scalability:


24/7/365 global remote support for DBA and consulting, systems administration, special projects
or emergency response

38 2011 Pythian

- Successful growing business for more than 10 years


- Served many customers with complex requirements/infrastructure just like yours.
- Operate globally for 24 x 7 always awake services
4 2012 Pythian

Apply at hr@pythian.com
Deep Dive Agenda

What does HA mean?


HA principles
Components of HA system (h/w, s/w, processes, people)
AutomaticStorage Management
DataGuard and other replication solutions

RAC, RAC One Node, cold failover


Host virtualization
Licensing cheat sheet

5 2012 Pythian
What Does HA Mean?

6
Availability
is the
proportion of time a system
is in a functioning condition
http://en.wikipedia.org/wiki/Availability

7 2012 Pythian
Availability is the
proportion of time a system
is in a functioning condition

uptime - system is functional


downtime - otherwise

A(t) = Pr [ system is functional ]

8 2012 Pythian
Functional?
100% operational?

10% operational?

somewhat operational?

pingable?

most critical functionality is available?

9 2012 Pythian
Example: DaaS - Database as a Service

What is functional?
Instance up?
Can connect? faster than 1 sec?
Can select data?
Can insert a row?
Can commit? faster than 100ms?
Get my minimal CPU capacity?
Get my minimal I/O capacity?

10 2012 Pythian
Example: Monitoring System

What is functional?

System is functional if monitored target failure is detected


within 1 minute

Systemis functional if it provides assurance that there is no


monitored target failures

11 2012 Pythian
Database Availability as Infrastructure SLA

SLA users are database applications / clients


Use cases
DaaS, IaaS and consolidation deployments
Inabilityor unwillingness to integrate with application SLAs and
design for holistic availability
SLAs defined as uptime percentage
Measured most frequently by sampling with a probe
Manually tracked
Degraded mode - limited capacity or functionality
Service outage window threshold (consider an outage is
longer than 5 minutes)

12 2012 Pythian
Database Availability as Application SLA

SLA users are application users


Use cases
Holistic application high availability design
SLA defined as percentage of successful transactions
Example: 99.9% transactions completed within 10 seconds with 99%
within 1 second over one month rolling or calendar month
SLA defined as data freshness for ETL-like workload
Measured often from end user perspective (application logs)
No degraded mode other than in SLA definition
Like variability in response time of different SLA per transaction type

13 2012 Pythian
Whats first?

Application SLA
or
Infrastructure SLA

Image courtesy of http://bp2.blogger.com/_aOANamthC7U/SCLiz3HPYuI/AAAAAAAAAB0/OQ7De4aWoEw/s320/Chicken_or_Egg.jpg

14 2012 Pythian
Infrastructure SLA as Proxy for Application SLA

99.9% of transactions should be completed within 10 seconds

If average transaction rate is constant 24x7...


1 month = 30 days = 720 hours = 43,200 minutes
0.1%of transaction failures means cumulative outage for 43.2
minutes per month

If 90% of traffic is from 9am to 9pm


Peak-time = 21,600 minutes, Off-peak = 21,600 minutes
0.01% of transaction failures is 24 minutes peak-time downtime or
3.6 hours of off-peak downtime or some combination

15 2012 Pythian
Alexs timeless (?) definition of High Availability

High Availability is when a system meets its SLAs

16 2012 Pythian

Roger Magoulas defined Big Data as amount of data that becomes challenging for an
organization to manage, store and process.
Service Availability vs Data Availability

DataAvailability is often called Recoverability to distinguish


from service availability
Recoverability
defines acceptable data loss as Recovery Point
Objective (RPO)

HAdesigners often need to balance between data


recoverability and service availability

17 2012 Pythian
HA Principles

18
KISS - Keep It Simple Stupid

The central enemy of reliability is complexity


-- Geer et al. (CyberInsecurity: The Cost of Monopoly)

Simplicity is prerequisite for reliability


-- Edsger W. Dijkstra (How do we tell truths that might hurt?)

Complexity is the
enemy of
availability!
19 2012 Pythian

Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
KISS - endorsed by Oracle

OracleDocumentation High Availability Overview 11.2 (2.2.5


Manageability Goal)
A manageability goal is more subjective than either the RPO or the RTO. It results
from an objective evaluation of the skill sets and management resources available
in an organization, and the degree to which the organization can successfully
manage all elements of a high availability architecture. Just as RPO and RTO
measure an organization's tolerance for downtime or data loss, your
manageability goal measures the organization's tolerance
for complexity in the IT environment. When less
complexity is a requirement, simpler methods of achieving
high availability are preferred over methods that may be
more complex to manage, even if the latter could attain more
aggressive RTO and RPO objectives. Understanding manageability goals helps
organizations differentiate between what is possible and what is practical to
implement.

20 2012 Pythian

Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
No SPOF (Single Point Of Failure)

21 2012 Pythian
Active vs Passive Redundancy

Passive redundancy
Enough disks capacity to sustain disk failures
Enough aggregated network capacity to sustain NIC failure
Active redundancy
Relocate services on RAC node failure
DataGuard failover
Active/passive multipathing configuration

22 2012 Pythian
Cost balance

23 2012 Pythian
RPO - Recovery Point Objective

RTO - Recovery Time Objective

24 2012 Pythian
SLAs + failure probability + $$ => RPO/RTO

Failure RTO RPO


NIC failure 1 sec 0
Server failure 1 min 0
SAN failure 15 min 5 min
Site failure 1 day 1 hour
Clusterware bug 1 hour 0
Clusterware bug 5 min 10 sec

25 2012 Pythian
How to Design for High Availability?

Option 1 - design to balance the cost and risk of violating


SLAs
Requires careful planning and close collaboration with business

Option 2 - HA by design
This is what most people do

26 2012 Pythian
Manageability vs Fault Tolerance

Planned downtime Unplanned downtime


Patching Software failures (bugs)
Adding/removing capacity Hardware failures
Application versions releases Human mistakes
Platform migrations Unexpected change impact
Hardware upgrades
DR validation

27 2012 Pythian
Maintenance windows

Option 1 - count as downtime towards SLA

Option
2 - count as uptime and explicitly specify
maintenance windows rules in SLAs
Example 1 - 4 hour maintenance window can be scheduled with 48 hours notice
on Sundays up to twice per month
Example 2 - Maintenance windows are 15 minutes every Saturday at 9am UTC
and 2 hours every last Saturday of the quarter at the same time
Example 3 - one 15 minutes maintenance window can be scheduled every
quarter agreed with customers for one year in advance

28 2012 Pythian
Availability must be
monitored!

29 2012 Pythian
Components of HA System

30
Hardware Components Relative Failure Rates

Average rating of hardware component failures - rated from 1 to 7

Hard disks

Network cards

Servers (except disks and network cards)

Network switches

Storage arrays

Cables

SAN switches
2.00 3.00 4.00 5.00 6.00

Less likely to fail More likely to fail

Based on public 51 respondents from Oracle community (Pythian, LinkedIn RAC SIG, Twitter)

31 2012 Pythian
Fault Tolerance in Enterprise Servers

Redundant components
Power Supplies
Cooling

Disks

Network cards
Proactivemonitoring / failure anticipation / remote
diagnostics
Enterprise vendor support with call home and proactive services
Environmental conditions - cool and clean
Enterprise grade servers and components
Server CPUs, ECC server RAM, enterprise HDDs and SSDs

32 2012 Pythian
Software Components

Operating Systems
Storage multipathing drivers
Network redundancy drivers
Oracle Database software
Oracle ASM

Oracle Grid Infrastructure (Clusterware)


Oracle DataGuard and other replication software
Monitoring software
Application database clients (OCI, JDBC, ODP.NET)

33 2012 Pythian
Importance of the Processes for High Availability
change control

monitoring of the relevant components

requirements and procurement

operations

avoidance of network failures

avoidance of internal application failures

avoidance of external services that fail

physical environment

data redundancy How large a share of


network redundancy currently unavailable
technical solution of backup
enterprise IT systems
process solution of backup

resilient client/server solutions


would you guess would be
physical location available if a best practice
infrastructure redundancy
factor X had been present?
storage architecture redundancy
0% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00%

Ulrik Franke, Pontus Johnson, Johan Knig, Liv Marcks von Wrtemberg: Availability of enterprise IT systems - an expert-
based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid

34 2012 Pythian
Time-Cost-Quality Triangle

time

cost

quality

35 2012 Pythian
HA Team Triangle

availability / down time

team
quality

processes
24x7
36 2012 Pythian
High Availability with Oracle Engineered Systems

Standard well tested configurations


Thousands of customers are running
*exactly* the same configurations
No issues with components compatibility

37 2012 Pythian
ASM and Storage Reliability

38
ASM Striping

Extents and Allocation Units


Coarse striping & Fine striping
Random striping -> equal distribution
You cannot disable ASM striping

file1 file2

39 2012 Pythian
ASM Rebalancing

asm_power_limit - rebalancing speed


Async rebalancing in 11.2.0.2
No auto-magic re-layout based on performance
You can force rebalancing for a DG

40 2012 Pythian
ASM Mirroring

Primary extent & secondary extent (mirrored copy)


Write is done to all extent copies
Read is done always from primary extent by default

41 2012 Pythian
ASM Failure Groups

Group of disks can fail at once


Mirror extents between failure groups
Beware of space provisioning

SAN 1 SAN 2

42 2012 Pythian
ASM Redundancy Levels

Per diskgroup or per file


DG type external
per file - unprotected only
DG type normal redundancy
per file - unprotected, two-way or three-way
DG type high redundancy
per file - three way only

43 2012 Pythian
Read IO Failures

Can re-read mirror block since 7.3 (initially done for Veritas)
Same happens with ASM - attempt to re-read from another
mirror
If successful, repair is done for all mirrors
H/w RAID mirroring doesnt give that flexibility
Re-read is done by database blindly hopping for the best
Read failures in the database are handled depending on
context
ASM can recover not only from media failure errors but
from corruptions (bad checksum or wrong SCN)

44 2012 Pythian
Read IO Failure Remapping

Read from secondary extent is performed


Write back to the *same* place is attempted
Disk might do its own block reallocation
If
the write to the same location fails then extent relocated
and original AU marked as unusable
If
the second write fails, then disk set OFFLINE like on any
other write failure
Fix occurs only if a reading process can lock that extent

REMAPcommand - doesnt detect block corruption but only


read media failure

45 2012 Pythian
Silent Corruptions of Secondary Extents

Reads are first attempted from primary extent


Secondary extent is accessed if primary read fails
preferred_read_failure_groups can cause the same
issue when some mirror extents are rarely read
Automatic remap is done when failure is detected
ASMCMD REMAP command forces read of an extent so if read
media error is produced, remapping happens
REMAP is used best alongside of disk scrubbing features
REMAP doesnt detect mirrors inconsistencies or logical block
corruptions!
AMDU utility - Google for Luca Canali AMDU

46 2012 Pythian

ASM Recovery from Read and Write I/O Errors

Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise a healthy disk. ASM tries to recover from read errors on corrupted sectors
on a disk. When a read error by the database or ASM triggers the ASM instance to attempt bad block remapping, ASM reads a good copy of the extent and copies it to the disk that
had the read error.

If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad block
reallocation.

If the write fails, ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked as unusable. If the
write fails, the disk is taken offline.

One unique benefit on ASM-based mirroring is that the database instance is aware of the mirroring. For many types of logical corruptions such as a bad checksum or incorrect
System Change Number (SCN), the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that
encountered the read is in a position to obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.

When encountering a write error, a database instance sends the ASM instance a disk offline message.

If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from ASM, the write is considered successful.

If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.

When the ASM instance receives a write error message from an database instance or when an ASM instance encounters a write error itself, ASM instance attempts to take the disk
offline. ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many partners are already offline, ASM forces the dismounting of
the disk group. Otherwise, ASM takes the disk offline.

The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before ASM or database I/O. For
information on the remap command, see "remap Command".
Partial Writes

ASM doesnt use DRL (Dirty Region Logging)


So
what happens if one mirror is done and second mirror write didnt
complete during a crash?
Oracle database has an ability to recover corrupted blocks
Oracle database reads both mirrors if corruption is detected
and if one of the mirrors is good, repair happens

47 2012 Pythian
Disk Failure

Disk is taken offline on write error - not on read error


On disk failure, ASM reads headers of disks in FG
Whole FG is dismounted rather then individual disks
Multiple FG failures -> dismount DG
ASM also probes partners when disk fails trying to identify
failure group pathology
If disks is lost while one of its partners is offline, DG is dismounted
Read failure from disk header -> ASM takes disk offline

48 2012 Pythian
What is MTBF?

Mean Time Before Failure (practically MTTF)


MTBF is inverse of the failure rate during useful life
Thus MTBF is indicator of failure rate but does not predict useful
lifetime
Assuming failures have exponential distribution
= 1-exp(-t/MTBF)
Annualized Failure Rate (AFR) is calculated for time t = 1 year
MTBF of 1,600,000 hours is 182+ years
Doesnt mean the drive will likely work 182 years!
AFR = 1 - exp(24 * 365 / 1,600,000) = 0.546%

49 2012 Pythian
Bathtub Failure Rate Curve

50 2012 Pythian
Real AFR & Utilization (ref Google study)

3 months infant
mortality:
10% AFR

Assuming
exponential
distribution,
hourly failure rate
= 0.001%

Poisson process:
- exponential
distribution

51 2012 Pythian
Hardware Mirroring RAID1 and 2nd Disk Failure

Dataloss happens only if 2nd failed disk is


the mirror of the 1st failed disk.
Window of vulnerability = 5 hours
Time to replace a failed disk
The chance that 2nd mirror fails in the
next 5 hours is 0.005% resulting in data
loss (based on Googles data of 10% AFR)

52 2012 Pythian
ASM Mirroring and 2nd Disk Failure

What if mirrored
extent is placed
randomly on other
disks?
... any 2nd disk failure
would result in data
loss
Think Exadata - 168
disks and one fails...
then 2nd failure of any
of 156 disks will result
in data loss
5h window of
vulnerability
Chance of data loss =
0.777%

53 2012 Pythian

5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)


Pr (DL) = 1-(1-0.00005)^156 = 0.00777 (0.777%)
ASM Partnering

Each disk is assigned


several partner disks
For manufacturers MTBF 1,600,000 hours (large(8 partners in
diskgroups): 11.2.0.1+
AFR = 0.546% )
_asm_partner_target_disk_part

data loss chance is 0.002% after first failure,


during 5 hours recovery time Now only one of 8
partner disk failures
would cause data loss
Assumption: disk failures follow Poisson process!
- i.e. only 0.04%
chance of data loss
during 5 hours
windows of
vulnerability after the
first disk failure
Above is for Googles
10% AFR.

54 2012 Pythian

5hrs FR = 1-(1-0.00001)^5 = 0.00005 (0.005%)


Pr (DL) = 1-(1-0.00005)^156 = 0.00777 (0.777%)
Disk Failures in Real Life - a non-Poisson Process

Failures are correlated in time


Manufacturing defects affecting a batch
Environmental (temperature)
Operational (power surge)
Software bugs
Non-exponential distribution in time

55 2012 Pythian
Consider a group of n disks all coming from the same production batch.
We will consider two distinct failure processes:
1. Each disk will be subject to independent failures that will be
exponentially distributed with rate ; these independent failures are
the ones that are normally considered in reliability studies.
2. The whole batch will be subject to the unpredictable manifestation
of a common defect. This event will be exponentially distributed
with rate ' << . It will not result in the immediate failure of any
disk but will accelerate disk failures and make them happen at a rate
'' >> .

56 2012 Pythian

batch-correlated failures
Probability of Surviving a Data Loss After a Failure

P surv = exp(nTR)
n - # of partners for a failed ASM disk (8)
- rate of failure
TR - window of vulnerability (5 hours)

(normal redundancy)
57 2012 Pythian
Random Failure
(No Global Batch Defect Manifestations)

P surv = exp(nTR)
- normal rate of failure (or 1/MTBF)

MTBF = 1000000 hours:


P surv = 99.996%
Googles AFR of 10%:
P surv = 99.954% (normal redundancy)
58 2012 Pythian
After a Failure Caused by a Global Defect

P surv = exp(nTR)
- accelerated rate of failure

is one failure per week:


P surv = 78.813%
is one failure per month:
P surv = 94.596% (normal redundancy)
59 2012 Pythian
After a Failure Caused by a Global Defect

P surv = (1+nTR)exp(nTR)
- accelerated rate of failure
n - 5 hours
is one failure per week:
P surv = 97.58%
is one failure per month:
P surv = 99.85% (high redundancy)
60 2012 Pythian
After a Failure Caused by a Global Defect

P surv = (1+nTR)exp(nTR)
- accelerated rate of failure
n - 1 hour
is one failure per week:
P surv = 99.89%
is one failure per month:
P surv = 99.99% (high redundancy)
61 2012 Pythian
DEMO

-- target # of partners
select ksppstvl
from sys.x$ksppi p, sys.x$ksppcv v
where p.indx=v.indx and
ksppinm='_asm_partner_target_disk_part';

-- disk partners
select d.path, p.number_kfdpartner
from x$kfdpartner p, v$asm_disk_stat d
where p.disk=&disk_no
and p.grp=group_number
and p.number_kfdpartner=d.disk_number;

62 2012 Pythian
Disk Failure Recovery - ASM vs RAID1 mirroring

RAID1 - longer window of vulnerability but fewer


disks to fail (1)

ASM - shorter window of vulnerability but more


disks to fail (# of partners = 8)

63 2012 Pythian
High Availability with DataGuard
and other replication solutions

64
Two Kinds of Oracle Database Replication

Physical: database duplication


Physical block changes are replicated
DataGuard physical standby, Dbvisit, archivelogs transfer and apply
Storage replication (like EMC SRDF, Veritas VVR)
Logical: data duplication
Data rows changes are replicated
Logical DataGuard standby, Streams, GoldenGate, Shareplex, Dbvisit
Replicate

65 2012 Pythian
Three
Two Kinds of Oracle Database Replication

Physical: database duplication


Physical block changes are replicated
DataGuard physical standby, Dbvisit, archivelogs transfer and apply
Storage replication (like EMC SRDF, Veritas VVR)
Logical: data duplication
Data rows changes are replicated
Logical DataGuard standby, Streams, GoldenGate, Shareplex, Dbvisit
Replicate
Distributed systems: application driven replication

66 2012 Pythian
Physical replication is
the most common
replication for HA

67 2012 Pythian
Redo Generation

Buffer cache Instance

checkpoint redo logs buffer

Shadow commit
LGWR
DBWx Process
DBWx
DBWx

online redo logs


datafiles

Database

68 2012 Pythian
Database Recovery

Buffer cache Instance


Manual physical standby (poor
mans DataGuard)
Clone database to standby host
Transfer redo logs to standby as
they are produced on primary
(archivelogs normally)
Recovery Apply redo as it gets
DBWx
DBWx
DBWx
process
transferred to standby host

Usually implemented with


datafiles
online and archive scripts or using third party
redo logs
Database
products like Dbvisit

69 2012 Pythian
Physical Standby Architecture

redo permanently
generation recovering
(applying
Primary Instance Standby Instance redo)

redo transfer

Primary Standby
Database Database

70 2012 Pythian
ARCH redo transfer
Primary Standby
redo log buffer

MRP

datafiles
LGWR

redo apply

RFS

online redo logs redo transfer

ARCn
ARCn
ARCn achivelogs

achivelogs

71 2012 Pythian
LGWR SYNC redo transfer
Primary Standby
redo log buffer

MRP

datafiles
LGWR LNSn
redo apply
redo transfer

RFS

online redo logs

ARCn standby redo logs


ARCn
ARCn achivelogs

ARCn
ARCn
ARCn
achivelogs

72 2012 Pythian
LGWR ASYNC redo transfer
Primary Standby
redo log buffer

MRP

datafiles
LGWR

redo apply

RFS

LNSn
online redo logs

redo transfer

ARCn standby redo logs


ARCn
ARCn achivelogs

ARCn
ARCn
ARCn
achivelogs

73 2012 Pythian
DataGuard Protection Modes

Maximum Protection
ZERO DATA LOSS
Primary stops if redo cannot be shipped to at least one standby
Maximum Performance
redo transfer is asynchronous so some data loss is likely
Maximum Availability
Likemaximum protection mode but if redo cannot be shipped, it
switches to maximum performance mode instead of stopping

74 2012 Pythian
DataGuard Broker

Broker automates many operations


Configuration

Switchover

Failover

Enterprise Manager integration


Health-checks

Pre-requisite for Fast Start Failover (FSFO)

75 2012 Pythian
Fast Start Failover and Observer
Site 3

Observer

a t
t be he
ar a rt
he be

?
at
Site 1 Site 2

Primary Instance Standby Instance


redo transfer

heartbeat

Primary Standby
Database Database

76 2012 Pythian
Dbvisit - third party DataGuard alternative

77 2012 Pythian
Incrementally updated standby

Clone primary database from level 0 backup


Take incremental level 1 differential backups
Enable Block Change Tracking on primary
Transfer incremental backups on standby
Apply incremental backups using RMAN
Need to transfer archivelogs still

Supports NOLOGGING changes!

78 2012 Pythian
Storage layer replication

Storage array or volume manager is the replicator


Database independent
Cant optimize speed and amount of data copied
ASYNC replication
potential data loss
lower performance overhead
SYNC replication
no data loss
significant performance overhead - adding time to all write operations

79 2012 Pythian
Storage vs Database replication

Storage replications Database replication


see http://bit.ly/J1T49S
(DataGuard and etc.)
More bandwidth needed Less bandwidth needed

More performance overhead Less performance overhead


(writes must replicate in seq)
Data corruptions often Recovery process detects many
propagated block corruptions
Cold failover licensing rules Standby instance must be
apply licensed
Simple for NOLOGGING NOLOGGING replication needs
custom solution
Less flexibility More flexible

Cant query replicated data Active DataGuard option

80 2012 Pythian
Active DataGuard

Read Write Redo Apply


+
Primary Instance Standby Instance Open Read Only

redo transfer

Primary Standby
Database Database

81 2012 Pythian
GoldenGate and Streams?

More complex to setup and maintain than physical standby


Best features in manageability
Application releases
Database patching
Database upgrades
Platform migrations

82 2012 Pythian
Application level replication can be simple...

Parallel data warehouses


Two sites are running ETL independently using the same data or
sources
Queries can be served by one or both sites
Almost everything is independent
Except input data
Planned maintenance in rolling fashion
Rolling upgrades
Rolling application releases
Passive redundancy for any failure

83 2012 Pythian
High Availability with Oracle RAC
and other clustering solutions

84
Single Instance Oracle Database
APP

Query/DML/DDL

SERVER

INSTANCE

Memory (SGA, PGA)


Processes (PMON, SMON, LGWR and etc. +
multiple shadow processes)

Read/write

Datafiles
Controlfiles
redo logs
Database flashback logs, change tracking and etc...

85 2012 Pythian
Single Instance Oracle Database
APP

SERVER

INSTANCE

Database

86 2012 Pythian
Oracle RAC Database
APP

SERVER 1 SERVER 2 SERVER 3

INSTANCE 1 INSTANCE 2 INSTANCE 3

Database

87 2012 Pythian
RAC looks simple. Eh?

88 2012 Pythian
Actually... RAC is very complicated

Must manage multiple redo threads


Must manage multiple undo tablespaces
Must manage global buffer cache
Block mastering concept
Current blocks and consistent images
Global locks (library cache, dictionary cache)
Global ACID
Changes done in one instance must be immediate visible on others
Datafile writes must be syncronized

89 2012 Pythian
Role of Grid Infrastructure
OS OS OS

VIP VIP VIP


Listener Listener Listener
Service Service Service

Instance Instance Instance

ASM ASM ASM


Grid Infrastr. Grid Infrastr. Grid Infrastr.

interconnect
storage access

OCR Voting
disk

Shared storage

90 2012 Pythian

Clusterware is generic with customizations for Oracle resources.


Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
OS OS

Clusterware Clusterware

CSSD CSSD

interconnect
OPROCD OPROCD

91 2012 Pythian

CSSD cannot talk to each other -> operations are not synchronized -> shared data
access -> corruption
OS OS

Shoot Clusterware Clusterware

The
Other
Node CSSD CSSD

interconnect

In OPROCD OPROCD

The
Head Voting
disk

92 2012 Pythian

In addition to NHB, Oracle introduced DHB.


IO Fencing needed on split brain to avoid evicted node doing any further IOs.
Oracle doesnt rely on any hardware - need compatibility with all palatform/
hardware.
OS OS

Clusterware Clusterware

VIP

RACG
EVMD

CRSD

CSSD CSSD

interconnect
OPROCD OPROCD

Voting
disk

93 2012 Pythian
OS OS

Clusterware Clusterware

Ask
The
Other CSSD CSSD

Node OPROCD
interconnect
OPROCD

To
Reboot
Voting

Itself (c) known quote


disk

94 2012 Pythian

Oracle cant shoot another node without remote control and cant rely on one type of
IO fencing (HBA/SCSI reservations).
Whats left - beg another another - please shoot yourself!
11gR2 Grid Infrastructure:
CSSD attempts graceful
shutdown

95 2012 Pythian

As long as its able to stop all IO capable clients


OS OS

Clusterware Clusterware

CSSD Monitor/Agent

CS
SD
CSSD

interconnect
OPROCD OPROCD

Voting
disk

96 2012 Pythian

What if CSSD is not healthy? Its very possible that its not network problem but CSSD
just doesnt reply for some reason. CSSD Monitor (OCLSOMON) comes to the scene.
OS

Clusterware

CSSD CSSD

interconnect
OPROCD/CSSD Mon OPROCD

Voting
disk

97 2012 Pythian

Worse yet, the whole node is sick and even OCLSOMON cant function properly. Like
CPU execution is stall.
OS OS

Clusterware Clusterware

CSSD CSSD
interconnect
OPROCD OPROCD

Voting
disk

98 2012 Pythian

Losing access to voting disks - CSSD commit suicide.


Why? Cluster must have two communication paths + VD is the media for IO fencing.
OS OS

Clusterware Clusterware

CSSD CSSD
interconnect
OPROCD OPROCD

Voting
disk

99 2012 Pythian

All nodes can reboot if voting disk is lost.


Good time to discuss voting disk redundancy? 1 vs 2 vs 3
11gR2 Grid Infrastructure:
CSSD attempts graceful
shutdown

100 2012 Pythian

As long as its able to stop all IO capable clients


OS OS

Clusterware Clusterware

Instance Instance
LMON
member kill

CSSD CSSD

interconnect
OPROCD OPROCD

Eviction by escalation of
a member kill

101 2012 Pythian

oclskd: (Oracle Clusterware Kill Daemon)


OS OS

Clusterware Clusterware

CSSD CSSD

interconnect
OPROCD OPROCD

11gR2
Intelligent
Platform
Voting
Management disk

Interface

102 2012 Pythian


OS OS

Clusterware Clusterware

CSSD CSSD

interconnect
OPROCD OPROCD

Voting

Exadata Fencing disk

103 2012 Pythian


Oracle Cluster Services

104 2012 Pythian


Orders processing Cluster Services example

Business areas
Customers order (web)
Content display (web cached on app-tier)
Feedback (web)
Orders fulfillment (back-office)
Data extraction batches

105 2012 Pythian


Services: preferred & available

NEW_ORD NEW_ORD NEW_ORD NEW_ORD

CONTENT CONTENT CONTENT CONTENT

FEEDBACK FEEDBACK FEEDBACK FEEDBACK

PROC_ORD PROC_ORD

BATCH BATCH

DB1 DB2 DB3 DB4

106 2012 Pythian

CP based placement

default instances running service - preferred

available can take over on failure

automated failover

Cache Fusion overhead


Node affinity

NEW_ORD NEW_ORD

CONTENT CONTENT

FEEDBACK FEEDBACK

PROC_ORD PROC_ORD

BATCH

DB1 DB2 DB3 DB4

107 2012 Pythian

Fits on single node, work with different subsets of data


Minimize RAC overhead - no block shipment on interconnect
Dynamic re-mastering minimizes need to communicate over interconnect
Service failover

NEW_ORD NEW_ORD

CONTENT CONTENT

FEEDBACK FEEDBACK

PROC_ORD PROC_ORD
PROC_ORD

BATCH
BATCH

DB1 DB2 DB3 DB4

108 2012 Pythian

instance failure - service moved automatically


but impact from heavy-weights
solution - stop them... manually? too long!
Client-side connection balancing and connection-
time failover
LOAD_BALANCE=OFF lh1-vip
FAILOVER=OFF lh2-vip
lh3-vip
lh4-vip
LOAD_BALANCE=OFF
FAILOVER=ON

LOAD_BALANCE=ON
FAILOVER=ON

LISTENER_G41 LISTENER_G42 LISTENER_G43 LISTENER_G44

lh1 lh2 lh3 lh4

2012 Pythian

This works with ALL drivers - OCI, ODP.NET, thin/think JDBC


FAILOVER=ON by default
Remote listener registration (no SCAN)

LISTENER_G41 LISTENER_G41
Local registration
G41.local_listener=LISTENER_G41
G42.local_listener=LISTENER_G42
G43.local_listener=LISTENER_G43
G44.local_listener=LISTENER_G44

Remote registration G41


G42

*.remote_listener=LISTENERS_G4

lh1 lh2

2012 Pythian

simplify - use tnsnames.ora named descriptors in


init.ora
Server-side CLB

<= 10gR1
Instance load LISTENER_G44

Node load
Which
10gR2 instance?
Service Metrics
AWR

G41 G42 G43 G44

111 2012 Pythian


Virtual IP

node1-vip node2-vip

Listener Listener

node1 node2

112 2012 Pythian

Compare SSH Connection refused vs Connection timed-out


Failover

Failure detection
Clean up and re-establish connection
Restore session state
Process failure in application logic

2012 Pythian

Barb Lundhilds whitepaper - http://www.oracle.com/technetwork/database/clustering/overview/awm11gr2-130711.pdf


Failure detection

James Morle has excellent explanation


Conflict speed vs. false positive
VIP is a must for node crash situations
VIP failover time impacted by CRS timing
Without VIP, network delay depends on some TCP settings

2012 Pythian

http://www.scaleabilities.co.uk/wp-content/uploads/downloads/2011/11/RAC_Connection_Management.pdf
Re-connect

No magic
Can be automated to some degree
FMF Fully Manual Failover
STRF Semi-Transparent Reactive Failover
TAF Transparent Application Failover
STPAF Semi-Transparent Pro-Active Failover
FCF Fast Connection Failover

2012 Pythian
Manual reconnect

Reacting upon response failure


Possibly much later at the next DB call
Need to clean up session
Re-establish new connection
Restore session context
Fail transaction in application and possibly re-process it

2012 Pythian
Transparent Application Failover

Semi-Transparent Reactive Failover


Reacting upon response failure
Clean up automated
Re-establish is automatic
Need session context restore TAF callback
Fail transaction in application and possibly re-
process it
OCI and thick JDBC (OCI-based), no thin JDBC

2012 Pythian
Fast Connection Failover (FCF) the next generation

Semi-Transparent Pro-Active Failover


Server-side technology
Based on FAN (Fast Application Notification)
Integrated with client drivers
Connection pooling is a must

2012 Pythian
Oracle Notification Services (ONS)

Diagram from Barb Lundhilds presentation

2012 Pythian
Diagram from Barb Lundhilds presentation

ONS daemon implements simple publish-subscribe messaging mechanism.


Cold Failover
APP

Oracle Grid Infrastructure


Microsoft Failsafe Active Passive

Using other clusterware VIP &


VIP
Listener

Database Database
Coldfailover licensing instance instance

rule of 10 days
Storage Storage
access access

Database

120 2012 Pythian


RAC One Node

RAC One Node has failure tolerance features of cold failover


RAC One Node has manageability features of RAC
OMotion - active/active instance switchover
RAC One Node is not a scaleability solution
RAC is a scalability solution
RAC One Node is cheaper than RAC
PassiveRAC One Node server can fit cold failover database
licensing conditions

121 2012 Pythian


Virtualization for High Availability

122
Server Virtualization HA

OnVM failure or host failure, VMs can be automatically


restarted on other hosts
Active failure tolerance measure
Vmware and OVM high availability features
Like cold failover
VMware VMotion and OVM Live Migration - move VM to
another host *online*
Manageability feature - not failure tolerance feature
Like RAC One Node switchover
Both VMware and OVM require shared storage
VMware vSphere 5 has replication feature
Careful with licensing (cold failover rule apply)
123 2012 Pythian
What about deploying RAC on VMware or OVM?

RAC scales out Virtualization scales in

RAC provides failure Virtualization provides


tolerance failure tolerance

RAChas manageability Virtualization


has
features manageability features

RAC is not managed like web


farms, RAC is much more
static environment
124 2012 Pythian
Licensing Cheat Sheet

125
Selecting Standard Database Editions

RAC is included
limit on number of sockets is per cluster - 2 or 4 sockets
Doesnt support DataGuard and managed standby
Manual physical standby using scripts of third party products like
Dbvisit
Missing most of online manageability features
Very limited Streams support
GoldenGate works well
Standard Edition One - $5,800 per socket
Standard Edition - $17,500 per socket
No extra options are available

126 2012 Pythian


Selecting Enterprise Database Edition
Enterprise Edition - $47,500 per core*
RAC option - $23,000 per core*
RAC One Node option - $10,000 per core*
DataGuard is included
Must license primary and standby(s)
Active DataGuard option - $10,000 per core*
Must license primary and standby(s)
Lots of online manageability features included
All Streams features are included

GoldenGate option - $17,500 per core*


Active DataGuard included
* Core multipliers apply - http://www.oracle.com/us/corporate/contracts/processor-core-factor-table-070634.pdf

127 2012 Pythian


Cold Failover licensing rules

Standby node license if included in primary license if


Oracle software is normally not running on cold failover server
Can run software in standby node for up to 10 days per calendar year
Any part of calendar day is counted as a full day

DataGuard standby doesnt qualify


Oracle instance is mounted on standby
Storage replication qualifies
Cold failover qualifies
RAC One Node qualifies see http://bit.ly/J1T49S

128 2012 Pythian


Q&A

Email me - gorbachev@pythian.com
Read my blog - http://www.pythian.com
Follow me on Twitter - @AlexGorbachev
Join Pythian fan club on Facebook & LinkedIn

129 2011 Pythian

Potrebbero piacerti anche