Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2 2009/2010 Pythian
Why Companies Trust Pythian
Recognized Leader:
Global industry-leader in remote database administration services and consulting for Oracle,
Oracle Applications, MySQL and SQL Server
Work with over 150 multinational companies such as Western Union, Fox Interactive Media, and
MDS Inc. to help manage their complex IT deployments
Expertise:
One of the worlds largest concentrations of dedicated, full-time DBA expertise.
38 2011 Pythian
Apply at hr@pythian.com
Deep Dive Agenda
5 2012 Pythian
What Does HA Mean?
6
Availability
is the
proportion of time a system
is in a functioning condition
http://en.wikipedia.org/wiki/Availability
7 2012 Pythian
Availability is the
proportion of time a system
is in a functioning condition
8 2012 Pythian
Functional?
100% operational?
10% operational?
somewhat operational?
pingable?
9 2012 Pythian
Example: DaaS - Database as a Service
What is functional?
Instance up?
Can connect? faster than 1 sec?
Can select data?
Can insert a row?
Can commit? faster than 100ms?
Get my minimal CPU capacity?
Get my minimal I/O capacity?
10 2012 Pythian
Example: Monitoring System
What is functional?
11 2012 Pythian
Database Availability as Infrastructure SLA
12 2012 Pythian
Database Availability as Application SLA
13 2012 Pythian
Whats first?
Application SLA
or
Infrastructure SLA
14 2012 Pythian
Infrastructure SLA as Proxy for Application SLA
15 2012 Pythian
Alexs timeless (?) definition of High Availability
16 2012 Pythian
Roger Magoulas defined Big Data as amount of data that becomes challenging for an
organization to manage, store and process.
Service Availability vs Data Availability
17 2012 Pythian
HA Principles
18
KISS - Keep It Simple Stupid
Complexity is the
enemy of
availability!
19 2012 Pythian
Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
KISS - endorsed by Oracle
20 2012 Pythian
Dr. Daniel Geer, Chief Technology Officer and co-founder of AtStake, was fired by AtStake for co-authoring CyberInsecurity: The Cost of Monopoly. AtStake is a supplier to
Microsoft.
No SPOF (Single Point Of Failure)
21 2012 Pythian
Active vs Passive Redundancy
Passive redundancy
Enough disks capacity to sustain disk failures
Enough aggregated network capacity to sustain NIC failure
Active redundancy
Relocate services on RAC node failure
DataGuard failover
Active/passive multipathing configuration
22 2012 Pythian
Cost balance
23 2012 Pythian
RPO - Recovery Point Objective
24 2012 Pythian
SLAs + failure probability + $$ => RPO/RTO
25 2012 Pythian
How to Design for High Availability?
Option 2 - HA by design
This is what most people do
26 2012 Pythian
Manageability vs Fault Tolerance
27 2012 Pythian
Maintenance windows
Option
2 - count as uptime and explicitly specify
maintenance windows rules in SLAs
Example 1 - 4 hour maintenance window can be scheduled with 48 hours notice
on Sundays up to twice per month
Example 2 - Maintenance windows are 15 minutes every Saturday at 9am UTC
and 2 hours every last Saturday of the quarter at the same time
Example 3 - one 15 minutes maintenance window can be scheduled every
quarter agreed with customers for one year in advance
28 2012 Pythian
Availability must be
monitored!
29 2012 Pythian
Components of HA System
30
Hardware Components Relative Failure Rates
Hard disks
Network cards
Network switches
Storage arrays
Cables
SAN switches
2.00 3.00 4.00 5.00 6.00
Based on public 51 respondents from Oracle community (Pythian, LinkedIn RAC SIG, Twitter)
31 2012 Pythian
Fault Tolerance in Enterprise Servers
Redundant components
Power Supplies
Cooling
Disks
Network cards
Proactivemonitoring / failure anticipation / remote
diagnostics
Enterprise vendor support with call home and proactive services
Environmental conditions - cool and clean
Enterprise grade servers and components
Server CPUs, ECC server RAM, enterprise HDDs and SSDs
32 2012 Pythian
Software Components
Operating Systems
Storage multipathing drivers
Network redundancy drivers
Oracle Database software
Oracle ASM
33 2012 Pythian
Importance of the Processes for High Availability
change control
operations
physical environment
Ulrik Franke, Pontus Johnson, Johan Knig, Liv Marcks von Wrtemberg: Availability of enterprise IT systems - an expert-
based Bayesian model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid
34 2012 Pythian
Time-Cost-Quality Triangle
time
cost
quality
35 2012 Pythian
HA Team Triangle
team
quality
processes
24x7
36 2012 Pythian
High Availability with Oracle Engineered Systems
37 2012 Pythian
ASM and Storage Reliability
38
ASM Striping
file1 file2
39 2012 Pythian
ASM Rebalancing
40 2012 Pythian
ASM Mirroring
41 2012 Pythian
ASM Failure Groups
SAN 1 SAN 2
42 2012 Pythian
ASM Redundancy Levels
43 2012 Pythian
Read IO Failures
Can re-read mirror block since 7.3 (initially done for Veritas)
Same happens with ASM - attempt to re-read from another
mirror
If successful, repair is done for all mirrors
H/w RAID mirroring doesnt give that flexibility
Re-read is done by database blindly hopping for the best
Read failures in the database are handled depending on
context
ASM can recover not only from media failure errors but
from corruptions (bad checksum or wrong SCN)
44 2012 Pythian
Read IO Failure Remapping
45 2012 Pythian
Silent Corruptions of Secondary Extents
46 2012 Pythian
Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise a healthy disk. ASM tries to recover from read errors on corrupted sectors
on a disk. When a read error by the database or ASM triggers the ASM instance to attempt bad block remapping, ASM reads a good copy of the extent and copies it to the disk that
had the read error.
If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad block
reallocation.
If the write fails, ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked as unusable. If the
write fails, the disk is taken offline.
One unique benefit on ASM-based mirroring is that the database instance is aware of the mirroring. For many types of logical corruptions such as a bad checksum or incorrect
System Change Number (SCN), the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that
encountered the read is in a position to obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.
When encountering a write error, a database instance sends the ASM instance a disk offline message.
If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from ASM, the write is considered successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the ASM instance receives a write error message from an database instance or when an ASM instance encounters a write error itself, ASM instance attempts to take the disk
offline. ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many partners are already offline, ASM forces the dismounting of
the disk group. Otherwise, ASM takes the disk offline.
The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before ASM or database I/O. For
information on the remap command, see "remap Command".
Partial Writes
47 2012 Pythian
Disk Failure
48 2012 Pythian
What is MTBF?
49 2012 Pythian
Bathtub Failure Rate Curve
50 2012 Pythian
Real AFR & Utilization (ref Google study)
3 months infant
mortality:
10% AFR
Assuming
exponential
distribution,
hourly failure rate
= 0.001%
Poisson process:
- exponential
distribution
51 2012 Pythian
Hardware Mirroring RAID1 and 2nd Disk Failure
52 2012 Pythian
ASM Mirroring and 2nd Disk Failure
What if mirrored
extent is placed
randomly on other
disks?
... any 2nd disk failure
would result in data
loss
Think Exadata - 168
disks and one fails...
then 2nd failure of any
of 156 disks will result
in data loss
5h window of
vulnerability
Chance of data loss =
0.777%
53 2012 Pythian
54 2012 Pythian
55 2012 Pythian
Consider a group of n disks all coming from the same production batch.
We will consider two distinct failure processes:
1. Each disk will be subject to independent failures that will be
exponentially distributed with rate ; these independent failures are
the ones that are normally considered in reliability studies.
2. The whole batch will be subject to the unpredictable manifestation
of a common defect. This event will be exponentially distributed
with rate ' << . It will not result in the immediate failure of any
disk but will accelerate disk failures and make them happen at a rate
'' >> .
56 2012 Pythian
batch-correlated failures
Probability of Surviving a Data Loss After a Failure
P surv = exp(nTR)
n - # of partners for a failed ASM disk (8)
- rate of failure
TR - window of vulnerability (5 hours)
(normal redundancy)
57 2012 Pythian
Random Failure
(No Global Batch Defect Manifestations)
P surv = exp(nTR)
- normal rate of failure (or 1/MTBF)
P surv = exp(nTR)
- accelerated rate of failure
P surv = (1+nTR)exp(nTR)
- accelerated rate of failure
n - 5 hours
is one failure per week:
P surv = 97.58%
is one failure per month:
P surv = 99.85% (high redundancy)
60 2012 Pythian
After a Failure Caused by a Global Defect
P surv = (1+nTR)exp(nTR)
- accelerated rate of failure
n - 1 hour
is one failure per week:
P surv = 99.89%
is one failure per month:
P surv = 99.99% (high redundancy)
61 2012 Pythian
DEMO
-- target # of partners
select ksppstvl
from sys.x$ksppi p, sys.x$ksppcv v
where p.indx=v.indx and
ksppinm='_asm_partner_target_disk_part';
-- disk partners
select d.path, p.number_kfdpartner
from x$kfdpartner p, v$asm_disk_stat d
where p.disk=&disk_no
and p.grp=group_number
and p.number_kfdpartner=d.disk_number;
62 2012 Pythian
Disk Failure Recovery - ASM vs RAID1 mirroring
63 2012 Pythian
High Availability with DataGuard
and other replication solutions
64
Two Kinds of Oracle Database Replication
65 2012 Pythian
Three
Two Kinds of Oracle Database Replication
66 2012 Pythian
Physical replication is
the most common
replication for HA
67 2012 Pythian
Redo Generation
Shadow commit
LGWR
DBWx Process
DBWx
DBWx
Database
68 2012 Pythian
Database Recovery
69 2012 Pythian
Physical Standby Architecture
redo permanently
generation recovering
(applying
Primary Instance Standby Instance redo)
redo transfer
Primary Standby
Database Database
70 2012 Pythian
ARCH redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR
redo apply
RFS
ARCn
ARCn
ARCn achivelogs
achivelogs
71 2012 Pythian
LGWR SYNC redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR LNSn
redo apply
redo transfer
RFS
ARCn
ARCn
ARCn
achivelogs
72 2012 Pythian
LGWR ASYNC redo transfer
Primary Standby
redo log buffer
MRP
datafiles
LGWR
redo apply
RFS
LNSn
online redo logs
redo transfer
ARCn
ARCn
ARCn
achivelogs
73 2012 Pythian
DataGuard Protection Modes
Maximum Protection
ZERO DATA LOSS
Primary stops if redo cannot be shipped to at least one standby
Maximum Performance
redo transfer is asynchronous so some data loss is likely
Maximum Availability
Likemaximum protection mode but if redo cannot be shipped, it
switches to maximum performance mode instead of stopping
74 2012 Pythian
DataGuard Broker
Switchover
Failover
75 2012 Pythian
Fast Start Failover and Observer
Site 3
Observer
a t
t be he
ar a rt
he be
?
at
Site 1 Site 2
heartbeat
Primary Standby
Database Database
76 2012 Pythian
Dbvisit - third party DataGuard alternative
77 2012 Pythian
Incrementally updated standby
78 2012 Pythian
Storage layer replication
79 2012 Pythian
Storage vs Database replication
80 2012 Pythian
Active DataGuard
redo transfer
Primary Standby
Database Database
81 2012 Pythian
GoldenGate and Streams?
82 2012 Pythian
Application level replication can be simple...
83 2012 Pythian
High Availability with Oracle RAC
and other clustering solutions
84
Single Instance Oracle Database
APP
Query/DML/DDL
SERVER
INSTANCE
Read/write
Datafiles
Controlfiles
redo logs
Database flashback logs, change tracking and etc...
85 2012 Pythian
Single Instance Oracle Database
APP
SERVER
INSTANCE
Database
86 2012 Pythian
Oracle RAC Database
APP
Database
87 2012 Pythian
RAC looks simple. Eh?
88 2012 Pythian
Actually... RAC is very complicated
89 2012 Pythian
Role of Grid Infrastructure
OS OS OS
interconnect
storage access
OCR Voting
disk
Shared storage
90 2012 Pythian
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
91 2012 Pythian
CSSD cannot talk to each other -> operations are not synchronized -> shared data
access -> corruption
OS OS
The
Other
Node CSSD CSSD
interconnect
In OPROCD OPROCD
The
Head Voting
disk
92 2012 Pythian
Clusterware Clusterware
VIP
RACG
EVMD
CRSD
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
93 2012 Pythian
OS OS
Clusterware Clusterware
Ask
The
Other CSSD CSSD
Node OPROCD
interconnect
OPROCD
To
Reboot
Voting
94 2012 Pythian
Oracle cant shoot another node without remote control and cant rely on one type of
IO fencing (HBA/SCSI reservations).
Whats left - beg another another - please shoot yourself!
11gR2 Grid Infrastructure:
CSSD attempts graceful
shutdown
95 2012 Pythian
Clusterware Clusterware
CSSD Monitor/Agent
CS
SD
CSSD
interconnect
OPROCD OPROCD
Voting
disk
96 2012 Pythian
What if CSSD is not healthy? Its very possible that its not network problem but CSSD
just doesnt reply for some reason. CSSD Monitor (OCLSOMON) comes to the scene.
OS
Clusterware
CSSD CSSD
interconnect
OPROCD/CSSD Mon OPROCD
Voting
disk
97 2012 Pythian
Worse yet, the whole node is sick and even OCLSOMON cant function properly. Like
CPU execution is stall.
OS OS
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
98 2012 Pythian
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
disk
99 2012 Pythian
Clusterware Clusterware
Instance Instance
LMON
member kill
CSSD CSSD
interconnect
OPROCD OPROCD
Eviction by escalation of
a member kill
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
11gR2
Intelligent
Platform
Voting
Management disk
Interface
Clusterware Clusterware
CSSD CSSD
interconnect
OPROCD OPROCD
Voting
Business areas
Customers order (web)
Content display (web cached on app-tier)
Feedback (web)
Orders fulfillment (back-office)
Data extraction batches
PROC_ORD PROC_ORD
BATCH BATCH
CP based placement
automated failover
NEW_ORD NEW_ORD
CONTENT CONTENT
FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
BATCH
NEW_ORD NEW_ORD
CONTENT CONTENT
FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
PROC_ORD
BATCH
BATCH
LOAD_BALANCE=ON
FAILOVER=ON
2012 Pythian
LISTENER_G41 LISTENER_G41
Local registration
G41.local_listener=LISTENER_G41
G42.local_listener=LISTENER_G42
G43.local_listener=LISTENER_G43
G44.local_listener=LISTENER_G44
*.remote_listener=LISTENERS_G4
lh1 lh2
2012 Pythian
<= 10gR1
Instance load LISTENER_G44
Node load
Which
10gR2 instance?
Service Metrics
AWR
node1-vip node2-vip
Listener Listener
node1 node2
Failure detection
Clean up and re-establish connection
Restore session state
Process failure in application logic
2012 Pythian
2012 Pythian
http://www.scaleabilities.co.uk/wp-content/uploads/downloads/2011/11/RAC_Connection_Management.pdf
Re-connect
No magic
Can be automated to some degree
FMF Fully Manual Failover
STRF Semi-Transparent Reactive Failover
TAF Transparent Application Failover
STPAF Semi-Transparent Pro-Active Failover
FCF Fast Connection Failover
2012 Pythian
Manual reconnect
2012 Pythian
Transparent Application Failover
2012 Pythian
Fast Connection Failover (FCF) the next generation
2012 Pythian
Oracle Notification Services (ONS)
2012 Pythian
Diagram from Barb Lundhilds presentation
Database Database
Coldfailover licensing instance instance
rule of 10 days
Storage Storage
access access
Database
122
Server Virtualization HA
125
Selecting Standard Database Editions
RAC is included
limit on number of sockets is per cluster - 2 or 4 sockets
Doesnt support DataGuard and managed standby
Manual physical standby using scripts of third party products like
Dbvisit
Missing most of online manageability features
Very limited Streams support
GoldenGate works well
Standard Edition One - $5,800 per socket
Standard Edition - $17,500 per socket
No extra options are available
Email me - gorbachev@pythian.com
Read my blog - http://www.pythian.com
Follow me on Twitter - @AlexGorbachev
Join Pythian fan club on Facebook & LinkedIn