Sei sulla pagina 1di 62

Disaster Recovery

Planning (DRP)
Disaster Recovery
Planning (DRP)
 DRP is the process of regaining access to the
data, hardware and software necessary to
resume critical business operations after a
natural or human-induced disaster.
 A disaster recovery plan (DRP) should also
include plans for coping with the unexpected o
r sudden loss of key personnel, although this is
not covered in this article, the focus of which is
data protection.
 DRP is part of a larger process known as
business continuity planning (BCP).
What is the difference DRP
and BCP (1/2)
 Disaster recovery is the process by which you
resume business after a disruptive event.
 The event might be
 something huge-like an earthquake or the
terrorist attacks on the World Trade Center
 something small, like malfunctioning software
caused by a computer virus.
 Given the human tendency to look on the
bright side, many business executives are pr
one to ignoring "disaster recovery" because d
isaster seems an unlikely event.
What is the difference DRP
and BCP (2/2)
 "Business continuity planning" suggests a
more comprehensive approach to making sur
e you can keep making money.
 Often, the two terms are married under the
acronym BC/DR.
 At any rate, DR and/or BC determines how a
company will keep functioning after a disrupti
ve event until its normal facilities are restored
.
What do these plans
include (1/2)
 All BC/DR plans need to encompass
 how employees will communicate
 where they will go
 how they will keep doing their jobs.

 The details can vary greatly, depending


on the size and scope of a company and
the way it does business.
What do these plans
include (2/2)
 For example, The plan at one global
manufacturing company
 restore critical mainframes with vital data at
a backup site within four to six days of a
disruptive event,
 obtain a mobile PBX unit with 3000
telephones within two days
 recover the company's 1000-plus LANs in
order of business need
 set up a temporary call center for 100
agents at a nearby training facility.
Events that necessitate
disaster recovery
 Natural disasters
 Fire
 Power failure
 Terrorist attacks
 Organized or deliberate disruptions
 Theft
 System and/or equipment failures
 Human error
 Computer viruses
 Testing
Prevention against data
loss (1/2)
 Backups sent off-site in regular intervals
 Includes software as well as all data
information, to facilitate recovery
 Create an insurance copy on Microfilm or
similar and store the records off-site.
 Use a Remote backup facility if possible to
minimize data loss
 Storage Area Networks (SANs) over
multiple sites make data immediately
available without the need to recover or
Prevention against data
loss (2/2)
 Surge Protectors — to minimize the
effect of power surges on delicate electr
onic equipment
 Uninterruptible Power Supply (UPS) and/
or Backup Generator
 Fire Preventions — more alarms,
accessible extinguishers
 Anti-virus software and other security
measures
Techniques and technology
 Mirroring
 Disk mirroring : Redundant arrays of inexpensive
disks 1 (RAID1)
 Server mirroring: web / ftp /email
 RAID : RAID0 – 6 and combination
 On-site data storage
 Back up - Tape / optical disk
 Off-site data storage (backup-site)
 Cold sites
 Warm sites
 Hot site
Mirroring
 Mirroring can occur locally or remotely.
 Locally means that a server has a second hard
drive that stores data.
 A remote mirror means that a remote server
contains an exact duplicate of the data. The secon
d drive is called a mirrored drive.
 Data is written to the original drive when a
write request is issued. Data is then copied to
the mirrored drive, providing a mirror image o
f the primary drive.
 If one of the hard drives fails, all data is
protected from loss.
Disk mirroring (RAID1)
 The replication of
logical disk volumes
onto separate physical
hard disks in real time to
ensure continuous
availability
, currency and accuracy.
 A mirrored volume is a
complete logical
representation of
separate volume copies
Server mirroring
 Mirror sites are most commonly used to provide
multiple sources of the same information, and are of
particular value as a way of providing reliable access
to large downloads.
 Mirroring is a type of file synchronization
 Web server
 To preserve a website or page, especially when it is closed or
is about to be closed.
 To counteract censorship and promote freedom of information
 Email server
 To protect loss of email information
 ftp server
 To allow faster downloads for users at a specific geographical
location
Redundant arrays of
inexpensive disks (RAID)
 The organization distributes the data across
multiple
smaller disks, offering protection froma crash
that could wipe out all data on a single,
shared disk.
 Benefits of RAID include the following

 Increased storage capacity per logical disk volume


 High data transfer or I/O rates that improve
information throughput
 Lower cost per megabyte of storage

RAID0
 RAID Level 0 -aka. a stripe set
or striped volume) splits data
evenly across two or more
disks (striped) with no parity
information for redundancy.
 It is important to note that
RAID 0 provides zero
data redundancy.
 RAID 0 is normally used to
increase performance
 A RAID 0 can be created with
disks of differing sizes, but the
storage space added to the
array by each disk is limited to
RAID1
 A RAID 1 creates an exact
copy (or mirror) of a set of
data on two or more disks.
 This is useful when read
performance or reliability
are more important than da
ta storage capacity.
 Such an array can only be
as big as the smallest
member disk.
 A classic RAID 1 mirrored
pair contains two disks (see
diagram), which increases r
eliability
RAID2
 A RAID 2 stripes data at the bit (rather than block) level, and
uses a Hamming code for error correction.
 Extremely high data transfer rates are possible.
 RAID 2 is the only standard RAID level which can automatically
recover accurate data from single-bit corruption in data.
 At the moment, there are no commercial implementations of
RAID-2
RAID3
 RAID Level 3uses byte-level
striping with a dedicated
parity disk.
 RAID 3 is very rare in
practice.
 One of the side-effects of
RAID 3 is that it generally
cannot service multiple req
uests simultaneously.
 This comes about because
any single block of data
will, by definition, be sprea
d across all members of the
set and will reside in the sa
me location.
RAID4
 RAID Level 4 uses block-level
striping with a dedicated parity dis
k.
 This allows each member of the
set to act independently when onl
y a single block is requested.
 RAID 4 looks similar to RAID 3
except that it stripes at the block
level, rather than the byte level.

 In the example , a read request for


block "A1" would be serviced by
disk 0. A simultaneous read reque
st for block B1 would have to wait,
but a read request for B2 could be
serviced concurrently by disk 1.
RAID5
 A RAID 5 uses block-level striping
with parity data distributed across
all member disks.
 RAID 5 has achieved popularity
due to its low cost of redundancy.
 A minimum of 3 disks is generally
required for a complete RAID 5 co
nfiguration.
 In the example, a read request for
block "A1" would be serviced by di
sk 0.
 A simultaneous read request for
block B1 would have to wait, but a
read request for B2 could be
serviced concurrently by disk 1
RAID6
 A RAID 6 extends RAID 5 by
adding an additional parity
block, thus it uses block-
level striping with two parity
blocks distributed across all
member disks.
 Improve reliability
 Like RAID 5, the parity is
distributed in stripes, with
the parity blocks in a differe
nt place in each stripe.
Nested RAID
Storage Model
Storage Area Network
 The Storage Network Industry Association
(SNIA) defines the SAN as a network whose pr
imary purpose is the transfer of data between
computer systems and storage elements.

 A SAN consists of a communication


infrastructure, which provides physical conne
ctions; and a management layer, which organ
izes the connections, storage elements, and c
omputer systems so that data transfer is secu
SAN ‘s definition
 Put in simple terms, a SAN is a
specialized, high-speed network attachin
g servers and storage devices
 It is sometimes referred to as “the
network behind the servers.”
 A SAN introduces the flexibility of
networking to enable one server or man
y heterogeneous servers to share a com
mon storage utility, which may comprise
many storage devices, including disk, ta
SAN Component
 SAN Connectivity
 the connectivity of storage and server
components typically using Fibre Channel
(FC).
 SAN Storage
 TAPE /RAID /ESS (Enterprise Storage
System) /JBOD (Just Bunch of Disk) /SSA
(Serial Storage Architecture)
 SAN Server
 Windows /Unix /Linux and etc
Switched Fabric
 An infrastructure specially designed to
handle storage communications called a
fabric.
 A typical Fibre Channel SAN fabric is
made up of a number of Fibre Channel
switches.
 Today, all major SAN equipment vendors
also offer some form of Fibre Channel
routing solution, and these bring substan
tial scalability benefits to the SAN archit
Fiber Channel protocol
 Fibre Channel is a layered protocol. It consists of 5
layers, namely:
 FC0 The physical layer, which includes cables, fiber
optics, connectors, pinouts etc.
 FC1 The data link layer, which implements the 8b/10b
encoding and decoding of signals.
 FC2 The network layer, defined by the FC-PI-2
standard, consists of the core of Fibre Channel, and de
fines the main protocols.
 FC3 The common services layer, a thin layer that
could eventually implement functions like encryption o
r RAID.
 FC4 The Protocol Mapping layer. Layer in which other
protocols, such as SCSI, are encapsulated into an infor
mation unit for delivery to FC2.
IP Storage Networking
 FCIP (Fiber Channel over IP)
 It is a method for allowing the transmission of
Fibre Channel information to be tunneled
through the IP network.
 iFCP (Internet Fiber Channel Protocol)
 It is a mechanism for transmitting data to and
from Fibre Channel storage devices in a SAN,
or on the Internet using TCP/IP
 Internet SCSI (iSCSI)
 It is a transport protocol that carries SCSI
commands from an initiator to a target.
FCIP (Fiber Channel over
IP)
 FCIP encapsulates FC frames within TCP/IP,
allowing islands of FC SANs to be
interconnected over an IP-based network
 TCP/IP is used as the underlying transport to
provide congestion control and in-order deliver
y FC Frames
 All classes of FC frames are treated the same
as datagrams
 End-station addressing, address resolution,
message routing, and other elements of the FC
iFCP
 iFCP is a gateway-to-gateway protocol for
implementing a fibre channel fabric over a
TCP/IP
 Traffic between fibre channel devices is routed
and switched by TCP/IP network
 The iFCP layer maps Fibre Channel frames to a
predetermined TCP connection for transport
 FC messaging and routing services are
terminated at the gateways so the fabrics are
not merged to one another
iSCSI
 iSCSI is a SCSI transport protocol for
mapping of block-oriented storage data over
TCP/IP networks

 The iSCSI protocol enables universal access


to storage devices and Storage Area Network
s (SANs) over standard TCP/IP networks
Back up site
 A backup site is a location where a business
can easily relocate following a disaster, such
as fire, flood, or terrorist
threat. This is an integral part of the disaster
recovery plan of a business.
 A backup site can be another location
operated by the business, or contracted via a
company that specializes in disaster recovery
services.
 In some cases, a business will have an
Cold Sites
 A cold site is the most inexpensive type of
backup site for a business to operate.
 It provides office spaces to operate
 It does not include backed up copies of data
and information from the original location of th
e business, nor does it include hardware alrea
dy set up.
 The lack of hardware contributes to the
minimal startup costs of the cold site, but requ
ires additional time following the disaster to ha
ve the operation running at a capacity close to
Warm Sites
 A warm site is a location where the
business can relocate to after the disast
er that is already stocked with compute
r hardware similar to that of the original
site, but does not contain backed up co
pies of data and information.
Hot Sites
 A hot site is a duplicate of the original site of
the business, with full computer systems as we
ll as near-complete backups of user data.
 Ideally, a hot site will be up and running within
a matter of hours. This type of backup site is
the most expensive to operate.
 Hot sites are popular with stock exchanges
and other financial institutions who may need
to evacuate due to potential bomb threats and
must resume normal operations as soon as po
How to choose
 Choosing the type is mainly decided by a
company's cost vs. benefit strategy.
 Hot sites are traditionally more
expensive than cold sites since much of
the equipment the company needs has a
lready been purchased and thus the ope
rational costs are higher.
 However if the same company loses a
substantial amount of revenue for each
day they are inactive then it may be wor
 The advantages of a cold site are
simple--cost. It requires much fewer reso
urces to operate a cold site because no e
quipment has been bought prior to the di
saster.
 The downside with a cold site is the
potential cost that must be incurred in
order to make the cold site effective.
 The costs of purchasing equipment on
very short notice may be higher and the
Discovery Planning steps
(1/3)
 Assess business impact and risk.
 This should include an assessment of the business
unit's function and, preferably, a business impact
analysis (BIA).
 The purpose of the assessment is to determine
the business unit's relative contribution to the
larger organization (monetary and functional).
 The greater the potential impact, the more money
a company should spend to restore a system or
process quickly.
 For instance, a stock trading company may decide
to pay for completely redundant IT systems that
would allow it to immediately start processing tra
Discovery Planning steps
(2/3)
 Develop a Disaster Recovery framework.
 Data should be categorized by importance.
Two measures of importance are used, RTO
and RPO.
 Recovery Time Objective (RTO) is the
acceptable amount of time between the disa
ster and the post-disaster resumption of fun
ction (how long can we wait to restore data?
).
 Recovery Point Objective (RPO) is the
Discovery Planning steps
(3/3)
 Develop a recovery strategy and then a
written Disaster Recovery Plan.
 That written plan should address at a
minimum: response, recovery, and resumpti
on of services detailed tasks.
 Adjust information systems to make
Disaster Recovery easier.
 This includes consolidating servers and data,
perhaps with a Storage Area Network or
other archival storage method.
Important factors (1/3)
 Communication
 Personnel — notify all key personnel of the
problem and assign them tasks focused tow
ard the recovery plan.
 Customers — notifying clients about the
problem minimizes panic.
 Recall backups
 If backup tapes are taken offsite, these need
to be recalled. If using remote backup
services, a network connection to the remot
e backup location (or the Internet) will be re
Important factors (2/3)
 Facilities
 having backup hot sites or cold sites for
larger companies. Mobile recovery facilities
are also available from many suppliers.
 Prepare your employees
 during a disaster, employees are required
to work longer, more stressful hours, and a
support system should be in place to allevi
ate some of the stress. Prepare them ahead
of time to ensure that work runs smoothly.
Important factors (3/3)
 Business information
 backups should be stored in a completely
separate location from the company

 Testing the plan


 provisions, directions, frequency for testing
the plan should be stipulated.
Things to do in DRP (1/4)
 Here are 10 absolute basics your plan should
cover:
   1. Develop and practice a contingency plan
that includes a succession plan for your CEO.

   2. Train backup employees to perform


emergency tasks. The employees you count on
to lead in an emergency will not always be ava
ilable.

3. Determine offsite crisis meeting places for


top executives.
Things to do in DRP (2/4)
4. Make sure that all employees-as well as
executives-are involved in the exercises so tha
t they get practice in responding to an emerge
ncy.
  
5. Make exercises realistic enough to tap into
employees' emotions so that you can see how
they'll react when the situation gets stressful.
  
6. Practice crisis communication with
Things to do in DRP (3/4)
7 Invest in an alternate means of
communication in case the phone networks
go down.

8. Form partnerships with local emergency


response groups-firefighters, police to establi
sh a good working relationship. Let them bec
ome familiar with your company and site.
 
Things to do in DRP (3/3)
9. Evaluate your company's performance
during each test, and work toward
constant improvement. Continuity exerci
ses should reveal weaknesses.

10. Test your continuity plan regularly to


reveal and accommodate changes.
technology, personnel and facilities are i
n a constant state of flux at any compan
Top mistakes in disaster
recovery (1/3)
1. Inadequate planning:
 Have you identified all critical systems,
 do you have detailed plans to recover them to the
current day?
 Everybody thinks they know what they have on
their networks, but most people don't really know
how many servers they have,
 how they're configured, or what applications reside
on them-what services were running,
 what version of software or operating systems they
were using.
Top mistakes in disaster
recovery (2/3)
 2 Failure to bring the business into the planning
and
testing of your recovery efforts.

 3 Failure to gain support from senior-level


managers.
The largest problems here are:
 Not demonstrating the level of effort required for full
recovery.
 Not conducting a business impact analysis and
addressing all gaps in your recovery model.
Top mistakes in disaster
recovery (3/3)
 Not building adequate recovery plans that
outline your recovery time objective, critica
l systems and applications, vital document
s needed by the business, and business fun
ctions by building plans for operational acti
vities to be continued after a disaster.

 Not having proper funding that will allow


for a minimum of semiannual testing.

Potrebbero piacerti anche