Sei sulla pagina 1di 62

Database Administration:

The Complete Guide to Practices and Procedures

Chapter 18
Data and Storage
Management

Agenda

Storage Management Basics


Files and Data Sets
Space Management
Fragmentation and Storage
Storage Options
Planning for the Future
Questions

Storage Management Basics


DBMS vendors do not certify or explicitly support
any specific third-party storage products.
Nevertheless, some underlying and reliable storage
technology must be used.

The DBA must evaluate the many products,


technologies, and vendors that provide storage
solutions.
The DBA may work in conjunction with a storage
administration group, if one exists at his shop.
Some storage technologies are better suited than
others in terms of performance, reliability, usability,
and cost.

Disk
Modern disk drives are more reliable than in years past,
with an ever-increasing mean time between failures
(MTBF).
Disk drives may achieve in excess of a hundred thousand
hours of availability before failing.
But the mechanical nature of the disk drive renders them more
vulnerable to failure than other computer components.
As the number of disk drives in a system increases, the
vulnerability of the system increases.

A single organization may rely on hundreds or thousands


of disk drives to support its database applications.
Certain modern storage solutions, such as RAID, can be
used to address some of the MTBF problems.

Integrity vs. Availability


For mission critical applications data integrity can be
more important than data availability.
If the storage media is unreliable and a failure causes data
corruption, the lost data can be more of a problem than the
downtime.
Database storage solutions must protect the data at all costs.

Database performance is I/O dependent


The faster the DBMS can complete an I/O operation the
faster the database application will run.
Data retrieval from storage media takes much longer to
complete than data retrieval from cache or memory.

Modern storage systems provide their own caching


mechanism to prestage data in memory.

Data Growth
IBM: Inside IBM we talk about ten times more connected people, a
hundred times more network speed, a thousand times more devices, and
a million times more data.
IDC Corporation: the amount of digital information created in the world in
2010, exceeded a zettabyte (1 trillion gigabytes), for the first time. The
amount of digital information created in 2011 will surpass 1.8 zettabytes.
A 2011 study shows that almost all shops reported data growth over the
past year and one-third reported the amount of data within their
enterprises grew by 25% or more during the past year.
About 10% have data stores in the petabyte range.
27% exceeded 100TB total

Winter Corporation TopTen Program: most recent report highlights a 23 TB


transaction processing database hosted on the mainframe and a 100 TB
data warehouse hosted on Unix.
Giga Research Group How much data is on the planet?
Approximately 201,000 TB, or about 197 PB.
http://www.informationweek.com/816/gerstner.htm
http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
http://www.wintercorp.com/WhitePapers/WC_TopTenWP.pdf

Data Storage and Size


Terminology
Abbrevia
Term
Size
Power of 2
tion
B

Byte

8 bits

KB

Kilobyte

1,024 bytes

210 bytes

MB

Megabyte

1,024 KB

220 bytes

GB

Gigabyte

1,024 MB

230 bytes

TB

Terabyte

1,024 GB

240 bytes

PB

Petabyte

1,024 TB

250 bytes

EB

Exabyte

1,024 PB

260 bytes

ZB

Zettabyte

1,024 EB

270 bytes

YB

Yottabyte

1,024 ZB

280 bytes

Goals When Building a Storage


System
Preventing loss of datathe number-one priority
Assuring that adequate capacity is available and that
the storage solution can easily scale as storage needs
grow
Selecting a solution that provides fast access to data
with minimal or no interruptions to service
Choosing storage solutions that are fault tolerant and
that can be repaired quickly when a failure occurs.
Selecting a storage solution where you can add or
replace disks without an outage
Combining all of the above into a cost-effective
storage solution your company can afford

Files and Data Sets


A file typically corresponds
to a data file
Or a data set

Storage Devices
The DBA may choose to use multiple storage devices for
the different files to:
Align the performance requirements of the file with the
appropriate disk device
Separate indexes from data for performance reasons
Isolate the transaction log on a separate and very fast
device
Isolate temporary and work files on a single volume; if
a disk error occurs, temporary files can be deleted and
redefined with no backup and recovery implications
Spread the data across multiple devices to facilitate
parallel access.

File Placement on Disk


File placement can achieve performance gains.
Consider placing index files and data files on separate
disk devices.
Place the transaction log on a separate device from the
database.
Short-stroking:
To use only the very outside edge of the disks so that the arm
does not need to move very far on the platter to find the
needed data.
Also keeps the platter from needing to rotate very many times
to find all that data.

Exact file placement is becoming less beneficial with the


advent of modern storage devices.
RAID mixes things up regardless of how you try to place the
data.

Raw Partitions vs. File


Systems
A raw partition is simply a raw disk
device with no operating system or
file system installed.
Avoid the additional buffering of the
O/S.
The drawback to raw devices is the
difficulty of tracking the database
files.

Temporary Database Files


Modern DBMSs provide capabilities to
create temporary database objects
that exist only during the scope of a
specific transaction.
Temporary database objects require
some form of short-term persistent
storage.
Depending on the DBMS, the DBA
will need to assign disk devices and
an amount of storage for use by

Space Management

Number of secondary extents


Device fragmentation,
Fragment usage information,
Free space available
Segment or partition size
Tables and indexes allocated per segment
Amount of reserved space that is currently
unused
Objects approaching an out of space
condition

Data Page Layout


Header Information

Data Rows

Offset Table

Page Components
The page layout for a database object usually consists of
three basic components.
Page header -housekeeping information.
The page header may include a page identifier, forward and
backward links to other data pages, an identifier indicating to
which table the page belongs, free space pointers, and the
minimum row length for the table.

Data rows - actual data rows of the table (or index).


Rows will not cross a page boundary except for certain large
data types like text, images, and other binary large objects.

Offset table - pointers to each data row on the data page.


Some DBMSs always use some form of offset table, whereas
others use the offset table only when variable-length rows exist
on the data page.

Allocation Pages
The DBMS uses an allocation page to manage
the other pages in the database object.
Sometimes call a space map page.

Allocation pages control physical pages. Each


physical page is mapped to a single database.
Usually stored as a bitmap, containing a series of
bits that are either turned on (1) or off (0).
Each bit refers to pages within the allocation unit
and to whether space is available within the page.
The allocation page will have a distinctly different
format than a typical data page.

Data Record Layouts


Row Header.
Each record typically begins with several bytes of physical
housekeeping information detailing the structure and
composition of the data contents of the row. This might
include row length, information on variable-length data, and
other control structures.

Row Data.
This consists of the actual data contents of the data columns
for the row, in the order of their definition. Depending on the
DBMS, the variable- and fixed-length columns may be
separated.

Offset Tables.
Optionally, the record may contain an offset table with
pointers to manage and control where variable-length fields
are stored within the row.

Calculating Row Length

Calculating Table Size


Assuming a 4K page size with 32
bytes of overhead, calculate the
number of rows per page as:

And the total amount of space


required for the table is calculated
as:

Index Page Layouts


Header information.
Physical housekeeping information detailing the structure
and composition of the index record.

Row length.
For variable-length keys, the index may need to store the
actual length of the indexed data.

Index key values.


The actual data values for the index key.

Page pointer.
Points to the physical location of the data page in the table
that actually holds the indexed data.

Offset and adjust tables.


May be required to manage and control the position of
variable-length fields stored within the index row.

Calculating Index Size


(Row Size)
Start by calculating the row size for
the index using one of the following
formulas:

Calculating Index Size


(Record Size)
You need to calculate the size of an
index record, not just the size of the
row.
Records are simply the row plus the
overhead required to store the row.
To calculate the record size you will
need to obtain the size of the row
overhead required for indexes in the
DBMS you are using.

Calculating Index Size


(Entries Per Page)
Use this formula to calculate the
number of records that can fit on a
single index page.

Calculating Index Size


(Levels)

etc.

Transaction Log Sizing


Arriving at an exact size for the transaction log can be
more of an art than a science.
A good rule of thumb is to provide sufficient space for the
log file to capture all database modifications that will
occur between log archivals during your busiest
processing period.
The tricky part is determining the most active processing
period and the database modifications impact on the log.
If you do not automatically archive or back up your
database logs, you should pay close attention to the size
of the database log file(s).
Failure to do so can cause data to be unrecoverable or cause
processing to slow down or halt completely.

Fragmentation and Storage


Fragmentation can be the enemy of performance and
storage requirements.
Disk fragmentation.
Occurs at the operating system level and the DBA must be able
to use OS tools to analyze drives on which databases reside.

Index fragmentation.
A more serious concern for DBAs.
Indexes can become disorganized and fragmented as data is
added, modified, and removed from indexed tables.
Fragmented indexes consist of many scattered areas of storage
that are too small to be used productively.
This causes wasted space, which can hinder performance and
increase storage costs.
Use tools to scan indexes for fragmentation and take actions to
defragment or rebuild indexes on a regular basis.

Storage Options

SSD
RAID
JBOD
SAN
NAS
Tiered Storage

SSD: Solid State Devices


Also referred to as solid-state drives, solid
state disks orelectronic disks.
SSDs use solid state memory to store data.
Because SSDs use memory as opposed to
magnetic disks, which aremechanical, I/O
performance is greatly improved.

Although SSDs outperform traditional disk,


and are less susceptible to physical damage,
they are considerably more expensive and
tend to have lower capacities.

RAID
RAID: Redundant Arrays of Independent
Disks
RAID combines multiple disk devices into an
array.
There are many levels of RAID technology,
which deliver different degrees of fault
tolerance and performance.
RAID can improve availability, remove the
need for outages to change hardware, and
overall minimize downtime.

RAID Levels
Vendors provide varying levels of
support for the RAID levels that have
been defined.
These various levels of RAID support
continuous availability through
combinations of functions:
Mirroring
Striping
Parity

Mirroring
Mirroring occurs when complete
copies of the data are made on at
least two disk drives, and all changes
made to the data are made
simultaneously to both copies.
If one fails, access is
automatically shifted
to the remaining copy.

Striping
Striping occurs when subsets of data
are spread across multiple disk
drives.
If any one drive fails, the impact of the
failure is limited to the data within the
stripe on that disk.

Parity
Parity bits are encoded data that can
be used to facilitate the
reconstruction of the original data.
In the event that all or part of the data
cannot be accessed if the drive fails.

The lost data can be reconstructed


on the fly until it can be rewritten
to undamaged disks.

RAID Levels
Level

Fault

Read

Write

Cost

No RAID

tolerance
No

performance
Normal

performance
Normal

Inexpensive

Level 0

No

Fast

Fast

Expensive

Level 1

Yes

Normal

Normal

Moderate

Level 2

Yes

Normal

Normal

Moderate

Level 3

Yes

Normal

Normal

Moderate

Level 4

Yes

Normal

Slow

Moderate

Level 5

Yes

Fast

Slow

Expensive

Level 6

Yes

Fast

Slow

Expensive

Level 10

Yes

Fast

Normal

Expensive

Level 50

Yes

Normal

Normal

Expensive

Level

Yes

Fast

Fast

Very

0+1

Expensive

RAID-0

RAID-1

RAID-2

RAID-3

RAID-4

RAID-5

RAID-6

RAID-10

RAID-50

RAID-0+1

Proprietary RAID
A number of proprietary variants and
levels of RAID have been defined by
the storage vendors.
If you are in the market for RAID
storage, be sure you understand
exactly what the storage vendor is
delivering.

http://en.wikipedia.org/wiki/RAID

Evaluating RAID
Favor fault-tolerant RAID levels for database files. Database
files not on fault-tolerant disks are subject to downtime and
lost data.
Choose the appropriate disk system for the type of activity
each database object will experience. For example, you
might want to implement two separate RAID systemsone
at RAID-5 for data that is heavily read-focused, such as
analysis and reporting, and another at RAID-1 (or RAID-0+1)
for transaction data that is frequently written and updated.
For high performance, mission critical implementations with
sufficient budget consider RAID10 for its performance and
fault tolerance.
If you have the budget at your disposal, consider RAID-0+1
because it has fast read, fast write, and fault tolerance.

JBOD
JBOD stands for just a bunch of disks.
The term is used to differentiate
traditional disk technologies from newer
storage technology.

SAN: Storage Area Network


A storage area network, or SAN, generally refers
to an interconnected network of storage devices.
No industrywide standard definition of SAN exists, and
it means different things to different folks.
To some, a SAN is anything that includes a fiber
channel switch.
Others define a SAN to be anything with two or more
host systems using fiber channel technology.
However you define them, SANs offer high speed,
coupled with high availability.

Fiber channel is a serial interface that can deliver


a transfer rate of up to 105 MB/second.

Benefits of SAN
SAN affords the following
benefits:
Shared storage between multiple
hosts
High I/O performance
Server and storage consolidation

NAS: Network-Attached
Storage
NAS refers to storage that can be
accessed directly from the network.
Hosts or client systems can read and write
data over a network interface

NAS provides the following benefits:


Shared storage between multiple hosts
Simpler management due to reducing
duplicate storage
Application based storage access at file level

SAN versus NAS


A SAN is best utilized as a storage backbone providing
basic storage services to host systems and NAS servers.
SANs are well suited for sharing storage and building the
infrastructure for server and storage consolidation.
Applications requiring high performance or large capacities are
good candidates for SAN technology.
SAN is ideal for database applications.

NAS is better suited for solving multimedia storage


problems, data sharing issues, and sharing of storage for
smaller systems.
NAS does not efficiently handle the block-based storage used
by database systems.

Do not make the mistake of forcing all storage to be of


one type, either SAN or NAS.
Match the storage requirements to the access and modification
needs of each database and application.

Tiered Storage
With tiered storage different categories of data are
assigned to different types of storage media in order
to reduce total storage cost.
Can be important for organizations that manage significant
amounts of data that continues to grow in volume.
Tiered storage can offer some financial relief.

To take advantage of tiered storage, data first must be


categorized based on criteria related to the data and
its use. Example of categories include:
usage patterns
importance to the organization
level of protection required
recovery needs (local and disaster)
performance requirements

Effective Storage Tiering


Requirements
A useful categorization technique needs
to be defined
A method of measuring data such that it
can be categorized is required
All of the available storage options need
to be aligned with the data categories
And a technique needs to be created for
moving the data to the appropriate
storage devices

Multi-Temperature Data
Popularized by Teradata.
This technique deploys four categories:
Hot
Warm
Cool
Dormant

The temperature of data is defined as a function of


the access rate for queries, updates, and data
maintenance.
Hotter data is accessed more frequently than warm data
which is accessed more frequently than cool data.
Finally, dormant data is rarely updated or queried and it is
part of a static data model.
http://www.teradatamagazine.com/v11n03/Tech2Tech/Why-Multi-Temperature-Data-Mat

Categorize Devices to
Temperature
The next step is to categorize your storage devices for use with
each data temperature.
Hot data that is I/O intensive and requires high availability can be
placed on storage devices offering high performance, reliability,
advanced features and large capacity.
RAID or SSD

Warm data is less frequently accessed than hot data and often is
read more than it is modified. Less expensive disk with good
performance and reliability, but not top-of-the-line.
SATA and SCSI configured storage can work well for warm data.

Cool data is not accessed often. Such data usually still needs to
reside on direct access storage devices.
Perhaps NAS (network attached storage) or object based storage.

Dormant data, which has not been accessed for a long time
(perhaps years) and whose data model is stable.
Offline storage systems such as intelligent tape or optical disk.

Automated Storage Tiering


Storage vendors offer automated
tiered-storage solutions that work
based on defined rules and tiering
definitions.

Planning for the Future


Database implementations are not static.
Once deployed, databases are queried, updated,
loaded, unloaded, and reorganized, and data is
deleted and inserted from them on an ongoing
basis.
As the data composition of a database changes,
its storage requirements will change as well.

The DBA must be ever-vigilant in planning for


future growth.
The DBA must keep an eye on the amount of data
and the number of users accessing the data.
When either expands, database storage may have
to be modified.

Capacity Planning
Capacity planning measures and compares system
capacity against requirements.
Determine whether your existing infrastructure can
sustain the anticipated workload by:
Measure current capacity
Gauge the growth of capacity over time
Factor in the anticipated capacity requirements of new
corporate and IT initiatives

If the projected growth outpaces the ability of your


computing environment to support it, you will need
to evaluate the cost of modifying and possibly
scaling up your computing infrastructure.

Storage Planning
From a storage perspective, this may involve
simply adding more disk devices and assigning
them to the DBMS.
However, it may involve additional tasks to
support additional data and users, such as the
following:
Redesigning applications
Redesigning databases
Modifying DBMS parameters
Reconfiguring hardware components
Adjusting software interfaces

Storage Planning Questions


When will more storage be required?
How much additional storage is
needed?
Where is the additional storage
needed?
What needs to be done to align the
additional storage with the DBMS?

Questions

Potrebbero piacerti anche