Designing For Performance

NetWorker Best Practices Designing for Performance
EMC Proven Professional Knowledge Sharing August, 2007
Matt Steinberg Senior Solutions Architect Cambridge Computer Services, Inc. msteinberg@cambridgecomputer.com
Table of Contents
Defining the Technical and Business Questions ...............................................3 Business Questions ................................................................................................. 3 Technical Questions ................................................................................................ 3 Identifying Backup System Bottlenecks.............................................................4 Front-End Side Bottlenecks .................................................................................... 5 Central Bottlenecks .................................................................................................. 6 Back-end Bottlenecks .............................................................................................. 6 Eliminating Backup System Bottlenecks ...........................................................7 Front-end Solutions.................................................................................................. 7 Central Solutions ...................................................................................................... 9 Back-end Solutions .................................................................................................. 9 Alternate Solutions ................................................................................................. 11 Conclusion ......................................................................................................11 Authors Biography ..........................................................................................12
Disclaimer: The views, processes or methodologies published in this article are those of the author. They do not necessarily reflect EMC Corporations views, processes or methodologies.
This article will address three distinct areas of NetWorker performance: defining business and technical questions, identifying backup system bottlenecks and eliminating backup system bottlenecks. Defining the Business and Technical Questions In this first section, we will define the business and technical problems affecting NetWorker performance in a backup environment. This is vital to ensure that solutions are focused toward specific areas of concern. Business Questions There are business issues to keep in mind when designing any backup system. In most businesses, including non-profits, universities and hospitals, you must work with the customer to define two simple objectives: the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). An organization must consider how much data they can lose while still remaining viable. Defining the RPO allows an organization to set goals about how far back an application or set of systems can be restored. For example, most customers will agree that email is a critical application, yet most email systems are backed up nightly. If there is a corruption or loss of data in email before the next scheduled backup, a restore of last nights backup means the entire days email is lost. Is this acceptable to the customer? On the other hand, a simple front-end web server for a customer facing application can be restored from the night before, and maybe even a week before since there is very little data change and users will not be affected. Therefore, setting the RPO is important not just for an organization, but for each major application in the environment. Once youve determined the restore interval, defining the RTO allows an organization to set goals around how quickly data must be recovered. To use the same example, some organizations may approve a 4-8 hour interval to restore email. The front-end web server is different however. If it is the customer facing application for a financial organization, email may have to be the first application that needs to be restored. Every moment that it is not available, it costs the company money and customers goodwill. While RPO and RTO are similar, they do not depend on each other. They are independently important in determining the performance of a NetWorker environment, as they will factor in the design of the backup system for both backup and restore.
Technical Questions Once youve addressed the business questions, it is time to address the technical. The central question is why are backup systems still an issue? There are many answers to discuss.
First, the infrastructure for conventional backups is not growing nearly as fast as disk capacities. Infrastructure consists of tape capacities that have not kept up with disk. Briefly reviewing disk and DLT tape capacities over generations illustrates this issue. In the days of single tape drives inside of servers, a hard drive fit on a single tape. Today, with typical volumes over 2 TB, a volume residing on one tape is a distant memory. Not only is tape capacity a factor, but also tape performance which needs to increase in parallel. For example, if a data set is 100 GB and needs to be backed up in 4 hours, a 100 GB tape that writes at approximately 25 GB/hr (7 MB/s) is required. If this data set doubles to 200 GB, and it needs to be backed up in the same 4 hours, not only does the tape capacity need to double to 200 GB, the performance of the tape drive also needs to double to 50 GB/hr (15 MB/s). Historically, disk capacities have more than doubled in 12-18 months and tape capacities and performance have lagged behind by doubling in 18-24 months. Tape backup is inefficient by design. A typical backup environment is redundant, and wastes a large amount of tape media and time by performing multiple full backups weekly or even daily. Tape is a sequential media; it is only write or read. Therefore, a NetWorker backup system cannot update a backup set written to tape. This causes a large factor of tape capacity to production data, since most customers keep up to 30-90 days of backup on tape locally for restore. Tape backup systems also require additional resources as production data grows, yet most people do not grow the backup system in parallel with the data. As a result, backup systems are designed and purchased for the current data set, and are not scalable to meet short- or long-term requirements. The majority of backup systems still use tape despite its challenges. There are historical reasons for tape, such as low cost, easy portability and time. The cost of tape today is still near 10-20% the cost of enterprise disk per GB, making it an affordable backup media. The ability to move tape off-site easily and store it for 30 years makes it a suitable media for storage as retention policies dictate. Due to the benefits of working with tape, it is likely most NetWorker environments will utilize it in specific-use cases. Identifying Backup System Bottlenecks Once the business and technical questions have been answered, it is important to identify the specific bottlenecks in the NetWorker environment. Bottlenecks can be found in three distinct areas: front-end, central, and back-end, as illustrated in the following chart.
CHART: NetWorker Bottlenecks
Central in the NetWorker Server/ Storage Node Front-End Bottlenecks on the hosts
LAN
Network Bandwidth in the Front-End and Central
Back-End Storage Devices
Pinpointing bottlenecks can be more challenging than determining if they exist at all. The bottleneck on a host can vary from job to job, and host to host, and fluctuate throughout the nightly backup. NetWorker uses simple tools to flush these out including reports built into NetWorker 7.3.x and different logs in the system. In addition to the reports there are additional, more sophisticated ways to identify bottlenecks. One method uses Backup Advisor to analyze a backup system and graphically represent host performance and its performance throughout the night. Front-End Side Bottlenecks Most backup systems are designed from back to front, and neglect to consider front-end bottlenecks. The first front-end bottlenecks stem from how a host processes data to be backed up. These range from the network, CPU and RAM of the host, as well as the storage channel and how many physical disk spindles there are (on which the data resides). If a client cannot send data quickly and efficiently across the network to the NetWorker Server or Storage Node, a faster tape drive will not solve performance problems. File and volume sizes are other front-end bottlenecks. File size is a big factor in how NetWorker processes a backup set when it reads the file system of a given client and sends it to a NetWorker Server or Storage Node. If there are a large number of small files, the host can have trouble reading all of the information from where the files are laid out in the file system and on the particular disk drives. This causes disk trashing that drastically decreases backup speed. Also, NetWorker will need to catalog every file name in the hosts Client File Index (CFI), causing a large database and a heavier load on the NetWorker Server.
Volume size also affects how long it takes to back up data. A larger volume will result in a longer backup. The actual size of the volume is relative to each environment, but there will be a size for each unique backup system where the volume takes longer than 24 hours to back up across a network, causing a large issue when it comes to daily incremental or differential backups and restores (reverting back to the discussion on RPO and RTO). The last front-end bottleneck occurs in the particular application, for example Microsoft Exchange. Exchange servers back up using the Microsoft API (MAPI). This is a slow protocol for backups that causes end-user frustration when completing mailbox or bricklevel backups. An application like Exchange causes bottlenecks in a backup system that are sometimes impossible to resolve using contemporary backup methods, though other options with NetWorker have recently become available. Central Bottlenecks Central bottlenecks have a major impact on performance in NetWorker environments, but are sometimes the easiest to address. Historically, if the NetWorker Server and/or Storage Node had only one 10/100 connection, it was vastly limited in its ability to pull enough data from the network to write to the storage devices. A 10/100 network can only support, at the theoretical maximum, 12.5 MB/s (45 GB/hr) of bandwidth. This problem is not as common with gigabit networks, but still exists in some environments. If an environment has only one NetWorker Server/Storage Node, it can become its own bottleneck. When the Server has a lot of data to process, it can be bogged down by trying to write the data to a storage device while cataloging the backup data. Each major operating system vendors NetWorker Server can only accommodate so many backup clients simultaneously. Therefore, as a NetWorker environment grows, there can be diminishing returns with performance through the NetWorker Server. Back-end Bottlenecks Lastly, there are back-end bottlenecks, which fall to the storage devices NetWorker backs up to, mainly tape drives. There are two points to consider when looking at these back-end bottlenecks. First, the number of tape drives in a tape library is an important factor in overall performance. For example: if there is one NetWorker Server/Storage Node which can only write at a maximum of 50 MB/s, and the tape library it is attached to holds four LTO2 tape drives, it would be impossible to keep those tape drives spinning. Each LTO-2 tape drive has a native throughput of 35 MB/s and with four would equal 140 MB/s of total throughput they would require. The following chart illustrates this typical problem; one NetWorker Server attempts to keep four high speed tape drives at full speed throughout the course of the night. CHART: Individual Tape Drive Performance (Cambridge Computer Customer, 4/2003)
Second, the type and nature of these tape drives can become a bottleneck in certain environments. The concept of repositioning or shoe-shining is not a new one, yet it has become more important as linear tape drives have increased in speed. Shoe-shining occurs because of the linear nature of DLT, SDLT and LTO tape drives. In general, a linear tape drives tape read/write head is traveling at approximately 160 inches/sec across the tape writing the data in tracts in a serpentine fashion. As a tape drive receives data from a NetWorker Server/Storage Node, it stores it in a tape drive buffer, and then flushes that buffer to the tape. When the buffer empties, the tape drive cannot stop and wait for the buffer to be refilled, because it is moving too fast. First it must slow down, and then rewind, reposition and start again. This two steps forward / one step back motion of the tape drive causes a shoe-shine effect and degrades tape drive performance. Understanding the shoe-shine concept, it is easy to see how backup performance can suffer. Designing a tape drive back-end is also important for restore and clone performance. These two operations are vital to any organization once RTO and RPO goals have been set, as well as for creating tapes for off-site retention in a timely manner. Simply adding more or faster tape drives are not the answer, and can sometimes increase problems. Eliminating Backup System Bottlenecks Front-end Solutions Once you have identified all of the bottlenecks in a NetWorker backup system, they should be addressed, front to back. Beginning with the host side, there are two possible approaches to tackling problems: one addresses efficiency and the other requires brute force. In NetWorker environments, efficiency means either backing up data more than it needs to be backed up, or simply backing it up. If a host struggles to back up across the network, reduce redundancy. This is accomplished by utilizing the many different levels in NetWorkers scheduling policy resource. NetWorker allows for more than the typical full and incremental backup strategy, it allows end users to utilize 9 levels of differential backups, in addition to a unique consolidated backup feature.
With the levels of differential backup, an administrator can be creative in scheduling backups with multiple layers of sophistication. A level number backup sends only the changes since the last smaller level, for example, a level 5 backup will backup all changes since the last level 4, 3, 2, 1 or 0 (a full). Therefore, a typical monthly full schedule could be created to limit the amount of data a host has to send over the network. SUN 0 5 4 3 MON TUE inc inc inc inc inc inc inc inc WED THU inc inc inc inc inc inc inc inc FRI inc inc inc inc SAT inc inc inc inc
The backup schedule illustrated above performs a full backup once every four weeks, incrementals during the week, and then a level differential, which involves backing up all changes since the last full backup. This allows the administrator to create a schedule that limits redundancy by performing full backups across the network once a month, and limiting the number of restores if a disaster occurs in the middle of the month as the only data restored would be the original full, the last differential level and the following incrementals. You can also achieve backup efficiency by not backing up data at all. Utilizing an ILM or HSM strategy on the host file system enables the host to remove stale data that is backed up multiple times assuming it has not been accessed. DiskXtender, for instance, migrates data that has not been accessed in 90 days or is over two years old onto a lower cost media, like SATA disk, optical or tape, to eliminate backup redundancy. This software leaves a stub file, so the data can be retrieved if necessary, but would occupy less space in the file system. This drastically decreases the size of backup and reduces the backup window on this particular server. Applying brute force is the second method of eliminating front-end bottlenecks. NetWorker can brute force a host to backup in different ways. In the simple configuration of the client, server, and device resources, there are settings to push large amounts of data through the backup system simultaneously. On the host side, the parallelism settings allow for the client to push multiple streams of backup data to the Server/Storage Node. This technology, sometimes called multiplexing, forces the client to maximize its resources to send the backup data across the network. A bottleneck caused by file size is difficult to solve with the previously mentioned methods. NetWorker has a module that addresses this particular issue, called SnapImage. SnapImage is a volume-based backup for certain operating systems that allows the client to send data across the network in a block-based method, while accounting for files that have been backed up. This gives the host the ability to perform file-level, as well as full volume restores at a much faster speed because the client does not have to walk the file system and send millions of small files across the network.
Large volume size is the last front-end bottleneck. When large volumes sent over the network take copious amounts of time to transfer, NetWorkers three-tiered architecture has the advantage of upgrading the client to a Dedicated Storage Node (DSN). A DSN is a server that can write its data directly to a storage device, tape or disk without traveling over the network. It backs up over SCSI or some type of SAN. Although a DSN has some advantages, it also has disadvantages. When some servers become a DSN, they can be too slow to keep up with the speed of the tape drive. If this happens, either the tape drive will be tied up for a long time during the backup, or it will shoe-shine. Either scenario could lead to a slower backup than if the server had been backed up across the network. PowerSnap is another way to solve a bottleneck caused by large volume size. Utilizing the SAN, PowerSnap interacts with the existing snapshot software that is built into certain SAN hardware platforms, scheduling and retaining those snapshots within NetWorker. This allows administrators an extra level of protection from failure between backups, and also has the advantage of using the snapshot to perform a backup. PowerSnap can be used to mount a snapshot of a volume on a large server to backup on the NetWorker Server or Storage Node. Once the snapshot is mounted, the backup of the particular volume is no longer moving across the network. Instead, it is being backed up directly to the storage device. This takes the load off the particular server from where the snapshot was created. Central Solutions There are two main bottlenecks identified in the central section of the backup system. The first is network bandwidth on the NetWorker Server. The TCP Offload Engine (TOE) is a simple solution for this problem. It is a special NIC that offloads the TCP/IP processing that a server does in its CPU onto the card itself, greatly increasing the performance of network backup. The second approach is to implement a NetWorker Storage Node. A NetWorker Storage Node is part of the three-tiered architecture within NetWorker and is a server that only reads and writes data to tape from the network, and sends all database information back to the NetWorker Server. You can double the amount of network backup produced by an environment by utilizing a NetWorker Storage Node. NetWorker Storage Nodes can help with a backup system with a lot of LAN data to backup, and systems where storage devices need to be pushed at a high speed. Back-end Solutions Lastly, you must eliminate back-end bottlenecks. One way is to design a tape backup system with either very few linear tape drives thus reducing the possibility of shoeshining, or a tape drive technology that doesnt shoe-shine at all.
With the lessons learned from our discussion on central bottlenecks, an administrator can implement a separate server as a NetWorker Storage Node. It can then share a few fast linear tape drives with the NetWorker Server without shoe-shining. Utilizing Dynamic Drive Sharing (DDS) between the NetWorker Server and Storage Node, NetWorkers scheduling can coordinate the two servers so that the drives are always being written at full speed. The following chart shows an environment with a NetWorker Server and Storage Node backing up across a SAN to four tape drives. CHART: NetWorker Server and Storage Node with DDS
NetWorker Server
LAN
NetWorker Storage Node

Purchasing a helical scan tape drive may also solve back-end performance problems. A helical scan tape drive is a different technology than linear tape drives like DLT, SDLT or LTO. The most common helical scan tape drives are AIT and SAIT. They write differently than linear tape in that they write data to tape in bands, using an angled spinning disk that writes up and down the tape drive. This spinning disk is moves approximately 150 revolutions/sec, but the tape head is only traveling at 1 inch/sec in relation to the tape. This technology eliminates the possibility of shoe-shining since the drive is moving so slowly; if the tape drives buffer empties, the tape can just stop writing without repositioning. Even if the tape technology is chosen to take advantage of the particular environment, it still might not be fast or flexible enough to overcome the back-end bottlenecks of the backup system. While the right tape technology is important, many end users have begun to implement backup-to-disk strategies. Although backup-to-disk may seem like the answer to our problem, it still needs to be designed thoughtfully in accordance to the specific environment. NetWorker has long had the ability to stage backups to disk, keep them for short periods of time, and then move them to tape automatically. This functionality was enhanced in NetWorker 7 as the Advanced File Type Device (AFTD) was created. The AFTD allows for backups and restores to occur simultaneously and organizes the backups on disk in a random fashion, instead of the historical NetWorker tape method. This new method of
10
storing backups takes advantage of the random nature of disk, greatly increasing the performance of disk backup. The new AFTD disk backup does have disadvantages. The disk backup volumes are disk drives on the NetWorker Server or Storage Node. Because of this, the drives have some type of storage size limit, based on the file system used on the server. You will need more AFTDs if the file system is small, as in some Windows or Solaris file systems. Also, since the disk backup is done through a file system, it is nearly impossible to share between a Server and Storage Node, causing many more AFTDs. In larger environments, multiple AFTDs are complicated and cumbersome to manage. Environments that still need disk backup can use a Virtual Tape Library (VTL). A VTL appliance manages the disk backup and provides a tape library to the NetWorker Server or Storage Node. Emulating a tape library can be beneficial to a larger NetWorker environment, as it can be shared easily. Sharing a VTL is like sharing any physical tape library, although there is no limit in tape drives. This makes it less complicated than sharing a tape drive using DDS. Utilizing a VTL can also greatly increase the performance of disk backup. The VTL server appliance is managing the disk behind it, allowing for enhanced throughput. Alternate Solutions If the solutions weve discussed do not eliminate the backup system bottlenecks in a NetWorker environment, there are others. Some are based on Continuous Data Protection (CDP). CDP can be defined in multiple ways, but in essence it is a way to copy data more frequently than a typical backup system, like NetWorker. RepliStor is one of the oldest and robust CDP solutions. It is a host-based, asynchronous replication product designed for Microsoft Windows. RepliStor, in conjunction with NetWorker, can greatly enhance a backup system. Replication products can be used on servers that do not change as frequently as typical applications, taking them out of the backup systems daily backup rotation. If a tape copy is necessary, the secondary server that is the target of replication can be used, taking the load off of the original host. If the secondary server is a Storage Node or DSN, the load is also taken off of the network. RecoverPoint is another CDP solution that varies somewhat from the host-based replication software products. RecoverPoint works as an out-of-band appliance replicating data on a SAN to a SAN in a separate location. This solution could replace the typical NetWorker backup solution since RecoverPoint could give an end user a restore point for file servers, applications and off-site retention. Conclusion There are a few steps you can take to tackle issues in a backup system. First, an end user must identify the needs and goals of the organization in regards to designing a NetWorker environment. Second, with these goals in mind, you must pinpoint the problems in the existing backup system. Once the issues have been identified, eliminate them using traditional or new solutions. Lastly, reconcile these solutions to the available budget.
11
Authors Biography Matt Steinberg is a Senior Solutions Architect at Cambridge Computer Services, Inc., and has been working with customers to solve backup and storage problems for over six years. He has written several NetWorker Best Practice lectures that have been delivered in conjunction with EMC as well as teaching certified NetWorker classes in multiple locations across the globe.
12

Designing For Performance

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Designing For Performance

Caricato da

Copyright:

Formati disponibili

NetWorker Best Practices Designing for Performance

EMC Proven Professional Knowledge Sharing August, 2007

CHART: NetWorker Bottlenecks

Network Bandwidth in the Front-End and Central

Back-End Storage Devices

NetWorker Storage Node

Potrebbero piacerti anche