Sei sulla pagina 1di 136

EMC VPLEX Metro Witness Technology and High Availability

Version 2.1

EMC VPLEX Witness VPLEX Metro High Availability Metro HA Deployment Scenarios

Jennifer Aspesi Oliver Shorey

Copyright 2010 - 2012 EMC Corporation. All rights reserved.


EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date regulatory document for your product line, go to the Technical Documentation and Advisories section on EMC Powerlink. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All other trademarks used herein are the property of their respective owners.

Part number H7113.2

EMC VPLEX Metro Witness Technology and High Availability

Contents

Preface Chapter 1 VPLEX Family and Use Case Overview


Introduction ....................................................................................... 18 VPLEX value overview .................................................................... 19 VPLEX product offerings ................................................................ 23 VPLEX Local, VPLEX Metro, and VPLEX Geo ......................23 Architecture highlights ..............................................................25 Metro high availability design considerations ............................. 28 Planned application mobility compared with disaster restart ...........................................................................................29

Chapter 2

Hardware and Software


Introduction ....................................................................................... 32 VPLEX I/O ..................................................................................32 High-level VPLEX I/O flow......................................................32 Distributed coherent cache........................................................33 VPLEX family clustering architecture ....................................33 VPLEX single, dual, and quad engines ...................................35 VPLEX sizing tool.......................................................................35 Upgrade paths.............................................................................36 Hardware upgrades ...................................................................36 Software upgrades......................................................................36 VPLEX management interfaces ...................................................... 37 Web-based GUI ...........................................................................37 VPLEX CLI...................................................................................37 SNMP support for performance statistics...............................38 LDAP /AD support ...................................................................38

EMC VPLEX Metro Witness Technology and High Availability

Contents

VPLEX Element Manager API.................................................. 38 Simplified storage management..................................................... 39 Management server user accounts................................................. 40 Management server software.......................................................... 41 Management console ................................................................. 41 Command line interface ............................................................ 43 System reporting......................................................................... 44 Director software .............................................................................. 45 Configuration overview................................................................... 46 Single engine configurations..................................................... 46 Dual configurations.................................................................... 47 Quad configurations .................................................................. 48 I/O implementation ......................................................................... 50 Cache coherence ......................................................................... 50 Meta-directory ............................................................................ 50 How a read is handled............................................................... 50 How a write is handled ............................................................. 52

Chapter 3

System and Component Integrity


Overview............................................................................................ Cluster ................................................................................................ Path redundancy through different ports ..................................... Path redundancy through different directors............................... Path redundancy through different engines................................. Path redundancy through site distribution .................................. Serviceability ..................................................................................... 54 55 56 57 58 59 60

Chapter 4

Foundations of VPLEX High Availability


Foundations of VPLEX High Availability .................................... 62 Failure handling without VPLEX Witness (static preference).... 70

Chapter 5

Introduction to VPLEX Witness


VPLEX Witness overview and architecture .................................. 82 VPLEX Witness target solution, rules, and best practices .......... 85 VPLEX Witness failure semantics................................................... 87 CLI example outputs........................................................................ 93 VPLEX Witness The importance of the third failure domain ......................................................................................... 97

EMC VPLEX Metro Witness Technology and High Availability

Contents

Chapter 6

VPLEX Metro HA
VPLEX Metro HA overview .......................................................... 100 VPLEX Metro HA Campus (with cross-connect) ...................... 101 VPLEX Metro HA (without cross-cluster connection)............... 111

Chapter 7

Conclusion
Conclusion........................................................................................ 120 Better protection from storage-related failures ....................121 Protection from a larger array of possible failures...............121 Greater overall resource utilization........................................122

Glossary

EMC VPLEX Metro Witness Technology and High Availability

Contents

EMC VPLEX Metro Witness Technology and High Availability

Figures

Title 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Page 20 21 22 24 26 34 42 43 47 48 49 56 57 58 59 62 63 63 64 65 66 67 68 69 71 72 73 74 75 76

Application and data mobility example ..................................................... HA infrastructure example ........................................................................... Distributed data collaboration example ..................................................... VPLEX offerings ............................................................................................. Architecture highlights.................................................................................. VPLEX cluster example ................................................................................. VPLEX Management Console ...................................................................... Management Console welcome screen ....................................................... VPLEX single engine configuration............................................................. VPLEX dual engine configuration ............................................................... VPLEX quad engine configuration .............................................................. Port redundancy............................................................................................. Director redundancy...................................................................................... Engine redundancy ........................................................................................ Site redundancy.............................................................................................. High level functional sites in communication ........................................... High level Site A failure ................................................................................ High level Inter-site link failure ................................................................... VPLEX active and functional between two sites ....................................... VPLEX concept diagram with failure at Site A.......................................... Correct resolution after volume failure at Site A....................................... VPLEX active and functional between two sites ....................................... Inter-site link failure and cluster partition ................................................. Correct handling of cluster partition........................................................... VPLEX static detach rule............................................................................... Typical detach rule setup .............................................................................. Non-preferred site failure ............................................................................. Volume remains active at Cluster 1............................................................. Typical detach rule setup before link failure ............................................. Inter-site link failure and cluster partition .................................................

EMC VPLEX Metro Witness Technology and High Availability

Figures

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

Suspension after inter-site link failure and cluster partition ................... 77 Cluster 2 is preferred ..................................................................................... 78 Preferred site failure causes full Data Unavailability ............................... 79 High Level VPLEX Witness architecture.................................................... 83 High Level VPLEX Witness deployment .................................................. 84 Supported VPLEX versions for VPLEX Witness ....................................... 86 VPLEX Witness volume types and rule support....................................... 86 Typical VPLEX Witness configuration ....................................................... 87 VPLEX Witness and an inter-cluster link failure....................................... 88 VPLEX Witness and static preference after cluster partition................... 89 VPLEX Witness typical configuration for cluster 2 detaches .................. 90 VPLEX Witness diagram showing cluster 2 failure .................................. 91 VPLEX Witness with static preference override........................................ 92 Possible dual failure cluster isolation scenarios ........................................ 95 Highly unlikely dual failure scenarios that require manual intervention ..................................................................................................... 96 Two further dual failure scenarios that would require manual intervention ..................................................................................................... 97 High-level diagram of a Metro HA campus solution for VMware ...... 101 Metro HA campus diagram with failure domains.................................. 104 Metro HA campus diagram with disaster in zone A1............................ 105 Metro HA campus diagram with failure in zone A2.............................. 106 Metro HA campus diagram with failure in zone A3 or B3.................... 107 Metro HA campus diagram with failure in zone C1 .............................. 108 Metro HA campus diagram with intersite link failure........................... 109 Metro HA Standard High-level diagram ................................................. 111 Metro HA high-level diagram with fault domains ................................. 113 Metro HA high-level diagram with failure in domain A2..................... 114 Metro HA high-level diagram with intersite failure.............................. 116

EMC VPLEX Metro Witness Technology and High Availability

Tables

Title 1 2 3

Page

Overview of VPLEX features and benefits .................................................. 26 Configurations at a glance ............................................................................. 35 Management server user accounts ............................................................... 40

EMC VPLEX Metro Witness Technology and High Availability

Tables

10

EMC VPLEX Metro Witness Technology and High Availability

Preface

This EMC Engineering TechBook describes and provides an insightful discussion on how implementation of VPLEX will lead to a higher level of availability. As part of an effort to improve and enhance the performance and capabilities of its product lines, EMC periodically releases revisions of its hardware and software. Therefore, some functions described in this document may not be supported by all versions of the software or hardware currently in use. For the most up-to-date information on product features, refer to your product release notes. If a product does not function properly or does not function as described in this document, please contact your EMC representative. Audience This document is part of the EMC VPLEX family documentation set, and is intended for use by storage and system administrators. Readers of this document are expected to be familiar with the following topics:

Storage area networks Storage virtualization technologies EMC Symmetrix, VNX series, and CLARiiON products

Related documentation

Refer the EMC Powerlink website at http://powerlink.emc.com where the majority of the following documentation can be found under Support > Technical Documentation and Advisories > Hardware Platforms > VPLEX Family.

EMC VPLEX Architecture Guide EMC VPLEX Installation and Setup Guide EMC VPLEX Site Preparation Guide
EMC VPLEX Metro Witness Technology and High Availability
11

Preface

Implementation and Planning Best Practices for EMC VPLEX Technical Notes Using VMware Virtualization Platforms with EMC VPLEX - Best Practices Planning VMware KB: Using VPLEX Metro with VMware HA Implementing EMC VPLEX Metro with Microsoft Hyper-V, Exchange Server 2010 with Enhanced Failover Clustering Support White Paper: Using VMware vSphere with EMC VPLEX Best Practices Planning Oracle Extended RAC with EMC VPLEX MetroBest Practices Planning White Paper: EMC VPLEX with IBM AIX Virtualization and Clustering White Paper: Conditions for Stretched Hosts Cluster Support on EMC VPLEX Metro White Paper: Implementing EMC VPLEX and Microsoft Hyper-V and SQL Server with Enhanced Failover Clustering Support Applied Technology

Organization of this TechBook

This document is divided into the following chapters:

Chapter 1, VPLEX Family and Use Case Overview, summarizes the VPLEX family. It also covers some of the key features of the VPLEX family system, architecture and use cases. Chapter 2, Hardware and Software, summarizes hardware, software, and network components of the VPLEX system. It also highlights the software interfaces that can be used by an administrator to manage all aspects of a VPLEX system. Chapter 3, System and Component Integrity, summarizes how VPLEX clusters are able to handle hardware failures in any subsystem within the storage cluster. Chapter 4, Foundations of VPLEX High Availability, summarizes the concepts of the industry-wide dilemma of building absolute HA environments and how VPLEX Metro functionality manually accepts the historical challenge. Chapter 5, Introduction to VPLEX Witness, explains VPLEX architecture and operation.

12

EMC VPLEX Metro Witness Technology and High Availability

Preface

Chapter 6, VPLEX Metro HA, explains how VPLEX functionality can provide the absolute HA capability, by introducing a Witness to the inter-cluster environment. Chapter 7, Conclusion, provides a summary of benefits using VPLEX technology as related to VPLEX Witness and High Availability. Appendix A, vSphere 5.0 Update 1 Additional Settings, provides additional settings needed when using vSphere 5.0 update 1.

Authors

This TechBook was authored by the following individuals from the Enterprise Storage Division, VPLEX Business Unit based at EMC headquarters, Hopkinton, Massachusetts. Jennifer Aspesi has over 10 years of work experience with EMC in Storage Area Networks (SAN), Wide Area Networks (WAN), and Network and Storage Security technologies. Jen currently manages the Corporate Systems Engineer team for the VPLEX Business Unit. She earned her M.S. in Marketing and Technological Innovation from Worcester Polytech Institute, Massachusetts. Oliver Shorey has over 11 years experience working within the Business Continuity arena, seven of which have been with EMC engineering, designing and documenting high-end replication and geographically-dispersed clustering technologies. He is currently a Principal Corporate Systems Engineer in the VPLEX Business Unit.

Additional contributors

Additional contributors to this book include: Colin Durocher has 8 years of experience in developing software for the EMC VPLEX product as its predecessor and current state, testing it, and helping customers implement it. He is currently working on the product management team for the VPLEX business unit. He has a B.S. in Computer Engineering from the University of Alberta and is currently pursuing an MBA from the John Molson School of Business. Gene Ortenberg has more than 15 years of experience in building fault-tolerant distributed systems and applications. For the past 8 years he has been designing and developing highly-available storage virtualization solutions at EMC. He currently holds a position of a Software Architect for the VPLEX Business Unit under the EMC Enterprise Storage Division.

EMC VPLEX Metro Witness Technology and High Availability

13

Preface

Fernanda Torres has over 10 years of Marketing experience in the Consumer Products industry, most recently in consumer electronics. Fernanda is the Product Marketing Manager for VPLEX under the EMC Enterprise Storage Division. She has undergraduate degree from the University of Notre Dame and a bilingual degree (English/Spanish) from IESE in Barcelona, Spain. Typographical conventions EMC uses the following type style conventions in this document: Normal
Used in running (nonprocedural) text for: Names of interface elements (such as names of windows, dialog boxes, buttons, fields, and menus) Names of resources, attributes, pools, Boolean expressions, buttons, DQL statements, keywords, clauses, environment variables, functions, utilities URLs, pathnames, filenames, directory names, computer names, filenames, links, groups, service keys, file systems, notifications Used in running (nonprocedural) text for: Names of commands, daemons, options, programs, processes, services, applications, utilities, kernels, notifications, system calls, man pages Used in procedures for: Names of interface elements (such as names of windows, dialog boxes, buttons, fields, and menus) What user specifically selects, clicks, presses, or types

Bold

Italic

Used in all text (including procedures) for: Full titles of publications referenced in text Emphasis (for example a new term) Variables Used for: System output, such as an error message or script URLs, complete paths, filenames, prompts, and syntax when shown outside of running text Used for: Specific user input (such as commands) Used in procedures for: Variables on command line User input variables Angle brackets enclose parameter or variable values supplied by the user Square brackets enclose optional values

Courier

Courier bold

Courier italic

<> []

14

EMC VPLEX Metro Witness Technology and High Availability

Preface

| {} ...

Vertical bar indicates alternate selections - the bar means or Braces indicate content that you must specify (that is, x or y or z) Ellipses indicate nonessential information omitted from the example

We'd like to hear from you!

Your feedback on our TechBooks is important to us! We want our books to be as helpful and relevant as possible, so please feel free to send us your comments, opinions and thoughts on this or any other TechBook:
TechBooks@emc.com

EMC VPLEX Metro Witness Technology and High Availability

15

Preface

16

EMC VPLEX Metro Witness Technology and High Availability

1
VPLEX Family and Use Case Overview

This chapter provides a brief summary of the main use cases for the EMC VPLEX family and design considerations for high availability. It also covers some of the key features of the VPLEX family system. Topics include:

Introduction ........................................................................................ VPLEX value overview ..................................................................... VPLEX product offerings ................................................................. Metro high availability design considerations ..............................

18 19 23 28

VPLEX Family and Use Case Overview

17

VPLEX Family and Use Case Overview

Introduction
The purpose of this TechBook is to introduce EMC VPLEX high availability and the VPLEX Witness as it is conceptually architectured, typically by customer storage administrators and EMC Solutions Architects. The introduction of VPLEX Witness provides customers with absolute physical and logical fabric and cache coherent redundancy if it is properly designed in the VPLEX Metro environment. This TechBook is designed to provide an overview of the features and functionality associated with the VPLEX Metro configuration and the importance of active/active data resiliency for todays advanced host applications.

18

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

VPLEX value overview


At the highest level, VPLEX has unique capabilities that storage administrators value and are seeking to enhance their existing data centers. It delivers distributed, dynamic and smart functionality into existing or new data centers to provide storage virtualization across geographical boundaries.

VPLEX is distributed, because it is a single interface for multi-vendor storage and it delivers dynamic data mobility, enabling the ability to move applications and data in real-time, with no outage required. VPLEX is dynamic, because it provides data availability and flexibility as well as maintaining business through failures traditionally requiring outages or manual restore procedures. VPLEX is smart, because its unique AccessAnywhere technology can present and keep the same data consistent within and between sites and enable distributed data collaboration.

Because of these capabilities, VPLEX delivers unique and differentiated value to address three distinct requirements within our target customers IT environments:

The ability to dynamically move applications and data across different compute and storage installations, be they within the same data center, across a campus, within a geographical region and now, with VPLEX Geo, across even greater distances. The ability to create high-availability storage and a compute infrastructure across these same varied geographies with unmatched resiliency. The ability to provide efficient real-time data collaboration over distance for such big data applications as video, geographic /oceanographic research, and more.

EMC VPLEX technology is a scalable, distributed-storage federation solution that provides non-disruptive, heterogeneous data-movement and volume-management functionality. Insert VPLEX technology between hosts and storage in a storage area network (SAN) and data can be extended over distance within, between, and across data centers.

VPLEX value overview

19

VPLEX Family and Use Case Overview

The VPLEX architecture provides a highly available solution suitable for many deployment strategies including:

Application and Data Mobility The movement of virtual machines (VM) without downtime. An example is shown in Figure 1.

Figure 1

Application and data mobility example

Storage administrators have the ability to automatically balance loads through VPLEX, using storage and compute resources from either clusters location. When combined with server virtualization, VPLEX allows users to transparently move and relocate Virtual Machines and their corresponding applications and data over distance. This provides a unique capability allowing users to relocate, share and balance infrastructure resources between sites, which can be within a campus or between data centers, up to 5ms apart with VPLEX Metro, or further apart (50ms RTT) across asynchronous distances with VPLEX Geo.

Note: Please submit an RPQ if VPLEX Metro is required up to 10ms or check the support matrix for the latest supported latencies.

20

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

HA Infrastructure Reduces recovery time objective (RTO). An example is shown in Figure 2.

Figure 2

HA infrastructure example

High availability is a term that several products will claim they can deliver. Ultimately, a high availability solution is supposed to protect against a failure and keep an application online. Storage administrators plan around HA to provide near continuous uptime for their critical applications, and automate the restart of an application once a failure has occurred, with as little human intervention as possible. With conventional solutions, customers typically have to choose a Recovery Point Objective and a Recovery Time Objective. But even while some solutions offer small RTOs and RPOs, there can still be downtime and, for most customers, any downtime can be costly.

VPLEX value overview

21

VPLEX Family and Use Case Overview

Distributed Data Collaboration Increases utilization of passive data recovery (DR) assets and provides simultaneous access to data. An example is shown in Figure 3.

Figure 3

Distributed data collaboration example

This is when a workforce has multiple users at different sites that need to work on the same data, and maintain consistency in the dataset when changes are made. Use cases include co-development of software where the development happens across different teams from separate locations, and collaborative workflows such as engineering, graphic arts, videos, educational programs, designs, research reports, and so forth. When customers have tried to build collaboration across distance with the traditional solutions, they normally have to save the entire file at one location and then send it to another site using FTP. This is slow, can incur heavy bandwidth costs for large files, or even small files that move regularly, and negatively impacts productivity because the other sites can sit idle while they wait to receive the latest data from another site. If teams decide to do their own work independent of each other, then the dataset quickly becomes inconsistent, as multiple people are working on it at the same time and are unaware of each others most recent changes. Bringing all of the changes together in the end is time-consuming, costly, and grows more complicated as the data-set gets larger.

22

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

VPLEX product offerings


VPLEX first meets high-availability and data mobility requirements and then scales up to the I/O throughput required for the front-end applications and back-end storage. High-availability and data mobility features are characteristics of VPLEX Local, VPLEX Metro, and VPLEX Geo. A VPLEX cluster consists of one, two, or four engines (each containing two directors), and a management server. A dual-engine or quad-engine cluster also contains a pair of Fibre Channel switches for communication between directors. Each engine is protected by a standby power supply (SPS), and each Fibre Channel switch gets its power through an uninterruptible power supply (UPS). (In a dual-engine or quad-engine cluster, the management server also gets power from a UPS.) The management server has a public Ethernet port, which provides cluster management services when connected to the customer network. This section provides information on the following:

VPLEX Local, VPLEX Metro, and VPLEX Geo on page 23 Architecture highlights on page 25

VPLEX Local, VPLEX Metro, and VPLEX Geo


EMC offers VPLEX in three configurations to address customer needs for high-availability and data mobility:

VPLEX Local VPLEX Metro VPLEX Geo

VPLEX product offerings

23

VPLEX Family and Use Case Overview

Figure 4 provides an example of each.

Figure 4

VPLEX offerings

VPLEX Local VPLEX Local provides seamless, non-disruptive data mobility and ability to manage multiple heterogeneous arrays from a single interface within a data center. VPLEX Local allows increased availability, simplified management, and improved utilization across multiple arrays. VPLEX Metro with AccessAnywhere VPLEX Metro with AccessAnywhere enables active-active, block level access to data between two sites within synchronous distances. The distance is limited as to what Synchronous behavior can withstand as well as consideration to host application stability and MAN traffic. It is recommended that depending on the application that consideration for Metro be less than or equal to 5ms1 RTT. The combination of virtual storage with VPLEX Metro and virtual servers enables the transparent movement of virtual machines and storage across a distance.This technology provides improved utilization across heterogeneous arrays and multiple sites.

1.

Refer to VPLEX and vendor-specific White Papers for confirmation of latency limitations.

24

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

VPLEX Geo with AccessAnywhere VPLEX Geo with AccessAnywhere enables active-active, block level access to data between two sites within asynchronous distances. VPLEX Geo enables better cost-effective use of resources and power. Geo provides the same distributed device flexibility as Metro but extends the distance up to and within 50ms RTT. As with any Asynchronous transport media, bandwidth is also important to consider for optimal behavior as well as application sharing on the link. Note: For the purpose of this TechBook, the focus on technologies is based on Metro configuration only. VPLEX Witness is supported with VPLEX Geo; however, it is beyond the scope of this TechBook.

Architecture highlights
VPLEX support is open and heterogeneous, supporting both EMC storage and common arrays from other storage vendors, such as HDS, HP, and IBM. VPLEX conforms to established worldwide naming (WWN) guidelines that can be used for zoning. VPLEX supports operating systems including both physical and virtual server environments with VMware ESX and Microsoft Hyper-V. VPLEX supports network fabrics from Brocade and Cisco, including legacy McData SANs.

Note: For the latest information please refer to the ESSM (EMC Simple Support Matrix) for supported host types as well as the connectivity ESM for fabric and extended fabric support.

VPLEX product offerings

25

VPLEX Family and Use Case Overview

An example of the architecture is shown in Figure 5.

Figure 5

Architecture highlights

Table 1 lists an overview of VPLEX features along with the benefits.


Table 1

Overview of VPLEX features and benefits (page 1 of 2) Features Mobility Resiliency Distributed cache coherency Benefits Move data and applications without impact on users. Mirror across arrays without host impact, and increase high availability for critical applications. Automate sharing, balancing, and failover of I/O across the cluster and between clusters.

26

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

Table 1

Overview of VPLEX features and benefits (page 2 of 2) Features Advanced data caching Virtual Storage federation Scale-out cluster architecture Benefits Improve I/O performance and reduce storage array contention. Achieve transparent mobility and access in a data center and between data centers. Start small and grow larger with predictable service levels.

For all VPLEX products, the appliance-based VPLEX technology:

Presents storage area network (SAN) volumes from back-end arrays to VPLEX engines Packages the SAN volumes into sets of VPLEX virtual volumes with user-defined configuration and protection levels Presents virtual volumes to production hosts in the SAN via the VPLEX front-end For VPLEX Metro and VPLEX Geo products, presents a global, block-level directory for distributed cache and I/O between VPLEX clusters.

Location and distance determine high-availability and data mobility requirements. For example, if all storage arrays are in a single data center, a VPLEX Local product federates back-end storage arrays within the data center. When back-end storage arrays span two data centers, the AccessAnywhere feature in a VPLEX Metro or a VPLEX Geo product federates storage in an active-active configuration between VPLEX clusters. Choosing between VPLEX Metro or VPLEX Geo depends on distance and data synchronicity requirements. Application and back-end storage I/O throughput determine the number of engines in each VPLEX cluster. High-availability features within the VPLEX cluster allow for non-disruptive software upgrades and expansion as I/O throughput increases.

VPLEX product offerings

27

VPLEX Family and Use Case Overview

Metro high availability design considerations


VPLEX Metro 5.0 (and above) introduces high availability concepts beyond what is traditionally known as physical high availability. Introduction of the VPLEX Witness to a high availability environment, allows the VPLEX solution to increase the overall availability of the environment by arbitrating a pure communication failure between two primary sites and a true site failure in a multi-site architecture. EMC VPLEX is the first product to bring to market the features and functionality provided by VPLEX Witness prevents failures and asserts the activity between clusters in a multi-site architecture. Through this TechBook, administrators and customers gain an understanding of the high availability solution that VPLEX provides them:

Enabling of load balancing between their data centers Active/active use of both of their data centers Increased availability for their applications (no single points of storage failure, auto-restart) Fully automatic failure handling Better resource utilization Lower CapEx and lower OpEx as a result

Broadly speaking, when one considers legacy environments one typically sees highly available designs implemented within a data center, and disaster recovery type functionality deployed between data centers. One of the main reasons for this is that within data centers components generally operate in an active/active (or active/passive with automatic failover) whereas between data centers legacy replication technologies use active passive techniques which require manual failover to use the passive component. When using VPLEX Metro active/active replication technology in conjunction with new features, such as VPLEX Witness server (as described in Introduction to VPLEX Witness on page 81), the lines between local high availability and long distance disaster recovery are somewhat blurred since HA can be stretched beyond the data

28

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Family and Use Case Overview

center walls. Since replication is a by-product of federated and distributed storage disaster avoidance, it is also achievable within these geographically dispersed HA environments.

Planned application mobility compared with disaster restart


This section compares planned application mobility and disaster restart. Planned application mobility An online planned application mobility event is defined as when an application or virtual machine can be moved fully online without disruption from one location to another in either the same or remote data center. This type of movement can only be performed when all components that participate in this movement are available (e.g., the running state of the application or VM exists in volatile memory which would not be the case if an active site has failed) and if all participating hosts have read/write access at both location to the same block storage. Additional a mechanism is required to transition volatile memory data from one system/host to another. When performing planned online mobility jobs over distance a prerequisite y is the use of an active/active underlying storage replication solution (VPLEX Metro only at this publication). An example of this online application mobility would be VMware vMotion where a virtual machine would need to be fully operational before it can be moved. It may sound obvious but if the VM was offline then movement could not be performed online (This is important to understand and is the key difference over application restart). When vMotion is executed all live components that are required to make the VM function are copied elsewhere in the background before cutting the VM over. Since these types of mobility tasks are totally seamless to the user some of the use cases associated are for disaster avoidance where an application or VM can be moved ahead of a disaster (such as, Hurricane, Tsunami, etc.) as the running state is available to be copied, or in other cases it can be used to enable the ability to load balance across multiple systems or even data centers. Due to the need for the running state to be available for these types of relocations these movements are always deemed planned activities.

Metro high availability design considerations

29

VPLEX Family and Use Case Overview

Disaster restart

Disaster restart is where an application or service is re-started in another location after a failure (be it on a different server or data center) and will typically interrupt the service/application during the failover. A good example of this technology would be a VMware HA Cluster configured over two geographically dispersed sites using VPLEX Metro where a cluster will be formed over a number of ESX servers and either single or multiple virtual machines can run on any of the ESX servers within the cluster. If for some reason an active ESX server were to fail (perhaps due to site failure) then the VM can be re-started on a remaining ESX server within the cluster at the remote site as the datastore where it was running spans the two locations since it is configured on a VPLEX Metro distributed volume. This would be deemed an unplanned failover which will incur a small outage of the application since the running state of the VM was lost when the ESX server failed meaning the service will be unavailable until the VM has restarted elsewhere. Although comparing a planned application mobility event to an unplanned disaster restart will result in the same outcome (i.e., a service relocating elsewhere) it can now be seen that there is a big difference since the planned mobility job keeps the application online during the relocation whereas the disaster restart will result in the application being offline during the relocation as a restart is conducted. Compared to active/active technologies the use of legacy active/passive type solutions in these restart scenarios would typically require an extra step over and above standard application failover since a storage failover would also be required (i.e. changing the status of write disabled remote copy to read/write and reversing replication direction flow). This is where VPLEX can assist greatly since it is active/active therefore, in most cases, no manual intervention at the storage layer is required, this greatly reduces the complexity of a DR failover solution. If best practices for physical high available and redundant hardware connectivity are followed the value of VPLEX Witness will truly provide customers with Absolute availability!

30

EMC VPLEX Metro Witness Technology and High Availability

2
Hardware and Software

This chapter provides insight into the hardware and software interfaces that can be used by an administrator to manage all aspects of a VPLEX system. In addition, a brief overview of the internal system software is included. Topics include:

Introduction ........................................................................................ VPLEX management interfaces........................................................ Simplified storage management ...................................................... Management server user accounts .................................................. Management server software ........................................................... Director software................................................................................ Configuration overview.................................................................... I/O implementation ..........................................................................

32 37 39 40 41 45 46 50

Hardware and Software

31

Hardware and Software

Introduction
This section provides basic information on the following:

VPLEX I/O on page 32 High-level VPLEX I/O flow on page 32 Distributed coherent cache on page 33 VPLEX family clustering architecture on page 33

VPLEX I/O
VPLEX is built on a lightweight protocol that maintains cache coherency for storage I/O and the VPLEX cluster provides highly available cache, processing power, front-end, and back-end Fibre Channel interfaces. EMC hardware powers the VPLEX cluster design so that all devices are always available and I/O that enters the cluster from anywhere can be serviced by any node within the cluster. The AccessAnywhere feature in the VPLEX Metro and VPLEX Geo products extends the cache coherency between data centers at a distance.

High-level VPLEX I/O flow


VPLEX abstracts a block-level ownership model into a highly organized hierarchal directory structure that is updated for every I/O and shared across all engines. The directory uses a small amount of metadata and tells all other engines in the cluster, in 4k block transmissions, which block of data is owned by which engine and at what time. After a write completes and ownership is reflected in the directory, VPLEX dynamically manages read requests for the completed write in the most efficient way possible. When a read request arrives, VPLEX checks the directory for an owner. After VPLEX locates the owner, the read request goes directly to that engine.

32

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

On reads from other engines, VPLEX checks the directory and tries to pull the read I/O directly from the engine cache to avoid going to the physical arrays to satisfy the read. This model enables VPLEX to stretch the cluster as VPLEX distributes the directory between clusters and sites. Due to the Hierarchical nature of the VPLEX directory VPLEX is efficient with minimal overhead and enables I/O communication over distance.

Distributed coherent cache


The VPLEX engine includes two directors that each have a total of 36 GB (version 5 hardware, also known as VS2) of local cache. Cache pages are keyed by volume and go through a lifecycle from staging, to visible, to draining. The global cache is a combination of all director caches that spans all clusters. The cache page holder information is maintained in a memory data structure called a directory. The directory is divided into chunks and distributed among the VPLEX directors and locality controls where ownership is maintained. A meta-directory identifies which director owns which directory chunks within the global directory.

VPLEX family clustering architecture


The VPLEX family uses a unique clustering architecture to help customers break the boundaries of the data center and allow servers at multiple data centers to have read/write access to shared block storage devices. A VPLEX cluster, as shown in Figure 6 on page 34, can scale up through the addition of more engines, and scale out by connecting clusters into an EMC VPLEX Metro (two VPLEX Metro clusters connected within Metro distances).

Introduction

33

Hardware and Software

Figure 6

VPLEX cluster example

VPLEX Metro transparently moves and shares workloads for a variety of applications, VMs, databases and cluster file systems. VPLEX Metro consolidates data centers, and optimizes resource utilization across data centers. In addition, it provides non-disruptive data mobility, heterogeneous storage management, and improved application availability. VPLEX Metro supports up to two clusters, which can be in the same data center, or at two different sites within synchronous environments. Also, introduced with these solutions architected by this TechBook, Geo cluster across distances achieves the asynchronous partner to Metro. It is out of the scope of this document to analyze VPLEX Geo capabilities.

34

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

VPLEX single, dual, and quad engines


The VPLEX engine provides cache and processing power with redundant directors that each include two I/O modules per director and one optional WAN COM I/O module for use in VPLEX Metro and VPLEX Geo configurations. The rackable hardware components are shipped in NEMA standard racks or provided, as an option, as a field rackable product. Table 2 provides a list of configurations.
Table 2

Configurations at a glance

Components Directors Redundant Engine SPSs FE Fibre Channel ports (VS1) FE Fibre Channel ports (VS2) BE Fibre Channel ports (VS1) BE Fibre Channel ports (VS2) Cache size (VS1 Hardware) Cache size (VS2 Hardware) Management Servers Internal Fibre Channel switches (Local Comm) Uninterruptable Power Supplies (UPSs)

Single engine 2 Yes 16 8 16 8 64 GB 72 GB 1 None None

Dual engine 4 Yes 32 16 32 16 128 GB 144 GB 1 2 2

Quad engine 8 Yes 64 32 64 32 256 GB 288 GB 1 2 2

VPLEX sizing tool


Use the EMC VPLEX sizing tool provided by EMC Global Services Software Development to configure the right VPLEX cluster configuration. The sizing tool concentrates on I/O throughput requirement for installed applications (mail exchange, OLTP, data warehouse, video streaming, etc.) and back-end configuration such as virtual volumes, size and quantity of storage volumes, and initiators.

Introduction

35

Hardware and Software

Upgrade paths
VPLEX facilitates application and storage upgrades without a service window through its flexibility to shift production workloads throughout the VPLEX technology. In addition, high-availability features of the VPLEX cluster allow for non-disruptive VPLEX hardware and software upgrades. This flexibility means that VPLEX is always servicing I/O and never has to be completely shut down.

Hardware upgrades
Upgrades are supported for single-engine VPLEX systems to dual- or quad-engine systems. A single VPLEX Local system can be reconfigured to work as a VPLEX Metro or VPLEX Geo by adding a new remote VPLEX cluster. Additionally an entire VPLEX VS1 Cluster (hardware) can be fully upgraded to VS2 hardware non disruptively. Information for VPLEX hardware upgrades is in the Procedure Generator that is available through EMC PowerLink.

Software upgrades
VPLEX features a robust non-disruptive upgrade (NDU) technology to upgrade the software on VPLEX engines and VPLEX Witness servers. Management server software must be upgraded before running the NDU. Due to the VPLEX distributed coherent cache, directors elsewhere in the VPLEX installation service I/Os while the upgrade is taking place. This alleviates the need for service windows and reduces RTO. The NDU includes the following steps:

Preparing the VPLEX system for the NDU Starting the NDU Transferring the I/O to an upgraded director Completing the NDU

36

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

VPLEX management interfaces


Within the VPLEX cluster, TCP/IP-based management traffic travels through a private network subnet to the components in one or more clusters. In VPLEX Metro and VPLEX Geo, VPLEX establishes a VPN tunnel between the management servers of both clusters. When VPLEX Witness is deployed, the VPN tunnel is extended to a 3-way tunnel including both Management Servers and VPLEX Witness.

Web-based GUI
VPLEX includes a Web-based graphical user interface (GUI) for management. The EMC VPLEX Management Console Help provides more information on using this interface. To perform other VPLEX operations that are not available in the GUI, refer to the CLI, which supports full functionality. The EMC VPLEX CLI Guide provides a comprehensive list of VPLEX commands and detailed instructions on using those commands. The EMC VPLEX Management Console contains but is not limited to the following functions:

Supports storage array discovery and provisioning Local provisioning Distributed provisioning Mobility Central Online help

VPLEX CLI
VPlexcli is a command line interface (CLI) to configure and operate VPLEX systems. It also generates the EZ Wizard Setup process to make installation of VPLEX easier and quicker. The CLI is divided into command contexts. Some commands are accessible from all contexts, and are referred to as global commands. The remaining commands are arranged in a hierarchical context tree that can only be executed from the appropriate location in the context tree.

VPLEX management interfaces

37

Hardware and Software

The VPlexcli encompasses all capabilities in order to function if the management station is unavailable. It is fully functional, comprehensive, supporting full configuration, provisioning and advanced systems management capabilities.

SNMP support for performance statistics


The VPLEX snmpv2c SNMP agent:

Supports retrieval of performance-related statistics as published in the VPLEX-MIB.mib. Runs on the management server and fetches performance related data from individual directors using a firmware specific interface. Provides SNMP MIB data for directors for the local cluster only.

LDAP /AD support


VPLEX offers Lightweight Directory Access Protocol (LDAP) or Active Directory for an authentication directory service.

VPLEX Element Manager API


VPLEX Element Manager API uses the Representational State Transfer (REST) software architecture for distributed systems such as the World Wide Web. It allows software developers and other users to use the API to create scripts to run VPLEX CLI commands. The VPLEX Element Manager API supports all VPLEX CLI commands that can be executed from the root context on a director.

38

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

Simplified storage management


VPLEX supports a variety of arrays from various vendors covering both active/active and active/passive type arrays. VPLEX simplifies storage management by allowing simple LUNs, provisioned from the various arrays, to be managed through a centralized management interface that is simple to use and very intuitive. In addition, a VPLEX Metro or VPLEX Geo environment that spans data centers allows the storage administrator to manage both locations through the one interface from either location by logging in at the local site.

Simplified storage management

39

Hardware and Software

Management server user accounts


The management server requires the setup of user accounts for access to certain tasks. Table 3 describes the types of user accounts on the management server.
Table 3

Management server user accounts Account type admin (customer) Purpose Performs administrative actions, such as user management Creates and deletes Linux CLI accounts Resets passwords for all Linux CLI users Modifies the public Ethernet settings Starts and stops necessary OS and VPLEX services Cannot modify user accounts (Customers do have access to this account) Uses VPlexcli to manage federated storage Uses VPlexcli Modifies their own password Can SSH or VNC into the management server Can SCP files off the management server from directories to which they have access

service (EMC service)

Linux CLI accounts All account types

Some service and administrator tasks require OS commands that require root privileges. The management server has been configured to use the sudo program to provide these root privileges just for the duration of the command. Sudo is a secure and well-established UNIX program for allowing users to run commands with root privileges. VPLEX documentation will indicate which commands must be prefixed with "sudo" in order to acquire the necessary privileges. The sudo command will ask for the user's password when it runs for the first time, to ensure that the user knows the password for his account. This prevents unauthorized users from executing these privileged commands when they find an authenticated SSH login that was left open.

40

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

Management server software


The management server software is installed during manufacturing and is fully field upgradeable. The software includes:

VPLEX Management Console VPlexcli Server Base Image Updates (when necessary) Call-home software

Each are briefly discussed in this section.

Management console
The VPLEX Management Console provides a graphical user interface (GUI) to manage the VPLEX cluster. The GUI can be used to provision storage, as well as manage and monitor system performance. Figure 7 on page 42 shows the VPLEX Management Console window with the cluster tree expanded to show the objects that are manageable from the front-end, back-end, and the federated storage.

Management server software

41

Hardware and Software

Figure 7

VPLEX Management Console

The VPLEX Management Console provides online help for all of its available functions. Online help can be accessed in the following ways:

Click the Help icon in the upper right corner on the main screen to open the online help system, or in a specific screen to open a topic specific to the current task. Click the Help button on the task bar to display a list of links to additional VPLEX documentation and other sources of information.

42

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

Figure 8 is the welcome screen of the VPLEX Management Console GUI, which utilizes a secure http connection via a browser. The interface uses Flash technology for rapid response and unique look and feel.

Figure 8

Management Console welcome screen

Command line interface


The VPlexcli is a command line interface (CLI) for configuring and running the VPLEX system, for setting up and monitoring the systems hardware and intersite links (including com/tcp), and for configuring global inter-site I/O cost and link-failure recovery. The CLI runs as a service on the VPLEX management server and is accessible using Secure Shell (SSH).

Management server software

43

Hardware and Software

For information about the VPlexcli, refer to the EMC VPLEX CLI Guide.

System reporting
VPLEX system reporting software collects configuration information from each cluster and each engine. The resulting configuration file (XML) is zipped and stored locally on the management server or presented to the SYR system at EMC via call home. You can schedule a weekly job to automatically collect SYR data (VPlexcli command scheduleSYR), or manually collect it whenever needed (VPlexcli command syrcollect).

44

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

Director software
The director software provides:

Basic Input/Output System (BIOS ) Provides low-level hardware support to the operating system, and maintains boot configuration. Power-On Self Test (POST) Provides automated testing of system hardware during power on. Linux Provides basic operating system services to the Vplexcli software stack running on the directors. VPLEX Power and Environmental Monitoring (ZPEM) Provides monitoring and reporting of system hardware status. EMC Common Object Model (ECOM) Provides management logic and interfaces to the internal components of the system. Log server Collates log messages from director processes and sends them to the SMS. EMC GeoSynchrony (I/O Stack) Processes I/O from hosts, performs all cache processing, replication, and virtualization logic, interfaces with arrays for claiming and I/O.

Director software

45

Hardware and Software

Configuration overview
The VPLEX configurations are based on how many engines are in the cabinet. The basic configurations are single, dual and quad (previously know as small, medium and large). The configuration sizes refer to the number of engines in the VPLEX cabinet. The remainder of this section describes each configuration size.

Single engine configurations


The VPLEX single engine configuration includes the following:

Two directors One engine Redundant engine SPSs 8 front-end Fibre Channel ports (16 for VS1 hardware) 8 back-end Fibre Channel ports (16 for VS1 hardware) One management server

The unused space between engine 1 and the management server as shown in Figure 9 on page 47 is intentional.

46

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

Figure 9

VPLEX single engine configuration

Dual configurations
The VPLEX dual engine configuration includes the following:

Four directors Two engines Redundant engine SPSs 16 front-end Fibre Channel ports (32 for VS1 hardware) 16 back-end Fibre Channel ports (32 for VS1 hardware) One management server

Configuration overview

47

Hardware and Software

Redundant Fibre Channel COM switches for local COM; UPS for each Fibre Channel switch

Figure 10 shows an example of a medium configuration.

ON I O OFF

ON I O OFF

ON I O OFF

ON I O OFF

ON I O OFF

ON I O OFF

Fibre Channel switch B UPS B Fibre Channel switch A UPS A


O OFF O OFF O OFF ON I O OFF ON I

Management server

Engine 2
ON I O OFF ON I

SPS 2
ON I O OFF ON I

Engine 1

SPS 1

VPLX-000254

Figure 10

VPLEX dual engine configuration

Quad configurations
The VPLEX quad engine configuration includes the following:

Eight directors Four engines Redundant engine SPSs 32 front-end Fibre Channel ports (64 for VS1 hardware)

48

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

32 back-end Fibre Channel ports (64 for VS1 hardware) One management server Redundant Fibre Channel COM switches for local COM; UPS for each Fibre Channel switch

Figure 11 shows an example of a quad configuration.

ON I O OFF

ON I O OFF

Engine 4

SPS 4
ON I O OFF ON I O OFF

Engine 3
ON I O OFF ON I O OFF

SPS 3

Fibre Channel switch B UPS B Fibre Channel switch A UPS A


O OFF O OFF O OFF ON I O OFF ON I

Management server

Engine 2
ON I O OFF ON I

SPS 2
ON I O OFF ON I

Engine 1

SPS 1

VPLX 000253

Figure 11

VPLEX quad engine configuration

Configuration overview

49

Hardware and Software

I/O implementation
The VPLEX cluster utilizes a write-through mode when configured for either VPLEX Local or Metro whereby all writes are written through the cache to the back-end storage. To maintain data integrity, a host write is acknowledged only after the back-end arrays (in one cluster in case of VPLEX Local and in two clusters in case of VPLEX Metro) acknowledge the write. This section describes the VPLEX cluster caching layers, roles, and interactions. It gives an overview of how reads and writes are handled within the VPLEX cluster and how distributed cache coherency works. This is important to the introduction of high availability concepts.

Cache coherence
Cache coherence creates a consistent global view of a volume. Distributed cache coherence is maintained using a directory. There is one directory per virtual volume and each directory is split into chunks (4096 directory entries within each). These chunks exist only if they are populated. There is one directory entry per global cache page, with responsibility for:

Tracking page owner(s) and remembering the last writer Locking and queuing

Meta-directory
Directory chunks are managed by the meta-directory, which assigns and remembers chunk ownership. These chunks can migrate using Locality-Conscious Directory Migration (LCDM). This meta-directory knowledge is cached across the share group (i.e., a group of multiple directors within the cluster that are exporting a given virtual volume) for efficiency.

How a read is handled


When a host makes a read request, VPLEX first searches its local cache. If the data is found there, it is returned to the host.

50

EMC VPLEX Metro Witness Technology and High Availability

Hardware and Software

If the data is not found in local cache, VPLEX searches global cache. Global cache includes all directors that are connected to one another within the single VPLEX cluster for VPLEX Local, and all of the VPLEX clusters for both VPLEX Metro and VPLEX Geo. If there is a global read hit in the local cluster (i.e. same cluster, but different director) then the read will be serviced from global cache in the same cluster. The read could also be serviced by the remote global cache if the consistency group setting local read override is set to false (the default is true). Whenever the read is serviced from global cache (same cluster or remote), a copy is also stored in the local cache of the director from where the request originated. If a read cannot be serviced from either local cache or global cache, it is read directly from the back-end storage. In these cases both the global and local cache are updated.

I/O flow of a local read hit 1. Read request issued to virtual volume from host. 2. Look up in local cache of ingress director. 3. On hit, data returned from local cache to host. I/O flow of a global read hit 1. Read request issued to virtual volume from host. 2. Look up in local cache of ingress director. 3. On miss, look up in global cache. 4. On hit, data is copied from owner director into local cache. 5. Data returned from local cache to host. I/O flow of a read miss 1. Read request issued to virtual volume from host. 2. Look up in local cache of ingress director. 3. On miss, look up in global cache. 4. On miss, data read from storage volume into local cache. 5. Data returned from local cache to host. 6. The director that returned the data becomes the chunk owner.

I/O implementation

51

Hardware and Software

How a write is handled


For both VPLEX Local and Metro, all writes are written through cache to the back-end storage. Writes are completed to the host only after they have been completed to the back-end arrays. In the case of VPLEX Metro, each write is duplicated at the cluster where it was written. One of the copies is then written through to local back end disk, whilst the other one is written to the remote VPLEX where in turn it is written through to the remote back end disk. Host acknowledgement is given once both writes to back end storage has been acknowledged. I/O flow of a write miss 1. Write request issued to virtual volume from host. 2. Look for prior data in local cache. 3. Look for prior data in global cache. 4. Transfer data to local cache. 5. Data is written through to back-end storage. 6. Write is acknowledged to host. I/O flow of a write hit 1. Write request issued to virtual volume from host. 2. Look for prior data in local cache. 3. Look for prior data in global cache. 4. Invalidate prior data. 5. Transfer data to local cache. 6. Data is written through to back-end storage. 7. Write is acknowledged to host.

52

EMC VPLEX Metro Witness Technology and High Availability

3
System and Component Integrity

This chapter explains how VPLEX clusters are able to handle hardware failures in any subsystem within the storage cluster. Topics include:

Overview ............................................................................................. Cluster.................................................................................................. Path redundancy through different ports ...................................... Path redundancy through different directors ................................ Path redundancy through different engines .................................. Path redundancy through site distribution.................................... Serviceability.......................................................................................

54 55 56 57 58 59 60

System and Component Integrity

53

System and Component Integrity

Overview
VPLEX clusters are capable of surviving any single hardware failure in any subsystem within the overall storage cluster. These include host connectivity subsystem, memory subsystem, etc. A single failure in any subsystem will not affect the availability or integrity of the data. Multiple failures in a single subsystem and certain combinations of single failures in multiple subsystems may affect the availability or integrity of data. High availability requires that host connections be redundant and that hosts are supplied with multipath drivers. In the event of a front-end port failure or a director failure, hosts without redundant physical connectivity to a VPLEX cluster and without multipathing software installed may be susceptible to data unavailability.

54

EMC VPLEX Metro Witness Technology and High Availability

System and Component Integrity

Cluster
A cluster is a collection of one, two, or four engines in a physical cabinet. A cluster serves I/O for one storage domain and is managed as one storage cluster. All hardware resources (CPU cycles, I/O ports, and cache memory) are pooled:

The front-end ports on all directors provide active/active access to the virtual volumes exported by the cluster. For maximum availability, virtual volumes can be presented through all director so that all directors but one can fail without causing data loss or unavailability. To achieve this with version 5.0.1 code and below directors must be connected to all storage.
Note: Instant failure of all directors bar one in a dual or quad engine system would result in the last remaining director also failing since it would lose quorum. This is, therefore, only true if one director failed at a time.

Cluster

55

System and Component Integrity

Path redundancy through different ports


Because all paths are duplicated, when a director port goes down for any reason, data seemlessly processes through a port of the other director, as shown in Figure 12 (assuming correct multipath software

is in place).
Figure 12

Port redundancy

Multipathing software plus redundant volume presentation yields continuous data availability in the presence of port failures.

56

EMC VPLEX Metro Witness Technology and High Availability

System and Component Integrity

Path redundancy through different directors


If a a director were to go down, the other director can completely take over the I/O processing from the host, as shown in Figure 13.

Figure 13

Director redundancy

Multipathing software plus volume presentation on different directors yields continuous data availability in the presence of director failures.

Path redundancy through different directors

57

System and Component Integrity

Path redundancy through different engines


In a clustered environment, if one engine goes down, another engine completes the host I/O processing, as shown in Figure 14.

Figure 14

Engine redundancy

Multipathing software plus volume presentation on different engines yields continuous data availability in the presence of engine failures.

58

EMC VPLEX Metro Witness Technology and High Availability

System and Component Integrity

Path redundancy through site distribution


Distributed site redundancy now enabled through VPLEX Metro HA (including VPLEX Witness) ensures that if a site goes down, or even if the link to that site goes down, the other site can continue seamlessly processing the host I/O, as shown in Figure 15. As illustrated if on site failure of Site B occurs, the I/O continues unhindered on Site A.

Figure 15

Site redundancy

Path redundancy through site distribution

59

System and Component Integrity

Serviceability
In addition to the redundancy fail-safe features, the VPLEX cluster provides event logs and call home capability via EMC Secure Remote Support (ESRS).

60

EMC VPLEX Metro Witness Technology and High Availability

4
Foundations of VPLEX High Availability

This chapter explains VPLEX architecture and operation:


Foundations of VPLEX High Availability ..................................... 62 Failure handling without VPLEX Witness (static preference) ..... 70

Foundations of VPLEX High Availability

61

Foundations of VPLEX High Availability

Foundations of VPLEX High Availability


The following section discusses several disruptive scenarios at a high level to a multiple site VPLEX Metro configuration without VPLEX Witness. The purpose of this section is to provide the customer or solutions architect the ability to understand site failure semantics prior to the deployment of VPLEX Witness and related solutions outlined in this book. This section isnt designed to highlight flaws in high availability architecture as implemented in basic VPLEX best practices without using VPLEX Witness. All solutions that are deployed in a Metro Active /Active state be they VPLEX or not will run into the same issues when not deploying a independent observer or Witness. such as VPLEX Witness. The decision for an architect to apply the VPLEX Witness capabilities or enhance connectivity paths across data centers using the Metro HA Cross-Cluster Connect solution is dependent on their basic fail-over needs.
Note: To ensure the explanation of this subject remains at a high level in the following section the graphics have been broken down into major objects (e.g. Site A, Site B and Link). You can assume that within each site resides a VPLEX cluster. Therefore, when a site failure is shown it will also cause a full VPLEX cluster failure within that site. You can also assume that the link object between sites represents the main inter-cluster data network connected to each VPLEX cluster in either site. One further assumption is that each site shares the same failure domain. A site failure will affect all components within this failure domain including VPLEX cluster.

This representation of Figure 16 as described shows normal operation where all three components are fully operational. (Note: green symbolizes normal operation and red symbolizes failure)

Figure 16

High level functional sites in communication

62

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

Figure 17 demonstrates that Site A has failed.

Figure 17

High level Site A failure

Suppose that an application or VM was running only in Site A at the time of the incident it would now need to be restarted at the remaining Site B. Reading this document, you know this since you have an external perspective being able to see the entire diagram. However, if you were looking at this purely from Site Bs perspective, all that could be deduced is that communication has been lost to Site A. Without an external independent observer of some kind, it is impossible to distinguish between full Site A failure vs. the inter-cluster link failure. A link failure as depicted by the red arrow in Figure 18 is representative of an inter-cluster link failure.

Figure 18

High level Inter-site link failure

Similar to the previous example, if you look at this from an overall perspective, you can see that it is the link which is faulted. However, if you consider this from Site A or Site Bs perspective all that can be deduced is that communication is lost to Site A (exactly like the previous example) and it cannot be distinguished if it is the link or the site at fault.

Foundations of VPLEX High Availability

63

Foundations of VPLEX High Availability

The next section shows how different failures affect a VPLEX distributed volume and highlights the different resolutions required in each case starting with the site failure scenario. The high level Figure 19 shows a VPLEX distributed volume spanning two sites:

Figure 19

VPLEX active and functional between two sites

As shown, the distributed volume is made up of a mirror at each site (M1 and M2). Using the distributed cache coherency semantics provided by VPLEX GeoSynchrony a consistent data presentation of a logical volume is achieved across both clusters. Furthermore due to cache coherency the ability to perform active/active data access (both read and write) from two sites is enabled. Additionally shown in the example is a distributed network where users are able to access either site which would be true in a fully active/active environment.

64

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

Figure 20 shows a total failure at one of the sites (in this case Site A has failed). In this case the distributed volume would become degraded since the hardware required at Site A to support this particular mirror leg is no longer available. For a resolution to this example, keep the volume active at Site B so the application can resume there.

Figure 20

VPLEX concept diagram with failure at Site A

Foundations of VPLEX High Availability

65

Foundations of VPLEX High Availability

Figure 21 shows the desired resolution if failure at Site A was to occur. As discussed previously the correct outcome of this is to keep the volume online in Site B.

Figure 21

Correct resolution after volume failure at Site A

Failure handling without VPLEX Witness (static preference) on page 70 discusses the outcome after an inter-cluster link partition/failure.

66

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

Figure 22 shows the configuration before the failure.

Figure 22

VPLEX active and functional between two sites

Recall based on the Site A / Site B simple failure scenarios, when a link failed, neither site knew of the exact failure. With an active/active distributed volume, a link failure would also degrade the distributed volume since write I/O at either site would be unable to propagate to the remote site.

Foundations of VPLEX High Availability

67

Foundations of VPLEX High Availability

Figure 23 shows what would happen if there was no mechanism to suspend I/O at one of the sites in this scenario.

Figure 23

Inter-site link failure and cluster partition

As shown, this would lead to a split brain (or conflicting detach in VPLEX terminology) since writes could be accepted on both sites there is the potential to end up with two divergent copies of the data. To protect against data corruption this situation has to be avoided. Therefore, VPLEX must act and suspend access to the distributed volume on one of the clusters.

68

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

Figure 24 displays a valid and acceptable state in the event of a link partition as Site A is now suspended. This preferential behavior (selectable for either cluster) is the default and automatic behavior of VPLEX distributed volumes and protects against data corruption and split brain scenarios. The following section explains in more detail how this functions.

Figure 24

Correct handling of cluster partition

Foundations of VPLEX High Availability

69

Foundations of VPLEX High Availability

Failure handling without VPLEX Witness (static preference)


As previously demonstrated, in the presence of failures, VPLEX Active/Active distributed solutions require different resolutions depending on the type of failure. However, since VPLEX version 4.0 had no means to perform external arbitration no mechanism existed to distinguish between a site failure and a link failure. To overcome this, a feature called static preference (previously know as static bias) is used to guard against split brain scenarios occurring. The premise of static preference is to set a detach rule ahead of failure for each distributed volume (or group of distributed volumes) that spans two VPLEX clusters to effectively define which cluster will be declared a preferred cluster and maintain access to the volume and which cluster should be declared the non-preferred, therefore suspending access should either of the VPLEX clusters lose communication with each other (this concept covers both site and link failure). This is known as a detach rule and means that one site can unilaterally detach the other cluster and assume that the detached cluster is either dead or that it will stay suspended if it is alive.
Note: VPLEX Metro also supports the rule set no automatic winner. If a consistency group is configured with this setting then IO will suspend at both VPLEX clusters if either the link were to partition or an entire VPLEX cluster fail. Manual intervention can then be used to resume IO at a remaining cluster if required. Care should be taken if setting this policy since although this will always ensure that both VPLEX clusters remain identical at all times, the trade off is that the production environment would be halted. This is useful if a customer wishes to integrate VPLEX failover semantics with failover behavior driven by the application (suppose the application has its own witness, etc.) In this case, the application can provide a script that invokes the resume CLI command on the VPLEX cluster of its choosing.

70

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

Figure 25 shows how static preference can be set for each distributed volume (also known as a DR1 - Distributed RAID1).

Figure 25

VPLEX static detach rule

This detach rule can either be set within the VPLEX GUI or via VPLEX CLI. Each volume can be either set to Cluster 1 detaches, Cluster 2 detaches or no automatic winner. If the Distributed Raid 1 device (DR1) is set to Cluster 1 detaches, then in any failure scenario the preferred cluster for that volume would be declared as Cluster 1, but if the DR1 detach rule is set to Cluster 2 detaches, then in any failure scenario the preferred cluster for that volume would be declared as Cluster 2.
Note: Some people when looking at this prefer to substitute the word detaches for the word preferred or wins which is perfectly acceptable and may make it easier to understand.

Failure handling without VPLEX Witness (static preference)

71

Foundations of VPLEX High Availability

Setting the rule set on a volume to Cluster 1 detaches, would mean that Cluster 1 would be the preferred site for the given volumes. (Additionally the terminology that Cluster 1 has the bias for the given volume is also appropriate) Once this rule is set then regardless of the failure (be it link or site) the rule will always be invoked.
Note: A caveat exists here that if the state of the BE at the preferred cluster is out of date (due to prior BE failure, an incomplete rebuild or another issue) the preferred cluster will suspend I/O regardless of preference.

The following diagrams show some examples of the rule set in action for different failures, the first being a site loss at B with a single DR1 set to Cluster 1 detaches. Figure 26 shows the initial running setup of the configuration. It can be seen that the volume is set to Cluster 1 detaches.

Figure 26

Typical detach rule setup

72

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

If there was a problem at Site B, then the DR1 will become degraded as shown in Figure 27.

Figure 27

Non-preferred site failure

Failure handling without VPLEX Witness (static preference)

73

Foundations of VPLEX High Availability

As the preference rule was set to Cluster 1 detaches, then the distributed volume will remain active at Site A. This is shown in Figure 28.

Figure 28

Volume remains active at Cluster 1

Therefore in this scenario, if the service, application, or VM was running only at Site A (the preferred site) then it would continue uninterrupted without needing to restart. However, if the application was running only at Site B on the given distributed volume then it will need to be restarted at Site A, but since VPLEX is an active/active solution no manual intervention at the storage layer will be required in this case.

74

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

The next example shows static preference working under link failure conditions. Figure 29 shows a configuration with a distributed volume set to Cluster 1 detaches as per the previous configuration.

Figure 29

Typical detach rule setup before link failure

Failure handling without VPLEX Witness (static preference)

75

Foundations of VPLEX High Availability

If the link were now lost then the distributed volume will again be degraded as shown in Figure 30.

Figure 30

Inter-site link failure and cluster partition

To ensure that split brain does not occur after this type of failure the static preference rule is applied and I/O is suspended at Cluster 2 in this case as the rule is set to Cluster 1 detaches.

76

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

This is shown in Figure 31.

Figure 31

Suspension after inter-site link failure and cluster partition

Therefore, in this scenario, if the service, application, or VM was running only at Site A then it would continue uninterrupted without needing to restart; However, if the application was running only at Site B then it will need to be restarted at Site A since the preference rule set will suspend access for the given distributed volumes on Cluster2. Again, no manual intervention will be required in this case at the storage level as the volume at Cluster 1 automatically remained available. In summary, static preference is a very effective method of preventing split brain. However, there is a particular scenario that will result in manual intervention if the static preference feature is used alone. This can happen if there is a VPLEX cluster or site failure at the preferred cluster (such as the pre-defined preferred cluster for the given distributed volume).

Failure handling without VPLEX Witness (static preference)

77

Foundations of VPLEX High Availability

This is shown in Figure 32 where there is distributed volumes which has Cluster 2 detaches set on the DR1.

Figure 32

Cluster 2 is preferred

78

EMC VPLEX Metro Witness Technology and High Availability

Foundations of VPLEX High Availability

If Site B had a total failure in this example, disruption would now also occur at Site A as shown in Figure 33.

Figure 33

Preferred site failure causes full Data Unavailability

As can be seen, the preferred site has now failed and the preference rule has been used, but since the rule is static and cannot distinguish between a link failure or remote site failure, in this example the remaining site becomes suspended. Therefore, in this case, manual intervention will be required to bring the volume online at Site A. Static preference is a very powerful rule. It does provide zero RPO and zero RTO resolution for non-preferred cluster failure and inter-cluster partition scenarios and it completely avoids split brain. However, in the presence of a preferred cluster failure it provides non-zero RTO. It is good to note that this feature is available without automation and is a valuable alternative when a VPLEX Witness configuration (discussed in the next chapter) is unavailable or customer infrastructure cannot accommodate due to the lack of a third failure domain.

Failure handling without VPLEX Witness (static preference)

79

Foundations of VPLEX High Availability

VPLEX Witness has been designed to overcome this particular non-zero RTO scenario since it can override the static preference and leave what was the non preferred site active guaranteeing that split brain scenarios are always avoided.
Note: If using a VPLEX Metro deployment without VPLEX Witness, and the preferred cluster has been lost, IO can be manually resumed via cli at the remaining (non-preferred) VPLEX cluster. However, care should be taken here to avoid a conflicting detach or split brain scenario. (VPLEX Witness solves this problem automatically.)

80

EMC VPLEX Metro Witness Technology and High Availability

5
Introduction to VPLEX Witness

This chapter explains VPLEX architecture and operation:


VPLEX Witness overview and architecture ................................... VPLEX Witness target solution, rules, and best practices............ VPLEX Witness failure semantics.................................................... CLI example outputs .........................................................................

82 85 87 93

Introduction to VPLEX Witness

81

Introduction to VPLEX Witness

VPLEX Witness overview and architecture


VPLEX Metro v5.0 (and above) systems can now rely on a new component called VPLEX Witness. VPLEX Witness is an optional component designed to be deployed in customer environments where the regular preference rule sets are insufficient to provide seamless zero or near-zero RTO storage availability in the presence of site disasters and VPLEX cluster and inter-cluster failures. As described in the previous section, without VPLEX Witness, all distributed volumes rely on configured rule sets to identify the preferred cluster in the presence of cluster partition or cluster/site failure. However, if the preferred cluster happens to fail (in the result of a disaster event, etc.), VPLEX is unable to automatically allow the surviving cluster to continue I/O to the affected distributed volumes. VPLEX Witness has been designed specifically overcome this case. An external VPLEX Witness Server is installed as a virtual machine running on a customer supplied VMware ESX host deployed in a failure domain separate from either of the VPLEX clusters (to eliminate the possibility of a single fault affecting both the cluster and the VPLEX Witness). VPLEX Witness connects to both VPLEX clusters over the management IP network. By reconciling its own observations with the information reported periodically by the clusters, the VPLEX Witness enables the cluster(s) to distinguish between inter-cluster network partition failures and cluster failures and automatically resume I/O in these situations. Figure 34 on page 83 shows a high level deployment of VPLEX Witness and how it can augment an existing static preference solution.The VPLEX Witness server resides in a fault domain separate from VPLEX cluster 1 and cluster 2.

82

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

Figure 34

High Level VPLEX Witness architecture

Since the VPLEX Witness server is external to both of the production locations more perspective can be gained as to the nature of a particular failure and the correct action taken since as mentioned previously it is this perspective that is vital to be able to determine between a site outage and a link outage as either one of these scenarios requires a different action to be taken.

VPLEX Witness overview and architecture

83

Introduction to VPLEX Witness

Figure 35 shows a high-level circuit diagram of how the VPLEX Witness Server should be connected.

Figure 35

High Level VPLEX Witness deployment

The VPLEX Witness server is connected via the VPLEX management IP network in a third failure domain. Depending on the scenario that is to be protected against, this third fault domain could reside in a different floor within the same building as VPLEX cluster 1 and cluster 2. It can also be located in a completely geographically dispersed data center which could be in a different country.
Note: VPLEX Witness Server supports up to 1 second of network latency over the management IP network.

Clearly, using the example of the third floor in the building, one would not be protected from a disaster affecting the entire building so, depending on the requirement, careful consideration should be given if choosing this third failure domain.

84

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

VPLEX Witness target solution, rules, and best practices


VPLEX Witness is architecturally designed for VPLEX Metro clusters. Customers who wish to use VPLEX Local will not require VPLEX Witness functionality. Furthermore VPLEX Witness is only suitable for customers who have a third failure domain connected via two physical networks from each of the data centers where the VPLEX clusters reside into each VPLEX management station Ethernet port. VPLEX Witness failure handling semantics only apply to Distributed volumes in all synchronous (i.e., Metro) consistency groups on a pair of VPLEX v5.x clusters if VPLEX Witness is enabled. VPLEX Witness failure handling semantics do not apply to:

Local volumes Distributed volumes outside of a consistency group Distributed volumes within a consistency group if the VPLEX Witness is disabled Distributed volumes within a consistency group if the preference rule is set to no automatic winner.

At the time of writing only one VPLEX Witness Server can be configured for a given Metro system and when it is configured and enabled, its failure semantics applies to all configured consistency groups. Additionally a single VPLEX Witness Server (virtual machine) can only support a single VPLEX Metro system (however, more than one VPLEX Witness Server can be configured onto a single physical ESX host).

VPLEX Witness target solution, rules, and best practices

85

Introduction to VPLEX Witness

Figure 36 shows the supported versions (at the time of writing) for VPLEX Witness.

Figure 36

Supported VPLEX versions for VPLEX Witness

As mentioned in Figure 36, depending on the solution, VPLEX Static preference alone without VPLEX Witness may still be relevant in some cases. Figure 37 shows the volume types and rules which can be supported with VPLEX Witness

Figure 37

VPLEX Witness volume types and rule support

Check the latest VPLEX ESSM (EMC Simple Support Matrix), located at https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest information including VPLEX Witness server physical host requirements and site qualification.
86

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

VPLEX Witness failure semantics


As seen in the previous section VPLEX Witness will operate at the consistency group level for a group of distributed devices and will function in conjunction with the detach rule set within the consistency group. Starting with the inter-cluster link partition the next few pages discuss failure scenarios (both site and link) which were raised in previous sections and show how the failure semantics differ using VPLEX Witness compared to just using static preference alone. Figure 38 shows a typical setup for VPLEX 5.x with a single distributed volume configured in a consistency group which has a rule set configured for cluster 2 detaches (such as cluster 2 is preferred). Additionally it shows the VPLEX Witness server is connected via the management network in a third failure domain.

Figure 38

Typical VPLEX Witness configuration

VPLEX Witness failure semantics

87

Introduction to VPLEX Witness

If the inter-cluster link were to fail in this scenario VPLEX Witness would still be able to communicate with both VPLEX clusters since the management network that connects the VPLEX Witness server to both of the VPLEX clusters is still operational. By communicating with both VPLEX clusters, the VPLEX Witness will deduce that the inter-cluster link has failed since both VPLEX clusters report to the VPLEX Witness server that the connectivity with the remote VPLEX cluster has been lost. (such as, cluster 1 reports that cluster 2 is unavailable and vice versa). This is shown in Figure 39.

Figure 39

VPLEX Witness and an inter-cluster link failure

In this case the VPLEX Witness guides both clusters to follow the pre-configured static preference rules and volume access at cluster 1 will be suspended since the rule set was configured as cluster 2 detaches.

88

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

Figures 40 shows the final state after this failure.

Figure 40

VPLEX Witness and static preference after cluster partition

The next example shows how VPLEX Witness can assist if you have a site failure at the preferred site. As discussed above, this type of failure without VPLEX Witness would cause the volumes in the surviving site to go offline. This is where VPLEX Witness greatly improves the outcome of this event and removes the need for manual intervention.

VPLEX Witness failure semantics

89

Introduction to VPLEX Witness

Figure 41 shows a typical setup for VPLEX v5.x with a distributed volume configured in a consistency group with a rule set configured for Cluster 2 detaches (such as, Cluster 2 wins).

Figure 41

VPLEX Witness typical configuration for cluster 2 detaches

90

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

Figure 42 shows that Site B has now failed.

Figure 42

VPLEX Witness diagram showing cluster 2 failure

As discussed in the previous section, when a site has failed then the distributed volumes are now degraded. However, unlike our previous example where there was a site failure at the preferred site and the static preference rule was used forcing volumes into a suspend state at cluster 1, VPLEX Witness will now observe that communication is still possible to cluster 1 (but not cluster 2). Additionally since cluster 1 cannot contact cluster 2, VPLEX Witness can make an informed decision and guide cluster 1 to override the static rule set and proceed with I/O.

VPLEX Witness failure semantics

91

Introduction to VPLEX Witness

Figure 43 shows the outcome.

Figure 43

VPLEX Witness with static preference override

Clearly, this is a big improvement on the scenario where this happened with just the static preference rule set but not using VPLEX Witness. Since volumes had to be suspended at cluster 1 previously, there was no way to tell the difference between a site failure or a link failure.

92

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

CLI example outputs


On systems where VPLEX Witness is deployed and configured, the VPLEX Witness CLI context appears under the root context as "cluster-witness." By default, this context is hidden and will not be visible until VPLEX Witness has been deployed by running the cluster-Witness configure command. Once the user deploys VPLEX Witness, the VPLEX Witness CLI context becomes visible. The CLI context typically displays the following information:
VPlexcli:/> cd cluster-witness/ VPlexcli:/cluster-witness> ls Attributes: Name ------------admin-state private-ip-address public-ip-address Contexts: components VPlexcli:/cluster-witness> ll components/ /cluster-Witness/components: Name ---------cluster-1 cluster-2 server ID -1 2 Admin State ----------enabled enabled enabled Operational State ------------------in-contact in-contact clusters-in-contact Mgmt Connectivity ----------------ok ok ok

Value ------------enabled 128.221.254.3 10.31.25.45

VPlexcli:/cluster-Witness> ll components/* /cluster-Witness/components/cluster-1: Name Value ----------------------- -----------------------------------------------------admin-state enabled diagnostic INFO: Current state of cluster-1 is in-contact (last state change: 0 days, 13056 secs ago; last message from server: 0 days, 0 secs ago.) id 1 management-connectivity ok operational-state in-contact /cluster-witness/components/cluster-2: Name Value ----------------------- -----------------------------------------------------admin-state enabled

CLI example outputs

93

Introduction to VPLEX Witness

diagnostic

id management-connectivity operational-state

INFO: Current state of cluster-2 is in-contact (last state change: 0 days, 13056 secs ago; last message from server: 0 days, 0 secs ago.) 2 ok in-contact

/cluster-Witness/components/server: Name Value ----------------------- -----------------------------------------------------admin-state enabled diagnostic INFO: Current state is clusters-in-contact (last state change: 0 days, 13056 secs ago.) (last time of communication with cluster-2: 0 days, 0 secs ago.) (last time of communication with cluster-1: 0 days, 0 secs ago.) id management-connectivity ok operational-state clusters-in-contact

Eefer to the VPLEX CLI guide found on Powerlink for more details around VPLEX Witness CLI. VPLEX Witness cluster isolation semantics and dual failures As discussed in the previous section, deploying a VPLEX solution with VPLEX Witness will give continuous availability to the storage volumes regardless of there being a site failure or inter-cluster link failure. These types of failure are deemed single component failures and have shown no single point of failure can induce data unavailability using the VPLEX Witness. It should be noted, however, that in rare situations more than one fault or component outage can occur especially when considering inter-cluster communication links which if two failed at once would lead to a VPLEX cluster isolation at a given site. For instance, if you consider a typical VPLEX Setup with VPLEX Witness you will automatically have three failure domains (this example will use A, B, and C, where VPLEX cluster 1 resides at A, VPLEX cluster 2 at B, and the VPLEX Witness server resides at C). In this case there will be in inter-cluster link between A and B (cluster 1 and 2), plus a management IP link between A and C, as well as a management IP link between B and C, effectively giving a triangulated topology. In rare situations there is a chance that if the link between A and B failed followed by a further link failure from either A to C or B to C, then one of the sites will be isolated (cut off).

94

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

Due to the nature of VPLEX Witness, these types of isolation can also be dealt with effectively without manual intervention. This is achieved since a site isolation is very similar in terms of technical behavior to a full site outage the main difference being that the isolated site is still fully operational and powered up (but needs to be forced into I/O suspension) unlike a site failure where the failed site is not operational. In these cases the failure semantics and VPLEX Witness are effectively the same. However, two further actions are taken at the site that becomes isolated:

I/O is shut off/suspended at the isolated site. The VPLEX cluster will attempt to call home.

Figure 44 shows the three scenarios that are described above:

Figure 44

Possible dual failure cluster isolation scenarios

As discussed previously, it is extremely rare to experience a double failure and figure 44 showed how VPLEX can automatically ride through isolation scenarios. However, there are also some other possible situations where a dual failure could occur and require manual intervention at one of the VPLEX clusters as VPLEX Witness will not be able to distinguish the actual failure.

CLI example outputs

95

Introduction to VPLEX Witness

Note: If best practices are followed then the likelihood of these scenarios occurring is significantly less than even the rare isolation incidents discussed above mainly as the faults would have to disrupt components in totally different fault domains that would be spread over many miles.

Figure 45 shows three scenarios where a double failure would require manual intervention to bring the remaining component online since VPLEX Witness would not be able to determine the gravity of the failure.

Figure 45

Highly unlikely dual failure scenarios that require manual intervention

A point to note in the above scenarios is that for the shown outcomes to be correct the failures would have to have happened in a specific order where the link to the VPLEX Witness (or the Witness itself) has failed and then either the inter-cluster link or the VPLEX cluster fail. However, if the order of failure is reversed then in all three cases the outcome would be different since one of the VPLEX clusters would have remained online for the given distributed volume, therefore not requiring manual intervention. This is due to the fact that once a failure occurs, the VPLEX Witness will give guidance to the VPLEX cluster. This guidance is sticky, and once provided its guidance it is no longer consulted during any subsequent failure until the system has been returned to a fully operational state. (i.e.has fully recovered and connectivity between both clusters and the VPLEX Witness is fully restored).

96

EMC VPLEX Metro Witness Technology and High Availability

Introduction to VPLEX Witness

VPLEX Witness The importance of the third failure domain


As discussed in the previous section, dual failures can occur but are highly unlikely. As also mentioned many times within this TechBook, it is imperative that if VPLEX Witness is to be deployed then the VPLEX Witness server component must be installed into a different failure domain than either of the two VPLEX clusters. Figure 46 shows two further dual failure scenarios where both a VPLEX cluster has failed as well as the VPLEX Witness server.

Figure 46

Two further dual failure scenarios that would require manual intervention

Again, if best practice is followed and each component resides within its own fault domain then these two situations are just as unlikely as the previous three scenarios that required manual intervention. However, now consider what could happen if the VPLEX Witness server was not deployed within a third failure domain, but rather in the same domain as one of the VPLEX clusters. This situation would mean that a single domain failure would potentially induce a dual failure as two components may have been residing in the same failure domain. This effectively turns a highly unlikely scenario into a more probable single failure scenario and should be avoided.

CLI example outputs

97

Introduction to VPLEX Witness

By deploying the VPLEX Witness server into a third failure domain the dual failure risk is substantially lowered, therefore manual intervention would never be required since a fault would have to disable more than one dissimilar component potentially hundreds of miles apart spread over different fault domains.
Note: It is always considered best practice to ensure ESRS and alerting are fully configured when using VPLEX Witness. This way if a VPLEX cluster loses communication with a Witness server then the VPLEX cluster will dial home and alert. This also ensures that if both VPLEX clusters lose communication to the witness that the witness function can be manually disabled if the witness communication or outage is expected to last for an extended time reducing the risk of data unavailability in the event of an additional VPLEX cluster failure or WAN partition.

98

EMC VPLEX Metro Witness Technology and High Availability

6
VPLEX Metro HA

This chapter explains VPLEX architecture and operation:


VPLEX Metro HA overview ........................................................... 100 VPLEX Metro HA Campus (with cross-connect) ........................ 101 VPLEX Metro HA (without cross-cluster connection) ................ 111

VPLEX Metro HA

99

VPLEX Metro HA

VPLEX Metro HA overview


From a technical perspective VPLEX Metro HA solutions are effectively two new flavors of reference architecture which utilize the new VPLEX Witness feature in VPLEX v5.0 and greatly enhance the overall solutions ability to tolerate component failure causing less or no disruption than legacy solutions with little or no human intervention over either Campus or Metro distances. The two main architecture types enabled by VPLEX Witness are:

VPLEX Metro HA Campus This is defined as those clusters that are within campus distance (typically < 1ms round trip time or RTT). This solution utilizes a cross-connected front end host path configuration giving each host an alternate path to the VPLEX Metro distributed volume via the remote VPLEX cluster. VPLEX Metro HA This is defined with distances larger than campus but still within synchronous distance (typically higher than 1ms RTT, but not more than 5ms RTT) where a VPLEX Metro distributed volume is deployed between two VPLEX clusters using a VPLEX Witness, but not using a cross-connected host path configuration.

This section will look at each of these solutions in turn and show how system uptime can be maximized by stepping through different failure scenarios and showing from both a VPLEX and host HA cluster perspective how the technologies interact with each failure. In all of the scenarios shown, the VPLEX is able to continue servicing IO automatically across at least one of the VPLEX clusters with zero data loss ensuring that the application or service within the host HA cluster simply remains online fully uninterrupted, or is restarted elsewhere automatically be the host cluster.

100

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

VPLEX Metro HA Campus (with cross-connect)


VPLEX Metro HA campus connect can be deployed when two sites are within campus distance of each other (up to 1ms round trip latency). A VPLEX Metro distributed volume can then be deployed across the two sites using a cross-connected front end configuration and a VPLEX Witness server installed within a different fault domain. Figure 47 shows a high level schematic of a Metro HA campus solution for VMware.

Figure 47

High-level diagram of a Metro HA campus solution for VMware

As can be seen, a single VPLEX cluster is deployed at each site connected via an inter cluster link.

VPLEX Metro HA Campus (with cross-connect)

101

VPLEX Metro HA

A VPLEX distributed volume has been created across both of the locations and a vSphere HA cluster instance has been stretched across both locations using the underlying VPLEX distributed volume. Also shown in Figure 47 on page 101 are the physical ESX hosts that are not only connected to the local VPLEX cluster where they physically reside, but also have an alternate path to the remote VPLEX cluster via the additional cross-connect network that is physically separate to the VPLEX inter cluster link connecting both of the VPLEX clusters. The key benefit to this solution is its ability to minimise and eliminate any recovery time if components were to fail (including even an entire VPLEX cluster which would be unlikely since there are no single points of failure within a VPLEX engine) as now the physical host has an alternate path to the same storage actively served up by the remote VPLEX cluster which will automatically remain online due to the VPLEX Witness regardless of rule set. The high-level deployment best practices for a cross-connect configuration are as follows:

At the time of writing inter-cluster Network latency is not to exceed 1ms round trip time between VPLEX clusters. VPLEX Witness must be deployed (in a third failure domain) when using a cross-connect campus configuration. All remote VPLEX connection should be zoned to the local host as per EMC best practice, and local host initiators must be registered to the remote VPLEX. The distributed volume is then exposed from both VPLEX clusters to the same host. The host path preference should have a local path preference set, ensuring the remote path will only be used if the primary one fails ensuring no additional latency is incurred
Note: At the time of writing the only two qualified host cluster solutions that can be configured with the additional VPLEX Metro HA campus (as opposed to standard VPLEX Metro HA without the cross-cluster connect) are vSphere version 4.1 & 5.0, Windows 2008 and IBM Power HA 5.4. Be sure to check the latest VPLEX Simple Support Matrix, located at https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest support information or submit an RPQ.

102

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Failure scenarios For the following failure scenarios, this section assumes that vSphere 5.0 update 1 or above is configured in a stretched HA topology with DRS so that all of the physical hosts (ESX servers) are within the same HA cluster. As discussed previously, this type of configuration brings the ability to teleport virtual machines over distance, which is extremely useful in disaster avoidance, load balancing and cloud infrastructure use cases. These use cases are all enabled using out of the box features and functions; however, additional value can be derived from deploying the VPLEX Metro HA campus solution to ensure total availability for both planned and unplanned events. High-level recommendations and pre-requisites for stretching vSphere HA when used in conjunction with VPLEX Metro are as follows:

A single vCenter instance must span both locations that contain the VPLEX Metro cluster pairs. (Note: it is recommended this is virtualized and protected via vSphere heartbeat to ensure restart in the event of failure) Must be used in conjunction with a stretched layer 2 network ensuring that once a VM is moved it still resides on the same logical network. vSphere HA can be enabled within the vSphere cluster, but vSphere fault tolerance is not supported (at the time of writing but is planned for late 2012). Can be used with vSphere DRS. However, careful consideration should be given to this feature for vSphere versions prior to 5.0 update 1 since certain failure conditions where the VM is running at the non preferred site may not invoke a VM fail over after failure due to a problem where the ESX server does not detect a storage Persistent Device Loss (PDL) state. This can lead to the VM remaining online but intermittently unresponsive (also know as a zombie VM). Manual intervention would be required in this scenario.
Note: This can be avoided by using a VPLEX HA Campus solution with a cross-cluster connect on a separate physical network to the VPLEX inter cluster link. This will ensure that an active path is always present to the storage no matter where the VM is running. Another possibility to avoid would be to use host affinity groups (where supported). It is recommended to upgrade to the latest ESX and vSphere versions (5.0 update 1 and above) to avoid these conditions.

VPLEX Metro HA Campus (with cross-connect)

103

VPLEX Metro HA

For detailed setup instructions and best practice planning for a stretched HA vSphere environment, refer to White Paper: Using VMware vSphere with EMC VPLEX Best Practices Planning which can be found at http://powerlink.emc.com under Support > Technical Documentation and Advisories > Hardware Platforms > VPLEX Family > White Papers. Figure 48 shows the topology of a Metro HA campus environment divided up into logical fault domains.

Figure 48

Metro HA campus diagram with failure domains

The following sections will demonstrate the recovery automation for a single failure within any of these domains and show how no single fault in any domain can take down the system as a whole, and in most cases without an interruption of service. If a physical host failure were to occur in either domain A1 or B1 the VMware HA cluster would restart the affected virtual machines on the remaining ESX servers.

104

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Example 1

Figure 49 shows all physical ESX hosts failing in domain A1.

Figure 49

Metro HA campus diagram with disaster in zone A1

Since all of the physical hosts in domain B1 are connected to the same datastores via the VPLEX Metro distributed device VMware HA can restart the virtual machines on any of the physical ESX hosts in domain B1. Example 2 The next example describes what will happen in the unlikely event that a VPLEX cluster was to fail in either domain A2 or B2.
Note: This failure condition is considered unlikely since it would constitute a dual failure as a VPLEX cluster has no single points of failure.

In this instance there would be no interruption of service to any of the virtual machines.

VPLEX Metro HA Campus (with cross-connect)

105

VPLEX Metro HA

Figure 50 shows a full VPLEX cluster outage in domain A2.

Figure 50

Metro HA campus diagram with failure in zone A2

Since the ESX servers are cross connected to both VPLEX clusters in each site, ESX will simply re-route the I/O to the alternate path, which is still available since VPLEX is configured with a VPLEX Witness protected distributed volume. This ensures the distributed volume remains online in domain B2 as the VPLEX Witness Server observes that it cannot communicate with the VPLEX cluster in A2 and guides the VPLEX cluster in B2 to remain online as this also cannot communicate with A2, meaning A2 is either isolated or failed.
Note: Similarly in the event of a full isolation at A2, the distributed volumes would simply suspend at A2 since communication would not be possible to either the VPLEX Witness Server or the VPLEX cluster in domain B2. In this case, the outcome is identical from a vSphere perspective and there will be no interruption since I/O would be re-directed across the cross connect to domain B2 where the distributed volume would remain online and available to service I/O.

106

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Example 3

The following example describes what will happen in the event of a failure to one (or all of) the back end storage arrays in either domain A3 or B3. Again, in this instance there would be no interruption to any of the virtual machines. Figure 51 shows the failure to all storage arrays that reside in domain A3. Since a cache coherent VPLEX Metro distributed volume is configured between domains A2 and B2 IO can continue to be actively serviced from the VPLEX in A2 even though the local back end storage has failed. This is due to the embedded VPLEX cache coherency which will efficiently cache any reads into the A2 domain whilst also propagating writes to the back end storage in domain B3 via the remote VPLEX cluster in Site B2.

Figure 51

Metro HA campus diagram with failure in zone A3 or B3

VPLEX Metro HA Campus (with cross-connect)

107

VPLEX Metro HA

Example 4

The next example describes what will happen in the event of a VPLEX Witness server failure in domain C1. Again, in this instance there would be no interruption to any of the virtual machines or VPLEX clusters. Figure 52 shows a complete failure to domain C1 where the VPLEX Witness server resides. Since the VPLEX Witness in not within the I/O path and is only an optional component I/O will actively continue for any distributed volume in domains A2 and B2 since the inter-cluster link is still available, meaning cache coherency can be maintained between the VPLEX cluster domains. Although the service is uninterrupted, both VPLEX clusters will now dial home and indicate they have lost communication with the VPLEX Witness Server as a further failure to either of the VPLEX clusters in domains A2 and B2 or the inter-cluster link would cause data unavailability. The risk of this is heightened should the VPLEX Witness server be offline for an extended duration. To remove this risk the Witness feature may be disabled manually, therefore enabling the VPLEX clusters to follow the static preference rules.

Figure 52

Metro HA campus diagram with failure in zone C1

108

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Example 5

The next example describes what will happen in the event of a failure to the inter-cluster link between domains A2 and B2. Again in this instance there would be no interruption to any of the virtual machines or VPLEX clusters. Figure 53 shows the inter-cluster link has failed between domains A2 and B2. In this instance the static preference rule set which was defined previously will be invoked since neither VPLEX cluster can communicate with the other VPLEX cluster (but the VPLEX Witness Server can communicate with both VPLEX clusters). Therefore, access to the given distributed volume within one of the domains A2 or B2 will be suspended. Since in this example the cross connect network is physically separate from inter cluster link the alternate paths are still available to the remote VPLEX cluster where the volume remains online, therefore ESX will simply re-route the traffic to the alternate VPLEX cluster meaning the virtual machine will remain online and unaffected whichever site it was running on.

Figure 53

Metro HA campus diagram with intersite link failure

VPLEX Metro HA Campus (with cross-connect)

109

VPLEX Metro HA

Note: It is plausible in this example that the alternate path is physically routing across the same ISL that has failed. In this instance there could be a small interruption if a virtual machine was running in A1 as it will be restarted in B1 (by the host cluster) since the alternate path is also dead. However, it is also possible with vSphere versions prior to 5.0 update 1 that the guest OS will simply hang and vSphere HA will not be prompted to restart it. Although this is beyond the scope of the Techbook, to avoid any disruption at all for any host cluster environment, EMC suggests that the network that is used for the cross-cluster connect be a physically separate network from the VPLEX inter-cluster link, therefore avoiding this potential problem altogether. Refer to the VPLEX Metro HA (non campus- no cross connect) scenarios in the next section for more details on full cluster partition as well as Appendix A, vSphere 5.0 Update 1 Additional Settings, for proper vSphere HA configuration settings that are applicable for ESX implementations post version 5.0 update 1 to avoid this problem.

110

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

VPLEX Metro HA (without cross-cluster connection)


VPLEX Metro HA (without cross-cluster connection) deployment is very similar to a Metro HA campus deployment as mentioned in the previous section; however, this solution is designed to cover distances beyond the campus range and into distances of a metropolitan range where round trip latency would be beyond 1 ms and up to 5 ms. A VPLEX Metro distributed volume can then be deployed across the two sites as well as deploying a VPLEX Witness server within a different third failure/fault domain. Figure 54 shows a high level schematic of an Metro HA solution for vSphere without the cross-cluster deployment.

Figure 54

Metro HA Standard High-level diagram

As can be seen, a single VPLEX cluster is deployed at each site connected via an inter cluster link.

VPLEX Metro HA (without cross-cluster connection)

111

VPLEX Metro HA

A VPLEX distributed volume has been created across both of the locations and a vSphere HA cluster instance has been stretched across both locations using the underlying VPLEX distributed volume. Also shown in Figure 54 on page 111 are the physical ESX hosts that are now only connected to the local VPLEX cluster where they reside since the cross-connect does not exist. The key benefit to this solution is its ability to minimise and eliminate any recovery time if components were to fail as the host cluster is connected to a VPLEX distributed volume which is actively serving up the same block data via both VPLEX clusters, and will continue to remain online via at least one VPLEX cluster under any single failure event regardless of rule set due to the VPLEX Witness.
Note: At the time of writing a larger amount of host based cluster solutions are supported with VPLEX Metro HA when compared to VPLEX Metro HA campus (with the cross-connect). Although this next section only discusses vSphere with VPLEX Metro HA, other supported host clusters include Microsoft HyperV, Oracle RAC, Power HA, Serviceguard, etc. Be sure to check the latest VPLEX simple support matrix found at https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest support information.

Failure scenarios As with the previous section, when deploying a stretched vSphere configuration with VPLEX Metro HA, it is also possible to enable long distance vMotion (virtual machine teleportation) since the ESX datastore resides on a VPLEX Metro distributed volume, therefore existing in two places at the same time exactly as described in the previous section. Again, for these failure scenarios, this section assumes that vSphere version 5.0 update 1 or higher is configured in a stretched HA topology so that all of the physical hosts at either site (ESX servers) are within the same HA cluster. Since the majority of the failure scenarios behave identically to the cross-connect configuration, this section will only show two failure scenarios where the outcomes differ slightly to the previous section.

112

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Note: For detailed setup instructions and best practice planning for a stretched HA vSphere environment please read White Paper: Using VMware vSphere with EMC VPLEX Best Practices Planning, which can be found on Powerlink (http://powerlink.emc.com) under Home > Support > Technical Documentation and Advisories > Hardware/Platforms Documentation > VPLEX Family > White Papers.

Figure 55 shows the topology of an Metro HA environment divided up into logical fault domains. The next sections will demonstrate the recovery automation for single failures within any of these domains.

Figure 55

Metro HA high-level diagram with fault domains

VPLEX Metro HA (without cross-cluster connection)

113

VPLEX Metro HA

Example 1

The following example describes what will happen in the unlikely event that a VPLEX cluster was to fail in domain A2. In this instance there would no interruption of service to any virtual machines running in domain B1; however, any virtual machines that were running in domain A1 would see a minor interruption as the virtual machines are restarted at B1. Figure 56 shows a full VPLEX cluster outage in domain A2.

Figure 56

Metro HA high-level diagram with failure in domain A2 Note: This failure condition is considered unlikely since it would constitute a dual failure as a VPLEX cluster has no single points of failure.

As can be seen in the graphic, since the ESX servers are not cross-connected to the remote VPLEX cluster, the ESX server will lose access to the storage causing the HA host cluster (in this case vSphere) to perform a HA restart for the virtual machines within

114

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

domain A2. It can do this since the distributed volumes will remain active at B2 as the VPLEX is configured with a VPLEX Witness protected distributed volume which will deduce that the VPLEX in domain A2 is unavailable (since neither the VPLEX Witness Server or the VPLEX cluster in B2 can communicate with the VPLEX cluster in A2, therefore VPLEX Witness will guide the VPLEX cluster in B2 to remain online)
Note: It is important to understand that this failure is deemed a dual failure since at the time of writing it is possible that with vSphere (all versions including 5.0 update 1), the guest VMs in Site A will simply hang in this situation (this is known as a zombie) and VMware HA will not be prompted to restart it (all other supported HA clusters would detect this failure and perform a restart under these conditions). Although this is beyond the scope of the Techbook, manual intervention may be required here to resume VMs at the remote VPLEX cluster that automatically remains online regardless of rule sets due to the VPLEX witness. The reason for this is that if the VPLEX cluster is totally disconnected from the hosts, then the hosts will be unable to receive the PDL (Persistent device loss) status issued via the VPLEX, therefore vSphere will only see this as an APD (All paths down) state and in most cases wait for the device to be brought back online without failing the VM.

Example 2

The next example describes what will happen in the event of a failure to the inter-cluster link between domains A2 and B2. One of two outcome of this scenario will happen:

VMs running at the preferred site - If the static preference for a given distributed volume was set to cluster 1 detaches (assuming cluster 1 resides in domain A2) and the virtual machine was running at the same site where the volume remains online (aka the preferred site) then there is no interruption to service. VMs running at the non-preferred site - If the static preference for a given distributed volume was set to cluster 1 detaches (assuming cluster 1 resides in domain A2) and the virtual machine was running at the remote site (Domain B1) then the VMs storage will be in the suspended state (PDL). In this case the guest operating systems will fail allowing the virtual machine to be automatically restarted in domain A1.

VPLEX Metro HA (without cross-cluster connection)

115

VPLEX Metro HA

Figure 57 shows the link has failed between domains A2 and B2.

Figure 57

Metro HA high-level diagram with intersite failure

In this instance the static preference rule set which was previously defined as cluster 1 detaches will be invoked since neither VPLEX cluster can communicate with the other VPLEX cluster (but the VPLEX Witness Server can communicate with both VPLEX clusters), therefore access to the given distributed volume within the domains B2 will be suspended for the given distributed volume whilst remaining active at A2. Virtual machines that were running at A1 will be uninterrupted and virtual machines that were running at B1 will be restarted at A1.

116

EMC VPLEX Metro Witness Technology and High Availability

VPLEX Metro HA

Note: Similar to the previous note and though it is beyond the scope of this TechBook, vSphere HA versions prior to 5.0 update 1 may not detect this condition and not restart the VM if it was running at the non preferred site, therefore to avoid any disruption when using vSphere in this type of configuration (for versions prior to 5.0 update 1) VMware DRS host affinity rules can be used (where supported) to ensure that virtual machines are always running in their preferred location (i.e., the location that the storage they rely on is biased towards). Another way to avoid this scenario is to disable DRS altogether and use vSphere HA only, or use a cross-connect configuration deployed across a separate physical network as discussed in the previous section. See Appendix A, vSphere 5.0 Update 1 Additional Settings, for proper vSphere HA configuration settings that are applicable for ESX implementations post version 5.0 update 1 to avoid this problem.

The remaining failure scenarios with this solution are identical to the previously discussed VPLEX Metro HA campus solutions. For failure handling in domains A1, B1, A3, B3, or C, see VPLEX Metro HA Campus (with cross-connect) on page 101.

VPLEX Metro HA (without cross-cluster connection)

117

VPLEX Metro HA

118

EMC VPLEX Metro Witness Technology and High Availability

7
Conclusion

This chapter provides a conclusion to the VPLEX solutions outlined in this TechBook:

Conclusion ........................................................................................ 120

Conclusion

119

Conclusion

Conclusion
As outlined in this book, using VPLEX AccessAnywhereTM technology in combination with High Availability and VPLEX Witness, storage administrators and data center managers will be able to provide absolute physical and logical high availability for their organizations mission critical applications with less resource overhead and dependency on manual intervention. Increasingly, those mission critical applications are virtualized and in most cases using VMware vSphere or Microsoft Hyper-V virtual machine technologies. It is expected that VPLEX customers use the HA / VPLEX Witness solution to incorporate several application-specific clustering and virtualization technologies to provide HA benefits for targeted mission critical applications. As described, the storage administrator is provided with two specific VPLEX Metro-based solutions around High Availability as outlined specifically for VMware ESX 4.1 or higher as integrated into the VPLEX Metro HA Campus (cross-cluster connect) and standard (non campus) Metro environments. VPLEX Metro HA Campus provides a higher level of HA than the VPLEX Metro HA deployment without cross-cluster connectivity. However, it is limited to in-data center use or cases where the network latency between data centers is negligible. Both solutions are ideal for customers who are not only currently or planning on becoming highly virtualized but are looking for the following:

Elimination of the night shift storage and server administrator positions. To accomplish this, they must be comfortable that their applications will ride through any failures that happen during the night. Reduction of capital expenditures by moving from an active/passive data center replication model to a fully active highly available data center model. Increase application availability by protecting against flood and fire disasters that could affect their entire data center.

120

EMC VPLEX Metro Witness Technology and High Availability

Conclusion

From a holistic view of both types of solutions and what it provides the storage administrator, the following benefits are in common with variances. What EMC VPLEX technology with Witness provides to consumers are as follows, each discussed briefly:

Better protection from storage-related failures on page 121 Protection from a larger array of possible failures on page 121 Greater overall resource utilization on page 122

Better protection from storage-related failures


Within a data center, applications are typically protected against storage-related failures through the use of multipathing software such as EMC PowerPath. This allows applications to ride through HBA failures, switch failures, cable failures, or storage array controller failures by routing I/O around the location of the failure. The VPLEX Metro HA cross-cluster connect solution extends this protection to the rack and/or data center level by multipathing between VPLEX clusters in independent failure domains. The VPLEX Metro HA solution adds to this the ability to restart the application in the other data center in case no alternative route for the I/O exists in its current data center. As an example, if a fire were to affect an entire VPLEX rack, the application could be restarted in the backup data center automatically.This provides customers a much higher level of availability and lower level of risk.

Protection from a larger array of possible failures


To highlight advantages of VPLEX Witness functionality, lets recall how VMware HA operates. VMware HA and other offerings provides automatic restart of virtual machines (applications) in the event of virtual machine failure for any reason (server failure, failed connection to storage, etc.). This restart involves a complete boot-up of the virtual machines guest operating system and applications. While VM failure leads to an outage, the recovery from that failure is usually automatic. When combined with VPLEX in the Metro HA configuration, it provides the same level of protection for data center scale disaster scenarios.

Conclusion

121

Conclusion

Greater overall resource utilization


Using the same point of view of server virtualization based products and their recovery capabilities, turning over to utlization, VMware DRS (Distributed Resource Scheduler) can automatically move applications between servers in order to balance their computational and memory load over all the available servers. Within a data center, this has increased server utilization because administrators no longer need to size individual servers to the applications that will run on them. Instead, they can size the entire data center to the suite of applications that will run within it. By adding HA configuration (Metro and Campus), the available pool of server resources now covers both the primary and backup data centers. Both can actively be used and excess compute capacity in one data center can be used to satisfy new demands in the other. Alternative Vendor Solutions:

Microsoft Hyper-V Server 2008 R2 with Performance and Resource Optimization (PRO)

Overall, as data centers continue their expected growth patterns and storage administrators struggle to expand capacity and consolidate at the same time, by introducing EMC VPLEX they can reduce several areas of concern. To recap, these areas are:

Hardware and component failures impacting data consistency System integrity High availability without manual intervention Witness to protect the entire highly available system

In reality, by reducing inter-site overhead and dependencies on disaster recovery, administrators can depend on VPLEX to guarantee that their data is available at anytime while the beepers and cell phones are silenced.

122

EMC VPLEX Metro Witness Technology and High Availability

A
vSphere 5.0 Update 1 Additional Settings

This appendix contains the following information for additional settings needed for vSphere 5.0 update 1:

vSphere 5.0 update 1........................................................................ 124

vSphere 5.0 Update 1 Additional Settings

123

vSphere 5.0 Update 1 Additional Settings

vSphere 5.0 update 1


As discussed in previous sections, vSphere HA does not automatically recognize that a SCSI PDL (Persistent device loss) state is a state that should cause a VM to invoke a HA failover. Clearly, this may not be desirable when using vSphere HA with VPLEX in a stretched cluster configuration. Therefore, it is important to configure vSphere so that if the VPLEX WAN is partitioned and a VM happens to be running at the non-preferred site (i.e., the storage device is put into a PDL state), the VM recognizes this condition and invokes the steps required to perform a HA failover. ESX and vSphere versions prior to version 5.0 update 1 have no ability to act on a SCSI PDL status and will therefore typically hang (i.e., continue to be alive but in an unresponsive state). However, vSphere 5.0 update 1 and later do have the ability to act on the SCSI PDL state by powering off the VM, which in turn will invoke a HA failover. To ensure that the VM behaves in this way, additional settings within the vSphere cluster are required. At the time of this writing the settings are: 1. Use vSphere Client and select the cluster, right-click and select Edit Settings. From the pop-up menu click to select the vSphere HA, then click Advanced Options. Define and save the option:
das.maskCleanShutdownEnabled=true

2. On every ESXi server, vi /etc/vmware/settings with the content below, then reboot the ESXi server. The following output shows the correct setting applied in the file:
~ # cat /etc/vmware/settings disk.terminateVMOnPDLDefault=TRUE

Refer to the ESX documentation for further details.

124

EMC VPLEX Metro Witness Technology and High Availability

Glossary

This glossary contains terms related to VPLEX federated storage systems. Many of these terms are used in this manual.

A
AccessAnywhere The breakthrough technology that enables VPLEX clusters to provide access to information between clusters that are separated by distance. A cluster with no primary or standby servers, because all servers can run applications and interchangeably act as backup for one another. A powered component that is ready to operate upon the failure of a primary component. A collection of disk drives where user data and parity data may be stored. Devices can consist of some or all of the drives within an array. Describes objects or events that are not coordinated in time. A process operates independently of other processes, being initiated and left for another task before being acknowledged. For example, a host writes data to the blades and then begins other work while the data is transferred to a local disk and across the WAN asynchronously. See also synchronous.

active/active

active/passive

array

asynchronous

EMC VPLEX Metro Witness Technology and High Availability

125

Glossary

B
bandwidth The range of transmission frequencies a network can accommodate, expressed as the difference between the highest and lowest frequencies of a transmission cycle. High bandwidth allows fast or high-volume transmissions. When a cluster has the for a given DR1 it will remain online if connectivity is lost to the remote cluster (in some cases this may get over ruled by VPLEX Cluster Witness). This is now known as preference. A unit of information that has a binary digit value of either 0 or 1. The smallest amount of data that can be transferred following SCSI standards, which is traditionally 512 bytes. Virtual volumes are presented to users as a contiguous lists of blocks. The actual size of a block on a device. Memory space used to store eight bits of data.

bias

bit block

block size byte

C
cache Temporary storage for recent writes and recently accessed data. Disk data is read through the cache so that subsequent read references are found in the cache. Managing the cache so data is not lost, corrupted, or overwritten. With multiple processors, data blocks may have several copies, one in the main memory and one in each of the cache memories. Cache coherency propagates the blocks of multiple users throughout the system in a timely fashion, ensuring the data blocks do not have inconsistent versions in the different processors caches. Two or more VPLEX directors forming a single fault-tolerant cluster, deployed as one to four engines. The identifier for each cluster in a multi-cluster deployment. The ID is assigned during installation.

cache coherency

cluster

cluster ID

126

EMC VPLEX Metro Witness Technology and High Availability

Glossary

cluster deployment ID

A numerical cluster identifier, unique within a VPLEX cluster. By default, VPLEX clusters have a cluster deployment ID of 1. For multi-cluster deployments, all but one cluster must be reconfigured to have different cluster deployment IDs. Using two or more computers to function together as a single entity. Benefits include fault tolerance and load balancing, which increases reliability and up time. The intra-cluster communication (Fibre Channel). The communication used for cache coherency and replication traffic. A way to interact with a computer operating system or software by typing commands to perform specific tasks. The goal of establishing policies and procedures to be used during an emergency, including the ability to process, store, and transmit data before and after. A device that controls the transfer of data to and from a computer and a peripheral device.

clustering

COM

command line interface (CLI) continuity of operations (COOP)

controller

D
data sharing The ability to share access to the same data with multiple servers regardless of time and location. A rule set applied to a DR1 to declare a winning and a losing cluster in the event of a failure. A combination of one or more extents to which you add specific RAID properties. Devices use storage from one cluster only; distributed devices use storage from both clusters in a multi-cluster plex. See also distributed device. A CPU module that runs GeoSynchrony, the core VPLEX software. There are two directors in each engine, and each has dedicated resources and is capable of functioning independently. The write-specific data stored in the cache memory that has yet to be written to disk. The ability to restart system operations after an error, preventing data loss.
EMC VPLEX Metro Witness Technology and High Availability
127

detach rule

device

director

dirty data

disaster recovery (DR)

Glossary

disk cache

A section of RAM that provides cache between the disk and the CPU. RAMs access time is significantly faster than disk access time; therefore, a disk-caching program enables the computer to operate faster by placing recently accessed data in the disk cache. A RAID 1 device whose mirrors are in Geographically separate locations. Supports the sharing of files and resources in the form of persistent storage over a network. A cache coherent VPLEX Metro or Geo volume that is distributed between two VPLEX Clusters

distributed device

distributed file system (DFS) Distributed RAID1 device (DR1)

E
engine Enclosure that contains two directors, management modules, and redundant power. A Local Area Network (LAN) protocol. Ethernet uses a bus topology, meaning all devices are connected to a central cable, and supports data transfer rates of between 10 megabits per second and 10 gigabits per second. For example, 100 Base-T supports data transfer rates of 100 Mb/s. A log message that results from a significant action initiated by a user or the system. A slice (range of blocks) of a storage volume.

Ethernet

event

extent

F
failover Automatically switching to a redundant or standby device, system, or data path upon the failure or abnormal termination of the currently active device, system, or data path. A concept where each component of a HA solution is separated by a logical or physical boundary so if a fault happens in one domain it will not transfer to the other. The boundary can represent any item which could fail (i.e., a separate power domain would mean that is power would remain in the second domain if it failed in the first domain).

fault domain

128

EMC VPLEX Metro Witness Technology and High Availability

Glossary

fault tolerance

Ability of a system to keep working in the event of hardware or software failure, usually achieved by duplicating key system components. A protocol for transmitting data between computer devices. Longer distance requires the use of optical fiber; however, FC also works using coaxial cable and ordinary telephone twisted pair media. Fibre channel offers point-to-point, switched, and loop interfaces. Used within a SAN to carry SCSI traffic. A unit or component of a system that can be replaced on site as opposed to returning the system to the manufacturer for repair. Software that is loaded on and runs from the flash ROM on the VPLEX directors.

Fibre Channel (FC)

field replaceable unit (FRU) firmware

G
Geographically distributed system A system physically distributed across two or more Geographically separated sites. The degree of distribution can vary widely, from different locations on a campus or in a city to different continents. A DR1 device configured for VPLEX Geo 1,073,741,824 (2^30) bits. Often rounded to 10^9. The version of Ethernet that supports data transfer rates of 1 Gigabit per second. 1,073,741,824 (2^30) bytes. Often rounded to 10^9. A shared-storage cluster or distributed file system.

Geoplex gigabit (Gb or Gbit) gigabit Ethernet

gigabyte (GB) global file system (GFS)

H
host bus adapter (HBA) An I/O adapter that manages the transfer of information between the host computers bus and memory system. The adapter performs many low-level interface functions automatically or with minimal processor involvement to minimize the impact on the host processors performance.

EMC VPLEX Metro Witness Technology and High Availability

129

Glossary

I
input/output (I/O) Any operation, program, or device that transfers data to or from a computer. Connects Fibre Channel storage devices to SANs or the Internet in Geographically distributed systems using TCP. A network operating like the World Wide Web but with access restricted to a limited group of authorized users. A protocol that allows commands to travel through IP networks, which carries data from storage units to servers anywhere in a computer network. The transfer of data to or from a computer.

internet Fibre Channel protocol (iFCP) intranet

internet small computer system interface (iSCSI) I/O (input/output)

K
kilobit (Kb) kilobyte (K or KB) 1,024 (2^10) bits. Often rounded to 10^3. 1,024 (2^10) bytes. Often rounded to 10^3.

L
latency load balancing Amount of time it requires to fulfill an I/O request. Distributing the processing and communications activity evenly across a system or network so no single device is overwhelmed. Load balancing is especially important when the number of I/O requests issued is unpredictable. A group of computers and associated devices that share a common communications line and typically share the resources of a single processor or server within a small Geographic area. Used to identify SCSI devices, such as external hard drives, connected to a computer. Each device is assigned a LUN number which serves as the device's unique address.

local area network (LAN)

logical unit number (LUN)

130

EMC VPLEX Metro Witness Technology and High Availability

Glossary

M
megabit (Mb) megabyte (MB) metadata metavolume 1,048,576 (2^20) bits. Often rounded to 10^6. 1,048,576 (2^20) bytes. Often rounded to 10^6. Data about data, such as data quality, content, and condition. A storage volume used by the system that contains the metadata for all the virtual volumes managed by the system. There is one metadata storage volume per cluster. Two VPLEX Metro clusters connected within metro (synchronous) distances, approximately 60 miles or 100 kilometers. A DR1 device configured for VPLEX Metro The writing of data to two or more disks simultaneously. If one of the disk drives fails, the system can instantly switch to one of the other disks without losing data or service. RAID 1 provides mirroring. An operation where the cache is searched but does not contain the data, so the data instead must be accessed from disk.

Metro-Plex

metroplex mirroring

miss

N
namespace A set of names recognized by a file system in which all names are unique. System of computers, terminals, and databases connected by communication lines. Design of a network, including hardware, software, method of connection, and the protocol used. Storage elements connected directly to a network.

network

network architecture

network-attached storage (NAS) network partition

When one site loses contact or communication with another site.

EMC VPLEX Metro Witness Technology and High Availability

131

Glossary

P
parity parity checking The even or odd number of 0s and 1s in binary code. Checking for errors in binary data. Depending on whether the byte has an even or odd number of bits, an extra 0 or 1 bit, called a parity bit, is added to each byte in a transmission. The sender and receiver agree on odd parity, even parity, or no parity. If they agree on even parity, a parity bit is added that makes each byte even. If they agree on odd parity, a parity bit is added that makes each byte odd. If the data is transmitted incorrectly, the change in parity will reveal the error. A subdivision of a physical or virtual disk, which is a logical entity only visible to the end user, not any of the devices. A VPLEX single cluster. When a cluster has the for a given DR1 it will remain online if connectivity is lost to the remote cluster (in some cases this may get over ruled by VPLEX Cluster Witness). This was previously know as .

partition

plex preference

R
RAID The use of two or more storage volumes to provide better performance, error recovery, and fault tolerance. A performance-orientated striped or dispersed data mapping technique. Uniformly sized blocks of storage are assigned in regular sequence to all of the arrays disks. Provides high I/O performance at low inherent cost. No additional disks are required. The advantages of RAID 0 are a very simple design and an ease of implementation. Also called mirroring, this has been used longer than any other form of RAID. It remains popular because of simplicity and a high level of data availability. A mirrored array consists of two or more disks. Each disk in a mirrored array holds an identical image of the user data. RAID 1 has no striping. Read performance is improved since either disk can be read at the same time. Write performance is lower than single disk storage. Writes must be performed on all disks, or mirrors, in the RAID 1. RAID 1 provides very good data reliability for read-intensive applications.

RAID 0

RAID 1

132

EMC VPLEX Metro Witness Technology and High Availability

Glossary

RAID leg

A copy of data, called a mirror, that is located at a user's current location. The process of reconstructing data onto a spare or replacement drive after a drive failure. Data is reconstructed from the data on the surviving disks, assuming mirroring has been employed. The duplication of hardware and software components. In a redundant system, if a component fails then a redundant component takes over, allowing operations to continue without interruption. The ability of a system to recover lost data. Allows computers within a network to exchange data using their main memories and without using the processor, cache, or operating system of either computer. The amount of data that can be lost before a given failure event. The amount of time the service takes to fully recover after a failure event.

rebuild

redundancy

reliability remote direct memory access (RDMA) Recovery Point Objective (RPO) Recovery Time Objective (RTO)

S
scalability Ability to easily change a system in size or configuration to suit changing conditions, to grow with your needs. Monitors systems and devices in a network.

simple network management protocol (SNMP) site ID

The identifier for each cluster in a multi-cluster plex. By default, in a non-Geographically distributed system the ID is 0. In a Geographically distributed system, one clusters ID is 1, the next is 2, and so on, each number identifying a physically separate cluster. These identifiers are assigned during installation. A set of evolving ANSI standard electronic interfaces that allow personal computers to communicate faster and more flexibly than previous interfaces with peripheral hardware such as disk drives, tape drives, CD-ROM drives, printers, and scanners.

small computer system interface (SCSI)

EMC VPLEX Metro Witness Technology and High Availability

133

Glossary

split brain

Condition when a partitioned DR1 accepts writes from both clusters. This is also known as a conflicting detach. The amount of time taken for the storage to be available after a failure event (In all cases this will be a smaller time interval than the RTO since the storage is a pre-requisite). The number of blocks of data stored contiguously on each storage volume in a RAID 0 device. A technique for spreading data over multiple disk drives. Disk striping can speed up operations that retrieve data from disk storage. Data is divided into units and distributed across the available disks. RAID 0 provides disk striping. A high-speed special purpose network or subnetwork that interconnects different kinds of data storage devices with associated data servers on behalf of a larger network of users. A combination of registered initiators (hosts), front-end ports, and virtual volumes, used to control a hosts access to storage. A LUN exported from an array. Describes objects or events that are coordinated in time. A process is initiated and must be completed before another task is allowed to begin. For example, in banking two withdrawals from a checking account that are started at the same time must not overlap; therefore, they are processed synchronously. See also asynchronous.

storage RTO

stripe depth

striping

storage area network (SAN)

storage view

storage volume synchronous

T
throughput 1. The number of bits, characters, or blocks passing through a data communication system or portion of that system. 2. The maximum capacity of a communications channel or system. 3. A measure of the amount of work performed by a system over a period of time. For example, the number of I/Os per day. tool command language (TCL) A scripting language often used for rapid prototypes and scripted applications.

134

EMC VPLEX Metro Witness Technology and High Availability

Glossary

transmission control protocol/Internet protocol (TCP/IP)

The basic communication language or protocol used for traffic on a private network and the Internet.

U
uninterruptible power supply (UPS) universal unique identifier (UUID) A power supply that includes a battery to maintain power in the event of a power failure. A 64-bit number used to uniquely identify each VPLEX director. This number is based on the hardware serial number assigned to each director.

V
virtualization A layer of abstraction implemented in software that servers use to divide available physical storage into storage volumes or virtual volumes. A virtual volume looks like a contiguous volume, but can be distributed over two or more storage volumes. Virtual volumes are presented to hosts. A new feature in VPLEX V5.x that can augment and improve upon the failure handling semantics of Static .

virtual volume

VPLEX Cluster Witness

W
wide area network (WAN) A Geographically dispersed telecommunications network. This term distinguishes a broader telecommunication structure from a local area network (LAN). A specific Fibre Channel Name Identifier that is unique worldwide and represented by a 64-bit unsigned binary value. A caching technique in which the completion of a write request is communicated only after data is written to disk. This is almost equivalent to non-cached systems, but with data protection.

world wide name (WWN) write-through mode

EMC VPLEX Metro Witness Technology and High Availability

135

Glossary

136

EMC VPLEX Metro Witness Technology and High Availability