Sei sulla pagina 1di 40

This video is part of the Microsoft Virtual Academy.

In this session we are going to be diving deeper into understanding Microsoft's high availability solutions. Part one of this series look to the application infrastructure meaning failover clustering, virtualization and some of the other key infrastructure components. Part two is going to look at the applications which run on top of this infrastructure. Were going to spend most of our time looking at SQL Server, Exchange Server and then briefly cover some of the other server high availability solutions. Im Symon Perriman and Im going to be joined by SQL program manager Justin Erickson and Exchange technical writer Scott Schnoll in this session.

16-Nov-11

Learn about Microsofts different High Availability technologies and when to use each of them. High availability is important because it keeps our applications up and running not only for availability but also to make sure that our customers are happy, by maintaining continual service we can keep our customers connected in a 24/7 marketplace. This session will specifically focus on the application layer, weve covered the core infrastructure in part one of this video session and part three will look at the management focusing on System Center.

Im now going to turn it over to Justin Erickson, Senior Program Manager with the SQL team. Justin. Justin: Hello everyone Im Justin Erickson, Im a program manager in the SQL Server database engine team.

So lets quickly go through the introduction to each one of the technologies, if you guys have questions theres sessions through the SQL Server track that goes into more detail in a lot of the high availability technologies. Key thing that Id want to point out is when you look at what comprises database downtime theres two big portions; you have unplanned downtime where I actually have a failure, user caused an issue where I have to move to a different system and theres also planned downtime where Im doing an application upgrade, Im doing a patch or Im just trying to maintain the escalades that I need for my system through put so we look at all of these drivers as we look into what makes SQL Server availability technology.

And so the gamut of technologies that we have looking at existing releases as well as whats coming up in the SQL Server Denali release through always on is listed over here. So Ill walk through each one of these and talk about what does each technology build on based on the previous one and youll see that theres a sequence of looking at back up and restore, log shipping, database mirroring which is sort of the same technology being built incrementally to give you better SLAs. As well as technologies like replication which sort of fit into this space and failover clustering instances which use a lower level of data protection with the SANs and shared storage and SQL Server along with the shared storage. And then well end talking about some of the ways to manage downtime as well which is a majority of your downtime that youll see. So back and restore is the most basic technology. Regardless of what technology youre using on top of it its always a good idea to have a physical back up of your database so you can go and recreate the entire system from scratch should your high availability system go down, your entire data center go down or whatever other issues that maybe you have some issues that you need to go back to a point and time, back up and restore is your base set of technologies there.

When you look at the downtime though of the backup and restore solution, you have the backup and if something goes down you need to use backup and restore process to get your system back up and running, youre now doing a full installation of that system and applying that restore maybe you have the system there but you actually go into the restore from scratch there which if youre looking at terra byte sized databases that could take you a good amount of time. And thats where a system like SQL log shipping comes into place, where this is basically an automated backup and restore process where you have transactions that are coming into your primary system , you have log backups that are going on a periodic basis through the aging job schedule and this guys just copying backups out. Theres another job that will copy the backups to your local system and finally a third job that goes through the restore process, so this is basically doing the backup and restore, not waiting for that failure but saying Im going to have the system ready to go, constantly going through this backup, the copy job and the restore job so when I have that failure I have the system ready to go and just have to apply whatever logs I havent applied at that time. And theres a nice wizard with SQL Server management studio to help you set this up and determine what sets of intervals you want to be able to configure this on based on your needs.

The next technology takes the log shipping which was the automated backup restore process and builds it into the engine so this looks through how do I get that streaming log records and now because its built into the engine we can go and do things like provide synchronous commits where I can make sure my secondary system is fully up to date with the primary so when Im failing over theres zero data loss going on. And the way that this works is your application is coming in, committing a set of transactions at the time that we write the transactions locally to our log file were sending it over to the secondary. And if youre in synchronous mode well write it over to the log file on your secondary side, send back and ack and only then will we go and tell the application that hey, your transaction has been committed. So this means that at any point and time when we execute that failover my system is fully up to date with that primary. Of course you dont have to run in synchronous mode, you can always run in asynchronous mode where this guys just sending log records over as fast as he can, not waiting for the back up and restore process but the primary will continue to go ahead so Im not slowing down the workload to get that data that ends up being a choice depending on what your SLA needs are.

Another technology which isnt really built to be a high availability technology but is often used in high availability scenarios is replication and the reason why this is typically used is wanting to get extra utilization of that hardware. When Im using a system like log shipping or database mirroring I have a mirror or secondary thats sitting there, waiting in the event of a failure and sometimes we hear from a customer well if I have that hardware I also want to do something with it and theres scenarios where customers have previously used replication in the past because that not only allows you to send the data to your secondary but actually be able to read the data from the secondary as well for doing reporting or offloading other sets of workloads. And Ill talk in a second how always on availability groups takes away this need so as we go forward were simplifying our technology stack.

So SQL Server always on in the upcoming release of SQL Server, we took a holistic look of what do we do with high availability and figured how do we build an integrated, flexible, efficient, single solution to meet your high availability needs rather than the previous set of technologies that we put together to build a solution there. And from that we came up with two main feature areas; we have always on availability groups which provides you database protection with SQL Server is doing the data replication and similar to database mirroring and log shipping. And we have always on failover cluster instances which allows customers to use their existing infrastructure and provide data protection at the lower layer of the hardware stack using the SAN and the shared storage to provide the data protection and SQL Server failing over between these. Failover cluster instances is a technology that existed in the previous releases but was enhanced in Denali with multi-side clustering, flexible failover policy provides a better set up health detection and diagnostic infrastructure as well as improved failure times with indirect checkpoints. Always on availability is a new feature in Denali that replaces database mirroring, provides a multidatabase failover unit, multiple secondaries so I dont need a combined database mirroring and log shipping as well as active secondaries that I can now read and provide backups for from the secondary system so replication doesnt end up being in the high availability mix. As we look to the additional features set we provided here we looked into provided an integrated HA management solution.

10

So what is SQL Server failover cluster instances? This is built similar to what Symon went through with the other technologies with always on failover cluster instances we use a shared disk to do the data protection so each one of the machines is accessing the same sets of files and when were failing over were moving the access over to that same data file to another machine and having SQL Server start up on that side so on the SQL side were providing protection of the binaries processes between machines or relying on external SAN technologies to provide protection of the database files themselves.

11

WSFC = Windows Server Failover Clustering WSFC vs. FCI

Scoping: No replicas on same node (Hyper-V)


WSFC for: 1. Primary selection and coordination 2. Primary health detection 3. Distributed changes and truth

Secondary health driven from primary (no impact to primary)


SQL Server always on availability groups uses SQL Server to provide the data protection, its still built on top of Windows clustering to help us with the inter node health detection state configuration changes across the system but does not rely on any SAN or shared storage infrastructure, that SQL Servers providing the data protection levels. We have collections of databases moving rather than the binaries services moving between them.

12

When we look at always on as a comprehensive solution its built to be able to meet combinations of needs so in some cases youre looking at using shared storage and SANs for your data protection within your data center and availability groups between data centers which is like the picture on the right. In some cases you dont have any investments into shared storage and you want to use a cheaper solution to provide as a faster failover and thats where always on availability groups comes in. And so you can mix and match these technologies to meet your needs whatever they are.

Another common question that comes up is well what about virtualization, how does this fit into the mix? Virtualization is often used in consolidation scenarios with SQL Server and virtualization on its own does provide some high availability guarantees as well so when you look into virtualization you need to look at both the planned and unplanned downtime, at the host as well as the virtualization layer because virtualization will provide live migration where you can failover VMs between hosts with zero downtime and so thats the best set of solutions there. If you have an unplanned failover, an unplanned event at the host level, thats when youre failing over the entire VM and doing and OS restart so provides some protection over there but youll have a slower recovery time. If you have failures at the guest level thats where virtualization doesnt provide any set of protection there so if I have database file corruption or the binaries themselves within that OS get corrupt for whatever reason just using virtualization is not providing protection there, thats where youre falling back to back restore or you can use an additional technology at the guest level and they provide you the best of both worlds in a solution. And similarly at the planned level when Im patching the guest OS youre having downtime during the patch unless you have another technology within the guest OS to provide you protection. So when you look at this when to use always on technology and when to use a high availability technology if you look at these sets of requirements and your customers looking at requirements and says this isnt enough, thats when its worth going and investing into that complexity. If youre looking at these sets of requirements and meets your SLAs then its good to stick with virtualization as your core technology rather than biting off the additional complexity and adding in another solution to the guest and all our technologies will work through virtualization.

14

So that gives you a quick introduction to our unplanned downtime features. When we look into planned downtime there are other things to consider, how do I handle the OS as well as the SQL Server upgrades and thats where each one of the technologies has a rolling upgrades story where I can upgrade the mirror, failover to a secondary, patch the old primary and then fail back if I need to. Online operations is another key thing, if Im doing an application change where Im actually changing the database structures or adding new data to the system, online operations that are enhanced in Denali will allow you to make these changes without impacting the currently running workloads and theres new enhancements in SQL Server Denali where we can do more online index builds with log data types and big data types as well as adding in columns that are non null able columns which were not previously available in the previous releases. Along with this a lot of times you look into okay what are other sources that impact my SLAs if Im building my SLAs as a business Im looking not only what happens in the event of a failover when Im taking the system down but is my system able to respond at the throughput that I need it to go and respond and thats where resource governor is a great technology and allows you to throttle the workloads to reserve the capacity for your core workloads so I can have my core system saying that I want to make sure that I reserve 80% of my CPU for my core workload allowing lower priority workloads to still run onto that same box but restricting that so its not going to be on a certain set of resources so thats where resource governor is a great technology to be able to restrict your lower priority workloads from impacting the SLAs of your most critical workloads.

15

Thats a little bit about flash introduction to SQL Server, now Ill hand it over to Scott to talk about Exhange.

Scott: My name is Scott Schnoll, Im a Principal Technical Writer on the Exchange Team among other things I write all the product documentation around high availability, site resilience, disaster recovery and a few other areas so Im really excited to talk to you about it.

16

I do though want to tell you that Exchange does things a little differently from what youve heard until now. We do use failover clustering technologies but we dont use any shared storage, we dont use the resource model and in fact were just more of a consumer of clustered technologies as youll see in a minute. We also have in Exchange a very specific definition of high availability okay so to have true high availability for an Exchange Server you must have three criteria; you must service availability, data availability and automatic recovery from most failures and we say most failures because youre not going to get automatic recovery from all failures for example a data center level type event you wouldnt get automatic recovery from them. We have mechanisms to do manual recoveries for that but thats not an automatic solution and that would be a DR process, not a highly available process. The other thing I want to tell you about is we use this acronym called *overs a lot and that really is just our short hand notation for switchovers and failovers. Failovers weve been talking about a lot, failovers simply when the system takes the automatic corrective action for you, a switchover is when an administrator manually activates for instance a passive copy of an exchange database. And then we have site resilience as well, site resilience and HA they are unified into a single platform inside of Exchange 2010 for example but there are different operations with different configurations as youll see here in a minute, site resilience is that DR type configuration that you do to protect yourself when you have multiple data centers and you want redundancy across those data centers.

17

Now we actually introduced both service availability and *over capabilities way, way back in Exchange 5.5 but back in those days we were using Microsoft cluster server and NT 4, we were using the cluster resource model and many of our core components were cluster aware Exchange knew it was being installed in a cluster and it did something a little different from an unclustered Exchange Server. We also back at that time relied very heavily on third party partner products, we didnt have any built in data replication whatsoever so we had no native data availability in Exchange and instead relied on hardware vendors, storage vendors, replication vendors to make copies of our data for us. In Exchange 2007 we took a very revolutionary leap forward okay we started the model of breaking away of doing the old legacy way of doing Exchange clustering. We still supported the old style of Exchange clustering where you use shared storage but we gave that a different name, we called that a single copy cluster to reflect that in that cluster you only had one single copy of your data. So in 2007 we also introduced a second form of Exchange clustering called cluster continuous replication and that in fact is when we introduced our continuous replication or what we call log shipping technology.

18

We actually have three different forms of continuous replication in Exchange 2007, one is local where youre just shipping a copy of the logs to the database thats connected to the same server as your active copy, we also have cluster replication where every database you had on an active node was being replicated up to a node and you always had them in pairs and then we had this one called standby continuous replication that we introduced in service pack one for Exchange 2007 and what that did was allow you to replicate data pretty much anywhere from a standalone mailbox server to another standalone, from a cluster to a standby cluster and so forth and in fact it became as Exchange 2007 evolved and matured it became pretty much the defacto configuration or architecture to use a combination of cluster continuous replication for high availability within the data center and standby continuous replication to get you site resilience for that data center as well. And so this is basically what it looks like, the information stored in Exchange is doing what its done since day one, generates log files and as those log files are closed theyre copied over to the other copy of the database, theyre inspected by the other copy of the database and assuming they pass inspection they then get replayed into that copy thereby making that copy pretty much an up to date bit for bit duplicate of the original active copy.

19

Now this is typically what it would look like when you see it in the organization topology, here Ive got two separate CCR clusters remember CCR was always a pair of two; an active and a passive so Ive got two separate clusters, Ive got some Outlook and Outlook web app, and Active Sync clients that are out there theyre going through our front end component called a client access server in 2007 and later or in the case of Outlook going directly to the information store and talking to it and basically we would replicate one for one within these pairs. If you wanted to extend that solution to another data center you used a separate technology, you used standby continuous replication and that actually worked really well but it had some challenges with it. But it still worked, it got the data over there, you had a standby server, maybe a standby cluster and so if you had any problem with your primary site, in this case San Jose you could go ahead and activate the Dallas site, get your clustered mailbox up and running and life was good. There were some challenges though when you clustered the mailbox role in 2007 it couldnt co-exist with any other server roles or client access role the transport role, the unified messaging role, you had to buy extra hardware to do that, it only allowed you to use the mailbox role in the cluster. So that meant at a minimum if you wanted high availability for Exchange 2007 you had to buy at least four servers. Some of the other challenges were you had to have some clustering knowledge okay and that might not seem like a big deal if youve been doing it for a long time but most of the administrators who managed Exchange solutions are Exchange pros not cluster pros. And so sometimes it was challenging for them to build the underlying cluster correctly before they would deploy Exchange. This wasnt so much true in the CCR paradigm but it was especially true in the other type of cluster we had in 2007 called the single copy cluster when you had to also deal with the shared storage and the interconnects and getting that just right. Some of the other challenges was even though in 2007 we supported 50 databases per server if you had a problem with just a single database on that server you had to failover the whole clustered mailbox server, the entire Exchange Servers network identity had to be moved to another server even if you only had one problematic database out of 50 so that wasnt very optimal. We did introduce SCR, finally people had a built in way to get data replicated outside of the cluster and offsite to a different data center but we introduced it in a service pack and typically when we introduce major features in a service pack we dont put GUI around them, okay so that meant if you wanted to manage SCR you had to do it all from the Exchange management shell which is a powershell based console that you had to use, you couldnt use the Exchange management console which is an MMC snapin and click on pictures and stuff so that meant administrators had to learn to manage CCR one way and manage SCR a completely different way. And then the last challenge was even after you got the data over there, it was a pretty complex activation process that you had to go through there were many, many steps that involved usurping the clustered mailbox server itself and forklifting it over to the recovery server that took time, for some administrators it was confusing because of the different technologies and so we looked at all of this and came up with a whole new solution in Exchange 2010.

20

And in fact Exchange 2010 is very different from anything that weve done in the past. First of all theres no more clustered mailbox server okay, we dont use the cluster resource anymore or put slightly different, the cluster has no idea that were even there, but we know the clusters there because we use it, we use the clusters node and membership APIs so that we can join the servers together in a group, we also use the clusters heart beating technology which is very mature, and proven technology and will allow us to find out when servers are dropping off the network. And of course we use the cluster database because theres data that we need to share between the members and the solution and we need to share it very quickly much more quickly then if we were to store it in Active Directory and wait for it to be replicated across. So what we have now, what youre seeing is a representation of a new construct that we call a database availability group, or DAG for short. A DAG is simply a collection of mailbox servers, in this case five mailbox servers that host replicated databases so for example if you look at DB1 for example you can see that DB1 is using green, on mailbox server 1, green means in this case its the active copy and then weve got DB1 on mailbox server 2 and a DB1 under mailbox server 4 that are in blue, those represent passive copies of the database. Databases that the system keeps up and maintains itself and that are waiting to become active in the case of some sort of failure affecting the actual active database. We also made another architectural change where now all clients including Outlook mapping clients no longer connect directly to the information store. Instead they now connect to a set of services on the client access server, one is called the client address service, thats where they get their directory information, and the other service is called the RPC client access service and thats where they get their MAPI endpoint now, so all Outlook knows is its got its MAPI and directory in points, it has no idea its talking to a client access server not a mailbox server anymore. So you can see here I have the option to replicate databases as I see fit, its not like CCR where every database you have on the active node gets replicated at the passive node. Its more like SCR in this case in that the administrator gets to choose which databases get replicated and to where. So in this case the administrator only wanted three copies of DB1 so we spread them across mailbox server one, two and four. Similarly you can see on mailbox server one, DB1 and 3 are both green, those are the active copies but mailbox server 1 also hosts a passive copy of DB2, again, this is another departure from our previous model where you had only active instances on one server and only passive instances on another server. Now we have multiple instances, you could have active and passive copies of multiple databases on multiple servers as you see here.

21

In changing to this model this changed everything from a failover perspective okay because we dont have a clustered mailbox server anymore because we dont have a network identity to move anymore, we now only have to simply move the designation of the active copy and I say move the designation of the active copy because were not really picking up a database and moving it. All were saying is youre active now, youre passive now, you had a problem, youre active now and youre passive now, its that simple. Failovers now managed completely within Exchange because there is no cluster resource model if you open up failover cluster manager on a mailbox server thats a member of DAG and you look under services and applications, youre not going to see anything. Theres no exchange group, theres no exchange resources, no IP addresses, no storage groups, no databases, no information store, no system attendant, nothing, we dont use the cluster resource anymore. That though means that we had to have some mechanism to handle failover within Exchange in previous versions if we had a problem the cluster moved that resource over to another node for us. Now we actually have a whole brand new component inside of Exchange called Active manager and Active manager runs in a key service on these mailbox servers its called the Microsoft Exchange replication service. Its the same service we introduced in 2007 to do log shipping and CCR and SCR but now theres a new component that runs inside that service called Active Manager thats the brain of the Exchange solution. Active manager is not only responsible for managing everything but its also responsible for initiating the corrective action when some sort of failure occurs. So say for instance the disk hosting DB1 just dies, were not using RAID in this case so the disk dies and now the database is gone with it. Active manager detects that and will automatically failover the active copy to one of the other passive copies whichever one it believes to be the best most up to date healthy copy. So I mention all clients connect via CAS so the system works somewhat like this, Ive got any client out there, might be Outlook, might be Outlook web app, might be Active sync, we dont know, its just a client accessing and getting messages in. Theres an active manager client that also runs inside of the client access server and that knows where the users database is located so users only connect to CAS and its CAS that talks to RPC MAPI to the information store. Users dont talk to the information store directly anymore, weve abstracted the user connection away from the information store so that we can get fast failover when one of these databases has to failover.

22

So messages come in, they go to the appropriate database and then the log files representing those messages get replicated to the copies of that database these are using message icons, its not the actual messages that we replicate its the actual transaction log files generated by the Exchange database engine itself that gets replicated. So if we have a failure affecting database 1, database 1 disappears for whatever reason maybe its the storage, maybe its some sort of corruption we dont know, database 1 is gone, what happens in 30 seconds or less and you can see under mailbox server #2, the copy of DB1 has now gone green, in 30 seconds or less a new active copy replaces the failed active copy by choosing the best available passive copy to activate. So in this case the system decided that the best copy was on mailbox server 2, notice the client still stays connected the CAS, even though their underlying database went away, CAS understands whats going on because of the active manager clients, so active manager says no, your database is now over here, Im going to connect CAS to mailbox server 2 and the client is back in business. And because it happens so quickly its quite possible and more often then not clients dont even notice that any of this happened, theyre abstracted away from it so as their database goes away they dont get disconnected, they might get disconnected at the CAS server goes away but well talk about load balancing and how to deal with that in a minute but assuming CAS doesnt go away and its just a failure of the mailbox server, the mailbox servers network, the mailbox servers disks, thats going to be a transparent failover to the client okay, theyre probably not going to notice anything. Okay, now if they happen to be in the middle of Outlook web app, theyre composing something and they go to hit send, and in the middle of hitting send a failover occurs, they will get a message saying that your mailbox is temporarily unavailable. But if they just wait a few seconds and press f5 for refresh on the browser it will bring them right back into their mailbox and they wont even have to log on again, okay, its that fast. And of course mail flow will continue because active manager knows where the database is located and replication will continue as long as you have multiple copies left.

23

Now each DAG, Im showing a five member DAG here, each DAG can go up to sixteen members so you can have sixteen copies of all your databases and each Exchange Server itself supports 100 databases so you can 1600 databases inside a single DAG, of course that would be non replicated databases but you can have 800 databases where youve got 2 copies of each, 533 databases with 3 copies of each, and so forth and also consider that our maximum recommended database size is now 2 terra bytes per database you can grow this very, very large it scales incredibly well, and in case youre wondering how well, this solution is whats running Outlook.com, Office 365 and so forth. So were talking 75 million almost 80 million mailboxes on this solution. The beauty of it is the same exact commands you use to create the DAG inside a data center the same exact ones you would use to extend it to another data center to put yourself in a site resilience configuration, its that easy.

24

Now, I mentioned the DAG and this is basically again, what the architecture would look like, clients are talking to the client access server, theyre talking to the RPC client access service and the address book service, and its the active manager component that is telling those services where the users mailbox is located so that CAS can talk to mailbox for them.

25

So a DAG is simply just a set of servers up to sixteen that host a set of replicated databases you can have multiple DAGs in a single org, obviously if you need more than 16 members in the DAG you have to use a second DAG and so forth. We do leverage the Windows failover cluster technologies but were not cluster aware and we dont use the cluster resource model and the DAG itself defines the boundary for replication so you wont be replicating outside the DAG.

26

And we mention this, now we did add a second form of continuous replication in service pack 1 so let me briefly talk about that.

27

So now I have two forms and in 2007 and in 2010 RTM we had one form of continuous replication and one form of log shipping where were shipping closed transaction files. So the active copy in green would create the log files and then the passive copy would say hey, send me your latest log files, Ive got these so far, the latest log files would go through and if the passive copy is able to keep up and catch up with log generation activity on the active copy, which in this case it would because now its got log five which was the last one generated, now it says you know what, the database copy is up to date, now the system switches into block mode and we now instead of shipping those transaction files we actually ship blocks of ESE transactions as theyre being written to the log buffer. So we write to the log buffer on the active side and at the same time we send that information over to a corresponding buffer on the passive side and keep things up to date. Now all continuous replication is asynchronous so we dont wait for acknowledgement on the other side. So there is potential for data loss but weve got other mechanisms built in the system to get that data back. But again, now we have the ability to replicate blocks, we dont have to wait for a transactional log file to be closed in order to externalize that data which means that that amount of losable data has substantially decreased with service pack one as a result of databases being able to leverage block mode. And of course once the buffers full, it generates the corresponding log file on each separate side, its built, its inspected separately, and of course replayed into the copy of the database after that. We also have a mechanism whereby if we only get a partial buffer and then the active copy goes away, well actually use that, well take that, we call it a log fragment and well convert it into a full log and if theres usable transactions in there, well play those transactions against the database and at least get the data that we were able to get over.

We also have this concept of a lagged database copy and this is a database copy that you have the ability to delay replay of log files against for up to 14 days. So think of it as a point in time back up of your database up to 14 days. Dont go beyond 14 days we have a hard coded limit of 14 days for the lag but its basically there to provide you with a maximum of 14 days protection against things like logical corruption. If you have physical corruption in your store thats not going to be a problem because continuous replication will detect that and it will block physical corruption from being replicated to another database. For logical corruption theres no way for the system to tell that and so as its fail back mechanism you have the ability to delay replay into a passive copy so that if you do detect logical corruption from an end user you can go and activate a point and copy at a point and time before that corruption took place. And of course lag copies will affect storage design where holding those log files so you are going to need to size them appropriately but thats something else as a protection that we have.

29

16-Nov-11

Now load balancing has changed a little bit with Exchange 2010 as well in 2007 most customers were used to doing load balancing for the reverse proxies so that traffic coming from the internet would get load balanced and not overwhelm a single reverse proxy. As a result of the architectural change we made where Outlook now connects to the client access server instead of the information store you need to have a form of RPC load balancing that you can use for your Outlook clients so that all of your Outlook clients arent going to a single CAS server. So you will need a load balancer now and it will have to be an RPC load balancer so that means something like you know Windows Load balancing wont be able to handle that for you, its got to be RPC load balancing and the RPC load balancer has to not only support RPC but it also has to support infinity as well. This catches some customers off guard because its a new requirement we never had in Exchange before so be aware of that when you talk to customers who are migrating from 2003 or 2007 to 2010. And the last thing we have is Exchange does support back up and recovery obviously thats disaster recovery not high availability. We used to support both the ESE streaming back up APIs and the VSS APIs but because were dealing with much larger data sets now the ESE streaming APIs just werent going to do the job and so we cut them from Exchange 2010.

30

16-Nov-11

We now support only VSS based back ups but the good news is we ship a plug in for Windows Server back up in the box so if you just want a basic VSS back up of your databases you get that in the box with Exchange, you dont have to buy other products. If you want something more full featured its when DPM or any other Exchange aware third party or VSS solution would work for you. We also have some other DR technologies one called a recovery database, its basically an object into which you can put a restore database in which you can extract data out of it. We also have this concept of database portability where you can take any Exchange 2010 database and have it be moved to any other Exchange 2010 server inside the Org so even if you didnt replicate it you can pick it up and forklift it over to somewhere else. The last thing we have is this dial tone portability which if you do have a failure affecting your only database copy you at least have the ability to spring up what we call this dial tone database, its an empty database that just allows users to send and receive mail, it doesnt have all their historical data in it, its empty but it gives them dial tone so they can at least send and receive messages while youre in the background restoring their data.

31

Thank you Scott. Now Im going to continue and talk about some of the other mission critical servers for Microsoft and theyre high availability solutions.

32

16-Nov-11

http://support.microsoft.com/kb/957006 First of all think about virtualization, virtualization is one of Microsofts key investments of the Hyper V platform and all teams now test their products in Hyper V. Microsoft has whats called the common engineering criteria which are a series of guidelines which each engineering team must follow to ensure their applications are enterprise ready. One of the key tenets of this guide is to test in Hyper V to make sure that it has equivalent performance and equivalent resiliency, so you can actually go online to check out this KB article 957006 and its kept up to date as far as which of the various versions of all the Microsoft products and whether they are supported in a Hyper V environment. And when we think of Hyper V, think Hyper V with failover clustering meaning that we can run all of these application service inside a VM Guest and the VM itself is clustered.

33

The next major application is the file server and file server of course manages your storage, your shares, replication and search and indexing. Traditionally file servers user failover clustering, this is the default configuration and you can have multiple file servers on a failover cluster. DFS replication is another technology which is part of file server. And DFS replication can be used as a high availability technology in regards to it allows you to push information from one server to another server so in theory its in multiple locations. So if a primary server crashes or becomes unavailable you can recover the information from a secondary location. Now, you can do this within a single site, within a data center, within a group of servers, you could do this across multiple sites, so this can build up a disaster recovery solution if you have multiple data centers. With replication it also gives you the ability to access offline files, so if you use the offline files feature youre actually using some DFS replication on the backend to go and push or to keep updated your local copy of all of these versions. Now, one of the key things to keep in mind is that replication only happens when a file is closed so this could give you pretty good availability if youre working on something such as a Word document, or an Excel spreadsheet, but if we extend this concept to the enterprise it doesnt do a great job of replicating things which keep their file open. For example a virtual machines VHD file or a SQL database, these types of resources are kept open indefinitely and really theyre only closed when theyre taken offline. Now, if these are kept open and replication hasnt happened then potentially you could lose all of the data, all the information which has been collected since the last replication. And for this reason failover clustering does not support DFSR as a replication technology since it is possible that some data could be lost. Additionally you have to keep in mind that there could be replication conflicts, if you have multiple people working on the same document simultaneously in two different locations, when replication happens there could be some synching and configuration conflicts which need to be resolved. Nevertheless it is a great true inbox solution to give you some levels of high availability in your data center.

34

16-Nov-11

Lync Server is one of the extensions of the unified communications server, it basically covers all types of messaging, including IM, voice and video, and content sharing over live streaming mediums. Lync server has a high availability architecture which is relatively flexible. The core is using load balancers to connect people to a registrar so when a user wants to go and connect to the Lync server theyre going to go and get sent to a registrar. Now there is a requirement to use hardware load balancers for this registrar and NLB, Microsofts network load balancing is explicitly not supported. Now the registrars themselves they actually have access to whats called a back up registrar pool so if this primary registrar is unavailable when a client connects or it crashes the client will get sent to this backup registrar. So from their perspective they may be disconnected temporarily but their transaction will be recovered and they can continue staying online. There is also DNS load balancing available for other types of network traffic from Lync server, so this gives you kind of the basic forms of high availability just by distributing incoming clients to different registrars or different Lync server components to help distribute the traffic and make sure that one specific server or one specific component isnt overloaded with too many client connections. Now there are partners which deliver what are called SBAs or survivable branch applications. These are essentially customized unified communication applications which contain a subset of all of the Lync functionality so by connecting to these SBAs which are generally a branch office we could keep a client connected, we could keep them using some of the basic communication and collaboration tools but they do have limited ability to use the full functionality of Lync server. By having an SBA in a branch office we could keep our client up and running even if they cannot connect to their primary data center.

35

16-Nov-11

SBA in a branch office we could keep our client up and running even if they cannot connect to their primary data center. Lync also has two multi-site high availability solutions which are called Data Center Resiliency and Metropolitan Data Center Resiliency. With Data Center Resiliency it allows us to spread our Lync servers across multiple physical locations and we can even have high availability for the voice communication so if somebodys on a phone call, primary data center crashes, we can actually failover to this secondary without dropping the call. Specifically the high availability is built for voice failover so it is possible that other types of transactions such as IM communication could be temporarily lost if a failover happens. The more advanced version of this is whats called Metropolitan Data Center Resiliency and with this you have an Active/Active configuration with continual replication at the sites at the hardware level. And the reason why this is called metropolitan is that its generally going to be deployed within a specific city, meaning that the breadth or the distance which you can stretch the sites is limited to a few miles or a few dozen miles. This does give you higher availability since it is an active/active connection so youll have better resilience but the distance between the data center can be limited. The final Lync server high availability solution is simply backup and restore. If you lose information, it crashes, you can pull it back using an expedited service restoration process which can be a workflow that can be pre-programed or pre-orchestrated to help recover as quick as possible.

36

16-Nov-11

http://blogs.msdn.com/b/joelo/archive/2007/03/09/sharepoint-backup-restore-high-availabilityand-disaster-recovery.aspx

SharePoint Server is Microsofts web platform for all types of collaboration and document management. Its primarily built around a database where all of this shared content is stored and this database can be made highly available using SQL; it can use SQL backup and restore, database mirroring or log shipping and this database can be protected using System Center Data Protection Manager or DPM. DPM has some nice integration points with SharePoint because it gives you the ability to granularly restore specific objects so if you lost a specific document you could go and just recover that specific document rather than having to restore the whole database. Two of the additional SharePoint servers the Crawl or Index server and the Search or Query server these are deployed in a redundant topology meaning that there are multiple versions of them available throughout the infrastructure and if one of them is unavailable clients will simply be reconnected to another one to help speed up the indexing or to help speed up their search queries. SharePoint does have a rich, front end, web based interaction experience where clients or users actually will go and browse the document or collaborate on the site using a web front end and this is made highly available using Windows Network Load Balancing, NLB will go and distribute the traffic across these multiple front end servers to ensure that a single server is not overloaded. An additional high availability feature which is unique or customized for SharePoint is the recycle bin and this gives you the ability to simply recover items that were accidentally deleted so if you lose any type of file, list or application, by default they are still saved for 30 days before theyre permanently deleted to give people the opportunity to recover any documents that were accidentally removed.

37

16-Nov-11

Technical Review complete As we talked about the web server for SharePoint a lot of this is built on Microsofts web server known as IIS. IIS has a rich series of clients and a rich topology to handle all the different kind of web services from file transfers to actually displaying websites. Network Load Balancing is used for most of the web server roles with IIS, this means that when a client tries to connect to anything they can go through NLB and be load balanced across multiple servers, additionally hardware load balancing can be used. Now network load balancing, this does load balancing at a level 2-3 layer in the networking stack. However with IIS with the web server there are often load balancing requirements at level 7 which is what we call the HTTP traffic and so load balancer at this level actually look at the URL, HTTP://microsoft.com and it will go and load balance traffic based on what is contained within that URL it will load balance the HTTP traffic. And it does this by whats called an application request routing server or ARR. ARR essentially contains this logic to do the load balancing for this level 7 traffic. However the ARR server itself needs to be made highly available so that its not a single point of failure and the ARR server can user network load balancing to be deployed in redundant arrays. So at the front end youre going to have ARR with network load balancing this is going to not only give you traffic load balancing at level 2 and 3 but then its going to go and figure out at level 7 where it should redirect the clients to the content servers and then you have the middle tier which will actually go and serve the content up. Additionally IIS has high availability for two of its roles using failover clustering. This is the FTP role and the WWW role. There are white papers out there that will actually show you how to explicitly configure these roles on a Windows Server failover cluster so that there are client access points for anyone trying to connect to either of these roles is always available and can move between different nodes in the failover cluster.

38

16-Nov-11

As we review this section weve covered quite a lot of the core servers and core applications from Microsoft. As we know downtime is inevitable so not only is it important to keep our infrastructure up and running but its more important to keep our applications up and running. While most of the technologies and servers can use virtualization or network load balancing, many of them have unique and specific dependencies on failover clustering. And as weve seen from Exchange Server as well as SQL Server they both use failover clustering as one of the underlying technologies yet they abstract a lot of the management and unique functions specific to SQL and Exchange from clustering so while they might use the cluster for membership or for health checking the rest of the functionality is unique to Exchange and to SQL. We hope that you found this module on application high availability useful and check out part three of this series which will go and look at management high availability.

39

This video is a part of the Microsoft Virtual Academy. Thank you.

40

Potrebbero piacerti anche