Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
System High Availability
Technical White Paper
Issue 01
Date 2017-02-03
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.
Notice
The purchased products, services and features are stipulated by the contract made between Huawei and
the customer. All or part of the products, services and features described in this document may not
be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.
Website: http://e.huawei.com
Contents
A Glossary ....................................................................................................................................... 20
The architecture components of the desktop cloud product Huawei FusionCloud desktop
solution are deployed on virtual machines (VMs). Figure 1-1 shows the architecture of
Huawei FusionCloud desktop solution.
Annual average global VM Availability rate reaches 99.9% (written in the Service Level
Agreement)
This specification indicates the proportion of the time when VMs are available. It is
determined by the availability and repair capability. For details, see the following
formula:
MTBF
A
MTBF MTTR
Where:
A indicates the availability.
MTBF indicates the mean time between failures.
MTTR indicates the mean time to repair.
For details about how to achieve the promoted value, see chapter and Chapter 3 "System
Reliability."
Duty Time, 24 x 7
This value 24 x7 indicates that the VM can provide service all the time.
Power Recovery Duration, shorter than two hours
This specification indicates the duration from the time the power to the cloud platform is
resumed to the time all services are recovered.
System software of the cloud platform, including management software and computing
server software, does not need to be loaded sequentially. Loading each server takes less
than 5 minutes. A maximum of 20 servers can be loaded concurrently.
VM Migration Duration, 3 minutes
This specification indicates the duration from the time the system detects that a VM is
power-off or breakdown to the time the system successfully restarts the VM or the
duration form the time the system detects that a VM is faulty to the time the system starts
the VM on another server. The duration depends on the duration for the startup of the
operating system (OS) on the VM.
If the system management server does not receive any heartbeat response from a
VM within 40 seconds, it will start the VM on another server. This is the high
availability (HA) process. VM Migration Duration does not include the startup duration
of the VM itself. VM brain-split is avoided using a lock mechanism.
Live Migration Duration, 20 seconds for a VM with 1 GB memory
This specification indicates the duration for migrating a VM from one server to another
server without affecting service provisioning.
During the live migration, the virtualization software copies the memory to the
destination physical server at a rate of about 1 GB per 20 seconds. After that, the
software copies the data changed during the previous copy operation to the destination
physical server at the same rate. The process repeats until all the latest data has been
migrated to the destination physical server. Then the new VM is restarted and the
original VM is stopped. The migration takes a few milliseconds, and the user is unaware
of the process.
TC Yearly Failure Rate, smaller than 3%
3 System Reliability
3.1 Cabinet
The cabinet has the following characteristics for the reliability:
Dual power distribution units (PDUs), with the overcurrent protection function.
Earthquake-proof up to magnitude 9
In compliance with NEBS L3
3.2 Server
3.2.1 CPU Reliability
Reliability Concept Benefit
Feature
Core isolation Disables services Ensures service availability by sacrificing
running on some certain performance and recovers system
CPU cores when the processing capabilities during off-peak hours.
cores are faulty.
Socket isolation Starts the active Ensures service availability by sacrificing
CPU only for certain performance and recovers system
services when the processing capabilities during off-peak hours.
standby CPU is Enables out-of-band systems to access MCA
faulty. registers using the Platform Environment
Control Interface (PECI).
PECI-based MCA Provides a PECI Uses the PECI channel of out-of-band systems
register access by channel that is to access MCA registers of CPUs when an
out-of-band decoupled from the internal error occurs in the PCH and the PECI
systems active system for channel of the ME is unavailable, maximally
out-of-band systems capture fault information for fault location.
to access MCA
registers of CPUs.
Availability of the IP SAN is 99.9991%, MTBF is 12.684 year or 111110.1 hours, and Yearly
Outage is 4.73 minutes per year.
The FusionStorage and IP SAN supports the core data protection mechanisms,
such as data redundancy protection, power-off protection, background scanning,
and data pre-reconstruction. These mechanisms ensure data security. The user data
persistence rate reaches 99.999%. The data persistence rate for traditional PCs is
less than 95%.
After multiple NICs are bonded to one server, the system uses the array to expand incoming
and outgoing bandwidth on the server, to implement load balancing, and to enhance disaster
tolerance capability. This avoids transmission congestion or service interruption caused by a
single NIC.
VRRP dynamically associates virtual router with physical equipment which undertakes
service transmission. When the equipment is faulty, new equipment will be selected to
takeover service transmission. The whole process is transparent to users, achieving continuous
communication between internal network and external network.
The management node is responsible for all the services in the entire system. The node works
in active/standby mode. When the active and standby node malfunctions simultaneously,
relative services fail too, such as services of VM creation or deletion. A running VM cannot
be affected by the malfunction of the active/standby node. Users can perform the applications
on the VM without knowing that a fault has occurred on the active/standby node.
3.5.3 VM Backup
The eBackup VM backup scheme uses the Huawei eBackup software and the snapshot
backup function of FusionCompute to back up data for VMs. By working with
FusionCompute, eBackup backs up data of a specified VM or a specified volume of the VM
based on specified policies. If VM data is lost or the VM is faulty, data can be restored by
using backup data. Backup data is stored on the virtual disks attached to the eBackup VM or
the peripheral storage devices of the network file system (NFS) or common Internet file
system (CIFS).
The eBackup VM backup scheme has the following characteristics:
Ease of use. Users do not need to install backup proxy software. They only need to create
VM templates and manage VMs on the graphical user interface (GUI).
The VM-level backup service enables users to configure the full backup policy,
incremental backup policy, backup cycle, backup period, and backup data expiration
policy. Different types of VMs can be configured with different backup policies.
Efficient backup and restoration. In full backup mode, only valid data is backed up. In
incremental backup mode, only modified data is backed up. This minimizes the backup
traffic and the required backup storage space.
Concurrent backup and restoration. Each backup device supports 200 VMs and allows
concurrent backup and restoration for eight VMs. Each backup domain supports 10
backup devices. The backup has no impact on production VMs because eBackup
software is deployed on dedicated virtual devices.
3.5.4 VM HA
If the physical CNA server is powered down or restarted abnormally, the system can migrate
the VMs with high availability (HA) to other computing servers. This ensures that VMs can
be quickly restored to the normal state.
The FusionCloud solution provides multiple migration strategies. After a computing server is
powered down, since thousands of VMs can run within a cluster, the system migrates VMs to
different destination servers based on the network traffic status and load of destination servers.
This avoids network congestion and destination server overload.
If a VM cannot connect to the VRM, the system regards that the VM is faulty. The system
selects another computing node to start the faulty VM.
The VM supports the high availability feature. If a VM cannot connect to the VRM, the VRM
regards that the VM is faulty. The VRM issues a command about restarting the VM on
another computing node. Then, the VM fault is automatically recovered. To prevent VM
split brain caused by incorrect decisions, the system introduces the anti-split brain
lock mechanism.
Management nodes are running at active/standby mode and feature high availability (HA).
Active and standby management nodes, deployed on different CNAs, are in a mutually
exclusive relationship. For example, when the active management node is faulty, the standby
management node becomes the active management node. Meanwhile, the faulty management
node is restarted on another CNA and serves as the standby management node. Therefore,
active and standby management nodes will not break down simultaneously.
thereby preventing misjudgment on HA due to management network faults. After this function
is enabled, FusionSphere can detect host faults on the service plane and generate alarms
accordingly.
Live migration of the VM can reduce service running costs for the customer. With this
function, services running on different servers can be migrated to fewer or one server when
the traffic is light, then the idle server can be turned off. This helps the customer to reduce
costs. It also saves energy and reduces emission.
Live migration of the VM can ensure high reliability of the customer system. When a fault
occurs on a running physical machine, you can migrate the services to other properly running
machine before the situation turns worse.
The hardware can be upgraded without interrupting services. When the customer wants to
upgrade the hardware without interrupting the services, you can migrate all the VMs on the
physical machine to other machines and then upgrade the machine. After the upgrading is
finished, you migrate the VMs back. During the process, the services are not interrupted.
Currently, the system only supports live migration of the VM in the following application
scenarios:
Manually migrate the VM to any idle physical server as required.
Migrate the VM in batches to any idle physical server based on the status of resource
utilization.
Backup server
FTPS
FTPS FTPS
FTPS FTPS FTPS
The data on each node is backed up into compressed files at 01:00. The backup files are sent
to the backup server over FTPS. The backup server can be the one in Huawei VDI solution or
the FTP server provided by the customer. The IP address of the FTP backup server can be
configured on the ITA management page. Only the latest 10 backup data copies can be
retained on the backup server. If data is damaged, you can download the data from the backup
server and quickly restore the data according to the documentation.
A Glossary
A
AD active directory
ATA advanced technology attachment
B
BIOS basic input/output system
C
CNA Computing Node Agent
D
DB database
DHCP Dynamic Host Configuration Protocol
DNS domain name server
G
GM GalaxManager
I
ITA IT adapter
M
MTBF mean time between failure
MTTR mean time to repair
N
NC network computer
NEBS Network Equipment Building System
P
PDU power distribution unit
R