Sei sulla pagina 1di 25

Huawei FusionCloud Desktop Solution 6.

1
System High Availability
Technical White Paper

Issue 01

Date 2017-02-03

HUAWEI TECHNOLOGIES CO., LTD.


Copyright © Huawei Technologies Co., Ltd. 2017. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without
prior written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.

Notice
The purchased products, services and features are stipulated by the contract made between Huawei and
the customer. All or part of the products, services and features described in this document may not
be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all
statements, information, and recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.


Address: Huawei Industrial Base
Bantian, Longgang
Shenzhen 518129
People's Republic of China

Website: http://e.huawei.com

Issue 01 (2017-02-03) Huawei Proprietary and Confidential i


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper Contents

Contents

1 Huawei FusionCloud Desktop Solution .................................................................................. 1


2 System Availability Specifications............................................................................................ 2
3 System Reliability ......................................................................................................................... 4
3.1 Cabinet ............................................................................................................................................................. 4
3.2 Server ............................................................................................................................................................... 4
3.2.1 CPU Reliability ....................................................................................................................................... 4
3.2.2 Memory Reliability ................................................................................................................................. 5
3.2.3 Hard Disk Reliability .............................................................................................................................. 5
3.2.4 Supporting Regular Disk On-Line Faulty Detection and Precautions .................................................... 5
3.2.5 Power Reliability .................................................................................................................................... 6
3.2.6 System Monitoring .................................................................................................................................. 6
3.2.7 Onboard Software Reliability ................................................................................................................. 6
3.3 Storage Devices ................................................................................................................................................ 6
3.4 Network Devices .............................................................................................................................................. 7
3.4.2 NIC Load-Sharing ................................................................................................................................... 7
3.4.3 Switch Stacking ...................................................................................................................................... 8
3.4.4 Switch Interconnection Redundancy ....................................................................................................... 8
3.4.5 Virtual Router Redundancy Protection.................................................................................................... 8
3.4.6 Detached-Plane Network Communication .............................................................................................. 9
3.5 Cloud Platform HA .......................................................................................................................................... 9
3.5.1 Management Node HA ........................................................................................................................... 9
3.5.2 Data Backup for Management Nodes ................................................................................................... 10
3.5.3 VM Backup ........................................................................................................................................... 10
3.5.4 VM HA ................................................................................................................................................. 11
3.5.5 VRM-Independent VM HA Management ............................................................................................. 12
3.5.6 VM Fault Detection and Handling ........................................................................................................ 13
3.5.7 Live Migration of VMs ......................................................................................................................... 13
3.5.8 Storage Migration ................................................................................................................................. 14
3.5.9 VM Load Balancing .............................................................................................................................. 15
3.5.10 Black Box............................................................................................................................................ 15
3.5.11 Data Consistency ................................................................................................................................. 15
3.5.12 Health Check Tool and Information Collection Tool .......................................................................... 16

Issue 01 (2017-02-03) Huawei Proprietary and Confidential ii


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper Contents

3.6 FusionAccess Availability .............................................................................................................................. 16


3.6.1 FusionAccess Service HA ..................................................................................................................... 16
3.6.2 FusionAccess Service Monitoring ........................................................................................................ 17
3.6.3 Desktop Access HA .............................................................................................................................. 18
3.6.4 FusionAccess Management Data Backup ............................................................................................. 19
3.6.5 Power-on Recovery Reliability Design ................................................................................................. 19

A Glossary ....................................................................................................................................... 20

Issue 01 (2017-02-03) Huawei Proprietary and Confidential iii


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 1 Huawei FusionCloud Desktop Solution

1 Huawei FusionCloud Desktop Solution

The architecture components of the desktop cloud product Huawei FusionCloud desktop
solution are deployed on virtual machines (VMs). Figure 1-1 shows the architecture of
Huawei FusionCloud desktop solution.

Figure 1-1 Architecture of Huawei FusionCloud desktop solution

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 1


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 2 System Availability Specifications

2 System Availability Specifications

 Annual average global VM Availability rate reaches 99.9% (written in the Service Level
Agreement)
This specification indicates the proportion of the time when VMs are available. It is
determined by the availability and repair capability. For details, see the following
formula:
MTBF
A
MTBF  MTTR
Where:
A indicates the availability.
MTBF indicates the mean time between failures.
MTTR indicates the mean time to repair.
For details about how to achieve the promoted value, see chapter and Chapter 3 "System
Reliability."
 Duty Time, 24 x 7
This value 24 x7 indicates that the VM can provide service all the time.
 Power Recovery Duration, shorter than two hours
This specification indicates the duration from the time the power to the cloud platform is
resumed to the time all services are recovered.
System software of the cloud platform, including management software and computing
server software, does not need to be loaded sequentially. Loading each server takes less
than 5 minutes. A maximum of 20 servers can be loaded concurrently.
 VM Migration Duration, 3 minutes
This specification indicates the duration from the time the system detects that a VM is
power-off or breakdown to the time the system successfully restarts the VM or the
duration form the time the system detects that a VM is faulty to the time the system starts
the VM on another server. The duration depends on the duration for the startup of the
operating system (OS) on the VM.
If the system management server does not receive any heartbeat response from a
VM within 40 seconds, it will start the VM on another server. This is the high
availability (HA) process. VM Migration Duration does not include the startup duration
of the VM itself. VM brain-split is avoided using a lock mechanism.
 Live Migration Duration, 20 seconds for a VM with 1 GB memory

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 2


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 2 System Availability Specifications

This specification indicates the duration for migrating a VM from one server to another
server without affecting service provisioning.
During the live migration, the virtualization software copies the memory to the
destination physical server at a rate of about 1 GB per 20 seconds. After that, the
software copies the data changed during the previous copy operation to the destination
physical server at the same rate. The process repeats until all the latest data has been
migrated to the destination physical server. Then the new VM is restarted and the
original VM is stopped. The migration takes a few milliseconds, and the user is unaware
of the process.
 TC Yearly Failure Rate, smaller than 3%

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 3


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

3 System Reliability

3.1 Cabinet
The cabinet has the following characteristics for the reliability:
 Dual power distribution units (PDUs), with the overcurrent protection function.
 Earthquake-proof up to magnitude 9
 In compliance with NEBS L3

3.2 Server
3.2.1 CPU Reliability
Reliability Concept Benefit
Feature
Core isolation Disables services Ensures service availability by sacrificing
running on some certain performance and recovers system
CPU cores when the processing capabilities during off-peak hours.
cores are faulty.
Socket isolation Starts the active Ensures service availability by sacrificing
CPU only for certain performance and recovers system
services when the processing capabilities during off-peak hours.
standby CPU is Enables out-of-band systems to access MCA
faulty. registers using the Platform Environment
Control Interface (PECI).
PECI-based MCA Provides a PECI Uses the PECI channel of out-of-band systems
register access by channel that is to access MCA registers of CPUs when an
out-of-band decoupled from the internal error occurs in the PCH and the PECI
systems active system for channel of the ME is unavailable, maximally
out-of-band systems capture fault information for fault location.
to access MCA
registers of CPUs.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 4


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Reliability Concept Benefit


Feature
IVR (core/socket Provides monitoring Provides the fault monitoring and alarm
voltage detection) and alarm functions functions for all modules in CPUs so that risks
for the IVR modules to system stability can be identified and
integrated into processed in a timely manner.
CPUs.

3.2.2 Memory Reliability


Memory errors mainly include hardware errors and software errors. Hardware errors are
caused by invalid or damaged hardware. Components will return incorrect data continuously.
Hardware errors can be detected by memory self-checking during server startup.
Software errors occur frequently, and cannot be detected by memory self-checking. The data
in the memory can only be protected by algorithms of error checking and correcting (ECC).
As for memory software error correction, the X6000 and E6000 servers can check memory 2
bit errors and repair memory single bit errors by adopting ECC technology of industry
standard.

3.2.3 Hard Disk Reliability


 Hard disk hot–swapping: The servers support hard disk (SATA/SAS) hot–swapping
during system running.
 Hard disk RAID: The X6000 and E6000 servers support several RAID modes such as
RAID 0, RAID 1 and RAID 5, support hot spare disks for RAID groups, ensuring high
reliability of hard disk data. When a certain hard disk at a RAID group is faulty, the
servers support data restoration, RAID group recovery, and on-line hard disk
replacement. The RAID card has batteries, which improves the hard disk access
performance and protects the data in the cache when power outages occur.

3.2.4 Supporting Regular Disk On-Line Faulty Detection and


Precautions
The storage module of Huawei desktop cloud solution adopts the advanced technical standard
of SMART to monitor the Advanced Technology Attachment (ATA) and small computer
system interface (SCSI) hard disks, manage and check hard disk reliability, and predict disk
errors. The detecting principle is to detect hard disk properties such as data throughput
performance, motor start time, and error rate, and then deduce the hard disk faults and display
dialog box to avoid data loss by comparing and analyzing attribute values and standard
values.
SMART is widely used to improve the hard disk reliability. The key monitoring attributes of
SMART include:
 Read error rate
 Start/stop count
 Relocated sector count
 Spin up retry count
 Drive calibration retry count

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 5


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

 ULTRA DMA CRC error rate


 Multi-zone error rate

3.2.5 Power Reliability


Servers are configured with multiple power supply units (PSUs), and can generate
alarms when a fault occurs. PSUs support redundancy and hot swap, which ensures that the
system keep running when any PSU is faulty. The faulty PSU can be replaced on-line.

3.2.6 System Monitoring


The system monitors temperature of key components, such as the CPU and memory.
Together with the intelligent fan speed controlling and monitoring, the system reliability is
ensured.
The system monitors the running status of the key components, such as fans, PSUs and hard
disk. An alarm is generated when a fault occurs. Devices that support hot swap can be
replaced on-line. Devices that do not support hot swap must be powered off before
replacement.

3.2.7 Onboard Software Reliability


BMC software supports double images. If one Image is damaged in the Flash, the BMC will
start from the other Image. This prevents the failure in system starting.
BMC software monitors processes running on the server and restarts the server if a process
stops responding.

3.3 Storage Devices


The FusionStorage distributed storage system is designed for appliance. The FusionStorage
uses local storage resources on the computing node to store user data, and adopts redundancy
and distributed cache technologies to ensure data consistency and provide a high storage
performance solution. In FusionStorage distributed storage scenario, three duplicates are
stored, providing 99.9993% availability.
Each IP SAN (a Huawei S5500T system) consists of a controller enclosure, and three disk
enclosures, with a maximum of 96 hard disks. The following configurations ensure the
reliability of storage device:
 Eight physical links for multipathing
 Two global hot spare disks for each set
 Seven 9+1 RAID 5 groups for each set
Table 3-1 shows IP SAN reliability specifications.

Table 3-1 IP SAN reliability specifications

Scenario Availability MTBF (y) Yearly Outage (min/y)


Ten disks configured as RAID 5 99.9991% 12.684 4.73

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 6


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Availability of the IP SAN is 99.9991%, MTBF is 12.684 year or 111110.1 hours, and Yearly
Outage is 4.73 minutes per year.

The FusionStorage and IP SAN supports the core data protection mechanisms,
such as data redundancy protection, power-off protection, background scanning,
and data pre-reconstruction. These mechanisms ensure data security. The user data
persistence rate reaches 99.999%. The data persistence rate for traditional PCs is
less than 95%.

3.4 Network Devices


The network subsystem takes five measures to enhance system reliability:

Figure 3-1 Network subsystem

3.4.2 NIC Load-Sharing


As shown in Figure 3-1, the system adopts the bonding mode for the multiple NICs provided
by the physical server. This ensures system reliability and load-sharing. By using the bonding
mode, multiple NICs are bonded into one logical NIC. Therefore, the NICs can work
synchronously. In this way, the traffic to the server is shared on the NICs. Therefore, the load
on each NIC is much smaller and the ability of anti-concurrent access is improved. This
ensures stable and quick access to the server. In addition, if one NIC is faulty, the other NICs
take over the load seamlessly without interrupting services. This avoids service interruption
caused by failure in one NIC or link.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 7


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

After multiple NICs are bonded to one server, the system uses the array to expand incoming
and outgoing bandwidth on the server, to implement load balancing, and to enhance disaster
tolerance capability. This avoids transmission congestion or service interruption caused by a
single NIC.

3.4.3 Switch Stacking


Stacking switches is to connect a set of physical switches through stacking cables or
high-speed uplink ports to form one reliable logical switch. The access switch is stacked
through stacking ports. The switch stack mechanism improves the reliability of switch devices
and can be managed in a centralized manner which reduces maintenance cost.
After the two switches are stacked, they act as a single switch and are presented as one switch
device to peripherals. The two physical switches work in the master/slave mode. When one of
them is faulty, the other will take over services from the faulty one.
These switches are connected in ring or link topology. Then the stack master is elected by
running the stack management protocol. A stack master is responsible for the stack system
management, including: assigning IDs to stack members, collecting information about the
stack topology, and sends the topology information to stack members. A stack master also
designates the stack slave, which is prompted as the stack master to manage the stack system
if the master is faulty.

3.4.4 Switch Interconnection Redundancy


Smart link is also named as backup link. It provides a reliable and efficient solution on backup
and switchover for dual uplink of a link, and is usually used in dual uplink networking.
Compared with Spanning Tree Protocol (STP), Smart Link provides a better convergence
performance. Compared with Rapid Ring Protection Protocol (RRPP) and Smart Ethernet
Protection (SEP), Smart Link simplifies the configuration approach.
Dual uplink networking is one of the most common networking modes. Dual uplink
networking clears the redundancy using the STP and provides backup solution. When the
master link is faulty, traffic falls over to the slave link. This can meet the users' requirement
on redundancy backup on the functional level, but fails to meet many users' requirements on
performance level. Because the convergence speed is only in seconds, even if quick migration
in quick STP is adopted. This is an unfavorable performance KPI for high-end Ethernet
switches applied to telecommunication-level core network.
Based on the mentioned reasons, Huawei FusionCloud introduces Smart Link solution, which
achieves active/standby links redundancy backup and quick migration for dual uplink
networking. Smart Link solution, which is customized for dual uplink networking, ensures
performance, simplifies configuration. A kind of port association solution, called Monitor
Link, is introduced as a complement of Smart Link. It is used to monitor uplink, improving
the backup function of Smart Link.

3.4.5 Virtual Router Redundancy Protection


Virtual Router Redundancy Protocol (VRRP) is a fault tolerance protocol. With this protocol,
several routers can be grouped as a virtual router. When the next-hop switch of a host is fault,
service falls over a backup router of this virtual router without interrupting the service.
VRRP constructs a group of routing devices in the LAN into a VRRP backup group, which
equals to a virtual router. The hosts on LAN only need to know the IP address of virtual router
instead of the IP address of specific devices. After setting the default gateways of hosts to the
IP address of the virtual router, the hosts can use virtual gateways to communicate with
external networks.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 8


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

VRRP dynamically associates virtual router with physical equipment which undertakes
service transmission. When the equipment is faulty, new equipment will be selected to
takeover service transmission. The whole process is transparent to users, achieving continuous
communication between internal network and external network.

3.4.6 Detached-Plane Network Communication


The entire cloud computing system is logically divided into three planes: management plane,
storage plane, and service plane. To ensure data reliability of various network planes, the
FusionCloud solution adopts a detached-plane architecture. Different planes are separated by
using virtual local area network (VLAN). If one plane malfunctions, the other two planes can
keep on working. For example, when a temporary malfunction occurs on the management
plane, the service plane can still work properly and provide services to the cloud end user. In
addition, the cloud computing system supports priority setting based on the VLAN. With the
highest authority in the internal management and packet control, the administrator and user
can manage and control the system at any time.

3.5 Cloud Platform HA


3.5.1 Management Node HA
The active and standby management nodes of the FusionCloud system use the heartbeat
detection communication mechanism. The standby node detects the health status of the active
node in real time. When a fault is detected in the active management node, the standby
management node takes over the services of the active node and continues to provide services.
By starting the Watchdog, all the application processes on the service management node are
monitored in real time. The Watchdog can detect the abnormal status of the process like
deadlock and restart the process for recovery. If the process cannot be recovered after being
restarted, you can perform active/standby switchover on the service management node and
generate an active/standby exception alarm to ensure the reliability of the application process.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 9


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Figure 3-2 Management node HA

The management node is responsible for all the services in the entire system. The node works
in active/standby mode. When the active and standby node malfunctions simultaneously,
relative services fail too, such as services of VM creation or deletion. A running VM cannot
be affected by the malfunction of the active/standby node. Users can perform the applications
on the VM without knowing that a fault has occurred on the active/standby node.

3.5.2 Data Backup for Management Nodes


All data on the management nodes are automatically backed up regularly. Even when the
active and standby management servers are faulty and all data is lost, the data can recover
quickly.
The following describes the data recovery process when the active and standby management
servers are faulty and all data is lost:
Step 1 Change the management server.
Step 2 Reload the management node.
Step 3 Copy backup data to the management node, and start the management node.
It takes about 30 minutes to recover all the lost data.
----End

3.5.3 VM Backup
The eBackup VM backup scheme uses the Huawei eBackup software and the snapshot
backup function of FusionCompute to back up data for VMs. By working with
FusionCompute, eBackup backs up data of a specified VM or a specified volume of the VM
based on specified policies. If VM data is lost or the VM is faulty, data can be restored by

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 10


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

using backup data. Backup data is stored on the virtual disks attached to the eBackup VM or
the peripheral storage devices of the network file system (NFS) or common Internet file
system (CIFS).
The eBackup VM backup scheme has the following characteristics:
 Ease of use. Users do not need to install backup proxy software. They only need to create
VM templates and manage VMs on the graphical user interface (GUI).
 The VM-level backup service enables users to configure the full backup policy,
incremental backup policy, backup cycle, backup period, and backup data expiration
policy. Different types of VMs can be configured with different backup policies.
 Efficient backup and restoration. In full backup mode, only valid data is backed up. In
incremental backup mode, only modified data is backed up. This minimizes the backup
traffic and the required backup storage space.
 Concurrent backup and restoration. Each backup device supports 200 VMs and allows
concurrent backup and restoration for eight VMs. Each backup domain supports 10
backup devices. The backup has no impact on production VMs because eBackup
software is deployed on dedicated virtual devices.

3.5.4 VM HA
If the physical CNA server is powered down or restarted abnormally, the system can migrate
the VMs with high availability (HA) to other computing servers. This ensures that VMs can
be quickly restored to the normal state.
The FusionCloud solution provides multiple migration strategies. After a computing server is
powered down, since thousands of VMs can run within a cluster, the system migrates VMs to
different destination servers based on the network traffic status and load of destination servers.
This avoids network congestion and destination server overload.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 11


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Figure 3-3 VM HA feature

If a VM cannot connect to the VRM, the system regards that the VM is faulty. The system
selects another computing node to start the faulty VM.
The VM supports the high availability feature. If a VM cannot connect to the VRM, the VRM
regards that the VM is faulty. The VRM issues a command about restarting the VM on
another computing node. Then, the VM fault is automatically recovered. To prevent VM
split brain caused by incorrect decisions, the system introduces the anti-split brain
lock mechanism.
Management nodes are running at active/standby mode and feature high availability (HA).
Active and standby management nodes, deployed on different CNAs, are in a mutually
exclusive relationship. For example, when the active management node is faulty, the standby
management node becomes the active management node. Meanwhile, the faulty management
node is restarted on another CNA and serves as the standby management node. Therefore,
active and standby management nodes will not break down simultaneously.

3.5.5 VRM-Independent VM HA Management


FusionSphere supports VRM-independent VM HA management. This function allows
FusionSphere to detect the network heartbeat connection between hosts independent of the
VRM node. The HA function between hosts can take effect even if the VRM node is faulty.
This function also allows data stores that are associated with hosts to detect host status,

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 12


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

thereby preventing misjudgment on HA due to management network faults. After this function
is enabled, FusionSphere can detect host faults on the service plane and generate alarms
accordingly.

3.5.6 VM Fault Detection and Handling


Most of VMs run on Windows, which is prone to faults, such as blue screen of death (BSOD).
When the BSOD occurs on a VM, the Huawei cloud platform can detect BSOD information
and automatically restart the VM. After the VM is automatically restarted, you only need to
connect the VM.
The Huawei cloud platform also supports the one-touch disk migration function. When the
user VM operating system (OS) breaks down, the user does not need to reinstall the OS,
avoiding the risk of data loss. Instead, the Huawei cloud platform creates a VM, which shares
the same specifications with the faulty one, and automatically mounts the data from the faulty
VM to the new VM. The user can log in to the new VM to obtain the data without any other
manual operations.

3.5.7 Live Migration of VMs


The VM is the resource entity for the cloud platform to provides elastic computing services.
To prevent the service interruption caused by VM unavailability, the system enables the VM
to migrate without interrupting services, which is called live migration. In the process of
migration, to ensure the memory synchronization, the Hypervisor quickly copies the memory
data and migrates the VM to the target host without interrupting the service. Figure 3-4 shows
how the VM migrates to the target host without interrupting services. The data on the VM
remains unaltered after the migration by using shared storage resources.

Figure 3-4 VM live migration

Live migration of the VM can reduce service running costs for the customer. With this
function, services running on different servers can be migrated to fewer or one server when
the traffic is light, then the idle server can be turned off. This helps the customer to reduce
costs. It also saves energy and reduces emission.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 13


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Live migration of the VM can ensure high reliability of the customer system. When a fault
occurs on a running physical machine, you can migrate the services to other properly running
machine before the situation turns worse.
The hardware can be upgraded without interrupting services. When the customer wants to
upgrade the hardware without interrupting the services, you can migrate all the VMs on the
physical machine to other machines and then upgrade the machine. After the upgrading is
finished, you migrate the VMs back. During the process, the services are not interrupted.
Currently, the system only supports live migration of the VM in the following application
scenarios:
 Manually migrate the VM to any idle physical server as required.
 Migrate the VM in batches to any idle physical server based on the status of resource
utilization.

3.5.8 Storage Migration


The storage virtualization module on the cloud platform supports the storage live migration
capability. The user data must be migrated from one storage device to the other storage
devices in any of the following scenarios:
 The storage device is being maintained.
 Users have higher requirements on storage performance.
 The existing storage devices cannot meet requirements.
The storage live migration ensures that the user data can be migrated from one storage device
to another storage device without affecting VMs.
Figure 3-5 shows the storage migration.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 14


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Figure 3-5 Storage migration

3.5.9 VM Load Balancing


If a new VM is started, VMs are live migrated, or computing nodes are remotely restarted due
to faults, when the system works in load balancing mode, the system node dynamically
distributes the load based on the current load status of each physical computing server to
achieve a dynamically balanced status for the load of each physical computing server in a
cluster.

3.5.10 Black Box


Black box technology is introduced to the managing and computing nodes. When the system
runs abnormally or breaks down, the black box automatically saves the kernel logs of the
virtual machine manager (VMM), the system snapshot, the diagnosis information of the
kernel logs, and the last words of the system to a reliable storage device (the computing node)
or sends them to the remote server like the TFTP server using netpoll. Therefore after the
system breaks down, the information can be exported for problem analysis and identification.

3.5.11 Data Consistency


The entire cloud system meets the high reliability requirement in the Telecom field. More than
80% of system development codes are used to deal with various faults. The checkpoint and
rollback mechanisms are adopted to ensure data consistency. The cloud system has the data
auditing mechanism to audit and clear the potential junk data brought by faults. The junk data
can also be collected. This prevents data inconsistency due to the junk data and ensures that
services can run properly.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 15


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

3.5.12 Health Check Tool and Information Collection Tool


The system provides the health check tool and information collection tool to meet the high
reliability requirements in the Telecom field.
The health check tool regularly examines the system health, and monitors system running
status, alarm and log information, progress status, key configuration information, key resource
usage, and changing trend of key resources to detect performance, security, and reliability
risks. The health check tool also checks the system before and after the high-risk operation
and system upgrade to verify system health status.
The information collection tool accurately collects the fault-related logs and alarm
information based on fault types. This facilitates fault location and analysis, simplifies fault
information collection, and shortens the service breakdown duration.

3.6 FusionAccess Availability


3.6.1 FusionAccess Service HA
Table 3-2 describes the FusionAccess service software deployment modes. Other services of
FusionAccess adopt redundancy deployment modes except the license service because the
HDC can cache licenses. Services are not affected in one month even if single point failure
occurs in the license service. When a fault occurs in any service, the system detects the fault
in a timely manner and isolates the fault. In addition to redundancy deployment, all services
of FusionAccess support local service monitoring. If a service is abnormal, it will be restarted
to ensure proper running.
In scenarios where domain controllers including the AD and LiteAD are used to authenticate
users, if the domain controllers cannot authenticate users properly, the system can use the
local authentication function of VMs to ensure proper use of desktop cloud services.

Table 3-2 FusionAccess service software deployment modes

Service Name Function Deployment Service Impact of Single


Mode Point Failure

WI Allows users to log Load balancing None


in to VMs.
HDC Performs desktop Load balancing None
access control.
License Performs license Single-node Services are not affected in
control. deployment one month even if single
point failure occurs in the
license service.
ITA Supports service Active/Standby None
provisioning.
vLB Performs load Active/Standby None
balancing for WIs.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 16


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Service Name Function Deployment Service Impact of Single


Mode Point Failure

vAG Functions the Load balancing None


self-service
maintenance login
gateway and desktop
access gateway.
GaussDB Stores desktop Active/Standby None
service data.
AD Functions as IT Active/Standby None
(LiteAD)/DNS/ infrastructure
DHCP facilities.
Backup Server Manages system Single-node Services are not affected,
backup. deployment and monitoring is
implemented.

UNS Unified domain Load balancing


name service

3.6.2 FusionAccess Service Monitoring


FusionAccess monitors VDI infrastructure servers in real time. When services (or servers)
malfunction, alarms are centrally displayed on the ITA portal. Guides are provided for
handing each alarm. Different services or (servers) are monitored using different methods.
Services of the Linux servers proactively report heartbeats to the ITA. The heartbeat
information carries CPU and memory usage. If the ITA does not receive server heartbeats in
three consecutive cycles, an alarm indicating abnormal services is generated. If the CPU or
memory usage carried in heartbeats exceeds 80%, an alarm is also generated. Windows
servers are monitored by checking service status. Table 3-3 provides major FusionAccess
alarms.

Table 3-3 FusionAccess alarms


Service Name Function Monitoring Remarks
Method
WI VM login page Heartbeat
HDC Desktop access control Heartbeat
License License control Heartbeat
ITA Service provisioning Checking service Two ITA servers
status check the service
status of each
other.
vLB WI load balancer Heartbeat

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 17


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

Service Name Function Monitoring Remarks


Method
vAG Self-maintenance login Heartbeat
gateway and desktop access
gateway
GaussDB Stores desktop service data. Triggered by HDC
and ITA services.
AD IT infrastructure The ITA
(LiteAD)/DNS/D proactively
HCP monitors processes.
Backup Server Manages and configures Monitored by
data backup. checking backup
results.
UNS Unified domain name Heartbeats
management
CPU/Memory/Di Monitors all servers, CPUs, Servers periodically Ensure that the
sk memories, and disks of report CPU, disk is not full. In
VDI. memory, and disk the Linux OS,
status. ensure that the
number of inodes
does not exceed
80% of disk
partitions.
Clock Synchronizes clocks in a Servers periodically
synchronization system. report clock
synchronization
status.

3.6.3 Desktop Access HA


The following three methods are used to improve desktop access HA:
 Automatic desktop reconnection
If desktop disconnection occurs due to intermittent network disconnection or other
causes, clients automatically reconnect to desktops and users do not need to log in again.
 Automatic desktop service port switching
Clients may fail to connect to desktops if desktop service ports on user VMs are used by
applications installed by users when these ports are fixed. To avoid such software
compatibility problems, Huawei desktop service adopts automatic port switching
technology to ensure that a client can use another available port when a port is occupied.
 Desktop service process HA
Desktop service processes running on user VMs can automatically recover after
abnormal termination (regardless of whether the processes are terminated by users or
other processes due to errors).

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 18


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper 3 System Reliability

3.6.4 FusionAccess Management Data Backup


Figure 3-6 shows the management data backup.

Figure 3-6 Management data backup

Backup server

FTPS
FTPS FTPS
FTPS FTPS FTPS

AD (LiteAD)/ HDC License


ITA WI DB DHCP/DNS

The data on each node is backed up into compressed files at 01:00. The backup files are sent
to the backup server over FTPS. The backup server can be the one in Huawei VDI solution or
the FTP server provided by the customer. The IP address of the FTP backup server can be
configured on the ITA management page. Only the latest 10 backup data copies can be
retained on the backup server. If data is damaged, you can download the data from the backup
server and quickly restore the data according to the documentation.

3.6.5 Power-on Recovery Reliability Design


When the power supply is recovered after an unexpected outage in a data center, the system
can automatically start all servers. The server node startup is in random sequence. Therefore,
the system still works properly after the power supply recovery in this scenario.

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 19


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper A Glossary

A Glossary

A
AD active directory
ATA advanced technology attachment
B
BIOS basic input/output system
C
CNA Computing Node Agent
D
DB database
DHCP Dynamic Host Configuration Protocol
DNS domain name server
G
GM GalaxManager
I
ITA IT adapter
M
MTBF mean time between failure
MTTR mean time to repair
N
NC network computer
NEBS Network Equipment Building System
P
PDU power distribution unit
R

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 20


Copyright © Huawei Technologies Co., Ltd.
Huawei FusionCloud Desktop Solution 6.1
System High Availability Technical White Paper A Glossary

RAID redundant array of independent disks


RBD reliability block diagram
S
SCSI Small Computer System Interface
T
TFTP Trivial File Transfer Protocol

Issue 01 (2017-02-03) Huawei Proprietary and Confidential 21


Copyright © Huawei Technologies Co., Ltd.

Potrebbero piacerti anche