Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Table Of Contents
1. Executive Summary
1.1.Business Case
1.2.Solution Overview
1.3.Key Results
2. Oracle on All-Flash vSAN Reference Architecture
2.1.Purpose
2.2.Scope
2.3.Audience
2.4.Terminology
3. Technology Overview
3.1.Overview
3.2.VMware vSphere 6.0 Update 2
3.3.VMware vSAN 6.2
3.4.VMware vSAN Stretched Cluster
3.5.Oracle Database 12c
4. Solution Configuration
4.1.Overview
4.2.Architecture Diagram
4.3.Hardware Resources
4.4.Software Resources
4.5.Network Configuration
4.6.Oracle Database VM and DB Storage Configuration
5. Solution Validation
5.1.Test Overview
5.2.Test and Performance Data Collection Tools
5.3.vSAN Configurations Used in this Solution
5.4.Oracle OLTP Performance on vSAN
5.5.Oracle DSS Workload on vSAN
5.6.Oracle OLTP and DSS mixed workload on vSAN
5.7.vSAN Resiliency
5.8.vSphere vMotion on vSAN
5.9.Oracle Database 12c on vSAN Stretched Cluster
5.10.Oracle Database 12c on VSAN Centralized Management
6. Best Practices of Oracle Database on All-Flash vSAN
6.1.Best Practices of Oracle Database on All-Flash vSAN
7. Conclusion
7.1.Conclusion
8. Reference
8.1.White Paper
8.2.Product Documentation
8.3.Other Documentation
9. Appendix A SLOB Configuration
9.1.Appendix A SLOB Configuration
10. About the Author and Contributors
10.1.About the Author and Contributors
1. Executive Summary
This section covers the Business Case, Solution Overview and Key Results of
the Oracle Database 12c on the VMware vSAN 6.2 All-Flash document.
1. 1 Business Case
With more and more production servers being virtualized, the demand for
highly converged server-based storage is surging. VMware® vSAN™ aims at
providing a highly scalable, available, reliable, and high performance storage
using cost-effective hardware, specifically direct-attached disks in VMware
ESXi™ hosts. vSAN adheres to a new policy-based storage management
paradigm, which simplifies and automates complex management workflows
that exist in traditional enterprise storage systems with respect to
configuration and clustering.
1. 2 Solution Overview
This solution addresses the common business challenges that CIOs face today
in an online transaction processing (OLTP) and decision support system (DSS)
environment that requires predictable performance and cost-effective storage.
The solution helps customers design and implement optimal configurations
specifically for Oracle database on All-Flash vSAN.
1. 3 Key Results
2. 1 Purpose
2. 2 Scope
2. 3 Audience
2. 4 Terminology
TERM DEFINITION
Oracle
Automatic
Oracle ASM is a volume manager and a file system
Storage
for Oracle database files.
Management
(Oracle ASM)
Table 1. Terminology
10
3. Technology Overview
This section provides an overview of the technologies used in this solution.
11
3. 1 Overview
12
protection.
QoS (Quality of Service) with IOPS limit: policy-driven QoS limits and
monitoring of IOPS consumed by specific virtual machines, eliminating
noisy neighbor issues and managing performance SLAs.
Software checksum: end-to-end data checksum detects and resolves
silent errors to ensure data integrity; also this feature is policy-driven.
Client Cache: leverages local dynamic random access memory (DRAM)
to virtual machines to accelerate read performance. The amount of
memory allocated is 0.4 percent of total host memory up to 1GB per
host to local virtual machines.
With these new features, vSAN 6.2 provides the following advantages:
All-Flash Architecture
All-Flash vSAN aims at delivering extremely high IOPS with predictable low
latencies. In all-flash architecture, two different grades of flash devices are
commonly used in an All-Flash vSAN configuration: lower capacity and higher
endurance devices for the cache layer; more cost-effective, higher capacity,
and lower endurance devices for the capacity layer. Writes are performed at
the cache layer and then destaged to the capacity layer, only as needed. This
helps extend the usable life of lower endurance flash devices in the capacity
layer.
13
14
Erasure Coding
Erasure coding provides the same levels of redundancy as mirroring, but with a
reduced capacity requirement. In general, erasure coding is a method of taking
data, breaking it into multiple pieces and spreading it across multiple devices,
while adding parity data so it may be recreated in the event that one or more
pieces are corrupted or lost.
RAID 5
In this case, RAID 5 requires four hosts at a minimum because it uses a 3+1
logic. With four hosts, one can fail without data loss. This results in a significant
reduction of required disk capacity. Normally, a 20GB disk would require 40GB
of disk capacity in a mirrored protection, but in the case of RAID 5, the
requirement is only around 27GB.
15
RAID 6
With RAID 6, two host failures can be tolerated. In the RAID 1 scenario for a
20GB disk, the required disk capacity would be 60GB. However, with RAID 6,
this is just 30GB. Note that the parity is distributed across all hosts and there is
no dedicated parity host. A 4+2 configuration is used in RAID 6, which means
that at least six hosts are required in this configuration.
16
vSAN 6.1 introduced the Stretched Cluster feature. vSAN Stretched Cluster
provides customers with the ability to deploy a single vSAN Cluster across
multiple data centers. vSAN Stretched Cluster is a specific configuration
implemented in environments where disaster or downtime avoidance is a key
requirement.
vSAN Stretched Cluster builds on the foundation of fault domains. The fault
domain feature introduced rack awareness in vSAN 6.0. The feature allows
customers to group multiple hosts into failure zones across multiple server
racks to ensure that replicas of virtual machine objects are not provisioned
onto the same logical failure zones or server racks. vSAN Stretched Cluster
requires three failure domains based on three sites (two active sites and one
witness site). The witness site is only utilized to host witness virtual appliances
that store witness objects and cluster metadata information and also provide
cluster quorum services during failure events.
Write latency: In a vSAN Cluster, mirrored writes incur the same latency. In a
vSAN Stretched Cluster, you need to prepare the write operations at two sites.
Therefore, write operation needs to traverse the inter-site link, and thereby
incur the inter-site latency. The higher the latency, the longer it takes for write
operations to complete.
Read locality: The vSAN Cluster does read operations in a round robin pattern
across the mirrored copies of an object. The Stretched Cluster does all reads
from the single-object copy available at the local site.
Failure: In the event of any failure, recovery traffic needs to originate from the
remote site, which has the only mirrored copy of the object. Thus, all recovery
traffic traverses the inter-site link. In addition, since the local copy of the object
on a failed node is degraded, all reads to that object are redirected to the
remote copy across the inter-site link.
See more information in the VMware vSAN 6.2 Stretched Cluster Guide.
17
18
4. Solution Configuration
This section introduces the resources and configurations for the solution
including architecture diagram and hardware & software resources.
19
4. 1 Overview
This section introduces the resources and configurations for the solution
including:
Architecture diagram
Hardware resources
Software resources
Network configuration
Oracle database VM and database storage configuration
4. 2 Architecture Diagram
This solution had two architectures: one was vSAN Cluster as shown in Figure
5 and the other was vSAN Stretched Cluster as shown in Figure 25. In the
configuration of vSAN Stretched Cluster, the same servers used for vSAN
Cluster were used but split evenly into two sites.
The key designs for the vSAN Cluster solution for Oracle database were:
A 4-node vSAN Cluster with two vSAN disk groups on each ESXi host.
Each disk group was created from 1 x 800GB SSD (cache) and 3 x
800GB SSDs (capacity).
For vSAN Policy used, see vSAN Configurations Used in this Solution.
Two different VM sizes were used:
Medium VM—4 vCPU and 64GB memory with Oracle SGA set to 53GB
and PGA set to 10GB
Large VM—8 vCPU and 96GB memory with Oracle SGA set to 77GB
and PGA set to 10GB
Oracle Linux 7.0 operating system was used for database VMs.
Oracle VM configurations used in the tests.
20
TEST VM CONFIGURATION
2 x medium VM
OLTP tests
2 x large VM
1 x large VM
DSS test
1 x medium VM
vSAN Stretched
2 x large VM
Cluster test
21
4. 3 Hardware Resources
22
DESCRIPTIO
SPECIFICATION
N
Storage
1 x 12G SAS Modular RAID Controller
controller
23
The storage controller used in the reference architecture supports the pass-
through mode. The pass-through mode is the preferred mode for vSAN and it
gives vSAN complete control of the local SSDs attached to the storage
controller.
4. 4 Software Resources
24
Software-defined storage
solution for
VMware vSAN 6.2
hyperconverged
infrastructure
25
4. 5 Network Configuration
A port group defines properties regarding security, traffic shaping, and NIC
teaming. Jumbo frames (MTU=9000 bytes) was enabled on the vSAN and
vSphere vMotion interface and the default port group setting was used. Figure
6 shows the distributed switch port groups created for different functions and
the respective active and standby uplinks to balance traffic across the available
uplinks. Three port groups were created:
26
on ESXi vSAN VMkernel ports for routing between different VLANs (sites). The
Linux VM leverages Netem functionality already built into Linux for simulating
network latency between the sites. Furthermore, XORP installed on the Linux
VM provided support for a multicast traffic between two vSAN fault domains.
For vSAN network design best practices, see VMware vSAN 6.2 Network
Design Guide.
27
Oracle Single Instance Database VM was installed with Oracle Linux 7.0 and
was configured as follows:
28
Oracle ASM data disk group with external redundancy was configured with
allocation unit size of 1M. Data and redo ASM disk groups were presented on
different PVSCSI controllers.
See Oracle best practices in the Best Practices of Oracle Database on All-Flash
vSAN chapter.
Table 4 provides Oracle VM disk layout and ASM disk group configuration.
29
SCSI ID ASM
SCSI SIZE
NAME (CONTROLLER, DISK
TYPE (GB)
LUN) GROUP
Not
Operating System (OS) LSI
SCSI (0:0) 50 Applicabl
and Oracle binary disk Logic
e
Para
Database data disk 1 virtu SCSI (1:0) 100 DATA
al
Para
Database data disk 2 virtu SCSI (1:1) 100 DATA
al
Para
Database data disk 3 virtu SCSI (2:0) 100 DATA
al
Para
Database data disk 4 virtu SCSI (2:1) 100 DATA
al
Para
Online redo disk 1 virtu SCSI (3:0) 20 REDO
al
Para
Online redo disk 2 virtu SCSI (3:1) 20 REDO
al
30
5. Solution Validation
In this section, we present the test methodologies and processes used in this
operation guide.
31
5. 1 Test Overview
vSAN performance with OLTP and DSS workloads by using new features
in vSAN 6.2:
Oracle workload testing using SLOB to generate OLTP-like workload.
Oracle workload testing using Swingbench to generate DSS-like
workload.
vSAN resiliency during failures.
vSphere vMotion on vSAN.
vSAN Stretched Cluster for continuous data availability during a site
failure.
vSAN and Oracle health and performance management using VMware
vRealize Operations Manager™.
Two medium VM (4 vCPU and 64GB memory) and two large VM (8 vCPU
and 96GB memory).
Each VM was on a separate ESXi host of a 4-node cluster.
32
One medium VM (4 vCPU and 64GB memory) and one large VM (8 vCPU
and 96GB memory).
Each VM was on a separate ESXi host.
A 350GB Swingbench Sales History schema was created in each
database VM.
Sales History workload using Swingbench is by default 100 percent read
IO throughput intensive.
The default Sales History configuration file with 24 users was used with
the following transactions:
Sales Rollup by Month and Channel
Sales Cube by Month and Channel
Sales Cube by Week and Channel
Product Sales Cube and Rollup by Month
Sales within Quarter by Country
Sales within Week by Country
33
During this test, four Oracle database VMs were online with one on each ESXi
host. OLTP workloads were on two VMs and DSS workloads were on the other
two VMs.
Mixed Configuration
One medium VM (4 vCPU and 64GB memory) and one large VM (8 vCPU
and 96GB memory) running SLOB with a 300GB database.
One medium VM (4 vCPU and 64GB memory) and one large VM (8 vCPU
and 96GB memory) running Swingbench with a 350GB Sales History
schema.
Each VM was on a separate ESXi host of a 4-node cluster.
Same “SLOB Heavy workload” and Swingbench configurations described
in the previous sections applied.
We measured three important workload metrics in all tests: I/Os per second
(IOPS), average latency of each IO operation (ms), and IO throughput (MB/s).
IOPS and average latency metrics are important for OLTP workload. IO
throughput is a key metric in DSS workload.
vSAN Observer
esxtop utility
34
more information.
Several vSAN 6.2 feature combinations were used during the tests. Table 5
shows the abbreviations used to represent the feature configurations.
35
DEDUPLICATION AND
RAID CHECK
NAME COMPRESSION
LEVEL SUM
(SPACE EFFICIENCY)
Disable
R1 1 No
d
Enable
R1+C 1 No
d
Enable
R1+C+SE 1 Yes
d
Data
disks—
5
R15+C+S OS
Enable
disks— Yes
E[4] d
5
Redo
disks—
1
36
Unless otherwise specified in the test, the vSAN Cluster was designed with the
following common configuration parameters:
37
Test Overview
This test focused on extremely heavy Oracle OLTP workload on vSAN. SLOB
was used to stress four Oracle databases concurrently in the vSAN Cluster. A
300GB SLOB database was loaded on each of the four Oracle VMs (two large
VMs and two medium VMs). A VM was placed on every ESXi host in the 4-node
vSAN Cluster as shown in Figure 5. The same workload was running on all four
database VMs for 60 minutes. The baseline tests used R1 vSAN configuration.
Two different SLOB workload tests were run as part of the baseline test:
SLOB Medium workload with number of users set to 48 with think time
frequency set to 5 to hit each database with maximum requests
concurrently to generate OLTP workload.
SLOB Heavy workload with number of users set to 128 with think time
frequency set to 0 to hit each database with maximum requests
concurrently to generate extremely intensive OLTP workload.
We measured the key metrics for the OLTP workload. Figure 9 shows the IOPS
generated by four Oracle database VMs during the test. The IOPS reached a
peak of over 65 thousand with an average IOPS of 64 thousand. There was a
38
20-minute warm-up period and the performance reached a steady state after
that. Notice the workload was a mix of 75 percent read and 25 percent write
IOPS, which mimicked a transactional database workload.
Latency in an OLTP test is a critical metric of how well the workload is running.
Lower IO latency reduces the time CPU waits for IO completion and improves
application performance. Figure 10 shows the average read latency was 1.3ms
and the average write latency was 6.1ms during IO intensive workload.
Figure 10. vSAN Average Latency during R1 Configuration Test with SLOB
Medium Workload
39
We measured the key metrics for extremely intensive Oracle OLTP workload.
Figure 11 shows the IOPS generated by four Oracle database VMs during the
test. The IOPS reached a peak of over 100 thousands with an average IOPS of
95 thousand. There was a 20-minute warm-up period and the performance
reached a steady state after that. Notice the workload was a mix of 75 percent
read and 25 percent write IOPS, which mimicked a transactional database
workload.
Figure 11. vSAN IOPS in R1 Configuration Test with SLOB Heavy Workload
Latency was relatively low for this solution considering the extremely heavy
IOs generated concurrently by four Oracle databases. Figure 12 shows the
average read latency was 4ms and the average write latency was 17ms during
a peak workload scenario, more realistic real-world database environments
running in steady state will see much lower latencies.
40
Figure 12. vSAN Average Latency during R1 Configuration Test with SLOB
Heavy Workload
vSAN 6.2 introduced a host of new features like checksum and built-in data
reduction technologies including erasure coding, deduplication and
compression.
Figure 13 shows the IO metrics comparison while the space efficiency and CPU
utilization comparisons were shown in Figure 14 and Figure 15 respectively.
41
Figure 13 shows the latency under different vSAN configurations. They are
tested under peak IO utilization scenario with concurrent workload from four
OLTP database VMs. Typical real-world environments should have lower I/O
latencies.
In case of latency-sensitive application, use RAID 1 (Mirror) for data and redo
disks; otherwise, use RAID 5 (erasure coding) for data and RAID 1 for redo to
provide space efficiency with reasonable trade off of performance.
Figure 13. OLTP Workload IO Metrics Comparison with Different vSAN 6.2
Features
42
Under Oracle OLTP workload, we observed the resource overhead (CPU and
memory) because space efficiency and checksum features was minimal. Figure
15 shows the average ESXi CPU utilization across the tests: it was between 20
percent and 25.6 percent. In the case of memory, the overhead was negligible.
Summary
43
The figures above show various heavy OLTP workload tests with different
vSAN configurations. Table 6 summarizes all the test results.
The IOPS and latency data in the table are from vSAN Observer. Matching
IOPS and latency data was observed from the Linux Operating system iostat
command in each database VM.
44
AVER AVERA
AVER
AVER AGE GE
vSAN AVER AVER AGE SPACE
AGE WRIT ESXI
CONFI AGE AGE READ EFFICI
WRIT E CPU
GURAT TOTAL READ LATE ENCY
E LATE UTILIZ
ION IOPS IOPS NCY (%)
IOPS NCY ATION
(MS)
(MS) (%)
45
Test Overview
At the database level as observed from the AWR reports of both the database,
the total “physical read total bytes” of the combined database was 644MB/s.
This test proves that vSAN is a feasible solution not only for OLTP systems but
also for IO throughput-intensive DSS workload.
46
Test Overview
We used four Oracle database VMs with one VM on each ESXi host. We ran
OLTP workload on a medium VM and a large VM; similarly, we ran DSS
workload on a medium VM and a large VM. Both workloads were running
concurrently for 60 minutes. We used the same configuration as used in the
previous DSS test: R15+C+SE vSAN configuration for the best combination of
performance and storage efficiency. This configuration included all the key
features such as checksum, RAID 5, deduplication and compression.
Another key metric for OLTP performance is to have predictable latency so the
transactions are processed fast. Figure 19 shows the average IO latency from
OLTP VMs. The read latency was 2.6ms and the write latency was 9ms. Even
though DSS workload was also running, the latency on OLTP VMs had no or
minimal impact. These results demonstrate that vSAN All-Flash is an ideal HCI
platform for mixed workloads.
47
48
5. 7 vSAN Resiliency
Test Overview
This section validates vSAN resiliency and impact on Oracle database when
handling disk and host failures. We designed the following scenarios to
emulate potential real-world component failures during OLTP workload.
During this test, an Oracle database on a large VM was used. We ran a baseline
test with a steady OLTP workload for 30 minutes without any failure to
compare with the failure test results. The vSAN configuration used for this
testing was R15+C+SE. This policy provides the benefits of space efficiency
while maintaining the performance level required for business-critical Oracle
database. The same heavy SLOB workload configuration was applied.
We tested a disk failure scenario and a host failure scenario. With the
introduction of deduplication and compression in vSAN 6.2, there is a behavior
change on how a disk failure affects a cluster. In a cluster without
deduplication and compression enabled, a capacity disk failure only affects the
components on the disk. If deduplication and compression is enabled, the
whole disk group is affected and a disk group failure occurs. In a deduplication
and compression enabled cluster, a single disk failure at the capacity layer is
also considered as a complete disk group failure like a cache-layer disk failure
in vSAN Cluster.
49
In this test, vSAN is enabled with deduplication and compression. A disk failure
either in cache or capacity tier results in a complete disk group failure. We used
disk-fault-injection script to generate a permanent disk failure on a capacity
tier SSD to simulate a disk group failure. A host selected for this disk failure did
not host the Oracle database VM. However, the SSD that was failed had
components of the data and redo disk objects. A disk failure was introduced at
the 15th minute of the 30-minute test run, and the impact on IO performance
was measured.
Host failure
In this test, one of the ESXi hosts was shut down abruptly using the Cisco UCS
manager during workload to simulate host failure. The host that was powered
down did not run the Oracle database VM. The host failure was introduced at
the 15th minute of the 30-minute test run, and the impact on IO performance
was measured.
In the case of disk group failure, as soon as the permanent disk error was
injected on a disk, the disk group failed. After the failure occurred, vSAN
rebuilt the failed objects to be compliant with the protection policy. The
average IOPS drop of 14 percent was recorded comparing to the scenario
without failure as shown in Figure 20. However, the virtual machine objects and
components remained accessible and the transactions continued.
This disk group failure was caused by a permanent disk error injection so the
rebuild traffic started immediately. Figure 21 shows the rebuild traffic after the
error was injected, the background rebuild started immediately and completed
after 36 minutes. When the SLOB workload was running, the background
rebuild rate was low at an average of 50.6 MB/s and once the SLOB workload
finished, vSAN increased the rebuild rate to 89.8 MB/s. vSAN automatically
uses this intelligent prioritization and resync/rebuild throttling mechanism to
reduce the impact on production workloads. The rebuild and resynchronization
time depends on the amount of data that needs to be rebuilt or
resynchronized as well as other factors including production workload level
and cluster capacity in terms of compute and disk group configurations.
In the case of host failure after the host was abruptly powered down, Oracle
database continued to serve transactions. However, the IO performance was
50
affected more because both disk groups on the host failed. As shown in Figure
20, the average IOPS dropped by 28 percent comparing to the scenario
without failure. Because there might be cases of host reboot due to
maintenance or upgrade, the rebuild start time is governed by the default
repair delay time, which is 60 minutes. This helps to avoid unnecessary data
rebuild and resynchronization during the planned maintenance of hosts. The
default repair delay value can be modified. See the VMware Knowledge Base
Article 2075456 for steps to change it.
Figure 22 shows the IO latency based on Oracle disk type. The write latency of
redo disk with disk group failure was slightly higher due to rebuild traffic in the
background. In the case of the host failure, there was no immediate rebuild
and thus no impact on write latency. However, the data disk read latency
increased due to the failure of two disk groups. The variations in IO latency
with and without failure was minimal during these failure tests. None of the
tests reported IO error in the Linux VM or Oracle user-session disconnections,
which demonstrated the resiliency of vSAN during component failures.
51
Figure 22. Oracle Disk Latency Comparison with and without Failure
Test Overview
vSphere vMotion live migration allows moving an entire virtual machine from
one physical server to another without downtime. We used this feature to
migrate Oracle database instance seamlessly between the ESXi hosts in a vSAN
Cluster.
52
database VMs (large VM) from one ESXi host to another (ESXi 2 to ESXi 3) in
the vSAN Cluster as shown in Figure 23. The migration was initiated while
there was SLOB OLTP workload running against the database. We used the
R15+C+SE vSAN configuration and the test was running for 30 minutes. The
same heavy SLOB workload configuration was applied.
While SLOB was generating OLTP workload, vMotion was initiated to migrate
the Oracle database VM from ESXi 2 to ESXi 3 in vSAN Cluster. The migration
took around 201 seconds. During vMotion, there was a momentary reduction
in IOPS during the last phase of the migration and the IOPS got back to the
normal level afterwards. As shown in Figure 24, we compared the result of
vMotion with a similar test without any vMotion operation for the same
duration of time. The average IOPS without vMotion was 54,708 and it was
49,563 with one vMotion operation. The average IOPS drop during the
vMotion test was 9 percent comparing to that of the scenario without vMotion.
This test demonstrated mobility of Oracle database VMs deployed in a vSAN
Cluster using vMotion.
53
Test Overview
vSphere vMotion live migration allows moving an entire virtual machine from
one physical server to another without downtime. We used this feature to
migrate Oracle database instance seamlessly between the ESXi hosts in a vSAN
Cluster.
54
While SLOB was generating OLTP workload, vMotion was initiated to migrate
the Oracle database VM from ESXi 2 to ESXi 3 in vSAN Cluster. The migration
took around 201 seconds. During vMotion, there was a momentary reduction
in IOPS during the last phase of the migration and the IOPS got back to the
normal level afterwards. As shown in Figure 24, we compared the result of
vMotion with a similar test without any vMotion operation for the same
duration of time. The average IOPS without vMotion was 54,708 and it was
49,563 with one vMotion operation. The average IOPS drop during the
vMotion test was 9 percent comparing to that of the scenario without vMotion.
This test demonstrated mobility of Oracle database VMs deployed in a vSAN
Cluster using vMotion.
55
We set up vSAN Stretched Cluster using five ESXi hosts: two at site A, two at
site B, and a Witness ESXi Appliance at Site C. We used two Oracle database
large VMs: one VM at site A and the other VM at site B as shown in Figure 25.
We used R1+C+SE configuration in this test, which had checksum,
deduplication and compression enabled. Erasure coding is not supported on
vSAN Stretched Cluster because it needs four fault domains for RAID 5 while
there are only three fault domains in the vSAN Stretched Cluster configuration.
56
57
This test demonstrated one of the powerful features of the vSAN Stretched
Cluster: maintaining data availability even under the impact of a complete site
failure.
In this test, we used two Oracle database large VMs: VM1 at site A and VM2 at
site B. We used SLOB to generate OLTP workload in both database VMs
concurrently for a period of 60 minutes. After 20 minutes, site A was failed by
powering off both ESXi hosts at the site as shown in Figure 27.
58
After site A was down, the database VM (VM1) at site A was offline temporarily
before it was restarted at site B. However, database VM (VM2) at site B
continued transactions. The site outage did not affect data availability because
a copy of all the data at site A existed at site B so VM2 continued transactions.
The IOPS on vSAN Stretched Cluster during this test was shown in Figure 28.
At the beginning, there was no failure and Oracle workload was running in
both VMs. After running the test for 20 minutes, site A failure caused VM1 to
the offline state temporarily. From the 20th to the 29th minute, the workload
in the cluster was only from VM2 at site B, during which VM1 was restarted at
site B by vSphere HA. VMware vSphere Distributed Resource Scheduler™
(DRS) placed VM1 on the ESXi host at site B that did not have any workload.
Subsequently, SLOB workload was resumed in VM1 and both VMs continued
with workloads.
Figure 29 shows the average IOPS on the cluster before (from 0 to 20th
minutes) and after (30th to 60th minutes) site failure when the OLTP workload
was running. As shown in Figure 28, the IOPS reduced and read latency
increased after the site failure because less vSAN disk groups were available
due to two failed hosts at site A. However, the write latency was improved,
which came at a cost of not mirroring the data across sites for protection
during write operations.
After the test was completed, site A was brought back by powering on both
ESXi hosts. The vSAN Stretched Cluster started the process of site A
resynchronization with the changed components at site B after the failure. The
results demonstrated how vSAN Stretched Cluster provided storage high
availability during site failure by automating the failover and failback process
leveraging vSphere HA and vSphere DRS. This proves vSAN Stretched Cluster’s
ability to survive a complete site failure in an Oracle database environment.
59
For information about this Oracle Extended RAC on vSAN Stretched Cluster
solution, see Oracle Real Application Clusters on VMware vSAN.
60
Figure 29. IO Metrics Comparison before and after Site Failure in the Stretched
Cluster
Performance after site failure depends on the adequate resources such as CPU
and memory to be available to accommodate the virtual machines that are
restarted by vSphere HA on the surviving site.
In the event of a site failure and subsequent recovery, vSAN will wait for some
additional time for all hosts to become ready at the failed site before it starts
to synchronize components. This avoids repeatedly resynchronizing a large
amount of data across the sites. Therefore, instead of bringing up the failed
vSAN hosts in a staggered fashion, it is recommended to bring all hosts online
approximately at the same time. After the site is recovered, it is also
recommended to wait for the recovery traffic to complete before migrating
virtual machines to the recovered site. Therefore, it is recommended to change
vSphere DRS policy from fully automated to partially automated in the event of
a site failure.
61
This section shows some of the ready-to-use dashboards available for health,
performance monitoring, and troubleshooting for vSAN and Oracle database.
We recorded the dashboard views after OLTP workload was run on an Oracle
database backed by an All-Flash vSAN Cluster.
Get global visibility across vSAN Clusters for monitoring and proactive alerts
and notifications on an ongoing basis. As shown in Figure 30, this view
provides the alerts generated from vSAN and Oracle 12c Database. During a
heavy OLTP workload, notice the warning alerts from Oracle for “High
executions” and “High Redo Generated”.
62
Figure 30. Dashboard View Showing the Health and Performance Alerts in a
Centralized Pane
Figure 31 shows the VM level IO metrics during the two-hour OLTP run. During
the same period, Figure 32 shows the database-level IOPS. The average IOPS
was 60 thousand at the database level and VM level during this period of time.
This end-to-end view and correlation can help identify key trends and
troubleshoot bottlenecks.
63
64
Figure 33 illustrates the virtual machine level IO latency during the OLTP
workload. The average write latency was 8ms and the average read latency
was 1.5ms.
65
66
This section shows some of the ready-to-use dashboards available for health,
performance monitoring, and troubleshooting for vSAN and Oracle database.
We recorded the dashboard views after OLTP workload was run on an Oracle
database backed by an All-Flash vSAN Cluster.
Get global visibility across vSAN Clusters for monitoring and proactive alerts
and notifications on an ongoing basis. As shown in Figure 30, this view
provides the alerts generated from vSAN and Oracle 12c Database. During a
heavy OLTP workload, notice the warning alerts from Oracle for “High
executions” and “High Redo Generated”.
Figure 30. Dashboard View Showing the Health and Performance Alerts in a
67
Centralized Pane
Figure 31 shows the VM level IO metrics during the two-hour OLTP run. During
the same period, Figure 32 shows the database-level IOPS. The average IOPS
was 60 thousand at the database level and VM level during this period of time.
This end-to-end view and correlation can help identify key trends and
troubleshoot bottlenecks.
68
Figure 33 illustrates the virtual machine level IO latency during the OLTP
workload. The average write latency was 8ms and the average read latency
was 1.5ms.
69
70
vSAN 6.2 Design and Sizing Guide provides a comprehensive set of guidelines
for designing vSAN. A few key guidelines relevant to Oracle database are
provided below:
71
72
7. Conclusion
This section provides a summary on how the reference architecture validates
vSAN to be an HCI platform capable of delivering scalability, and high-
performance to Oracle database environments.
73
7. 1 Conclusion
In this solution, we ran extremely heavy Oracle workload with space efficiency
and data integrity features enabled and demonstrated that vSAN provided
excellent performance with minor resource overhead, while significantly
lowering TCO of the solution. The mixed workload test results validated vSAN
as a viable platform for running both OLTP and DSS Oracle workloads
together.
74
8. Reference
This section lists the relevant references used for this document.
75
8. 1 White Paper
8. 2 Product Documentation
8. 3 Other Documentation
DSS Benchmarking
SLOB Resources
76
77
The following file is the SLOB configuration file we used in our testing:
UPDATE_PCT=25
RUN_TIME=3600
WORK_LOOP=0
SCALE=210000
WORK_UNIT=64
REDO_STRESS=LITE
LOAD_PARALLEL_DEGREE=4
THREADS_PER_SCHEMA=1
# Settings for SQL*Net connectivity:
#ADMIN_SQLNET_SERVICE=slob
#SQLNET_SERVICE_BASE=slob
#SQLNET_SERVICE_MAX=2
#SYSDBA_PASSWD=change_on_install
#########################
#### Advanced settings:
#
# The following are Hot Spot related parameters.
# By default Hot Spot functionality is disabled (DO_HOTSPOT=
FALSE).
#
DO_HOTSPOT=FALSE
HOTSPOT_MB=8
HOTSPOT_OFFSET_MB=16
HOTSPOT_FREQUENCY=3
#
# The following controls operations on Hot Schema
# Default Value: 0. Default setting disables Hot Schema
#
HOT_SCHEMA_FREQUENCY=0
# The following parameters control think time between SLOB
# operations (SQL Executions).
# Setting the frequency to 0 disables think time.
#
THINK_TM_FREQUENCY=5
THINK_TM_MIN=.1
THINK_TM_MAX=.5
#########################
78
The following is the command we used to start SLOB workload with 48 users:
“/home/oracle/SLOB/runit.sh 48”
UPDATE_PCT=25
RUN_TIME=3600
WORK_LOOP=0
SCALE=210000
WORK_UNIT=64
REDO_STRESS=LITE
LOAD_PARALLEL_DEGREE=4
THREADS_PER_SCHEMA=1
# Settings for SQL*Net connectivity:
#ADMIN_SQLNET_SERVICE=slob
#SQLNET_SERVICE_BASE=slob
#SQLNET_SERVICE_MAX=2
#SYSDBA_PASSWD=change_on_install
#########################
#### Advanced settings:
#
# The following are Hot Spot related parameters.
# By default Hot Spot functionality is disabled (DO_HOTSPOT=
FALSE).
#
DO_HOTSPOT=FALSE
HOTSPOT_MB=8
HOTSPOT_OFFSET_MB=16
HOTSPOT_FREQUENCY=3
#
# The following controls operations on Hot Schema
# Default Value: 0. Default setting disables Hot Schema
#
HOT_SCHEMA_FREQUENCY=0
# The following parameters control think time between SLOB
# operations (SQL Executions).
# Setting the frequency to 0 disables think time.
#
THINK_TM_FREQUENCY=0
THINK_TM_MIN=.1
THINK_TM_MAX=.5
#########################
export UPDATE_PCT RUN_TIME WORK_LOOP SCALE WORK_UNIT LOAD_PA
RALLEL_DEGREE REDO_STRESS
79
The following is the command we used to start SLOB workload with 128
users:
“/home/oracle/SLOB/runit.sh 128”
80
81
82