Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
computing
performance
A
Bitcurrent
study
on
the
performance
of
cloud
computing
platforms
June,
2010
Executive Summary
Cloud computing is a significant shift in the way companies build and run IT resources. It promises pay-as-you-go economics and elastic capacity. Every major change in IT forces IT professionals to rebalance their application strategyjust look at client-server computing, the web, or mobile devices. Today, cloud computing is prompting a similar reconsidering of IT strategy. But its still early days for clouds. Many enterprises are skeptical of on-demand computing, because it forces them to relinquish control over the underlying networks and architectures on which their applications run. In late 2009, performance monitoring firm Webmetrics approached us to write a study on cloud performance. We decided to assess several cloud platforms, across several dimensions, using Webmetrics testing services to collect performance data. Over the course of several months, we created test agents for five leading cloud providers that would measure network, CPU, and I/O constraints. We also analyzed five companies sites running on each of the five clouds. As you might imagine, this resulted in a considerable amount of data, which we then processed and browsed for informative patterns that would help us understand the performance and capacity of these platforms. This report is the result of that effort. Testing across several platforms is by its very nature imprecise. Different clouds require different programming techniques, so no two test agents were alike. Some clouds use large-scale data storage thats optimized for quick retrieval; others rely on traditional databases. As a result, the data in this report should serve only as a guideline for further testing: your mileage will vary greatly.
Acknowledgements
First and foremost, this research would not have been possible without Webmetrics/Neustar. The company funded the development of the testing agents, allowed us to use their systems for data collection, and underwrote the cost of generating the study. They did so without constraints or editorial input, in the hopes of contributing to ongoing discussions and industry dialogue about cloud performance. This kind of altruism is rare, and were grateful for their assistance. This research was a team effort. Eric Packman and Pete Taylor worked hard to develop, deploy, and maintain multiple software agents across several clouds, and to collect and analyze the data, looking for anomalies and addressing monitoring issues. Lenny Rachitsky and Shirin Rejali of Webmetrics supported the idea of independent research that would help the IT community, and funded this work. Sean Power continues to be a great colleague and co-author on all things to do with web monitoring. Ian Rae, Dan Koffler, and the team at Syntenic offered first-hand cloud experience and assisted with analysis and early feedback. Finally, cloud experts, including Shlomo Swidler, Randy Bias, Jeremy Edberg, and many other end users provided invaluable insight.
Contents
Executive Summary .................................................................................................................................. 3 Acknowledgements................................................................................................................................... 4 Contents......................................................................................................................................................... 5 The state of web performance ............................................................................................................. 6 Elements of latency.............................................................................................................................. 6 A shift towards composed designs................................................................................................ 6 The reasons performance matters ................................................................................................ 7 The state of cloud computing ............................................................................................................... 8 What do we mean by clouds? .......................................................................................................... 8 The uncertainty of a shared resource .......................................................................................... 9 The problem with computing far away....................................................................................... 9 Cloud architectures........................................................................................................................... 10 The big questions.................................................................................................................................... 11 Testing methodology............................................................................................................................. 11 Test limitations........................................................................................................................................ 12 Test results ................................................................................................................................................ 14 Real website testshigh-level metrics.................................................................................... 14 Agent testshigh-level metrics .................................................................................................. 15 Performance histograms ................................................................................................................ 17 The performance of individual clouds...................................................................................... 18 How do different clouds handle workloads? ......................................................................... 33 Noteworthy observations ................................................................................................................... 40 Conclusions ............................................................................................................................................... 60 Further research and reading ........................................................................................................... 61 Peter Van Eijk ...................................................................................................................................... 61 Cloudstatus ........................................................................................................................................... 61 Cloudharmony..................................................................................................................................... 62 Alan Williamson and Cloudkick................................................................................................... 63 The Bitsource....................................................................................................................................... 64 Cloud test agent code ............................................................................................................................ 65 Simple objects...................................................................................................................................... 65 CPU test .................................................................................................................................................. 65 I/O test.................................................................................................................................................... 65
This report focuses primarily on the performance of cloud computing. While website performance optimization has come a long way from the early days of static sites, there are still many reasons that applications are slow.
Elements
of
latency
Web
latency
comes
from
four
main
factors:
Service
discovery
involves
finding
the
website.
This
is
usually
a
Domain
Name
Service
(DNS)
lookup
in
which
the
URL
of
the
website
is
resolved
to
an
IP
address.
There
may
be
other
delays
in
this
processfor
example,
when
a
site
redirects
the
client
to
another
site.
Clouds
change
how
lookups
happen,
because
they
may
redirect
visitors
to
different
destinations.
Network
latency
is
the
time
spent
travelling
across
a
network.
This
is
a
function
of
two
basic
things:
The
round-trip
time
between
the
browser
and
the
web
server;
and
the
number
of
round-trips
that
are
required
to
load
a
page.
A
page
with
few
objects
will
take
far
less
time
to
load,
even
over
a
slower
link,
than
a
page
with
many
components
on
it.
Processing
latency,
or
host
time,
is
the
work
the
server
has
to
do
when
preparing
content
for
the
browser.
This
is
the
primary
focus
of
our
study,
since
cloud
computing
changes
how
the
server
works.
Host
latency
may
come
from
simply
responding
to
a
request;
from
computationally
intensive
calculations;
from
retrieving
data
from
other
sources
such
as
a
database;
or
from
connecting
to
back-end
systems
behind
the
server
itself.
Client-side
latency
is
the
time
the
browser
takes
to
assemble
and
present
the
web
page
content.
While
client-side
latency
is
an
increasingly
important
component
of
web
monitoring
in
modern
websites,
clients
dont
care
(much)
whether
the
server
is
a
physical
machine
or
a
cloudso
we
wont
concentrate
on
this
kind
of
delay
in
this
report.
Whats more, in a mashup site that contains embedded components from elsewheresuch as JavaScript for analytics, or third-party embedded DIV tagsthe browser must retrieve additional components from elsewhere.
achieve economies of scale through automation, and as a result are cheaper than dedicated infrastructure for many applications. A consequence of these technologies and business models is the emergence of the developer as the central role in IT. Todays developer can spin up machines easily, without long purchase cycles or the burden of administration. She can experiment with several versions of the application for only a few pennies an hour. And she can scale up and down with demand. At the root of this is the idea of a composed designan application that consists of many sub-applications stitched together. An application might rely on a virtual machine as a computing resource, but use another service to store large objects; another to queue messages for processing; and another to quickly store and retrieve key-value pair data. This is a different way of building applications from earlier, more monolithic approaches.
Regardless of what kind of site youre running, you should care about performance. If youre considering cloud computing, you need to know how that move will affect your business.
The term cloud computing has only been around a few years, and theres been a lot of confusion around what a cloud isand isnt. Given the huge it just works appeal of clouds, many providers and vendors have clothed themselves in the cloud mantle in the hopes of revitalizing their web-based application.
In other words, for this study were looking exclusively at public PaaS and IaaS clouds.
Googles computing system, which allows very fast searches across the entire Internet. Indeed, once our data was in Bigtable, queries were very fast. Humans also expect responsiveness. If youre building an internal application for people in your building, a cloud may introduce delay simply because its not in the same building as your users. On the other hand, if youre building an application for users throughout the world, then Googles App Engine has over 20 points of presence and may speed things up; similarly, Amazons Cloudfront offering can speed up the delivery of static content to your users.
Cloud
architectures
The
application
youre
building
will
dictate
the
cloud
model
you
adopt.:
If
youre
building
a
simple
three-tiered
website,
clouds
are
a
simple
choice,
and
you
can
launch
a
pre-configured
LAMP
stack
in
an
IaaS
environment.
If
your
site
experiences
spiky
traffic,
and
youre
willing
to
edit
your
code,
you
might
consider
a
PaaS
model
to
handle
scale
and
pay
only
for
CPU
cycles.
If
youre
doing
data
mining
across
a
large
data
set,
then
youll
want
a
framework
like
Hadoop
that
can
process
things
in
parallel.
If
you
want
a
messaging
platform,
then
you
may
want
a
message
queue
service.
If
youre
trying
to
broadcast
media
to
many
destinations,
youll
likely
involve
a
content
delivery
network.
In the end, you have architectural control over cloudsbut its at a much higher level. Rather than worrying about which processes are running on an individual server, youre selecting which clouds and cloud services to use as you assemble your application.
10
For our research, we want to answer a few basic questions about clouds: How do different clouds perform for web users? What are the differences across clouds for specific functionsnetwork delivery, computation, and back-end I/O? How much is one cloud user affected by its neighbors? How do IaaS and PaaS clouds vary in performance? How do different platforms handle spikes in concurrent requests? How variable, or predictable, are specific cloud platforms?
Testing
methodology
To
answer
these
questions,
we
chose
five
cloud
platformsthree
IaaS
and
two
PaaS.
For
each,
we
created
a
test
application
that
could
exercise
the
various
elements
of
latency
in
which
we
were
interested:
A
one-pixel
GIF,
to
test
raw
response
time
and
caching.
A
2-MByte
object,
to
test
network
throughput
and
congestion.
A
CPU-intensive
task
(repeatedly
calculating
a
sine
function)
that
would
consume
processing
capacity.
An
I/O
intensive
task
(searching
through
a
database
for
a
specific
record,
then
clearing
the
cache)
to
measure
back-end
systems
and
resource
contention.
We also chose five companies whose web businesses used each cloud platform, according to the following criteria: They had to be reputable, real-world companies rather than experiments or personal sites The sites had to be dynamic, rather than static They had to be codedthat is, not simply parameterized versions of turnkey SaaS applications that had been skinned to match a particular brand
Despite the popularity of cloud computing, this wasnt as easy as it sounds. Many websites that claim to be running in the cloud are actually customer- or partner- portal front-ends for SaaS applications, while others are little more than static marketing websites. For each site, we tested a single object (usually a one-pixel GIF or a favicon.ico image) and a full page load. This meant that each cloud had fourteen testsfive real sites tested twice, and four agent sites.
11
These tests were run from mutiple locations worldwide using Webmetrics testing service. We generated the analysis of performance from both Webmetrics reports and post-processing of the logfiles generated by the Webmetrics service. The goal of this research isnt to recommend a particular cloud; indeed, our research shows that different platforms work better for different application types. Weve named the five cloud platforms on which we tested, but weve hidden the names of the production applications that were running on each cloud.
Test limitations
First and foremost: this is not a scientific, repeatable test. The very nature of clouds is that theyre inconsistent and multitenant, and subject to seasonal and daily variances. Consider: There are huge differences between the ways data is stored in cloud platforms. Some use databases, others use massively parallel data storage Some clouds are busy, and were sharing machines with others; on other clouds, we may have the luxury of a machine to ourselves. We dont know. The way a particular function (such as a sine calculation) is performed varies from cloud to cloud. The physical machine on which a virtual machine is running will vary from test to test. Factors beyond the cloud providers controlInternet congestion, DDOS attacks, problems with DNSmay change the results.
12
Much of the delay within an application may be the result of something the developer does; we saw several tested sites change performance, and go down, during the test. We purposely switched between sequential and simultaneous test modes during the data collection period in order to understand the impact of contention on resources. This is a reasonable simulation of a website thats facing changing traffic patterns: sometimes, visits are spread out; other times, they happen all at once. PaaS platforms artificially limit the number of instructions we can run in a period of time, which means we had to reduce the number of sine calculations performedso its not accurate to say that PaaS is faster than IaaS from this data, because the PaaS platform is actually doing far less work. In fact, when we tried the same number of computations on PaaS and IaaS platforms, the IaaS platform was so much faster that the results werent easily measurable.
Nevertheless, the results are valuable precisely because they show the variance and uncertainty, and help us understand what to look for across clouds.
13
Test results
Well first look at high-level results across platforms and test types, then look at individual test types, individual clouds, and trends among websites running on each cloud.
40.00% 30.00% 20.00% 10.00% 0.00% <1 2 4 Terremark Amazon Rackspace Google Salesforce 12
10
Delay
in
seconds
14
Heres a data table of average load times across all five clouds for a two-day period in early April. Salesforce Google Rackspace Amazon Terremark 1-pixel GIF (light test) CPU Test 0.11s 8.13 0.25 1.97 1.63 0.18 3.25 2.16 0.23 4.41 10.03 0.23 5.00 3.75 12.35 2-MByte GIF (heavy test) 0.50
IO Test 6.26 2.03 3.33 19.46 Averages can be misleading, however, as some clouds experienced far greater latency. Heres the median latency:
15
Median
latency
25
20
15
10
5
0
IO
CPU
2MB
GIF
1-pixel
GIF
Seconds
Seconds
Sometimes variance is as important as latency, however, so we looked at the standard deviation across each cloud test:
16
Seconds
Note that IaaS clouds consistently show greater variance than PaaS clouds; and that, because its measuring network performance, the 2MByte GIF test varies significantly when tested from many locations on the Internet. In other words, the variance here is largely due to geography: WAN latency varies the most.
Performance
histograms
A
better
way
to
understand
latency
is
to
look
at
a
histogram
of
performance,
which
shows
how
many
tests
experienced
a
particular
level
of
latency.
The
following
chart
combines
all
four
kinds
of
testsince
network,
server
responsiveness,
computation,
and
back
end
I/O
are
all
components
of
an
applicationto
show
how
each
cloud
compares.
17
Percent of tests
50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 1 2 3 Terremark Amazon Rackspace Google Salesforce 9 10+
Delay in seconds
18
Salesforces force.com This is a PaaS cloud featuring its own programming language (Apex.) Its designed primarily for extending the Salesforce.com SaaS CRM, and developers have access to their CRMs data structures such as contacts and sales funnels.
Heres the performance histogram for the five sites running on Force.com.
19
20.00%
40.00%
60.00%
80.00% 100.00%
20
Delay in seconds
21
Google App Engine Googles cloud platform is a PaaS. Developers can use either Java or Python, and are billed for computation used. Their applications must use the Google-specific data store, which allows scalability and fault tolerance. Heres how the five sites running on Google App Engine did during the test:
Heres the performance histogram for the five sites running on App Engine:
22
23
60000 50000 Test count 40000 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+
Delay in seconds
24
Rackspace Cloud Rackspace.com is an established managed service provider, offering bare metal hosting, managed hosting, and IaaS and PaaS clouds. Heres how the five companies whose sites run on Rackspace fared during our test.
Heres the performance histogram of the websites we evaluated. Note that performance is fairly consistent across the test period, but that one site is slow compared to all others and affects results significantly.
25
Heres the performance histogram for the four test functions on Rackspace.
26
40000 Test count 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+
Delay in seconds
The CPU test suffered from two spikes seen in the test period.
27
Amazon Web Services Amazon is the clear market leader, and for many technologists their model is synonymous with cloud computing. Most customers use the EC2 virtual machines and S3 storage, but other services arent as broadly adopted. Heres what the five sites running on Amazon looked like:
Heres the performance histogram for the five sites running on AWS:
28
Heres the performance histogram for the four tests in the same period:
29
50000 Test count 40000 30000 20000 10000 0 0 1 2 3 Totals IO CPU 2MB 1-pixel 9 10+
Delay in seconds
30
Terremark vCloud Unlike most public clouds, which rely on open source Xenserver for virtualization, Terremarks cloud offering is based on VMWares virtualization technology. Heres how five Terremark customers fared during the test period:
20.00%
40.00%
60.00%
80.00% 100.00%
31
Delay in seconds
32
Now lets look at the agents weve created for each cloud. Heres a high-level view of how well the clouds handled each type of test.
Delay in seconds
33
1-pixel GIF Here are the results for the retrieval of a one-pixel GIF by each of the clouds.
Heres the performance histogram for each cloud. This is a measure of how well the cloud caches static content and responds quickly to requests.
Delay in seconds
34
2-MByte GIF The second test evaluates network throughput. Heres what it looked like across a month of testing.
Heres the performance histogram for each cloud running the test.
Delay in seconds
35
CPU test The third test looks at processing capacity, asking the cloud platform to compute 1,000,000 sine and sum operations.
Note that the measurements for Salesforce.coms cloud must not be compared to others directly; because of governors on the PaaS platform that limit the number of instructions that can be carried out, this cloud only executed 100,000 operations one tenth of those the other clouds were processing. Heres the performance histogram for the test:
36
Amazons processing of the CPU tests was slow, but we chose the smallest virtual machine available from their service catalog; bigger machines would have handled this more quickly. We also looked at error rates for the CPU test: Errors seen, CPU test Salesforce Google Rackspace Amazon Terremark Uptime 99.96% 99.99% 99.93% 100.00% 100.00% Errors 11 3 20 0 0 Mar 15 - Apr 15 Successes 29106 32212 32470 30371 31426
37
I/O test In the fourth test, we search a storage system for a specific string. Heres the result of that process over time; this is the most revealing in terms of interesting data.
Note the two spikes in latency within Rackspace; the massive drop in both Terremark and Amazon, followed by an increase; and the gradual growth in delay within Googles App Engine. The drop, and subsequent increase, are caused by switching the testing model from sequential (where tests happen one after another) to simultaneous (where all tests occur at once.) Both the Terremark and Amazon IaaS agents become much slower when tests are simultaneous because theyre unable to scale automatically; however, Rackspace, which is also an IaaS model, wasnt affected by the change. Well cover this in more detail below. Heres the performance histogram:
38
Delay in seconds
The PaaS providers did well, largely because of their shared storage model that is optimized for large data sets across many machines; Rackspaces cloud performed very well but had occasional slow-downs.
39
Noteworthy observations
As we conducted this research, we saw many anomalies and interesting patterns. Here are a few of them.
40
Gradual increase in latency on Google App Engines CPU Over the course of the test, we saw a significant increase in the CPU latency of Googles App Engine. While Googles service was still faster than other platforms at this specific task, we saw the delay roughly double in this time.
41
Spikes in Rackspaces CPU performance Rackspaces CPU tests encountered two significan spikes during the test, when latency jumped from a respectable 2 seconds to three times as much.
42
A problem with a subset of customers in an IaaS In this chart, we see several Rackspace Cloud customers experiencing a slow-down at the same time. One becomes completely unavailable, while two others are slow. Our static test agent remains fast throughout the test, suggesting that this is an internal problem affecting several sites simultaneously.
43
How big a problem is WAN latency? In this chart, we compare WAN performance for our I/O test. WAN latency is only a fraction of the total delay here. As a whole, theres a roughly two-second difference between the fastest and slowest testing point from which we tested the agent.
44
Spiky I/O performance and outages In this chart, we see the Rackspace I/O test varying significantly, while one of the Rackspace customers were watching also goes offlinesuggesting that a resource problem on the back end affected us, and also affected a company whose application is I/O dependent that we shared the cloud with.
45
High variance within Salesforces cloud This chart shows performance on the Salesforce cloud. Early on, CPU is very slow; then it recovers, and proceeds to get slower again. During that first spike, other sites occupying the PaaS platform also become spiky and slow.
46
The impact of availability zones This data was presented by Eran Shir, CTO of advertising service Dapper, at Web2Expo San Francisco. We include it here to emphasize the issues with availability zones and performance. In Amazons model, customers can choose an availability zone from which their application will be served. This is partly for compliance reasonsso customers know what legislation applies to their dataand partly to provide separation for failover and SLAs. But all zones are not the same. Consider the latency of an application running in Availability Zone US-East-1A:
In the Eastern zone, performance often spikes to 2 seconds or more; in the Western zone, its consistently fast. This kind of performance degradation is essential to watch.
47
WAN latency variance Heres a good example of performance degradation from one part of the world (the West Coast) that doesnt affect another.
48
Slow-downs affecting all sites on one day This chart of February 23 shows many sites running on Salesforces cloud all slowing down concurrently.
49
Slow-downs affecting a single site Degradation doesnt always affect everyone on a site. Here are several applications running on App Engine, where only one of them has a problem.
50
When the whole Internet is slow Some days, many clouds see a problemwhich is likely a result of intermittent issues with the Internet as a whole.
51
Money
spent
on
clouds
For
the
duration
of
the
tests,
heres
how
much
we
spent
on
each
platform.
Amount $127.69 4.71 61.44 0.00 102.87 Cloud Amazon Web Services (EC2 and EBS) Google App Engine Rackspace.com Force.com Terremark
Note that we didnt hit any of the rate limits for the Salesforce force.com cloud, so we didnt pay for the capacity we used as part of the testbut because of force.coms governors, we were performing only 10% of the CPU operations of the other clouds.
52
IO
contention
on
Rackspace
During
our
analysis,
we
saw
wide
changes
in
I/O
performance
At
one
point,
56356
rsec/s
meant
only
a
74.26%
utilization
of
the
system;
at
another,
584
rsec/s
meant
100.00%
utilization.
This
is
a
clear
sign
that
the
system
is
competing
with
other
processes
for
resources.
Heres
the
data
for
the
two
periods:
Device rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrqsz avgqusz await svctm %util
Period 1 Period 2
Sda1 Sda1
0.00 0.00
0.00 0.00
673.27 19.00
0.00 0.00
56356.44 584.00
0.00 0.00
83.71 30.74
4.07 2.95
6.04 147.37
1.10 52.63
74.26 100.00
53
Simultaneous versus spread out tests When we began the testing, we launched several requests from several cities simultaneously. This meant that the test applications had to deal with three or four users at exactly the same momentand our CPU- and I/O-intensive tests did not deal with this well in an Infrastructure-as-a-Service model.
Later, we switched testing to spread requests out more evenly (the sequential setting), and the IaaS patforms dealt with the load far better. Consider the following chart, which shows latency of the I/O test on an IaaS cloud before and after the change:
54
Now consider the same test volume, for the same period, on a PaaS cloud. Remember that PaaS has no notion of a machinethe system scales to handle load automatically.
55
While theres some variance, its far less than in the IaaS example above. Heres a graph of three cloud platforms before and after the switch: the two IaaS platforms (Terremark and Amazon) fare far better when the requests are spread out, while the PaaS platform (Google) barely changes.
56
Heres Rackspaces performance across this period, showing the improvement in I/O latency by spreading the requests out.
57
Spikes in a single cloud The following chart shows the I/O performance across all cloud providers. We can see that I/O on one of them has spiked significantly, but that other clouds seem to also have sporadic slow-downs.
58
Variance within Google App Engine This chart shows a three-day period during which App Engines performance varied significantly. While the small image test was relatively consistent in its performance, all three other tests spiked and there appeared to be correlation between I/O, network, and CPU latency.
59
Conclusions
First and foremost: theres a lot to watch. Clouds can fail in many unexpected ways; here are some of the lessons weve learned. Watch your neighbors. Weve seen good evidence that several cloud applications slow down at once, so youll definitely be affected by others using the same cloud as you. Understand the profile of your cloud. The histograms shown here demonstrate that different clouds are good at different tasks. Youll need to choose the size of your virtual machinesin terms of CPU, memory, and so onin order to deliver good performance. You need an agent on the inside. When you plan a monitoring strategy, you need custom code that exercised back-end functions so you can triage the problem quickly. Choose PaaS or IaaS. If youre willing to re-code your application to take advantage of big data systems like Bigtable, you can scale well by choosing a PaaS cloud. On the other hand, if you need individual machines, youll have to build elasticity into your IaaS configuration. Big data doesnt come for free. Using large, sharded data stores might seem nice; but it takes time to put the data in there, which may not be appropriate for your applications usage patterns. Monitor usage and governors. In PaaS, if you exceed your rate limits, your users will get errors. Troubleshooting gets harder. You need data on the Internet as a whole, your cloud provider as a whole, and your individual applications various tiers, in order to properly triage the problem. When you were running on dedicated infrastructure, you didnt have to spend as much time eliminating third-party issues (such as contention for shared bandwidth or I/O blocking.) PaaS means youre in the same basket. We noticed that if youre using a PaaS, when the cloud gets slow, everyone gets slow. With IaaS, theres more separation of the CPU and the servers responsivenessbut youre still contending for shared storage and network bandwidth. Watch several zones. When you rely on availability zones to distribute the risk of an outage, youll also need to deploy additional monitoring to compare those zones to one another.
60
Theres already a good body of research on cloud performance in the public domain. In preparing this study, we consulted many online resources; here are some of the most relevant ones:
Figure
2:
Peter
Van
Eijk's
research
into
Amazon
Cloudfront
performance
from
NYC
His results were presented at the Computer Measurement Groups annual meeting and are available on Slideshare1.
Cloudstatus
Now
part
of
VMWare,
Hyperic
has
a
variety
of
agents
that
collect
cloud-specific
metrics
across
several
public
clouds.2
Figure
3:
A
sample
of
Hyperic's
Cloudstatus
dashboard
Whats notable here is that each cloud has unique things to measureon Amazon, for example, it might include SimpleDB or SimpleMQ data, while on Google it might include Bigtable lookup times.
Cloudharmony
Over a two-month period, Cloudharmony paid end users to run their testing tool and measure the speed of clouds.3
3 http://blog.cloudharmony.com/2010/02/cloud-speed-test-results.html 62
The primary focus of the testing was end-user throughput, rather than web page latency, but they were able to test many different platforms during the test. Cloudharmonys client-side monitoring tool can exercise many cloud functions as part of its testing routine.
Allan Williamson published data on oversubscription and contention within an IaaS cloud,4 referencing data gathered by Cloudkick.5 The data indicates increasing congestion within Amazons resources. These issues seem to plague Amazons US- East availability zone, something weve heard from several sources.
4 http://alan.blog-city.com/has_amazon_ec2_become_over_subscribed.htm 5 https://www.cloudkick.com/blog/2010/jan/12/visual-ec2-latency/ 63
While network congestion is only one factor that can lead to poor end user experience, its an important one. As clouds grow in popularity, its important to remember that theyre a shared resource and to ensure that the provider is scaling infrastructure capacity commensurate with demand.
The
Bitsource
The
Bitsource
did
an
independent
comparison
of
Rackspace
and
Amazon6
to
measure
performance
within
the
clouds
themselves.
Figure 7: The Bitsource's comparison of compilation time across cloud virtual machines
This comparison focused more on applying traditional benchmarkssuch as compilation or I/O operationsacross two systems. 6 http://www.thebitsource.com/featured-posts/rackspace-cloud-servers- versus-amazon-ec2-performance-analysis/ 64
Heres some additional information on how the test agents were constructed. As noted above, the construction of tests varies widely by platform; while we attempted to be consistent across platforms, the results should by no means be used to pick a particular cloud provider without further testing. Most notably, the Force.com CPU tests consisted of 100,000, rather than 1,000,000, operations.
Simple
objects
The
one-pixel
and
2-MByte
tests
are
simply
objects
retrieved
by
URL.
The
one-pixel
retrieval
does
not
include
any
cookies
or
other
information,
so
it
can
be
retrieved
quickly
and
is
primarily
a
test
of
network
round-trip
time
and
server
responsiveness;
in
many
cases,
the
cloud
moved
this
to
a
cache.
The
larger
image
also
tests
network
throughput
and
congestion.
CPU
test
To
test
processing
resources,
we
conduct
1,000,000
SIN
operations
and
1,000,000
SUM
operations.
The
code
for
the
CPU
load
is
the
same
on
all
clouds,
except
that
in
one
PaaS
environment
(force.com),
we
reduce
this
to
50,000
operations
in
order
to
comply
with
the
platforms
governorsdespite
this,
its
still
slower
than
IaaS
platforms.
The
code
is
as
follows:
<? $lcnt = 1000000; $a = '<h1>CPU Header '.$lcnt.'</h1><br/>'; print($a); $sinsum=0; for ($y=0;$y<$lcnt;++$y) { $temp = $y; $x[$y]= sin($temp); } for ($y=0;$y<$lcnt;++$y) { $sinsum+=$x[$y]; } print($sinsum.' <br/>'); ?>
I/O
test
In
each
case,
we
loaded
the
available
data
store
(a
database
in
IaaS
environments,
or
the
built-in
one
in
a
PaaS
environment)
with
500,000
records.
IO.php
runs
a
full
table
scan
of
the
500,000
rows,
and
then
flushes
disk
buffers
to
ensure
the
data
isnt
cached.
In
IaaS
environments,
the
code
for
IO.php
is:
<?
65
$chandle = mysql_pconnect("localhost", "root","barfly") or die("ERROR: Connection Failure to Database"); mysql_select_db("MainObj", $chandle) or die ("Database not found."); $query="select ip_addr, comment,value from vote where comment like '%OIYDGOAPSL%'"; $result = mysql_db_query("MainObj", $query) or die("ERROR: Failed Query"); $result = mysql_query($query); echo "<h1>IO</h1><br />"; // Check result // This shows the actual query sent to MySQL, and the error. Useful for debugging. if (!$result) { $message = 'ERROR: ' . mysql_error() . "<br />"; $message .= 'Whole query: ' . $query; die($message); } while ($row = mysql_fetch_assoc($result)) { echo $row['ip_addr']; echo "<br />"; } mysql_free_result($result); // this is to flush the cache for next time or we're effectively cheating the load Dom0 system("/sbin/fc"); ?>
/sbin/fc
assumes
superuser
privileges
to
flush
all
disk
cache.
This
was
done
at
the
end
of
the
code
to
prevent
other
load
issues
as
the
OS
recovers
a
little
from
having
the
buffers
suddenly
all
marked
dirty.
This
forces
the
next
web
hit
to
not
be
able
to
search
through
the
entire
database
from
cache,
but
rather
forces
DISK
IO
ro
happen
as
though
this
was
the
first
query
ever.
The
code
for
fc.c
is:
#include <stdio.h> main() { setreuid(0,0); system("echo '3' > /proc/sys/vm/drop_caches"); }
To prove this works, use iostat -x 1 and go visit the web page IO.php a few times, you'll see all the IO load anew each time. Rename IO.php and you'll note the entire DB fits into RAM, and there's no IO to disk anymore. The code for PaaS I/O tests varies by PaaS platform significantly. Both Force.com and Google App Engine add visible and invisible data per row. Force.com adds author, creator, and other metadata such that 524k rows is about 1.1GB, whereas it's only 190 megs on a mysql database. While we cant flush the cache in a PaaS cloud, the fact that were running a full scan of a gigabyte of data in a shared environment suggests that its less likely to be cached.
66
Googles insert slow to query fast philosophy means that data was inserted at a rate of 31,160 300-byte rows per CPU-hour. We quickly burned through the 6.5 free CPU-hours Google offers its users, and in the end, took nearly 3 days to push all of the test data into App Engines Bigtable storage. Google App Engine does not support searching for a substring in a string field. As such, the usual search for the 6-byte string found only in row 11692 (which was the basis for the I/O test) is replaced by a search for the entire field (all 255 bytes). This is essentially still a full index scan, as this column is not indexed normally, and should be give or take about the same amount of aggregate IO load as all the other tests.
67