Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Cisco Press
800 East 96th Street, 3rd Floor
Indianapolis, IN 46240 USA
ii
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is crafted
with care and precision, undergoing rigorous development that involves the unique expertise of members from the
professional technical community.
Readers feedback is a natural continuation of this process. If you have any comments regarding how we could
improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail
at feedback@ciscopress.com. Please make sure to include the book title and ISBN in your message.
We greatly appreciate your assistance.
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized.
Cisco Press or Cisco Systems, Inc. cannot attest to the accuracy of this information. Use of a term in this book
should not be regarded as affecting the validity of any trademark or service mark.
iii
Publisher
Editor-in-Chief
Executive Editor
Cisco Representative
Cisco Press Program Manager
Manager, Marketing Communications, Cisco Systems
Cisco Marketing Program Manager
Production Manager
Acquisitions Editor
Development Editor
Project Editor
Copy Editor
Technical Editors
Team Coordinator
Book Designer
Cover Designer
Composition
Indexer
John Wait
John Kane
Brett Bartow
Anthony Wolfenden
Sonia Torres Chavez
Scott Miller
Edie Quiroz
Patrick Kanouse
Michelle Grandin
Jill Batistick
Marc Fowler
Jill Batistick
David M. Fishman
John P. Morency
Richard L. Ptak
Tammi Barnett
Gina Rexrode
Louisa Adair
Mark Shirar
Larry Sweazy
iv
In Loving Memory
This book was nished as a nal tribute to my late husband, John McConnell.
I hope these words keep his ideas alive in the industry a little longer.
Grace Morlock McConnell
Dedication
John W. McConnell
December 9, 1943November 3, 2002
This book is dedicated to my wife, Grace, whose support has been so helpful in carving out the time and quiet
needed for this project. My friends and Grace have also provided a supportive environment and tolerated my
frequent absences to work with clients. Returning home to a warm community has been really important to me.
vi
Acknowledgments
Many people have been part of this process of turning some ideas and experience into a book. First, my thanks to
the Cisco Press team, especially Michelle Grandin. The steady enthusiasm and willingness of all to help are deeply
appreciated.
In the same vein, the technical reviewers have been so helpful. Ive had the pleasure of spending good time exchanging
views with John Morency and Rich Ptak at many analyst conferences and other events; their suggestions for this
manuscript were specic and helpful, and in some cases spurred some spirited discussions. Although Ive never met
David Fishman face to face, Id be pleased to buy him a good meal someday as thanks for so many good suggestions
and his attention to detail and integrity on getting it right.
Another group I want to acknowledge are the clients Ive worked with around the world. Ive gotten to learn a lot
about how technology is actually used and to work with people who want to push the envelope.
Finally, my thanks to my friends and colleagues in the industry who constantly stimulate and challenge me. Its
been a tremendous blessing to be among so many creative and independent thinkers and doers that have shaped the
networking industry.
John McConnell
Its impossible to begin these acknowledgements without wishing that John were still alive. This is his book, not
mine. He conceived it; he drafted it; he should have been writing this page. We all used to joke about how John
towered over the industry, and it wasnt just because of his height. In working from Johns drafts to complete the
book, in talking to colleagues about his work, and in remembering the easy, jovial way he talked about examples of
industry practices, I was constantly reminded of his stature and of the friendly way he had. I think I can say, with
condence, that everyone in the industry truly misses him; I certainly do.
Johns wife wanted to see this book come to publication, and Cisco Press went far out of their way to make that happen.
Jill Batistick and Michelle Grandin, the editors, were wonderfully friendly and helpful; they made the process of
working through the chapters almost enjoyable. The technical reviewers, Rich Ptak, John Morency, and David Fishman
put a tremendous amount of work into the book. They didnt just point out my errors; they suggested corrections
and entire new paragraphs that could improve the text. They were truly partners in bringing the book to publication.
Id also like to thank Astrid Wasserman, of MediaLive International, Inc., (the organizers of Networld+Interop),
who gave me a copy of Johns proposed two-day seminar on Service Level Management. Although it was never
presented, the seminar slides gave me a lot of insight into his ideas.
I have tried to stay close to Johns original thoughts and text, although I have occasionally succumbed to temptation
and added additional information. Minor additions occur in all chapters; major additions are in Chapter 2 (measurement statistics), Chapter 6 (triage for quick assignment of problems to appropriate diagnostic teams), Chapter 8
(transaction response time), and Chapter 11 (ash loads and abandonment). Most of the additions are topics that I
had discussed with John at various conferences we attended together; I hope, and believe, that he would agree with
them. In all cases when the author speaks directly to the reader, that author is John.
Eric Siegel
October 14, 2003
vii
viii
ix
Contents at a Glance
Preface xxi
Part I
Chapter 1
Introduction
Chapter 2
Chapter 3
Part II
Chapter 4
Instrumentation
Chapter 5
Event Management
Chapter 6
Real-Time Operations
Chapter 7
Policy-Based Management
Chapter 8
Chapter 9
5
13
41
61
81
101
129
177
209
219
193
195
145
163
59
245
231
217
Contents
Preface xxi
Part I
Chapter 1
Introduction
E-business Services
B2B
B2C
B2E
10
11
13
14
14
15
17
18
Workload 18
Availability 19
Transaction Failure Rate 20
Transaction Response Time 20
File Transfer Time 20
Stream Quality 20
Low-Level Technical Metrics
21
21
16
15
15
xi
Latency 22
Jitter 22
Server Response Time
Measurement Granularity
Measurement Scope
23
23
23
24
26
28
29
31
33
28
33
34
37
41
42
45
45
48
49
52
Instrumentation Management
53
49
52
46
xii
54
54
55
56
57
Part II
Chapter 4
Instrumentation
61
63
64
64
65
67
68
65
66
61
69
70
72
72
73
73
73
75
Passive Collection 75
Active Collection 75
Trade-Offs Between Passive and Active Collection
Hybrid Systems 77
76
59
xiii
Instrumentation Trends
Adaptability
77
77
Collaboration
78
78
78
Event Management
81
82
82
83
84
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
Volume Reduction
86
Roll-Up Method 86
De-duplication 87
Intelligent Monitoring
Artifact Reduction
87
88
Verification 88
Filtering 89
Correlation 90
Business Impact: Integrating Technology and Services
Top-Down and Bottom-Up Approaches
Modeling a Service 92
Care and Feeding Considerations 93
Prioritization
Activation
94
95
Coordination
96
99
92
98
97
97
91
85
xiv
Chapter 6
Real-Time Operations
101
Reactive Management
103
Triage
104
Root-Cause Analysis
107
108
110
Brownouts 110
Virtualized Resources 110
The Value of Good Enough 111
Proactive Management
112
113
113
113
114
116
116
117
120
117
116
xv
121
127
Policy-Based Management
Policy-Based Management
The Need for Policies
129
129
130
132
133
131
133
134
Policy Distribution
134
136
Policy Design
136
Policy Hierarchy
137
Policy Attributes
137
Policy Auditing
138
138
138
139
139
142
141
xvi
Chapter 8
145
146
146
146
147
147
149
149
150
151
153
154
Propagation Delay
Processing Delay
154
156
157
157
159
160
161
163
163
168
Content Distribution
169
164
156
xvii
172
173
171
173
174
177
178
180
Round-Trip Latency
Jitter
179
181
181
QoS Technologies
181
Tag-Based QoS
182
182
185
188
188
Levels of Control
Demarcation Points
189
189
191
189
178
xviii
Part III
195
196
199
200
206
209
209
211
211
211
213
214
Capacity Planning
Part IV
203
205
Summary
193
214
215
219
220
222
223
Project Benefits
220
223
217
xix
Soft Benefits
226
226
228
231
231
231
232
233
233
234
235
237
239
240
242
245
245
248
248
248
Event Integration
Process Integration
249
250
250
252
251
237
xx
256
255
254
253
252
xxi
Preface
Some years ago I received a true pearl of wisdom from an industry colleague. In order to truly understand your
profession, he advised, you must make the effort to learn other disciplines that are completely different from the
one that you espouse.
That colleague was John McConnell, a man who truly understood this advice by walking the talk over the course of
his life. Born into a military family, John developed a keen understanding of the importance of the global ecosystem at
a very young age through his childhood experiences in both Europe and the Far East. Despite being a shy, scholarly
individual throughout primary and secondary school, John also demonstrated the value of hard work and dedication
by making the varsity rowing team at U.C. Berkeley.
The strong work ethic that John nurtured at Berkeley served him well after he received his masters in computer science
in 1968. What differentiated John from many of his fellow graduates, however, was the application of his craft to nonIT disciplines after graduation. Some of his rst initiatives included the application of computer technology to measure
the rate of solar intensity upon the earth and the development of a programming language that was designed to test the
content and substance of moon samples brought back to earth by the Apollo astronauts. In addition, John developed
a number of network control programs for the ARPANET (the predecessor to todays Internet) in the mid-1970s
when the state of the commercial data networking industry was in its true infancy.
John also spent a number of years in professional capacities that had very little to do with information technology.
After graduate school, John became an accomplished massage therapist, hypnotist, and practitioner in the art of
Rolng, a technique for the detection, treatment, and removal of bodily stress and pain. In 1983, using his Rolng
technique, John was selected to work with the members of the U.S. bicycling Olympic team, and he applied this
technique to aid the team in preparing for the 1984 Olympic games. Recently, when not consulting, John was training
to become an instructor in the Ridhwan Foundation, an institution whose focus is the rediscovery and integration of
the true self into ones own professional and personal life. Over the years, he had a myriad of personal interests
including soaring, mountain climbing, bird watching, backpacking, rowing, and blues festivals. One of his most
recent and satisfying accomplishments was the design, building, and completion of a second home in southern
Costa Rica that effectively enabled both he and his wife Grace to really get away from it all.
First and foremost, Johns professional focus in the IT industry was the advancement of technologies and products
that improved the efciency and the effectiveness of IT management.
Given his whole life background, John was especially dedicated to reducing the operational and business pain
points associated with IT implementation and management. This focus is reected in Johns prior work,
Internetworking Computer Systems and Managing Client/Server Environments, as well as in Practical Service
Level Management: Delivering High-Quality Web-Based Services. Johns numerous publications, conferences, and
televised briengs reect a focused dedication to the removal of technological barriers to the optimal effectiveness
of IT organizations worldwide. His life experiences of a true Renaissance man uniquely enabled him to both understand and drive the level of change needed to not only improve state of the art, but also quality of life. John was
indeed the gold standard of knowledge, professionalism, and personal integrity that made the pursuit of these
goals not only a logical possibility, but, for many of us, a practical reality. The loss of John will be keenly felt for
some time, but the goals and values that he aspired to and embraced will inspire and guide many of us for years to
come.
John Morency, President, Momenta Research
May 2003
PART
Introduction
Chapter 2
Chapter 3
CHAPTER
Introduction
The World Wide Webthe Webis the catalyst for the changes in our communications,
work styles, business processes, and ways of seeking entertainment and information. The
Internet is just the transport infrastructure for the web-based services that drive so much
innovation. Note, however, that the Internet generally gets all the credit. As Thomas
Friedman writes in The Lexus and the Olive Tree:
The Internet is going to be like a huge vise that takes the globalization system that I have describedand
keeps tightening and tightening that system around everyone, in ways that will only make the world smaller
and smaller and faster and faster with each passing day.
This is an accurate description of the environment that most of us deal with directly on a
daily basis. The Internet is a tremendous business engine, and, as it transforms the ways we
do business, it is being transformed in turn by the ways we use it. We must learn how to
manage the growing array of online business services or risk being marginalized by a faster
moving and more dynamic business environment.
In this introductory chapter, I discuss the following:
E-business Services
E-business is a generic term dening business activities that are carried out totally, or in
part, through electronic communications between distributed organizations and people.
These activities are characterized by speed, exibility, and constant change.
The Internet has become the vehicle for transforming business processes. The reasons for
its ascendancy include the following:
The Internet protocols are the only workable set of technologies that really provide a
high degree of interoperability among different systems.
The wide geographic reach of the Internet increases the size of any potential market.
Chapter 1: Introduction
The introduction of the browser and its supporting technologies make the Internet
much easier to use, thereby increasing the potential market.
There are many ways of segmenting and describing the large variety of services available
through the Internet and the Web. A simple classication that covers most services is based
on the relationship of the business to customers, business partners, and employees. For
example, the process shown in Figure 1-1 describes a simple situation involving all three
types of relationships: business to business (B2B), business to consumer (B2C), and
business to employee (B2E). These segments are an easy way of organizing our thinking
about services, although its important to remember that business processes in the real
world will have many variations and overlaps.
Figure 1-1
Business Relationships
Supplier
Customer
Sales Staff
B2C
B2B
B2E
Internal Enterprise
Systems
B2B
B2B services are a broad category that incorporates transactions among different
businesses and government agencies. Many current B2B services, such as supply chain
management and credit authorization, use the Internet to drive down the costs and delays
associated with current processes and to boost their productivity.
B2B is rapidly broadening to include more than supply chain management and credit
authorization. Functions such as shipping, billing, and Customer Relationship Management
E-business Services
(CRM) are now often external to the business; other businesses provide and host these
specialized services as a utility. For example, entry of a customers order can result in more
than the functions of pricing, authorizing, assembling, and shipping; a modern system
might use B2B links to provide the customer with a shipment tracking number from the
shipping company, and it might interact with an external CRM service to reect the current
purchases and other factors of the customers prole. Meanwhile, the sales person might be
indirectly using B2B links to handle her commissions and personnel data through
outsourced employee management services, and engineering staff might use B2B links for
collaborative design.
Thanks to the Web, B2B is rapidly transforming into an even more dynamic set of services
from which an enterprise can select in real time. No one wants to be dependent on a single
supplier or customer; everyone must deal with competitive pressures exerted from both
sides. Services such as credit authorization and shipping are examples of those that can be
selected in real time based on their performance or costs. Other services and supplies may
be selected from web-based exchanges or e-markets.
B2B processes can be complex. They must follow the business requirements for tracking
orders, negotiating contracts, arranging payments, and reporting outcomes that govern
these processes when they take place without the automation of electronic communications.
Note that new benets become available, although at the cost of additional complexity,
when B2B replaces older systems. For example, organizations can change their business
processes to increase their business effectiveness by obtaining real-time information on
order volumes, revenue rates, cancelled orders, and other factors. This additional
information, while adding to complexity, provides value in addition to the acceleration of
the processes themselves by identifying further efciencies.
Continuous monitoring of B2B suppliers, partners, and web infrastructure
(communications, hosting, and exchanges) is necessary to determine whether they are
meeting their service quality commitments.
As in conventional commerce, managing across organizations adds complexity. All the
links in the B2B services chains are known, but these links are controlled by many different
organizations, are complex, and may change rapidly as services are selected in real time.
Managing B2B services therefore requires cooperation with the management teams of the
other participants and, possibly, with third-party measurement organizations to assure true
end-to-end service quality.
B2C
B2C garnered most of the early attention from the trade press and analysts as traditional
businesses took advantage of the Internets wide geographic reach and low costs for
reaching customers. Some businesses (eBay and Amazon.com, for example) were founded
to exploit this new market opportunity.
Chapter 1: Introduction
B2C sites continually add new services of their own while offering links to related
businesses and services in an attempt to offer one-stop shoppingand sellingto their
customers. This is a highly competitive segment with little customer loyalty. The wide
selection of competing sites draws customers away whenever any one site has a service
disruption.
B2C environments are characterized by a lack of visibility and management control of the
customer-access infrastructure, which is the set of networks, caches, and other systems that
consumers use to connect to the B2C site. Customers usually dont want measurement tools
embedded in their systems, and the access infrastructure providers also resist making their
internal performance readily visible. There is also limited visibility into the performance of
partner sites (advertisers and other third parties), which are important parts of the
customers perception of total site performance. The span of control and management
available to B2C sites is therefore usually limited to monitoring and managing their internal
operations (inside the rewall) as well as measurement of Internet delays and performance
as seen from various points on the edge of the Internet.
B2E
B2E services are also known as the intranet. These services help improve the internal
effectiveness of an organization and help it keep pace with its customers and business
partners. Many B2E services enable employees to query their benets, schedule vacations,
ll out expense reports, and conduct a set of activities that formerly required a large staff to
coordinate.
B2C and B2E services use the web browser as the access device. Transactions are initiated
from the browser to deliver information and activate a range of business processes.
However, B2E environments are the only ones that enable administrators to have control of
both endsthe servers as well as the desktops, cell phones, and personal communicators
used to access them.
Financials application. However, further discussion soon revealed that their international
operations used real-time currency conversion decisions. The real-time exchange rates in
the Oracle Financials application were, in fact, accessed through the Web.
Indeed, webbed services are now taking on many of the characteristics of an ecosystem,
which is a group of independent but interrelated elements comprising a unied whole. A
smooth business process depends on each element carrying out its tasks accurately and
quickly, with consideration for maintaining balances among all the elements. In a wellbalanced webbed ecosystem, all elements bear appropriate shares of the load. None is
overwhelmed, none is underutilized. Balance is concurrently maintained between service
quality and service cost. The ecosystem metaphor is gaining momentum as online
processes evolve to dynamically select their elements (underlying services) based on their
current behavior and performance.
The webbed ecosystem perspective also holds within any subgroup of systems. For
instance, hosting facilities use a range of technologies, such as prioritizing devices,
bandwidth managers, global load balancers, and caches, to deliver online business services.
These systems also need balanced management; adding bandwidth when servers are
congested is a wasteful investment.
10
Chapter 1: Introduction
QoE is the most important to customers, yet it is also the most difcult to evaluate. For
example, I recently visited a large company that derives over half its revenues online. They
were justly proud of a new initiative that reduced web page download times by two seconds.
However, the content was so dense and difcult to navigate that users still needed a long
time to understand the directions and identify the buttons or links they wanted to use next.
Improved technical performance did not appreciably raise the QoE in this case; users
wasted at least two seconds looking for what they wanted.
Summary
11
The second group of chapters (810) in this part steps through the major
systems used for web service delivery. It looks at the ways they can be used
to improve service delivery and also discusses their specic instrumentation
needs, using the system management infrastructures described in the rst
part of this section. Chapter 8 investigates the instrumentation and
management of applications and of end-user access devices, such as
browsers. Chapter 9 looks at web server systems, including servers, load
balancers, and content distribution networks. Finally, Chapter 10 discusses
instrumentation and management of the transport infrastructure, including
QoS technology and trafc shaping to achieve policy objectives.
Summary
The Internet, and the Web, are transforming business processes for interaction among
businesses, government, suppliers, customers, and employees. As more and more critical
business processes go online, the service quality of those processes becomes more
important to the success of business as a whole.
SLAs are the formal, negotiated contracts between service providers and service users that
dene the services to be provided, their quality goals, and the actions to be taken if the SLA
terms are violated.
SLM is the process of managing network and computing resources to ensure the delivery
of acceptable service quality, usually as dened in an SLA, at an acceptable price in an
acceptable time frame. It is a competitive weapon in the marketplace because it can improve
customer relationships, create more revenue opportunities, and reduce costs.
CHAPTER
An overview of SLM
An introduction to technical metrics
Detailed discussions of measurement granularity and measurement validation
Business process metrics
Service Level Agreements (SLAs)
Note that the chapter ends with a summary discussion in the context of building an SLA.
Use of metrics in combination with the SLAs service level objectives to control
performance is discussed in Chapter 6, Real-Time Operations, and Chapter 7, PolicyBased Management.
14
IT Group
Customer Role
Hosting Co.
Content Delivery
Customer Role
Customer Role
ISP #1
ISP #2
Customer Role
Customer Role
Telephone Co.
Telephone Co.
15
16
17
Service providers may package the metrics into specic proles that suit common customer
requirements while simplifying the process of selecting and specifying the parameters.
Service proles help the service provider by simplifying their planning and resource
allocation operations.
Description
Availability
Stream Quality
Availability
Packet Loss
Latency
Jitter
18
Workload is an important characteristic of both high- and low-level metrics. Its not a
measure of delivered quality; instead, its a critical measure of the load applied to the
system. For example, consider the workload of serving web pages. A text-only page might
comprise only 10 K bytes, whereas a graphics page could comprise a few megabytes. If the
requirement is to deliver a page in six seconds to the end user, massively different
bandwidth and capacity will be necessary. Indeed, content may need to be altered for lowspeed connections to meet the six-second download time.
NOTE
In many situations, certain technical metrics arent specied in the SLA. Instead, the
supplier is asked to use best effort, which represents the classic Internet delivery strategy of
get it there somehow without concern for service quality. Today, best effort represents the
commodity level for services. There are no special treatments for best-effort services. The
only need is that there are sufcient resources to prevent best-effort services from starving
out, which means having the connection time out because of long periods of inactivity.
Discussions of all of the examples in Table 2-1 follow, to illustrate the basic concepts of
technical metrics. Additional descriptions of these metrics, and other technical metrics,
appear in Chapters 4 and 810.
Workload
The workload high-level technical metric is the measure of applied load in end-user terms.
Its unreasonable to expect a service provider to agree to service levels for an unspecied
amount of workload; its also unreasonable to expect that an end user will willingly
substitute specication of obscurely-related low-level workload metrics instead of
understandable high-level metrics. SLAs should therefore begin by specifying the highlevel workload metrics, and service providers can then work with the customers technical
staff to derive low-level workload metrics from them.
For transaction systems, the workload metric is usually specied in terms of the end-user
transaction mix and volumes, which typically vary according to time of day and other
business cycles. For existing systems, these statistics can be obtained from logs; for new
systems or situations (such as a proposed major advertising campaign designed to drive
prospective customers to a web site), the organizations marketing group or their
consultants should work to produce the most accurate, specic estimates possible. These
workload estimates for new systems should be used for load testing as well as for SLAs.
19
Transaction workload metrics must include end-user tolerance for transaction response
time delays. If response time delays are too long, external customers will abandon the
transaction. In legacy systems where external customers did not interact directly with the
server systems, abandonment was not a factor in workload testing. Call-center operators
handled any delays by talking to the customers, shielding them from the problem, if
necessary. On the Web, customers see the delays without any shielding, and they may
decide at any point to abandon the transactionwith immediate impact on the server
systems workload.
Another effect of the direct connection between customers and web-serving systems is that
theres no buffer between those customers and the servers. In a call center, the workload is
buffered by external queues. Incoming calls go through an automatic call distribution
system; callers are placed on hold until an operator is available. In an order-entry center,
the workload is buffered by the stack of documents on the entry clerks desk. In contrast,
the web workload has no external buffer; massive spikes in workload hit the servers
instantly. These spikes in workload are called flash load, and they must be specied in the
workload metric and considered during load testing. Load specication for the Web should
therefore be in terms of arrival rate, not concurrent users, as was the case for call centers
and order-entry centers.
File-serving, web-page, and streaming-media workload metrics are similar to transaction
metrics, but simpler. Theyre usually specied in terms of the size and number of les that
must be transferred in a given time interval. (For web pages, the types of the les are usually
specied. Dynamically-generated les are clearly more resource-intensive than stored
static les.) The serving system must have the bandwidth to serve the les, and it must also
be able to handle the anticipated number of concurrent connections. Theres a relationship
between these two variables; given a certain arrival rate, higher end-to-end bandwidth
results in fewer concurrent users.
Availability
Availability is the percentage of time that the system is perceived as available and
functioning by the end user. It is a function of both the Mean Time Between Failures
(MTBF) and the Mean Time To Repair (MTTR). Scheduled downtime might, in some
organizations, be excluded from these calculations. In those organizations, a system can be
declared 100 percent available even though its down for an hour every night for system
maintenance.
Availability is a binary measurementthe service is either available or it isnt. For the end
user, and therefore for the high-level availability metric, the fact that particular underlying
components of a service are unavailable is not a concern if that unavailability is concealed
through redundant systems design.
Availability can be improved by increasing the MTBF or by decreasing the time spent on
each failure, which is measured by the MTTR. Chapter 3, Service Management
20
Architecture, introduces the concept of triage, which decreases MTTR through quick
assignment of problems to the appropriate specialist organization.
Stream Quality
The quality of multimedia streams is difcult to measure. Although underlying low-level
technical metrics, such as frame loss, can be obtained, their relationship to the quality as
perceived by an end user is very complex.
Streaming is a real-time service in which the content continues owing even with variations
in the underlying data transmission rates and despite some underlying errors. A content
consumer may see a small blemish on a graphic because a packet is lost in transit
equivalent to static on your car radio. There is no rewinding and playing it again, as there
might be with interactive services. Thus, packet loss is handled by just continuing with the
streaming rather than retransmitting lost packets.
21
Occasional packet loss can still be tolerated and sometimes may not even be noticed. If
packet loss increases, quality will begin to degrade until it falls below a threshold and
becomes unacceptable. Years of development have been focused on concealing these lowlevel errors from the multimedia consumer, and the major existing technologies from
Microsoft, Real Networks, Apple, and others have different sensitivities to these errors.
Nevertheless, quality must be measured. The telephone companies years ago established
the Mean Opinion Score (MOS), a measure of the quality of telephone voice transmission.
There are also international standards for evaluation of audio and video quality as perceived
by human end users; examples are the International Telecommunication Unions ITU-T
P.800-series and P.900-series standards and the American National Standards Institutes
T1.518 and T1.801 standards. Simpler methods are also in use, such as measuring the
percentage of successful connection attempts to the streaming server, the effective
bandwidth delivered over that connection, and the number of rebuffers during transmission.
22
Packet Loss
Packet loss has different effects on the end-user experience, depending on the service using
the transport. The choice of a packet loss metric for a particular application must be
carefully considered. For example, packet loss in le transfer forces retransmission unless
the high-level transport contains embedded error correction codes. In contrast, moderate
packet loss in streaming media may have no user-perceptible effect at allunless bad luck
results in the loss of a key frame.
The burst length must be included in packet loss metrics. Usually a uniform distribution of
dropped packets over longer time intervals is implicitly assumed. For example, out of every
100 packets there could be two lost without violating an SLA calling for two percent packet
loss. There may be a different perspective if you examine behavior over longer intervals,
such as 1,000 packets. Up to 20 packets in a row could be lost without violating the SLA.
However, losing 20 consecutive packetscreating a signicant gap in data received
might drive quality levels to unacceptable values.
Latency
Latency is the time needed for transit across the network; its critical for real-time services.
Excessive latency quickly degrades the quality of web sites and of interactive sound and video.
Routes in the Internet are usually asymmetric, with ows often taking different paths
coming and going between any pair of locations. Thus, the delays in each direction are
usually different. Fortunately, most Internet applications are primarily sensitive to roundtrip delays, which are much simpler to measure than one-way delays. File transfer, web
sites, and transactions all require a ow of acknowledgments in the opposite direction to
data ow. If acknowledgments are delayed, transmission temporarily ceases. The roundtrip latency therefore controls the effective bandwidth of the transmission.
Round-trip latency is much simpler to measure than one-way latency, because clock
synchronization of separated locations is not necessary. That synchronization can be quite
tricky if it is accomplished across the same network thats having its one-way delay
measured. In that case, uctuations in the metric thats being measured (one-way latency)
can easily affect the stability of the measurement apparatus for one-way latency. An
external reference, such as the satellite Global Positioning System (GPS) timers, is often
used in such situations.
Jitter
Jitter is the deviation in the arrival rate of data from ideal, evenly-spaced arrival; see Figure
2-3. Some packets may be bunched more closely together (in terms of inter-packet delays)
or spread farther apart after crossing the network infrastructure. Jitter is caused by the
internal operation of network equipment, and its unavoidable. Jitter is created whenever
there are queues and buffering in a system. Extreme varieties of jitter are also created when
theres rerouting of packets because of network congestion or failure.
Measurement Granularity
Figure 2-3
23
Jitter
Ideal
Packet
Spacing
Actual
Packet
Spacing
Jitter
Measurement Granularity
The SLA must describe the granularity of the measurements. There are three related parts
to that granularity: the scope, the sampling frequency, and the aggregation interval.
Measurement Scope
The rst consideration is the scope of the measurement, and availability metrics make an
excellent example. Many providers dene the availability of their services based on an
overall average of availability across all access points. This is an approach that gives the
service providers the most exibility and cushion for meeting negotiated levels.
24
Consider if your company had 100 sites and a target of 99 percent availability based on an
overall average. Ninety-nine of your sites could have complete availability (100 percent)
while one could have zero. Having a site with an extended period of complete unavailability
isnt usually acceptable, but the service provider has complied with the negotiated terms of
the SLA.
If the availability level is specied on a per-site basis instead, the provider would have been
found to be noncompliant and appropriate actions would follow in the form of penalties or
lost customers. The same principle applies when measuring the availability of multiple
sites, servers, or other units.
Availability has an additional scope dimension, in addition to breadth: the depth to which
the end user can penetrate to the desired service. To use a telephone analogy, is dial tone
sufcient, or must the end user be able to reach specic numbers? In other words, which
transactions must be accessible for the system to be regarded as available?
Scope issues for performance metrics are similar to those for the availability metric. There
may be different sets of metrics for different groups of transactions, different times of day,
and different groups of end users. Some transactions may be unusually important to
particular groups of end users at particular times and completely unimportant at other
times.
Regardless of the scope selected for a given individual metric, its important to realize that
executive management will want these various metrics aggregated into a single measure of
overall performance. Derivation of that aggregated metric must be addressed during
measurement denition.
Measurement Granularity
25
percent chance that the true median, if we could perform an innite number of
measurements, would be between ve seconds and seven seconds. That is what the 95
Percent Condence Interval seeks to estimate, as shown in Figure 2-4. When you take
more measurements, the condence interval (two seconds in this example) usually becomes
narrower. Therefore, condence intervals can be used to help estimate how many
measurements youll need to obtain a given level of precision with statistical condence.
Confidence Interval for Internet Data
Actual Median
Confidence Interval
Percentage of
Measurements
Figure 2-4
9 10 11 12 13 14
There are simple techniques for calculating condence intervals for normal distributions
of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent
section on statistical analysis, Internet distributions are so different from the normal
distribution that these techniques cannot be used. Instead, the statistical simulation
technique known as bootstrapping can be used for these calculations on Internet
distributions.
In some cases, depending on the pattern of measurements, simple approximations for
calculating condence intervals may be used. Keynote Systems recommends the following
calculation approximation for calculating the condence interval for availability metrics.
(This information is drawn from Keynote Data Accuracy and Statistical Analysis for
Performance Trending and Service Level Management, Keynote Systems Inc., San Mateo,
California, 2002.) The formula is as follows:
26
Now you must decide if the preliminary calculations are reasonable. We suggest that
the preliminary calculation should be accepted only if the upper limit is below 100
percent and the lower limit is above 0 percent. (The example just used gives an upper
limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable
if n = 30 or greater.)
Note that were not saying that the condence interval is too wide if the
upper limit is above 100 percent (or if the average availability itself is 100
percent because no errors were detected); were saying that you dont know
what the condence interval is. The reason is that the simplifying
assumptions you used to construct the calculation break down if there are not
enough data points.
For performance metrics, a simple solution to the problem of condence intervals is to use
geometric means and geometric deviations as measures of performance, which are
described in the subsequent section in this chapter on statistical analysis.
Keynote Systems suggests, in the paper previously cited, that you can approximate the 95
Percent Condence Interval for the geometric mean as follows, for a measurement sample
with n valid (nonerror) data points:
Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
Lower Limit = [geometric mean] / [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
This is similar to the use of the standard deviation with normally distributed data and can
be used as a rough approximation of condence intervals for performance measurements.
Note that this ignores cyclic variations, such as by time of day or day of week; it is also
somewhat distorted because even the logarithms of the original data are asymmetrically
distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered
using this recipe are much less than those that result from the usual use of mean and
standard deviation.
Measurement Granularity
27
Table 2-2 shows this idea. If availability is measured on a small scale (hourly), high
availability and requirements such as the 5-9s or 99.999% permit only 0.036 seconds of
outage before theres a breach of the SLA. Providers must provision with adequate
redundancy to meet this type of stringent requirement, and clearly they will pass on these
costs to the customers that demand such high availability.
Table 2-2
Day
Week
4 Weeks
98%
1.2 min
28.8 min
3.36 hr
13.4 hr
98.5%
0.9 min
21.6 min
2.52 hr
10 hr
99%
0.6 min
14.4 min
1.68 hr
6.7 hr
99.5%
0.3 min
7.2 min
50.4 min
3.36 hr
99.9%
3.6 sec
1.44 min
10 min
40 min
99.99%
0.36 sec
8.64 sec
1 min
4 min
99.999%
0.036 sec
0.864 sec
6 sec
24 sec
If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicates
that an acceptable cumulative outage of 24 seconds per month is permitted while remaining
in compliance. A 99.9 percent availability level permits up to 40 minutes of accumulated
downtime for a service each month. Many providers are still trying to negotiate an SLA
with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13.4
to 3.5 hours each month.
Note that these values assume 24 7 365 operations. For operations that do not require
round-the-clock availability, or are not up during weekends, or have scheduled maintenance
periods, the values will change. That said, theyre pretty easy to compute.
The key is for service provider and service customer to set a common denition of the
critical time interval. Because longer aggregation intervals permit longer periods during
which metrics may be outside tolerance, many organizations must look more deeply at their
aggregation denitions and look to their tolerance for service interruption. A 98 percent
availability level may be adequate and also economically acceptable, but how would the
business function if the 13.5 allotted hours of downtime per month occurred in a single
outage? Could the business tolerate an interruption of that length without serious damage?
If not, then another metric that limits the interruption must be incorporated. This could be
expressed in a statement such as the following: Monthly availability at all sites shall be 98
percent or higher, and no service outage shall exceed three minutes. In other words, a little
arithmetic to evaluate scenarios for compliance goes a long way.
28
Measurement Validation
Measurement problems, which are artifacts of the measurement process, are inevitable in
any large-scale measurement system. The important issues are how quickly these errors are
detected and tagged in the database, and the degree of engineering and business integrity
thats applied to the process of error detection and tagging.
Measurement problems can be caused by instrument malfunction, such as a response timer
that fails, and by synthetic transaction script failure, which leads to false transaction error
reports. It can also be caused by abnormal congestion on a measurement tools access link
to the backbone network and by many other factors. These failures are of the measurement
system, not of the system being measured. They therefore are best excluded from any SLA
compliance metrics.
Detection and tagging of erroneous measurements may take time, sometimes up to a day or
more, as the measurement team investigates the situation. Fortunately, SLA reports are not
generally done in real time, and theres therefore an opportunity to detect and remove such
measurements.
The same measurements will probably also be used for quick diagnosis, or triage, and that
usage requires real-time reporting. Theres therefore no chance to remove erroneous
measurements before use, and the quick diagnosis techniques must themselves handle
possible problems in the measurement system. Good, fast-acting artifact reduction
techniques (discussed in Chapter 5, Event Management) can eliminate a large number of
misleading error messages and reduce the burden on the provider management system.
An emerging alternative is using a trusted, independent third-party to provide the
monitoring and SLA compliance verication. The advantage in having an independent
party providing information is both service providers and their customers could view this
party as objective when they have disputes about delivered service quality.
Keynote Systems and Brix Networks are early movers into this market space. Keynote
Systems provides a service, whereas Brix Networks provides an integrated set of software
and hardware measurement devices to be installed and managed by the owner of the SLA.
They both provide active, managed measurement devices placed at the service demarcation
points between customers and providers or between different providers. (Other companies,
such as Mercury Interactive and BMC, now offer similar services and software.)
29
The measurement devices, known as agents in the Keynote service and verier
platforms in the Brix service, carry out periodic service quality measurements. They
collect information and reduce it to trends and baselines. There is also a real-time alerting
component when the measurement device detects a noncompliant situation. Alerts are
forwarded to the Keynote or BrixWorx operations center where they are logged and
included in service level quality reports. As the Keynote system is a service, Keynote
provides measurement device management and measurement validation.
Keynote and BrixWorx also offer integration with other management systems and support
systems for reporting to customers, provisioning staff, and other back-ofce departments.
Test suites for more detailed testing are also stored at the center and deployed to the
measurement platforms as necessary.
Trusted third parties may be the solution needed to reduce the problems when customer
experience and provider data are not in close agreement.
Statistical Analysis
Most statistical behavior that you see in life is described by a normal distribution, the
typical bell-shaped curve. This is an extremely convenient and well-understood data
distribution, and much of our intuitive understanding of data is built on the assumption that
the data were examining ts the normal distribution. For a normal distribution, the
arithmetic average is, indeed, the typical value of the data points, and a standard deviation
calculated by the usual formula gives a good sense of the breadth of the distribution.
(A small standard deviation implies a very tight grouping of data points around the average;
a large standard deviation implies a loose grouping.) For a normal distribution, 67 percent
of the measurements are within one standard deviation of the average, and 95 percent are
within two standard deviations of the average.
Unfortunately, Web and Internet behavior do not conform to the normal distribution. As a
result of intermixing long and short les, compressed video and acknowledgments, and
retransmission timeouts, Internet performance has been shown to be heavy tailed with a
right tail. (See Figure 2-5.) This means that a small but signicant portion of the
measurement data points will be much, much larger than the median.
Figure 2-5
30
If you use just a few measurements to estimate an arithmetic average with a heavy-tailed
distribution, the average will be very noisy. Its unpredictable whether one of the very large
measurements will creep in and massively alter the whole average. Alternatively, you may be lulled
into a false sense of security by not encountering such an outlying measurement (an outlier).
The situation for standard deviations is even worse because these are computed by squaring
the distance from the arithmetic average. A single large measurement can therefore
outweigh tens of thousands of typical measurements, creating a highly misleading standard
deviation. Its mathematically computable, but worse than useless for business decisions.
Use of arithmetic averages, standard deviations, and other statistical techniques that depend
on an underlying normal distribution can therefore be quite misleading. They should
certainly not be used for SLA contracts.
The geometric mean and the geometric standard deviation should be used for Internet
measurements. Those measures are not only computationally manageable, theyre also a
good t for an end-users intuitive feeling for the typical measurement, psychologically.
As an alternative, the median and eighty-fth percentiles may be used, but they take more
power to compute.
The geometric mean is the nth root of the product of the n data points. The geometric
deviation is the standard deviation of the data points in log space. The following algorithm
should be used to avoid computational instabilities:
Undo the logarithm by exponentiating the results to the same base originally used.
Note that the geometric deviation is a factor; the geometric mean must be multiplied and
divided by it to create the upper and lower deviations. Because of the use of logarithms, the
upper and lower deviations are not symmetrical, as they are with a standard deviation in
normal space. This is one of the prices you pay for the use of the geometric measures.
Another disadvantage is that, as is also true for percentiles, you cannot simply add the
geometric statistics for different measurements to get the geometric statistics for the sum
of the measurements. For example, the geometric mean of (connection establishment time
+ le download time) is not the sum of the geometric means of the two components.
Instead, each individual pair of data points must be individually combined before the
computations are made.
These calculations of both the geometric mean and the geometric deviation, or the median
and the eighty-fth percentile, should be used for end-user response time specication.
Using these statistics instead of conventional arithmetic averages or absolute maximums
helps manage SLA violations effectively and avoids the expense of xing violations that
were caused by transient, unimportant problems.
31
Difculty in nding experts at the provider who actually understand the providers
own services
Although such issues have made the service-provider marketplace somewhat turbulent, the
good news is that the situation is improving because of two developments.
The rst is the continuing build-out of the Internet core with optical transmission systems
of tremendous capacity coupled to the widening deployment of broadband services for the
last mile access links to the customer. When this capacity is fully in place, bandwidth
services can be activated and deactivated without the delays associated with running new
wiring and cable. As these high-capacity transmission systems become more widespread,
it becomes a question of coordinating the activities of both customer and provider
management systems for more effective and economical service delivery.
That introduces the second enabling factor: the development of standards, such as extensible
markup language (XML) and Common Information Model (CIM), and other factors, are
making the sharing of management information easier and simpler than it used to be.
Customers and service providers can use mechanisms such as XML to loosely couple their
management systems. Neither party needs to expose internal information processes to the
other, but they can exchange requests and information in real time to speed up and simplify
their interactions.
Customers can allocate their spending more precisely by activating and deactivating
services with ner control and thereby reducing their usage charges. They can also
temporarily add capacity or services to accommodate sudden shifts in online business
activities.
32
Providers have a competitive edge when they have the appropriate service management
systems. They can meet customer needs quickly and use their own dynamic pricing
strategies to generate additional revenues.
Business process metrics measure the quality of the interactions between customers and
service providers as a way of including them in an SLA and thereby improving them. Some
of these metrics may be incorporated in standard provider service proles, while others may
need to be negotiated explicitly.
Many customer organizations maintain relationships with multiple service providers to
avoid depending on a single provider and to use the competition to extract the best prices
and service quality they can negotiate.
Business process speed and accuracy will be even more important in the future as customer
and provider management systems are integrated, and as services are activated and
deactivated in real time. Service providers must be able to provision quickly, bill
appropriately, and adjust services in a matter of a few seconds to a few minutes. Customers
must also be able to understand their service mix and adjust their requests to the service
provider to match changes in their business requirements. It is this environment that will
begin to accelerate the use of business process metrics as part of the selection and continued
evaluation of a set of service providers.
Table 2-3 lists two emerging categories of business process metrics. Problem management metrics
measure the providers responses to customer problems, whereas real-time service
management metrics track the responses to customer requests for service modications.
Table 2-3
Description
Notication Time
Escalation Time
Activation /
Deactivation Time
Change Latency
The elapsed time to effect a parameter change across the entire system
33
34
Change latency is an idea for a metric that arose from the experience of one of my
colleagues. She works for a large multinational organization with approximately 1,200
global access devices. Some access points support a small number of dial-in users, while
others accommodate larger buildings and campuses. Her organization wanted to change
some access control policies and asked the service provider to update all the access devices.
The problem occurred because the provider changed only portions of the devices in phases
over two days rather than all at once, leading to a situation in which devices had
inconsistent access control information. The result was disruptions to the business.
They have an explicit agreement that denes the services that will be provided, the
metrics to assess service provider performance, the measurements that are required,
and the penalties for noncompliance.
The clarity of the SLA removes much of the ambiguity in customer-service provider
communication. The metrics, rather than arguments based on subjective opinions of
whether the response time is acceptable, are the determinant for compliance.
The SLA also helps customers manage their costs because they can allocate their
spending on a differentiated scale with premiums for critical services and commodity
pricing where best effort is sufcient.
Customers have the condence that they can successfully deploy the critical services
that improve their internal operations (remote training and web-based internal
services) or strengthen their ability to compete (web services and supply chains). Too
many efforts have oundered due to unacceptable service quality after deployment.
The SLA becomes more important as you move toward customer-managed service
activation and resource management. The SLA will determine what the customer is
allowed to do in real time in terms of changing priorities and service selections.
Service providers have been reluctant to negotiate SLAs because of their increased
exposure to nancial penalties and potentially adverse publicity if they fail to meet
customer needs. In spite of their reluctance, they have been forced into adopting SLAs to
keep their major customers. The evolution of SLAs has therefore been driven mainly by
customer demands and fear of losing business.
35
Early SLAs focused primarily on availability because it was easier to measure and show
compliance. Availability is also easier for a provider to supply by investing in the
appropriate degree of redundancy so that failures do not have a signicant impact on
availability levels.
Performance metrics are beginning to be included in more SLAs because customers
demand them. Providers have a more difcult time guaranteeing performance levels
because of the dynamism of their shared infrastructures. Simply adding more bandwidth
will not guarantee acceptable response time without signicant trafc engineering,
measurements, and continued analysis and adjustment. The difculty of managing highly
dynamic ows has many providers reluctant to accept the nancial penalties that are part of
most SLAs.
Nonetheless, the value of the SLA to providers is also recognized, and some of the
signicant factors are as follows:
The clarity of the SLA serves the provider as it does the customer. Clearly dened
metrics simplify the assignment of responsibility when service levels are questioned.
The SLA offers service providers the capacity to differentiate their services and
escape (somewhat) the struggles of competing in a commodity-based market. As
providers create and deploy new services, they can charge on a value-pricing basis to
increase their prot margins.
When constructing an SLA, customers must assess their desired mix of services and weigh
their relative priorities. A useful rst attempt is to match those needs against the providers
precongured service proles. This will group services with common characteristics and
requirements, and it will also help identify any special services that are not easily
accommodated by the predened categories. Requirements that do not t a predened class
will require special considerations when negotiating an SLA.
After services have been grouped, their relative priorities within each category must be
established. Customers can do this by selecting the appropriate service prole; for example,
many service providers offer a variation on the platinum, gold, and silver proles. Typically,
platinum services are the most expensive and provide the highest quality; gold and silver
are increasingly less expensive and provide relatively lower quality.
Even if prebuilt service proles are used, the SLA negotiations must include discussions of
how the SLA metrics are to be measured and how any penalties or rewards are to be
calculated. Customers will continue to push for stronger nancial penalties for
noncompliance, and providers will give in to the pressure as slowly as they can in a highly
competitive market.
36
Unfortunately, its not uncommon for providers and customers to have ongoing disputes
about the delivered services and their quality. Some of the roots of the problem are
technical: customers and providers may have different measurement and monitoring
capabilities and are therefore comparing apples to oranges. Other problems are rooted in
the terms of the SLA, where ambiguities lead to different interpretations. SLAs must
therefore incorporate relevant measurement, artifact reduction, verication mechanisms,
and appropriate statistical treatments to protect both parties as much as possible. Customers
must play a role in the verication process because they still have the most to lose when
serious service disruptions occur.
SLA penalties and rewards are a form of risk management on the part of the customer.
However, they continue to be among the least well-developed elements of service offerings.
More mature industries offer guarantees and incentives; the ability of the service provider
to reduce and absorb some risk for its customers is a key competitive differentiator.
Still, customers bear the brunt of any disruptions caused by a provider. As one customer
once said, The problem is the punishment doesnt t the crime; an hour-long outage costs
us over $100,000, and my provider just gives me a 10 percent rebate on my next bill.
Nevertheless, the correct role for penalties and rewards is to encourage good performance,
not to compensate the customer for all losses. If loss compensation is needed, its a job for
risk insurance.
Rather, SLA penalties and rewards must focus on motivation. The penalties and rewards
should be sufcient to inspire the performance the customer wants, and the goals should be
set to ensure that the motivating quality of the SLA remains throughout the time period.
Impossible or trivial goals dont motivate, and capped penalties or goals stop motivating
when the cap is reached. For example, if a provider must pay a penalty based on monthly
performance, and the SLA is violated in the rst three days of the monthso the maximum
penalty must be paidthe provider wont be motivated to handle problems that appear
during the remainder of the month. After all, that particular customers ship has already
sunk; maybe another customers ship is still sinking and can be rescued without paying a
maximum penalty!
Web performance goals that are set unrealistically high, with no reference to the Internets
background behavior, will cause the supplier to refuse the SLA or insist on minor penalties.
A solution to this problem is to include in the SLA metrics a background measure of
Internet performance or of competitors performance, possibly from a public performance
index or from specic measurements undertaken as a part of the SLA.
Sometimes, performance is so poor that a contract must be terminated. The SLA should
discuss the conditions under which termination is an option, and it should also discuss who
bears the costs for that termination. Again, the costs should be primarily designed to
motivate the supplier to avoid terminations; it may not be possible to agree on an SLA in
which all of the customers termination costs are repaid.
Summary
37
Finally, customers may want to include security concerns in their SLA as part of a service
prole through additional negotiation and specication. Security is notoriously difcult to
measure, except in very large aggregates. Security metrics are more likely to take the form
of response-time commitments in the event of a breach, either to roll out patches, shut down
access, or detect an intrusion. The bulk of security discussions around service levels will be
about policies, not measurement.
Summary
This chapter covers a lot of territory and sets the stage for the following chapter discussions
that cover different aspects of actually managing services. Successful service management
is predicated on delivering acceptable service quality at acceptable price points and within
acceptable time frames. Correctly handled, it improves service quality, improves
relationships with suppliers, and may even lower total costs.
The SLA is the basic tool used to dene acceptable quality and any relationships between
quality and price. It is a formal, negotiated contract between a service provider and a service
user that denes the services to be provided, the service quality goals (often called service
level indicators and service level objectives), and the actions to be taken if the service
provider does not comply with the SLA terms.
Measurement is a key part of an SLA, and most SLAs have two different classes of metrics,
technical and business process metrics. Technical metrics include both high-level technical
metrics, such as the success rate of an entire transaction as seen by an end user, and lowlevel technical metrics, such as the error rate of an underlying communications network.
Business process metrics include measures of provider business practices, such as the speed
with which they respond to problem reports. Metrics should also include measures of the
workload expected. Service providers may package the metrics into specic proles that
suit common customer requirements while simplifying the process of selecting and
specifying the parameters.
In any case, a properly constructed SLA is based on metrics that are relevant to the end-user
experience. Many of the low-level technical metrics, such as communications packet loss,
have complex relationships to end-user experience; its usually much better to use highlevel technical metrics that directly measure end-user experience, such as web page
download time and transaction time. The low-level technical metrics can then be derived
from the high-level technical metrics and used to manage subordinate systems.
SLA metrics must be carefully dened in terms of scope, sampling frequency, and
aggregation interval:
Scope represents the breadth of measurement (for example, the number of test points
from which availability is measured and the percentage of them that must be
unavailable for the entire system to be marked as unavailable).
38
The aggregation interval is also important, as longer intervals, often chosen in SLAs,
may allow long periods of sub-par performance. The tolerance for service interruption
then becomes important and may need to be separately specied.
Measurements must also be validated and subjected to statistical treatment when used in
SLAs, and the methods for that validation and treatment must be documented in the SLA
to avoid dispute. Validation ensures that erroneous measurements are removed, insofar as
is possible, before computation of the metrics used in the SLA. Statistical treatment ensures
that outlying measurements do not create a misleading picture of the performance as
perceived by end users, with the resulting waste of resources spent xing what may be a
minor issue. Arithmetic averages and standard deviations should not be used to handle
Internet statistics.
Finally, the SLA should be written with penalty and reward clauses that are sufcient to
inspire the performance the customer wants, and the goals should be set to ensure that the
motivating quality of the SLA remains throughout the time period. Capped penalties or
goals are examples of techniques that may motivate a supplier to abandon work on an
account just because the cap has been reachedprobably not the desired behavior.
The service level indicators and objectives described in the SLA are then used by the
operations staff and by automated systems to manage the service levels, as described in
Chapters 6 and 7.
CHAPTER
Service Management
Architecture
This chapter describes the overall architecture for the management of service delivery on
the Web; it forms the framework that later chapters will build on to create a complete
design. It also gives some of the relevant history of service management architectures to
enhance the readers understanding of the issues facing management architectures. The
history will let the reader see the origins of some of the major management products in the
marketplace.
Before the discussions begin, however, its important to understand that a system
architecture denes the components of a system and their relationships, showing how they
provide the required system functions and meet the system objectives. The interconnections
among the architectures subsystems are clearly dened, and each subsystem can be
expanded to reveal internal details. A good architecture therefore provides a high-level overview
for those who need it, while also providing detailed technical information, if necessary.
Any number of architectures can be created to handle the same challenge; the number of
potential architectures is limited only by the imagination of the architect. Each can stress
different capabilities or organizing principles. Ideally, the selected architecture provides the
maximum business value (including functionality, exibility, match to the organization's
culture, and more) while costing the least in funding and effort to build and manage.
Design decisions made within each architectures subsystem may affect the function and
performance of the other subsystems, although the subsystems in some architectures are
more tightly interrelated than in others. For example, in a Web service delivery architecture,
the decision to use a private network of geographically distributed caching systems has
implications for the design of the central Web-serving system. A well-dened architecture
helps those managing it to see the implications of a subsystems design and management
decisions on the system as a whole.
This chapter is organized into three sections. The rst section provides a brief description
of a large-scale Web services delivery architecture along with its business environment.
Thats because its impractical to discuss management systems without having a common
understanding of the architecture of the systems being managed and the business
environmentthe webbed ecosystemwithin which those systems must function. The
middle section discusses the history of service management platforms for heterogeneous
systems and the design factors and standards that go into them. The last section gives a
summary of the service management architecture used in this book and provides references
to the relevant chapters.
42
Access
Provider
CDN Server
Cache
DNS
Server
Routers
Server
Farm
MCI
Server
Farm
C&W
Sprint
AT&T
Level 3
Router
UUNET
Verio
Qwest
The Internet
Firewall
Load Distributor
Web
Servers
Application
Servers
Database
Servers
Primary
Server Farm
This server farm uses a three-tier application model, which is normally used for large-scale
systems. The three tiers are as follows:
Web servers, which maintain the connections with client browsers and other client
devices, parsing and handling input from them, formatting data to be sent to them,
serving unchanging (static) web pages, and often being responsible for maintaining
transaction context.
43
Application servers, which run the major transaction and dynamic web page
generation systems, as well as any specialized applications for the end users. They
often run specialized transaction-processing operating systems that simplify
programming for scalability and availability.
Database servers, which handle the large back-end databases needed by larger Web
systems.
Because the three tiers are loosely coupled, each tier can grow independently of the others,
and interconnections can be used to increase availability.
Above the three server tiers in Figure 3-1 is the load distributor, which distributes incoming
requests among the web servers, and the rewall and Internet access router.
In Figure 3-1, the primary server farm is multi-homed; its connected to two different
Internet Service Providers (ISPs) to increase availability. The primary server farm usually
also includes ancillary devices, such as the authoritative Domain Name System (DNS)
server, which provides the key records for mapping the sites Internet host names to Internet
numeric addresses, and server-side caches, which can be used to relieve the serving systems
of highly repetitive work by storing the results of commonly repeated requests. The end users
Quality of Experience (QoE) depends on much more than the primary server farms
performance, however. Multiple server farms, caching devices, content distribution networks,
third-party content providers, and the DNS may also be involved.
Most large systems rely, often indirectly, on multiple, distributed server farms. Some
enterprises have multiple locations from which they provide their basic content, and they
use geographic distribution technologies to try to direct end users to the server farm that will
respond the fastest. For example, its impossible to deliver rapid web page downloads in Asia
from a server system in New York City; enterprises that have a large user base in Asia must,
therefore, have some server systems on that side of the Pacic. Geographic distribution is
critical to providing good QoE, though its difcult to locate an end user with great precision
by using that end users Internet address. Obtaining detailed knowledge of location, while
very important for some applications and for some performance situations, can be quite tricky.
Caching devices are used to store frequently requested data inside the network, at the server
location, or within the end users local network to decrease both network trafc and the time
needed to locate and display data. These devices are often provided free of charge for web
sites to use, but conguring web pages for use with remote caching can be complex.
Precise evaluation of the QoE at an end users location as the result of caching requires
remote measurement facilities.
A Content Distribution Network (CDN) is a service that uses a large network of remote
caches to provide much more control of caching than is available using free caching. A
CDN can provide prepositioning of content such as a major advertising campaign; it also
provides the ability to cache download les and streaming media, which are usually not
stored by public caches. A CDN gives the content owner direct, immediate control over
remotely cached content. A CDN can also supply differentiated content to end users, based
on their location.
44
Most web sites uses third-party content providers for some advertising or even basic site
content. Many stock-trading sites, for example, use a third-party provider for stock price
graphs that are visually embedded in their web pages. Despite the fact that content comes
from third-party content providers, the end user usually does not realize that the content
originates from different sites. If there are performance problems, the site owner is blamed,
not the third-party provider.
Finally, the web site cant even be found by the end user if there are problems with the
performance of the DNS. DNS is a worldwide hierarchy of server systems congured as a
distributed directory, and it must be able to reach the web sites authoritative record and
interpret that (often complex) record correctly. DNS information can then be cached in the
DNSs own dedicated system of distributed caching servers, with some control from the
web sites owner. Without measurement from end-user locations, problems with the DNS
are often not detected until irate end users call up the site to complain about the sites being
ofine. The site may be completely accessible from the site owners intranet, but
completely inaccessible from large areas of the Internet.
All the trafc between the end user and the various server farms, caching devices, content
distribution networks, and DNS servers travels over the Internets complex mesh of
backbones and peering points, which are the locations at which different organizations
interconnect their backbones. The routing tables, used to direct Internet trafc, are so
complex that the routing software cannot consider uctuations in transit time when making
routing decisions. If it did, the Internet would be saturated by routing table update
messages, and the router CPUs would be saturated by the calculations required. The result
is that routing through the Internet is often suboptimal, and trafc often heads for congested
areas and peering points instead of traveling around them.
Routers do attempt to reroute around failed pieces of the Internet; they just dont usually
reroute around congested pieces. Delays can build quickly at congestion points, and packets
can be lost or duplicated as routers try to recover from their problems. To add to the
complexity, the route between any pair of endpoints is almost always different in the two
directions.
For all Internet and Web situations, you can see that measurement of the performance as
seen by the end user must be available to detect QoE problems occurring in the complex
Web-serving systems. Those measurements must also be quickly available and must be
credible; otherwise, the web sites owner wont be able to use them to get an Internet service
provider to x a problem. Of course, some problemssuch as a very localized difculty
with an ISPs bank of dial-in modemsare beyond the scope of responsibility of a web
sites owner, even though that web sites availability appears to be affected. In such cases,
which occur constantly, measurements from standard locations and the use of public
performance index measurements can be used to reassure management (and the end user,
if necessary) that the problem is a local one and is beyond the direct responsibility of the
owners or operators of the web site.
45
46
If a problem occurred in the traditional IBM SNA-based system, the system operator had
central, integrated control of the application, the network, and the end users terminal. The
networks System Services Control Points (SSCPs) could instantly locate and diagnose the
complete end-to-end connection between a particular application and a particular terminal.
Given clear visibility into underlying connections, other tools could diagnose the
application problems quickly. The entire system was tightly-coupled. Conguration was
extremely complex and could be error-prone, but a running system was under strong central
control.
In contrast, Internet-based systems are loosely coupled and do not rely on massive,
centralized congurations of servers, storage, and network hardware. However, these more
exible congurations are more difcult to operate at a given level of service and do not
have any central management system. Instead of having that central authority, which is the
keystone of a traditional system, Internet-based systems have a loose confederation of
interacting, separately owned and controlled subsystems.
Of course, the exibility of networked architectures is a mixed blessing. It does facilitate
changes to keep pace with changing demands of the business. However, such change can
also introduce new complexities and vulnerabilities. When a problem occurs in an Internetbased system, nding the precise end-to-end path that the data ow is taking may be
extremely difcult; theres no central switching or routing authority. Even if that path is
found, its unclear that the knowledge could be effectively used to x any problems
quicklythe responsible ISP might be one that isnt directly accountable to either end of
the connection.
Further exacerbating the situation is that a problem as seen by the end user could have been
caused by any of dozens of interacting subsystems and servers. The image on the end users
browser probably comes from multiple servers simultaneously (third-party suppliers
provide stock charts, ads, and so on); each data ow may have been invisibly intercepted
and possibly cached by devices unknown to server or end user; the server assigned to a
particular end user may have been assigned only temporarily and cannot easily be traced at
a later time or even while the error is occurring. Running a help desk in the complex Internet
environment is much more technically difcult, and it takes more ongoing negotiating and
interacting with external suppliers than running one in the traditional environment!
47
The traditional approach, still common today, is to use each tool in isolation. When several
tools are needed for a task, a staff person takes the output from one tool and uses that
information to drive the next tool. This approach needs additional staff attention, can
consume large amounts of time, and adds the risk of introducing errors with manual steps.
It also requires an investment in additional equipment and requires additional physical
space, adding signicantly to the cost of monitoring and management. Integrating the tools
appears to be a better solution.
If integration is good, you might wonder why is there so little of it. Consider the following
reasons:
Second, the market has been willing to settle for integration as dened by marketing
departmentsintegration that seemed to correspond to needs, but that has failed to
meet the test of practice.
The early management platforms touted themselves as integration points for a set of bestof-breed management tools. Unfortunately, their marketing hype exceeded their capacity
for delivering any meaningful integration. Competition was on the basis of who had the
longest list of third-party management tools sharing interfaces to their platform,
notwithstanding any real integration efforts. The market positioning suggested that
commonality of interfaces was the key to making tools useful; that turns out not to be the
case in practice.
The integration many early management platform vendors actually offered might be better
characterized as consolidation and tool launching. Consolidation allows customers to use a
single server for a set of management tools rather than use a server for each one. Tools can
be launched after an alert triggers a response. This is useful, but there is no integration
each tool still operates as a separate entity with its own commands, functions, data schema,
and display formats.
Some management platforms added integration on the glassa consistent look and feel for
a set of tools. This feature is useful because it simplies usage and reduces staff training
requirements. The platforms offered this common look and feel for their products and the
overall console. However, each tool could, and often did, have its own conventions after
being launched.
All the early platform vendors got away with these low-level integration features because
the market was relatively unsophisticated, and systems management did not demand as
much integration. However, today, this lighter level of integration is no longer adequate;
management tools must now work in a webbed services environment.
A cynical view is that the early lack of deep integration also served vendors as they built
substantial professional services organizations to nish the job. I had one vendor in a
48
moment of candor admit that his company made $10 in professional services for each $1 a
customer spent on the actual software. Market studies in general showed that the consulting
spent to take such tools off the shelf and put them to use exceeded the licensing fees by a
factor of 2:1 or more.
The relatively shallow integration left organizations with several other choices: they could
nd another integrator, undertake the effort themselves, or live with a set of disjointed
management tools. Of course, using a systems integrator was expensive and timeconsuming; it often meant that a company was dependent upon the integrator every time
new management tools were acquired. The alternative of internal integration efforts was
also expensive and time-consuming, as it diverted development resources from the core
business initiatives.
As with much of technology, invention is the mother of necessity in management. The
management industry has been responding to the need for better integration through
consolidation. The big players buy up niche products and offer the suite as an integrated
solution. Others are forging strategic partnerships and integrating their products. Both
trends offer some additional value for management solution buyers. However, it is still
unusual for these efforts to produce a product suite that offers more than integration on the
glass. Often surface integration relies too heavily on limited new software to try to glue the
disparate pieces together.
NOTE
49
50
The contrast with traditional management organization strategies is stark. In the past, teams
had isolated, well-bounded responsibilities; for instance, the network and application
infrastructures managers had little reason to interact. Today, such specializations must be
integrated with a structure for mutual responsibility and collaboration by specialists across
these different layers. Infrastructure managers can be specialists, but service managers must
also be generalists.
Boundaries between customers, their providers, and their business partners are also
becoming more uid. At any point, the constellation of providers and partners can change
as the mix of services responds to business shifts. To keep pace with the changing mix,
management systems must interact more frequently, and customers need to assume some
of the management functions that have been the providers domain.
51
The syntax (command and data structure) of SNMP can be used by a management
application to determine that a variable included in an array of system management
information is a 32-bit integer used as a counter. However, without the semantics (meaning)
of the counter, the application cannot use the data. Missing information may include the
following:
Many standard sets of SNMP syntax and semantics for network and applications systems
were dened, and they attempt to answer these questions. However, manufacturers quickly
introduced proprietary extensions to the standard MIB data denitions, and often the
semantics of those extensions were poorly specied, which stymied interoperability.
Proprietary extensions are inevitable, and they are a mixed blessing for customers. They are
desirable because they enable vendors to innovate and offer unique value-added features to
the standard SNMP management capabilities. They are a problem because management
applications from other vendors often do not use the foreign extensions to advantage.
Without complete specications, and without a nancial incentive to do the integration
work, vendors cant and wont incorporate other vendors extensions into their
management tools. This leads to situations where customers having similar network
devices from several vendors must use different management tools for each product set
even though all the devices perform the same functions in almost identical ways.
This problem of data denition continues in current standards efforts, although there has
been some improvement. The extensible markup language (XML) standard is already being
used extensively for exchanging structured information, and many vendors have adopted
XML as a means for exchanging information between their own management products. XML
takes a step forward by including methods for converting the format of a messages data
into a format understood by the receiver, but the semantics of that message must still be
dened elsewhere.
The Distributed Management Task Force, a standards body composed of industry players,
has recently dened the Common Information Model (CIM), which is intended to
complement XML by offering more complete denitions for all management tools. For
example, CIM can be used to describe each managed object by the following:
52
MethodsDescribe the operations that can be performed on the object. For the
servers there would be methods for rebooting, killing a process, creating a process,
changing the number of active threads, and other operations.
CIM is still very young, and not yet widely used, but it points a way to the future.
Instrumentation
Instrumentation, described in detail in Chapter 4, Instrumentation, and Chapters 810
and shown at the top of Figure 3-2, monitors and measures the performance and availability
of system components, as well as that of services. Instrumentation of components, or element
instrumentation, tracks the status and behavior of individual components, such as network
devices, servers, and applications. Examples of element measurements are CPU busy
percentage and the percentage of received packets that contain transmission errors. Services
instrumentation tracks the behavior of services using active and passive collectors. Examples
of measured services are round-trip time through a network and transaction response time.
Instrumentation takes two forms:
Figure 3-2
53
Alerts
SLA Statistics
and Reporting
Real-Time
Event Management
Real-Time
Operations
Back-Office
Operations
Policy-Based
Management
Long-Term
Operations
Instrumentation Management
Instrumentation managers, described in Chapter 4 and shown in the middle of Figure 3-2,
congure the instrumentation systems and receive the measurement data from them. They
examine each incoming data item, ltering out obvious measurement errors and comparing
measurements to specied thresholds to see if an alert should be issued. If measurements
indicate a possible problem, the instrumentation manager may demand additional
measurements to help make sense of the problem and to see if the original measurement
was an outlier or was a true indicator of a difculty. There are two primary outputs from the
instrumentation manager: alerts and service level indicator data. The former consists of
alerts that are important enough to be escalated to the real-time event handler, where they
will be combined with other data for evaluation; the latter consists of data sets and
aggregated measurements that are all forwarded to the SLA statistics system for statistical
treatment and reporting on system performance.
54
55
The policy manager applies business rules to the operation of the system. It is an automated
tool that identies the service levels allocated to each end user and application, based on
rules programmed by the system operators. It then tunes the system and denies system
access as needed to enforce those service levels.
Some examples of the functions performed by the trio of event manager, operations
manager, and policy manager are listed here and are discussed in more detail in Chapters 57:
Long-Term Operations
Some operations are considered to be longer term because their activation or completion
within a short time interval is not critical. Such longer-term operations, shown at the bottom
of Figure 3-2, can be associated with strategic changes to the service-delivery environment,
or they can offer more fundamental remediation of problems identied by alarms. Some
examples of longer-term operations include the following:
56
Load testing is discussed in Chapter 11, Load Testing, and system modeling and capacity
planning are discussed in Chapter 12, Modeling and Capacity Planning.
Back-Office Operations
Back-ofce operations, shown at the bottom of Figure 3-2, are related to the business side
of service delivery. These processes have usually been described as Operations Support
Systems (OSS) in the world of traditional telephone providers. They constitute a bridge
between operations of the service-delivery environment and the management of the
business that pays for them. Typical back-ofce functions for service providers include the
following:
BillingTracks resource usage and charges accordingly. Billing will be tied to the
negotiated terms of the SLA, and it must be exible and easily extended to incorporate
new services.
Customer serviceProvides the help desk, web pages, and other means of
interacting with customers and supporting a wide range of needs, such as ordering
services, getting information, and resolving disputes.
Order trackingFollows customer orders through the steps from initial contact
through ordering, activation, and revenue capture.
Service consumers must manage their online business with similar types of information.
For example, they should track the performance of their providers, the cost of the services
that they use, and the benet or income from their use of those services.
The business-process metrics described in Chapter 2 can be used to suggest metrics that
will help manage the overall performance of the back-ofce operations.
Summary
57
Summary
This chapter outlines the important parts of a complete service level management system.
Starting with a description of a large-scale Web services delivery architecture, it then shows
the inuences of that architecture on the design of service level management systems.
Critical inuences are the constant demands for changing services, the use of multiple
service providers and partners, elastic boundaries among teams and providers, demands for
fast system management, and the need for mutually understandable data item denitions
and event signaling mechanisms among the various pieces of the management system.
The generic management system outlined in Figure 3-2 is used as a reference model in the
rest of the book. The parts of the generic management system consist of system
instrumentation (fully described in Chapters 4 and 810), instrumentation management
systems (described in Chapter 4), and the SLA statistics and reporting systems (described
in Chapter 2) that use the data from instrumentation. The parts also consist of the real-time
operations systems (event handling, Chapter 5; operations, Chapter 6; and policy, Chapter 7),
along with long-term operations (load testing, Chapter 11; and system modeling and
capacity planning, Chapter 12), and, nally, back-ofce operations, which are not further
described in this book.
PART
II
Instrumentation
Chapter 5
Event Management
Chapter 6
Real-Time Operations
Chapter 7
Policy-Based Management
Chapter 8
Chapter 9
Chapter 10
CHAPTER
Instrumentation
The term instrumentation is used to describe the technologies and processes for monitoring
and measuring the behaviors of services, infrastructures, and elements. You use
instrumentation to monitor behavior and assess the impact of changing operational
conditions on your ability to meet the compliance requirements for Service Level
Agreements (SLAs). Services managers need appropriate instrumentation to inform them
of actual or potential problems and provide feedback after they make adjustments. This
chapter introduces an instrumentation methodology for managing the infrastructure
described in the overview in Chapter 3, Service Management Architecture.
This chapter covers the following topics:
62
Chapter 4: Instrumentation
Figure 4-1
63
Response Time
CPU
Queue
Frames
Errors
Hits
Packets
Errors
Element Instrumentation
Figure 4-1 shows the relationship between elements and service instrumentation. Service
instrumentation must monitor and measure the overall behavior of the aggregate elements
supporting any service ow.
Business managers also rely on real-time information to track business processes and goals.
Technical information must be translated into business-centric metrics. A large transaction
volume indicates high server performance, but this high volume may have no business
value if the transactions are completing quickly because the desired content is missing.
64
Chapter 4: Instrumentation
Lack of the appropriate service instrumentation results in service managers being left to
manage by hope and by customer feedback. This strategy involves reacting to problems
reported by customers and trying alternatives until something works.
Decisions based on poor management information result in the following:
Exacerbating the issuePoor information can lead to poor decisions that introduce
further disruptions.
Technical and business managers need information about, and insight into, service behavior
so that they can make effective service management decisions. Operational decisions must
be made within short time intervals, while other decisions, having major long-term effects,
can be made more slowly. The text discusses these in turn.
65
As an example of the business perspective, an online web site selling merchandise can be
monitored with network-based probes tracking the actual URLs being used. The web
applications can also provide direct access to information. The instrumentation indicates
the number of abandoned shopping carts by analyzing the URLs owing on the network or
reaching certain points within the application itself.
Technical problems, such as slow credit authorization or billing services, can increase the
number of abandoned shopping carts. These problems are correctable with standard
technical means. Carts are also abandoned when there is a problem with the actual web
content or navigation. The business administrator needs to understand when an unwelcome
change has occurred and take steps to keep the business running smoothly.
Stress testing services to determine their actual capacities and breaking points
Managers stress test those operational areas and loads where service disruptions are
more likely.
Evaluating the services mix to determine if new services will destabilize the
current mix and introduce more service degradationManagers can avoid
unpleasant surprises and outages by planning ahead.
Instrumentation is the bedrock for managing services and service quality. An instrumentation system provides accurate and timely information for a range of management
decisions and other functions. In addition, instrumentation provides essential feedback for
technical and business administrators. Measuring the results of any decision validates good
choices or indicates whether further attention is still needed.
66
Chapter 4: Instrumentation
There are two primary instrumentation modes: activating trip wires and taking time slices.
A combination of trip wires and time-sliced measurements is used for supporting
operational and strategic service management tasks. Figure 4-2 shows the use of trip wires,
which can generate real-time alerts by comparing a behavior to a static threshold value or
by tracking deviations from a normal behavioral envelope. Time slices are repetitive
measurements of the same variables over time.
NOTE
Trip wires and time slices are used for real-time alerts; time slices also help with longerterm functions, such as planning.
Figure 4-2
Trip Wires
!
# $
(
! % &'
Trip Wires
Trip wires provide simple real-time alerting for operational decision making. Management
tools compare the collected information to established thresholds. An alert is sent to a
management application when the value is higher or lower than the established threshold.
Further processing of the alert determines whether it is a valid problem, whom to notify,
and which tools to activate.
A series of thresholds can be established. Consider an SLA requiring response times of ve
seconds or less. A warning level (2.5 seconds, for instance) gives administrators ample time
to investigate a performance shift and take appropriate action. A three-second threshold
denotes a performance level that is getting closer to unacceptable values, and a four-second
threshold is used to bring an urgent response from the management system.
67
Determining when a trip wire should be triggered is fairly simple. However, the simplicity
introduces difculties because a threshold is usually a static value and the environment is
dynamic. There may be peaks and valleys of activity and a threshold set too low will trigger
a rash of alerts that do not really indicate a problem. Raising the threshold reduces the alert
volumes when the normal load is high, but introduces the risk of missing situations when
normal volumes are lighter.
One key for effective instrumentation is selecting realistic thresholds to ensure accurate
warnings. Many product specications are based on a set of optimum conditions, and actual
performance can be quite different. Realistic load testing is a practical means for
determining accurate threshold values. Load testing is discussed in Chapter 11, Load
Testing.
Time Slices
Time slices are repetitive measurements of the same variables over longer time intervals.
They track changes in normal behavior over an extended period of time.
Baselines are an example of a time-sliced measurement. Baselines are also used as trip
wires because they provide a more accurate assessment of dynamic behavior. Repetitive
measurements are used to set the initial baseline for normal behavior as an envelope with
high, low, and average values. Statistical techniques such as those mentioned in Chapter 2,
Service Level Management, can be used to set the baseline values. A baseline approach
sends an alert whenever measurements fall outside the normal envelope. Current measures
are compared to the baseline and deviations can reveal conditions such as the following:
A shift in normal behavior that naturally occurs over time with growth or
changesThis situation merely denes a new normal baseline.
A trend showing that performance is shifting toward the edges of the envelope
and thus may indicate an underlying problemAdministrators spend time only on
situations that are actually abnormal.
A measurement that should not have any influence can be detected and
discardedA single measurement, for instance, could have a very abnormal value,
but a single occurrence requires no further attention. Many artifacts, which are false
indications of the actual situation, can be automatically screened, saving valuable staff
time and minimizing unnecessary interruptions.
Baselines are most effective when the environment is stable long enough to take the
measurements and make the calculations. Baselines must be recalculated as normal loads
grow or newly added services alter the environment.
Time slices require consistent measurements over time and some processing to determine
the trends. Trends revealed with time-sliced measurements are used for longer-term
planning and optimization functions.
68
Chapter 4: Instrumentation
Storing the collected information for a variety of other service management functions
The instrumentation system provides the framework for monitoring service behaviors and
reporting them to other parts of the service management system. The instrumentation
system manages collectors and aggregators, ensuring that they are operating properly and
collecting the appropriate information. The processing functions organize the data and save
some for long-term storage. The collectors and aggregators collect and reduce data and pass
alerts to the processing or event management functions. An instrumentation system
produces the information necessary for making sound management decisions at the tactical
or strategic levels.
A service instrumentation system provides an organizing framework for leveraging the
installed instrumentation base while guiding the incorporation of new components.
Instrumentation is dynamic; new instrumentation emerges with new technologies and
services. New information sources must be incorporated with minimal staff intervention
and then leveraged by other service management tools.
The major components of a service instrumentation system are shown in Figure 4-3. Event
handling, in the real-time event manager, and SLA management tools are also included
because they are tightly coupled with the instrumentation system. The basic cyclic behavior
of instrumentation management, collection, and processing drives many other management
functions.
These components represent an abstract way of discussing what an instrumentation system
does. The reality of how it is actually implemented is usually messiersome of these
functions take place in several stages and in different parts of the system. Different vendors
offer different sets of features and functions; the completeness of the system functions is
the goal. The behavior can be viewed as cyclic. The collected information causes
adjustments in the information collection process, which creates new information, which
results in a change, and so forth.
Figure 4-3
69
!
"###
!#
$ %#
$ !#
&" '
&"
( ) #
"
)*+
,(#
Control the local data collection activities in a distributed set of collectors and
aggregators
These policies can specify the types of measurements to be taken, their frequency, and the
acceptable ranges of values. For example, simple policies can dictate that more than three
consecutive abnormal measures should generate an alert. The measurement frequency
policy should be based on the failover latency (how long it takes redundant components to
70
Chapter 4: Instrumentation
respond to a service disruption and resume service delivery at the specied quality levels).
Thus, for example, if a service fails within ve minutes, your system should test every 12
minutes.
Instrumentation managers simplify operations because a single command affects the
operation of many collectors and aggregators. Thus, staff time and mistakes are reduced and
the instrumentation is managed effectively.
Instrumentation managers periodically use a heartbeat to verify continued collector and
aggregator operations. A heartbeat is a periodic exchange of messages to verify that both
parties are operating properly. Consider that an independent collector (discussed later in
this chapter) might not communicate for long periods of time when no problems are
detected. The instrumentation manager, in this case, uses a heartbeat to determine whether
the collector is still operating; if heartbeats are not returned, the instrumentation manager
must take steps to reestablish communication or shift to other monitors.
Collectors
Collectors measure service behavior instead of element behavior. They collect information
that is suited for each class of services. The information includes response times for
transactions and packet loss for interactive or streaming classes. Collectors measure
specic service instances, verifying that individuals, groups, or regions receive acceptable
service quality.
Collectors can be programmed to provide more granular service information. They can
measure subtransactions to make distinctions among functions, such as downloading a
page, executing a stock trade, or ordering merchandise. Collectors use a combination of
active and passive techniques. These techniques are discussed later in this chapter.
Collectors and aggregators are shown in Figure 4-3 in relation to other instrumentation
system components. They are the source of management information and alerts for the
processing and event management components. Alerts are the trip wire; the collectors send
alerts when a certain condition, such as an unacceptable delay, has been detected. The
system provides time-sliced data for SLA tracking and a variety of purposes.
Collectors can be embedded in network elements (as described in the sidebar), incorporated
as software modules in desktops or servers, or packaged as standalone components.
Continued processor price/performance improvements reduce the impact when more
instrumentation processing is embedded. In addition, the additional processing power
enables more complex measurements. In the future, collectors will interact with other
collectors and instrumentation managers.
71
Cisco Systems Embeds SAAs to Measure Performance Within or Across a Network Infrastructure
72
Chapter 4: Instrumentation
Aggregators
Aggregators are used for scaling and for providing efciency through monitoring and
managing a set of local collectors. Aggregators consolidate the information and usually
carry out simple ltering to reduce the volume of information they forward to the
processing functions. Aggregators also conserve bandwidth by ltering alerts and
forwarding only those needing further attention. Figure 4-3 shows how aggregators can be
cascaded to scale even further.
Aggregators also scale the instrumentation management tasks because they can accept a
single management directive and distribute it to the collectors they control; in that case, they
are instrumentation managers as well as aggregators. They use heartbeats to check collector
health and to set new monitoring policies as directed.
Aggregators can also provide local correlation and integration of the information from
multiple collectors. This creates higher-quality information for components higher in the
chain.
Processing
Processing involves a range of functions that are packaged in vendor-dependent ways.
Further, these processing functions are widely distributed within the instrumentation
system. For example, the collectors themselves usually test for trip-wire situations. In
addition, they often build baselines and carry out more sophisticated measurements.
Remember this rule of thumb: Functions tend to move toward the information source.
Some functions overlap with event management or with features of some management
tools. Such situations are acceptable because completeness of monitoring coverage is the
goal.
When new information arrives, it may need grooming. Grooming is the process that
simplies the information-handling tasks of the other components. For example, some data
values might need normalization because different collectors use different value ranges. For
example, collectors from one vendor might have a range from 110, and another collector
might provide values from 150 for the same type of information. The data cannot be
accurately compared until the ranges are normalized to the same value; in this case,
multiplying the rst set of data by 5 provides consistency.
Grooming can also include artifact reduction, as discussed in Chapter 5, Event
Management. Some of these functions are packaged differently, depending on particular
vendor packaging choices.
Trip wires require real-time processing of the collected data to test when an alert is sent.
Developers of collectors are applying more sophisticated testing to reduce the alert volume.
For example, the collector might not forward an alert unless a threshold has been exceeded
for three measurements in succession.
73
Demarcation Points
Collectors are deployed most effectively by selecting the appropriate demarcation points
usually a boundary between organizations or infrastructures. The enterprise-service
provider interface is an example of an organization demarcation point. Collectors
positioned at each demarcation point measure the delay across the provider network as well
as within parts of the enterprise network structure. They can then provide end-to-end
service quality measurements; additional placements break the measurements into specic
domains.
74
Chapter 4: Instrumentation
The collectors in Figure 4-5 are placed at demarcation points. They move from right to left
between the following:
Figure 4-5
Remote Server
Delays
Service Provider
Delays
Local
Delays
Total Delay
The desktop (or wireless phone or PDA) collector measures the entire round-trip delay for
any transaction initiated from that location. No further measurements are needed unless the
delay exceeds specications in the SLA.
For situations in which the desktop is beyond the control of the enterprise (for example, a
web site serving the general public), or for situations where a disinterested third party is
needed, measurement services, such as those offered by Keynote Systems, can be used.
The other demarcation points are used to identify the likely cause of the delay so that staff
members are properly assigned without wasting additional time and interrupting other
activities.
As an example of the use of demarcation points, consider that measuring the round-trip
delay between the desktop and the edge of the service provider network isolates the delay
associated with the local infrastructure. Tracking the round trip between the edges of the
service provider network measures the delay introduced by the provider. Finally, measuring
a transaction from the collector closest to the server tracks the server delays.
75
Passive Collection
The most common collectors use passive collection. In other words, they gather only the
information that ows by. For example, a desktop collector tracks user activity as it occurs
and keeps a record of specic transactions and their completion times. Passive collectors
can be relatively simple and can consume minimal resources. They use no additional
bandwidth, but they can generate large volumes of data. They are good for detailed data
collection and for reactive management, such as forwarding an alarm when a problem is
detected.
Placing a collector in a desktop is a common form of passive collection and measurement.
One of the rst to offer desktop instrumentation was VitalSigns, which became the Lucent
Technologies VitalSuite product line after acquisition. The collector usually intercepts
trafc owing between the desktop and the network and measures round-trip delay while
tracking the applications and subtransactions actually being used. The information usually
is stored at the desktop until it is passed to the management system for further analysis and
processing. A real-time alert is forwarded whenever the response time exceeds a predened
threshold value.
Active Collection
Active collection, in contrast, uses active agents to generate network and application
activity for management measurement purposes. An active approach is proactive because it
is exercising networks and services and evaluating their behavior rather than waiting for a
passive collector to detect a problem. Periodic active measurements detect problems earlier
than the passive approach. Active measurements are probing behavior even in the middle
of the night; they do not depend on user actions to highlight a problem.
Virtual transaction (or synthetic transaction) is the commonly used term for describing
active measurements. There can be a range of virtual transactions for measuring
76
Chapter 4: Instrumentation
Virtual transactions match the actual business processes being measured; thus, the
measurements are viewed with condence by administrators. Virtual transactions are of
limited value if they dont match the actual business processes. Using a simple database
query in a virtual transaction doesnt illuminate potential problems when the actual
business processes are making multiple queries and activating other processes.
Checking for correct operation is essential after a virtual transaction extends beyond the
simple ping. For example, a web server might return a page not found message quickly.
Using that measurement to route more trafc to that (apparently) lightly loaded server only
compounds the problem. As another example, a virtual transaction for ordering a product
must verify that appropriate information is placed correctly in forms, that the credit card
authorization worked, and that a conrming message was sent.
Active agents are usually used as a proxy for a set of local desktops. Therefore, they must
be carefully placed and congured so that they accurately reect the user experience. The
virtual transactions they use must match the actual transactions of the local desktops and
they must access the same services so that the trafc ows over the same areas of the
network. When the Internet is involved and there are thousands of external customers, a
measurement service, such as that offered by Keynote Systems, can perform virtual
transactions from the same backbones and geographic locations as the customers.
Active agents consume network and application resources. Therefore, they must be
constrained through measurement policies dening the virtual transactions to use and the
frequency. Other policy parameters dene acceptable values so that trip wires are activated.
Highly dynamic environments that frequently create new transactions or modify current
ones add to the administrative burden. Administrators must develop new virtual
transactions or modify their current set. This entails taking the time to understand
the transactions, modeling the steps, determining successful outcomes, and measuring
parameters.
Instrumentation Trends
77
Hybrid Systems
A combination of active and passive agents offers optimum instrumentation coverage. The
passive agents collect information on the actual transactions and their performance, and the
active agents proactively nd problems and build accurate baselines. This maximizes the
information quality while minimizing the resource impacts of virtual transactions.
An instrumentation system for tracking service behaviors can be conceptualized as a new
layer that sits above the element instrumentation. The services layer uses its own
monitoring tools and techniques for measuring and tracking service-level metrics.
Integrating the information from both layers is discussed in Chapter 5.
Instrumentation Trends
Instrumentation for tracking service behaviors is continuing its evolution. There are several
ways of leveraging the instrumentation after the basic system is installed.
Adaptability
Adding adaptable measuring strategies reduces loads and adjusts the granularity as needed.
There are minimal changes while service delivery is operating without problems. However,
as problems arise, the instrumentation can shift into other modes. If an active collector
detects a service outage, it can shorten the interval it uses for virtual transactions to measure
the duration of the outage. After service availability is restored, it uses a longer time
interval.
78
Chapter 4: Instrumentation
Collaboration
Collectors are beginning to collaborate with each other as well as with external
management applications. Measurements between pairs of collectors help determine
overall compliance and simplify problem isolation (see the sidebar in this chapter for more
information).
Summary
Measuring service quality and determining compliance with SLAs are fundamental goals
of instrumentation. Careful selection of demarcation points places intelligent collectors at
the proper points to gather information on service behavior and quality.
Active techniques offer proactive problem detection and consistent baseline measurements.
They are coupled with widely distributed passive collectors for thorough coverage.
Trip wires and time slices provide real-time notications and solid data for planning and
provisioning.
Each infrastructure involved in service management has its own specic instrumentation
needs. These are discussed in the chapters covering each infrastructure: Chapter 8, Managing the Application Infrastructure; Chapter 9, Managing the Server Infrastructure; and
Chapter 10, Managing the Transport Infrastructure.
CHAPTER
Event Management
Chapter 4, Instrumentation, describes how service behaviors are monitored to track
compliance with service level metrics and to identify potential or actual service disruptions.
Event management, which is the topic of this chapter, describes the different steps that
transform a ood of raw alerts into a reduced set of events that require further action from
the management system.
The instrumentation system collects raw data, such as response times or packet loss.
Unfortunately, administrators often nd raw data of low value for making sound service
management decisions. A response-time measurement; might have little meaning without
further analysis; for example, is it an isolated incident or part of a growing number of slow
transactions? Although the measurement is compared to thresholds and baselines to
generate an alert, more context is needed to determine how important any measurement
actually is. The event management functions rene raw instrumentation measurements into
those that require further attention from the management system. Administrators spend
their time on important problems and make better decisions with rened information.
This chapter is a complement to Chapter 4, and both should be considered part of the same
process: helping you turn raw data into usable information and indicating the next steps
when a response to a service disruption is detected and reported.
Note that one of the difculties in organizing material for this chapter is that vendors offer
a range of packaging options. For instance, some of the functions discussed in this chapter,
such as artifact reduction, can also be embedded in the intelligent collectors discussed in
Chapter 4. In addition, other products are designed specically for handling events. Thus,
be aware that just because event management is discussed separately doesnt mean that it
must be packaged as a separate product.
Regardless of the sometimes-vague dividing lines between instrumentation and event
management, the goal of this chapter is to examine the range of event management
functions and their contribution to the process of creating usable, action-oriented
information. Specically, the focus is on the following:
82
Simple Network Management Protocol (SNMP) traps, which are sent mainly by
network infrastructure elements, although elements in other infrastructures (such as
the server infrastructure) also use SNMP
The following subsections discuss how alerts are triggered, the need to transport alerts
reliably from their origin to the central event manager, and the need for the event manager
to handle the fact that some alerts are more important than others.
Alert Triggers
Baselines, thresholds, and internal failures are the usual triggers of alerts from element
instrumentation. Threshold alerts can be triggered when a threshold is crossed in either
direction (see Figure 4-2 in Chapter 4). Baselines represent a normal operating range of
measurements. Alerts are triggered when the monitored variable is moving toward the edge
of the envelope or has moved outside that envelope.
Alerts are also generated when there are internal failures, such as with a disc system, an
application, or an interface. An element might be able to report certain failures itself. For
example, a server that is still operating after an application fails can easily report that
failure.
Other failure alerts are indirect, often because of a failure that prevented the element from
reporting its own problems. For example, a central instrumentation management portal
monitors a set of collectors with a heartbeat exchange. If a collector doesnt respond, an
alert is generated, noting that it might have failed.
Internally generated alerts are used to integrate and coordinate event management
operations and to activate other management system components. As seen in Figure 5-1,
the event manager activates other functional areas, such as fault- or performancemanagement tools. At this point, the event management system has organized an alert
stream into a set of actions based on the alerts that are generated.
Figure 5-1
83
Event Management
Instrumentation Management,
Aggregation, and Filtering
SLA Data
Alerts
Internal
Alert
Raw Alerts
Artifact Reduction,
Volume Reduction,
and Filtering
Compound Metrics and
Correlation
Business Impact and
Prioritization
Process Activation and
Generation of
Internal Alerts
Event Management
Real-Time
Operations
Reporting
and Billing
Policy-Based
Management
Actions
The alert volume can be substantial; several large organizations that I have spoken with
recently have tens of thousands of element alerts daily, while the services alarms are in the
mid-hundreds. There are usually more element alerts than service alerts because many
element problems do not affect service quality when there is sufcient redundancy.
Automated processing of alerts is needed to identify those requiring immediate action from
the management system. High alert volumes and more complex sorting criteria can
overwhelm human staff.
84
One common solution to the problem of missing alerts is to have remotely located
aggregators that are in proximity to the source of the alerts. If they use a reasonably errorfree communications channel to connect to the alert sources, the aggregators will receive
almost all alerts correctly.
To push measurement alert information reliably into the enterprises event manager from
the remote aggregators, it is necessary to avoid using unreliable transport. This can be
difcult if the alerting system uses industry-standard SNMP traps; normal SNMP uses
unreliable transport.
One way of reliably transporting SNMP is illustrated by the web-performance
measurement service, Keynote Systems, which uses industry-standard SNMP traps to push
its measurement alert information into event managers from Tivoli, HP OpenView,
Micromuse Netcool, and other major management systems. Keynote places a small
appliance next to the enterprises event manager, inside the enterprises rewall. That
appliance connects across the Internet to the Keynote system using once-a-minute,
outgoing, secure, reliable connections. Retrieved alerts are then signaled with SNMP traps
from the Keynote appliance to the management system thats only a few feet away; theres
little chance of losing the alert.
Keynote also offers direct plug-in into some event managers, such as Unicenter/TNG; in
those cases, software is installed into the event manager to communicate directly with the
Keynote systems using reliable, secure transport and XML. Either methodlocal
appliance or direct plug-incan be used to improve the reliability of alert transport.
Alert Management
The alert stream contains information of differing value for managing services. Not every
alert requires further attention. As an example, an alert reporting a slow response time for
a single customer might not indicate a problem by itself. A single slow transaction can be
caused by temporary server congestion, lost packets, or a routing change. No further
attention is needed as long as the percentage of completed transactions with acceptable
response times is very high.
The managed environment is highly dynamic and the instrumentation can create artifacts,
which are false indications of the actual situation (false positives); they need to be
eliminated before a false diagnosis causes further disruptions to staff and operations. In
fact, responding to artifacts wastes staff time because subsequent measurements usually
reveal no problem at all.
The event manager organizes the remainder of the alerts after the artifacts have been
removed from consideration. There are ranges of actions depending on the overall
operational context. For example, a measurement that exceeds a warning threshold requires
different attention than a measurement indicating noncompliance with a Service Level
Agreement (SLA).
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
85
Alerts also have different business impacts that affect subsequent management decisions.
A disruption that affects revenues and business relationships should draw more attention
than a slight slowing of internal e-mail.
Refer again to Figure 5-1, which illustrates event management functions. Starting at the top
are the event management inputs, which are either internally generated alerts or those from
the service or element instrumentation.
The event management functions are shown within the rectangle in the middle. Functions
such as artifact reduction, ltering, and correlation are applied to any alert.
The event management system identies the events that require further action. The next
step depends on the event. Some events activate a fault management tool while others
launch a performance management tool. Events can also trigger a billing subsystem, page
an administrator, generate a report, or initiate other functions.
86
Table 5-1
Value
Volume reduction
Artifact reduction
Business impacts
Prioritization
Activation
Coordination
Volume Reduction
Simply reducing the alert volume can be very helpful. Hundreds of alerts reporting the same
situation can be generated. However, only a single alert is necessary to note the database
server failure and to start recovery procedures.
There are different methods of reducing the alert volume: roll-up, de-duplication, and
intelligent monitoring.
Roll-Up Method
Hierarchical collector structures reduce alert volumes by rolling them up from one level to
the next. The aggregators described in Chapter 4 are a natural place for implementing this
alert compression. In Figure 5-2, three collectors are using virtual transactions against the
same server. If the server is congested, all collectors forward a server slow alert to the
aggregator. The aggregator simply passes a single server slow alert downward to the event
manager or another level in the instrumentation hierarchy.
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
Figure 5-2
87
Multiple
Slow-Server
Alerts
Aggregator
Single
Slow-Server
Alert
Event Manager
De-duplication
A failure may generate a multitude of virtually identical alarms and events that can be
consolidated into one alarm by de-duplication. For example, a router failure may spawn a
large number of alarms about dropped connections. De-duplication adds information to a
single event, indicating a larger number of similar alarms.
Intelligent Monitoring
Adaptive instrumentation (Chapter 4) provides the exibility for intelligently monitoring
situations, and it thereby reduces alert volumes. Consider the example at the beginning of
this section. Active collectors have reported a database server failure. It may take some time
for the failover procedures to complete and even longer to resolve the problem. If the active
collectors continue monitoring, they only add to network and alert loads without adding any
new information.
Adaptive instrumentation helps the situation by continuing measurements and not
generating any further alerts until the virtual transaction indicates that service is restored.
A different alert then informs the management system that the service is again healthy. The
elapsed time between the failure and restoration alerts measures the outage.
Additional reduction is possible with deeper knowledge of the service topology.
Dependencies can indicate that for some failures, downstream monitoring will not be
productive. For example, service behavior can be monitored in steps or in smaller parts of
88
the entire transaction. If a step fails, monitoring those that follow do not yield any useful
information until the failed step is repaired. The same approach of monitoring, but not
generating, new alerts is used to detect the restoration of the service step.
Artifact Reduction
There are techniques to reduce the raw alert volume by eliminating artifacts, which are
measurements that falsely imply an important problem where none actually exists.
Response time, for example, could be slowed while network routers are recalculating
routing options after a failure or topology change. The next transaction has satisfactory
response time after the routing system has stabilized. There is no value in notifying the
transaction manager of this artifact. There is nothing to chase and correct because the
transient behavior of the routing system has ceased.
A transaction could also be lost or timed out while a server in a tier fails and is replaced.
Specic transactions might be lost in the failed server, but after the replacement is
operating, operations resume at satisfactory levels.
A similar situation arises when an occasional lost packet triggers an alert because a
transaction failed or timed out. Further checking, however, usually nds operations
proceeding within the normal range of behaviors.
Large numbers of artifacts can consume large amounts of staff time and divert effort from
other tasks. It is difcult for humans to identify all the artifacts and ignore them when
appropriate to do so.
There are approaches to help reduce the number of artifacts that slip through:
Verication
Filtering
Correlation
Verification
Quickly verifying that an incoming alert is reporting an actual problem is an effective rst
step in eliminating artifacts. For example, an active collector can be used to run a
transaction that has been reported as noncompliant. The active measurement establishes
whether the problem persists and is repeatable; if it is, further attention might be warranted.
The initial measurement is treated as an artifact if the test doesnt reveal a problem.
Using a repeat failures lter for simple thresholds can help discriminate noise from real
failure conditions by requiring that several successive measurements exceed the threshold
before an alert is issued. For instance, you can stipulate that it will take 10 minutes to
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
89
forward an alert if the interval between virtual transactions is 5 minutes and the rule is that
two repeated failures are needed. Using criteria for successive measurements frees the
system from responding to a single blip that later cannot be found.
A more proactive form of verication uses active measurements after the initial alert is
received. Verication with an immediate series of virtual transactions claries the situation
quickly. Successive failures are detected in less than a minute rather than waiting for 10
minutes to attack the problem.
For example, consider a customer who is verifying the response time of a remotely hosted
service. Suppose an active measurement device periodically initiates a virtual transaction,
perhaps every 10 minutes. If one of these virtual transactions exceeds the specied response
time, the measurement device immediately sends a series of closely spaced virtual
transactions. If those complete successfully, no further action is necessary.
If the problem persists, the customer management system noties the provider and begins
tracking the provider response until the problem is resolved and the customer veries that
acceptable service levels are restored.
The collector sending the alert should be used for verication whenever possible because
the environment will be more consistent. Using a collector in a different location might
change the results, depending on the location of the problem. On the other hand, using
multiple monitors from multiple locations can provide some diagnostic triangulation;
noting that a problem is detected from one side of the network but not the other can aid in
problem isolation.
Filtering
Filtering is the application of rules to a single alert source over some time interval. Figure
5-3 illustrates the application of rules concerning measurements exceeding a specied
response-time threshold. This is more sophisticated than a check for successive overthreshold measurements. This is an X out of Y process instead. That is, within a set of Y
transactions, any X that is slow constitutes an alert. The gure illustrates a three out of eight
conditionany three over-threshold measurements out of eight will trigger an alert.
Note that these ltering rules require state to be maintained between measurements. They
should be selectively applied to a small numbers of sources to avoid loading the event
manager. Using simple counters places less processing demands on the event manager, but
this is done at the expense of being less able to exploit more effective ltering rules.
90
Figure 5-3
Response Time
Filtering Window
Transactions
Suppress
Artifact (1 of 8)
Response Time
Filtering Window
Transactions
Correlation
Filtering tracks a single source over a period of time and eliminates artifacts as a result.
Correlation, in contrast, works with a number of alert sources simultaneously (or within
short intervals). As mentioned, some types of failures or disruptions trigger many additional
alerts. Correlation works with this ood of alerts and removes the secondary artifacts,
which are those alerts caused by another problem. For instance, the database server failure
results in many reports of failed transactions. Administrators will waste valuable time
looking at each service with a problem rather than addressing the cause of all the secondary
artifacts.
Correlation is more powerful than ltering because it identies the most likely cause of a
urry of alerts. The accuracy speeds problem resolution and reduces staff disruption.
Correlation is also more complicated than ltering because it deals with multiple,
independent alert sources. Correlation depends on understanding the relationships among
various service elements. Essentially, it is the rule of cause and effect. (If you cannot reach
a router, you cannot reach the networks that are connected to it, for example.)
Building the appropriate information for a correlation engine is a challenge. Early
correlation engines, such as the Tivoli Enterprise Console and the Veritas NerveCenter,
were powerful, but they often became shelfware (software that wasnt used in production)
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
91
92
Conversely, if an element fails, an administrator needs to know which services are affected
and what the business ramications of those services are. A failure affecting a critical
business service receives more attention than one that interrupts internal data backup.
Service managers can also use element instrumentation in a different way. Elements
associated with key services can send alerts to the service manager. These are informational
because the service manager is not usually responsible for responding to element problems.
The service manager is informed that changes are occurring even if no disruptions are
threatened for key services. Several element failures affecting the service would be another
early warning mechanism.
Understanding the business impacts of any alert enables administrators to truly understand
what is important to the business and make better decisions.
The subsequent discussion is further divided into the following subsections:
Modeling a Service
Modeling is the most effective way of associating a service with the elements supporting
it. Models use the power inherent in object-based descriptions and tools (see the
accompanying sidebar for more information).
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
93
Building a service model is fairly straightforward if the proper ingredients are available.
One of the most important is an object library that has the templates for all the common
components, whether physical or logical. Objects representing physical components, such
as servers, must be readily available and easily customized to represent specic instances
of any physical entity. Objects also represent logical entities, such as an application or an
external service. Some of these will be common, and others will be specic for each
organization.
After the objects are dened, they must be related by setting the appropriate attributes in
each object that dene dependencies to other objects. Thus, an application object is related
to the server where it executes. The application is also related to other functional objects,
such as a database system, a content delivery network, or an external search engine.
94
Prioritization
This is a stage when an alert has been identied as an event requiring some response from
the management system, including notifying staff, sending reports to appropriate staff,
assigning staff to the event, or immediately activating automated procedures.
All events are not of equal importance, however, and management teams must keep the
business-critical services running smoothly. A report that online customers are abandoning
their shopping carts in droves will be of immediate concern to a business manager; a report
that an occasional catalog lookup is slow need not receive the same attention.
Basic Event Management Functions: Reducing the Noise and Boosting the Signal
95
Prioritizing correctly means collaborating between the management staff and its service
customers (directly or indirectly). It is the customers who must determine and communicate
the relative priority of their services to their providers (internal or external). Only when they
have this clear indication of business priorities can providers assign the appropriate
priorities to the associated alerts and events.
Providers must also assign other priorities to help them with their operations. The text has
already mentioned that customers might pay higher premiums or have stricter
noncompliance penalties. These concerns must also be incorporated into priority
assessments.
The event manager uses these assigned priorities to organize the event stream and guide
responses more effectively. Staff members are directed to address their attention to the most
critical events.
The event manager also needs an aging mechanism so that low-priority events receive
attention within a specied time frame rather than being completely starved-out by higherpriority events. The event priority is automatically increased by the aging mechanism if the
event hasnt received attention within a specied time frame.
Prioritized events are usually placed in their appropriate queueminimally offering a
severe, moderate, or warning priority level. Some products offer much more granularity
with more priority levels to assign. Multiple thresholds can be used to trigger different
responses depending on the severity of the alert.
The event monitor interface is also a means for tracking workow. The status of each event,
such as when the event was received and its current status, is available. As events are
cleared, they are appropriately marked, logged, and removed from the active queues.
Events must be organized in a variety of formats to meet different management needs. An
overall display might show all outstanding events by priority class. Other displays are
needed to show the affected customers, the specic SLAs, and the penalties that apply. Staff
can also modify the events, changing priorities or clearing them from the console.
Activation
Any event activates one or more management tools. The time constraints imposed by SLAs
mandate automatic and rapid responses to problems while the management staff is being
notied.
Registration is the process of linking events and management tools. Management tools are
activated by the event manager when any events for which they have registered are detected.
Specic tools register with the event manager for types and classes of events. A Cisco
Systems device manager would register to receive any events generated by specic Cisco
elements, for example.
96
Registration is usually accomplished with an application program interface (API) for the
event manager. Most products use a publish/subscribe approach, where a management tool
subscribes to certain events. The event manager publishes events, which activate the
subscribers. Multiple tools can also be activated by a single event.
The event manager currently uses local server functions to activate the specied
management tool. In the future, XML documents will activate remote management tools.
Coordination
Event management can help integrate the services and technology management areas as
well as integrate management tools into processes. It is a natural place for integration
because element and services instrumentation are already converging there.
One key factor in a bigger role for event management is the use of internally generated
alerts. Figure 5-4 offers an example of event management as an integrating factor. A server
failure alert is generated (1 in the gure) and leads to the activation (2) of a server
management tool. The server manager performs the detailed problem analysis and
determines that a hardware failure has occurred and the server is not operational. The server
manager then creates an internally generated alert (3), which comes from the management
tool, not the managed environment), and the event manager then sends another alert that
activates a tool that determines the impact on services (4).
Figure 5-4
Event Manager
4) Activate Tool
2) Activate
Tool
3) Internal Alert
Server Element
Manager
5) Internal Alert
Service
Manager
97
The impact assessment tool determines if the server failure is having an impact on service
quality, such as congesting the remaining servers and creating unacceptable response times.
If that is the case, it sends another internally generated alert (5 in Figure 5-4) that activates
provisioning tools and trafc redirection tools and that noties the staff of a serious threat
to service-level compliance.
Incorporating internal alerts from the management system adds more value because a single
point receives all alerts and can place them in the proper context. One example of this
function would be a performance manager sending an alert if the pool of stand-by servers
falls below a dened threshold number. The management staff then has this information and
can prioritize it against other events to allocate efforts as effectively as possible.
Event management has a range of functions for sifting through an alert stream and picking
from the management system those that need immediate attention. Some products have a
full set of these functions, while other products use a more limited set. Other products
distribute these functions and couple them more tightly to instrumentation.
98
A set of Netcool Probes and Monitors uses active techniques for collecting operational
information and feeds the data to the ObjectServer. The Probes are passive collectors,
implemented as software modules placed in monitored elements. Monitors are separate
systems carrying out active measurements. The set of Probes and Monitors covers a wide
variety of equipment, services, and transactions.
The Probes and Monitors also provide local processing to reduce the loads on the network
and ObjectServer. Local processing enables sophisticated ltering of the alarm streams, and
it helps the solution scale for large, managed environments.
The active monitors maintain complex rules that can calculate expressions with multiple
alarm sources. The value of the expression determines if an event is triggered. Both types
of collectors track cumulative behavior, such as the number of slow transactions over the
preceding two hours.
Instrumenting across the service infrastructures gives Netcool a denite advantage
compared with many event management systems that are focused on network elements and
alarms. In contrast, Netcool uses complex expressions in the monitors that are based on
sources in different infrastructures. Administrators can zero in on service behaviors across
the infrastructures.
A publish/subscribe interface is used to associate management tools and events. Any tool
registers for one or more events; when these events occur, the appropriate tool is launched
and responds as needed.
Administrators access the event management system from anywhere on the Internet. They
can navigate quickly through different views. Administrators can view event status, clear
events, change their priority, and generate reports on event-management activity.
Event Management
Micromuse offers a set of capabilities for reducing alarm volumes and delivering actionoriented information to the management tools. Some of these functions include deduplication, normalization, automatic suppression of transient conditions, and presentation
from a service perspective.
De-duplication consolidates related alarms and thereby provides a clearer picture for the
staff. Instead of dozens of individual alert messages, the operators see a single message
with associated information about the number of underlying alarms.
Normalization is a means of aligning the values from many different sources so that data
can be compared accurately. As a case in point, consider the situation in which one collector
measures utilization as an integer value representing the percentage, and another expresses
utilization as a fraction. A management tool would not necessarily know that 0.9 is actually
larger than 80. Normalization converts incoming information into consistent formats and
data ranges. Any tool that subscribes to the ObjectServer can use the normalized data.
Summary
99
Transient conditions are a burden for the management team. An arriving alert can take staff
time and effort to set up new measurements, select troubleshooting tools, and launch a
trouble ticket. Transients may have already died down, leaving the staff with nothing out of
the ordinary to measure and analyze. The Netcool auto-clear function tracks these
transients and removes them from the active list when they disappear by themselves.
The Netcool/Impact management tool adds the intelligence to transform element
information into service-centric perspectives. Impact enables the staff to build service
models and associate them with actual elements. Whenever an element fails, Impact
determines which customers and services are affected. It uses other information to assign
the event the proper priority so that the staff guides the workload accordingly. Finally,
Impact provides the problem resolution policy associated with the failed element.
Micromuse is adding more tools and integrating partner products into the Netcool suite as
well. Their basic architecture is copied extensively.
Summary
Event management takes in a large volume of alerts of varying value and produces a smaller
volume of events that require further attention. It reduces the volume using de-duplication,
roll-up, artifact elimination, ltering, and correlation to remove artifacts and identify the
alarms that matter for maintaining service quality.
After events are identied, they are prioritized to add additional guidance to keep the staff
focused on the most pressing technical or business problems. The last step is using the event
manager to activate the management tools that have registered for specic events, thus
completing the transformation of raw instrumentation into actions to be taken.
Event management will become a key integration point as internally generated alerts are
used to activate other tools and to manage process steps. The creation of sequenced tool
operations enables organizations to build more sophisticated automated management
processes.
CHAPTER
Real-Time Operations
Effectiveness and accuracy of real-time operations directly affects compliance with Service
Level Agreements (SLAs). The time taken to detect a problem, determine its cause, and
take corrective action is the time during which service quality is at risk and SLA violations
may occur.
The demand for higher service quality shrinks the time allowance for responding to the
actual variations in service behavior. Every SLA has time-based metrics. For example,
availability metrics are all about time: the available uptime each month; the total of outages
(downtime); the time between outages; and the duration of each outage. Transaction
completion times are another example of a time metric. The emerging metrics for
measuring provider Quality of Service (QoS), such as service activation time or responses
to trouble tickets, are also time-based.
Real-time operations management involves operations that deal with time-sensitive tasks
such as monitoring, analyzing, and responding to potential service disruptions. Real-time
operations management tools are a core part of most commercial system management
consoles.
To illustrate real-time operations management, consider Figure 6-1. As shown, the realtime operations manager receives alert input from the real-time event management module
and time-sliced measurement input from the SLA statistics module. The real-time
operations manager typically contains the sophisticated analysis tools that evaluate the
incoming alerts and SLA measurements, identifying problems and proposing solutions.
NOTE
Time-sliced, periodic measurements are used for managing and reporting on service
quality. The real-time operations manager must continuously update its assessment of
system behavior and take further actions as needed. It may generate internal alerts if it
detects that the time-sliced measurements are straying over predetermined thresholds.
102
Figure 6-1
&
!"
&
+
$
&,$
Automated responses can be directly activated by alerts or after other functions, such as
root-cause analysis, have performed their tasks. The use of automated responses assists in
decreasing the time to handle a situation and the potential for errors and misjudgments.
Many routine issues can be handled by automation; where issues cannot be mitigated
automatically, automated analysiseven if only partialcan assist the human
troubleshooters.
NOTE
The function of real-time operations management is to help staff reduce Mean Time To
Repair (MTTR) when incidents occur and to increase the Mean Time Between Failures
Reactive Management
103
(MTBF) whenever possible through proactive prediction of difculties. There are three
basic methods discussed in this chapter to achieve those goals:
Reactive management
Proactive management
Automated responses
These methods are discussed in order in this chapter, followed by illustrative descriptions
of some major commercial real-time operations managersincluding response managers
for denial-of-service attacks.
Reactive Management
Reactive management will always be needed because, simply, failure happens. Devices
unexpectedly fail, changes turn out to have unintended consequences, backhoes cut ber,
or entire electrical grids go down. Reactive management is the most demanding from a time
perspective because administrators have no prior warning and still must assemble their
resources and attack the problem as best they can.
Components of reaction time include the following:
Problem detection and verication, initiated by the instrumentation and rened by the
management tools
Problem isolation, consisting of further analysis to identify and isolate the cause of
any (potential) service disruption
Problem resolution, in which steps are taken to resolve the problem and restore
service levels, if necessary
Most of the time involved in resolving a problem is usually spent in the problem isolation
phase, attempting to determine what is actually causing the problem. Increasingly complex
service environments add to the challenge because even the simplest delivery chains span
multiple elements and organizations. The instrumentation quickly detects threshold
violations, baseline drifts, and other warning conditions.
For an organization thats well prepared for reactive real-time management, many actions
to resolve a problem (such as bringing an additional server into the mix, selecting an
alternate network route, switching to another service provider, or redirecting trafc to a
lightly loaded data center) can be completed quickly. However, even for the most agile
organizations, the maximum leverage in reducing resolution time is in reducing problem
isolation time with speed and accuracy improvements.
Accelerating problem detection and vericationthereby increasing the speed with which
validated alarms can be generatedbuys time for the problem isolation process. Moreover,
faster analysis means there is less lead time needed between the arrival of a warning and
104
Triage
Triage is the process of determining which part of the service delivery chain is the most
likely source of a potential disruption. First, its important to understand what triage does
not do. It isnt diagnostic; it isnt focused on determining the precise technical explanation
for a problem. Instead, its a technique for very quickly identifying the organizational group
or set of subsystems thats probably responsible for the problem.
Triage thereby saves problem isolation time in two ways. First, it ensures that the bestqualied group is identied and set to work on the problem as quickly as possible; second,
it decreases nger-pointing time.
Identifying the best qualied group to deal with the problem means nding those who are
most likely to have the specialized tools and knowledge that can be used to solve the
problem more quickly than if it were left with a generalist group.
Equally important, triage techniques are focused on drastically decreasing nger-pointing
time, during which various groups try to avoid taking responsibility for a problem. It does
that by presenting the responsible group with data thats sufciently detailed and credible
to convince them that its truly their problem.
An example of triage technique should clarify the difference between it and detailed
diagnosis. In this approach, called the white box technique, a simple web server (the
white box) is installed, as shown in Figure 6-2, at the point where the enterprise web
server systems connect to the Internet infrastructure. (The web server could be extremely
inexpensive; it could be just an old PC running a avor of Unix and the Apache web server
system without any conguration, serving the default Apache home page or some other
simple content.)
Reactive Management
Figure 6-2
105
Virtual Transaction
Response Time
Service
Measurements
Network Groups
Responsibility
Router
White Box
Unloaded
Web Server
Demarcation
Load Distributor
Production
Web Servers
Server Groups
Responsibility
Server Farm
The white box web server in Figure 6-2 is located at the demarcation point between two
different organizational groups: the group responsible for the web server systems and the
group responsible for Internet connectivity. Active measurement instrumentation is located
outside the enterprise server room at the opposite end of the network. It is at end user
locations, and it measures both the enterprises web pages and the web page on the white
box. Because no end user knows about the existence of the white box, the white box has
almost no workload; it is used only by the measurement agents.
Figure 6-3 shows an example of some response time measurements from the system
diagramed in Figure 6-2. Its easy to see that when the event occurred, the unloaded white
box server was unaffected. The chart can be created in a few seconds, and it is sufcient to
convince the server group that its almost certainly their responsibility. The root-cause
106
reason for the problem is unknown; the chart is not diagnostic. However, the responsible
group has almost certainly been correctly identied within a few seconds, and ngerpointing time has been cut to zero. The server group can then use their root-cause analysis
tool or other specialized tools and knowledge to study the problem further.
Triage Example Measurements
-
-
Figure 6-3
,(
%
")*% #
+
(
!"
" #$%
&'
(
Triage points can be established at many boundaries within a system, and different
techniques can be used to establish those boundaries. Triage points can be placed at the
demarcations between network and server groups, as shown in Figure 6-3, and they can also
be placed just outside a rewall, at a load-distribution device, and at a specialized subgroup
of web servers.
White boxes can be measured to create easy-to-understand differential measurements, but
theyre not always necessary. For example, consider an organization that measures the
response time of a conguration screen on a load distribution device to see if there are any
problems up to that point. Triage can also be performed by placing active measurement
instrumentation at demarcation points, such as just outside a major customers rewall, to
see if response time from that point is acceptable.
Finally, detailed measurements can themselves be used for triage, although more technical
knowledge is usually necessary. For example, an external agent can measure the time
needed to establish a connection between itself and a le server, followed immediately by
a measurement of the time needed to download a le from that server. It can probably be
assumed that if le download time increases greatly without any corresponding increase in
connection time, then theres a problem with the server, not with the network. (This is
further discussed in Chapter 10, Managing the Transport Infrastructure.)
Such triage techniques are very useful in the heterogeneous, uid world of web systems. It
requires much less detailed knowledge of the internals of the various subsystems than does
root-cause analysis. This is a great virtue when things change frequently and the internals
of some systems are hidden. It also cuts time from the most time-intensive part of system
Reactive Management
107
management. However, it can be difcult to use it for complete diagnostic analysis within
a complex system; too many triage points, or demarcation points, are needed. For complex
systems, true root-cause analysis tools are a necessary complement.
Root-Cause Analysis
Root-cause analysis tools can require considerable investment and conguration, but they
can be surprisingly powerful and benecial. They use a variety of approaches to organize
and sift through inputs from many sources. These sources include raw and processed realtime instrumentation (trip-wires), historical (time-sliced) data, topologies, and policy
information. They produce a likely cause more quickly and more accurately than staffintensive analysis. Analysis tools are activated in a fraction of a second after an alert is
generated and are already collecting data much faster than staff could respond to a pager or
to e-mail. Because conditions can change quickly, and critical diagnostic evidence may not
be preserved, compression of activation time is paid back with more effective analysis.
Root-cause analysis tools can be targeted at elements or servicesor both. The earliest
root-cause tools focused on a single infrastructure, usually the network; newer products are
focusing on service performance spanning many infrastructures.
108
A difcult case arises when all the infrastructures are behaving within their normal
operating envelopes. This is an opportunity for automated tools to collect as much
information as possible for a staff member to use. The information might not be conclusive,
but it can guide the staff members next steps in an effective way.
Assembled information can include the following:
For instance, an end-to-end response problem could automatically result in the comparison
of other infrastructure measurements to their historical precedents and could also result in
the automatic initiation of new infrastructure measurements. Those automated
investigations could fail to nd any performance that exceeds thresholds. However,
learning that the transport infrastructure delay has suddenly increased vefold while all the
other infrastructures are operating within their normal envelopes would indicate the most
likely area for further investigation.
Linking service root-cause analysis to element root-cause analysis adds leverage to
accelerate the resolution process at both levels. Passing information and control between
the two domains speeds operations and keeps both teams informed and effective.
Reactive Management
109
An alert is forwarded to the alarm manager, which in turn activates element tools and
noties staff. Information is also passed to help the troubleshooting process at this time. It
includes the following:
Indications of changes in the actual site used (Domain Name System [DNS]
redirection)
The troubleshooters already have this information as they start to narrow down the cause,
identifying which parts of the services are behaving well and which parts require further
investigation.
Todays technology still leaves some manual steps in the hand-off, such as transferring
information to the element root-cause tools. This is the step where time is lost and errors
can be introduced. In the future, automation of isolation functions (discussed later in this
chapter) might be used to simplify and accelerate the process. For example, an automatic
script could exercise the route used by agents running the measurement transactions,
probing all the devices on the path and looking for any exceptional status or operating loads.
It can have this information ready for a staff member or for another tool that can then
investigate further.
The ow must be bi-directional. Element instrumentation may detect an element failure
rst. Redundancy keeps operations owing while actions are taken to address the failure.
The primary consideration is the services impacted by an element failure. When a service
has been impacted, the management system might respond by monitoring more closely and
setting thresholds for more sensitivity. Being able to understand the relationship between
elements and the services depending on them allows administrators to prioritize tasks and
ensure that critical services have the highest degree of redundancy.
Tools in the services domain may notice unwanted trends and correlate them with the
element failure; sometimes a simple time correlation between the failure and the detection
of the shift is all that is necessary. In the best case, both domains have information and can
communicate effectively as they watch for and resolve developing problems. (This is where
a lot of the new investment in management products is goingbuilding management
systems that can correlate symptoms from disparate elements and understand the impact on
the multiple services and customers while helping operations staff prioritize and x the
problems. The InCharge systems from System Management Arts, Inc. is an example of this
trend.)
110
Complicating Factors
Brownouts and virtualized resources make the tasks of triage and root-cause analysis more
difcult. These are discussed in the following subsections.
Brownouts
A brownout can be a difcult challenge to diagnose because all the elements are still
operable, but performance suffers nonetheless. In contrast, hard (complete) failures are
easier to resolve because a hard failure is a binary valuesomething works or it doesnt.
There are tests that verify a failure and help identify the source of a problem.
It is harder to identify the likely cause of a brownout because there is no denite service
failure that lends certainty to the search. Degrading performance can be caused by any of
the following: a conguration error, high loads, or an underlying element failure that
increases congestion in another part of the environment. Redundancy further complicates
isolation of the cause of the brownout because underlying element failures may be hidden
from service measurements by element redundancy.
The steps described for basic root-cause analysis still apply in brownout failures.
Troubleshooters need all the information and context that they can assemble. Historical
comparisons, indications of recent changes, and other data can help them understand the
situation more clearly. Some patterns, such as a xed percentage of all web requests taking
an abnormally long time, strongly suggest the probable causesespecially if the same
percentage of web servers has recently been upgraded to new software. Sophisticated rootcause analysis tools can learn to look for these patterns and thereby help diagnose brownout
failures.
Virtualized Resources
Another complicating factor is introduced by the common system architecture of
virtualizing resources, in which an entire set of similar resources appears to the end user as
a single, virtual resource. Virtualization simplies many tasks for the end user and the
application developer; its most common in storage systems, where rather than identify
physical sectors on individual discs, storage software virtualizes the storage resources as
volumes and le systems. In the webbed services customers case, for example, geographic
load distribution makes a set of distributed sites available with the same name. The
geographic load balancer selects the site, and the end user automatically connects to the
closest site without having to know the details.
NOTE
Application developers use object brokers to hide the details of locating and transforming
the objects that an application accesses.
Reactive Management
111
Load balancing switches are another means of virtualizing; they hide a tier of servers
behind the switch. Requests are directed to the switch, which in turn allocates them to any
member of the set. Firewalls and hidden networks using Network Address Translation
(NAT) technology also create virtualization.
Unfortunately, from a root-cause perspective, virtualization obscures important details. A
synthetic measurement transaction might detect a performance shift because the
geographic load distributor has selected a different site with different transport and
transaction delays. Understanding that distinction helps the troubleshooting team save time.
They may determine no further actions are needed until the usual site is restored to service.
In fact, the redirection may have behaved entirely as expected, with service measurements
verifying the resilience of the environment. It might also lead to further investigation
because service levels must still be maintained even when these actions are taken.
To handle the complications of virtual resources, management tools must be able to
distinguish among the various hidden resources or, at least, must be able to suggest that the
problem lies somewhere within the virtual group. Instrumentation within the virtual group
can take measurements without having the individual group members identities obscured
by the virtualization process.
As suggested before, failure patterns can suggest the cause of the problem, even if the
virtualization layer cannot be penetrated. In addition, some IT organizations create special,
secret addresses for servers within a virtual group so that they can be measured externally
without revealing those addresses to the general end-user base, as in the white box triage
technique previously discussed.
112
Proactive Management
Because reactive management is challenging at best, administrators are attempting to be
more proactiveidentifying indications of potential service disruptions early enough to
avert them entirely or to minimize their impact. Proactive management is highly desirable,
so many products claim to deliver it. Reduced downtime is one yardstick or metric to
measure these claims.
Baseline Monitoring
Baselines are continuously calculated based on regular samples of behavioral variables. The
baseline is usually calculated with an average or median of the load, a response time, and other
monitored attributes. A range denes the normal activity envelope, which is the normal
range of the monitored variable through time. The actual behavior is related to the
performance envelope; if its value lies within the envelope, the normal range of behavior is
being measured.
The trend of the measured value is also important because you want to know if behavior is
likely to stay within the expected envelope. If the behavior is trending toward the edge of
the envelope, thats more of a concern than if the trend is moving deeper within the
envelope.
The baseline is an effective early warning mechanism; however, the warning usually lacks
the specic information needed to take specic steps.
Automated Responses
113
Identify potential problems with sufcient accuracy that administrators will heed the
warnings and take action
Automated Responses
Automated responses are another key real-time function; they are activated after an alarm
is detected or after another tool, such as root-cause analysis, has done its task.
Automation was initially introduced to reduce staff effort and errors by automatically
initiating corrective actions or collecting information for further staff attention. Taking over
repetitive tasks, such as making regularly scheduled measurements or setting
congurations for groups of elements, saves substantial labor and reduces the errors and
inconsistencies that occur with manual input.
Automation also speeds up processes because they do not demand staff attention; they are
activated as needed without waiting for permission. Speeding up processes is always
valuable, but you reach a point where more speed may not give the leverage you seek.
The new challenge is not just speeding up a simple task and continuously shrinking the
window; it is about using the same time window for making more complex, intelligent
decisions.
114
A Case Study
To better understand how automated responses work, consider the set of actions needed
after a root-cause analysis has identied a failed element. The management team is more
effective when it is addressing the most critical problems and keeping business processes
functioning. Any task, such as addressing a failed element, must be prioritized against other
tasks demanding staff time and attention.
The impact of element failure must be assessed in real time to make the best decisions (by
management tools or by staff). In the example in Figure 6-4, you can see that distinct steps
are involved. Each step is discussed in the following subsections.
Figure 6-4
Edge Routers
Firewalls
Premises Routers
LAN Switches
Web Servers
Automated Responses
115
Redundancy is temporarily disabled in this case because each router has only one
connection left to other parts of the physical infrastructure.
The topology information is supplied by the enterprise management platform. The
application uses the published schema and application program interface (API) to collect
the topology information it needs. Note that future plans could include conversion to the
Common Information Model (CIM) specied by the Distributed Management Task Force
(DMTF).
system to direct trafc away from the site if there is less than a predened amount.
3 Check inventory for a replacement rewall or a computer system that can be loaded
with the software, in the event that the rewall cannot be repaired in place in a timely
fashion.
4 Assess the relative priority of the task and place it in the workow system.
116
Step 5: Reporting
Real-time reports are generated for browser access by members of the operations team and
the group responsible for conguring the automated systems. They include the following:
117
ProactiveNet
ProactiveNet was an early player in the active monitoring and management of complex
e-business infrastructures. ProactiveNet bases its approach on statistical quality control
principles. It uses sampling and analysis to track behavioral shifts and identify root causes.
Sampling is more efcient than measuring everything all of the time. The key is selecting
the variables to sample; they should be ones whose changes are the most inuential.
(Netuitive, discussed later, uses a similar approach with its strongly correlated variables.)
The sampling interval is a basic parameter that determines the granularityevery ve
minutes, for example. The trade-off between granularity and volume is a major issue to
decide. Frequent samples will detect smaller shifts in behavior, but at the expense of
generating huge amounts of data to store, manage, and protecton top of the additional
sampling trafc.
Sampling of service behavior establishes the operational envelope: the average, maximum,
and minimum values of a behavior (response time, utilization, or help desk calls, for
example) over a period of time. These baselines represent the ranges of normal behavior.
ProactiveNet builds intelligent thresholds from its baselines. It determines a practical
threshold value after the baseline is created and the maximum and minimum ranges are
118
determined. Thresholds are adjusted as the baseline changes, always providing an accurate
warning at any time without manual staff adjustments. Adjusting thresholds on the y is a
critical feature because it accounts for the full range of motion in the environment and
reduces the likelihood of both false positives (reporting problems that dont exist) and false
negatives (failing to report real problems).
A large variety of other management tools leverage the collected data. They determine
thresholds, test for SLA compliance, isolate failures, or produce business metrics, for
example.
ProactiveNet monitors a variety of devices, applications, servers, and other infrastructure
components. The baselines and intelligent thresholds provide the warning and identify the
cause if resource usage shifts toward a potential service disruption.
These techniques are very powerful for dealing with single elements. However, web-based
services are composed of a highly interrelated set of elements distributed across multiple
organizations. More information is needed for tracking down service disruptions in a
complex infrastructure. ProactiveNet therefore provides a pre-built set of dependency
relationships for most common applications so that customers are spared the effort of
building them. This feature alone makes a signicant contribution to reducing deployment
cycle time with ProactiveNet.
When searching for the root cause of a problem, ProactiveNet uses a sequential ltering
approach, progressively eliminating elements as the root cause until only those likely to be
a cause of the disruption remain. As shown in Figure 6-5, each ltering step removes a
portion of the remaining candidates. The steps that are taken are initiated by an alarm
reporting degraded performance.
The rst lter discriminates between normal and abnormal behaviors. Processing is very
efcient because ProactiveNet has already established the adaptive resource baselines.
Only abnormal behaviors are selected, resulting in a signicant reduction in candidates, on
the order of 200:1.
The second lter applies time-based correlation to the remaining candidates. The premise
at this stage is that the probability of simultaneous, unrelated baseline deviations is unlikely.
Time correlation associates a set of abnormal baselines with a single cause, as yet
undetermined.
Figure 6-5
119
Isolation Filter
Uses Known System Dependencies to
Pass Only Relevant Measurements
(Approximately 1 in 50)
System dependencies are the focus of the third ltering stage. ProactiveNet uses the
relationship information stored for common transactions to further isolate the root-cause.
The dependencies point back to the root-cause because the transaction depends on these
resources. For example, given a sluggish transaction, anomalies in e-mail performance
metrics are set aside if the transaction doesnt depend on the e-mail service.
Finally, the ltered element data is examined in detail and ranked, if possible, by the
probability that it is the cause of the problem. It is then presented to the operators for
evaluation. With a ranked set of potential causes computed automatically, administrators
and troubleshooters can get to work applying professional judgment much more quickly
than if they had to work through the root-cause triage manually.
120
Netuitive
Netuitive specializes in predicting future behavior. The companys offerings build
predictive models that are used to identify behaviors that may lie outside the range of
expected activity thresholds. The models are derived from correlation inputs in
combination with congurations set by subject matter experts.
Most services have a large number of parameters that characterize their behavior. Any root
cause or triage strategy needs to determine which variables are the most useful in
understanding and predicting behavior. Typically, there is an overabundance of variables from
which to choose, confounded by a lack of understanding of their relationships to each other.
Netuitive proceeds through a set of steps as they build a predictive model. The tool collects
operational information, renes the variables, incorporates expert knowledge, and renes
the model.
The process begins by baselining the range of operational behavior, collecting operational
data for a 14-day period as a rst step in modeling for a new application. Netuitive captures
all the variables the application provides as it builds a representation of average, maximum,
and minimum values for the operational envelope.
Netuitive then identies strongly correlated variablesthose whose behavior is tightly
coupled to other variables that dene the operational envelope. A change in one variable
will be reected in other strongly correlated variables and is therefore a good predictor of
change. Conversely, tracking a variable with low correlation does not provide any
indication of its impact on overall application behavior.
The goal is to determine a small set of strongly correlated variables that are accurate
indicators of behavioral change. This enables the model to be as simple as possiblebut
not so simple that it provides inaccurate predictions.
Netuitive also facilitates incorporation of input from experts that understand the modeled
application. These are usually members of the original development team or those who
have extensive practical experience with using the application. Such subject matter experts
provide the root-cause information, using their knowledge to link specic variable changes
with their likely causes.
The nal stage in predictive model development is verifying the capabilities and usefulness
of the model. Anomalous events are introduced to validate that the model detects them and
provides the correct root-cause analysis for them. The assessment also tracks the numbers
of false alarms that are generated as a measure of the models accuracy.
The application model is now ready for production use; it shows how the various
measurement inputs are correlated and how they can be used to predict performance
problems.
121
In production, the Netuitive Analytics Core System Engine calculates dynamic thresholds
for models variables using their workload and the time as the basis. The threshold values
dening the operational envelope are updated continuously.
The Netuitive system also calculates imputed values for the models variables based on the
actual variable values and history. In other words, it evaluates the expected value if the
variables follow their normal relationships and the correlation between them holds.
Real-time alerts are generated when actual measurements differ from the imputed values
by an amount that indicates a possible problem. Predictive alerts indicate that a forecasted
value will exceed the forecasted baseline range. The alerting module also has a parameter
that denes the number of alerts that trigger an alarm to other management elements, such
as a management platform.
Netuitives approach offers a peek into the future. A possible drawback is keeping the
models current with application enhancements. Changes in the application may introduce
new correlations between variables. New behavior must also be incorporated after nding
the needed experts. This may represent an ongoing effort that should be balanced against the
gains of predictive tools.
122
even the most robust Internet sites. The attacker unleashes the attack by sending the
zombies a directive specifying the target system and the attack parameters (some actually
select from a repertoire of attacks).
The sudden onslaught of a DDoS attack can quickly disrupt operations. There is often no
warning, such as a more gradual increase in loading might provide. The performance and
availability collapse can be very sudden if elements are operating with little headroom.
123
erode.
Step 2 Capture as much trafc as possible for analysis; ideally, this means
packets to you.
Step 7 Get the ISPs help in tracing the packet to the next ISP in the chain.
Step 8 Continue working with ISPs until you reach the origin network for the
attacking packets.
Step 9 Get the origin ISP to set up lters to stop the packets from entering the
124
Automated Defenses
One of the continuing threats that DDoS attacks pose is the shock of a sudden trafc surge
that disrupts services for lengthy periods. All too often, rapidly deteriorating service quality
is the earliest indication that an attack is underwayin fact, such a degradation shows it is
already succeeding.
Earlier detection clearly helps the defensedetecting the early signs, or signatures, of an
impending attack and activating the appropriate defensive measures in time. A warning
from any source is helpful, but those that provide the longest lead time are the most
valuable. Longer lead times are attained with some trade-off with accuracy. A sudden surge
of 100,000 connection attempts within the last minute has a very high likelihood of being
a DDoS attack, but your lead time is very short.
Conversely, detecting a smaller perturbation that predicts an incipient attack increases your
lead time and options for countering the attack. However, the longer lead time might also
come with an occasional false alarm when some perturbation was not actually indicative of
an attack.
Two examples of defense solutions are described in the following sections to show the
range of automated approaches available.
125
Router
Peak Controller
Firewall
PeakFlow Collector
Load Distributor
Web Servers
The Arbor PeakFlow system is a set of distributed components for defending against DDoS
attacks. Collectors gather statistics from Cisco and Juniper routers and from other network
components, such as switches. They also monitor routing update messages to follow
changes in the routing fabric. Periodic sampling is used to build a normal activity baseline
for each collector.
Collectors detect anomalous changes in the trafc patterns and characterize them for the
PeakFlow controllers. (This is where each vendor has their secret sauce: proprietary
algorithms for mining the changing operational patterns and extracting better predictions
of future problems.) The distributed PeakFlow collectors provide the detailed analysis that
identies particular attack signatures. Knowing the type of attack helps direct the defensive
response more accurately. Remote collectors also capture new anomalies that may prove to
be new attack signatures. New signatures are added to provide faster diagnosis if the same
attack is attempted in the future.
126
A PeakFlow controller integrates the reports from a set of collectors and determines if a
DDoS attack is indicated in the anomalies. The controller traces the attack to its source and
constructs a set of defensive lters. The controller can then automatically load and activate
the defensive lters, or they can be initiated after staff inspection.
The Arbor Networks approach uses a centralized correlation engine to pick out attack
indicators from the collector anomaly reports. Centralized correlation aids accuracy
because the distributed nature of attacks may be obscured when looking at each point of
attack; the aggregate pattern is more revealing.
Summary
127
attack occurs. Some policy questions that need discussion and agreement prior to the attack
itself include the following:
What is the set of graduated steps to take after an attacker is traced (usually selecting
the least disruptive rst)?
Increasingly sophisticated DDoS attacks demand more sophisticated defenses. This area
merits continuous attention, as attackers will not rest once their current methods have been
defeated. Detectors and the policies they activate must be reviewed and rened to meet
evolving threats.
Summary
Real-time operations comprise a set of key functions that must operate within tight time
constraints. Information ows into the real-time operations system from the
instrumentation manager (the source of alert data) and from the SLA statistics modules,
which provide time-sliced measurements of performance. The real-time operations system
then processes the inputs in an attempt to improve MTBFpossibly by using proactive
techniques to predict possible failures. At the same time, it tries to assist the operations staff
in decreasing the MTTR when a failure actually occurs.
Reactive management, used to decrease MTTR, is based on the use of triage and root-cause
analysis. Triage tries to identify the responsible organization very quickly, in the hope that
they will be able to use their specialized tools and knowledge to x the situation. Rootcause analysis is a more detailed, technically intense process that tries to assist in the
detailed diagnosis of the situation.
Root-cause analysis uses sophisticated methods of ltering and correlating input data,
possibly combined with a model of the system being managed, to make reasonable
suggestions about the cause of a performance problem.
Active responses can then be used to handle routine problems or even predicted problems
so that system operators can concentrate on more complex issues.
CHAPTER
Policy-Based Management
Managing services in compliance with a Service Level Agreement (SLA) places more
demands on the management system and the staff. Stiffer penalties for noncompliance
increase the pressures to respond quickly and accurately even while the environment grows
more dynamic and complex. Often, more sophisticated automation than that described in
Chapter 6, Real-Time Operations, is needed to relieve and supplement overworked staff
members. Toward that end, this chapter covers the following:
Policy-based management
The need for policies
The policy architecture
Policy design
Examples of products
Policy-Based Management
Automation is a key attribute of an effective Service Level Management (SLM) system.
Stringent SLA compliance criteria reduce the time cushion that administrators might have
had. One of the compliance criteria mentioned in Chapter 2, Service Level Management,
is a demand for higher availability. If management staff members are left to deal with high
rates of change and growing complexity, the resolution times are unacceptable. Automated
management tasks are the only way to add speed and to deal with complexity.
Note, however, that automated tasks are also of concern to administrators because they are
taking actions and making changes at a faster rate than humans can maintain. A policybased management system is an attempt to leverage automation while constraining actions.
Policies are sets of rules that dene and constrain the actions the management system takes
in different situations. Table 7-1 shows the various levels of rules that might be involved in
a policy-based system. The rules are dened from the business level downward. Each rule
level supports the goals of the levels above and depends on lower levels to achieve those
goals.
130
Table 7-1
Multi-Level Rules
Level
Focus
Business rules
Service rules
Infrastructure rules
Element rules
As an example, consider that infrastructure rules might involve establishing special routes
for low-latency network trafc or allocating more servers behind a load-balancing switch.
Those infrastructure rules depend in turn on the proper element congurations.
Management system rules govern internal management processes, such as monitoring.
Monitoring processes have targets, polling and heartbeat frequency, threshold values for
alerts, and steps to take when there are failures in the instrumentation system.
Many policies are activated when a potential or actual service disruption is passed along by
an alert from the real-time event manager. Other policies are activated in response to
changes in the SLA statistics.
Policy-based management has a learning curve. Simple policies save staff time and effort
and are usually implemented rst. More sophisticated policies are implemented as the
management team gains experience and learns how to extend policies more deeply into
business processes and to more areas in the managed environment.
Policy-based management is the systematic creation of policies that drive the management
system to maintain the highest service quality.
Staffing costsEconomic pressures to show strong bottom line results work against
the need to hire expensive expert staff.
131
Table 7-2 illustrates the major differences between the service and element policy domains.
For example, high levels of redundancy can absorb element failures without substantially
degrading service quality. More sophisticated policies are needed for managing a complex
and dynamic service environment.
Table 7-2
Service-Centric
Applied to services
Relatively simple
Relatively complex
The next two sections discuss management policies for elements and for entire services.
132
Policies offer large environments that have many devices, sites, and users a consistent way
for handling element conguration. This approach scales gracefully as the environment
grows. In addition, staff are freed from element-specic details and are involved only if a
policy fails.
While freeing administrators from a plethora of low-level decisions and reducing the
likelihood of error is attractive, it is important to remember that the best results are obtained
when policy management has unambiguous input. Elements that have very clear
management instrumentation and a limited set of conguration options are the best
candidates for applying automated policies.
Conversely, elements such as high-end operating systems, application servers, and other
parts of the service delivery architecture dont always expose their management
information clearly. This situation makes automated decisions less clear-cut. The multiple
layers of complexity inside some elements, such as servers, also make tuning them a
challenge. The pressure is on policy designers to incorporate those subtleties to get the most
from a policy-based approach.
Service-Centric Policies
This policy category deals with service-quality issues rather than element behavior. Such
policies are inherently more complex, and they can span several infrastructures. Most
importantly, service-centric policies are targeted as much toward achieving business aims
as maintaining technical performance. For example, policies are focused on minimizing
penalties or treating the affected customers in various ways.
Lets look at an example to clarify the differences between element- and service-centric
policies. Consider a provider using a tiered server farm to speed transaction ows. The
redundancy of the farm means that a single server failure does not immediately impact
service availability, but it begins to expose the site to performance problems if the
remaining servers are approaching their loading limits. This is an example of a servicecentric policy, which focuses on maintaining adequate server capacity, rather than
responding in detail to the failure of any server in the farm.
The policy actions taken when a server fails can include the following:
Check the other servers on the tierIs their load after the failure still under the
dened threshold? As an example, consider that a set of four servers each running at
a 25 percent load transforms into three servers each with a 33 percent load.
Check the loadIf the load is acceptable, for now, send an alert and wait for the staff
to take further action. If the load is too high, increase the severity of the alert and page
the server manager.
A Policy Architecture
133
Check the number of servers in the standby poolIf they are depleted below a
threshold value, send a high-priority alert to the event management system.
Provide detailed reportsThe reports should cover the steps taken and warn of
imminent problems. They should also generate a problem ticket for repair of the failed
server.
Other information can be used to increase the intelligence of the response. For instance,
there can be a check to see if there are imminent load changes. The alert could then provide
more information, such as whether the remaining servers in the tier are operating under
threshold now and whether the afternoon trafc surge is 30 minutes away. This gives the
staff better information and indicates that attention is needed to avoid compounding the
problems.
A Policy Architecture
This section covers the basic components comprising a policy management system. The
components include policy enforcers, repositories, and policy managers (after all,
something has to manage the management policies). Specic products might have different
combinations of components.
134
Security features control access to the policy management tools. Only selected
administrators can create or modify the policies. The same restrictions are applied to the
policies for each customer; only those administrators responsible for a customer can set or
modify the appropriate policies.
Repository
The policy repository contains the policy information used by the other elements in the
policy system. The repository can be implemented in many waysas a set of at les, a
database, and more commonly, as a directory. Directories are winning favor because they
offer advantages, including the following:
Directories are already widely used for other functions, including conguration,
access control, and resource allocation. This offers the advantage of using existing
mechanisms rather than inventing something equivalent.
Repositories are structured to provide independent policy domains for each customer
organization, their business units, and specic individuals within the organization.
The policy management tools aid administrators in creating and modifying the information
held in the policy repository. After the information is safely in the repository, it must be
available to the elements that actually act on it.
Policy Distribution
After they are in the repository, policies can be distributed to the appropriate elements.
There are three types of distribution models: pull, or component-centric; push, or
repository-centric; and a hybrid that combines elements of each. These approaches are
discussed in the following subsections.
A Policy Architecture
135
Adding more intelligence to the repository enables ner tuning and control. For example,
distinctions between a customer accessing services from high-speed or low-speed
connections can be made and the appropriate policies can be applied to each case.
Consider further that time of day might be considered as a criterion for blocking or
permitting access to specic services in different time periods. Such a policy would be one
method of preventing undesired activities, such as bulk transfers or database mirroring,
from taking place when they would interfere with other activities.
A pull model evolves through on-demand delivery to the components. Over a period of
time, each component pulls together the unique information it needs for its specic
functions.
The drawback of this approach is that policies must remain fresh; policy information in a
component can become stale and of no value. In fact, it has negative value because an old
policy is in effect rather than its successor. As usual, theres a compromise to be made
between frequent update requests and using local caching to reduce trafc and delays.
Hybrid Distribution
Using both models takes advantage of the strengths of each. The pull model obtains the
latest information and can reduce trafc with local caching and aging. The push model is
used when large-scale rapid policy changes are necessary (for example, when a security
breach is detected). The policy system would use a push operation to change operations
very quickly and minimize the damage from an intrusion.
136
Enforcers
This is where the rubber meets the roadenforcers ensure that policies are properly carried
out. Enforcers are distributed throughout each infrastructure and carry out specic
functions. Some enforcers are part of management agents embedded in another element
while others, such as those in load-balancing switches, are major parts of the elements core
function. Examples of enforcers include the following:
Access devices at the network edge apply access policies that determine access to
services. These policies must be very granular, with policies that can be applied to
individuals and services.
Routers in the network core switch and forward trafc according to policies for
reserving bandwidth and forwarding trafc. This must also be granular for each
customer and each service.
Servers apply priority policies for scheduling their tasks. These policies can also vary
with activities. For example, a customer browsing a catalog might receive a lower
priority than one who is completing a purchase.
The geographic distribution system applies policies to select a site based on distance,
relative loads, or customer proles.
Policy Design
Designing policies becomes more complicated as you move from the elements to the
services they support. Several infrastructures might be involved and the decisions made by
the policy system must reect more conditions, each of which must be tested and analyzed.
One important tool for assessing policy robustness as policies are designed is called Failure
Modes and Effects Analysis (FMEA). It is commonly found in structured process and
design methods, such as Six Sigma.
Using a spreadsheet or table with yellow stickies on a whiteboard, FMEA accounts for the
following:
Policy Design
137
The likelihood that the failure mode will be detected, rated on a scale of 110
A weighted Risk Priority Number (RPN), which is obtained by multiplying the three
ratings: severity, frequency, and likelihood of detection
The power of the FMEA method rests on two foundations. First, it makes explicit the
catalog of policy inputs required to make the policy successful, and it helps make explicit
how a policy has to deal with them. Second, and more importantly, FMEA provides a
discussion framework for collaboration by experts from multiple domains. These domains
include applications, networks, servers, electricity, and so forth. If a policy can be
thoroughly accounted for in an FMEA, its a good candidate for automation; if not, the
policy will likely not succeed.
Further discussion in this section is divided into the following subsections:
Policy hierarchy
Policy attributes
Policy auditing
Policy closure criteria
Policy testing
Policy Hierarchy
Policies can be organized into hierarchical structures that give advantages to providers and
customers. Customers can have an overall policy that applies to all their users. They can
assign additional constraints to different business units, for example, by allowing them
ner-grained control.
One example would have a customer policy forbidding certain applications from running
during normal business hours. Each business unit must conform to that organizational
policy, but they can add more constraints that do not violate it. One business unit may add
additional times when those applications are forbidden, for instance.
Hierarchy enables constraints to be organized and applied while still preserving the
exibility of lower-level policies that do not violate the basic constraints.
Policy Attributes
Policy attributes are important to consider as well. The attributes differ according to the
needs of each policy.
Policies can be based on the initiating user. As an example, user A can use a particular
service only at night and only at the bronze level. On the other hand, User B can use the
138
same service anytime at a platinum level. This provides a great deal of granularity and
customization.
The time of day will have a strong impact on many policies because it controls when certain
services are available, or it changes the constraints on service quality. Thus, time can be an
important attribute because certain activities can be scheduled for, and restricted to, times
when they do not interfere with more critical service ows. Policy violations also identify
those users who are trying to violate the policy or who do not understand it.
Policies can come into conict with each other, just as they do in other work areas.
Assigning each policy a precedence value helps resolve conicts, with the highest
precedence being the operative policy at that moment. Using precedence is a good practice
because it forces administrators to evaluate the relative priorities of the policies they create
and manage.
Policies might also need a lifetime to help with their administration and management. There
will be situations where an administrator creates a policy for special situations. It is easy to
dene its lifetime and automatically deactivate it when the time expires. This saves
administrative effort to track policies and prevents old policies from being used without
oversight. Some policies can have a lifetime value of forever. They will exist until an
administrator takes specic action to delete them.
Policy Auditing
Policy systems need auditing functions. I have worked with several organizations, for
example, that had created large numbers of policies and then found that only a small
percentage of them were actually used. Knowing which policies are heavily used also
focuses time and energy for optimizing those that give the higher pay-off.
Policy Testing
Many policy systems lack strong testing capabilities. Administrators must have condence
that the policies they are using will work properly in the operating range the policy was
designed to handle. Policies can be tested in development laboratories or by hand by
running through a set of possible scenarios.
139
Periodic comparison of the device conguration against the policy denitions stored
in the repository
The repository is a directory accessed through the Lightweight Directory Access Protocol
(LDAP). LDAP establishes a client/server connection before transporting requests from the
client. LDAP is the common-access mechanism, leaving the choice of the actual repository
schema open. The repository could be an object-oriented database, an SQL database, or
even a at le.
QPM uses several policy-distribution mechanisms to support its push-delivery model.
Cisco will continue to support Simple Network Management Protocol (SNMP) devices, but
it sees Common Open Policy Services (COPS) as its strategic future direction. Distributed
servers can be placed close to concentrations of policy elements to avoid backbone
congestion and to speed policy updates to the components.
140
COPS
COPS is an emerging Internet standard for interactions between a policy client (the Policy
Enforcement Point [PEP]) and a policy server (the Policy Decision Point [PDP]). It
supports both models for distributing policy information. The pull model is used when the
PEP initiates requests, updates, and deletes to the PDP, which returns a decision for each
request. The push model is also available, enabling the PDP to push new information to a
PEP, or delete old information.
COPS uses reliable communication between the PEP and the PDP. Because secured policy
exchanges are essential, COPS uses message-level security for message integrity,
authentication, and replay protection. Other security mechanisms, such as IPSec, can also
be used.
Messages between the PDP and PEP contain self-identifying objects relating to each
request or response. Examples include client type, interfaces, errors, decisions, and timers.
COPS offers higher reliability than earlier connectionless protocols, such as SNMP. It also
imposes the burden on the PEP and PDP to keep the connection active with heartbeat trafc
that also adds more network loading.
Cisco provides special engines that use Network Based Application Recognition (NBAR)
to inspect each incoming packet and classify it according to the specied policy. Enforcers
are also built into most Cisco devices.
QPM has all the components for a policy-based management system oriented toward
element-management policies. It has Cisco devices with the capabilities to classify trafc
and apply a range of enforcement policies. Figure 7-1 shows an example of the use of QPM
to handle an SLA. Among the SLA specications are descriptions of service classes,
services, and metrics.
Figure 7-1
!
"
# $"%
& %
141
Under the SLA in Figure 7-1, there are two branches: the left branch sets up the
classication and enforcement functions while the right branch handles instrumentation.
With the left branch, the SLA denes the service classes, such as streaming, interactive, or
transactional, that are covered by the agreement. The next stage is dening the membership
of each service or application and dening which service class applies.
This information is then enhanced with the denition of the relative priorities for each
service class. Applications are identied by criteria such as the information in the
communications packet. This information is used to congure the NBAR functional
module so that it recognizes each application and appends the appropriate information to
each packet it processes.
Information can also be loaded into other devices that act as enforcers for the policy system.
For example, edge devices can have rate and admission control functions that are activated
for each type of application/service ow.
The metrics are handled on the right branch of Figure 7-1. They are specied for each class
and service in the SLA. A solution such as QPM takes these metrics and congures the
instrumentation system accordingly. Instrumentation can be found in devices, desktops,
servers, and stand-alone collectors and aggregators. Both passive and active
instrumentation are congured to capture the metrics and report them to a management
server.
Some of these steps require some manual translation today. Future SLAs can be constructed
as XML documents, providing an electronic input for QPM. More of the process can be
automated, adding more value by reducing staff labor and errors.
142
Summary
This chapter introduced the idea of policy-based management as a means of dealing with
demands for service management in a complex environment with tight time constraints.
Automating many of the responses and procedures minimizes staff labor, reduces staff
mistakes, and provides the speed needed to meet stringent SLA compliance criteria.
Policies serve two main purposes: they dene what actions the management system takes
in certain situations, and they prohibit other management activities that are irrelevant to a
specic problem.
Policy systems have enforcers to determine the appropriate actions to take on service ows.
Policies are distributed using push, pull, or hybrid approaches. The push model is very
effective for abruptly changing the policy system behavior. The pull model enables each
component to ask for information as needed.
Policies for services management evolve by automatically integrating more functionality.
Consider the policies that could be activated when a desktop initiates a streaming
connection. Collectors inside the desktop are activated to measure the latency and packet
loss on the connection. Alerts are forwarded if the measurements indicate an actual or
potential service disruption. Monitoring is discontinued when the connection is terminated.
A security breach might activate a set of policies that adjust rewalls, isolate key resources,
inform corporate management, and track the intruder while alerting the management team.
Policy-based products for service management are still maturing and administrators need
to assess their actual capabilities carefully.
For any policy, the output of a decision is only as good as the quality of the input. In
selecting where to apply policy-based management, as much consideration must be given
to the information used to make the decision as to the automation of possible outcomes.
This reinforces the importance of good instrumentation and event management for good
policy-based management.
CHAPTER
The critical need to have applications designers and the network and services
managers share the same perspectives about service delivery
Application-level service metrics, which are high-level technical metrics and other
end-user experience metrics
146
Application-Level Metrics
147
Granted, this does simplify the application development process, but there is the impact of
not knowing where your objects (content) are actually located. Content location can have a
signicant performance impact due to long distance (propagation delay), restricted
bandwidth, or overloaded servers at a given location. In response, the teams felt that any
unacceptable access delays could be xed by adding more bandwidth. However, this
answer masks confusion between bandwidth and propagation delay. In this instance, the
stated solution would undoubtedly contribute to poor application design decisions that
cannot always be xed later by throwing resources (money) at the problem. This is
especially true when wide-area networks (WANs) spanning long distances are involved.
To enhance the understanding of network impact on application behavior, tools such as
Compuwares Application Expert can be useful. Application Expert is used during the
development phase to quickly test different application scenarios. It enables developers to
see the effect of network delays and bandwidth on application performance. This
information is fresh enough that it has an impact during the development process rather
than after the fact; this leads to better implementation. In the deployment phase,
Application Expert can be used to monitor the actual performance and identify further
opportunities for improvements.
Application-Level Metrics
Applications require instrumentation to make their behavior observable, and, as
appropriate, controllable. Client-side collectors operating in passive or active modes
provide some of the instrumentation because they measure the user experience from that
location. Note that client-side collectors are usually not a part of the application
instrumentation itself; they measure the application behavior for a specic virtual
(synthetic) transaction.
148
Table 8-1 shows examples of application instrumentation for a web sales application. The
instrumentation can be further divided into internal and external measurements:
Table 8-1
The internal measurements give insight into the behavior of the systems within the
direct control of the IT group. There are categories of internal measurements,
including those for workload, customer behavior, and business behavior.
External measurements show the behavior as seen by an end user outside the scope of
the IT organization, such as by an end user using the Internet to access the system and
perform a transaction.
Insight
Workload
Number of transactions/second
Gauges of activity
Effectiveness of content
Navigation
Stickiness
Business Measurement
Number of completed orders
Revenues generated
Abandoned carts
Promotion feedback
Application-Level Metrics
149
Workload
Workload metrics track the capacity of the application. Capacity is determined by the
quality of the implementation and the assigned computing, storage, and network resources.
Some measurements of overall activity and capacity might include the number of
transactions per second, the number of concurrent connections, or the actual server loading
measurements.
Periodic measurements of workload can build activity baselines that prole the normal
ranges over longer time intervals. Alerts can then be generated when the comparison of the
actual workload against the baselines indicates a trend away from the normal ranges.
I recently visited a site that had melted down when a web page designer added two simple
objects, and another 45 Kb of payload, to the home page. Testing in the lab showed no
apparent bugs, so the new page was placed in production. The problems began appearing
during times of heavy customer access. When there were over 10,000 active connections,
which occurred during the hours of peak demand, the additional load was 450 MB being
sent across the network for users downloading the home pages. As it turned out, this bump
on the backbone was actually the nal straw that convinced this organization to outsource
their content delivery to a managed infrastructure provider.
150
Other aspects of customer behavior are used to optimize application performance. For
example, tracking the most heavily used content or the most frequent transactions gives
valuable information for improving the effectiveness of the customer-facing application.
For example, one site found that one of the most frequently accessed pages took ve clicks
to reach. This was a business site wanting short visits and quick navigation. The desired
content was moved to the home page, resulting in improved customer satisfaction and
increased revenue because fewer customers lost interest with more direct navigation to their
desired destination.
The popularity of content also assists managers in making intelligent content placement,
preloading caches, and determining the number of replication sites. The same value is
provided by instrumentation that identies the most frequently used transactions.
Developers can focus their attention and optimize those transactions that will offer the
highest payoff in improved performance.
Business Measurements
Business measurements are becoming increasingly important. They are directly important
to business managers who want to understand how their online business is actually
functioning in real time. These metrics are important to technology managers as well; being
the source of critical business information establishes the value of better management
investments.
Some examples of business metrics are completed orders, generated revenue, promotion
feedback, and abandoned shopping carts.
A tally of completed orders indirectly measures the effectiveness of the web sitewhether
it is keeping customer interest long enough to close sales, for example. This measures only
bottom line effectiveness, not efciency; however, it is a basic metric for many
organizations at this time.
The completed orders metric can be broken into more details, such as the following:
Application-Level Metrics
151
Promotion feedback can be invaluable when business managers and their marketing teams
are always focusing on guiding users down a certain path to meet objectives such as
strengthening the Internet brand, creating stronger differentiation with competitors,
responding to market and competitor moves, and maintaining customer loyalty. They are
under continuous pressure to capture a greater market share while simultaneously reducing
customer acquisition costs.
Instrumentation can use special web pages, special buttons or links, or other ways of
tracking responses to a variety of promotions. This information can be analyzed and
organized to assess the effectiveness of different promotions and to understand acquisition
costs.
For some reason, abandoned shopping carts always seem to get a business managers
attention. There have been some anecdotal reports that abandonment rates are often over 50
percent for some consumer sites. This should be distressing because these are potential
buyers who have taken time to navigate the site and select products before they go to
another site.
Business behavior metrics may be derived from other more basic measurements. For
example, revenues are calculated after each order is completed. The basic revenues may be
further segmented by the customer, the product, the time of day, a promotion, or other
criteria. These metrics must be baselined, and thresholds should be established. Because
business managers want to understand and respond to situations more quickly and because
technology managers want to make adjustments to maintain compliance with Service Level
Agreements (SLAs), an alarm can be sent to the appropriate business and technology
managers when an application has a sudden drop in revenues or visitors.
152
Each web page within the transaction should also be measured for download time because
that can be a good indicator of user abandonment behavior. On legacy systems, users didnt
see the computer screen directly; they spoke to call center operators. If the computer was
slow, the operator would talk to the customer and save the sale. On the Web, the customers
are directly exposed to slow web service, and theyll abandon a slow transaction. A twominute transaction that consists of ten 12-second page downloads is considerably different
from a two-minute transaction that consists of ve 6-second page downloads and one 90second download. Many users will abandon during that 90-second download. Thats crucial
information for the business groups and should be included in the SLA.
Figure 8-1
153
Serialization
Delay
Propagation
Delay
Serialization Delay
Serialization delay is caused by the process of converting a byte or word in the computers
memory to or from a serial string of bits on the communications line. Serialization causes
delays in most routers and, of course, at the source and destination. The time needed for
serialization is the time needed to write bits on to or off of the communications line; its
controlled by the line speed. For example, 1500 bytes requires 8 milliseconds (ms) to
serialize at 1.5 Mbps and 300 ms to serialize at 40 kbps. The added header and trailer overhead
increases serialization delay because of the time needed to write and read those bytes.
Decreasing overhead by tting more data into each packet decreases download time by
decreasing serialization delay. However, most systems that run over the public Internet use
either 1460 bytes per packet (for high-speed connections) or 576 bytes per packet (for dialup connections); its not easy to change those values. Changes are more easily made on
private systems. (Longer packets increase jitter and the penalty for a packet error; but in a
private, dedicated network where the number of router hops is constrained and transmission
quality is more controllable, this might not be a major issue.)
Its important to note that serialization delay is greatly inuenced by compression and
encryption of content. For example, the standard home-user, dial-up modems perform hardware
compression within the modem itself. For some data patterns, the modem compression ratio is
4:1 or better. If a data block has been compressed, it is shorter and therefore takes much less time
to serialize. On the other hand, encrypted data cannot be compressed. (An encrypted string
appears to be purely random and therefore uncompressible.) The result is that secure web pages
are transmitted much more slowly on transmission links that have a large serialization delay.
Such web pages should be compressed before encryption. (This is also a strong argument in
favor of using true end-user measurements instead of computed or simulated end-user
measurements. A true end-user measurement would include the effects of modem hardware
compression; no commercial emulated measurements do that.)
154
Queuing Delay
Queuing delay is caused by waits in queues at origin, destination, and intermediate
switching or routing nodes. Variations in this delay cause jitter. For streaming media
applications, a dejitter buffer is required at the receiving end. (The delay in the dejitter
buffer is typically one or two times the typical jitter.)
Propagation Delay
Propagation delay is governed by the laws of physics; propagation delay cannot be
decreased by increasing the line speed. It is a distance-sensitive parameter. The ITU-T
standard G.114 species 4 s/km for radio, 5 s/km for optical ber, and 6 s/km for
submarine coaxial cables, including repeaters. Therefore, it will require 20 ms to travel the
4000 kilometers (km) from New York City to Los Angeles, or 100 ms to travel the 17,000
km from New York City to Melbourne, Australia. A signal beamed up to a geosynchronous
satellite and down again, a distance of 72,000 km, takes approximately 280 ms.
An example may help illustrate the massive importance of propagation delay. Imagine a
1-MB le to be transmitted over three different connections:
An Internet connection from New York to Los Angeles with an effective bandwidth of
15 Mbps and a one-way propagation delay of 75 ms
An Internet connection from New York to Los Angeles with an effective bandwidth of
1.5 Mbps and a one-way propagation delay of 75 ms (a typical coast-to-coast latency
on the Internet)
Theres some additional complexity that must be mentioned here: the Transmission Control
Protocol (TCP) used by web browsers and for reliable le transmission over the Internet
has a typical data block size of 1460 bytes and a window size of 17,520 bytes.
NOTE
The window is the maximum amount of unacknowledged data that can be outstanding at
any given time; the value given here is for the Windows 2000 operating system (OS). Thus,
for a window size of 17,520 bytes, twelve 1460-byte data packets can be transmitted before
an acknowledgment must be received. An acknowledgment is sent after each evennumbered packet is received.
155
Note also that TCPs slow start algorithm, which slowly increases transmission rate at the
start of a le to avoid congestion, is being ignored for this example. (The large le size
makes slow start less important here, but it can be important for short les.)
Now you can see the effects of propagation delay on performance:
Figure 8-2
For local, high-speed Ethernet, the propagation delay is so low that theres never a
problem receiving the acknowledgments before 17,520 bytes have been serialized.
The transmission of 1 MB proceeds at full line speed and is complete in
approximately .1 seconds.
For the 1.5-Mbps Internet connection in our example, serialization of 17,520 bytes
takes approximately 100 ms, and the propagation delay across the U.S. takes
approximately 75 ms. The round trip is therefore approximately 150 ms, and the rst
acknowledgment is generated when the second packet has nished arriving,
approximately 15 ms after the rst packet begins to arrive. Therefore, as shown in Figure
8-2, theres a 65 ms pause to wait for an acknowledgment after each block of 17,520
bytes is transmitted. Transmission of the 1 MB in 58 separate blocks of 17,520 bytes
each takes approximately 9.5 seconds.
For the 15-Mbps Internet connection, serialization of 17,520 bytes takes approximately 10
ms, and the propagation delay across the U.S. takes approximately 75 ms. The round trip
is therefore approximately 150 ms, and the rst acknowledgment is generated
approximately 1.5 ms after the rst packet begins to arrive at the receiver. Therefore, as
shown in Figure 8-3, theres a 142 ms pause to wait for an acknowledgment after each
block of 17,520 bytes is transmitted. Transmission of the 1 MB in 58 blocks of 17,520
bytes takes approximately 9 secondsalmost the same as for the 1.5-Mbps connection!
Start Transmit
75 ms
90 ms
100 ms
Start Receive
Send First ACK
Finish Transmit
165 ms
175 ms
Start Transmit
Finish Receive
240 ms
Start Receive
156
Figure 8-3
Serialization Delay
0 ms
10 ms
Start Transmit
Finish Transmit
75 ms
85 ms
Start Receive
Send First ACK
Finish Receive
152 ms
162 ms
Start Transmit
Finish Transmit
227 ms
237 ms
Start Receive
Finish Receive
This situation is even more important for web transactions, where each web page may
require many les, each with this type of sensitivity to transmission delays. The number of
round trips required by a web page or a transaction is sometimes referred to as turns, and
decreasing that number clearly decreases the sensitivity to transmission delay. Another way
of decreasing download time is to decrease transmission delay itself, and Chapter 9
discusses how content distribution networks can be used to place some of the pages content
closer (in terms of transmission delay) to the end user.
Processing Delay
Processing delay in the network includes modem delays (typically 40 ms or more for a pair
of V.34 modems without compression and error correction functions, for example), router
delays, and telephone network switching equipment delays.
Processing delay at the web server encompasses such functions as authentication, database
access, use of supporting services, and calculation. Increasing the server performance,
improving caching and load distribution, accelerating encryption speeds, adding servers, or
adding disc capacity are all ways to reduce the processing time or time spent on a server.
Instrumenting Applications
157
The network administrators must also increase their application awareness as well. They
need to select window and packet sizes that reduce latency and improve efciency wherever
they can. They, too, must understand that bandwidth does not solve every application
performance problem.
The placement of content is becoming a concern as pressures to deliver and use richer
content at higher quality continue. The content delivery infrastructure discussed in Chapter
9 and the sensitivities just covered in this chapter indicate the trade-offs that must be
considered in application design and operations.
Instrumenting Applications
Applications must provide the internal loading, customer, and business behavior metrics
that are necessary to understand their functioning. These metrics can be collected by
instrumentation from the web-server systems and applications, from other server
components, and from the end user. These are discussed in the following sections.
158
One way of tagging a page is to insert an almost-invisible phantom object on each page.
Usually this is a transparent, extremely small image; the technique is often called pixelbased tracking or page-bug tracking. When the page is loaded into a browser, the browser
automatically makes a request for this invisible objectexactly as it requests all the other
images on the page. Its just another image as far as the browser is concerned. The phantom
objects tag is no different than a standard image tag, except that the tag references the data
collection server. Because of that reference, the phantom object request is directed to a
third-party recording site or tool that captures the activity. By using a single object in a page
to represent the page as a whole, it reduces the number of entries for each page retrieved.
Instead of having a log entry for each item on the pageand there may be 50 or more
theres only one entry per page.
Unfortunately, the simplest version of this type of tagging cant see the interactions that the
user makes with the web page. However, more complex versions of phantom object tagging
enable the tag to contain a parameter string in the form of a query string in the image
request. That parameter string can be constructed by JavaScript running in the browser, and
it can therefore record user actions on the page along with any other information available
to JavaScript, such as browser size and available plug-ins.
Of course, use of phantom objects requires that each page to be measured include the
phantom object and, probably, a piece of special JavaScript code. In contrast, log le
analysis does not necessitate changes to the web pages.
Cookies can also be used by themselves or in conjunction with phantom-object tagging and
JavaScript. A cookie is an object exchanged between the browser and a web application. It
contains application and user information that applications can use for authentication,
personalization of content, and identication of customers for differentiated treatment. It is
stored in the browser at the request of the server, and a copy of it is returned to the server
with any subsequent requests made to that server. (Many users have set their browsers to
reject cookies automatically, unfortunately.)
WebSideStorys HitBox is one of the leading tools that uses phantom objects, usually in
combination with JavaScript and sometimes with cookies.
Clickstream is a tool from Clickstream Technologies that uses cookies in combination with
a tracking module installed on the web server. Each request for a page serves the page to
the browser, accompanied by a page-side measurement algorithm that records page display
times as well as any ofine and cached browsing activities that occur. The information is
recorded in a cookie and later sent to a server that records and analyzes all the request
informationincluding the browser-side, cache-based activities that would not otherwise be
seen by instrumentation because they did not result in any trafc on the communications link.
Keynotes WebEffective is a measurement service thats different from the phantom object
services. To use WebEffective, one line of JavaScript is embedded in the web sites entry
pages. That JavaScript redirects selected users (a sample of all users, specic users, and so
on) to the WebEffective server. The WebEffective server then inserts itself between the end
Instrumenting Applications
159
users browser and the original web server systems. In that position, it records everything
the end user does on the web page and everything that the original web server systems do
in response. (For example, it can discover that an end user is not clicking on a particular
button because that button is not displayed on the end users small browser window.) If
requested, WebEffective presents a pop-up window to the end user to ask permission to
track activity. After permission is given, it can ask questions of end users at any time. It can
even intercept end users who are abandoning the site to ask them why theyre leaving, and
it can track them to their next site.
The integrity of web pages can also be evaluated by web analytics tools. I have spoken with
several organizations that have written a simple application for periodically validating the
integrity of the web application content. These applications improved service availability
by ensuring that the correct content was correctly linked. The rapid, frequent changes on
many sites might introduce a broken link, a pointer to a non-existent page, or other
problems that result in poor customer experience, lost business, and reduced chances of
future visits. Any problems with links or content are passed in an alert to the alarm manager.
(Such tools are also available from commercial vendors and include the Keynote
WebIntegrity tool and the Mercury Interactive Astra SiteManager tool.)
The integrity testing tool can be used to exercise the embedded links in the web pages. The
virtual transactions load a page, check for the correct content using a simple technique like
a checksum, and then initiate further virtual transactions based on links in the new page.
Unfortunately, some manual intervention might be needed because of potential loops in the
sequence of linksthe tests never terminate. The virtual transactions would exercise
selected trails, such as those leading to visitor purchases. Web analytics tools help identify
the paths that have the heaviest visitor volume.
In an example I saw, the rst deployment of an integrity-testing tool was at a site with a high
number of objects within each page. Before the application was tuned by the inclusion of
delays inside each testing transaction, the rapid sequence of links delivered large numbers
of new objects to the servers cache. This would cause the cache to replace other content
with these transitory objects and add some delays while the cache refreshed its normal
content after the test. The active collector was attached in a new position so that the servers
cache was not directly in line and was therefore not disturbed by the integrity testing.
160
locations will show degradation. However, rather than faulting the application as a whole,
the data from active collectors at the edge of the data center shows that the real problem lies
with the network.
Legacy applications are still very much in the mix for most organizations, although they are
hidden behind better web interfaces or safely tucked away in the back-end areas. The
challenge with legacy applications is that many were never initially instrumented for
remote monitoring and management.
The relative opacity of legacy applications dictates less direct approaches to understanding
application behavior. One approach pioneered by BMC Software was to treat an application
as a black box and observe behavior indirectly. BMC Software started to instrument
mainframe applications by observing their effects on system logs, disc system activity, and
memory usage, among other factors. BMC uses experience to make an educated guess
about the applications behavior, based on inferences derived from analysis of those factors
that could be monitored.
Geodesic Systems offers a more direct approach to instrumenting applications with
their Geodesic TraceBack tool. They actually embed instrumentation during the application
build process. Instrumentation is incorporated into the application code at compile time,
and it records application behavior at such a ne level of detail that application errors can
be pinpointed to a specic line of code.
End-User Measurements
For end-user measurements, passive and active collectors are placed near concentrations of
customers or at key infrastructure locations. They interact with the web applications and
carry out normal transactions. They measure the end-to-end performance of the application
from various sites. Almost all measurement system vendors, such as Computer Associates,
Tivoli, and HP, offer tools for running synthetic transactions or for passively observing an
end user.
Measurement services are also available from companies such as Keynote Systems and
Mercury Interactive. Keynote Systems is the largest supplier, with over 1500 active
measurement collectors at over 100 locations on all the major Internet backbones
worldwide. They run synthetic web transactions over high bandwidth, dial-up, and wireless
links, and they can also pull streaming media and evaluate the end-user experience. Use of
measurement service suppliers makes the most sense when your customers are dispersed
over the Internet or when you need a disinterested third party to provide your SLA metrics.
Its important to measure accurately when you want to evaluate the end-user experience. As
mentioned, emulation of dial-up user experiences by using restricted-bandwidth devices
fails miserably because of the impact of a real modems hardware compression feature. A
study in the Proceedings of the 27th Annual Conference of the Computer Measurement
Summary
161
Summary
The application infrastructure is aptly named because most applications are composed of
related elements and supporting services. Applications include customer-facing elements,
which are activated most often through a browser, as well as backend functions, such as
credit authorization and order tracking. A single interaction with an end user commonly
involves multiple applications and services, and those applications and services are
themselves usually constructed of many smaller modules. Delivering superior service
quality in such a system requires good coordination and management of the supporting
applications and services.
One essential need is for closer communication among application designers and the
operations teamsbefore, during, and after deployment. Application performance is
sensitive to many factors that designers usually ignore, such as transmission delay, the
number of turns, or back-and-forth data exchanges needed for a transaction.
Legacy applications usually lack adequate instrumentation, whereas newer applications are
providing some embedded monitoring and tracing functions. Instrumentation allows
administrators to track application workload as well as business and customer behavior.
CHAPTER
164
Figure 9-1
Access
Provider
CDN Server
Cache
DNS
Server
Routers
Router
Firewall
Load Distributor
Web
Servers
Application
Servers
Database
Server Farm
165
A similar approach is emerging to wring more performance from server farms. Servers are
designed to be high-performance computing and data-access platforms. They can suffer
from dealing with high-speed network communications tasks, such as the following:
Performing the processing needed for key establishment and for encryption and
decryption of Secure Sockets Layer (SSL) connections
Hypertext Transfer Protocol (HTTP), the protocol used for web page transfers, adds
additional strains because some versions of HTTP use a separate connection for each object
that is accessed. This means that a browser will create a connection, access the object, and
break the connection for each object, even if the same server is involved for all the objects
on a pageand many web sites have 50 or more objects per page. This adds additional
server overhead and slows response.
New products address these limitations with a computer system placed between the server
farm and the customers using the site. This new front end is purpose-built for handling
communications tasks, in contrast to a general-purpose server where these functions
compete with application services for resources. Such front-end devices handle the
communications tasks and also perform load-balancing functions.
SSL Accelerators
Businesses and customers are increasingly concerned about the privacy of their
transactions. The SSL protocol is an application-layer protocol using TCP for reliable
delivery. SSL uses special software at the client and server ends of the connection to ensure
that communications are private.
After a TCP connection is established, the client and server authenticate each other to
establish that they are who they represent themselves to be. Encrypted digital certicates
are exchanged and validated. Then the parties exchange encrypted messages and create a
unique key that they use for only this session. The key enables secure communications and
the detection of any alterations to the trafc in transit.
SSL adds some additional network overhead for authenticating the partners and negotiating
the security prole, but the biggest SSL impact is in computing processing loads associated
with key creation. Large numbers of secure connections can degrade server performance
because servers must dedicate cycles to the processing associated with SSL establishment.
166
There are two related types of load distribution: local load distribution, which shares load
across servers in a single server farm, and geographic load distribution, which uses the end
users location to optimize server farm selection. Both types often contain extra functions,
such as SSL acceleration, attack handling, and aggregation of many hundreds of incoming
connections into far fewer server connections to decrease the servers connection-handling
workload.
167
The state of the server infrastructure is an example of a supply-side criterion that can
inuence the selection process. The load distributor uses information about server loads,
access controls, application or content availability, and priority to nd the best server at that
moment.
Dynamic server supply-side selection strategies are based upon criteria such as the
following:
Some load distributors periodically execute a set of scripts that check the health of the
servers. For example, they can request a web page and test to see if it was correctly
presented. This can be done at the same time that theyre timing the servers response speed.
Session persistence is important to transactions that require multiple requests. A customer
making a purchase needs to enter information, such as an address, a credit card number, and
shipping instructions. These types of services are statefulsome context is needed between
requests to maintain coherence and associate the requests.
When servers share a common repository for state information, switching a request to any
server is acceptable. However, most applications maintain their state independently, each
in a specic server. In those cases, sending requests associated with a single transaction to
different servers can cause the transaction to fail.
Session persistence is the capacity to associate a set of requests so that they are directed to
the same server. A few examples of the persistence options used by load distributors
illustrate the capabilities:
Source persistenceThe load distributor remembers the addresses of the end user
and the identity of the server that was assigned on the rst request. Further requests
from that user are directed to the original server. However, many end users are on the
other side of a rewall or a Network Address Translation (NAT) system, which can
reassign addresses frequently, limiting the utility of this technique.
SSL session ID persistenceFor secure sessions using SSL, the load distributor can
use the SSL protocols session ID, which is unique to a particular end user, to identify
that end user and the assigned server.
168
Performance is improved by getting users close to the desired content on the network.
Users can be identied by country and receive content in the specied language.
The distributed sites will show higher performance in the aggregate if trafc is distributed
intelligently, so this strategy works best if all are (approximately) equally loaded. Having
one center under-utilized while another is congested wastes resources at each location.
Intelligent geographic distribution decisions must also incorporate persistencethe user
must be directed to a single site, at least for the duration of a transaction.
Geographic distribution decisions are made using the same criteria as for local load
distribution, with additional input about the location of the end user. In some cases, the
Internet address of the end user or of the end users DNS server can be matched against a
table of Internet addresses to determine probable locations. In other cases, all the server
farms attempt return contact with the end user, and the server farm with the fastest access
is assigned to that end user.
Content distribution network (CDN) switching is an interesting feature that enables a web
site to use public content distribution networks for extended geographic reach and to handle
trafc surges. As the web sites data centers reach capacity, a public CDN is used to handle the
overow until trafc levels fall. Public CDNs are available with usage-based pricing,
enabling the web site owners to control costs as well.
Caching
Caches are special-purpose appliances that hide network and server latency by quickly
delivering frequently used content. A cache delivers objects faster than a serverafter the
objects are in the cache.
The cache is used in conjunction with an interception switch, which intercepts all trafc
designated for particular services, such as the Web service on TCP Port 80, usually without
regard for the ultimate destination. The cache looks to see if it already has that object in its
storage; if so, it provides that object to the requester much faster than if the object had to
be fetched from the server. Objects can be stored in cache explicitly, by being preloaded, or
they are stored when the cache sees an object requested by an end user that it hasnt seen
before. In that case, the cache performs the fetch from the web server on behalf of the end
user, and it then stores the object in the cache memory for the next retrieval (see Figure 9-2).
Figure 9-2
169
Caching
End-to-End Retrieval
Cached Retrieval
Interception
Switch
Web
Server
Desired
Object
Cache
Copy of
Desired Object
The cache must be sensitive to the aging characteristics of each object. Some objects, such
as company logos, may never change. Other content, such as current stock prices, will
change constantly. Caching loses its value when it delivers expired (stale) content. Objects
often are delivered with caching headers from their origin servers; those headers tell the
cache how long it can store the object before it expires. If there isnt a caching header, or if
the item is probably unique and will never be requested again (for example, a URL with an
embedded query string, or a dynamically generated le with a URL ending in .jsp, or an
encrypted object), the cache will simply ignore the object and not cache it.
Placing a cache in front of a server hides much of the server delay after objects are in the
cache. Preloading those objects in the cache can be easily controlled by the server
administration, if it owns both the servers and the server-side cache.
The cache can also be placed close to the client so that network latencies are eliminated as
well. In those cases, the client-side caches are probably owned by the end users ISP and
will depend on caching headers for information about expiration.
All browsers also contain caches; this is readily apparent when an end user navigates by
using the Back button on the browser. Note that a lot of the end-user activity may be
concealed from server management tools if it comes out of cache instead of from the
original server.
Content Distribution
On the face of it, there seems to be signicant business opportunities for those who can
deliver high-quality, content-rich services, including detailed graphics, animation, and
sound. Service providers are also attracted to the potential of these high-value and high-
170
171
system which pieces of the page are unchanging, what their cache expiration times are, and
how to do some simple processing to select one of a number of web page fragments for
inclusion in a web page to be delivered to an end user.
Content distribution is available by assembling content servers and content managers from
components sold by cache vendors or by subscribing to a content distribution network
service, such as those furnished by Akamai, Speedera, and Mirror Image.
Current behavior based upon CPU load, memory usage, network, and disc activity, for
instance
172
173
discarded connections, the number of times particular servers have been chosen, and more.
The load-distribution system can then be tuned to handle performance situations as they
occur. Detailed usage information can also be extracted for accurate billing and resource
forecasting. Through a Web services XML interface, F5 network devices can be integrated
directly with any third-party application.
Cache Instrumentation
Its important for instrumentation design to consider the fact that caches absorb incoming
requests from end users. Phantom objects (page bugs), discussed in Chapter 8, can be used
to count web page downloads even when the entire page is cached. The phantom objects
le is simply marked with a cache header as uncacheable, or it is given an attribute, such as
a query string, that cache agents will avoid. That phantom object will always be fetched
from the origin server, even when the entire rest of the page is fetched from cache. The
relatively slow delivery of the phantom object wont interfere with the end users
perception of page performance, as its usually an invisible, one-pixel object.
If server-side caches are used, the caches performance data is available for analysis.
Information on cache hits and misses can be used to compute the bandwidth savings
resulting from the cache. (For browser-side caches, such computations can also be made,
but theyre slightly more complex. The number of page views must be combined with
knowledge of how many page elements were actually fetched from the server and
compared to the number of elements designed into the page.) In any case, end-user
measurements are needed to see the impact of caching on end-user performance.
174
Other metrics for the content-delivery infrastructure can be derived as well. One useful
measurement would assess the real impact of the content-delivery infrastructure on the
backbone. Bandwidth gain is a metric that compares the total content delivered to
consumers to the backbone bandwidth needed for preloading objects, refreshing content,
and accessing the server when an object is not resident in the cache. For example, the
bandwidth gain is 5 if the cache were delivering 100 Mbps of content while using 20 Mbps
on the backbone for cache overhead. This shows the benet of content servers on the edge
versus upgrading the backbone for a centralized approach. The bandwidth gain becomes
even more signicant in the aggregate view. If there are 25 content servers at the edge, each
with a bandwidth gain of 5, the backbone impact is very clear.
Content delivery networks can supply detailed information, including not just workload
volumes but also information about the geographic location of your end users and the
particular web pages or other content that they request.
Summary
There are a growing number of options for boosting server performance and availability.
Global load distribution and a tiered architecture provide high levels of availability with
two levels of redundancy: multiple sites and multiple servers within each site.
Individual servers must be optimized as they are assigned to tiers. As transactions ow
through the tiers, they can be load balanced by content switches or accelerated with new
front-end processors.
Content delivery infrastructures speed the delivery of content, opening new opportunities
for service providers and their customers. Providers have new, high-margin services to
offer, while customers have new applications that save money, increase competitive
advantage, and strengthen their Internet presence. The traditional use of a centralized origin
server is being replaced by a set of content servers at the edge. That enables customers to
get high-quality, content-rich services, while providers avoid moving large volumes of
time-sensitive content across their backbones.
The distribution of servers across the network means that managing server instrumentation
is becoming a critical skill. Load distribution, caching, and content distribution networks
greatly affect both performance and the management information that is available to system
administrators. (A lot of the end-user activity may be concealed from the original servers
management tools if it comes out of cache or content-distribution networks instead of from
the original server.)
These services also need to be managed to ensure that they provide a return on investment.
Each type of service provides element-level management data, but synthetic transactions
from many locations in the system are important for Service Level Agreements (SLAs) and
for handling performance issues.
CHAPTER
10
The control and measurement of transport service quality when the trafc ows
among separate organizations
178
These were rst introduced in Chapter 2, Service Level Management, and they are
expanded upon here.
179
percentile value of all of those measurements is used as the basis for billing. An effect of
this is that the top ve percent of the ve-minute samples are ignored; you can therefore
burst up to the maximum bandwidth of your access line for up to ve percent of the month
without any additional cost.
In both cases (total number of bytes transmitted or peak usage), the workload measured is
not precisely the same as the workload or bandwidth as seen by the application. If errors
interfere with data packets, and those packets are therefore retransmitted, the low-level
workload metrics usually count those packets again. The paradoxical result is that as link
quality deteriorates, the byte count carried by that link rises!
180
are one or more errors within a particular data block; any error at all will require
retransmission of the entire block if perfect transmission is required. The usual exception
is streaming media, which can accept low error rates without major impact on the
application; in those cases, simple bit error rates may be sufcient.
Packet loss over an Internet connection may be dened very coarsely in terms of ping ratios. For
that measure, short ping packets are transmitted in a burst and are immediately echoed back
by the destination. The percentage of packets that return within a dened, short time window is
taken as the success ratio. This is an extremely coarse approximation because Internet paths are
variable, and error rates can uctuate greatly because of intersecting trafc ows. The effect of
errors on TCP communications is also quite complex, as TCP is very sensitive to the particular
pattern of errors over time. Therefore, a brief burst of ping packets may not be representative of
the effective performance of the connection as seen by TCP. Block-oriented error ratios for the
underlying transport, such as ATMs cell error ratio or similar measures for Frame Relay, are
therefore preferable. Theyre not perfect for TCP, but theyre better than ping ratios.
One-Way Latency
Services in the interactive classes may require one-way delay measurements. Routing
protocols often select different paths in each direction between a pair of nodes, so the
latency in the two directions can be considerably different. (Each ISP typically tries to hand
a packet to another ISP as soon as possible, thereby decreasing the distance it must carry
the packet.) See Figure 10-1.
Figure 10-1 Internet Latencies and Asymmetric Routes
The challenge is getting the time measurement between two unsynchronized hosts. The
Network Time Protocol (NTP) can be used; there are commercial variants that offer similar
functionality. A more expensive, and more accurate, approach is using Global Positioning
System (GPS) receivers to measure time and synchronize the clocks at each site.
QoS Technologies
181
Round-Trip Latency
Round-trip latency is the common metric for transactional and (some) interactive services.
Synchronizing clocks is not an issue because the initiator gets a response and can easily
determine the elapsed time, which is the metric of interest.
Round-trip latency on the Web is easily measured by using the time elapsed between the
rst two steps in the establishment of a TCP connection. (A SYN packet is sent out, and
a SYN ACK packet is returned.) The turnaround time at the destination is minimal, as it
does not involve the destination application. Most active measurement collectors provide
that measurement, labeling it initial connection time or TCP connection time. It can also be
obtained from ping packets.
Jitter
Jitter is the variance in packet arrival times. It can have a serious impact on service quality,
although buffering in the receiving host helps considerably. Anyone with a CD in his or her
car appreciates the value of buffering and knows its limitations when there are too many
bumps in the road. The real problem is the additional delay added by the dejitter buffer; its
usually one or two times the expected jitter. For interactive use of the network, such as voice
communications, ITU-Ts standard G.114 suggests a maximum one-way latency of 150
milliseconds (ms). Therefore, a dejitter buffer of, say, 50 ms would form a large part of the
latency budget.
Jitter is measured by tracking the arrival time of each successive packet and calculating the
variance between them, assuming a transmitter is introducing trafc into the network at a
constant rate. Some commercial tools enhance this measure with additional analytics; for
example, the NetIQ product calculates the distribution of the jitter measures to provide
more insight into the range of underlying performance.
QoS Technologies
QoS technologies classify network trafc and then ensure that some of that trafc receives
special handling. The special handling may include attempts to provide improved
availability, error rates, latency, and jitter.
However, because of the perceived complexity of QoS, many organizations choose to
implement service quality differentiation through the use of separate facilities (isolation)
instead of through QoS technology. For example, a separate LAN can be built to handle
Voice over IP (VoIP), thereby isolating it from delays caused by large-scale le transfers on
the data LAN. The QoS alternative of using frame tagging on the LAN may appear to be
too complicated. In some cases, the organization may simply over-provision the transport
facilities massively and hope that bandwidth constriction will never occur. In addition,
182
signaling between ISPs for QoS is generally not done, so use of QoS technologies across
the public Internet is impractical.
Even when its completely implemented, QoS does not necessarily guarantee particular
performance. Performance guarantees can be quite difcult and expensive to provide in
packet-switched networks, and most applications and users can be satised with less
stringent promises, such as prioritization only, without delay guarantees.
For the stated reasons, QoS technology is primarily used in private networks and has not
yet achieved the widespread use that was predicted some years ago. Nevertheless, with the
growing use of VoIP and other latency-sensitive network uses, interest in QoS is growing.
This section of the chapter discusses the major QoS technologies. The QoS technologies
are placed into two groups: tag-based QoS, which relies on identication tags placed into
data frames and used by network switches; and trafc-shaping QoS, which tries to manage
bandwidth allocations through queuing or rate-shaping at a single point instead of through
the active cooperation of all network elements and explicit tagging.
This section of the chapter also discusses the alternative to the major QoS technologies. This
alternative is called over-provisioning, also known as design by hope.
Tag-Based QoS
Networks forward trafc through routers, switches, and access devices. The transport
infrastructures must make forwarding decisions based on the required treatment for each
trafc ow.
In tag-based QoS, trafc is initially classied by having the appropriate forwarding
information added to each packet. Desktops, other customer input devices, and network
devices can classify and mark the packets, possibly relying on a central database of
authorizations. Trafc can be identied by end user, protocol, and application at the
network entry point. Then the classier can decide whether to admit the trafc to
the network and, if admitted, which classication tag to place into the data packet headers.
Switches, routers, and other network devices in the core of the network then examine the
tags to determine how to handle the trafc.
There are different types of QoS technologies that use classication and tagging. This
subsection describes IEEE 802 LAN QoS, IP Type of Service (TOS), IP Differentiated
Services (DiffServ), Multiprotocol Label Switching (MPLS), and Resource Reservation
Protocol (RSVP) as examples; other technologies are also available.
QoS Technologies
183
802.1p eld) for priority. There are therefore eight possible non-overlapping priorities.
There is also a eld (the 802.1Q eld) that enables the identication of up to 4095 virtual
LANs (VLANs). VLAN trafc may also receive differentiated treatment.
IP TOS
The TOS byte has been part of the IP header from the earliest specication. It provides three
bits that can be used to differentiate priority levels. Routers can examine these bits to set
queuing priorities and to select among routing options. Its also possible to set router lters
to examine other parts of the packet header (for example, the protocol type or the origin/
destination address pair) when choosing a particular forwarding priority.
Some administrators use TOS elds and ltering to provide very coarse prioritization
within the router queues. At most, one or two classes of trafc, such as certain transactions,
are given priority over other trafc in the queues. No strict QoS guarantees are made, and
there is no attempt to inuence routing decisions at all.
IP DiffServ
DiffServ technology can provide both performance guarantees and performance
prioritization. It does that by using the TOS byte (renamed the DS byte) in the IP header to
indicate the QoS class. All the information that the router needs to handle the packet is
contained in the packet header, so routers dont need to learn or store information about
individual trafc ows. The disadvantage of DiffServ is that ows must be handled as part
of larger groups; its not possible to single out a particular ow for special handling,
independent of other ows. Instead, it must be grouped with many other ows for its trip through
all the routers, and it will receive the same handling as all of the other ows in its group.
Aggregation into a small set of classes simplies the management of large numbers of
ows, improving scalability for large backbones. Each router, for instance, interprets the
DS Byte and follows the associated forwarding behavior.
Devices at the network boundary may be used to set the DS byte according to current
resource allocation policies. They can map between the DS byte and IEEE 802 QoS tags, for
example.
MPLS
A classical router processes each packet by doing the following:
Checking whether the packet should be discarded for being too old (the time to live
has expired)
184
The router repeats these steps even if the next arriving packet belongs to the same ow. This
approach becomes a bottleneck with higher ow volumes and faster trunk speeds. MPLS is
intended to overcome the limitations of classical routers in backbones with tens of
thousands of ows.
MPLS adds a tag to each packet; each tag is associated with a predened routing and
handling strategy. Each router simply reads the tag, using it to identify the next hop and
forwarding policies to use. Processing time is shortened to a table lookup and trafc is
forwarded quickly through the MPLS domain.
MPLS was originally designed for the dense Internet core where high volumes must be
routed with no delays. Administrators dene the sets of routes and treatments associated
with each label and distribute the information to the core routers. Edge devices are also
congured to identify incoming service ows and append the appropriate tag or label.
Administrators can take advantage of this strategy to build static routes for trafc with time
constraints, using routes with fewer hops and higher speed trunks, for instance. They can
also choose to allow dynamic routing where the routers exchange reachability information
and make adjustments on their own.
RSVP
RSVP is a common mechanism for reserving bandwidth across a single network
infrastructure. A receiver initiates a reservation request for a requested ow. The request is
passed through the network devices and those that are RSVP-enabled reserve the
bandwidth as requested. The trafc ow, identied by its addresses and protocol type, is
then given special handling when it passes through the RSVP-enabled devices that have
accepted the reservation.
One of the drawbacks of RSVP is that devices that do not support it simply pass the request
through. This leads to a situation where a device might not support RSVP but does not
inform anyone of that fact; the data packets depart their point of origin assuming that they
have a dedicated route, but arrive and nd no resources reserved on their behalf. Rerouting
of trafc ows, which is not uncommon in IP-based networks, may also result in the trafc
ow going through routers that are temporarily unaware of the special handling that the
packets should receive.
The network must also handle all service ows appropriately, allocating the resources
needed to comply with the constraints of every active SLA.
QoS Technologies
185
Traffic-Shaping QoS
Bandwidth management is an essential function for guaranteeing service quality. Network
bandwidth is shared among a competing set of service ows and must be allocated and
managed effectively. The most critical services must receive sufcient resources to meet the
objectives set forth in SLAs.
The tag-based approaches to QoS try to perform bandwidth management by tagging
specic packets and then instructing all the network equipment to give those packets
preferential treatment. Those approaches have difculties if all the pieces of network
equipment in the data ow path dont participate. Use of trafc-shaping QoS is an
alternative.
In trafc-shaping QoS, a special appliance or process in a router is invoked to identify data
ows (by their source and destination addresses) and sort them into different queues or
otherwise manage their data rates. The appliances or processes try to change the
characteristics of the trafc itself rather than trying to control the handling of packets
between the connection end points. If that appliance or process is located at a key point
through which all trafc ows, it can control the available bandwidths even though all the
devices in the data ows path dont participate. Trafc-shaping QoS is not always as
precise as tag-based QoS approaches, but its easier to implement.
There are two basic approaches to trafc-shaping QoS: rate control and queuing. Each is
discussed in the following sections.
186
Rate Control
Rate control is a QoS strategy that helps regulate a set of TCP ows with a range of
forwarding needs. Rate control regulates the introduction of trafc into the transport
infrastructures to minimize the interference among ows competing for the same network
resources and to set relative priorities for access to scarce resources. Coordinating the
behavior of a group of connections is a large departure from the basic free-for-all of the
original TCP design concepts and implementations. (See the preceding sidebar for more
details.)
Rate control is analogous to the air trafc control system. You have probably had the
experience of having your ight delayed because of congestion at the destination airport.
You dont take off until there is an opening for you at the other end, the weather clears, or
other situations improve. In contrast, the typical TCP philosophy discussed in the sidebar
could be characterized like this: launch all the planes and hope they land before they run
out of gas or the skies get too crowded.
Packeteer started this niche in 1996 when they introduced their PacketShaper product line,
based on Packeteers patented rate control technology. A PacketShaper device is placed
between the sender and receiver, typically in front of servers or at customer-provider
network demarcation points. This enables it to intercept the receivers feedback (TCP ow
credits and acknowledgements) and adjust the actual TCP connection behavior by
manipulating the timing of protocol acknowledgments and ow control allocations. The
PacketShaper inspects the packet headers for address, protocol, and application
QoS Technologies
187
information, classies them, and then applies the appropriate policies as the trafc ows
through it. This approach leaves the connection endpoints unchanged and unaware of
actions taken by the queuing appliance.
The PacketShaper smoothes bursty trafc and thus minimizes its impact on other service
ows. TCP connections can be assigned a guaranteed bit rate by the PacketShaper, a
function not explicitly enabled by the TCP specications. The assigned rate is a minimum
guaranteed rate, allowing for higher rates when additional bandwidth is available. If a ow
for a trafc class cannot get the required bandwidth guarantee, the connection request can
be refused, or the connection can be established without a guarantee.
The PacketShaper can also block service ows from the network if desired. A discard
policy blocks connection attempts and discards packets without notifying the user. The
granular classication lets the PacketShaper redirect web users to an error URL that
informs them of the blockage. This technique lets administrators keep unwanted ows off
of their networks or allow them only at specied times.
Queuing
Queuing can be used to reorganize the trafc streams passing through the queue. For
example, a low-priority packet is queued and held if higher-priority trafc is waiting to be
forwarded. An arriving high-priority packet is placed in the queue ahead of lower-priority
packets.
In queuing-based QoS, the packets in an arriving ow are inspected and assigned to a class.
All ows that are members of a class share a queue. Packets are transmitted from the queue
based on relative queue priority and rules of fairness among queues to ensure that ows
have enough (even if minimal) resources to continue operations.
Class-based queuing (CBQ), originally developed by Sally Floyd and others at the
Lawrence Berkeley Laboratory, is an attempt to provide fair allocation of bandwidth
without requiring massive amounts of processing power in the network devices. Each class
of user is guaranteed a certain minimum bandwidth, and any excess bandwidth is allocated
according to rules set up by the network administration. Specic implementations of CBQ
have been designed with an entire hierarchy of classes, with, for example, excess capacity
redistributed within each branch of the hierarchy as much as possible.
A drawback to CBQ is that there is no fairness within the class. A large number of packets
from other members extends the waiting time for following packets from other ows. This
causes inconsistency in forwarding and possible quality uctuations.
Weighted fair queuing (WFQ), which is more complex than CBQ and requires much more
processing power, can be used to provide absolute guarantees of maximum latency.
188
189
Levels of Control
There are macro- and micro-levels of control involved for ow-through QoS.
The micro-level is the management of internal ows within any organization or provider
network infrastructure. All the tools mentioned in the rst part of this chapter can be applied
as needed. The management team for that infrastructure has the responsibility of managing
their own infrastructure to meet compliance criteria.
The macro-level entails monitoring end-to-end service quality, identifying the portion
(responsible organization) of the infrastructure that contributes to poor service quality, and
verifying that quality is restored.
Demarcation Points
Demarcation points are the boundaries between management organizations and the
resources they control. Periodic measurements can be made across the cloud between
different demarcation points. The collectors use active techniques to exchange and measure
trafc between themselves. The basic delay measurements are augmented by jitter and
packet loss measurements. A trend indicating degrading service quality triggers an alert to
management systems that verify the measurements and activate the appropriate management tools to oversee the details of resolving the problem.
190
0$. #
/
Measure
Alarm
Diagnose
Test
Admin
Help
Log Out
)'
/9 9
/
9: 6 & 8$6
/#;
& 8
7" -$
9
/ #=
9
/ #=
+ 6 (
7 6, 8$6
)*
*+
,
( &
- .
'/(
0#
11"2
3
45
End-user measurements can also be used in attempts to bypass transport provider problems.
A new approach named route control is emerging as way of allocating trafc among a set
of service providers to maintain overall control of service quality.
Many organizations use several ISPs to gain the following advantages:
A router on the customer premises has connections to each service provider and selects one
using the Border Gateway Protocol (BGP). BGP has some shortcomings. It is complex and
difcult to tune, and it can send trafc over paths that dont provide the lowest latency or
the lowest cost. In addition, it can certainly create situations in which low-priority trafc
ows across the ISP with the highest charges.
netVmg, an early mover in route-control solutions, introduced the Flow Control Platform
to address the problems in what they term the middle milethe set of Internet backbones
between the sender and the receiver.
The Flow Control Platform inspects outgoing ows, identies the destination networks, and
uses proprietary approaches to measure performance to each destination across all the
attached ISPs. Performance baselines are monitored for conformance to latency, packet
Summary
191
loss, and cost policies. If the current ISP is unable to comply, the Flow Control Platform
selects another outgoing ISP based on cost and performance specications. It sends a BGP
update to the web sites boundary router to initiate a change to a new provider. The Flow
Control Platform also allows for more sophisticated policies that consider security and
other factors, such as requiring or forbidding certain ISPs. Customers also have the
advantage of consistent provider measurements to facilitate SLA adjustments and contract
negotiations. They can optimize a set of provider services to their needs.
Even if route control isnt used, multiple providers or multiple transport services can
provide different levels of transport quality for an enterprise. The enterprises border
switches or routers can sort outgoing trafc into different classes, depending on the
required transport performance (possibly signaled by the same tagging system used within
the enterprise). Those border devices can then route the outgoing trafc over the
appropriate provider service; for example, trafc requiring latency guarantees would be
sent over a service that can provide such guaranteesprobably at a higher price than for
the service providing best-effort delivery. This is similar to the use of multiple, isolated
networks to provide different levels of service within the enterprise, as discussed in the
previous subsection on over-provisioning and isolated networks.
Summary
The transport infrastructure spans a large number of networks and trafc that must be
managed at two levels. The network managers must consider how they will measure and
manage the network characteristics of bandwidth, availability, packet loss, latency, and
jitter.
Flows within a single organizations domain are managed using specialized technologies
that can tag data packets for special handling and can make reservations for that special
handling.
Rate control and queuing can be used at key network points to manage bandwidth for all
the trafc streams passing through those points. Some organizations dont use ow
management technologies; they simply over-provision massively, hoping that there wont
be severe congestion problems.
Flows across management domain boundaries are an ongoing challenge. Attaining owthrough QoS and delivering consistent end-to-end quality independent of the set of ISPs
used is the goal. The technical means are becoming available with the introduction of
MPLS and route control, although the administrative burdens are still high.
PART
III
CHAPTER
11
Load Testing
Load balancers, caches, and Quality of Service (QoS) enabled network devices
dynamically shift resources in real time to meet compliance requirements. In contrast, a
longer-term management focus is intended to identify future problems with enough lead
time to take the appropriate preventive actions. The long-term time scale is days or weeks
versus the minutes and seconds involved in real-time management processes. Real-time
operations play the hand thats been dealt; long-term operations focus on drawing better
cards to begin with.
If you are a services manager, one question that is usually in the back of your mind is this:
When will my services fail to meet their quality objectives? (That you will fail to meet
Service Level Agreement (SLA) objectives under some combination of circumstances is
almost a given. Perhaps a more accurate statement of the question is this: Who breaks my
services rst? My customers, or my testing team?)
The long-term functions discussed in this chapter and Chapter 12, Modeling and Capacity
Planning, ensure that there are adequate resources to allocate and that you better
understand how to structure operational options. The long-term functions are capacity
planning, modeling, and load testing.
Capacity planning using modeling tools, discussed in Chapter 12, has often been difcult
and chancy, greatly depending on the assumptions you make about loads, request rates, and
other key variables. Analysis time lengthens as the numbers of elements and their
interactions increase. Moreover, there are many elements and effects whose interactions are
very difcult to quantify (latency and queuing are two key examples), and the simulations
are no better than the assumptions that went into them.
In contrast, load testing is empirical rather than analytic; it addresses the question in a more
straightforward manner: see how the services actually behave and where they really break. With
load testing, you can learn about problematic performance and address it before deployment.
You can then baseline the behavior so that you can allocate resources properly and set realistic
performance goals. Load testing can also be used to improve the accuracy of analytic tools.
Load testing uses active validation techniques to dene the performance envelope of a
service or an element. The service or element being tested is subjected to an increasing
offered load until service quality begins to plummet. Alternatively, the service is subjected
to a steady load over a longer period of time, to identify effects, if any, of caching or
buffering.
196
This chapter discusses load testing of the server and application infrastructures, which can
be handled together. I dont discuss in detail the use of load tests to evaluate transport
infrastructures, but specialized tools and techniques are available to test transport networks.
They generate high trafc volumes and can be set to introduce errors and shift timing
relationships to create jitter between packets. Device vendors use these expensive trafc
generators to test and validate their products and to determine the relative strengths and
weaknesses of competing products.
A load testing strategy needs these pieces: a test bed, controllable load generators, good
system proles, good analysis, and clear reporting tools. Toward that end, this chapter
covers the following:
197
There are two areas of interest in the classical load curve: where the behavior is linear
and where it changes to being nonlinear. The inflection point is the boundary between linear and
nonlinear responses and is a function of the peak service rate (whether measured in packets
per second, frames per second, or transactions per second) that the specic layers
infrastructure has been engineered to provide. Its based on elementary queuing theory;
above the inection point, even small increases in applied load result in large changes in
performance and, possibly, in availability.
The linear portion of the response curve represents the most stable and predictable part of
the performance envelope. It represents conditions where the resources are sufcient for the
load applied. Queuing delays are minimal, and the response time is low for a range of loads.
As the offered load grows, the response time begins to lengthen as resources are more
heavily subscribed. Loading up to the inection point is (approximately) linear. Any
load increases up to the inection point have the same impactthe incremental
response increases are invariant. Each increment of offered load has an equal
corresponding incremental increase in response time.
The slope of the linear portion gives important information: it indicates the sensitivity to
loading changes. A at slope shows that response is less sensitive to a loading change than
a steeper slope. Flat slopes are desirable because they are characteristics that lend stability
to the behavior so that response time doesnt degrade as loads varysomething customers
demand. Note that a at slope may also mean an underutilized system. However,
underutilized today also means headroom for further growthtying directly to the
challenges of capacity planning.
198
An administrator or planner wants to work in the linear part of the performance envelope
because he or she can make fairly accurate estimates about expected response times with
increasing loading.
At the inection point, the offered loads exceed the ability of the tested environment to
process them quickly enough, and queue lengths begin to increase exponentially. Delays
grow quickly, degrading time-sensitive activities, congesting servers and networks, and
causing customers to go elsewhere and online transactions to fail.
Note that it also takes some time to recover and shift back to linear operation, even if the
offered load is removed completely. Queue lengths must be reduced rst. Administrators
want to avoid the nonlinear area because of its unpredictability. At one moment, operations
are still within the metrics of the SLA; then, when a small burst of requests arrives, the
entire operation can grind to a halt. This is risky ground to manage.
On the public Web, there are two additional phenomena that must be considered: ash load
and abandonment.
On most non-Web, transaction-oriented computer systems, the queue is external; that is, it
is outside the system. Customers wait in telephone queues, or there is a pile of incoming
documents to be processed in front of each data-entry clerk. The ow of transactions is
therefore reasonably steady, with a rm maximum number of sessions set by the number of
clerk terminals or dial-in lines. Under heavy load, the external queue builds, and the result
is that theres a steady, unchanging workload, classically measured by the concurrent
sessions statistic.
On the public Web, in sharp contrast, the incoming trafc hits the system directly. Massive
ash loads can appear in response to a television ad or a mention in a news article, with
hundreds of thousands of users trying to establish TCP sessions simultaneously. Such loads
can overwhelm the system at the precise time that user satisfaction is most important. (Why
run a television ad and then convince most of the public that they never want to go to your
web site again?)
Loads on public web sites can therefore be much higher than the loads generated by
classical load-generation tools; special Web load generation tools are necessary. In
addition, load statistics for Web-based transactions should be in terms of arrival rate over a
given interval, not concurrent users. For example, the Keynote LoadPro service can handle
hundreds of thousands of concurrent user session initiation attempts ooding in from the
Internet, and it measures load in terms of session initiation rate, not in terms of concurrent
users.
The other major difference between classical load transaction testing and Web load
transaction testing is abandonment. In classical systems and on corporate intranets, users
dont abandon a transaction. They remain in the transaction until completion, regardless of
199
the amount of time it takes. There simply isnt anywhere else to go. Call center operators
and data entry operators must wait if the transaction response time is very slow, and external
customers who dialed into a corporate mainframe dont usually disconnect and then dial
into a competitor on a whim, just to see if the competitor is faster.
On the public Web, however, its extremely easy to abandon a transaction; people do it all
the time. Worse, web protocols usually dont inform the system when an end user has abandoned
a transaction; the web server system must use timeouts or other special methods to guess
when an end user has abandoned a transaction. The result is that many transactions in a Web
system may be inactive, waiting for timeout, especially under a heavy load with the long
response times that encourage abandonment. If the Web systems abandonment-detection
and resource-recovery mechanisms are inadequate, the system may clog as a result of
massive numbers of abandoned transactionsleading to even worse performance and even
more abandonment in a vicious cycle. The system might be able to handle a brief peak load,
but be unable to endure a longer-duration peak load because it cannot recover resources
efciently from abandoned transactions.
Load testing of public web sites must therefore include a way to simulate transaction
abandonment and must be run long enough to determine the endurance of the system. In
the Keynote LoadPro system, dissatisfaction and abandonment scores are kept for each
simulated user. They vary according to the type of user (beginner, experienced, and so on)
and according to the type of web page (home page, search page, and so on). They vary
because different classes of users have different tolerances for server delay, and users are
willing to wait different lengths of time for different types of pages. The LoadPro system
then simulates abandonment at the appropriate points, and it also reports a dissatisfaction
score at the end of the load simulation, to indicate aggregate user satisfaction with the entire
experience.
Abandonment is another reason, in addition to ash loads, that concurrent sessions should
be avoided as a measure of load in the Web environment, where session termination is
difcult to detect. Concurrent sessions can, however, be used as a measure of performance.
For a given arrival rate of new transactions, the number of concurrent sessions decreases as
the systems performance increases; good performance allows end users to do their work
quickly and log off.
200
There have been cases where the same testing organization has published different reports
showing that the sponsor of each won hands down over the same competitors that beat them
in a previous report. This happens for two reasons:
The worst case is a testing facility that caters to whomever pays the bills.
A more common reason is that each test uses slightly different conditions to favor the
sponsor. As a case in point, many years ago, a network device vendor trumpeted its
packet forwarding performance. It turned out that each interface card had storage for
recently used IP addressessaving a routing database lookup and speeding up the
forwarding process. Needless to say, their testing process always used a small number
of IP addresses, so the local storage of recent IP addresses was leveraged to the hilt.
The test generated impressive numbers, but had little relevance to a real environment.
When a large set of IP addresses was used, the performance dropped signicantly.
This isnt surprising, and there is still a lot that can be learned, but you have to work
a little harder. Its also an important lesson in attention to detail in designing internal
tests; such effects can be inadvertently introduced into any test environment.
It is important to understand as much as you can about the testing process behind the data
in the testing report. A good report should include the following:
A full description of the Device Under Test (DUT) features used. Are they realistic in
your world? Are features that lower performance used in the tests? Does the network
equipment handle a mixed background workload, along with any possible access
control lists and similar resource-using features, similar to your environment?
A description of the testing proles. What are the loading characteristics and
workload properties?
Gathering a set of load testing benchmark reports can be useful even if the results of the
tests are not directly comparable to your situation. Studies sponsored by your vendors
competitors may reveal problems that your vendor will not mention to you. Benchmark
reports may also point out common problem areas for which you should be looking. The
value of independent element testing is that you dont pay for it, and it helps dene realistic
envelopes and their associated thresholds.
201
Operational environments with thousands of servers are usually too expensive to replicate
completely. I have worked with several large organizations that have an exact copy of their
operating environmenthundreds of switches, servers, routers, and other equipment.
These few companies are the exception to the rule. A smaller, but still faithful, copy must
be used with adjustments to the load testing results to give realistic operational guidance.
In effect, the test bed is a manageable subset of the target production environment.
Costs of administering the test bed should not be underestimated. (In addition, as with most
testing, investments must be compared with the cost of failure if investments are not made.)
There must be close coordination between the operations and testing teamsnew software
upgrades, patches, and changes in operating systems or connectivity must be included in
the test bed as well. Some organizations manage their test-bed environments as though they
were production-quality operations when new software is distributed, to ensure consistency
between the two environments.
An environment with a test bed and load generators can be useful in other ways. For
example, planners and administrators can carry out some real-world what-if analyses by
changing the test bed or the proles to test for extremes in behavior or sensitivities.
Experimenting with different conditions may reveal more information, for instance.
Performance degradation may follow different scenarios when it results from a steadily
increasing load versus when a sudden jump in the load occurs. Carefully controlling the
offered loads applied to the test bed can reveal inection points in a variety of subsystems.
What-if analysis also extends to other management strategies. Trying new content
distribution techniques, or assessing the real impact of a cache in the test bed, helps drive
better resource allocation decisions.
Many valuable tests can be conducted in a local environment, especially those identifying
the inection points for LANs, switches, and servers. However, for many services, the
Internet and the other networks used by the enterprise must be part of the testing procedure.
Testing across the Internet and enterprise networks is necessary because of the variability
introduced by distance and routing changes, the use of different providers, and the
interactions with other trafc ows.
Testers can use desktop systems located in various locations to drive the transactions at the
test bed. For large-scale tests over the Internet or within corporate networks, service
providers such as Keynote Systems and Mercury Interactive can provide load testing on
demand. In many cases, its much less expensive to use a service for highly realistic,
massive load tests than it is to acquire the software, hardware, network connectivity, and
expertise to run the test on your own.
Testing staff should perform highly repetitive preparatory testing using in-house systems,
but they should consider using an external service for nal, large-scale acceptance tests
before production. (Those external services can normally reuse your in-house scripts,
although they often need to be supplemented by parameters for abandonment behavior and
ash load characteristics.) External testing organizations offer advantages beyond just
saving money for major test efforts. They have extensive testing experience, may be faster
than your own organization, and do not take your staff away from their normal functions.
202
Load testing of web applications requires load generators, which are specially programmed
computer systems that produce large numbers of synthetic (virtual) transactions. Load
generators run scripts of synthetic transactions that follow a prescribed set of steps. Some
of these steps can be parameterized to simulate a wider variety of users or transaction types
found in normal operations. These steps may include the following:
203
Other tips that can help in this effort include the following:
Maintaining the same ratios of resources seems to help with gauging results. This is a
practical rule of thumb: the same dependencies are more likely to be revealed if the
ratio of aggregated uplinks to backbone speed is maintained. As an example, if the
production environment aggregates eight 100-Mb Fast Ethernet links into a 1-Gb
Ethernet backbone, the test bed should use the same ratios. Using half that number
would not create the same bottlenecks at high loads. It would, therefore, be misleading
if used to determine a practical level of service over-subscription that both supports
that target usage rates and is also economical in terms of the needed amount of
supporting hardware and software.
Performing unit testing helps you see the maximum number of concurrent
transactions or connections a single server can actually handle, along with the effect
of adding multiple servers. (In some cases, you dont get full advantage from each
additional server because of inter-server synchronization overhead and other factors.)
Then the number of servers for a given load can be estimated. It is also important to
stress test elements, such as load balancers, to determine their actual performance
envelope for the anticipated number of connections and workload.
The collected information from load testing is used to characterize the performance
envelope, including the system behaviors under ash load and when users are abandoning
transactions. A good data management capability is needed to save and organize data from
a series of tests. Planners, developers, and administrators can compare results from
different tests and rene their understanding of behavior and their testing procedures.
Statistical techniques can be applied protably to tease out the past contributions of
different factors; good software for this has come down in price over the past several years.
The most important statistical technique is graphing; visual representation of test results
can quickly identify inection points.
204
If the service you are testing will actually use an external provider for credit authorization,
clearing payments, or Customer Relationship Management (CRM), those behaviors must
be included. One issue is that there are fees for these services, and testing large transaction
volumes is expensive. There is also the problem of transactions creating data that has to be
backed out later after the testing is completedno one wants thousands of products ordered
in the test to be actually shipped and paid for.
What is needed instead is easy access to the initiation of the external service. A simple
application can simulate receiving the request, waiting a dened time, responding, and then
replicating the external activities in a controlled environment. Such simulators of external
activities can also introduce errors or timeouts to see how such conditions affect
performance under load.
For a new service, you may need some initial guessing because there is no real operational
data. Good instrumentation makes better operational data available after deployment. The
feedback from actual operations is used to modify the loading assumptions as needed.
Future tests will be more accurate because they use more accurate loading characteristics.
After the appropriate transaction set has been identied, the individual transactions must be
recorded and prepared for the testing phase.
Testers typically use a transaction recorder to capture real transactions and structure them
into scripts that can be replicated to different load generators and executed repeatedly. The
key is that the capture should not require any staff effort to effect the conversion to a test
script, as most of the early products of this type did.
The script of the recorded transaction is staticit captures one user accessing one set of
information, ordering one product, and using one credit card; therefore, it is of limited
value. Running the same transaction repeatedly will not offer much insight into the actual
operational behavior under a wider variety of loads and transactions. In fact, I know of a
team that tested against the same transaction for so long that they unconsciously adapted
their code to produce stunning performance for that transaction only.
Variability in the scripts is needed, and good load testing tools must provide this. The
variability should be structured in several ways. For example, a le with a list of user
names, products, catalog numbers, or other important information could be used as input.
File-driven variability is usually used rst because repeating the tests accurately is helpful
in the early testing phases. Adding randomly generated transactions is also helpful for
testing a wider range of behaviors.
The actual testing process uses the generated scripts to simulate the activity of real
customers. The test procedure includes other parameters that govern its operations,
including the following:
205
A complex environment has dozens to hundreds of possible conguration options that can
be changed from test to test, and those permutations can quickly get out of hand. However,
statistical methods exist, in a discipline known as Design of Experiments (DOE), that make
it possible to achieve robust results without resorting to the full range of all permutations.
Such methods have been used with great success in a wide variety of scientic, engineering,
and quality control disciplines. References and resources about DOE are available on the
Web.
Remember that an important goal of load testing is to break the service by forcing it into
operational areas in which performance rapidly degrades. The breakdown provides a rich
source of data if the test bed is instrumented properly. What you also want to know is where
the performance broke. Was it network congestion, server overload, poor application
design, the database, or a combination of factors that caused the problem? This information
offers one key to alleviating performance problems or at least to pushing the inection point
toward the right, meaning higher loads are sustainable before nonlinear behavior occurs.
The system elements are already instrumented to provide information on loads and
exceptions; sometimes the elements can provide all the information needed, by monitoring
their internal states during web transaction load testing. At other times, youll need to drive
specialized transactions into a portion of the test bed to determine server, application,
network, or database delays.
206
information about the contributions of specic components to overall response time allows
accurate investments and optimum returns.
Developers use load testing as feedback on the soundness of their designs and their use of
best practices. They are able to determine if the application meets basic performance,
stability, and reliability needs. Tracking the number of failed transactions is also critical.
High transaction volume at the edge of the performance envelope counts as success only if
the transactions are actually completing as expected. Developers need to know if fast
transaction execution actually masks failures within each attempt.
Load testing data also assists administrators with setting realistic thresholds and with
adjusting operations throughout a business cycle. The relationship of the inection point to
the negotiated SLA must be evaluated. For example, if the required response time lies well
within the linear domain, to the left of the inection point, there is a high probability that
delivered services will be compliant. In contrast, if the service will be operating in the
nonlinear part of its performance envelope, to the right of the inection point, compliance
is likely to suffer.
Alarm thresholds can be set as a result of load tests and their determination of the inection
points. As loading or response times approach the inection point, the system can be
congured to issue an alert with a high severity level. A warning alert can be generated at
lower loading and response levels to give the operations team some lead time before
instability occurs. These alerts can be incorporated into automated operations or policy
systems (as discussed in Chapter 7, Policy-Based Management) so that actions such as
redirecting trafc to other sites or bringing more resources online can be initiated
automatically as performance moves toward the inection point.
Load testing results can also be used to make modeling tools more accurate and effective,
as discussed in Chapter 12.
Summary
Load testing is an important long-term function that is used to assist management in
understanding the behavior of their systems. Of particular importance is the inection
point, where the relationship between applied load and system response shifts from being
linear to nonlinear. It can be used in setting operations alerts and parameters as well as in
anticipating future problems caused by lack of resources. Load testing can also be used to
help build models, as discussed in Chapter 12.
Load testing on the Web has some characteristics that are quite different from load testing
in pre-Web, transaction-oriented systems.
The rst key difference is the appearance of ash load. This occurs because external users
connecting from the public Internet can appear in unprecedented numbersmuch greater
than seen in controlled, proprietary systems.
Summary
207
The second key difference is abandonment and the fact that Web systems usually cannot
detect abandonment directly. Unlike call center operators and employees using an intranet,
the Internets web users quickly abandon a transaction if the response time is too long. That
abandonment affects system load and therefore the response time for other users. Also, as
the system usually cannot directly detect abandonment but must use timeouts instead,
abandoned transactions can congest the system if resources are not efciently recovered.
Load testing on the Web must therefore include user abandonment behavior and endurance
tests to evaluate the systems ability to detect abandoned transactions and recover resources
efciently.
Of course, Web load testing must also involve all Web services, networks, and other
equipment to ensure that it is a realistic view of actual performance. Web load test services
may be useful to provide large-scale, highly realistic load testing as a nal step before
production.
CHAPTER
12
Are there any instabilities we should know about, such as scaling under loads?
Where are we in the variables range? Will a bigger change to it lead to a bigger
improvement?
Are there simpler or cheaper alternatives?
Because of limited time and resources, some organizations trying to answer these questions
often limit an approach to what has been done in the past. They are playing it safe, but
missing opportunities to make a bigger impact. The other risk is missing key trends by
210
focusing on the familiar. More than a few decisions have also been based upon just plain
guessing and hoping for the best.
Simulation modeling quickly explores many options, leading to better understanding and
decision making. The benets of simulation modeling are as follows:
After a simulation model is constructed and validated, it can be used for exploring a
range of alternatives for planners, administrators, and designers. They can quickly
eliminate those alternatives that do not improve performance or service quality. Rapidly
iterating through alternatives leads to an optimum solution for a set of operating scenarios.
Being able to evaluate alternative designs and workloads without modifying hardware and
software can be a denite advantage.
Evaluating a range of alternatives is also helpful in identifying sensitivities. For example,
changing the loading characteristics, the transaction mix, the topology, or other factors can
identify specic sensitivities. A certain mixture of services may introduce instability and
mutual interference, while the same mixture with different proportions operates smoothly.
Simulation modeling tools allow more agility and faster results when compared to using
load testing on a test bed (which is discussed in Chapter 11). For example, with a model,
you can add a different kind of device or one that is needed and not yet delivered to the test
bed. This enables testing and analysis to go forward without waiting for all the real pieces
of the environment to be assembled. The elements of a model can be updated, replaced, or
modied in a matter of minutes and new results can be produced quickly thereafter. In
contrast, ordering all possible products that can be placed in the test bed is not economically
feasible. Even if it were, delays in obtaining the products and integrating them into the test
bed must be considered. Modeling is especially useful when parts of the physical or
software infrastructure are not available and testing can begin without them.
Results are usually obtained faster with modeling than with load testing on a test bed
because there is no physical infrastructure to deal with and no software to modify. For
example, making a change to the test bed requires staff time to create any physical changes
such as altering connectivity, reassigning servers, moving switches, or changing link
capacity. Additional time may be required to modify software, update directories, and
adjust management tools to reect changes. Further effort is needed to verify that the
changes were made properly and introduced no new sources of errors or problems.
The expenditures for the equipment in the test bed, for its administration, and for its
operation must also be considered relative to the costs for acquiring and learning to use
good modeling tools. Its a matter of balancing your investment strategy and making sure
that the complete suite of tools provides the most cost-effective management capabilities.
211
Some organizations have their integrator maintain a model that they use for planning and
what-if scenarios. Others aquire load testing tools or work with testing organizations.
Model Construction
Building a model has usually been a tedious and error-prone process. Describing the
elements and their relationships grows more difcult as the environment grows more
complex. The possibility of errors being introduced into the model also grows,
necessitating more laborious checking.
212
Modeling tools use automatic discovery as much as they can to simplify model building.
This approach works reasonably well at the topology level where the elements and their
connections can be determined by most discovery tools. The task gets more difcult when
the applications and dependent services are included.
As applications are distributed across many servers and data centers, understanding the
relationships among their components can be quite challenging. Most services are
dependent upon other applications and services, and the dependencies are usually
incorporated into the model manually.
Models are usually constructed by combining the interconnections of the system with the
characteristics of the individual system nodes. The interconnections, or topology, can be
discovered automatically from an existing system, or the topology can be constructed using
design tools. Even if the topology is discovered automatically or imported from other
systems that have discovered it automatically, some manual intervention may be needed to
ensure that its accurate.
The individual system nodes are usually based on prepackaged object libraries, or
templates, that are ready for out-of-the-box model building. In these libraries, behavioral
descriptions are built for each type of object. For example, network object libraries
have descriptions for each device, detailing the maximum number of interfaces and
maximum link speeds, packet forwarding rates, Quality of Service (QoS) capabilities, and
other factors. A server object would describe CPU power, memory capacity, disc I/O rates,
and similar server behaviors.
Application objects are complex and usually require some manual characterization of the
application process ow. Sophisticated simulation modeling packages include
programming languages that can be used to construct those ows.
Planners use the library to build a model quickly and explore its behavior. The predened
templates save time and reduce errors; they are complemented with tools for constructing
new objects and incorporating them into the library.
Models are then driven by a variety of inputs for a thorough coverage of the systems
performance envelope. Using actual inputs is always the best alternative. Actual network
trafc can be captured with a variety of collectorsremote monitoring (RMON) agents,
protocol analyzers, or a variety of point products. Transactions can be captured with
transaction recorders and from server logs. These sources give the most accurate input to
the model. Models can also be driven from scripts, les, or other sources. These inputs can
be tuned to stress different parts of the model and are also used as a repetitive, consistent
baseline to track changes in results.
The OPNET Network Editor is used to build and display topology information. Network
topology information can be imported or constructed graphically with the Network Editor.
Users have a palette of node and link objects to choose from while they build a topological
description of their environment. OPNET has an extensive object library, including objects
for an aggregated cloud node that can be congured with the latencies and packet-loss
213
ratios that have been measured from a real network. Customers can also create their own
objects for new devices. Simple dialog boxes for each object instance provide a means for
conguring them with the appropriate parameters, although reasonable defaults are
provided.
OPNET Flow Analysis can then be used to model the detailed characteristics of networks,
and OPNET Application Characterization Environment (ACE) can be used to model the
details of application transactions. ACE can use input from measurement collectors; it
discovers transactions and their detailed performance characteristics for input into the
model.
Similarly, the HyPerformix solutions include the HyPerformix Infrastructure Optimzer and
Performance Proler; these jointly create or import topology information, model the
system, and use input from measurement collectors to discover transaction performance
characteristics for use in the models.
Model Validation
Determining the good enough point requires validation of the models results. Then you
actually know how good it is and how good you need it to be. One approach uses the test
bed, if it exists, and compares actual results produced by the test bed to the model results.
When the discrepancy between them is acceptable, the model is good enough.
HyPerformix suggests driving the model to the point where the most heavily used server
is at 50, 70, and 90 percent of maximum capacity. They recommend as a validation
guideline that modeled server utilization should be within 10 percent of measured
utilization, modeled response time should be within 1020 percent of measured response
time, and modeled throughput should be within 1015 percent of measured throughput. Of
course, acceptable accuracy is also determined by your time and resource commitments,
tolerance for risk, and your stafng skill levels. At some point, the marginal value produced
by more renements is not worth the time and expense to achieve them.
Comparing the model results with the test bed results can also identify areas where the
models results can be adjusted with the real-world input from the test bed. Rather than
extensive modications to the model, a simple adjustment of the results can sufce
sometimes. Data from the actual production environment can also be used to calibrate the
model results. Good instrumentation captures the loading characteristics and the responses.
The actual loads are used to drive the model, and its results are compared with those from
the actual production environment.
The model becomes even more valuable after it has been calibrated because its results can
be adjusted to achieve more accuracy. Combining modeling with load testing and other
capabilities builds a stronger overall long-term management capability.
214
Reporting
Presenting the modeling results in an easy-to-understand form is another key. Models, like
the environments they simulate, generate large amounts of data, and their value lies in
converting it into usable information, particularly through visual representation. A variety
of formats as well as the ability to interact with the data are key to effective analysis.
Interactive use of the model claries sensitivities to certain operating conditions, showing
the changes in model outputs that result from changes in model inputs.
Capacity Planning
Capacity planning is another key long-term operation. Many of the real-time technologies
I have discussed implement policies that describe how the resources are divided among a
set of competing services. Capacity planning ensures that there are enough resources in the
future to make the resource allocation strategies workno real-time strategy works
effectively when its resources are over-subscribed.
The goals of capacity planning are similar to those for proactive managementto give the
management team sufcient time to take the necessary actions to prevent a service
disruption. In the case of real-time operations, the lead time is measured in minutes,
whereas capacity planning works on the scale of weeks or months.
Capacity planning is considerably more complex than in the early client-server days. In
early client-server designs, the basic environment was a server attached to a router.
Performance problems were usually addressed through boosting server performance or
increasing the Internet connection speed. Most of the time the problem was solved, or at
least postponed for some time.
Todays environments are not amenable to the blanket upgrade approach; adding resources
to every element is simply too expensive. Even when the funding for large-scale overprovisioning is available, there is no guarantee that it will actually solve the problem.
Having excess resources helps with loading uctuations, but often the resource
enhancements that actually contribute to any improvement are hard to pinpoint.
Planners need to understand the sensitivitiesthe factors that have the most inuence on
behaviorin their environments. The dynamics of complex systems depend on the
relationships between resources and the changing distance between operating and
inection points. Understanding sensitivities to user volume, transaction volume, service
mix, and other factors helps focus on the areas where the highest return will be realized.
Identifying the resources that are the rst to be over-subscribed and congested enables
specic interventions that produce the largest improvement for the least investment. This
represents another chance to take a short breather before the process begins again. Some
other resource becomes a new problem area, and the planning and evaluation of alternatives
is repeated.
Summary
215
Its important to note that the different constituent domains of the services delivery
environmentsoftware, hardware, servers, networking elements, and peopleall scale
somewhat differently. As a result, theres no single capacity planning methodology that
extends across multiple domains. There are also important dependencies among them
server scaling is obviously a product of the demands of the application or applications that
the server will be running.
Note that more mature applications, such as databases and enterprise resource planning
(ERP), are fairly well characterized, and they have the instrumentation to support analysis
of capacity horizons. Newer applications, such as directories and application servers, are
less straightforward. In both cases, theres a fair amount of literature that addresses
planning for different subsystems. However, given the expertise required for each domain,
effective capacity planning will remain an art of collaboration among such experts for the
foreseeable future.
Summary
Simulation modeling is increasingly important because the current operational complexity
and dynamism overwhelm staff. Models are helpful for predicting future behavior, nding
optimum solutions, and exploring alternatives. Models offer speed and agility compared
with setting up physical test beds.
Test beds are also useful as a reality check for the model; the t of actual behavior to the
model results improves the condence in the model and points to areas where tuning or
adjustments are needed.
Capacity planning prevents future disruption caused by insufcient resources.
PART
IV
Chapter 14
Chapter 15
Future Developments
CHAPTER
13
220
221
As shown in Figure 13-1, the goal is to determine when the project has broken even
delivering additional value that matches the costs. The project begins in the redthere have
been expenses with no recouping of the investment yet. The starting point coordinates are
determined by the purchase and implementation costs (vertical axis) and the deployment
date. As the investment becomes operational, it begins generating business value, such as
increased revenue, reduced costs, or higher service quality.
(Benefit - Cost)
Profitable
In the Red
Initial
Cost
BreakEven
Point
Time to Value
Time
Deployment
At some point in time, the cumulative benet value matches the costs and the break-even
point is reached. The Time to Value is frequently used to describe the time needed to reach
the break-even point; shortening this time interval recovers the costs more quickly and
increases the leverage gained from that investment. The slope of the operations curve
indicates the rate of recoverya steeper slope is a shorter Time to Value and a higher
payback for each succeeding interval.
Each project alternative has its own ROI graph to nd its Time to Value and value slopes.
For example, the solid line in Figure 13-1 indicates a stronger ROI potential than the dashed
one because the solid lines Time to Value is shorter and its value grows more rapidly over
the same time interval. The investment continues to provide additional value after reaching the
break-even point. Cost savings and deferred spending are direct business benets that can
be realized throughout a long operational life.
The two ROI lines in Figure 13-1 are linear for simplicity; actual projections may be curved
or have discontinuities because of variations introduced by seasonal behavior, sudden
market shifts, or other events.
222
ROI graphs for long-duration projects should include calculation of net present value
(NPV). Cash today is worth more than cash tomorrow because cash can be invested and
earn interest or another type of return. Therefore, if $1000 is invested in Web systems today
to obtain $1000 of benets a year from now, the project is in the red. That $1000 could have
been invested in the nancial markets and would have returned more than $1000 over the year.
NPV calculations or similar calculations (such as internal rate of return [IRR]) are used by
nancial ofcers to handle this analysis. They use the time value of money, which is the
return that money can earn if invested, to make the benet of an investment clear. After all,
investing the money carries much less risk than using that money to improve a Web system.
If the money can earn more through investment than it will bring in if used to make those
Web system improvements, the project may not be worth the effort.
Transaction delay
An investment in content delivery infrastructure is another example. The metrics that might
be used to assess the ROI potential for content delivery investments (as discussed in the
Content Distribution and Instrumentation of the Server Infrastructure sections of
Chapter 9, Managing the Server Infrastructure) are as follows:
The projected, or measured, bandwidth gain at the network edge. Bandwidth gain is
used to estimate the bandwidth savings, relative to a centralized distribution system,
when a content distribution network is used.
Download time
Availability
223
Project Costs
The costs and the time of deployment locate the starting point in Figure 13-1. There are a
number of factors to consider, and each project will need to select and weigh the factors that
apply to that specic evaluation.
Many factors can be incorporated into a cost calculation, including the following:
Staff time for design, project evaluation, project implementation, and project
management
Some of these elements may be harder to identify and quantify, depending on the specic
organization. For example, the requirements may come out of a planning group that
addresses a range of technology issues. Time spent on evaluations may include actual
testing scenarios, specication reviews, and customer research. The implementation costs
might include some vendor-provided professional services, services from other sources,
training for operations staff, modications and updates to management tools, or additional
hardware for parallel operation and gradual cut-over.
The cost of maintaining the status quo must also be considered. For example, consider a
key service that generates $12 million in annual revenue. If the average availability is 98
percent, there is a potential $240,000 revenue loss (every month the company loses another
$20,000 in potential revenues). Investing $50,000 to raise availability to 99.5 percent is an
attractive option because the improvement generates $180,000 annually in new revenue
opportunities. The Time to Value is less than four months, and each month past the breakeven point adds another $15,000 (potentially) to the revenue stream.
At the same time, calculating the costs of the contributors and the alternatives is not always
black and white. For example, will all the headcount involved in development and
deployment be involved in the project 100 percent of their time? Does an uptime of 98
percent mean that revenue grinds to an absolute halt when the systems are down, or could
there be alternatives that keep the revenue going? For example, orders phoned in or faxed
in instead of submitted over the Internet are still orders. Its useful to apply a little
skepticism to estimating both costs and benets, as youll see in the following sections.
Project Benefits
Estimating the benets from the investment may involve various levels of guessing and
using rules of thumb. One of the most frustrating aspects is trying to nd some estimates
that at least bring you closer to understanding the costs and benets of an investment. One
major question is this: How reasonable are the estimates? Is a 30 percent improvement in
224
bandwidth utilization within the realm of possibility, or is 15 percent more realistic? Each
estimate changes the slope of the ROI line and shifts the Time to Value.
Many vendors now have ROI calculators that they use as part of the sales process. They
generally embed a set of assumptions about costs and impacts within their calculator. Of
course, the estimates are helpful only to the degree that they match the environment. The
size of the company, the industry segment, and the relative technical maturity are among
the factors inuencing the ROI outcomes. Obtaining more information on the embedded
information, how it was gathered, and the data used to build the model will help calibrate
the results.
I have worked with several vendors to begin building some rules of thumb from early
adopter experiences. The goal is to get some idea of what other customers report about their
experience with implementation, introduction, and day-to-day operations. These rules of
thumb will give potential buyers some additional ways to relate to the usual ROI
information. For example, a potential customer can relate to a rule of thumb that indicates
that similar customers have reduced their stafng levels by 20 percent while improving
service quality. (Some of the rules that can be applied are described in the following
subsections.)
Measurements should be analyzed to determine the actual benets derived from the
investment. The key is having sound measurements and a starting baseline. Care should be
taken to document performance levels fully and quantitatively before implementation
begins. As the implementation proceeds, it is also important to remain exible and
incorporate measurements for tracking unexpected outcomes. Following up with the actual
assessment is very helpful for the IT group. It helps them calibrate their own estimates and
rene their ability to project benets with more accuracy in the future. Equally important
is establishing their credibility with the business managers through accurate projections.
Remember that a business-knowledgeable IT manager should be involved in the analysis to
improve the relevance for other business managers.
Availability Benefits
Changes in availability are usually evaluated in terms of capturing more revenues or
distributing more information to consumers. Identifying the actual revenue rates is often the
most difcult step. Good service instrumentation is necessary to determine the number of
orders and the revenue generated when the system is available. In some cases, however, a
simple calculationdividing the revenues by the number of operational hours to get an
average revenue ratemay sufce.
After the revenue rate is determined, the rule of thumb is that a 1-percent change in
availability is an additional 7.2 hours per month of potential revenue gain. If your
organization has annual revenues of $12 million, for example, the rate is $1 million
monthly, or almost $1400 per hour. A 1-percent change can produce an additional $10,000
in monthly revenues, or $120,000 on an annual basis. Spending $50,000 to raise availability
by 1 percent gives an ROI Time to Value of ve months.
225
Performance Benefits
Performance is easy to measure for compliance: it is straightforward to determine the
percentage of transactions that complete within the specied response time. Assessing the
business impact is a bit more difcult. A subset of the total transactions actually generates
revenues; the remaining transactions are used for browsing product information, checking
promotions, facilitating customer-managed support, or tracking outstanding orders, among
other possibilities. Identifying the types of transactions and tracking each category may be
required.
Tracking the actual business that is transacted and closed provides deeper insight into the
impacts of any investment. Some metrics that demonstrate the value include the following:
Change in the deal sizeAre the orders larger as a result of better service quality?
Change in unit transaction costsThe investment may allow larger transaction
volumes without requiring expenditures for additional infrastructure, thereby
reducing the unit costs of transactions.
Staffing Benefits
Stafng impacts may be important considerations in some situations. For example,
investments in automated tools may allow staff head count reductions or reassignments to
other tasks. The savings include salaries, benets, and possibly training costs. At other
times, stafng may remain constant while the infrastructure grows over time, improving
staff productivity by managing more elements and services without adding team members.
In the end, the best metric for services management is the change in transaction volumes
relative to stafng levels.
Infrastructure Benefits
Many investments will impact one or more infrastructures. A basic metric is the change in
service ows relative to the infrastructure changes. If an infrastructure handles more service
ows as a result of the investment, its productivity has been improved accordingly. For
example, a company I interviewed was able to increase their transaction ows by 30 percent
without additional spending, translating into a substantial savings compared to scaling the
infrastructure by that amount.
Deployment Benefits
Deploying new services is an ongoing process rather than one of periodic releases. Rapid
deployment is facilitated with load testing and good design practices. Load testing also
226
helps determine if a new service will degrade the current services mix when it is placed into
production. The usual response is to refuse to deploy any new services that disturb the
normal operational baselines.
Deployment delays can have signicant impacts on revenuesa rule of thumb that I use is
a $20,000 revenue loss for every $1 million in annual revenues that the service generates.
Indirect impacts include disappointing customers and losing competitive advantage to other
early movers. Using predeployment load testing and other service-level management
techniques can decrease the probability of deployment delays and, therefore, improve
revenues.
Soft Benefits
Business analyses depend on quantitative results that can be demonstrated and calculated
from data. There are often qualitative results that may be important in certain situations. For
example, many businesses use customer satisfaction surveys to gauge their relationships
with the market. Improvements in customer satisfaction are important, although they may
not be directly connected to specic business metrics. Surveys of internal IT customers may
also indicate improvements after a project is completed.
Improvement in management staff retention frequently results from projects that
implement better management tools and processes. Staff is freed from repetitive lower-level
tasks and can spend more time focusing on more challenging and valuable tasks.
Soft results, while not to be ignored, must be regarded skeptically, particularly because it is
easy to predict optimistic outcomes for productivity enhancements. Working closely with
customers to identify priorities for investment can help sharpen focus on tradeoffs.
Similarly, collaboration with human resources and nance organizations can help establish
consensus and buyoff where numbers alone dont tell the whole story.
227
results. Initial measurements taken after deployment indicated that the actual customer
volumes and transaction activity were substantially lower than the projections. The initial
response to the disappointing behavior was to consider spending more for additional
computing power and network bandwidth, which is the common response.
Cooler heads prevailed, fortunately, and argued for better measurements to clarify the
situation rather than possibly compounding the problem with misplaced efforts and poor
investments. A set of synthetic transactions established that the server response times and
the network delays were not the problem; in fact, they were very low. These measurements
were helpfulthe technology was performing well, although the site wasnt. They
indicated that other alternatives needed investigation.
An engagement with a professional services rm that used web analytics was initiated, and
they collected information on the user behavior within the web applications. The data
clearly illuminated the problems. The major contributor to the disappointing outcome was
that the web application was difcult to navigate, and many customers would simply click
away after becoming frustrated.
The key web pages were those that guided potential customers to the offered products and
services and hopefully converted their interest into actual sales. However, the navigation
paths involved traversing a large number of pages with confusing content and links that
were not obvious. Further discussions indicated some of this was done deliberately so that
other cross-selling opportunities could be presented with each new page. This is the same
annoying dead-end strategy used by many sites that churn out new pop-up ads with each page.
The web applications were able to be changed fairly quickly because they were built with
JavaBeans and were, therefore, easy to modify. Shorter paths to the key content were
constructed, and the intervening pages were designed with simpler layouts. Upselling was
linked to the key pages rather than adding distractions on the path.
The synthetic transactions were used to verify that the changes did not add any signicant
delays to the original baselines. The operational results veried that the web application
design, rather than the underlying technology, was the problem. The number of customers
remained steady for the rst six weeks or so. The critical change was that the number of
customers reaching the key pages tripled and generated a 15-percent revenue gain.
The number of customers began to increase, partly by word of mouth and partly from repeat
visits. Over the next six months, the number of customer visits doubled, and revenues per
customer also grew considerably as the application was rened. The impact of a cleaner,
shorter navigation path was that the volume growth was accommodated without any new
infrastructure investments. Much of the computing and networking load was reduced
because customers were not linking through meaningless pages to reach the desired
content.
The ROI evaluation is summarized in Table 13-1. The costs included the professional
services of an analytics rm, the internal staff time to adapt and test the web applications,
and measurement tools. A case could be made that the tools will be used for many other
228
purposes and shouldnt be expensed solely to this effort. In this situation, however, a simple
analysis was deemed sufcient.
Table 13-1
Benefit
Web analytics
services
$54,000
Rework web
applications
$2,153, 000
$ 7,500
$ 22,500
$ 165,000
$2,207, 000
The benets included the revenue increases over the initial deployment period and for the
following six months. The hard numbers indicate the project reached a break-even point in
one month (on an annual basis). The project solved the problems and had a denite business
impact.
There were other benets that deserve mention as well. The infrastructure showed a 100
percent leverage, doubling trafc with no additional investments after the applications were
modied, and the unit costs of transactions were halved as a result.
Summary
Demonstrating the business value of technology investments is becoming an integral part
of most purchasing processes. Being able to nd the strongest ROI is equally as important
as nding the best technical approach. The basic ROI determination is fairly
straightforwardtotal the costs, determine the benets, and calculate the Time to Value. As
with many other things that are simple in concept, the details are more complicated. A
major challenge is projecting the benets before implementation because they will be based
on estimates rather than hard data. After implementation, there will be quantitative data
available.
Assessing the potential benets involves looking at as many different outcomes as
possiblechanges in customer visits, transaction volumes, the size of orders, and the
percentage of customers who return for further business.
Scrutiny of signicant technology initiatives is being moved higher in most organizations,
usually involving business managers with less technical expertise. The ROI is the way they
evaluate and decide on key technology purchases.
CHAPTER
14
Implementing Service
Level Management
I have covered many aspects of Service Level Management (SLM) in a webbed world.
Pulling them all together requires a coherent approach to implementation so that service
management can evolve as a system rather than merely comprise a disjoint set of management tools. This chapter presents some ideas for implementing an effective SLM system.
The focus is on the process and a strategy for moving through that process, rather than on
specic technology or product decisions.
The text discusses the following:
232
service. Service Level Agreement (SLA) wording, denition of metrics, and service level
objectives along with their statistical treatment will probably be new to the organization.
Accompanying these will be the need to handle integration and grooming of instrumentation measurements, changed problem management techniques, and service level reporting.
An application that depends on service levels may already exist; for example, many legacy
transaction systems are given priority on internal enterprise networks, and Voice over IP
(VoIP) systems usually are given priority on LANs. These prioritizations are commonly
based on rudimentary packet or frame tagging and on simple priority queues within routers
and switches, usually without a comprehensive system to report on and manage service
levels. Moving one of those applications to a new, more integrated SLM methodology is
probably the smoothest rst project. The migration can provide an opportunity for staff who
are already involved with service level techniques to learn the new methods. The staff can
also bring their knowledge of the organizations needs into the initial development of the
new SLM systems.
Its also important to choose a pilot implementation in which end users can review plans at
critical junctures. Some of this is common sense. SLM helps address the needs of users, so
asking their advice can help avoid blind spots. Allowing users to participate, or at least to
observe, also lets them buy into the learning process. If users are part of the process, they
will probably be more forgiving of the inevitable mistakes and delays.
If there is no existing application already using some type of service level technology, its
probably best to pick an application that uses a limited subset of the enterprises systems,
instead of trying for a global initial project. The fewer the number of different subsystems
and providers involved, the less complexity that will have to be addressed in this rst trial.
Note that a simpler environment also exposes problems more clearly, and the implementation team learns more quickly. However, limitations in scope must be balanced with the
need to detect upcoming implementation problems. Subsystems and providers that are
widely used in the enterprise should be included in one of the earlier SLM projects even if
that increases complexity. One or more of those subsystems or providers might have an
incompatibility with the chosen SLM technologies, and its better to detect that problem
early, before the momentum for a particular set of SLM technologies has grown.
Incremental Aggregation
It is best to introduce and activate new services in small increments. Each service can be
monitored for a trial period to ensure that baseline service quality and stability are
maintained under a variety of loading conditions. At that point, the service can be
incorporated into early SLM projects, and continuous monitoring for compliance can
become part of the regular management routine.
233
Further projects can be brought online as soon as the initial set has been proven successful.
This increased use of the service through aggregation of the needs of multiple projects has
advantages both in building on a now-proven SLM and service technology and in
negotiating more favorable terms with service providers. For example, aggregating the
projected bandwidth demands from each business unit into a single acquisition gives the
organization more leverage in obtaining bulk discounts or other benets. Using the initial
pilot as a teaser for the supplier, with the promise of additional projects from aggregated
additional needs, can provide important negotiating leverage.
This iterative process may appear slower in the beginning, but the phased, gradual approach
is useful. There are usually gaps between the projections of resource requirements and the
actual conditions. Ongoing measurements can be used in the phased approach to determine
if resource adjustments are needed before more services are added.
234
A services census checks whether devices have service management capabilities; the goal
is to catalog the service management capabilities already in place. In addition to Quality of
Service (QoS)enabled network devices, elements such as caches, load balancers, and
trafc shapers should be identied.
235
measuring the outage duration in terms of the service offeredsuch as the period of time
during which a piece of the network was not functioning. Where the commitments made by
providers do not integrate effectively, end users will perceive a different impact from the
outages than might be indicated by the availability statistics of the underlying
subcomponents.
It is also necessary to specify measurement validation and any statistical treatments that
should be applied to the data. These should be combined with sampling frequency to ensure
that condence intervals are acceptable. For example, a critical service might be probed
every ve minutes, while those of lesser importance are checked every fteen minutes. (See
Chapter 2 for detailed discussions.) The increased granularity of more frequent
measurements must be balanced against the additional demand on servers and networks.
More organizations are adding a dynamic specication that shortens the measurement
interval if the metric is trending toward unacceptable values.
After the metrics and their measurement procedures are specied, the service level
objectives can be established based on the requirements of the application. In cases where
the performance characteristics of a service are well-established, such as those associated
with a service from a major external supplier, it may be necessary to choose from the
service classes that the supplier offers. For example, an interactive application might be
able to choose among three offered classes of service with three different sets of acceptable
response times and packet losses, as shown in Table 14-1. Major Internet Service Providers
(ISPs) offer service guarantees for transit thats completely on their networks, and some are
offering guarantees for transit to and from endpoints on other networks.
Table 14-1
Platinum
60
0.25%
Gold
100
0.5%
Silver
150
2%
236
Passive Measurements
Passive measurements provide insight into the actual services being used and their volumes
throughout the day. Passive monitors should be placed on the access links from data centers
where they can capture the actual trafc owing to and from the center. Placing agents near
the organizational boundary also tracks the outbound trafc originating within the
organization.
Remote Monitoring (RMON) agents are passive agents that can provide a rich view of
application ows across the networked infrastructure. Many LAN switches have embedded
RMON agents that can be used. A stand-alone agent can also be used to collect the
information, allowing measurements at sites where a switch does not have an RMON probe
or where the large volume of collected data impacts the switch performance.
Passive agents in servers can also provide information about the applications that are
executing, the numbers of concurrent users, and the time distribution of usage. The server
information can supplement or replace the RMON data. The measurements at the
organizations edge will still be needed to understand the outbound trafc.
Active Measurements
Active measurements are used to build a consistent view of service behavior as seen by the
end users. For example, a set of synthetic transactions can be constructed that are realistic
approximations of actual end-user activity. Performance is tracked by sending synthetic
transactions to the actual site. This offers the performance perspective as experienced by
customers, partners, or suppliers. The measurements are used to alert the local management
team that service disruptions may be threatening sales or business relationships.
Active probes should be placed in multiple locations for the best results. The internal
environment can be as transparent as desired because multiple probes can provide detailed
and granular measurements. There is more exibility within an infrastructure that the
organization controls and manages; probe placements should be used to provide overall
end-to-end measurements and to break down the components along the path. Performance
of different areasacross the backbone, on the web server, or on the backend server, for
examplegives the detailed data needed for resource planning.
Placing probes at different points in the internal network also gives a broad picture of any
service quality variations that are related to differences between locations. This gives
resource planners a ner level of detail and identies specic areas that require further
attention.
One caution in distributing active probes across the infrastructure is that overduplication of
measurements should be avoided. For example, it may seem reasonable to place a probe
that monitors an end-to-end service, plus individual probes that monitor each component
of that service. However, when multiple edge services use the same core service, you dont
need to add multiple monitors for the core service as well.
237
External parties will not expose their internal operations to outsiders; nonetheless, they still
must be measured. Therefore, the main measurement objective for external services is the
identication of the proper demarcation points so that the performance of external parties
suppliers, partners, or hosted servicescan be isolated and measured by appropriate
instrumentation deployed at those points.
Demarcation points close to edge routers connecting to external services are the most
desirable locations for such instrumentation. Active measurements can track the
performance between these demarcation points, and that performance can be used in SLAs
with the external service providers. These measurements can evaluate the delay in provider
networks, the delays at hosting sites, or the delays within a partner environment.
Measuring a provider is simple when both demarcation points are within the same
organization. Placement is not restricted, and the provider delay can be clearly determined.
Negotiating with key partners or suppliers for placing measurement probes is becoming
more common. For security reasons, a business partner may not want any external
equipment connected inside the rewall. If a probe cannot be placed inside a rewall, a
probe located at a demarcation point just outside the rewall can be used to measure the
external network delays.
238
Adjustments to the design are indicated when the actual performance isnt sufcient to meet
the objectives. The gap between the desired and the actual performance will be a gauge
of the effort and expense to bridge the gap. A small gap may be bridged with a small
upgrade or some simple reorganization of the resources. A small gap may also be resolved
by upgrading a single resource, such as adding a faster server or distributing content to edge
servers to reduce congestion. The census process can help by identifying system
components that can be moved to places where they add more leverage and control while
reducing the need for new purchases.
Larger differences may indicate that an investment in multiple areas, such as network
bandwidth or a faster database server, may be necessary. A balance between larger
investments and the target levels may be considered. Would adding two seconds to the
proposed response-time metric result in lost business or productivity? Would the two
seconds be a good idea if it saved a substantial sum of money?
Granular internal instrumentation is very helpful at this stage. Measurements may directly
identify the main contributors to the delay. If the internal network delays are 10 ms and the
server delays are 6.5 seconds, most leverage is going to be found in improving server
performance (or whatever back-end services are activated).
Capacity of the systems under the actual workload is also a critical consideration.
Acceptable response when loads are light may be deceptive. If some of the services are not
yet deployed, there is more uncertainty about the actual infrastructure capacities.
Estimating expected growth in transactions or users is important to ensure that a new
service management system has the headroom to accommodate growth for some initial
period, all the while staying below the inection point, which is where performance
becomes nonlinear. Baseline information or data from load testing is very helpful in
building a realistic assessment of the implementation effort, its costs, and time frames. (See
Chapter 11, Load Testing, for more information about inection points and load testing.)
After the service-level and system capacity needs have been determined, the process of
tuning system performance can begin. Having a set of choices increases options and
leverages competition. It also makes the selection process more complex because there will
be several ways of satisfying any particular requirement. For example, money can be spent
directly on the servers to improve processing power or memory, or it can be spent on storage
systems to boost server performance. The expenditures may also be indirect and include
buying load balancers, web server front ends, or content management and delivery systems.
The instrumentation used for baselines can now be used in sensitivity testing as the new
elements are added to the test bed and then to the production environment. It is best to
measure one change at a time to get a better feel for the changes that are making the most
signicant contributions.
Instrumentation may also reveal that some other applications carried by the system are
interfering with service performance. For example, some activities, such as playing games
over the network or downloading media les, are not related to business goals and waste
239
time and resources. At other times, a legitimate service, such as backing up a database, may
interfere with other critical services because they are scheduled incorrectly. Data from
instrumentation may therefore indicate a need for admission control policies and
enforcement. Undesired applications would be barred from using network resources, while
others could be scheduled to reduce their interference with other operations.
Construction of SLAs
The foundation of any service management system is based on instrumentation, reporting,
and a clearly dened SLA. As part of an SLM implementation, most organizations will
create one set of SLAs with a range of external providers and partners. They will create a
separate internal set of SLAs between the IT group and the business units. All parties gain
from having an SLA; service customers expect to have more control of service quality and
their costs, while providers have clearer investment guidance and can reap premiums for
higher service quality.
Because it species the consensus across all parties, the SLA becomes the foundation for
deciding if services are being delivered in a satisfactory fashion. I discussed SLAs in
Chapter 2, and they are reviewed here, along with some additional discussion of SLA
dispute resolution.
An SLA should clearly dene the following:
There may be other areas, such as determining nancial penalties, that must also be
addressed and resolved. This is driven by business considerations: the providers want the
lowest possible exposure to penalties, while the customers want realistic compensation for
service disruptions.
Usually the penalties suit the provider rather than the customer. For instance, the common
remedy for a disruption is a rebate on future bills or a refund. The downside for customers
is that the rebate may be a small fraction of the customer losspossibly thousands of
dollars of lost revenue per minute. As discussed in Chapter 2, there are strategies that can
be applied to encourage desired supplier behavior and that can be coupled with risk
insurance, if necessary, to compensate for losses.
Even where legally binding SLAs with external providers are not involved, its important
to reduce ambiguities to an absolute minimum.
The SLA areas of metrics and service level objectives were discussed in preceding sections;
the other SLA areas are described in the following subsections.
240
241
internal usersmuch like trafc reports on the radio. This enables them to see the actual
service quality they are experiencing, and they can track improvements over time.
Published reports reduce the load on the help desk because users can check performance
and other parameters for themselves rather than asking help desk staff for information. User
access to SLA compliance reporting also keeps the management service team accountable
because everyone sees the results.
Its important that there be consistency between the published reports used for determining
SLA compliance and the instant reports available on the Web. However, there may need to
be adjustment of the instant reports because of measurement errors and other problems;
customers must be made aware of that possibility to avoid losing credibility. Credibility
problems can also appear if instant measurements are made available before the reporting
system has been fully tested; spurious reports of service level problems will make the load
on the help desk worse instead of better.
Accountability is enhanced by scheduled, periodic reviews of the service level reports.
Dening who participates in these reviews from each side, how often they are scheduled,
and what material is to be reviewed should go hand in hand with the reporting requirements.
Dispute Resolution
There should be a mechanism dened for resolving disputes because they will inevitably
arisedespite having as many details as possible spelled out in the SLA. Given the
pressures to minimize penalties on the providers side and the criticality of services on the
customer side, discrepancies in the measurements and their interpretation will be subjected
to substantial scrutiny.
In the past, service providers traditionally made the measurements themselves and just
reported them to customers. Today, many customers want to conduct their own
measurements. Some customers feel it keeps the provider honest. In this, the webbed
services industry is just catching up with more traditional industries, in which regular
monitoring of supplier inputs to manufacturing or other elements of the supply chain is a
critical, standard operating procedure.
When provider and consumer are measuring from different points or using different
intervals, the results will not be consistent and will lead to disputes when disruptions occur.
As the nancial consequences mount, the probability of disputes rises accordingly.
There is, therefore, a move to use a trusted third-party whose measurements are assumed to
be objective. Companies that measure Internet performance, such as Keynote Systems, are
used to verify the performance of the cloudthe integrated set of services. Other
companies, such as Brix Networks, have also been founded to address this specic concern.
Brix places measurement appliances at key demarcation points, collects the performance
information, and analyzes it at a central site. Mercury Interactivethrough their Topaz
Managed Servicesoffers some of the same capabilities.
242
It benets both parties when accountability is clearly determined from agreed-upon metrics
and measurements. Any dispute resolution process needs specic steps, such as both parties
simultaneously conducting measurements to determine the differences in the readings. Any
differences can be used to correct and calibrate the results before determining if services
are compliant. Such a process is critical to building a workable relationship between service
providers and service customers.
In some cases, especially during initial implementations of SLM, it may be necessary to
adjust the service level objectives because it has become apparent that the costs are too
high, or the system is not yet capable of meeting the original targets. The possibility of that
readjustment must be understood by all parties involved in those initial efforts.
Summary
The implementation process depends on solid basics, including crafting a clear and specic
SLA. Getting instrumentation and reporting in place early for internal SLAs helps with
realistic assessments of service quality before implementation begins. Giving internal users
access to service quality reports is also helpful in building support.
A phased approach is usually most effective, starting with smaller efforts and gaining
experience and speed over time. An existing application that uses simple prioritization or
other rudimentary service level management systems is a prime candidate for the rst full
SLM project, as are applications that use manageable, but representative, subsets of the
enterprises total architecture.
Baselining the performance of the existing system and taking a census of all the existing
systems components and their capabilities will help during the system development and
tuning phases as well as during the evaluation of the success of the entire SLM project.
The implementation of full SLM relies on specication of the performance metrics, choice
of instrumentation types and locations, and correct construction of the SLAincluding
methods for resolving the disputes that will inevitably occur in the rst implementations.
CHAPTER
15
Future Developments
This nal chapter offers some closing thoughts about future directions of Service Level
Management (SLM). As a rich area, it is evolving in several directions. Topics covered in
this chapter are as follows:
246
management system relationships might last indenitely or endure for time periods as brief
as the duration of a single ow. For example, a downstream shipping service might be
dynamically selected based on geographic coverage, delivery schedules, and cost. The
management system needs to communicate with the selected shippers management system
to detect and help diagnose any problems with transactions involving the shipping service.
The emergence of routing optimization products and services offers multi-homed sites
another example of a dynamic service chain. Using these route control products, customers
can select Internet service providers (ISPs) in real time, measuring their performance and
comparing costs. The basic interactions addressing service disruptions will be
supplemented with additional requests for information or new measurements. Other
interchanges will focus on reporting trends that indicate a drift toward SLA noncompliance.
Efcient providers are rewarded with more trafc and more revenues.
Real-time, customer-to-provider interactions will become more important because
customers are always eager to trim their costs and deliver their services more effectively.
These factors will continue to push customers and their providers toward real-time
exchanges that give the customers increasing control of the resources they buy from the
provider.
The term customer network management has been used to describe systems that enable
customers to make real-time requests for a range of network services. With customer
network management, customers can submit trouble tickets, track their status, obtain
billing and usage information, and change service metrics. The real-time interaction can
speed problem resolution and enable customers to adjust their bandwidth consumption to
their current and projected demands. Service providers also benet because customers are
taking over tasks that were previously handled by provider staff. Service providers can
leverage more specialized offerings to strengthen their competitive differentiation.
One obstacle to greater customer participation in services delivery is a distinct lack of
integration among the various back-ofce systems. (These systems include provisioning,
billing, order tracking, and capacity planning.) Many new service rollouts, such as digital
subscriber line (DSL), overwhelmed the providers and their systems, resulting in long
delays and many installation and activation errors. Methods for customer participation in
service allocation, accounting, and management exist today in the switched networks used
by traditional telephony service providers; as the economics lter to routed networks, the
tools and infrastructure will have to catch up.
Customer management systems interact with management systems from business partners,
providers, customers, and suppliers. These interactions are between management systems
and are different than the business application interactions that are usually dened as
business-to-business ows. Different management systems need to exchange information
during regular operations, and especially when service disruptions occur.
Because of the complex, dynamic, and rapidly changing web of interactions, adaptable
management products are needed. These are tools that discover changes in the managed
247
environment, update their information, and continue operating without staff involvement.
The pressures of intense competition, constant change, and growing criticality have
outstripped the capacity for hands-on maintenance of service management tools.
If management tools depend on manual entry of information or rules in response to changes
in the current environment, they wont be able to keep pace. Organizations cannot afford
time delays while they update their tools. They also cannot afford incorrect analyses caused
by outdated tools or problems caused by tools failing to provide information in a timely
manner.
A simple change, such as assigning an application to a different server, might take no more
than a minute, but it can consume hours of staff time to update the management tools if they
must be updated manually. The combination of complexity, stringent time pressures, and
stiffer penalties for noncompliance is forcing management systems to become more
automated with more sophisticated analysis and responses.
RiverSoft introduced its Network Management Operating System (NMOS) to address this
problem. (RiverSoft is now part of Micromuse, and NMOS has been integrated into the
Netcool product line.) NMOS periodically checks the network infrastructure for changes.
When changes are detected, it updates its information accordingly. It then adjusts the
correlation information to reect new connectivity and new dependencies. Other products
are taking an intermediate step of detecting and reporting changes, leaving the adjustments
to the staff. This feature will be a key differentiator, especially for those organizations that
are struggling to stay on top of their own environments.
A similar problem for administrators is the need to understand the relationships between
service ows and underlying infrastructures, which is difcult if those relationships keep
changing. Without knowledge of the relationships, administrators must take corrective
actions without understanding their impacts on the business. They may restore a device,
adjust a route, or change access parameters, among other tasks, without knowing if their
actions made any difference in the organizations business outcomes. Service management
tools must be able to present business-related information to administrators before they
make such management decisions.
Mapping these relationships automatically is not an easy task, and the relationships
between service ows and underlying infrastructures must be shown in both directions:
from ows to elements and from elements to ows. Most management vendors offer a
partial solution supporting automatic discovery of elements and applications, leaving the
identication of relationships to the staff. This helps, but falls short of what is actually
needed.
Dynamism makes the problem even more difcult because continuous changes increase the
risk of using outdated information. Some companies I have interviewed are introducing
new services weekly, constantly reallocating servers to applications as loads shift, and
altering bandwidth assignments in (near) real-time. Perhaps a new approach will be needed
that enables applications to register themselves when they are activated. A management
tool could receive the registration and update the information accordingly.
248
Superficial Integration
The two most supercial forms of integration are integration on the glass and integration
on a system. Integration on the glass means that there is a consistent look and feel to the
management tools on the platform. Consistency is helpful because it reduces training and
simplies many tasks, but integration on the glass is of limited value after the training
savings are realized.
The rst Simple Network Management Protocol (SNMP) management tools used a
dedicated computer system, called integration on a system. This can be a wasteful
approach, especially when the demands on the server are low. Early SNMP platforms made
a virtue of this fact, by marketing several tools sharing a single server as a form of
integration. These platforms used their event management functions to launch a specic
tool whenever criteria called out by a set of rules were satised. Such sharing does save
hardware costs, and keeping it all local to one server simplies some of the tool-launching
logic. However, after those efciencies are realized, the value is limited.
Data Integration
Data integration has long been touted as a breakthrough that will solve management
challenges arising when a set of tools is needed to restore service levels. When one tool
cannot share its information in a straightforward way with others, staff time must be
expended to close the gap. The usual process has entailed using a tool, getting its output,
and entering that as input for another management tool. Involving staff also raises the
possibility of errors being introduced by staff as they manually move information between
tools.
Extensible markup language (XML) is emerging as the preferred way of attaining a level
of data sharing and integration. XML-parsing technology and document creation tools are
readily available, simplifying the transformation between local and standard
representations. In practice, selection of a common schema remains a challenge. All the
249
different parties sharing data must have a common way of interpreting the tagged and
structured information inside an XML document. Within an enterprise, such interfaces can
be handled with local standards because documents are shared in a single organization.
However, sharing information across organizational boundaries, or in the absence of strong
standards, can be more of a struggle because theres no guarantee that the schema used by
each party is compatible.
The work of the Distributed Management Task Force (DMTF) may be very helpful here.
This standards body has dened the Common Information Model (CIM) and an encoding
scheme for XML documents. However, this standard has not yet clearly demonstrated its
value. Economic realities tend to work against spending in support of standards that may
not demonstrate an immediate payback.
One outcome of the emerging focus on XML is a shift away from efforts to create a single
management information repository. Projects that attempt to dene and implement such a
single repository for management information almost always fail, for reasons that are clear,
especially in hindsight.
The rst barrier for unied management information repositories has been the schema.
Every management vendor has its own internal schema and prefers to impose it as the
industry standard. Competitors understand the advantages of having a proprietary schema,
which ensures lock-in for their products; the resulting deadlocks often lead to early failure.
Another barrier was the relative immaturity of distributed database technologies. Problems
of keeping information fresh, arbitrating concurrent updates, saving information, and
providing easy database backup and recovery often made any effort look impractical. It
appeared that the foundation technology was not ready for prime time.
Distributed database technologies have matured, but the reluctance to reengineer
fragmented databases is still strong. Adoption of a monolithic management repository
requires extensive changes in almost all organizations. The reality is that there are many
databases within an organization that hold critical management information. In some
cases, databases are separate for legal or regulatory reasons; in all cases, organizations are
reluctant to reorganize their databases.
Rather than focus on the data store, XML facilitates data exchange with a protocol for
documents, using a dened encoding scheme. Schema descriptions and presentation
information can also be appended to documents. That is what makes XML a strong
alternative to the repository concept for data integration.
Event Integration
XML document exchanges are sufcient for ongoing data communication between
management tools. Such exchanges are essentially synchronousa tool receives a message
and responds to it. However, that is only part of the answer; event integration is also
required, because management tools need asynchronous communication with other parts of
the management system.
250
Event integration enables a management tool to signal asynchronously and activate another
part of the management system when a specic event occurs. Such integration must be bidirectional; events can ow in either direction as determined by the specic needs of any
management task. Each party must be able to understand the event so that it can take the
appropriate action. As more events are encoded as XML documents, XML can simplify the
integration process and leverage data-integration methods for use in event integration.
Process Integration
Solid data and event integration enables administrators to build management processes,
which are automated sequences of tool functions that are sequenced and controlled by a
process manager. Process integration offers high value to administrators because each
automated process saves staff effort and expense each time its triggering situation occurs.
251
The webbed services environment can be the basis for building effective SLM solutions.
Many of the pieces are now available and the benets of a new approach are compelling.
The Web is based on exploiting the power of loosely coupled systems that interact in many
different ways to create a variety of services. That same loosely coupled approach can be
applied to the architecture of a services management system and to the processes performed
by that architecture; proposals for those designs are discussed in the following subsections.
Process Managers
Process managers oversee a management process by ensuring that all lower-level tools and
processes carry out their tasks successfully. They organize information and oversee
portions of the managed environment. The process managers are higher-level tools or
functions that coordinate tools and tool clusters; they also communicate with other process
managers. The process manager organizes the collected information and determines if its
task is complete; if it is, the process manager reports to a higher-level process manager.
When the task is not complete, the process manager initiates further activities or reports a
failure.
The process manager needs logic to analyze the incoming information and make the
appropriate decisions. It takes different steps depending upon the analysis; for example, the
process manager might request further detailed measurements, access other information
sources, or use different tools as its analysis dictates. Process managers may also have
correlation, policy, and presentation functions.
Correlation is important when determining a root cause or trying to understand the
interactions among different parts of the managed environment. The ability to correlate
across different infrastructures is a high-value capability.
252
The process manager might set or modify system policies while it collects information it
needs; in addition, it may adjust other parts of the managed environment. Note that some
process managers might focus primarily on overseeing policy-based operations.
Presentation is also a key function. A complex and dynamic environment is challenging to
manage, and it is also a challenge to organize and present information that is useful. Useful
information must be presented in a way that enables a human to gain an understanding of
the situation quickly. This function must be very exible because different people will
respond to different types of presentation formats.
253
Instead of extending the signaling function of todays event managers, it is possible to use
message queuing systems to integrate management system components. Message queuing
software is offered by companies such as IBM (WebSphere MQ) and TIBCO
ActiveEnterprise. These messaging products provide the means for different applications
on different computer systems to exchange information in a controlled way. Such
messaging platforms have been implemented as backbones in complex inter-application
environments, such as brokerages and other nancial services organizations.
The information can be exchanged in the form of XML documents, which the parties are
responsible for transforming into locally useful forms. The messaging software handles the
other aspects of application-to-application communications. It handles synchronization,
queuing, backpressure or ow control, status reports, and other matters that smooth the
exchange of the XML documents.
This combination of messaging and XML constitute a strong foundation for integrating a
set of management tools. The XML documents provide the information, and the messaging
software handles efcient exchanges and signaling between applications. Management
tools can now be distributed across several servers, if desired. This exibility enables
administrators to link their management tools into sequences that dene management
processes. This level of integration offers substantial value to the management teams.
XML will be the major means of data sharing between management tools, especially those
from different vendors. XML has already achieved a strong foothold in many products, and
vendors are using it as an internal integration tool for their own products. This trend will
accelerate, especially because XML makes absorbing products from mergers and
acquisitions easier as well.
254
If the server infrastructures do not report any problems, the triage process manager could
turn to the transport infrastructure. For instance, it could ask transport infrastructure service
managers to initiate end-to-end measurements to determine basic delays, packet loss, or
other relevant metrics. The end-to-end testing tool may need to access instrumentation
information to locate the appropriate probes to activate for the measurements.
If the end-to-end measurements indicate that further investigation is warranted, additional
tools could be brought into action. If the servers are multi-homed, one example of an
investigation into network problems might be a check of ISP performance using
measurement data accumulated by the routing optimization system thats managing the
selection of ISPs. If an external network is determined to be the problem, synthetic
transactions or another testing method could be initiated to probe the external network and
determine when it resumes operating within the range that does not further threaten service
quality.
The loosely coupled, process-oriented approach enables administrators to focus on the
steps they need to follow to achieve a management result, instead of trying to address
management strictly in the context of a platform or element.
This approach will help IT groups manage the responsibilities and cooperation of each key
process team. If, for example, a triage operation within the services group identies the
transport infrastructure as the likely cause of the disruption, the appropriate transport
specialists and processes can be automatically signaled to resolve the problem and restore
service levels. The technical means to integrate the support groups can come from the
emerging messaging and signaling functions discussed previously in this chapter. They
enable either team to signal the other and activate the appropriate processes.
255
256
Summary
SLM is not only necessaryit is vital in the webbed world that we inhabit. The rapid
change and dynamic service chains that are characteristic of newer Web-based services are
forcing development of new management systems that can cope with that change
automatically, instead of relying on manual conguration and manual system management.
Originally, management systems had simplistic integration; although all the management
tools might reside on the same platform and bear a supercial similarity, operators still had
to move data manually from one tool to the next. Errors were frequent, and automation was
problematic.
Effective tools with more innovative products are coming. Integration is becoming deeper,
at the level of data integration, event integration, and process integration. Events detected
by one tool will automatically trigger process initiation and automated sequences in a
completely different tool, using shared data. These tools can be loosely coupled, which
gives great exibility in tool location and organization. Tools no longer will need to share
a single platform; instead, for example, they could communicate with each other using
message-based queuing and XML.
The service management industry is undergoing consolidation, and large companies are
aggressively acquiring smaller startups. When you need to choose a supplier, you should
focus on innovative companies that have a good path to viability, through their own assets
or partnering.
SLM is now feasible, although not as simple as we would like. Nonetheless, an exciting and
challenging world is emerging, and SLM will be a key enabler of the potential of the
Internet.
INDEX
A
abandoned shopping carts, 151
abandonment, 198
accelerators, SSL, 165
access links, passive monitoring, 236
accountability, 241
accuracy of root-cause analysis, 107
ACE (Application Characterization
Environment), 213
actions, 113116
activation
deactivation time
of management tools, 95
active collection, 7578
active customers, 150
active measurements, 236
active monitoring, 75
activity baselines, 149
adaptive instrumentation, 77, 87
addresses, IP, 122
administration
aggregators, 72
applications
effect of organizational structures,
146
infrastructure, 145
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
complexity, 130
CRM, 204
260 administration
analysis
FMEA, 136
process managers, 251
root-cause, 107
statistical, 2930
API (application programming interface), 96
Application Characterization Environment
(ACE), 213
application program interface (API), 96
applications, 117
baselining, 237
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management
effect of organizational structures,
146
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
management systems integration, 248250
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
Arbor Networks, 125
architecture, 41
delivery of Web services, 4244
beds
design, 45
drivers, 4852
environment evolution, 4546
heterogeneous systems, 4648
example of, 5256
instrumentation, 52
policies, 133, 136
servers, 163, 174
SNA, 245
Web management systems, 250, 254
Web services, 163
arrival rates, 198
artifacts
alerts, 84
correlation, 90
eliminating, 88
reducing, 28, 72, 88
assessment of
headroom, 115
local impact, 114
association, 51, 93
asymmetric routes, 180
ATM (Asynchronous Transfer Mode), 178
attacks
Arbor Networks, 125
DDoS, 121, 124
SYN Flood, 121
attributes, 51
objects, 93
policies, 137
audio, 170
auditing policies, 138
authentication, 165
authoritative DNS servers, 43
automation
defenses, 124
operations, 55
policy-based management, 129130
architecture, 133, 136
design, 136, 138
elements, 131132
need for, 130131
products, 139, 142
service-centric, 132133
responses, 113, 116
availability, 19, 21
ROI, 224
transport services, 179
B
B2B (business to business), 67
B2C (business to consumer), 78
B2E (business to employee), 8
back-ofce operations, 56
bandwidth
over-provisioning, 188
traffic-shaping QoS, 185
transport services, 178
baselines
activity, 149
monitoring, 112
performance, 237
ProactiveNet, 117
revenue, 150
time slices, 67
beds, 200203
261
262 behavior
behavior
customer behavior measurements, 149
Netuitive, 120
predicting, 112
services, 62
benchmarks, load testing, 199200
best effort services, 18
BGP (Border Gateway Protocol), 190
billing, 56
boosting signals, 85, 97
bootstrapping, 25
Border Gateway Protocol (BGP), 190
bottom-up integration, 92
boundaries, elastic, 49
Brix Networks, 28
brownouts, 110111
buffering, 124
dejitter, 154
jitter, 181
building
automated responses, 116
simulation modeling, 211, 213
business to business. See B2B
business to consumer. See B2C
business to employee. See B2E
businesses
e-business, 56
B2B, 67
B2C, 78
B2E, 8
goals for performance, 254
measurements, 150
process metrics, 31, 34
ROI, 219, 228
C
caches, 43, 157, 168169
instrumentation, 172
server-side, 43
calculations
confidence intervals, 25
NPV, 222
candidates for automated responses, 116
capacity
planning, 197, 214215
workload metrics, 149
CBQ (class-based queuing), 187
CDN (Content Distribution Network), 168, 43
cell error ratios, 179
census of existing systems, 233
change latency, 34
characteristics, 51
CIM (Common Information Model), 51, 249
CIR (Committed Information Rate), 178
circuits, Frame Relay, 178
Cisco QoS Policy Manager (QPM), 139
class-based queuing (CBQ), 187
classes, one-way latency, 180
Clickstream Technologies, 158
client-side caches, 169
clocks, synchronizing, 180
closure criteria policies, 138
clusters, tools, 251
code, XML, 248
collaboration of instrumentation, 78
collectors, 125
deploying, 73
embedding, 71
event management, 82, 85
content
linkage, 78
managing, 72
measurements, 70
monitoring, 69
roll-up method, 86
services, 75
commerce
e-business, 56
B2B, 67
B2C, 78
B2E, 8
goals for performance, 254
measurements, 150
process metrics, 31, 34
ROI, 219, 228
commercial operations, 116, 127
Committed Information Rate (CIR), 178
Common Information Model (CIM), 51, 249
communication
between design and operations, 156
effect of organizational structures, 146
completion rates, ROI, 225
complexity, managing, 130
compliance testing, 55
components
instrumentation, 159
SLM, 15
systems, 68, 73
computation load, 166
concurrent sessions, 198
concurrent statistics, 198
condence intervals, 25
conguration
architecture, 45
drivers, 4852
environment evolution, 4546
263
D
data ow control, transport services, 188, 191
data integration, 248. See also integration
data item denition, 50
databases
distributed, 249
instrumentation, 6163
servers, 43
DDoS (Distributed Denial of Service) attacks,
121, 124
DE (discard eligible), 178
de-duplication, 87
defenses, automated, 124
dejitter buffers, 154, 181
delay, 154
processing, 156
propagation, 154
queuing, 154
round-trip, 189
serialization, 152
think time, 152
demand-side (end-user request) criteria, 166
demarcation points, 73, 189, 237
deployment, 231
collectors, 73
incremental aggregation, 232233
initial project selection, 231232
planning, 233242
ROI, 225
design
architecture, 45
drivers, 4852
environment evolution, 4546
development
265
266 development
E
e-business services, 56
B2B, 67
B2C, 78
B2E, 8
edge routers, demarcation points, 237
Edge-Side Includes (ESI), 170
effective throughput, 179
efciency of aggregators, 72
fidelity of transactions
267
services, 64
virtualized resources, 110
escalation time, 33
ESI (Edge-Side Includes), 170
evaluation of ROI, 227. See also ROI
events
integration. See also integration, 249
managing, 8182, 85
applying Micromuse, 9799
reducing noise, 85, 97
tools, 95
publishing, 96
real-time handling, 54
signaling, 50
evolution of environments, 4546
existing applications, 232
existing systems
baselining, 237
documentation, 233
optimizing, 237
expanding services, 49
extensible markup language (XML), 51, 248
external role of IT groups, 14
F
failover latency, 69
Failure Modes and Effects Analysis (FMEA),
136
failure rate of transactions, 20
false positives, alerts, 84
fast system management, 50
feedback, 65. See also management
delity of transactions, 78
le transfers, 20
lters
aggregators, 72
alerts, 89
correlation, 90
egress, 122
repeat failures, 88
nancials, 56
ash load, 198
ooding, 198
Flow Control Platform, 190
ow-through QoS, 189. See also data ow
control
FMEA (Failure Modes and Effects Analysis),
136
Frame Relay, CIR, 178
front-end processors, 164
functions, 72
instrumentation, 68
of event management, 86
H
headers, caching, 169
headroom, assessing, 115
heartbeats, 70
heavy-tailed distribution, 30
heriachical collector structures, 86. See also
collectors
heterogeneous systems, 4648
hierarchies, policies, 137
high-level technical metrics, 17. See also
metrics
history of architecture, 45
drivers, 4852
environment evolution, 4546
heterogeneous systems, 4648
HTTP (Hypertext Transfer Protocol), 165
hybrid distribution, 135
hybrid systems, active/passive agents, 77
I
G
gateways, BGP, 190
generated revenues, 150
generators, load testing, 200, 203
geographic distribution technologies, 43
geographic load distribution, 167
geometric deviation, 30
geometric mean, 30
geometric standard deviation, 30
GPS (Global Positioning System), 180
grooming, 72
groups, monitoring, 69
IP (Internet Protocol)
services
managing, 6365
tracking, 77
systems, 68, 73, 77
Web servers, 157
integration
alerts, 96
on the glass, 47
processes, 92
systems, 248250
technologies, 91
tools, 255
integrity, web pages, 159
intelligent monitoring, 87
interactive classes
collectors, 70
one-way latency, 180
interactive services, round-trip latency, 181
interfaces, API, 96
internal failures, 82. See also alerts;
troubleshooting
internal role of IT groups, 14
internally-generated alerts, 96
Internet latencies, 180
Internet Service Providers (ISPs), 43, 49
intervals
aggregation, 2627
time slices, 67
intranets, 8
invariant responses, 197
investments, ROI, 219, 228
IP (Internet Protocol)
addresses, 122
DiffServ, 183
TOS, 183
269
270 isolation
J
Jacobson, Van, 186
jitter, 22, 154, 181
K
Keynote Systems, 28
Keynote WebIntegrity tool, 159
keys, 165
knowledge repositories, need for, 131
L
LAN (local area network), 182
languages, 113
latency, 21
change latency, 34
diagnostics, 189
failover, 69
one-way, 180
round-trip, 181
Lawrence Berkeley Laboratories, 186
lead time, benets of, 112
legacy applications, opacity of, 160
linkage, 78
links
passive monitoring, 236
troubleshooting, 179
load balancing, 126, 166
load distribution, 164
geographic, 167
instrumentation, 172
local, 166
load testing, 67, 195196
beds, 200203
benchmarks, 199200
generators, 200203
performance envelope, 196, 199
results, 205206
transaction load-test scripts, 203205
local impact, assessing, 114
local load distribution, 166
locations
active probes, 236
instrumentation, 235
long-term effect of management decisions, 65.
See also management
long-term operations, 55
loss, packets, 179
lower-level services, 152, 157
low-level technical metrics, 17. See also
technical metrics
M
macro/micro-level, QoS, 189
management
aggregators, 72
applications
baselining, 237
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management, 146-157, 248250
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
complexity, 130
CRM, 204
demarcation points, 189
digital certificates, 166
events, 8185
applying Micromuse, 9799
reducing noise, 85, 97
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
new technologies, 31, 34
NMOS, 247
phased implementations, 231
incremental aggregation, 232233
initial project selection, 231232
planning, 233242
policy-based, 129130
architecture, 133, 136
design, 136138
elements, 131132
need for, 130131
products, 139142
service-centric, 132133
problem management metrics, 33
real-time operations, 101
automated responses, 113, 116
brownouts, 110
commercial, 116, 127
proactive, 112113
reactive, 103104
root-cause analysis, 107111
triage, 104, 107
virtualized resources, 110111
real-time service metrics, 33
services, 6365
SLM
components, 15
overview of, 917
SNMP, 248
systems integration, 248250
technologies, 91
tools, 95, 255
transport infrastructure, 177
data ow control, 188191
metrics, 178181
QoS, 181, 188
Web system architecture, 250, 254
Management Information Base (MIB), 50
manual association of services, 93
maximum burst size, 178
271
problem management, 33
real-time service management, 33
SLA, 16
technical, 17, 23
transport services, 178181
workload, 149
MIB (Management Information Base), 50
Micromuse, 9799, 247
middle mile, 190
mission statements, ROI, 222. See also ROI
modeling
ROI, 220
simulation, 209211
building, 211213
performance, 211
reporting, 214
validating, 213
services, 92
modems, processing delays, 156
moderate priority level, 95
modes, instrumentation, 6567
modication
services, 49
thresholds, 115
monitoring, 28
active, 75
baselines, 112
groups, 69
instrumentation design, 73, 77
intelligent, 87
passive, 75, 236
services, 6163
transactions, 63
variables, 82
operations
273
O
N
NAT (Network Address Translation), 167
net present value (NPV), 222
Netcool, 247. See also Micromuse
NetIQ, 157
NetScaler, 126
Netuitive, 120
Network Address Translation (NAT), 167
network edge, 170
Network Management Operating System
(NMOS), 247
Network Time Protocol (NTP), 180
network-aware applications, 156
networks
Arbor Networks, 125
CDN, 168
collectors, 71
isolation, 188
management systems integration, 248250
modems, 156
SNMP, 248
objects, 93
oncurrent user session initiation attempts, 198
one-way latency, 180
opacity of legacy applications, 160
operating systems, NMOS, 247
operational business decisions, 64. See also
management
operational environments, 146
operational technical decisions, 64. See also
management
operations, 54, 101107, 109113, 116, 127
aggregators, 72
applications
baselining, 237
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management, 146-157, 248250
Netuitive, 120
network-aware, 156
274 operations
ProactiveNet, 117
servers, 42
back-office, 56
complexity, 130
CRM, 204
demarcation points, 189
design groups, 156
digital certificates, 166
events, 8185
applying Micromuse, 9799
reducing noise, 85, 97
instrumentation, 53
components, 68, 73
modes, 65
time slices, 67
trip wires, 6667
interaction teams, 146147
long-term, 55
new technologies, 31, 34
NMOS, 247
performance envelope, 196, 199
phased implementations, 231
incremental aggregation, 232233
initial project selection, 231232
planning, 233242
policy-based, 129130
architecture, 133, 136
design, 136138
elements, 131132
need for, 130131
products, 139142
service-centric, 132133
problem management metrics, 33
policy-based management
P
Packeteer, 186
packets
collectors, 70
jitter, 181
loss, 21, 179
PacketShaper, 187
page-bug tracking, 158
parsing XML, 248
partners, 49
passive collection, 7578
passive measurements, 236
passive monitoring, 75
PathFinder (DIRIG Software), 94
payback, ROI, 228
peak cell rates, 178
peak service rates, 197
275
problem signatures, 91
process integration, 250
process managers, 251
processing, 72
alerts, 86
functions, 72
processing delays, 156
products, policies, 139, 142
proles, load testing, 203205
programming XML, 248
projections, ROI, 221
promotion feedback, 151
propagation, delays, 152, 154
protocol analyzers, 212
protocols
analyzers, 212
BGP, 190
HTTP, 165
NTP, 180
SNMP, 50, 172, 248
TCP, 180
providers, measuring, 237
provisioning, 33, 56
publishing events, 96
pull (component-centric) model, 134
push (repository-centric) model, 135
Q
QoE (Quality of Experience), 10, 43
QoS (Quality of Service), 10, 212
services census, 233
transport services, 181, 188
resolution time
R
rate control, QoS, 186
ratios, 180
customer orders to customer visitors, 150
load testing, 203
raw alerts, 81
reactive management, 103104
brownouts, 110
root-cause analysis, 107111
triage, 104, 107
virtualized resources, 110111
real-time, 101
automated responses, 113, 116
brownouts, 110
commercials, 116, 127
proactive management, 112113
reactive management, 103104
root-cause analysis, 107111
277
278 resources
resources
SLM
components, 15
overview of, 917
virtualized, 110
responses
automated, 113, 116
servers, 23
transactions, 20, 152, 157
responsibilities, 240
results, load testing, 205206
retransmissions, effective throughput, 179
Return on Investment (ROI), 56, 219, 228
revenue baselines, 150
reviews, scheduling, 240
Risk Priority Number (RPN), 137
RiverSoft, 247
RMON (Remote Monitoring), 212, 236
ROI (Return on Investment), 56, 219, 228
roles, 240
roll-up methods, 86
root cause, 33
root-cause analysis, 55, 107
round-trip delay, 189
round-trip latency, 181
route control, 190, 246
routers, demarcation points, 237
routes, asymmetric, 180
RPN (Risk Priority Number), 137
rules of policy-based management, 130
S
SAA (Service Assurance Agent), 71
sampling frequency, 2426
scaling aggregators, 72
scheduled reviews, 240
scope, measurement of, 2324
scripts, transaction load-test, 203205
Secure Sockets Layer (SSL), 165167
security
authentication, 165
DDoS attacks, 121
selection
of candidates, 116
of instrumentation, 235
of thresholds, 67
semantics, 50
sensitivities, 197, 214, 237
serialization of delays, 152
servers, 170
application, 42
database, 43
infrastructure, 163, 174
instrumentation, 6163
priority level, 95
response time, 23
Web, 42, 157
server-side caches, 43, 169
Service Assurance Agent (SAA), 71
Service Level Agreements. See SLAs
Service Level Management. See SLM
service providers, elastic boundaries, 49
service-centric policies, 132133
soft benefits
services
behavior, 62
census, 233
collectors, 70
correlation, 93
disruptions, 62
event management, 82, 85
troubleshooting, 64
e-business services, 56
B2B, 67
B2C, 78
B2E, 8
expanding, 49
incremental aggregation, 232
instrumentation, 6163
monitoring design, 73, 77
tracking, 77
integrating, 91
management, 6365
measurement, 160
modeling, 92
modifying, 49
performance envelope, 196, 199, 254
quality measurement, 151
technical metrics, 17, 23
tracking, 75
transport
data ow control, 188191
metrics, 178, 181
QoS, 181, 188
Web
architecture, 163
delivery architecture, 4244
webbed, 89
279
280 software
software, 117
baselining, 237
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management
effect of organizational structures,
146
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
management systems integration, 248250
Netuitive, 120
network-aware, 156
ProactiveNet, 117
servers, 42
source persistence, 167
speed
demands of, 245, 247
root-cause analysis, 107
spoong, 122
SSCPs (System Services Control Points), 46
SSL (Secure Sockets Layer), 165167
stafng
costs, 130
ROI, 225
starving out best effort services, 18
statistics
analysis, 2930
concurrent sessions, 198
SLA, 54
sticky environments, 149
streaming
collectors, 70
multimedia, 179
quality, 20
supercial integration, 248
supply-side (server status) criteria, 166
suppression, IP source address spoong, 122
sustainable cell rates, 178
switches
content, 166
instrumentation, 6163
NetScaler, 126
switching, CDN, 168
SYN Flood attacks, 121
synchronization of GPS, 180
syntax, SNMP, 50. See also programming
synthetic (virtual) transactions, 75, 202
system architecture, 41. See also architecture
design, 45
drivers, 4852
environment evolution, 4546
heterogeneous systems, 4648
example of, 5256
Web service delivery, 4244
System Services Control Points (SSCPs), 46
systems
instrumentation, 68, 73
integration, 248250
Systems Network Architecture (SNA), 45, 245
tools
T
tag-based QoS, 182
tagging erroneous measurements, 28
Tavve EventWatch, 117
TCP (Transmission Control Protocol)
Packeteer, 186
round-trip latency, 181
slow start algorithm, 155
traffic-shaping QoS, 185
transport services, 180
teams
development, 147149
elastic boundaries, 49
technical metrics, 17, 23
technical quality metrics, 178181
technologies, integrating, 91
telephone voice transmissions, quality of, 21
testing
DUT, 200
integrity, 159
load, 195196
beds, 200, 203
benchmarks, 199200
generators, 200203
performance envelope, 196, 199
results, 205206
transaction load-test scripts, 203205
phased implementation, 231232
policies, 138
regression, 202
think time, 152
third-party content providers, 44
threats, automated defenses, 124
281
thresholds
alerts
triggers, 82
trip wires, 6667
modifying, 115
throughput, 179
tiers of architecture, 163, 174
time
correlation, 91
lines, 147
NTP, 180
slices, 67
transactions, 151152, 157
Time to Value (ROI), 221
time, 151. See also measurements, metrics
tolerance for service interruption, 27
tools, 255
baselining, 237
clusters, 251
development teams, 146147
existing, 232
infrastructure, 145
instrumentation, 6163
legacy, 160
management, 95
effect of organizational structures,
146
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
management systems integration, 248250
Micromuse, 97, 99
Netuitive, 120
282 tools
network-aware, 156
policy-based management, 133
ProactiveNet, 117
reporting, 240
servers, 42
simulation modeling, 211
building, 211, 213
reporting, 214
validating, 213
top-down integration, 92
top-down process, 62
tracking
services, 75, 77
workflow, 95
trafc, tag-based QoS, 182, 185
transactions
collectors, 70
failure rates, 20
fidelity, 78
load-test scripts, 203205
monitoring, 63
recorder, 204
response time, 20, 152, 157
ROI, 228. See also ROI
roll-up methods, 86
security of, 165
service quality measurement, 151
synthetic, 75
synthetic (virtual), 202
time, 151
virtual, 75
transfers, les, 20
transport infrastructure
managing, 177
metrics, 178181
U
users
experience, 75
measurements, 160
utilities
baselining, 237
clusters, 251
development teams, 146147
existing, 232
workload
infrastructure, 145
instrumentation, 6163
legacy, 160
management, 95
effect of organizational structures,
146
instrumentation, 157, 161
metrics, 147, 152
operational environments, 146
time lines, 147
transaction response time, 152, 157
management systems integration, 248250
Micromuse, 97, 99
Netuitive, 120
network-aware, 156
policy-based management, 133
ProactiveNet, 117
reporting, 240
servers, 42
simulation modeling, 211
building, 211, 213
reporting, 214
validating, 213
V
validation
measurement, 2829
simulation modeling, 213
values, NPV, 222
variables
bit rates, 178
monitoring, 82
time slices, 67
verication of alerts, 88
video, 170
virtual (synthetic) transactions, 147
virtual transactions, 75, 202
virtualized resources, 110
volume
intelligent monitoring, 87
reducing, 86
W
warning priority level, 95
Web management systems, 250, 254
web pages
integrity, 159
load testing, 199. See also load testing
Web servers, 42, 157
Web services
architecture, 163
delivery architecture, 4244
webbed ecosystem, 89
webbed services, 89
WebEffective, 158
WebTrends Log Analyzer Series, 157
WFQ (weighted fair queuing), 187. See also
queuing
windows, 154
workow, tracking, 95
workload
metrics, 18, 21, 149
transport services, 178
283
X-Y
X out of Y process, 89
XML (extensible markup language), 51, 248
Z
zombies, 121