Network and Voice Management Green Book ENU

CA GREEN BOOKS
Network and Voice Management

An Integrated Solution for Network Fault and Performance Management
OVERVIEW OF CONVERGED NETWORK CHANGES AND MANAGEMENT NEEDS BEST PRACTICES FOR DEPLOYING CAS INTEGRATED SOLUTION FOR NETWORK AND VOICE MANAGEMENT
LEGAL NOTICE
This publication is based on current information and resource allocations as of its date of publication and is subject to change or withdrawal by CA at any time without notice. The information in this publication could include typographical errors or technical inaccuracies. CA may make modifications to any CA product, software program, method or procedure described in this publication at any time without notice. Any reference in this publication to non-CA products and non-CA websites are provided for convenience only and shall not serve as CAs endorsement of such products or websites. Your use of such products, websites, and any information regarding such products or any materials provided with such products or at such websites shall be at your own risk. Notwithstanding anything in this publication to the contrary, this publication shall not (i) constitute product documentation or specifications under any existing or future written license agreement or services agreement relating to any CA software product, or be subject to any warranty set forth in any such written agreement; (ii) serve to affect the rights and/or obligations of CA or its licensees under any existing or future written license agreement or services agreement relating to any CA software product; or (iii) serve to amend any product documentation or specifications for any CA software product. The development, release and timing of any features or functionality described in this publication remain at CAs sole discretion. The information in this publication is based upon CAs experiences with the referenced software products in a variety of development and customer environments. Past performance of the software products in such development and customer environments is not indicative of the future performance of such software products in identical, similar or different environments. CA does not warrant that the software products will operate as specifically set forth in this publication. CA will support only the referenced products in accordance with (i) the documentation and specifications provided with the referenced product, and (ii) CAs then-current maintenance and support policy for the referenced product. Certain information in this publication may outline CAs general product direction. All information in this publication is for your informational purposes only and may not be incorporated into any contract. CA assumes no responsibility for the accuracy or completeness of the information. To the extent permitted by applicable law, CA provides this document AS IS without warranty of any kind, including, without limitation, any implied warranties of merchantability, fitness for a particular purpose, or noninfringement. In no event will CA be liable for any loss or damage, direct or indirect, from the use of this document, including, without limitation, lost profits, lost investment, business interruption, goodwill or lost data, even if CA is expressly advised of the possibility of such damages.
COPYRIGHT LICENSE AND NOTICE:

This publication contains sample application programming code and/or language which illustrate programming techniques on various operating systems. Notwithstanding anything to the contrary contained in this publication, such sample code does not constitute licensed products or software under any CA license or services agreement. You may copy, modify and use this sample code for the purposes of performing the installation methods and routines described in this document. These samples have not been tested. CA does not make, and you may not rely on, any promise, express or implied, of reliability, serviceability or function of the sample code. Copyright 2007 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies.
Network and Voice Management
ACKNOWLEDGEMENTS
CA thanks the following people for their contributions to this CA Green Book: Principal Authors Don LeClair Sue Andersen Jason Bryk Roger Craig Bill Donoghue Justin Gagnon Brian Gollaher Andrew Haigh Kathleen Hickey Mark Hounslow John Kane Michael Marks John Murdough Barbara OToole Pete Oliveira Jason Warfield Dianne Weiss The principal authors and CA would like to thank the following contributors: Ajei Gopal Tricia Bancroft Lynn Beck Gregory Buonaiuto Curtis Lehman Peter Clairmont Dan Lewis Anders Magnusson Alexandre Moscoso Joe Pennachio David Soares Peter Skotny Cheryl Stauffer Tom Wilson
3: Network and Voice Management
CA PRODUCT REFERENCES
CA Network and Voice Management eHealth eHealth for Voice SPECTRUM eHealth for Voice Policy Manager eHealth E2E Console eHealth Live Health eHealth Traffic Accountant eHealth Universal Workflow Integration Modules eHealth Universal Data Integration Modules eHealth Universal Wireless Integration Modules SPECTRUM Infinity SPECTRUM Integrity SPECTRUM Xsight SPECTRUM OneClick SPECTRUM Service Manager SPECTRUM Report Manager SPECTRUM Alarm Notification Manager SPECTRUM ATM Circuit Manager SPECTRUM Configuration Manager SPECTRUM Secure Domain Manager SPECTRUM Frame Relay Manager SPECTRUM Microsoft Operations Manager Connector SPECTRUM Multicast Manager SPECTRUM OSS Integrations SPECTRUM QoS Manager SPECTRUM Remedy ARS Gateway SPECTRUM SNMPv3 Support SPECTRUM VPN Manager SPECTRUM Watch Editor SPECTRUM Service Performance Manager SPECTRUM Assurance Server Xsight SPECTRUM Assurance Server Integrity SPECTRUM Assurance Server Infinity
Contents
Chapter 1: Introduction ......................................................................................................9 About This Book .............................................................................................................9 Executive Summary ...................................................................................................... 10 Evolving Requirements for Network and Voice Management............................................ 10 CAs Network and Voice Management Solution .............................................................. 10 Chapter 2: Challenges of Network and Voice Management .................................................... 13 Evolution of the Network-to-Service Delivery Platform ...................................................... 13 Impact on Network Operations Teams............................................................................. 13 Impact on Network Management Software Requirements .................................................. 14 Chapter 3: CAs Network and Voice Management Solution .................................................... 17 EITM: CAs Vision ......................................................................................................... 17 Enterprise Systems Management.................................................................................... 18 The Value of CAs Network and Voice Management Solution............................................... 19 A Key Part of CAs EITM Vision .................................................................................... 20 Network and Voice Management for Key Vertical Markets .................................................. 20 Telecommunication Service Providers........................................................................... 20 Government ............................................................................................................. 21 Enterprise................................................................................................................. 21 The Components of the Solution..................................................................................... 22 eHealth ....................................................................................................................... 22 eHealth Components.................................................................................................. 23 The Benefits of eHealth .............................................................................................. 23 SPECTRUM .................................................................................................................. 24 SPECTRUM Components ............................................................................................. 24 The Benefits of SPECTRUM ......................................................................................... 25 Integration between eHealth and SPECTRUM ................................................................... 25 eHealth for Voice .......................................................................................................... 25 The Benefits of eHealth for Voice ................................................................................. 26 CA Technology Services Network and Voice Management Service Offerings ......................... 26 Assessment Understanding the Gaps......................................................................... 27 CA Maturity Models .................................................................................................... 28 Design Building the Right Solution ............................................................................ 28 Implementation The Bottom Line of Solution Success .................................................. 29 Optimization Anticipating Change ............................................................................. 29 Why Trust Your Service Availability to CA Technology Services? ...................................... 29 How the Solution Delivers the Key Points of Value ............................................................ 30 Effective Service Level Management ............................................................................ 30 Proactive Service Assurance ....................................................................................... 31 Rapid Problem Resolution ........................................................................................... 32 Predictive Capacity Planning ....................................................................................... 33
Chapter 4: Deployment Architecture for Network and Voice Management ............................... 35 Network Performance Components ................................................................................. 35 E2E Console.............................................................................................................. 35 Live Health ............................................................................................................... 36 Integration Modules ................................................................................................... 36 Distributed eHealth.................................................................................................... 36 Remote Poller ........................................................................................................... 37 Report Center ........................................................................................................... 37 Traffic Accountant ..................................................................................................... 37 Network Fault Management Components ......................................................................... 38 Assurance Server ...................................................................................................... 38 OneClick................................................................................................................... 39 Watch Editor ............................................................................................................. 40 Alarm Notification Manager ......................................................................................... 40 Frame Relay Manager ................................................................................................ 41 ATM Circuit Manager .................................................................................................. 41 Multicast Manager ..................................................................................................... 42 QOS Manager............................................................................................................ 42 VPN Manager ............................................................................................................ 43 SNMPv3 ................................................................................................................... 43 Secure Domain Manager ............................................................................................ 44 Configuration Manager ............................................................................................... 44 Report Manager ........................................................................................................ 45 Service Performance Manager ..................................................................................... 45 Service Manager........................................................................................................ 46 Voice Management Components ..................................................................................... 47 eHealth for Voice ....................................................................................................... 47 eHealth for Voice Policy Manager ................................................................................. 47 Deployment Architectures.............................................................................................. 48 Small-to-Medium Enterprise Deployment...................................................................... 48 Large Service Provider Deployment ............................................................................. 49 Network Performance Hardware and Software Requirements/Sizing.................................... 50 Network Fault Management Hardware and Software Requirements/Sizing ........................... 50 eHealth for Voice Single PC or Database Server Hardware and Software Requirements ......... 52 Chapter 5: Setting Up and Configuring the Integrated Solution.............................................. 53 Installing the CA Network and Voice Management Solution Software................................... 53 Installation Prerequisites ............................................................................................ 53 Installation Steps ...................................................................................................... 54 How You Install SPECTRUM......................................................................................... 54 How You Install SPECTRUM OneClick and Report Manager .............................................. 55 How You Install eHealth ............................................................................................. 55 How You Install eHealth for Voice ................................................................................ 56 Configuring the Integrated Solution ................................................................................ 57 Best Practices ........................................................................................................... 57 Identify Resources and Use SPECTRUM to Discover Them as Global Collections ................. 57 Import Global Collections into eHealth.......................................................................... 58 Organize Your Resources by Creating eHealth Groups .................................................... 60 Schedule eHealth Discoveries of Global Collections ........................................................ 61 Network and Voice Monitoring ........................................................................................ 62 Set Up Live Health ..................................................................................................... 62 Forward Live Health Traps to SPECTRUM ...................................................................... 64 Customize and Schedule Health Reports to Forward Traps .............................................. 65 Configure eHealth for Voice to Send Alerts to SPECTRUM................................................ 66 Configure SPECTRUM to Recognize the eHealth Server ................................................... 68 Configure SPECTRUM to View eHealth Alarms ............................................................... 69 System Maintenance ..................................................................................................... 70 System Backup Archives ............................................................................................ 70 Data Recovery Best Practices ...................................................................................... 70
Chapter 6: Gathering System Information from Agents ........................................................ 71 Deployment and Administration of System Agents ............................................................ 71 Best Practices ........................................................................................................... 71 Supported Agents...................................................................................................... 72 Prerequisites............................................................................................................. 72 How You Add System Agents in SPECTRUM................................................................... 72 Unicenter NSM Agents................................................................................................ 79 How You Add System Agents in eHealth ....................................................................... 80 Performance Reporting On System Agents....................................................................... 80 At-a-Glance Reports .................................................................................................. 80 MyHealth Reports for Systems .................................................................................... 82 Health Reports for Systems ........................................................................................ 82 Using Live Trend ....................................................................................................... 82 How You Run Trend Reports for Systems...................................................................... 84 Top N Reports ........................................................................................................... 86 What-If Capacity Trend Reports for Systems................................................................. 86 Chapter 7: Service Level Management................................................................................ 87 Interview Procedures .................................................................................................... 88 Interview Questions ................................................................................................... 88 General Questions ..................................................................................................... 88 Analysis and Mapping Procedures ................................................................................... 89 How You Organize the Resource Information................................................................. 89 How You Illustrate the Relationships of Resources to Each Other ..................................... 90 How You Decompose the Information and Mapping to Service Models .............................. 90 Example of a Business Service Map to Service Models .................................................... 90 Creating Service Models and Relationships....................................................................... 92 Key Concepts ............................................................................................................ 92 How You Create Service Models................................................................................... 93 Example 1: A Customer Account Access Service ............................................................ 93 Example 2: Extend the Service to Monitor Critical Processes ........................................... 99 Implement Example 2 in SPECTRUM .......................................................................... 101 Example 3: Extend the Service to Include a Response Time Element ............................. 105 Create SLAs ............................................................................................................... 108 Key Concepts .......................................................................................................... 108 Create SLAs and Guarantees..................................................................................... 109 Example 4: An SLA for the Customer Account Access Service....................................... 110 How You Implement the A to Z Account Access SLA in SPECTRUM ................................. 116 Service and SLA Reporting........................................................................................... 118 Run SPECTRUM Service Manager Customer-Facing Reports .......................................... 118 Service Availability by: Name, Customer, Owner ......................................................... 119 Service Availability Variable Health Level .................................................................... 120 Service Summary by: Name, Customer, Owner ........................................................... 121 Service Summary Variable Health Level ..................................................................... 121 SLA Detail By Customer ........................................................................................... 122 SLA Inventory by Customer ...................................................................................... 123 SPECTRUM Service Manager: Internal Reports ............................................................ 123 Service Health by Service Name ................................................................................ 123 Service Inventory .................................................................................................... 125 Top N Worst Performing Services .............................................................................. 126 Top N Worst Performing Services Including All Outage Types ........................................ 126 Top N Worst Service Outages.................................................................................... 127 Top N Worst Service Resources by Total Downtime...................................................... 128 SLA Status Current and Recent by Customer .............................................................. 128 SLA Summary by: Name, Customer, Status ................................................................ 129 SLA Summary Warned or Violated ............................................................................. 129 SLA Detail By: SLA Name, Time Range, Last N Periods................................................. 130 SLA Detail with Resource Outages ............................................................................. 133 Customer SLA Summary .......................................................................................... 134
Chapter 8: Proactive Service Assurance............................................................................ 135 How You Identify Potential Problems ............................................................................. 135 Configure Live Health to Watch for Growing Problems .................................................. 136 Configure Health Reports to Send Traps for Growing Problems ...................................... 136 Send Voice Alerts to SPECTRUM ................................................................................ 137 How You Respond to Alarm Actions in SPECTRUM ........................................................... 137 Chapter 9: Predictive Capacity Planning............................................................................ 139 How You Identify Underutilized Resources ..................................................................... 140 Locate Underutilized Resources ................................................................................. 140 Confirm Underutilization ........................................................................................... 141 How You Address Underutilized Resources .................................................................. 143 Show ROI ............................................................................................................... 143 Update Your Configuration ........................................................................................ 144 How You Identify Overutilized Resources ....................................................................... 145 Locate Overutilized Resources ................................................................................... 145 Confirm Overutilized Resources ................................................................................. 145 How You Address Overutilized Resources.................................................................... 147 How You Plan Future Capacity Changes ......................................................................... 149 Identify Potential Capacity Changes ........................................................................... 149 Analyze Capacity Trends........................................................................................... 150 Visualize Capacity Changes....................................................................................... 152 How You Address Capacity Changes........................................................................... 153 Voice Capacity Planning............................................................................................... 154 Analyze Voice Capacity............................................................................................. 154 Analyze GoS ........................................................................................................... 155 How You Address Underutilized Resources .................................................................. 157 Show ROI ............................................................................................................... 157 How You Address and Confirm Overutilized Resources.................................................. 157 Analyze Voice Messaging Disk Capacity ...................................................................... 158 How You Resolve Disk Capacity Issues ....................................................................... 158 Chapter 10: Rapid Problem Resolution.............................................................................. 159 Problem-Solving Techniques ........................................................................................ 159 Complex Problems and Powerful Solutions .................................................................. 160 Problem Prediction and Prevention............................................................................. 160 Business Impact ...................................................................................................... 161 Event Correlation and RCA A Three-Pronged Approach ................................................ 161 RCA ....................................................................................................................... 161 Inductive Modeling Technology.................................................................................. 162 Event Management System ...................................................................................... 163 Condition Correlation ............................................................................................... 164 Fault Scenarios .......................................................................................................... 166 Communication Outages and Impacts ........................................................................ 166 How SPECTRUMs Intelligence Isolates Communication Outages .................................... 167 Event Management System ...................................................................................... 171 Apply Condition Correlation to Service Correlation ....................................................... 177 Leverage the Integrated Solution ................................................................................. 178 Index ........................................................................................................................... 179
Chapter 1: Introduction
About This Book
The CA Green Book for Network and Voice Management describes how to manage the performance and availability of converged networks. The CA solution provides proactive management of voice and data services, ensures that bandwidth and system capacity is sufficient, and supports business-driven service levels. The solution also provides integrated network fault and performance management to support the network as a service delivery platform. The information contained in this CA Green Book is designed for network operators, engineering, and technical staff charged with managing voice and data networks. The deployment examples highlighted in this book present the views of a small enterprise and a large service provider. This information may be useful for many other network deployments, but may not meet all of their specific requirements. This CA Green Book provides an understanding of capabilities that you can deploy today to manage your converged network. The opening sections provide a strategic view of the trends toward converged networks, and the subsequent sections present best practices for deploying and using the CA Network and Voice Management solution to manage the converged network. This CA Green Book contains only information about network and voice management. This is one of a series of CA Green Books designed to help define the capabilities of CAs key solutions and provide best practices on how to manage and secure them. Other CA Green Books will present solutions across a wide range of IT management topics including systems management, database management, and workload automation. This Network and Voice Management Green Book is targeted toward CIOs, network management teams, and technical teams. The book is structured as follows: Chapters 1-3 Provide CIOs and network managers with an overview of the challenges of managing converged networks and the value of CAs Network and Voice Management solution. Chapters 4-10 Deliver sample deployments for the products comprising CAs Network and Voice Management solution to network managers and other technical personnel. Provide best practices for planning, deploying, and configuring this solution to speed the time-to-value for investments made in optimizing the network. This CA Green Book also covers the following topics: Technical descriptions of the components that comprise the recommended solution Best practices for setting up and configuring the components of the solution Defining and managing network service level management Best practices for enabling proactive service assurance and resolving problems quickly Performing capacity planning and management of voice and data networks
Executive Summary
Evolving Requirements for Network and Voice Management
The entire computing infrastructure has been dependent upon the network. Recently, the use of the network has changed enormously, with the convergence of data, voice, businesscritical applications, and video content traveling over the same network. Network faults and performance problems have immediate and negative business consequences for productivity, cost, and revenue. As a result, the interest and awareness of network fault and performance issues has expanded to a wider audience of business-oriented and non-technical users who want and need real-time information. In addition, the network operations team must rapidly learn new technologies and expand their management responsibilities to support converged data and voice networks. In this environment, network management solutions must provide critical information and management capabilities appropriate to both technical and business users. Network management solutions need to provide configurable alerts, dashboards, and analytical capabilities to all users in addition to delivering traditional fault and performance management. Todays network and voice management solutions must provide the following support: Heterogeneous Networks They must support data technologies such as internet protocol (IP), asynchronous transfer mode (ATM), frame relay (FR), and broadband, as well as voice infrastructures comprised of legacy time-division multiplexing (TDM) infrastructures, pure-play IP telephony (IPT) infrastructures, and hybrid infrastructures. Scale They must support vast networks distributed across countries and continents, and enable central or regional network management teams. Integration They must be able to use a single solution to manage data and voice infrastructures, and critical systems. Role-Based Service Information They must be able to communicate to external customers with Service Level Agreements (SLAs) and communicate to internal customers with either Operational Level Agreements (OLAs) or less formal mechanisms. They must help both technical and business-minded audiences assess the networks ability to support the business, and identify and pinpoint the root cause of problems.
CAs Network and Voice Management Solution

CAs Network and Voice Management solution provides converged voice and data management. It supports the need for IT to proactively manage end-to-end voice and data services, ensure adequate capacity of bandwidth and systems, and support service levels defined by the business. This solution is a key part of Enterprise IT Management (EITM), which is CAs vision for how to dynamically manage and secure IT environments, enabling organizations to fully realize the potential of IT. CAs Network and Voice Management solution spans both IP and legacy voice technologies, enabling companies to migrate to IP telephony at their own pace, and reduce the complexity of managing heterogeneous infrastructures.
This tested integrated solution provides the following key points of value: Effective Service Level Management Baseline, assess, and track services through the network, and communicate adherence to SLAs and OLAs to business and technical audiences. Proactive Service Assurance Use policy-based monitoring to detect service degradations before customers or end users are impacted. Rapid Problem Detection Resolve the true cause of the problem through event correlation, root cause analysis (RCA), and linkage with real-time and historical reporting. Predictive Capacity Planning Use intelligent embedded algorithms that help network operations teams identify when and where to make circuit and hardware changes to keep service within expected performance thresholds. CAs Network and Voice Management solution is comprised of three primary components, all of which provide a consistent set of capabilities across the voice and data infrastructure. eHealth Network Performance Management gathers and stores critical performance data from over 100 vendors and 1000 devices; applies intelligent algorithms to performance data to identify service degradation, capacity planning problems, and causes of performance problems. SPECTRUM Fault Management provides fault management, RCA, and service level management. eHealth for Voice offers system performance and fault management for legacy TDM and IP communication systems, messaging systems, and private branch exchange (PBX) switches. This is a powerful and unique solution for managing increasingly complex network services within enterprise, government, and telecommunications environments. It enables technical teams to improve service, control costs, reduce risk, increase revenue, and drive efficiency when managing IT infrastructure as a business service. To help ensure successful implementations, the CA Technology Services organization is equipped to help any organization assess, design, implement, and optimize network performance and availability solutions. They take a lifecycle approach to the implementation of a total network management solution which includes the following: Assessment Understanding the Gaps Comprehensive assessments, such as the Event-to-Resolution Readiness Assessment, to validate the current maturity and efficiency level of network performance and availability management. Design Building the Right Solution CA architects design successful service availability solutions for customers that range in size from a single location, medium-sized business to global IT operations demanding 24x7 availability and high-speed performance. Implementation The Bottom Line of Solution Success CA consultants prepare the environment; install, configure, and customize eHealth and SPECTRUM; verify and document your eHealth and SPECTRUM solutions on test, quality assurance (QA), and production systems; and provide knowledge transfer to your staff.
Optimization Anticipating Change Optimization services evaluate ways in which your existing eHealth and SPECTRUM solutions can be further utilized or fine-tuned. Health check services can include tuning and reconfiguration, upgrades, and migrations, as well as training and certifications. With CAs converged network and voice management solution, the IT organization becomes more proactive not reactive in their approach to managing voice and data. Network operations teams have the tools to quickly determine the cause of problems. The IT planning or engineering group can determine if resources are underutilized or reaching a capacity threshold. The IT department can manage their relationships with key constituencies with formal service levels. The ability to monitor and report on grade of service (GoS) and quality of service (QoS) for calls in the voice network is essential to successful service level management. CAs Network and Voice Management solution makes this all possible.
Chapter 2: Challenges of Network and Voice Management

All computing mainframe, client-server, distributed, grid, or web services computing depends on the function of the network on which it resides. Todays businesses recognize the dependence of business critical services, such as financial applications and voice, on the network infrastructure. If the infrastructure is down or slow, the resulting impact on business-critical applications and services, and the end users who rely on them, creates loss of revenue and productivity, while increasing costs. On average, Infonetics estimates that infrastructure downtime and degradation costs enterprises up to 3.6% of revenue annually. 1
Evolution of the Network-to-Service Delivery Platform

Over the past couple of years, the nature of the network itself has changed, and this change has created significant implications for network management software and for the operations and IT team members who use it. Not long ago, the main function of the network was strictly maintaining data connectivity. Today, the network is considered to be more of a service delivery platform. The network supports real-time services such as Voice over IP (VoIP), IP Television (IPTV), and video teleconferencing, all of which have evolved from early adoption and are now approaching mainstream adoption. Enterprises are increasingly reliant on applications distributed over wide geographic areas to provide any-time access to employees, customers, and partners to accomplish critical business functions. Furthermore, the equipment comprising todays networks is now embedded with services such as security, high availability, and storage, which were previously provided by infrastructures found outside of the network.
Impact on Network Operations Teams

The impact of this evolution on the roles and responsibilities of network operations teams has been significant. As companies rely on real-time services to improve revenue, raise productivity, and cut costs, any degradation in service has an immediate impact on customer satisfaction and the business. Because the network is carrying applications that directly impact the companys bottom line, the range of internal constituents who want to understand the performance of the network has expanded from technical groups to business-oriented, non-technical, line-ofbusiness managers. Network operations team members now have a whole new set of internal and external customers to whom they need to communicate their ability to meet service level commitments. Because this group is non-technical, they need to communicate with them in business terms, rather than technical terms.
Infonetics Research, The Cost of Enterprise Downtime, North America 2004. http://www.infonetics.com. Used with permission.
In addition, because of the tight inter-dependence between the network infrastructure and the applications and services that are provided through it, network operations teams need to extend their oversight from purely network infrastructure to applications, voice, and other services as well. This extension requires a deeper understanding of new technologies formerly managed by other teams. The embedding of services into network equipment has also increased the range and complexity of devices that must be managed by the network teams, which further complicates their roles.
Impact on Network Management Software Requirements

The evolution of converged networks and network operations challenges requires management solutions able to maintain the connection between the business and both internal and external customers. Strategic use of network management capabilities can minimize potential accountability problems between network managers and application/IT managers. One group needs to account for network performance issues, while the other is responsible for the health of the applications deployed across the network and for satisfying internal customers through the application user experience. The ability to provide a multitude of reports and statistics on network status is essential. However, simple and user-friendly network management tools with high-level alert dashboards and features that enable users to drill down to application and network issues are gaining acceptance and will ultimately be accounted for in the IT budget. Converging networks in both enterprise and service providers will force network and IT managers to view network conditions on a per-application, per-flow basis. New opportunities are already in action in wireless local area network (WLAN) management, security management, VoIP management, and network configuration. Automation and visibility tools for managed service providers will be critical in offering services across multiple networks to multiple offices. This need is further amplified by the tightly knit supply chain within information networks and the increasing trend of distributed and mobile workforces. Operations teams charged with the responsibility to maintain converged networks face the following critical challenges: Heterogeneous Networks Networks are now composed of a broad range of technologies and vendors, including data technologies like IP, ATM, FR, and broadband; as well as voice infrastructures comprised of both legacy TDM infrastructures, pure-play IPT infrastructures, and hybrid infrastructures. Most migration to VoIP occurs gradually; therefore, the need to simultaneously manage both legacy TDM and IP telephony environments still exists. Scale Todays voice and data infrastructures are vast and span wide geographic areas across time zones, countries, and even continents. The management system must be able to scale to support these very large infrastructures and provide the required information to the operations teams, regardless of whether they are located centrally or regionally. Required to Manage Data Infrastructures, Voice Infrastructures, and Critical Systems The management system must span domains to allow IT team members to smoothly manage technical domains which were previously managed as silos. Required to Communicate QoS to a Variety of Constituents Operations teams need to be able to communicate to external customers through SLAs, and internal constituents
through OLAs or less formal mechanisms. Therefore, management software must contain the intelligence and capabilities to do the following for both technical and businessminded audiences: Assess the infrastructures ability to support the business. Identify problems as they occur. Pinpoint the source or cause of the problems. In response to these trends, the worldwide network availability market is growing rapidly. Delivering effective network management software will provide organizations with the tools that they need to evolve into more efficient structures.
(This page intentionally left blank)
Chapter 3: CAs Network and Voice Management Solution

EITM: CAs Vision
EITM is CAs vision for how to dynamically manage and secure IT environments, enabling organizations to realize the full potential of IT as a source of business value. EITM provides a common foundation for the integration and sharing of services and data that allows for the orchestration of all IT assets and resources in unison (infrastructure, applications, and business processes). This business-oriented approach also makes it possible to integrate the management of networks, systems, storage, databases, applications, and security as well as to provide a way to measure, optimize, and demonstrate the impact of IT on the organizations goals as never before. CA is the worldwide leader in management software solutions. We have been in the management software business for three decades, and have been focused on providing solutions for all areas of infrastructure management. We are committed to helping organizations achieve their goals by reducing IT costs to optimize capital and operating expenses, mitigating risk, achieving compliance, and helping to ensure that the infrastructure is always available and performing optimally. The core of CAs approach is to deliver management solutions that provide a unified view of all assets and operations of the organization as they relate to business activities and needs. This view enables organizations to align IT with business, enabling them to make better, more informed business decisions about how to direct business activities and utilize assets.
CAs management solutions include Service Availability, which helps IT departments deliver consistently superior IT services by implementing proactive, integrated management that provides insight into the health of all systems and applications on which each business service depends.
Enterprise Systems Management

As organizations expand their businesses to reach a broader audience and maintain a more competitive edge, a new generation of network and system management solutions has evolved to assist them in taking a more proactive and service-oriented approach to optimizing business process availability and performance. These solutions enable risk reduction or mitigation while increasing operational efficiencies, business continuity, and adherence to regulatory compliance. The end result is cost-efficient IT aligned with the business. Enterprise systems management solutions are flexible and powerful enough to deliver IT services that are business-aware and business-appropriate. The management solutions must make it possible to map IT resources to business requirements based on business impact. Automation that responds along policy-determined paths is at the heart of CAs enterprise systems management solutions. The first business of IT is to manage a good operation, with services up and running at an appropriately high level of availability and performance. To support these goals, CAs enterprise systems management solutions help ensure the following: Networks and systems are properly managed for availability and performance. Jobs and events are automated for optimization. Applications and databases are managed for performance. Desktops and servers are provisioned and configured. CA recognizes that enterprise management does not exist in isolation. It is part of an overall IT infrastructure that covers many disciplines. Servers are only useful to the extent that they are reliably working over predictably high-performing networks. Likewise, applications are only as secure as the systems that control their access. And nothing works well without storage systems that serve business needs. CAs vision is based on the belief that traditionally distinct disciplines networks and systems, storage, security, and service management should be integrated tightly to optimize the performance, reliability, and efficiency of enterprise IT environments. CA has designed products that interact with each other, leveraging common services software components that perform reusable functions across multiple applications. By developing a central management database to provide a unified view of all aspects of the enterprise in relation to business activities and needs, CA has created the foundation for a businesscentric IT organization. With access to comprehensive, cross-disciplinary information, IT organizations can fully understand how all IT resources are being used across their organizations. Armed with this knowledge, IT organizations can offer services tailored specifically to meet the needs of individual departments and provide executives with feedback about exactly how IT resources are being used and how costs are being incurred. The result is the empowerment of decision makers who are fully equipped to make informed choices about how to direct business activities.
The Value of CAs Network and Voice Management Solution

For today's businesses to be able to compete, high-performance voice and data solutions are essential. CAs vision for converged voice and data management supports the need for IT to proactively manage end-to-end voice and data services, ensure adequate capacity of bandwidth and systems, and support service levels defined by the business. CAs Network and Voice Management solution spans both IP and legacy technologies, enabling companies to migrate to IP telephony at their own pace, and reduce the complexity of managing heterogeneous infrastructures. Our product strategy supports this vision through continuous updating, innovation, and integration as defined by our customers. CAs Network and Voice Management solution provides the following key points of value: Effective Service Level Management This solution enables you to baseline, assess, and track services through the network and communicate adherence to SLAs and OLAs to business and technical audiences. Proactive Service Assurance It provides a policy-based approach to monitoring degradations in service which gives you the ability to identify degradations before customers are impacted, and to account for these degradations in RCA and event correlation. Rapid Problem Detection This solution focuses on resolving the true cause of the problem not the symptom through a combination of event correlation, RCA, and linkage with real-time and historical reporting. Predictive Capacity Planning The foundation of this solution is intelligent, embedded algorithms that inform network operations teams exactly when to upgrade or downgrade circuits or other hardware based on past usage trends and tailored thresholds. Our network and voice management strategy is four-fold: Enable our customers to manage their business processes, as well as the data, voice, and multimedia services consistent with their competitive strategy. Enable our customers to manage their converged networks, including business critical applications and the transition from traditional to IP telephony. Provide end-to-end fault and performance management for data, voice (TDM and IP telephony), and system and application infrastructures. Extend management to the voice and multimedia resources as well as the network infrastructure. CA is the only vendor who can provide the following: An integrated, proactive management solution A solution that spans both IP and legacy technologies Solutions that help ensure voice network performance before, during, and after a migration to VoIP With CAs Network and Voice Management solution, the IT organization becomes more proactive not reactive in their approach to managing voice and data. For example, instead of waiting to receive customer complaints about poor voice quality before acting, IT staff will be alerted when policies indicating jitter, low mean opinion scores (MOSs), or hardware problems such as T1 circuits have exceeded user-defined thresholds. When a problem occurs in an IP telephony environment, it is sometimes difficult to determine the
cause. For example, a user may complain about not being able to make a call, but many alarm events can be generated by routers, IPT systems, switches, etc. CAs Network and Voice Management solution provides a proactive approach to managing IP telephony and VoIP. Capacity planning is essential to converged voice and data management. Data collected from voice systems, whether legacy or IPT, and from the network (including trunks and port capacity) will help the IT engineering team determine if underutilized resources are reaching a capacity threshold. This information can also be used as the basis for predictive capacity planning for the network. IT departments are tasked with providing service levels to key constituencies, especially revenue-generating business units such as contact centers. The ability to monitor and report on GoS and QoS for calls in the voice network is essential to service level management.
A Key Part of CAs EITM Vision

CAs Network and Voice management solution fits within CAs company-wide EITM strategy. The solution addresses the four major CIO imperatives in the following ways: Improves service by providing proactive service assurance to detect problems before they impact end users. Ensures reliability and responsiveness of the network infrastructure with powerful RCA, event correlation, and impact analysis. Manages risk by assuring business continuity as it enables organizations to comply with regulatory and governance requirements. Manages costs by significantly reducing cost of downtime (outage or degradation) as it minimizes the number of occurrences and the duration of downtime. Aligns IT with business by giving the IT team a business view that provides status of endto-end business service.
Network and Voice Management for Key Vertical Markets

CA provides powerful and unique management software for managing increasingly complex network services within traditional enterprise and government environments, as well as telecommunications, cable, mobile wireless, and other service provider industries. It enables technical teams to improve service, control costs, reduce risk, increase revenue, and drive efficiency when managing IT infrastructure as a business service.
Telecommunication Service Providers

CA views the telecommunication service-provider industries as an important vertical market, the members of which leverage their operational environment as a key part of controlling costs and delivering product/service differentiation crucial factors to remaining competitive in todays challenging communications marketplace. Most service providers select a variety of management tools specific to element management, service provisioning, billing, customer care, and service assurance. Within large carrier and service provider environments, this creates many disparate applications and data stores. They must somehow be integrated to form efficient workflow processes that ensure services are delivered reliably while maintaining operational efficiencies. CAs Network and Voice Management solution provides flexible integration points to allow it to function successfully in heterogeneous Operational Support System (OSS) software environments, while also reducing deployment time and complexity, as well as
configuration and maintenance costs. CA works with many of the leading OSS vendors in areas of service assurance, fulfillment, and billing/customer care. CAs Network and Voice Management solution enables service providers to accomplish the following goals: Verify and validate SLA guarantees to improve customer satisfaction and reduce churn. Reduce operating expenses by minimizing downtime and Mean Time to Repair (MTTR). Reduce capital expenses through intelligent capacity planning. Accelerate the support for new service offerings: Virtual Private Network (VPN) VoIP Voice/video/data triple-play New wireless data services
Government
CA has a strong presence in the worlds national and local governments. CAs Network and Voice Management solution meets the IT operations automation needs of civilian and defense agencies with practical, achievable solutions that deliver rapid time-to-value and assure mission success. Deployed on trucks or other mobile vehicles, the solution helps to ensure the performance and reliability of IP, satellite, radio, microwave, and telephony communications between front lines and commanders in disparate operational posts. CA brings unique capabilities in support of network-centric warfare initiatives and informationage transformation.
Enterprise
CAs Network and Voice Management solution has a strong presence in the automotive, education, energy, entertainment, consumer, financial services, healthcare, hospitality, insurance, manufacturing, pharmaceutical, retail, semiconductor, technology, and transportation vertical markets. These are all high-impact-of-downtime industries. For todays converged voice and data networks, the key components of a network and voice management solution must be able to manage the end-to-end data infrastructure as well as the voice messaging systems and services. To manage this range of infrastructure, the solution must offer network performance and fault management, voice management for the PBX, messaging, other voice services, and system management for network operations staff to manage the critical systems under their responsibility.
The Components of the Solution

CAs solution for this market is comprised of three primary product lines, all of which provide a consistent set of capabilities across the voice and data infrastructure: eHealth assesses the health of your network and determines if it can accommodate voice. It tells you how well the network is performing, allows you to compare voice MOS with QoS statistics, and identifies trends in performance. SPECTRUM provides fault management, RCA, and voice modeling. It also enables you to model the components of your voice services to ensure that services are operating smoothly. eHealth for Voice offers system performance management for communication systems (IP and Traditional TDM), messaging systems, and management from a telephony perspective. These applications can work as standalone systems or integrated, as shown below.
eHealth
eHealth helps you take control of network performance and ensure QoS across the entire network infrastructure. It enables you to successfully accomplish a multitude of tasks such as ensuring the availability and performance of the network, documenting service levels, managing capacity, and accurately planning for growth. This solution allows you to face a number of challenges including managing a diverse collection of devices from numerous vendors, isolating the source of performance degradation throughout the network, minimizing recurring wide area network (WAN) expenses, and providing consistent reporting across your heterogeneous network infrastructure. This component of the solution enables you to achieve the following goals: Improve the service availability of your network. Reduce cost of downtime and end-user impact caused by downtime.
Identify and resolve problems faster. Plan for capacity before it is needed. Improve QoS. Meet and prove committed service levels.
eHealth Components
eHealth includes the following: eHealth E2E Console eHealth Live Health eHealth Traffic Accountant Report Center Distributed eHealth eHealth SPECTRUM Integration eHealth Universal Workflow Integration Modules (HP OpenView, IBM (Micromuse), Netcool, Cisco CIC) eHealth Universal Data Integration Modules (Cisco WAN Manager, Cisco IP Solution Center, Lucent, Nortel, Alcatel) eHealth Universal Wireless Integration Modules (Nortel, Starent)
The Benefits of eHealth

eHealth can be differentiated from other solutions in the following ways: eHealth offers best-in-class proactive management so that IT can correct problems before they become revenue-impacting issues. It has the broadest multi-vendor support over 1000 devices from 100 different vendors. Its reports have built-in intelligence to troubleshoot without requiring intimate knowledge of every component of the service. It provides auto-baseline; that is, eHealth will learn the normal behavior for each management device. The deviation from normal algorithm offers a more reliable threshold relative to history because the window of comparison is continuous. Unlike CA, many vendors have a limited portfolio of device and technology support, which poses a problem to companies trying to reduce the number of management software vendors. Customers want greater accountability from their vendors as IT becomes more of a service. Fewer vendors results in less complexity, better operating costs, and less risk in rolling out new business initiatives. The value of an integrated management platform is significant because IT operations spend a large amount of time and money identifying the cause of problems, and downtime is extremely expensive.
SPECTRUM
SPECTRUM is able to manage networks across the hall or around the world. It delivers granular Layer 2 and Layer 3 visibility down to the individual port and circuit for local area network (LAN), WAN, wired, wireless, physical, and virtual networks. It offers specific management applications that drill down to monitor and analyze ATM, Frame Relay, IP Multicast, QoS, voice, and VPN technologies. Its patented root cause and impact analysis technology pinpoints exact locations of degraded or failed devices. This component of the solution enables you to achieve the following goals: Manage the service availability of your network. Identify the root cause of network failure. See the relationships and impact of IT infrastructure on business services. Review the assigned operations personnel and status. Confirm the infrastructures operational integrity by tracking infrastructure changes.
SPECTRUM Components
SPECTRUM includes the following: SPECTRUM Infinity SPECTRUM Integrity SPECTRUM Xsight SPECTRUM OneClick SPECTRUM Service Manager SPECTRUM Report Manager SPECTRUM Alarm Notification Manager SPECTRUM ATM Circuit Manager SPECTRUM Configuration Manager SPECTRUM Secure Domain Manager SPECTRUM Frame Relay Manager SPECTRUM Microsoft Operations Manager Connector SPECTRUM Multicast Manager SPECTRUM OSS Integrations SPECTRUM QoS Manager SPECTRUM Remedy ARS Gateway SPECTRUM SNMPv3 Support SPECTRUM VPN Manager SPECTRUM Watch Editor SPECTRUM Service Performance Manager
The Benefits of SPECTRUM

SPECTRUM can be differentiated from other solutions in the following ways: SPECTRUM provides best-in-class RCA and access to the SPECTRUM solution knowledge base: Less time is devoted to identifying the root cause of IT issues. Less time is devoted to investigating false alarms and alerts. Accountability issues and blame-avoidance are eliminated. IT can still take advantage of advanced RCA with rules-based Event Management System (EMS) or policy-based condition correlation (not model-based Inductive Modeling Technology (IMT)). It predicts and prevents problems out-of-box by offering intelligent thresholds; other solutions only identify a problem after it has happened.
Integration between eHealth and SPECTRUM

eHealth SPECTRUM integration combines the best in network fault and performance management to offer a complete solution ensuring high availability and responsiveness of your mission-critical data networks. It enables customers to achieve the following: Integrate Live Health alarms and Health report exception alarms into SPECTRUM, which are then presented in the SPECTRUM OneClick interface. Provide context-sensitive launching from SPECTRUM alarm and topology views to eHealth reports. Enable SPECTRUM to provide files for eHealth discovery.
eHealth for Voice

eHealth for Voice provides comprehensive management of voice services across the network including PBX, IP telephony, voice messaging, unified messaging, and bandwidth utilization. This component of the solution enables you to achieve the following goals: Consolidate information from various systems and network components, whether legacy or IP-based, to provide a comprehensive view of voice system and network performance. Automate the data collection or polling process from legacy devices via modem, IP, or secure access systems. Identify IP telephony infrastructure components. Manage QoS policies for voice traffic. Monitor voice gateways, digital signal processors (DSPs), and peer-to-peer links. Estimate MOS between router endpoints. Obtain end-to-end visibility of voice quality, jitter, and delay. eHealth for Voice provides you with the ability to obtain a global view of voice systems, networks, and services for effective service level management, and drill down to the details to ensure adequate capacity, RCA, and proactive service assurance.
The Benefits of eHealth for Voice

eHealth for Voice offers the following benefits: Enables you to manage your business processes, including voice and multimedia services, consistent with your competitive strategy. Enables you to manage your transition from traditional to IP telephony. Provides end-to-end fault and performance management for IP telephony networks. Manages voice and multimedia resources to reduce risk, manage costs, enable new services, and align your investments with your IT objectives.
CA Technology Services Network and Voice Management Service Offerings

CA Technology Services has specialists in eHealth and SPECTRUM to help organizations assess, design, implement, and optimize network and voice availability and performance solutions across the enterprise. From financial services companies to telecommunications companies to government organizations and beyond, CA experts help you establish bestpractices workflows, integrate your network management solutions, and combine your network and voice availability and performance solutions with your service desk for a consolidated network event management system. The focus of CA network and voice availability and performance experts is to help you achieve the following: Improve business alignment by mapping the network infrastructure to critical IT services that support the business, and ensuring that your network team is focused on delivering the organizations most important services. Increase business planning capabilities by delivering full visibility across the network infrastructure through consolidated consoles, reports, and metrics analysis. Reduce risk by defining and implementing automatic repair responses that avoid the possibility of human error and guarantee problem repair consistency. Reduce cost by consolidating network event management into a central point of control which decreases staffing demands. Leverage value from existing network management systems by integrating and building upon your prevailing tools and workflows. Optimizing IT service delivery by applying International Organization for Standardization (ISO) and CobiT standards, IT infrastructure library (ITIL) best practices, and proven network management processes.
LIFECYCLE APPROACH FOR NETWORK AND VOICE AVAILABILITY AND PERFORMANCE SOLUTIONS
The needs of every organization are unique, but network management yields common themes in workflow processes and monitoring, and management instrumentation. A CA solution is deployed and optimized to the particular needs of your organization through a lifecycle of best practices services offerings.
Assessment Understanding the Gaps

Comprehensive assessments validate the maturity and efficiency of network and voice availability and performance management. CA experts conduct a comprehensive analysis of your network management capabilities including the following: Network management goals, objectives, capabilities, and strategies Network operations organization structure, and personnel roles and responsibilities Network monitoring, configuration, and integration software Network design and topology Voice traffic simulation Data analysis Associated security constraints (firewalls, access lists, and so on) Alarm/event severity definitions Existing business, technical, and environmental challenges and issues Change control processes CA and third-party product integration requirements Your current management capabilities are compared to the CA maturity model for people, processes, and technology and the assessment results in a Solution Architecture Overview (SAO). The SAO is a blueprint that defines achievable solution phases to maximize problem determination and response workflows, apply automation, and integrate service desk operations. CA consultants and architects also research and map the network infrastructure to IT services, propose recommendations, and furnish business justifications to help you secure funding.
CA Maturity Models
Industry best practices have been aligned across four distinct phases of IT maturity. CA Maturity Models are designed to accomplish the following: Determine the current state of the IT management processes. Assess the maturity of IT capabilities. Identify the weakest processes and return on investment (ROI) from improving and automating those processes.
Design Building the Right Solution

CA architects design successful network and voice availability and performance solutions for customers that range in size from a single-location business to global IT operations demanding 24x7 availability and high-speed performance. Value is delivered through the integration of the service desk with networks that support applications, databases, systems, storage, and security. Architects work with customers to review the assessment or document the as is environment, conduct interviews to ensure that business and IT management goals are addressed, correlate the features and functions of eHealth and SPECTRUM to your business and IT requirements, and identify hardware and customization requirements. As part of this process, the architect develops a comprehensive design and implementation plan or Solution Architecture Specification (SAS). CA architects must be certified to create a design. The architect qualification process is an intensive two-year program that requires several industry certifications, IT management technology and CA product training, and CA architect governance board approval. We require our architects to re-certify at least every two years. Customers who have in-house architects and are managing their own design planning, but may require supplemental assistance, can take advantage of the Enterprise Systems Management Solution Expert Package which provides customized short-term design guidance or review within a specifically defined scope.
Implementation The Bottom Line of Solution Success

Using the SAS as a guide, CA consultants prepare the environment; install, configure, and customize eHealth and SPECTRUM; verify and document your eHealth and SPECTRUM solutions on test, QA, and production environments; and provide knowledge transfer to your staff. Implementation services also include the development and deployment of integration components between eHealth and SPECTRUM, your other IT management applications, and your service desk. To ensure that implementation efforts are tightly managed, PMP-certified Project Managers track and report on progress, questions, issues, and roadblocks. CA Technology Services uses PMP-certified Project Managers and highly trained architects, consultants, and partners. On an annual basis, CA Technology Services invests 50% more in training our professionals than the industry average.
Optimization Anticipating Change

Optimization services evaluate ways in which your existing eHealth and SPECTRUM solutions can be further utilized or fine-tuned. Healthcheck services can include tuning and reconfiguration, upgrades, and migrations. Other services include training and certifications that focus on increasing staff efficiency. Past experience has found that staff training results in more efficient operations. These services are offered as onsite or offsite instructor-led, self-paced, or web-based. Instructors or course developers are also certified experts and dedicated to network and voice availability and performance.
Why Trust Your Service Availability to CA Technology Services?

Experience: CA has 30 years of enterprise systems management services experience. Proven Process: A dedicated assessment team plans, designs, and provides business justification for network and workflow recommendations and builds best practices into every customer blueprint. Expertise: A vibrant community of worldwide professionals focused on network, voice availability, and performance shares their solutions knowledge and continually contributes proven best-practice workflows and solution models. Focus: CA Technology Services is comprised of a team of Solution Managers and dedicated architects who are devoted exclusively to the assessment, design, delivery, and workflow methodologies offered around eHealth and SPECTRUM services and solutions.
How the Solution Delivers the Key Points of Value

Effective Service Level Management
The SPECTRUM Service Manager conveys the status of the business critical services, in a non-technical manner, to both operations and line-of-business managers. Information on the status of services can be organized in a variety of ways: by service, department (for internal customers), or customer (for service level management communication to external customers).
Proactive Service Assurance

CAs Network and Voice Management products are used together for proactive service assurance through the embedded algorithms within all of the product lines. This helps operations teams identify potential problems BEFORE they impact customer service. Within eHealth, this is accomplished primarily through the Time over Threshold and Deviation from Normal algorithms within Live Health, which allow an intelligent performance-based alert to be sent when current performance violates either a fixed threshold, or what is considered normal behavior (based on past history) for a particular length of time within a given analysis window. Similarly, eHealth for Voice sends alerts when violations of QoS or GoS are experienced. These alerts are fed into SPECTRUM, which applies its intelligence on policy, models, and rules to identify the severity of the problem, provide alarm integration and correlation, taking advantage of the SPECTRUM Service Management and voice modeling capability.
Rapid Problem Resolution

The network and voice management products work together to help network operations teams rapidly identify and resolve problems relating to voice and data infrastructures. Operations teams are notified of a potential problem first through SPECTRUM Service Manager. It displays business critical services and changes their color according to policies established within SPECTRUM. Operations team members can quickly identify problem devices in the infrastructure within the topology map, and drill down to details on alarms (including those sent by either eHealth or eHealth for Voice) in the Alarm Detail report to understand the specific issue. They can also access probable cause information within SPECTRUM, and can then launch either eHealth or eHealth for Voice to obtain additional historical performance information and identify the problem that the device was experiencing prior to the problem.
Predictive Capacity Planning

Network operations teams use intelligent algorithms within eHealth and eHealth for Voice to identify where and when additional capacity is needed to support committed goals for QoS or GoS. The products are also used to solve the opposite problem as well: to determine the areas in which the organization can save money by reducing capacity without negatively impacting the business.
Chapter 4: Deployment Architecture for Network and Voice Management

This chapter provides information to prepare for the installation and configuration of CAs Network and Voice Management solution. The following key topics are presented: Network performance components Network fault management components Voice management components Deployment architectures Sizing recommendations
Network Performance Components

eHealth is comprised of the following components: Required Components E2E Console Optional Components Live Health Integration Modules Distributed eHealth Remote Poller Report Center Traffic Accountant
E2E Console
The E2E Console is the core of an eHealth implementation and is required to operate eHealth. The E2E Console includes database, discovery, and poller functionality along with administration GUIs, reporting GUIs, and so on. eHealth licenses (universal and system) enable the eHealth Console to poll and collect data from certified devices with an embedded management software agent, and are required to operate eHealth. An element represents the eHealth model, or representation, for any part of an infrastructure that eHealth can analyze. eHealth can analyze a physical element, such as a specific port on a specific card of a specific router. It can also analyze a logical element, which refers to the logical purpose for a device or component, such as a network link. To determine if a device is certified for use with eHealth, log on to the Certification pages at http://support.concord.com. Note: You must have a Support account to access the http://support.concord.com site. You obtain an account with the purchase of the eHealth or SPECTRUM products.
Live Health
Live Health is the real-time performance monitoring engine that analyzes performance data collected with eHealth for deviations from normal behavior and threshold violations. Live Health includes three components: Live Exceptions gives you the ability to generate and display performance-based alarms. Live Status provides a single end-to-end view of the status of your infrastructure. Live Trend provides a real-time reporting capability.
Integration Modules
eHealth provides a set of integration modules (IMs) that enable you to use eHealth to report on the data that various network management systems (NMSs) collect. When you license and install an IM, you can use it to tap into data already collected by a present NMS. You can then import it en masse into eHealth, providing critical data quickly and eliminating the need for redundant data gathering via duplicate polling. Universal Workflow IMs enable customers to drill back from supported fault management systems to eHealth. This type of IM is supported on the following systems: SPECTRUM, IBM (Micromuse) Netcool, Cisco Information Center (CIC), and HP OpenView Network Node Manager. Universal Data IMs enable the import of configuration and performance data. This type of IM is supported on the following systems: Cisco WAN Manager, Cisco ISC, Lucent, Alcatel, and Nortel. Universal Wireless Data IMs enable import of configuration and performance data from other wireless element management systems into the eHealth E2E Console. This type of IM is supported on the following systems: Nortel Shasta SCS GGSN and Starent ST-16 Bulk Stats.
Distributed eHealth
If you have a large infrastructure, you could deploy multiple Distributed eHealth Systems across different physical locations or alternatively co-locate them in a central configuration referred to as a cluster. The cluster contains several eHealth systems that manage specific sets of resources, and share the information with each other. By using Distributed eHealth, you can distribute the workload of collecting and processing data across multiple eHealth systems that work in parallel. Report users can access reports for any element or groups in the cluster from Distributed eHealth Consoles, which are reporting front-ends to the cluster. You would typically choose a Distributed eHealth site when you want to run reports for more elements than a standalone eHealth system can support. You might also choose a Distributed eHealth site if you want to place an eHealth web server system outside the firewall and insulate the Distributed eHealth Systems within the firewall of your infrastructure. Depending on the number of Distributed eHealth Systems that you have, and the system performance of the Distributed eHealth Console, a Distributed eHealth site could support reports for up to one million elements. The Distributed eHealth Package software contains all software for Distributed eHealth Consoles and the software required to turn a standalone eHealth System into a Distributed eHealth System. You must purchase all console software, elements, and agents for the standalone eHealth systems separately. For complete instructions on administering a cluster, see the Distributed eHealth Administration Guide.
Remote Poller
If you are administering a large-scale or wide-area environment, one eHealth system may not be sufficient to monitor all of your resources. With remote polling, you install eHealth on remote systems (called remote sites) and configure each site to poll a set of elements. The database at each remote eHealth site contains the data for the elements that it is polling, and you can use eHealth at each site to manage those elements. A central eHealth system then retrieves element information and performance data periodically from the remote eHealth systems and merges the data into one central eHealth database. From this central database, you can run reports for all elements. You would typically choose a remote polling site when it is critical that the polling systems do not know about each other or share information. For example, the remote sites could each manage separate customers of a service provider, or separate business units within a large organization. These remote poller sites could be geographically dispersed sites, as well as local sites separated by firewalls or other barriers over which polling could be delayed or impossible. For complete instructions on administering a remote poller system, see the Using the eHealth Remote Poller Guide.
Report Center
eHealth Report Center is an optional reporting application available with eHealth Release 6.0 and later. It offers an alternative to the eHealth Report Developer Language (RDL), which is used to customize the standard eHealth reports. Report Center allows users to create and customize entirely new types of eHealth reports. These reports can answer different types of questions about the performance of network, system, and application resources. Report Center offers a large amount of flexibility with customization, and has many capabilities that allow users to manipulate the appearance of reports, and how existing eHealth data is represented. It offers a web-based, Windows folder-style interface which users can change based on their preferences. This intuitive interface makes it easy to quickly identify, view, and run reports. Report Center provides valuable sample reports that users can run to view the performance of their resources, or use as templates when creating new reports.
Traffic Accountant
Most enterprises with Internet and e-business strategies need to ensure that their network resources are well-matched to the needs of users and the demands of applications. It is important to know who is using bandwidth and the applications that are being used. Traffic Accountant lets you monitor network traffic with industry-standard RMON2 probes and NetFlowenabled Cisco routers. Traffic Accountant provides sophisticated grouping and sorting capabilities to create concise, easy-to-understand reports. You benefit by getting insight on trends and usage patterns that impact the performance of your network. Traffic Accountant requires a dedicated eHealth system; if you plan to monitor RMON2 traffic as well as the standard network performance traffic, you need two eHealth systems to ensure optimal polling and reporting performance.
Network Fault Management Components

SPECTRUM is comprised of the following components: Required Components Assurance Server Assurance Server Xsight Assurance Server Integrity Assurance Server Infinity OneClick Optional Components Watch Editor SNMP V3 Alarm Notification Manager Secure Domain Manager Frame Relay Manager Configuration Manager VPN Manager ATM Circuit Manager Report Manager Multicast Manager Service Performance Manager QoS Manager Service Manager
Assurance Server
SPECTRUM offers three types of Assurance Servers designed for different types of customers: Assurance Server Xsight (for emerging enterprises) Assurance Server Integrity (for larger enterprises) Assurance Server Infinity (for Service Providers)
ASSURANCE SERVER XSIGHT
The Assurance Server Xsight delivers the capabilities of core SPECTRUM technologies to a broader array of small businesses. With the introduction of SPECTRUM Xsight, CA extended the support of multi-vendor IP fault and performance management in a solution that is competitively priced and packaged to help you become operational quickly. This component provides support for most vendor devices found in todays enterprise networks. It supports single-server deployment only; it does not allow for a distributed deployment. The Assurance Server Xsight includes the following key features: Root cause analysis Impact analysis Auto-discovery of multi-vendor and multi-technology networks Standards-based integrations One concurrent administrator license (fault-tolerant license not included)
ASSURANCE SERVER INTEGRITY
SPECTRUMs roots lie within serving large enterprise customers whose businesses change, merge, and scale rapidly. The SPECTRUM Integrity solution has transformed CAs patented technologies and combined them with new features and functionality to offer todays evolving enterprises the power to manage business-critical services across the hall or around the world. This component provides support for most vendor devices found in todays enterprise networks. The Assurance Server Integrity includes the following key features: Root cause analysis Impact analysis Auto-discovery of multi-vendor and multi-technology networks Standards-based integrations One concurrent administrator license and a fault-tolerant license
ASSURANCE SERVER INFINITY
SPECTRUM Infinity is specifically focused on the needs of todays service providers. It provides specific functionality with significant performance improvements dedicated to accelerating new service rollouts and exceeding customer quality expectations, while allowing them to manage a growing infrastructure with existing resources. This component provides Integrity device support and the Advanced Management Module pack that provides support for high-end devices typically found only in service provider networks. The Assurance Server Infinity includes the following key features: Root cause analysis Impact analysis Auto-discovery of multi-vendor and multi-technology networks Standards-based integrations One concurrent administrator license, a fault-tolerant license, and two Southbound Gateway integration licenses
OneClick
SPECTRUM OneClick is a three-tier, web-based console. The central component is a web server that connects directly to SPECTRUM Assurance Servers and delivers information to distributed Java clients. The feature-rich Java clients are downloaded, installed, and updated from the OneClick web server to ease implementation, administration, and maintenance. The SPECTRUM OneClick console combines anywhere/anytime access and reduces the training requirements of standard web-based applications with the scalability and responsiveness of a full desktop client application.
The OneClick console includes the following key features: Event Console Role-based, intuitive web interface Comprehensive topology maps Highly configurable alarm console Full suite of troubleshooting tools Management security Automated client install Concurrent use licensing
Watch Editor
SPECTRUM Watch Editor provides a simple way of monitoring key performance indicators across network, system, and application infrastructures. The watch editor is used to customize and augment monitoring. Adding additional polling, logging, and thresholding capabilities to SPECTRUM can automate notification, and launch scripts when performance changes or operates outside of normal baselines. Watches can monitor statistics, complex calculations, and/or values that should remain static. Watches can also be configured and applied simultaneously to thousands of devices. Customers often use SPECTRUM Watch Editor to monitor central processing unit (CPU), memory, storage, and bandwidth utilization across multiple vendors and technologies. These capabilities all work together to provide a cost-effective way to intelligently watch thresholds and proactively automate corrective actions without having to apply complex programming techniques. The Watch Editor includes the following key features: Ability to easily create proactive thresholds for key performance indicators for network, system, and application infrastructures Ability to automate corrective action Ability to enable watches based on timers, polling, statistics, and calculated values
Alarm Notification Manager

The SPECTRUM Alarm Notification Manager (SANM) enhances SPECTRUMs alarm processing capabilities. SANMs Policy Administrator enables you to specify the types of alarms that you want to receive and to filter out the alarms that you consider unimportant. SANMs point-and-click user interface can also control alarm behavior based on hourly, daily, weekly, or monthly scheduling parameters. Large enterprises and service providers routinely leverage SANM for lights-out Network Operations Centers. Specific IT staff members, based on their working hours, can be notified of problems by SPECTRUM wherever they are, and have root-cause information at their fingertips. SANM delivers business alarms in business terms to executives, while delivering technical alarms in technical terms to IT staff. This component provides a cost-effective way to manage your alarm notification policies for your SPECTRUM environment.
The Alarm Notification Manager includes the following key features: Alarm consolidation Alarm filtering Policy-based alarm forwarding Alarm notification
Frame Relay Manager

SPECTRUM Frame Relay Manager delivers precise monitoring and performance thresholding of committed information rates (CIR), bandwidth utilization, and circuit congestion. Root cause analysis and fault isolation is provided per data link connection identifier (DLCI), with impact analysis to prioritize response and corrective action. Patented intelligent autodiscovery techniques leverage remote IP address information and traffic statistics to map DLCI connectivity and present an integrated topology view. Several large enterprises use SPECTRUM Frame Relay Manager to document SLA violations with their service providers. SPECTRUMs Frame Relay can also determine if an enterprise has purchased too much or too little bandwidth on a per-circuit basis. This results in cost savings of thousands and tens of thousands of dollars per month in WAN connectivity charges. Service providers have used SPECTRUM Frame Relay Manager to ensure SLA compliance, improve customer service quality, and deliver differentiated service offerings. Using this component, one service provider is able to identify Frame Relay problems in 97 to 99% of cases before their customers do and are working to fix them before the customers business is impacted. This component provides a cost-effective way to improve service quality, deliver end-to-end visibility, and reduce operating costs. The Frame Relay Manager includes the following key features: Proactive communication with Frame Relay equipment that supports RFC 1315 or RFC 2115 Frame Relay MIBs with vendor extensions for Cisco and Nortel Fast, accurate modeling of physical and logical DLCI port connectivity with IP address, subnet mask, and remote IP address information Out-of-box performance views show CIR throughput, congestion statistics, and data terminal equipment (DTE) changes
ATM Circuit Manager

SPECTRUM ATM Circuit Manager delivers precise monitoring and performance thresholding of ATM throughput, bandwidth utilization, and circuit congestion. Root cause analysis and fault isolation is provided per virtual private LAN (VPL)/virtual channel links (VCL) with impact analysis to prioritize response and corrective action. Patented intelligent autodiscovery techniques leverage remote IP address information and traffic statistics to map virtual path identifier (VPI)/virtual channel identifier (VCI) connectivity and present an integrated topology view. The ATM circuit path view displays the endpoint-to-endpoint mapping for each device, physical port, and logical interface traversed. Enterprises can also import a list of permanent virtual circuits (PVCs) provided by their service provider to accurately model all ATM WAN links. Several large enterprises already use SPECTRUM ATM Circuit Manager to document SLA violations with their service provider. In addition to this capability, SPECTRUMs ATM can also determine if an enterprise has purchased too much or too little bandwidth on a per-circuit basis. This can result in a cost
savings of several thousands of dollars per month in WAN connectivity charges. Service providers have used SPECTRUM ATM Circuit Manager to ensure SLA compliance and deliver differentiated service offerings. This component provides a cost-effective way to improve service quality, deliver end-to-end visibility, and reduce operating costs. The ATM Circuit Manager includes the following key features: Proactive communication with ATM equipment that supports RFC 1695 with private management information bases (MIBs) Fast, accurate modeling of physical and logical VPL/VCL port connectivity with IP address, subnet mask, and remote IP address information Out-of-box performance views showing cells-per-second throughput and ATM QoS information
Multicast Manager
SPECTRUM Multicast Manager provides multi-vendor visibility into logical multicast network sessions proactively monitoring key performance indicators while highlighting the impact of infrastructure outages on multicast services. All logical multicast overlay services are automatically discovered and modeled within the SPECTRUM Assurance Server. Multicast session models maintain complete knowledge of the multicast feed including its source, distribution tree, and receivers. SPECTRUM Multicast Manager presents the user with an easy-to-use interface for topology navigation and alarm monitoring. This results in lower training and administration costs as users have at-a-glance access to actionable information. Multicast enhancements allow the user to view the per-group multicast topology and the associated routers, switches, and ports that comprise the IP multicast group. SPECTRUM Multicast Manager also monitors multicast group health. If a resource in a multicast group (source, routers, switches, ports) experiences a reliability problem, SPECTRUM Multicast Manager will automatically understand the impact on the overall group. This component provides a cost-effective way to manage your multicast infrastructure as a business service. The Multicast Manager includes the following key features: Multi-vendor view of IP services with a detailed understanding of the elements that comprise a multicast group An intuitive interface for multicast topology navigation and alarm monitoring of groups, sources, receivers, and rendezvous point (RP) devices
QOS Manager
The SPECTRUM QoS Manager enables enterprises and service providers to verify and validate the configuration and effectiveness of QoS Policies and Traffic Classes throughout the IT infrastructure. Technology Relationship Mapping and web-based reporting discovers and documents the health and performance for each CoS configured across the network. Patented SPECTRUM analytics intelligently integrate and automate modeling of your QoS Policies and Traffic Classes to deliver RCA and impact prioritization.
SPECTRUM presents a unified view of your QoS-enabled infrastructure with an easy-to-use explorer/navigator that enables you to drill down to access device and port level information associated with a Traffic Class. Furthermore, SPECTRUM allows users to view each configured behavior and a complete set of statistics for policing, shaping, queuing, and random early detection (RED). Users have access to actionable alarm information when a particular class is experiencing a high rate of packet drops or buffer queuing. The QoS Manager includes the following key features: Ability to drill down to access device and port level information associated with a particular Traffic Class Out-of-box performance views of QoS traffic statistics such as policing, shaping, queuing, RED
VPN Manager
The SPECTRUM VPN Manager enables enterprises and service providers to automatically discover and manage the performance and reliability of site-to-site Layer 3 VPN tunnels. Proactive notification of potential problems assures that corrective actions can be taken before critical services and customers are impacted. SPECTRUM VPN Manager dynamically discovers physical and logical connectivity and verifies availability through virtual routing and forwarding (VRF) ping heartbeat monitoring. This component provides a cost-effective way to provide integrated fault and performance management of your VPN infrastructure. The VPN Manager includes the following key features: Logical VPN topology Proactive communication with VPN equipment that supports RFC 2547 BGP/MPLS VPN MIB with private vendor extensions Fast, accurate modeling of physical and logical VPN connectivity Out-of-box performance views of individual tunnel and aggregate VPN statistics
SNMPv3
With SPECTRUM SNMPv3, you can obtain secure management communications. The solution provides a translation engine for SNMP v1 and SNMPv3 requests and responses. SNMPv3 technology supports communication with network, system, storage, database, application, and security infrastructure components; proxy management of legacy systems; and manager-to-manager communications. Customers can leverage the full range of SNMPv3 authentication and encryption capabilities to provide fully secured management traffic and safe configuration/control of IT infrastructure components. SPECTRUM SNMPv3 includes the following key features: Security threat protection Authentication Privacy Confirmation High-capacity counters
Secure Domain Manager

Todays complex and distributed networks are being driven by security policies that inhibit the use of insecure management protocols such as Simple Management Network Protocol (SNMP) v1 or Internet Control Message Protocol (ICMP) for management of those networks. For example, a Demilitarized Zone (DMZ) separates a set of elements from the intranet through a firewall for security purposes. Still, the businesses, processes, and customers need to be supported by IT services. This requires visibility into the complete infrastructure. SPECTRUM Secure Domain Manager (SDM) enables customers to manage those domains by securely tunneling SNMP and ICMP traffic through a secure sockets layer (SSL) connection. Only a single hole needs to be inserted into the firewall, allowing for extended manageability without impacting security policies in place. This solution is totally transparent to the end user and all client applications, eliminating the need to perform additional administrative tasks. Note: This feature is not available for the Assurance Server Xsight. The Secure Domain Manager includes the following key features: Multiple secure domain connectors SNMP and ICMP traffic forwarding Securely tunneled traffic via XML/SSL over Transmission Control Protocol (TCP) Transparency to users and client applications
Configuration Manager
Managing todays complex infrastructures involves maintaining hundreds or thousands of business-critical devices. Being able to keep track of how they are all configured and making sure that configurations are accurate can be overwhelming. SPECTRUM Configuration Manager is an intelligent, integrated application that automates management of critical device configurations to keep your business operational. SPECTRUM Configuration Manager provides the tools that you need to capture, modify, load, and verify configurations for thousands of multi-vendor devices. With its unique design, SPECTRUM Configuration Manager allows users to perform device administration on configuration files, MIB object identifiers (OIDs), and SNMP attributes. Each configuration is time-stamped and identified by the revision number. SPECTRUM-specific values such as polling interval, community name, or security string can be edited. SPECTRUM Configuration Manager can quickly load any stored configuration to single or multiple devices simultaneously tracking all changes, scheduling automatic uploads during maintenance windows, or rolling back configurations to their last known good state. Automatically scheduled configuration comparisons deliver immediate notification of unauthorized changes. This component provides cost-effective configuration management to ensure business continuity.
The Configuration Manager includes the following key features: Configuration capture Configuration edit Configuration load and restore Configuration comparison and validation Automated scheduling
Report Manager
Simple to install and deploy, the SPECTRUM Report Manager leverages availability, performance, asset, and service quality data collected by SPECTRUM. Delivered via the web or a variety of export formats, you can take proactive identification and corrective actions before business-critical services and customers are affected. Capital investment efficiency is driven by SPECTRUM Report Managers asset and availability analysis by vendor, product family, and type of device. Studies indicate that accurate asset inventory reports can deliver a 30% capital expense reduction during the first year and 510% in ongoing operational savings by understanding asset utilization and re-allocation opportunities. Availability reports identify how vendors and/or specific products improve or degrade service quality ensuring that IT investment decisions are supported by factual information. SPECTRUM Report Manager enables your organization to assess the overall availability, performance, and service quality of the end-to-end IT infrastructure. The Report Manager includes the following key features: Automated report creation Automated report distribution via email Export to Adobe PDF, Microsoft Excel, and Word format One concurrent user license
Service Performance Manager

SPECTRUM Service Performance Manager (SPM) offers a multi-vendor, multi-agent approach to performance management, enabling existing response time management investments to be leveraged. Intelligent, automated discovery and configuration of performance test points within the infrastructure accelerate deployment and optimize workflow efficiency. Real-time troubleshooting capabilities enable operations staff to quickly isolate the cause of performance degradations, while also testing and verifying performance after corrective action has been taken. From the perspective of the end-user, a slow application is a broken application. SPM enables customers to proactively measure performance and detect problems with the network, systems, or applications by leveraging response time measurement capabilities already deployed within their infrastructure. Once corrective actions have been taken, realtime tests can verify that repairs have been effective. SPM provides a simple method for automatic discovery of performance test hosts, notification of threshold violations, centralized troubleshooting, and collection of statistical performance data for reporting and further analysis. The result is better performance and increased user satisfaction.
The Service Performance Manager includes the following key features: Real-time notification of threshold violations based on scheduled synthetic response time tests Baseline and threshold analysis Automatic discovery of response time agents Existing investments leveraged by unlocking the value of currently deployed performance agents: Cisco IP SLA (SAA & RTTMON), Cisco Ping MIB, Nortel Ping MIB, RFC 2925 (Extreme, Juniper, Riverstone), SystemEDGE, Micromuse (Network Harmoni SLA+)
Service Manager
The SPECTRUM Service Manager leverages SPECTRUMs Business Service Intelligence to provide real-time and historical management of business processes, SLAs, and customers. Rather than continuing to manage within a particular vertical technology silo (network, systems, applications, and so on), SPECTRUM Service Manager enables horizontal, crosssilo management in alignment with business process reliability. This tool understands the physical and logical relationships between the availability and performance of IT infrastructure components and the critical services and customers that they are designed to support. Related IT components are correlated into logical business services such as e-mail, Internet access, order entry, finance, etc. SPECTRUM Service Manager bridges the gap between operations and customer care service desks by correlating infrastructure reliability problems to impacted services and affected customers. This allows resources to be allocated based on the importance of the services that are affected. Real-time alarms are generated, warning of service outages and impending SLA violations including the root cause allowing them to be addressed quickly before the business is severely impacted. A service dashboard provides at-a-glance health status. Historical reports show past performance and details of degradations and outages, allowing the business to find new ways to improve the services over time. The Service Manager includes the following key features: Proactive business service and SLA management Intuitive service status dashboard Web-based service level reporting One concurrent service dashboard license Ability to manage services from additional SPECTRUM Assurance Servers
Voice Management Components

The eHealth for Voice product is comprised of the following components:
Required Components eHealth for Voice Right to Use License (per PBX/message system) Node License (1 per call/message system)
Optional Components eHealth for Voice Policy Manager
eHealth for Voice

eHealth for Voice is a multi-vendor, multi-system (call management and voice messaging) and multi-technology (traditional PBX (TDM) and IPT) performance management solution that greatly simplifies management of voice networks. eHealth for Voice eliminates the manual collection of data and the labor-intensive effort of report compilation and telephony GoS determination. This translates to improved voice system performance and availability delivered at a lower cost. Furthermore, eHealth for Voice is an agent-less solution that does not require any software to be installed on the voice systems, simplifying installation and greatly reducing time-to-value. You can run a wide variety of reports for delivery to printers, email recipients, or a corporate intranet. With eHealth for Voice, accurate current and historical system information is always available for trending and analysis. Instead of fragmented data snapshots, true performance measurements are delivered to a desktop or printer, every day, automatically. The eHealth for Voice architecture allows maximum scalability by modularizing functions. For smaller installations, a single server may contain the database and the data collection module. For larger applications, any number of additional servers in different locations may act as data collection agents, downloading data from clusters and sending it to the one central database. Data can be collected according to a user-defined schedule, twenty-four hours a day, seven days a week, so that all data is retrieved before it is overwritten. The central database may be accessed by any number of client machines over an IP network to provide access to data and reports. You can purchase eHealth for Voice by ordering the eHealth for Voice Right to Use license for the PBX, call system, or messaging system to be monitored (for example, purchase CA eHealth for Voice Nortel CS-1000 and Meridan to monitor Nortel PBXs) and then order the appropriate number of node licenses. One node license is required for each call system or messaging system monitored.
eHealth for Voice Policy Manager

eHealth for Voice Policy Manager is a component that plugs into the eHealth for Voice engine to monitor all data activity against user-defined criteria and provide automatic notification when those criteria are met. The module allows you to set specific thresholds and conditions at the node, platform, or system-wide level and to set notification actions including sending e-mail, console, and pager messages, SNMP traps to SPECTRUM, Unicenter NSM, or third-party monitoring systems, and invoking customized commands and
scripts. Conditions are combined with one or more response actions to create policies. The Policy Manager Service, running on an eHealth for Voice application server, then monitors data as it is loaded into the database against all active policies and automatically triggers policy actions as appropriate. You can purchase Policy Manager by ordering the eHealth for Voice Policy Manager Right to Use license and then ordering the appropriate number of node licenses. One node license is required for each call system or messaging system monitored.
Deployment Architectures
The CA Network and Voice Management solution provides two common deployments: Small-to-medium deployments, with one centralized management site and limited managed resources (under 100,000) across short geographic ranges Large deployments, which may have multiple distributed management sites, large numbers of managed resources, and possibly a vast geographic range The following sections describe two sample deployments, a small-to-medium local deployment suitable for a small enterprise, and a large, distributed deployment that would support a large-scale service provider. CAs solutions can support a variety of deployments and business services; the larger the environment, the more critical it is to leverage CA Technology Services to help plan, deploy, and manage the solution.
Small-to-Medium Enterprise Deployment

Typically, a small-to-medium enterprise deployment uses all of the performance components described in this chapter with the exception of the Distributed eHealth product and Remote Polling product. Small-to-medium deployments are typically used for customers who have 50,000 or fewer managed resources, with perhaps limited geographic spans. Chapter 5 describes the best practices and steps to create a small-to-medium-scale deployment. The following diagram represents the architecture for a small-to-medium enterprise deployment.
Typically, in a small-to-medium environment, you install SPECTRUM, eHealth, and eHealth for Voice on independent, dedicated systems. In addition, the SPECTRUM OneClick and Report Manager applications should also be installed on dedicated systems.
Large Service Provider Deployment

Typically, a large service provider deployment uses all of the components including the distributed configurations, such as Distributed eHealth, eHealth Remote Polling, Distributed SpectroSERVER, and distributed Voice installations. Large deployments typically consist of customers who have hundreds of thousands of managed resources, or more, which are spread over wide geographic ranges. Additionally, service providers need to ensure segmented views of the management resources, as the resources may be owned or allocated to independent customers. This guide does not describe how to plan and configure a large-scale deployment. If you are planning a large-scale solution deployment, work with your CA Sales and Technology Services team to plan and deploy your configuration based on the size and characteristics of your network. The following diagram represents the architecture for a large service provider deployment.
The CA Network and Voice Management solution scales to support these large configurations. As shown in the figure, central systems are typically installed within the primary network operations center (NOC) area. You can also install systems that collect data from areas closer to the remote locations in which the managed resources reside. The remote and central systems communicate to share information. The CA solution offers a
variety of ways to distribute the management coverage and polling services, as well as the reporting services, to cover a wide range of management environments.
Network Performance Hardware and Software Requirements/Sizing

Use the eHealth Sizing Wizard to determine the appropriate environment requirements. You can access the eHealth Sizing Wizard via http://www4.concord.com/sizing/swiz. The following are the minimum software and hardware requirements: Technical Specifications for eHealth 6.0 eHealth E2E Console required. Minimum System Requirements Operating Systems Windows Manager Memory Swap Space Free Disk Space UNIX Sun or HP server with minimum 900 MHz CPU(s) Solaris 9, 10 (32 and 64-bit) HP-UX 11.i, 11.23 (64-bit) OpenWindows, OSF/Motif, CDE 3 GB 6 GB 14 GB (includes eHealth files, Oracle, database, and DB backup location) If you use the optional Report Center capability, add 50% more disk space and 1 GB more memory. Mozilla 1.6 (or higher) Windows Server with minimum 2.0 GHz CPU(s) Windows 2003: Standard, Enterprise
3 GB 6 GB (NTFS Format) 14 GB (includes eHealth files, Oracle, database, third-party applications, and DB backup location) If you use the optional Report Center capability, add 50% more disk space and 1 GB more memory. Mozilla 1.6 (or higher) Internet Explorer 6 (or higher) Mozilla FireFox 1.x
Web Browser
Network Fault Management Hardware and Software Requirements/Sizing

Use the SPECTRUM sizer to determine the number of SpectroSERVERs you need to efficiently manage your network. Contact Technical Support or your CA Sales representative for more information about using the Spectrum sizer. The following are the minimum software and hardware requirements: Technical Specifications for SPECTRUM 8.0 Minimum System Requirements Operating Systems UNIX/Linux Sun UltraSPARC II Linux Pentium Xeon Solaris 9, 10 (see installation guide for required patches) Linux Red Hat Ver 3, update 6 or greater 2 GB 40 GB Windows Pentium Xeon Windows 2000, Windows XP Professional, or Windows 2003 Server
Memory Free Disk Space
2 GB 40 GB
Technical Specifications for SPECTRUM Report Manager and OneClick Servers Minimum System Requirements Operating Systems UNIX/Linux Sun SPARCstation Linux Pentium Xeon Solaris 9, 10 (see installation guide for required patches) Windows Pentium Xeon Windows 2000, Windows XP Professional, or Windows 2003 Server Note: Business Objects XI supports a maximum of 10 users on XP. Linux Red Hat Ver 3, update 6 or greater 1 GB (with 2GB swap space for Solaris) 4 GB Linux Update 6 or greater Solaris packages SUNWeu8os SUNWeuluf
Memory Free Disk Space Applications
1 GB 4 GB See Microsoft Support for information about updates for your Windows version. Business Objects XI Service Pack 1
Technical Specifications for SPECTRUM OneClick Servers (without Report Manager) Minimum System Requirements Operating Systems UNIX/Linux Sun SPARCstation Linux Pentium Xeon Solaris 9, 10 (see installation guide for required patches) Linux Red Hat Ver 3, update 6 or greater 1 GB 230 MB Linux Update 6 or greater Java 2 SDK, Standard Edition, version 1.5.0_06 or later Windows Pentium Xeon Windows 2000, Windows XP Professional, or Windows 2003 Server
Memory Free Disk Space Applications
1 GB 230 MB Windows 2000 - Service Pack 2 or later Java 2 SDK, Standard Edition, version 1.5.0_06 or later
eHealth for Voice Single PC or Database Server Hardware and Software Requirements
Minimum System Requirements Operating Systems Memory Free Disk Space Windows Pentium 1GHz or greater Windows 2000 Server Windows 2003 Server 512 MB minimum, 1 GB recommended 1GB for the eHealth for Voice program files 5-20GB for database depending on number of voice platforms and systems. Microsoft SQL Server 2000 with Service Pack 2 or higher, or Microsoft SQL Server 2005 Note: Must be purchased and installed separately by the customer.
Database Software
For the hardware and software system requirements for application and client-only installations, see the eHealth for Voice Operations Guide.
Chapter 5: Setting Up and Configuring the Integrated Solution

This chapter describes how to install and configure the components that comprise CAs Network and Voice Management solution in a small-to-medium deployment environment. It also outlines best practices for configuring the components to monitor performance. This CA Green Book does not describe how to plan and configure a large-scale deployment. If you are planning a large-scale solution deployment, work with your CA Sales and Technology Services team to plan and deploy your particular configuration based on the size and characteristics of your network.
Installing the CA Network and Voice Management Solution Software

To create the integrated network and voice management environment, you must install the SPECTRUM, eHealth, and eHealth for Voice applications on systems that are dedicated to each application. Important: The installation procedures for each application are described in detail in product-specific installation guides that are referenced within this chapter. You must use those guides to correctly install the applications. Review this chapter first to obtain the best practices for creating the integrated solution with these applications.
Installation Prerequisites
Before you proceed, read Chapter 4 of this guide to ensure that your systems meet the basic minimum requirements, and that you have sized your systems appropriately to support these applications in your environment. If you do not properly size the systems, they will not be able to support the application processing loads a condition which could result in CPU, disk space, and memory performance degradations. Note: Your CA Sales team will ensure that you properly size your systems to support the applications that you plan to use and the number of resources that you plan to manage using these applications.
Installation Steps
You can install the software applications in any order; as a best practice, install SPECTRUM and SPECTRUM OneClick/Report Manager first, then eHealth, and, finally, eHealth for Voice. Important: The CA network and voice management applications require you to use systems that are dedicated to each application. Do not use those systems for other applications or services. While anti-virus and security software are recommended for any server system in your environment, disable the anti-virus software during installation to ensure that the applications install completely. You need a minimum of four systems for the following basic configuration: SPECTRUM (SpectroSERVER system) SPECTRUM OneClick and Report Manager server eHealth eHealth for Voice
BEST PRACTICES
To facilitate the successful installation and setup of these components, review the following best practices: Ensure that the systems on which you plan to install the software have fixed IP addresses. Obtain and test login account privileges to the systems. For Windows systems, you need an account with Administrator privileges. For UNIX systems, you need access to the root user account.
How You Install SPECTRUM

On the system that you have designated as the SpectroSERVER for your environment, install SPECTRUM Release 8.0. Log on to http://support.concord.com to obtain the latest Service Pack for the release from the Software Downloads page. Follow the instructions in the SPECTRUM Installation Guide to complete the following tasks:
1. 2. 3.
Confirm SPECTRUM prerequisites. Prepare the operating system and optimize the system for best performance. Make sure that you have SPECTRUM license and extraction keys. You obtain the keys from your CA sales representative when you purchase the software. Install the software and perform any necessary troubleshooting for the installation. Start the SPECTRUM software. Enable access to the SPECTRUM system.
4. 5. 6.
Following the SPECTRUM installation, proceed to the SPECTRUM OneClick installation. Install OneClick so that you have full administrative access to SPECTRUM. Note: OneClick is the primary administration interface to SPECTRUM. Use OneClick, rather than the legacy SpectroGRAPH interface, to perform administrative functions.
How You Install SPECTRUM OneClick and Report Manager

On the system that you have designated as the OneClick and Report Manager server for your environment, install SPECTRUM OneClick and Report Manager for Release 8.0. Follow the steps in the Report Manager Installation and Administration Guide to complete the following tasks: Important: To install the OneClick and Business Objects software, follow the documentation carefully. Installation failures typically result if you diverge from the documented steps. Install Business Objects first and select the option to use an existing Java application server. In the event of a failure, you will have to remove and reinstall OneClick and Report Manager. 1. 2. Confirm OneClick prerequisites and system requirements. Prepare the operating system and optimize the system parameters for best performance. Install the software and perform any necessary troubleshooting for the installation. Install the OneClick client to run the application and confirm that you can connect to the SpectroSERVER system. On the OneClick interface, click the Report Manager tab on the OneClick index page to confirm that Report Manager installed correctly.
3. 4.
5.
How You Install eHealth

On the system that you have designated as the eHealth server for your environment, install eHealth Release 6.0. Make sure that you log on to http://support.concord.com to obtain the latest InstallPlus kit for the release from the Software Downloads page. Follow the instructions in the New Installations of eHealth 6.0 Guide for your system platform (Windows or UNIX) to complete the following tasks: 1. Confirm system prerequisites and locations for the eHealth and embedded Oracle software. Install the eHealth and Oracle software, and perform any necessary troubleshooting for the installation. Make sure that you have the eHealth licenses for the features that you will be using. You obtain these licenses with the purchase of the eHealth products. For eHealth release 6.0 GA, note that you need an eHealth SPECTRUM Integration license to configure and use the integrated solution.
2.
3.
Important: After you complete the eHealth installation, follow the instructions provided in the section Configuring the Integrated Solution in this chapter. Do not follow the instructions to start the eHealth console and begin discovering your resources as elements. For the integrated solution, you will discover eHealth elements by importing the SPECTRUM configuration from the SpectroSERVER system. This simplifies the administration tasks for eHealth discovery. For a description of the eHealth administration tasks and interfaces, see the eHealth Administration Overview Guide.
How You Install eHealth for Voice

On the system that you have designated as the eHealth for Voice server for your environment, install eHealth for Voice Release 4. You can install eHealth for Voice in its entirety on one PC. You can also install the software in a distributed configuration on several PCs where one is the database server and the others are client systems that can access the database server for reports and administration tasks. The Database Manager server requires a Microsoft SQL Server database engine. You must purchase and install Microsoft SQL Server before installing eHealth for Voice. You can typically accomplish this by installing Microsoft SQL Server 2000 on the PC which is to hold the eHealth database. Note: Microsoft SQL Server is required only on the PC that will contain the eHealth for Voice database (the Database Manager installation); it is not required on agent-only or client-only machines. Follow the instructions provided in the eHealth for Voice Operations Guide to complete the following tasks: 1. 2. 3. 4. Confirm system prerequisites. Install the software and perform any necessary troubleshooting for the installation. Optionally, install client-only servers to access the eHealth for Voice database server. Start the Program Console and define your voice environment:
a. b.
Install the licenses for the platforms to be supported. Start the following services (at minimum): Task Scheduler Data Collector Data Loader Policy Manager
c.
Define the following: Company Group Collector Platform
d.
Check the Data Collection queue to verify the scheduled data collection.
5.
Set up the SPECTRUM integration by following the instructions provided in the eHealth for Voice Integration for SPECTRUM Guide.
When you complete the eHealth for Voice installation, follow the instructions to start the Program Console and define your voice environment; then set up your SPECTRUM integration by following the instructions provided in the eHealth for Voice Integration for SPECTRUM Guide.
Configuring the Integrated Solution

Using SPECTRUM, SPECTRUM OneClick, eHealth, and eHealth for Voice, you can deploy an integrated solution for managing your network and voice resources. SPECTRUM provides the top-level management interface for resource identification, fault management, and IT network problem resolution. SPECTRUM reduces alarm noise and detects root causes of problems. eHealth provides performance management by collecting detailed statistics on your resources and analyzing that data to detect growing problems and changes in behavior. eHealth Live Health compares performance to thresholds and service rules, and raises alarms when resource performance starts to degrade. Health reports and Live Health can send alarms (traps) to SPECTRUM to reflect these problems in OneClick views. eHealth for Voice manages the end-to-end service for traditional voice networks as well as Voice over IP converged networks. It can detect service policy violations and capacity problems, and send alarms to SPECTRUM to alert network managers through their OneClick views. While these products can be used separately to manage and report on network performance and faults, their combined capabilities provide network managers with a single top-level view of possible problems and changes in network performance, and the capabilities to drill down to reports for more information and troubleshooting.
Best Practices
The following sections describe the best practices for configuring CAs integrated Network and Voice Management solution. These practices streamline common administration tasks, and reduce time devoted to managing and maintaining the software configurations. To configure the integrated solution, follow these primary steps:
1.
Identify the network resources that you want to manage using SPECTRUM discovery; then create Global Collections to organize those resources. Import the SPECTRUM-discovered resources into eHealth using eHealths discover process. Facilitate reporting and management of your resources by organizing related elements into groups and group lists based on the relationships such as the geographic region, customer, organization, or department that they support. Schedule eHealth discoveries of Global Collections to maintain the poller configuration.
2.
3.
4.
The following sections describe these steps in more detail, and provide references to product documentation that provides complete information.
Identify Resources and Use SPECTRUM to Discover Them as Global Collections

Use SPECTRUM discovery to identify the network resources that you want to manage, and then create Global Collections to organize those resources into topology views. These views help network operators track various collections of network entities, organizations, or services that comprise your infrastructure.
In addition, eHealth discovery uses the Global Collections as input. The following procedure describes how to create a static collection of elements on-the-fly. SPECTRUM offers many types of discovery and approaches. To determine the most effective approach for your organization, see the Modeling Your IT Infrastructure Administrator Guide and the AutoDiscovery User Guide. To create a new collection
1.
Log in to the SPECTRUM OneClick console, and select Tools, Utilities, Discovery, New Discovery. In the Discovery dialog, do the following: a. b. c. d. e. f. Specify a configuration name. Specify an IP address range or list, or import a file. Specify a valid community string. Under Modeling Options, select Discover Only. Click Discover. When the results appear, identify any devices that you do not want to monitor, right-click, and exclude them. Select Model. Select the desired Modeling Options. Click OK. Click Close. Select Yes to save the Discovery Configuration. Select Cancel to close the Discovery dialog.
2.
g. h. i. j. k. l.
3.
Select the Locator tab in the OneClick navigation pane, and use a search option to find resources based on particular criteria (for example, IP address or model name). Select one or more elements; then right-click to select Add To, Collections. In the Select Collections dialog, click Create. Specify a name for the collection and a description; then click OK. Click OK.
4. 5. 6. 7.
To keep your configuration up-to-date, you can run this process regularly.
Import Global Collections into eHealth

Import the SPECTRUM-discovered resources into eHealth using eHealths discover process. This allows eHealth to begin automatically polling resources and, thus, build a performance database and history for them. With this performance database, eHealth reports and Live
Health can detect changes in behavior, identify potential problems in service degradation or capacity, and provide insight into performance trends over time.
SET UP THE SPECTRUM INTEGRATION
Before you can import global collections to eHealth, run the SPECTRUM setup program on the eHealth system. To run the eHealth SPECTRUM Integration setup program
1. 2.
Log in to the eHealth system as the eHealth administrator. Open a terminal window and change to the eHealth directory by entering the following command, where ehealth is the full pathname: cd ehealth
3.
Run the setup program by entering the following command: ./bin/nhSpectrumSetup The SPECTRUM Import Setup dialog box opens.
4.
Enter the following information when prompted by the setup program: Hostname or IP address of the SPECTRUM OneClick server Port number for OneClick server Web requests Path where OneClick is installed on the server Username used to log in to the OneClick server Password for the specified user name
5.
Click OK. eHealth verifies your settings and displays a message notifying you if they are valid. Note: The validation process may take a few seconds.
DISCOVER SPECTRUM GLOBAL COLLECTIONS
Use the eHealth discover process to import the SPECTRUM configuration into eHealth. To discover a SPECTRUM Global Collection
1. 2. 3.
Log in to the eHealth console. Select Setup, Discover. In the Discover dialog, do the following: a. In the Mode list, select the technology types associated with the resources that you want to discover. Select SPECTRUM Import and specify the SPECTRUM Global Collection that you want to import. Click Discover.
b.
c.
eHealth connects to the OneClick server, extracts the information from the SPECTRUM collection, and discovers the appropriate elements.
4.
Save the discovered elements to the poller configuration. eHealth automatically begins polling them to collect performance data.
Organize Your Resources by Creating eHealth Groups

eHealth provides a grouping capability that helps you to organize your elements effectively, facilitate administration, and simplify reporting. By focusing on a subset of elements rather than all elements in your infrastructure you can manage them more easily as well as create effective reports that address specific needs. To manage your infrastructure, you can organize related elements into groups based on geographic regions, customers, organizations, or departments that they support. To organize your groups, you can associate them to group lists. For example, if you wanted to monitor the systems supporting your business within Europe, you could create a group called England (composed of resources that support offices in that country), and other groups for each country in which you operate. You could then add those groups to a group list called Europe and generate reports for the entire group list. To simplify reporting and administration, you can also filter your element lists based on your grouping strategy. Before grouping your resources, review the eHealth best practices for grouping outlined in the eHealth Element and Poller Management Guide. To create a new group
1.
Log in to the OneClick for eHealth console as an administrator who has permission to manage groups. Select Find Elements in the Managed Resources folder. Select the elements that you want to include. Select Element Chooser to filter the list. Include a wildcard such as an asterisk (*) to match characters, or a question mark (?) to match a single character. Right-click and select Create Group with Selected Elements. Specify the first group name and a description. If SmartTree is enabled, append a label to the group name that reflects the location of the elements and use the selected delimiter (for example: England-1, Germany-1, or Spain-1). Click OK. The group immediately appears under By Group. Repeat Steps 2 through 6 to create other groups with a suffix. For example: England-2, Germany-2, Spain-2. Under Managed Resources, select By Group. If SmartTree is enabled, the element tree displays two separate tiers in an alphabetical hierarchy based on that naming convention.
2. 3.
4. 5.
6. 7.
8.
To add the groups to a group list

1.
Log in to the OneClick for eHealth console as an administrator who has permission to manage groups. Right-click By Group List in the Managed Resources folder, and select New Group List. Specify a name and a description, then click OK. Double-click the group list name under By Group List, and select the Groups Not in This Group List tab. Select the groups that you want to include, right-click, and select Add Selected Groups to Group List. Click Yes to confirm. Click OK to save.
2. 3. 4.
5.
6. 7.
Schedule eHealth Discoveries of Global Collections

To maintain the element definitions in your poller configuration, you can schedule eHealth discoveries of Global Collections. A scheduled discover job runs automatically to update element information and ensure that the elements that you are monitoring continue to respond to eHealth polls. To schedule a discover process
1. 2. 3.
Log in to the eHealth console. Select Setup, Schedule Jobs. In the Schedule Job dialog, select Add Discover from the Add list. (The Add list default is Add At-a-Glance.) In the Add Scheduled Discover Job dialog, do all of the following, and then click Schedule:
a. b. c.
4.
Select one or more element types under Mode. Select SPECTRUM Import, and specify the Global Collection. Specify the days and time on which to run the scheduled discover job. As a best practice, run the scheduled eHealth discovery to follow SPECTRUM autodiscovery updates so that your configurations remain synchronized. If your environment does not change frequently, schedule less frequent discovery updates. If your environment is very dynamic, schedule more frequent discovery updates to ensure that your configuration information remains up-to-date. Click Schedule. Click OK.
d. e.
Network and Voice Monitoring

Using CAs integrated solution, you can closely monitor the performance of your network and voice resources. The eHealth Live Health application provides instantaneous feedback on trouble spots telling you where the problems are, when they started, and their severity. It also identifies growing problems before they become failures, allowing you to take action and keep your business running smoothly. Live Exceptions sends alarm notifications to the SPECTRUM interface. You can then run reports from the SPECTRUM OneClick interface to review eHealths analysis of the problems. To configure the integrated solution to monitor performance, follow these primary steps:
1. 2. 3. 4. 5. 6.
Set up Live Health monitoring of eHealth groups and group lists. Forward Live Health traps to SPECTRUM. Customize and schedule Health reports to send traps to SPECTRUM. Configure eHealth for Voice to send alerts to SPECTRUM. Configure SPECTRUM to recognize the eHealth server. Configure SPECTRUM to view eHealth alarms.
The following sections describe these steps in more detail, and reference the product documentation for complete information.
Set Up Live Health

After you discover the resources that you want to monitor and group them, you can associate them to a Live Health profile to indicate when performance problems are occurring. A Live Health profile is a set of alarm rules that eHealth applies to groups or group lists of elements. Alarm rules define the types of elements and conditions to monitor, the problem thresholds and duration, and the problem severity. eHealth provides hundreds of technology-specific profiles for managing your network resources. For each technology, eHealth offers the following types of Live Health profiles: Profile Name Failure Description of Purpose Identifies problems with availability, errors, or other device failures. Warns of overutilization or congestion problems which could cause network delay. Indicates when an elements capacity or volume is outside its typical performance for the baseline. Identifies when the network latency is slowing down. Latency is usually measured between the eHealth system and the device itself.
Delay
Unusual workload
Latency
Profile Name Configuration change Security
Description of Purpose Detects when a devices configuration has changed, such as module/card insertions to a switch. Warns of problems such as a firewall detecting a ping of death attack, login failures, or unauthorized accesses.
Once you assign a profile to a group or group list of elements, Live Exceptions monitors the group or group list to look for any activity that violates the specified rules, and produces alarms when activity triggers any of the rules in the profile. With this integrated solution, you can configure Live Health to send alerts to the SPECTRUM interface when problems occur. Carefully review the Live Exceptions web help available with the product to ensure that you understand performance and how the rules identify performance problems.
FIND LIVE EXCEPTIONS PROFILES
The Live Exceptions feature has hundreds of default profiles that you can use to monitor your resources. To search and review the profiles that apply to your types of resources, use the Live Health Profiles tool on the eHealth Certification support site. To review available Live Exceptions profiles
1. 2. 3.
Using a web browser, log on to http://support.concord.com. Click Certification. On the Certification page, click Live Health Profile Descriptions under Certification Information. Click Element Types to display the various types of resources that eHealth can monitor. Scroll through the list of elements to locate the types that you are currently monitoring. For example, if you are monitoring CPUs, click CPU, Router/Switch CPU(1), Generic Router/Switch CPU (2). Click a profile name, and review the profile description to determine the types of problems for which it will raise alarms.
4. 5.
6.
You can also create custom profiles and rules. For a description of how to create rules and profiles, see the Live Exceptions web help.
START LIVE EXCEPTIONS AND ASSOCIATE PROFILES
To use Live Exceptions, you must log in to the eHealth Web interface and download the Live Health client application to your local PC or workstation. Install the Live Exceptions client following the instructions provided on the download page. To access the eHealth web interface, use a web browser to navigate to http://hostname:port, where hostname is the name or IP address of your eHealth system, and port is the HTTP port used by the web server. If your Web server uses the default port 80, you can omit the port number. You must have an eHealth Web user account to log on to the eHealth web interface.
To set up Live Health to produce alarms when problems occur

1.
Make sure that you have downloaded and installed the Live Health client software from the eHealth Web interface. Do one of the following to open the Live Exceptions application: If your system is a Windows system, select Start, Programs, eHealth, Live Exceptions. Your program group name will vary, depending on the name that you used when you installed the Live Health client. On a UNIX system, change to the Live Health client installation directory and run the command nhLiveExceptions.
2.
3.
In the eHealth System field in the Live Exceptions application window, specify the name of the system to which you want to connect, and your user name and password. The Live Exceptions Browser appears. Select Setup, Subjects to Monitor. In the Setup Subjects dialog, click New. In the Setup Subjects Editor dialog, associate a profile to a group or group list:
a.
4. 5. 6.
From the Subjects list, select a group or group list of elements that you want to monitor using Live Health. Select an appropriate profile from the list that corresponds to the type of elements contained in the group or group list. Under Calendars, select a calendar to specify the time range during which Live Health should apply the profile to the group or group list. Click OK. Click OK to confirm; then click OK in the Setup Subjects dialog; then click OK to confirm that changes were sent to the eHealth server.
b.
c.
d. e.
Live Exceptions will begin to monitor the group or group list to identify any activity that violates the specified rules. When activity triggers any of the rules in the profile, Live Exceptions raises an alarm. Generally, alarms will start to appear 15 to 20 minutes after eHealth has had the opportunity to poll elements for several consecutive polls.
Forward Live Health Traps to SPECTRUM

By forwarding traps to SPECTRUM from eHealth Live Health, you can manage Live Exceptions alarms from the SPECTRUM OneClick console, view Alarm Detail reports, and clear alarms to reduce the MTTR for network issues. To configure Live Exceptions to forward traps
1. 2. 3.
Launch the Live Exceptions browser. Select Setup, Trap Destinations. In the Trap Destinations Manager dialog, click New.
4.
Specify the following information for the SpectroSERVER under Edit Trap Destination: Hostname IP address Port number
5. 6.
Click Add. Confirm that the name of the SpectroSERVER appears in the Existing Trap Destinations list; then click OK. Select Setup, Notifier Rules. In the Notifier Manager dialog, click New. In the Notifier Rule Editor dialog, do the following:
a. b. c. d. e.
7. 8. 9.
In the Name field, enter SPECTRUM. In the Action list, select Send Trap. In the To NMS list, select the SpectroSERVER that you specified in Step 4. Under When an alarm is, select both Raised and Cleared. Under Elements within, specify either a specific technology type or All Tech/Subjects. Click OK to save your Notifier rule.
f.
10. Confirm that the Notifier rule appears; then, close the window.
Customize and Schedule Health Reports to Forward Traps

A Health report evaluates the health of a group of elements by comparing current performance to historical performance over the course of a day, week, or month. The report identifies errors, unusual utilization rates, or shifts in volume that warrant investigation. This report helps you evaluate the health of your resources by monitoring how efficiently those resources are running, checking for availability of critical resources, and detecting whether they are beginning to experience problems. The report analyzes trends based on historical data and calculates averages using a service profile. You can configure individual Health reports to forward traps for Health exceptions to the SpectroSERVER. When a scheduled Health report runs, eHealth sends an SNMP trap to the SpectroSERVER for the top problem of each element in the Exceptions section of the Health report. Note: Only scheduled Health reports forward exceptions. If you manually run a Health report, it will not forward exceptions.
To forward exceptions from Health reports 1. 2. 3. 4. 5. 6. Log in to the eHealth Console. Select Reports, Customize, Health Reports. Select the report from which you want to forward Health exceptions. In the Presentation Attributes drop-down list, select General. Select NMS IP and Port Trap Address in the Attribute table. In the Value field, specify the SpectroSERVER IP address and SNMP port number, separated by a colon. For example: 001.02.03.004:162 7. 8. 9. Click Apply to save. In the Presentation Attributes drop-down list, select Exceptions. Select Send Exceptions SNMP Trap in the Attribute table.
10. Select Yes in the value field. 11. Click OK. 12. Click Save to save the custom report. 13. Select Setup, Schedule Jobs. 14. Select Add Health Report from the list. 15. In the Add Scheduled Report dialog, do the following:
a. b. c. d. e. f.
Select the report. For the subject, select the technology type and group for the report. Specify a time range for the report, and optionally, a time zone. Select the format in which you would like to output the report. Set the schedule for the job. Click OK.
16. Click OK.
Configure eHealth for Voice to Send Alerts to SPECTRUM

To allow SPECTRUM OneClick to show the voice-specific problems in PBXs, messaging systems, and other voice infrastructure monitored by eHealth for Voice, configure the eHealth for Voice Policy Manager to send alerts (SNMP traps) to SPECTRUM when a particular condition occurs. To configure the Policy Manager, you create a policy based on a defined action plan (the responses assigned to policies) and conditions.
To configure eHealth for Voice to send alerts to SPECTRUM

1.
On the system on which eHealth for Voice is installed, select Start, Programs, eHealth for Voice, eHealth for Voice. The eHealth for Voice Program Console appears. Select Tools, Service Setup to configure and start the Policy Manager service. Click Configuration, Servers to configure the Email, SNMP, Web, and SPECTRUM servers. Define the actions to include in the action plan:
a. b. c. d.
2. 3.
4.
Click Templates in the Policy Manager group of the console tree. Click Actions. Right-click in the right pane and click New. Complete the details for the action type. Specify information under the Properties and the Configure tabs. Click Save. Click Cancel to close the window.
e. f. 5.
Create an action plan template to define the responses that you want to assign to the policy: a. b. c. d. e. Click Templates in the Policy Manager group of the console tree. Click Action Plans. Right-click in the right pane and select New. Specify a name, description, time zone, and actions. Click Save.
6.
Create a policy based on that action plan: a. b. c. d. e. Click Policies in the Policy Manager group of the console tree. Click Global to create a policy based on eHealth for Voice global data. Right-click in the blank area of the right pane, and select New from the menu. Select Blank Policy, and click Next. Specify a name and description.
7.
Click Add.
8.
Define the condition: a. b. c. Specify the name, platform for the element, and data table. Specify the build criteria. Click Apply to save the condition.
9.
Define the policy: a. b. Select the time zone, operating interval, and timeframe. Specify the number of times the condition should match the policy before triggering the action plan. Specify the severity level. Select an action plan. Click Save.
c. d. e.
Configure SPECTRUM to Recognize the eHealth Server

After completing the eHealth setup, you must also configure SPECTRUM to recognize the eHealth server. This allows you to drill down to eHealth reports, as well as to clear alarms from the OneClick console. To enable SPECTRUM to recognize the eHealth server 1. Log on to the SPECTRUM OneClick homepage using your SPECTRUM credentials and click Administration at the top of the page. From the Administration menu, select eHealth Configuration. In the eHealth Configuration window, enter the following information: Hostname or IP address of the eHealth server. Port number on which eHealth listens for web requests. eHealth web administrator user name eHealth web administrator password 4. Select Started in the Alarm Notifier Status section to enable SPECTRUM to clear Live Health alarms. Note: If you configure eHealth to forward alarms to SPECTRUM, and configure SPECTRUM to view eHealth alarms, the alarm notifier enables you to clear those alarms directly from the OneClick console. 5. Click Save.
2. 3.
Configure SPECTRUM to View eHealth Alarms

If you configured eHealth to forward Live Health alarms or Health exceptions to a SpectroSERVER, you must also configure SPECTRUM to receive the alarms. To enable SPECTRUM to view eHealth alarms
1. 2. 3.
Log in as a SPECTRUM administrator. Select Start Console at the top of the OneClick page to launch the OneClick Console. In the Explorer tab of the OneClick Navigation panel, select your SpectroSERVER, and then select Universe.
Note: If you are monitoring multiple SpectroSERVERs, select Universe under the landscape for the Trap Director SpectroSERVER.
4. 5.
In the Contents panel, select the Topology tab. In the Topology tab toolbar area, click the Create a new model by type icon. The Select Model Type dialog appears. Select the All Model Types tab. Select EventAdmin, and then click OK. The Create Model of Type dialog appears. Specify the name and IP address of the eHealth server, and click OK. The eHealth server appears in the topology as an EventAdmin model.
6. 7. 8.
Note: For more information on creating a model in OneClick, see the Modeling Your IT Infrastructure Administrator Guide.
9.
Select the EventAdmin model in the OneClick Topology.
10. Right-click the EventAdmin model; then select Utilities, Attribute Editor. The Attribute
Editor dialog appears.

11. In the Attributes tree, select User Defined and click add. The Attribute Selector dialog
appears.
12. In the Select Model Type window, select Other, EventAdmin. 13. In the Attributes for EventAdmin window, select
map_traps_to_this_model_using_IP_header, and click OK. The attribute appears in the User Defined list in the Attribute Editor.
14. Click the arrow that points to the right. The attribute moves to the right window. 15. In the right window, select map_traps_to_this_model_using_IP_header, and select Yes. 16. Click Apply. SPECTRUM applies the attributes to the model, and the Attribute Edit
Results dialog appears.

17. Confirm your changes in the Attribute Edit Results window, and click Close. 18. Click OK in the Attribute Editor.
System Maintenance
As a best practice, you should back up your eHealth, OneClick, SPECTRUM, and eHealth for Voice systems on a regular basis. A good backup strategy ensures that you can recover your system quickly in the event of a disk failure, database corruption, or other unexpected event. Without a backup strategy, you risk spending hours or days reinstalling and reconfiguring these applications, and you could lose valuable data. For a complete description of how to manage system and database backups for these systems, see the following guides: SPECTRUM: Database Management eHealth: eHealth Database Management Guide eHealth for Voice: eHealth for Voice Operations Guide
System Backup Archives

Establish procedures for copying the backups to tape or another system that is in a directory owned by a secure user who has read-only permissions. (Any users who have access to the saved database can restore it on another system and view all of the data that it contains.) For maximum recovery safety, move the archived data offsite. You might also find it valuable to invest in a remote storage solution (such as those offered by CA BrightStor) that obsoletes the management of tape libraries. You should retain several months of archived backups so that you can flexibly recover data and reports. This also enables you to recover if a corruption occurred months ago undetected.
Data Recovery Best Practices

It is good practice to rehearse data recovery on a test machine, so that you become familiar with the procedures before facing a crisis.
Chapter 6: Gathering System Information from Agents

Systems are important components of the network. They typically contain your critical business applications, such as web servers, database applications, e-mail applications, and other company-critical applications. When their performance degrades, users are unable to run their applications and perform tasks that utilize those servers. This chapter discusses how to gather system monitoring information, and outlines best practices for configuring SPECTRUM and eHealth to manage Unicenter NSM, SystemEDGE, and third-party system agents.
Deployment and Administration of System Agents

SPECTRUM and eHealth leverage installed system monitoring agents for fault and performance information. This chapter describes three types of system agents: CA Unicenter NSM agents CA SystemEDGE agents Third-party agents Important: The installation procedures for the Unicenter NSM and SystemEDGE agents are described in detail in product-specific installation guides. You must use those guides to correctly install the agents. After you complete the software installation, review this chapter to obtain the best practices for configuring SPECTRUM and eHealth to leverage these agents.
Best Practices
To facilitate the monitoring and management of system agents, follow these best practices: Ensure that the system agent has been successfully installed. Configure an SNMP read-only or read-write community string on the system. Configure the system agent to send traps to the SpectroSERVER. Confirm that you have specified the correct IP address and community string to discover the agents. Confirm that your systems have only one management agent enabled and running on them. Systems can sometimes have multiple SNMP agents. For example, they could have the Microsoft SNMP agent and a CA SystemEDGE agent. If multiple agents are running and responding to SNMP queries, SPECTRUM and eHealth could model both agents for the one system. For more information, see the Unicenter NSM Agents section later in this chapter.
Supported Agents
The following table highlights the Unicenter NSM agents supported by eHealth and SPECTRUM based on release. The Active Directory and Performance agents are supported only for eHealth reporting. Unicenter NSM r11 Systems Agents UNIX System Agent (caiUxsA2) Windows System Agent (caiWinA3) Active Directory Services Agent (caiAdsA2) Log Agent (caiLogA2) Performance Agent (hpxAgent) Unicenter NSM 3.1 Systems Agents UNIX System Agent (caiUxOs) Windows System Agent (caiW2kOs) Active Directory Services Agent (caiAdsA2) Log Agent (caiLogA2) Performance Agent (hpxAgent)
SPECTRUM and eHealth also support all SystemEDGE agents as well as a variety of thirdparty agents provided by vendors such as Microsoft, Dell, Sun, HP, and IBM. In addition, these applications also support any MIB-II or RFC 2790-compliant agents. These applications provide out-of-box automated fault management, trap support, and performance reporting and trending. Note: For agents that support and use the RFC 2790 extensions of MIB-II, SPECTRUM can perform process, file system, and log file monitoring in addition to basic host systems performance monitoring. If you discover agents that do not have the RFC 2790 extensions, only basic host systems performance and log file monitoring may be possible.
Prerequisites
Before you begin, do the following: Confirm that you have administrator account access to both the SPECTRUM OneClick console and the eHealth console. If you are not familiar with the SPECTRUM OneClick console, see the OneClick Administration Guide for more information. If you are not familiar with the eHealth interfaces, review the descriptions of the eHealth console and the OneClick for eHealth (OneClickEH) console provided in the eHealth Administration Overview Guide.
How You Add System Agents in SPECTRUM

You can use either of these methods to add the system agents to SPECTRUM: Automatically discover the system agents using SPECTRUMs AutoDiscovery application. Manually add the system agents to SPECTRUM.
AUTOMATICALLY DISCOVER SYSTEM AGENTS
SPECTRUM can automatically discover and model your system resources using autodiscovery capabilities. To automatically discover your systems
1.
In the SPECTRUM OneClick console, select Tools, Utilities, Discovery, New Discovery.
The Discover dialog appears.
2.
In the Discovery window, do the following: Specify a configuration name. Specify an IP range or list, or select Import to import an IP list file. Specify a valid community string. If you specify more than one, OneClick uses the entry at the top first. Select Discover Only in Modeling Options. Click Advanced Options and specify port 6665 to discover Unicenter NSM agents.
3.
Click Discover. The Discovery dialog appears.
4.
(Optional) After the results set appears, exclude entries by right-clicking them and selecting Exclude. This prevents those devices from being modeled in the SPECTRUM database. Click Model to add the systems to SPECTRUM.
5.
6.
Under Model Options, deselect Create Wide Area Link Models and Create LANs. By deselecting these options, SPECTRUM does not automatically create subnet containers. For more information about discovery and modeling options, see the Modeling Your IT Infrastructure Administrator Guide. Click OK. After the systems are modeled, click Close in the Discovery dialog.
7. 8.
9.
(Optional) Click the paper and pencil icon in the left corner of the Tools menu.
10. Edit the topology by moving icons and add background images, as desired.
MANUALLY ADD A DEVICE USING CREATE MODEL BY IP ADDRESS
As an alternative to automatically discovering your systems, you can manually model them. This procedure works for systems that do not respond to discovery. To model your system resources manually
1.
Using the Explorer tab of the OneClick navigational panel, navigate to the Universe topology view in which you want the new device to appear. The selected Universe topology view appears under the Topology tab of the Contents panel. Tip: If you want to place the new device inside a network group container, double-click the container icon to display the topology view for that container. In the Topology tab toolbar area, click the Create model by IP address button. The Create Model by IP Address dialog appears.
2.
Note: To remove a modeled element from a view, select the element and click Delete (X).
3.
Specify the device network address and any other optional fields described in the following table.
Create Model by IP Address Settings Field Network Address (mandatory) Community Name (mandatory) DCM Timeout (optional) DCM Retry Count (optional) Description Specifies the network address (IP address) for the device that you are modeling. Specifies the SNMP community string for the device to be managed. Specifies the length of time that the Assurance Server will wait for a response from the device. Default is 3000 ms. Specifies the frequency with which the Assurance Server tries to communicate with the device after the DCM timeout value expires. Default is 2. Specifies the port on which the agent listens for SNMP requests. Default is 161. (Unicenter NSM Agents use 6665). Specifies if the device that you are modeling supports SNMPv2c protocols. Check this option if you are modeling a device configured for SNMP v2C. Specifies if the device you are modeling supports SNMPv3c protocols. Check this option if you are modeling a device configured for SNMP v3. Enables SPECTRUM OneClick to discover the linked connections (pipes) between the device (that you are adding) and its neighbor devices.
Agent port (optional)
SNMP v2C Enabled (optional)
SNMP v3 Enabled (optional)
Discover connections (optional)
4.
Click OK in the Create Model by IP Address dialog to create a device icon for the specified device (or click Cancel to cancel the model by IP operation). When you click OK, SPECTRUM OneClick places the newly created device icon in the selected Universe topology view.
Tips: To move or enhance the appearance of the recently modeled device icon, click the Edit mode button in the Topology tab toolbar. You can edit and arrange the model devices using the following techniques: To copy or paste the modeled device icon to another topology view other than the Universe topology, use the copy and paste functions in the Topology tab toolbar area. To change configuration parameters of a modeled device (for example, community name, polling interval, logging interval, security string, and so on), select the modeled device and change the appropriate settings in the Component Detail panel.
Unicenter NSM Agents

You can discover and model Unicenter NSM agents automatically using SPECTRUM discovery, or you can manually model them. Because Unicenter NSM agents use UDP port 6665 for SNMP communications, by default, rather than the standard SNMP port 161, SPECTRUM can discover and model other agents running on the host device. For example, if a Windows workstation is running a Unicenter NSM agent bound to port 6665, as well as the Microsoft SNMP agent bound to port 161, SPECTRUM will create two models for the device; a Unicenter NSM System Host device model and a Windows Host device model, as shown in the following figure.
This scenario can cause poor performance for the following reasons: It creates unnecessary duplicate models in SPECTRUM. It causes redundant SNMP traffic and polling which can reduce network and SPECTRUM performance. It reduces performance of the agent host machine because multiple management agents are providing performance data. To avoid this scenario, do the following:
1.
Before discovering and modeling, stop and/or remove all management agents except the one that you want to use to manage the system. By doing this, you can avoid creating and managing multiple models in SPECTRUM for the same host. Remember to use the correct SNMP port for the discovery. If you must run more than one agent on a given host system, consider manually modeling only the agent that you want to manage with SPECTRUM.
2.
How You Add System Agents in eHealth

After you discover and model systems using SPECTRUM, import a SPECTRUM Global Collection to add those system resources to eHealth for reporting and Live Health monitoring. You could add the systems using eHealth discovery as well, but as a best practice for the integrated solution, import the systems from SPECTRUM as described in Chapter 5. After a few eHealth poll cycles, you can run At-a-Glance reports and Trend reports from the OneClick interface.
Performance Reporting On System Agents

eHealth normalizes common performance data across all managed system agents (Unicenter NSM, SystemEDGE, and third-party). By presenting all performance data in a common and understandable format, this minimizes the learning curve for all users who access real-time and historical trending reports.
At-a-Glance Reports
An eHealth At-a-Glance report for system elements provides summary capacity statistics for the specified system including CPU, interface, and partition utilization; disk faults and I/O; and system availability. With these reports, you can quickly isolate busy CPUs or full disks and compare groups of systems. A sample At-a-Glance report for a system element follows.
HOW YOU RUN AT-A-GLANCE REPORTS
You can run At-a-Glance reports using one of these methods: In SPECTRUM OneClick, right-click a device and click At-a-Glance Reports. The integrated At-a-Glance report runs in the background and appears automatically in a web browser on your system. With a web browser, log in to the eHealth Web interface at the URL http://hostname:port, where hostname is the name or IP address of your eHealth system, and port is the HTTP port used by the web server. If your Web server uses the default port 80, you can omit the port number. You must have an eHealth Web user account to log in to the eHealth web interface. Navigate to the Run Reports tab and run an At-a-Glance report on demand.
MyHealth Reports for Systems

The MyHealth report page on the eHealth Web interface contains a series of charts that are tailored to your particular interest. MyHealth provides eHealth web users with one or more customized reports on the elements and groups that they consider critical. A MyHealth report page contains one or more panels, and each panel contains a separate chart.
Health Reports for Systems

A Health report contains information about the performance of a group of elements for a report period and alerts you to situations that require your attention. The report also identifies situations to investigate because of errors, unusual utilization rates, or excessive volume. You can use a Health report to do the following: Identify normal and exceptional system behavior. Compare the performance of a group of elements during a report period to their performance over a baseline period. Detect changes in behavior that indicate imminent or existing problems. Identify trends in volume. Identify systems that require further investigation.
Using Live Trend

You can use Live Trend to create charts that monitor statistics elements that you are polling using eHealth. You can create a single chart or multiple charts in various styles to represent element trends (a single element with multiple variables) or variable trends (a single variable for multiple elements). The following chart shows a Live Trend chart for four variables on a system called atlanta.
START LIVE TREND
To use Live Trend, you must log in to the eHealth Web interface and download the Live Health client application to your local PC or workstation. Install the Live Health client following the instructions provided on the download page. You can then start the Live Trend application to run real-time performance charts for your systems and resources. To start the Live Trend application
1.
Make sure that you have downloaded and installed the Live Health client software from the eHealth Web interface. Do one of the following to open the Live Trend application: If your system is a Windows system, select Start, Programs, eHealth, Live Trend. Your program group name will vary, depending on the name that you used when you installed the Live Health client. On a UNIX system, change to the Live Health client installation directory and run the command nhLiveTrend.
2.
3.
In the eHealth System field in the Live Trend application window, specify the name of the system to which you want to connect, and then specify your user name and password. The Live Trend Chart Definition Manager appears.
You can create your own charts through the Live Trend Chart Definition Editor to specify the elements and variables for which you want to view data. For more information, see the Live Trend web help that is accessible from the eHealth Web interface.
How You Run Trend Reports for Systems

You can use Trend reports to determine the value of one or more variables for your systems over a specified report period. This can help you to track the values of the variables to determine when values might have changed radically or when a particular event, such as a reboot or missed poll, occurred. The Trend variables differ for each element type. You can run reports for the following types of systems and system components: CPU Disk LAN Process and process set User or system partition WAN Each of these types includes specific variables on which you can run reports. For example, server disk elements have variables for disk reads and writes, storage capacity, and storage utilization. You can select up to ten variables at a time on which to run a Trend report. For a complete list of system Trend variables, see the eHealth web help. The following sample Trend report shows several common system variables: Total Bytes Total Incoming Bytes Total Outgoing Bytes System Calls
RUN A TREND REPORT
You can run Trend reports from the eHealth web interface. To create a Trend report similar to the example above 1. 2. 3. 4. 5. 6. 7. Use a web browser to log on to the eHealth web interface. Click the Run Reports tab. Scroll the Available Reports frame to the Trend reports section. Click Standard. Select the System Element Type. Scroll the Elements list and select the target system element. Scroll the Variables list and select variables; the sample report shows the four variables Total Bytes, Total Incoming Bytes, Total Outgoing Bytes, and System Calls.
8. 9.
Select the chart type such as a stacked line chart. Scroll the right frame and click More Options.
10. Select Show Summary Statistics in the General tab to show the tabular data below the chart. 11. Click Generate Report. eHealth processes the report data and displays the Trend report.
Top N Reports
A Top N report lists all of the elements in a group that exceed or fall below the report criteria goals that you specify. You can also specify the goal for each variable. eHealth calculates the difference between the actual value for that variable and the goal that you have set.
What-If Capacity Trend Reports for Systems

The eHealth What-If Capacity Trend report enables you to perform capacity planning by adjusting factors for capacity and demand until you have devised an appropriate what-if solution. By giving you the capability to illustrate possible future scenarios, this report helps you prepare for problems before they occur.
Chapter 7: Service Level Management

At the core of CAs solution is a concept called Business Service Intelligence (BSI) a methodology for understanding the relationships and impact of IT infrastructure on business services. BSI delivers Technology Relationship Mapping, impact analysis, and RCA that enables our customers to evolve their IT organizations from being tactically reactive to strategically proactive, while improving IT service quality from a customer and business perspective. BSI provides adaptive analytics that communicate bi-directionally with thousands of multivendor, multi-technology devices to identify, verify, and solve complex problems using model-based, rule-based, and policy-based correlation engines. Business service definition and on-going maintenance issues are eased through automation, while asset, availability, capacity planning, change management, performance, and trend analysis validate SLA compliance. BSI provides a bottom-up approach to Business Service Management (BSM) that is practical, achievable, and delivers rapid time-to-value. BSM provides the most obvious value when the basic fault management data is insufficient and it requires additional correlation to determine the impact that may have occurred as the result of a fault and to identify the business services that may be impacted. The SPECTRUM Service Management module features the ability to organize, analyze, and control all aspects of this area. It also provides a dashboard view as an extension of OneClick that focuses directly on service health and hides the complexity of topology that is normally seen using OneClick. In general, the approach to service management can be described as top-down to identify the relationships and dependencies of devices, systems, applications, or performance measurements. Within SPECTRUM, they are referred to as resources (models or data) and relationships. These resources and relationships are organized into service or subservice models. You should define a service from the bottom up to permit the future reuse of common services or subservices. You can configure SLAs to dynamically measure violations and send alerts. This chapter describes an approach to designing and implementing a service management system within the SPECTRUM application. Unlike most of the functions within SPECTRUM, preparing for service management may involve considerable planning to determine all of the required information and implications. The SPECTRUM methodology is designed to evolve over time. As more information becomes available as the implementation proceeds, a more granular representation and measurement of service modeling and management typically emerges. Additional References: Service Manager User Guide Report Manager User Guide SERVICE Performance Manager User Guide
Interview Procedures
This section describes an approach to designing and implementing the interview process for service management.
Interview Questions
To organize and implement service management and service level management, document and collect responses to the following interview questions: Which business services do you want to monitor? Which particular resources support those services? Processes? Software applications? IT devices? How can conditions and faults that affect services be detected? Which resource attributes should be monitored to determine the health of a service? Who should be notified if a given service fails? What are the SLAs, and how should they be quantified (metrics)? What is the criticality of a given service relative to other services or subservices?
General Questions
To organize and implement service management and service level management, document and collect responses to the following general questions: Which WAN and LAN technologies support the service? Are QoS CoSs currently set up? Are the MPLS-based VPNs that are currently in use being monitored? Are all of the critical network devices and servers manageable and being managed? Are all elements being properly discovered and mapped down to layer 2 or layer 3 as appropriate? Are thresholds configured on your critical interfaces? Error rate? Discard rate? Load, etc? Do any environmental monitors need to be monitored (temperature or humidity)? Do any power systems or battery backup systems need to be monitored? Are the critical log files or windows event logs being monitored? Are the critical processes or windows services being monitored?
Are application ports being tested? File Transfer Protocol (FTP)? Hypertext Transfer Protocol (HTTP)? Domain Name System (DNS)? For more advanced needs, are custom thresholds/alarms configured? Can any model attributes be used to determine the health of a resource? Have unique alarms been created using event or condition correlation? Are any existing or custom integrations enabled with alarm data that can be used? Of the IT resources listed, who is responsible for the proper operation of each? Does each individual have access to the correct tools, and do they have their contact information (email, phone, etc) available for distribution? Do each of the resources have a corresponding troubleshooter, or are troubleshooters added for proper notification? Which users benefit from the IT resources listed? Rate each user relative to each other: Low Medium-low Medium Medium-high High Do logical groups of users exist for ordering purposes? Department Function Role, etc Is the device criticality defined and/or measured for all of the network devices and servers?
Analysis and Mapping Procedures

This section describes an approach to analyzing and mapping the results of the interview process for service management.
How You Organize the Resource Information

Follow these best practices to organize the information that you collected for your services. Sort the information by common information types such as the following: Application names Server names Device names and or types Metrics measurement and sources of that data Identify logical groupings of these common resources to avoid duplication.
How You Illustrate the Relationships of Resources to Each Other

Create a diagram that shows how the resources relate to each other. The diagram can help you to map the way in which resources depend on or impact each other.
How You Decompose the Information and Mapping to Service Models

The information that you gather and prepare can help you to build the service models that you need to monitor with SPECTRUM. Take a bottom-up approach to create the most common resource models first. Create a service model by creating a relationship to the proper subservice models. Add service-specific resources not available via subservices.
Example of a Business Service Map to Service Models

Customer ABC has identified a critical business process. When their clients place phone orders, operators enter these orders into a web-based order processing system. These orders are stored and processed from an Oracle database. Because many problems can occur throughout this process, customer ABC wants to build a service that will indicate when the order processing is adversely impacted. In the interview process, some of the critical items were identified and then grouped in the following hierarchy: Web server (WEBORDER1) Dell hardware, running SNMP agent (RFC 2790 or equivalent) Microsoft Internet Information Services (IIS) web server Log file with critical data flow entries CPU and memory need to be monitored APC uninterruptible power supply (UPS) battery backup Proper response from web server required Oracle Database Server (WEBDB1) Dell hardware, running SNMP agent (RFC 2790 or equivalent) Oracle Database with Oracle Intelligent Agent CPU and memory need to be monitored APC UPS battery backup Cisco 6509 Catalyst switches DATASW1 responsible for Server connections DATASW2 responsible for Operator Workstation connections 25 Operator Workstations DNS service monitoring is required Dynamic Host Configuration Protocol (DHCP) service must function
By posing more questions such as the following, you can discover the criticality of items by possible faults: What are the most catastrophic failures that could occur? Is it possible to measure all of the items chosen? What would be the criticality of each item relative to every other item? Can any items within the list be reused, or are any necessary as a generic service to other IT business processes? What are the processes by which you want to manage problems? What would be more critical, losing 25% of the workstations, or losing the switch used to connect the servers? Start by grouping the most critical assets and the most critical outages. Most certainly, the loss of the servers or switches would be the most catastrophic failure to occur, so begin by grouping items as follows: SERVICE: Web Order Processing Components WEBORDER1, WEBDB1, DATASW1, DATASW2 If any component is down, service is down. Ports on switches with server connections If either port is down, service is down. Components operator workstations If 75% of the workstations are down, service is down. If 50% of the workstations are down, service is degraded. If 25% of the workstations are down, service is slightly degraded. Performance web response time, TCP port for Oracle If both are critical, service is down. If one or the other is critical, service is degraded. If one or the other is violated, service is slightly degraded. Alarm condition of all four resources General criticality for alarm conditions (minor, major, or critical) It is also necessary to determine how users are affected when these business services are impacted. To put it simply: who is affected when your business service is impacted, and how critical is that person? A customer who cannot access your sales website will be very inconvenienced; therefore, that customer is very likely a critical (very important) user. If your internal users cannot access an internal web server that is not very important for their day-to-day tasks, assign a much lower criticality to that problem. Answer the following questions to help ascertain the impact of our business process: Of the server users listed, can you sort the list of users by relative importance? Once listed, can you organize these users or customers by company, organization, department, or role?
You also need to consider the network services. Although more general, network services such as the following do affect the business service: DNS, DHCP, and e-mail. You can set up response time tests and use a service to monitor the servers providing the service; however, you should treat them slightly differently. Since these services may be a common dependency for other services, you should create them with reuse and modularity in mind. An example of a DNS and DHCP service might look like this: SUBSERVICE: DNS Components DNS servers SERVER-DNS1 and SERVER-DNS2 If both servers are down, the service is down. If one server is down, service is slightly degraded. Response time tests test DNS response time If response time is violated, service is slightly degraded. Alarm condition of both resources General criticality for alarm condition (minor, major, or critical)
Creating Service Models and Relationships

This section introduces service modeling concepts and techniques. Before creating service models, you should gain an understanding of a few key concepts.
Key Concepts
Resource Monitoring Every service model is a resource monitor that actively monitors its resources to determine its own service health. Service resources are SPECTRUM models, and virtually any model could be a service resource. Service resources might consist of device models, interface models, SPM tests, process models, and even other service models. To monitor a resource, the service watches specific attributes of the resource model. A service model can monitor any attribute whose values are whole numbers. This behavior of a service watching the attribute values of its resources is called resource monitoring. Service Health Service health is represented by a small set of values: up, down, degraded, and slightly degraded. Each resource monitor determines its own service health based on attribute values from its resources. Specifically, a service health policy is applied to the collective attribute values from all resources. A policy is essentially a formula which calculates a service health value based on one or more resource attribute values. The logic applied by the policy is encapsulated into a set of policy rules. Each rule is a statement which, when evaluated, will be labeled as true or false. When a policy is evaluated, the first rule that is found to be true, or the first rule satisfied, determines the service health taken on by the service or resource monitor. Root Cause and Service Impact Considering that a service determines its own health by monitoring its resources, a logical relationship exists between resource outages and service health. This relationship is expressed in terms of root cause and service impact. When a resource outage results in a change in the health of a service, that outage is the root cause of the service health change. Likewise, when a resource outage affects service health, the outage has a service impact. These concepts become very important for users who must address service outages.
Hierarchical Service Modeling As mentioned above, each service is a resource monitor that determines its own health by applying a policy to a set of attribute values from its resources. It is important to note that a service can monitor resources that are actually other services. This allows for the creation of service hierarchies; thus, a user can build services with components of other services. This allows for service modeling to extend from very low-level fundamental services to high-level conceptual service models.
How You Create Service Models

The process of creating a service model is composed of two primary steps:
1. 2.
Select resources. Select the policy that monitors the resources.
The following examples show how a user might create service models representing a webbased service. For more information about creating service models, see the Service Manager User Guide.
Example 1: A Customer Account Access Service

Determining the resources of a particular service can seem like a daunting task. In many cases, it is not possible for you to consider all possible components of a service, and then map how each component might impact a given service. One distinct advantage of SPECTRUM Service Manager is that you can start with small, simple models, and continually refine their service modeling as you gain a better understanding of the service components and how each one impacts the overall service. Although understanding all components of a service is difficult, it is usually easy to identify some of the most critical components, which provides an effective starting point. For example, consider a simple web service used by a phone support organization to access customer account data. This service will be referred to as the Customer Account Access Service. With just this basic, general description, you can begin to identify some of the service components. As this is a web-based service, it must be supported by one or more web servers. In addition, the service is providing access to information from a database of customer accounts. This implies that the database is likely hosted on one or more systems. For this example, consider an environment with two web servers, and two database servers. This provides a starting point for modeling the service. If both web servers, or both database servers, are down, the entire service will not work; as long as one web server and one database service is up, the service will run, even though it will likely experience some degradation. This very simple description provides the basis for creating the Customer Account Access Service. To begin device modeling in SPECTRUM, consider each web server and database host to be resource models. Monitoring the contact status of these device models will determine if the systems are up. As mentioned in the Key Concepts section, each service is a resource monitor. SPECTRUM offers a basic formula for service health which provides a general understanding of the availability of these four service resources. The following table presents a matrix containing each component and how its status (up/down) would affect the service relative to the status of the other resources.
Service Health Matrix Table Web Server 1 ESTABLISHED LOST Web Server 2 ESTABLISHED ESTABLISHED DB Server 1 ESTABLISHED ESTABLISHED DB Server 2 ESTABLISHED ESTABLISHED Service UP SLIGHTLY DEGRADED ESTABLISHED LOST ESTABLISHED ESTABLISHED SLIGHTLY DEGRADED ESTABLISHED ESTABLISHED LOST ESTABLISHED SLIGHTLY DEGRADED ESTABLISHED ESTABLISHED ESTABLISHED LOST SLIGHTLY DEGRADED LOST LOST ESTABLISHED ESTABLISHED LOST LOST ESTABLISHED ESTABLISHED LOST ESTABLISHED ESTABLISHED LOST LOST LOST ESTABLISHED LOST ESTABLISHED LOST LOST ESTABLISHED LOST ESTABLISHED ESTABLISHED LOST LOST LOST LOST ESTABLISHED LOST ESTABLISHED LOST ESTABLISHED LOST LOST LOST LOST DEGRADED DEGRADED DEGRADED DEGRADED DOWN DOWN DOWN DOWN DOWN
This table indicates that if both web servers or both database servers are down, the service is down. If one web server is down and one database server is down, the service is degraded. If any one server is down, the service is slightly degraded. This is a very simplified approach, but it demonstrates a good starting point. From this, you can consider how to monitor each component. The table shows that particular combinations of status values result in specific levels of service degradation. Essentially, you can classify the resources into web server components and database components, and think of the grouped resources as services within a service. To enable the Customer Account Access Service to function, the web server components and database server components must be functioning. Within the Customer Account Access Service, small, more discrete, subservices may exist.
Considering that each service is also a resource monitor, this example is a good case for creating resource monitors within the Customer Account Access Service, as shown below.
Resource monitors allow you to organize resources, and monitor them based on specific criteria with knowledge of how it will impact the service. The resource monitor becomes an abstraction of multiple resources, and reports a health value based on the collective status of the resources it monitors. That knowledge is the basis for a service resource monitoring policy. In this example, it was established that the contact status of each device model can be monitored to determine its availability. In addition, the table indicated how different combinations of contact status values impact the service. Looking first at the web servers, a policy can be produced which will adequately report the status of the web server components as a whole. These statements can be called the web servers redundancy policy.
WEB SERVERS REDUNDANCY POLICY
When the contact status of all web servers is lost, the web server component of the service is down. When the contact status of any one web server is lost, the web server component of the service is degraded. The web components and database components are described as services within a service. That concept is important when dealing with groups of resources that support a specific aspect of a service. In all cases, if both web server machines are down, the Customer Account Access Service is down. However, just one web server down does not necessarily indicate that the Customer Account Access Service is down, or even degraded. You can think of the web servers, collectively, as a component of the service in that when one of those servers is down, that component of the service is degraded. This might not be completely clear yet, but as the service model evolves, it will become apparent why this approach should be taken. The impact of loss of contact with the database servers mirrors that of the web servers. These statements can be referred to as the database servers redundancy policy.
DATABASE SERVERS REDUNDANCY POLICY
When the contact status of all database servers is lost, the database component of the service is down. When the contact status of any one database server is lost, the database component of the service is degraded. The web servers and database servers have been described collectively as a web server component and a database component. Consider how the web server and database server components impact the Customer Account Access Service.
If both web servers or both database servers are down, the service is down. These were organized into two groups: a web server component and a database component. These can be labeled as the Web Servers resource monitor and the Database Servers resource monitor. In review, each resource monitor determines its own health value, based on the resources that it is monitoring. The Web Servers resource monitor determines its health based on the contact status of both web server 1 and web server 2. The Database Servers resource monitor determines its health based on the contact status of database server 1 and database server 2. Encapsulating the web server and database server systems into resource monitors considers these statements, which will be called the Standard Account Access Policy.
STANDARD ACCOUNT ACCESS POLICY
When any resource monitor is down, the Customer Account Access Service is down. When all resource monitors are degraded, the Customer Account Access Service is degraded. When any one resource monitor is degraded, the Customer Account Access Service is slightly degraded. Although redundancy exists within each resource monitor, if either resource monitor is down, the overall service is down. Looking back at the table of Contact Status and Service Health values, this design can be validated. You can use the following three scenarios to test the design.
DESIGN TEST SCENARIOS
Web Server 1 is down. This will cause the web server resource monitor to become degraded; the database servers are not affected, so the database servers resource monitor is up. To apply the rules defined in the account access policy: The first rule is not satisfied because neither of the resource monitors is down. The second rule is not satisfied because the Database Server resource monitor is up, and not degraded. The third rule; however, is satisfied, because the Web Server resource monitor is degraded. In this scenario, the Customer Account Access Service will report slightly degraded. Looking back at the matrix, when web server 1 was down and all other devices were up, the overall service health should be considered slightly degraded. The design works for this scenario. Web Server 1 is down and Database Server 1 is down. Based on the implementation described above, this would result in both the web servers resource monitor and the database servers resource monitor becoming degraded. By evaluating the account access policy, the second rule is satisfied and the Customer Account Access Service will be degraded. By reviewing the matrix, when web server 1 and database server 1 are both down, the overall service health should be degraded, so, again, this design works correctly. Database Server 1 and Database Server 2 are down. If this was the case, the database servers resource monitor would be down. In review of the Account Access Policy, the first rule is satisfied; thus, it produces a result of down. By reviewing the matrix, when both database servers 1 and 2 are down, the overall health of the service should be down.
Although it is a very simple example, this process has identified the resources of a service and how to monitor them. Despite its simplicity, this implementation provides the knowledge to correctly report the health of the Customer Account Access Service for thirteen different fault scenarios involving the systems which host the web servers and database server applications. Obviously, this implementation is not yet very robust because it is only monitoring the four systems as up or down. Before extending the Customer Account Access Service, review the following steps to implement this design using SPECTRUM Service Manager.
IMPLEMENT EXAMPLE 1 IN SPECTRUM
You create Service Models in SPECTRUM using the Service Editor, which you launch from the Tools, Utilities menu of the OneClick Console. To start building the service model
1. 2. 3.
In the OneClick Console, select Tools, Utilities, Service Editor. Click Create. Specify the policy name Web Server Contact Monitor, and a description and security string. Click the Locate resources and containers button (binoculars). In the left pane of the Locate Resources dialog, click Devices, Devices, By Model Name (or By IP Address). Locate the selected search (binoculars). Specify search criteria (leading and trailing wildcards are implicit for model name) and click OK. In the right pane, select all server models that you would like to associate with this service model. Click Add Selected to Monitored Resources. Click Close. Click Select to display the Select Policy dialog. The resource monitor will use the Web Servers Redundancy Policy described previously in this chapter.
4. 5.
6.
7. 8. 9.
10. In the left pane, select Contact Status as the Value Map. 11. Click New in the Rule Set, and name the rule set Web Server Redundancy Rules. 12. Click Add to create the first rule: Rule Type All, When all are Down, the service is
Down.
13. Click OK. 14. Click Add to create the second rule: Rule Type Any, When any 1 are Down, the service
is Degraded.
15. Click OK. 16. Click Create in the Create Rule Set dialog. 17. Click OK.
18. Click Create in the Create Service dialog. 19. Repeat Steps 2 through 10 to start creating the Database Server Contact Monitor.
Note: This policy will be identical to the Web Server Redundancy Rules.
20. In the right pane, select Web Servers Redundancy Rule and click Copy. 21. Define the new rules name as Database Server Redundancy Rules. 22. Click Create. 23. Click OK to close the Select Policy dialog. 24. Click Create in the Create Service dialog. 25. Click Create to start creating the top level service (Customer Account Access Service)
of hierarchal structure.
26. Specify the service name Customer Account Access Service, and, optionally, a
description and security string.

27. Click the binoculars, and then click Locater, Services, Services, All. Launch the selected
search (binoculars).
28. Select the Landscape, if it appears. 29. Select the Web Server Contact Monitor and Database Server Contact Monitor Services. 30. Click Add Selected to Monitored Resources. 31. Click Close. 32. Click Select to display the Select Policy dialog. 33. In the left pane, select Service Health as the Value Map. 34. Click New in the Rule Set, and name the rule set Standard Account Access Policy, which
is based on the following set of rules: Rule Type Any: When any 1 are Down, the service is Down. Rule Type All: When all are Degraded, the service Degraded. Rule Type Any: When any 1 are Degraded, the service is Slightly Degraded.
35. Click Create in the Rule Set dialog. 36. Click OK. 37. Click Create in the Create Service dialog. 38. Close the window.
REVIEW EXAMPLE 1
The design for Example 1 includes one Service monitoring two Resource Monitors. This is a two-tiered approach in which each Resource Monitor consolidates the status of its own resources, and then reports the result as its service health. The Customer Account Access Service then determines its own service health, based on the collective service health of the two resource monitors. This pattern encompasses an important abstraction that is essential to understanding service management. Each service and resource monitor performs two tasks: Monitors those resources to which they are related. Determines its own service health by applying values from those resources to a policy. Consider these questions regarding the implementation of Example 1: Does the Customer Account Access Service have any knowledge of database server 2? The three test scenarios do not mention the Customer Account Access Service monitoring database server 2; however, in scenario 3, when both database servers 1 and 2 are down, the Customer Account Access Service correctly determined that its service health should be down. How did it work? The Database Servers Resource Monitor determined that its own health was down. The Customer Account Access Service, which monitors the Web Servers Resource Monitor and the Database Servers Resource Monitor, determined that it, too, should be down. When evaluating its Account Access Policy, it found that one of the resource monitors was down and, therefore, its own health should be down. Database servers 1 and 2 are resources of the Database Servers Resource Monitor, and the Database Resource Monitor is a resource of the Customer Account Access Service. Each component determines its own health based on its resources.
Example 2: Extend the Service to Monitor Critical Processes

Example 1 describes how to design and implement a very simple service using two resource monitors. Although this is a legitimate service, it is not a very complete one. In revisiting the Customer Account Access Service, you could expand the monitoring of service components in several ways. So far, only the Contact Status of those devices hosting the web servers and database servers has been incorporated into the service. Device availability alone does not ensure that you will be able to obtain customer account information. You need to also consider that a web server is an application that supports web transactions. This application must be running in order for customer account access requests to be processed. Considering the criticality of these web server systems, it is logical that they will also host an agent supporting process monitoring, or host information MIB such as defined by RFC 2790. This allows a user to actually monitor the web server process itself. You can use a process model to determine if a particular process is actually running on a device. Considering that the web server system might be up, but the web server application might not running, additional monitoring of the web server application processes is important to correctly determine the overall health of the Customer Account Access Service.
At first, it might appear simple enough to just add another resource monitor to watch the Condition of the web server process model and treat the availability of each process redundantly, in the same way as the device availability is monitored. Consider the following table, which shows a breakdown of potential fault scenarios and how each combination affects the availability of the web servers in terms of being able to process a request. This table demonstrates what is often called a high-sensitivity policy. Service Health Matrix Servers and Processes Web Server 1 Web Server 2 Process 1 Process 2 Web Service Health ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED LOST LOST LOST ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED LOST NORMAL CRITICAL NORMAL CRITICAL CRITICAL CRITICAL CRITICAL NORMAL NORMAL CRITICAL CRITICAL NORMAL CRITICAL CRITICAL UP DEGRADED DEGRADED DOWN DEGRADED DOWN DOWN
The following table replaces the individual devices and processes with the resource monitors that could be used to monitor them. Service Health Matrix Devices and Processes SERVER DEVICES UP UP UP DEGRADED DEGRADED DOWN DEGRADED DOWN DOWN SERVER PROCESSES UP DEGRADED DOWN DEGRADED DOWN DOWN UP UP DEGRADED WEB SERVICE UP DEGRADED DOWN DEGRADED DOWN DOWN DEGRADED DOWN DOWN
Note: The three rows at the end of this table typically would not happen since a system that is reported as down should not report that it has running processes. However, the rules should handle these situations to avoid the possibility of getting into unknown states.
Looking at the devices and processes collectively, the table indicates that this is not a case for a redundancy policy, which appeared to be the first choice when evaluating the resources. As found in the case of the overall service, the relationship between web service hosts and the web server processes implies a high-sensitivity rule set similar to this one. When any resource is down, the service is down. When any resource is degraded, the service is degraded. After evaluating this relationship between the server devices and processes, it would seem that you cannot easily extend the design in Example 1 to include supplemental monitoring such as including the new process resource models. Because the initial design tried to encompass a high level service with multiple components, it did not recognize that there are subservices within the Customer Account Access Service. After extending the monitoring to the process level, it becomes apparent that a web subservice and a database subservice do exist. Much like the web servers, you can monitor the database service host application using process models. A hierarchy is beginning to appear as the resource monitoring is extended to the process level. Service Hierarchy
CUSTOMER ACCOUNT ACCESS SERVICE
WEB SERVICE
DATABASE SERVICE
DEVICES
PROCESSES
DEVICES
PROCESSES
It is very typical to discover lower level services which at first did not appear to be significant enough to warrant a service model. In general, the service modeling process is an iterative process. Each revision adds additional precision and extends the total number of fault scenarios that can be correctly reported. This iterative approach can be summarized in different ways. One way is to consider that the goal of each revision is to enrich the root cause information which will be available in the event of a service fault. Looking back to Example 1, if both web server devices were available, but one web server process was down, the service would not have reported a fault although service users would have experienced some performance degradation. By extending the monitoring to the process level, the service would now report the degradation and the process failure as the root cause. The next section shows how Example 2 can be implemented in SPECTRUM.
Implement Example 2 in SPECTRUM

The design for Example 2 includes the creation of four process models. Two of these process models will monitor the web server application and the other two will monitor the database server application. It is likely that a user may identify additional processes which impact the availability of a particular service component. This approach can be extended to include those processes as well.
To create the process models, you should locate the host model representing the server machine on which the process is running. In the example, this would be a web server or database server device model. If the agent on the device supports RFC 2790, you can create process models for each process that you want to monitor. To create a process model for each process
1.
In the SPECTRUM OneClick console, list the host or hosts in the OneClick Contents panel. Select the host for which you want to create a monitoring rule. Expand the System Resources section within the OneClick Component Detail view. A subsection named Running and Monitored Processes appears. Expand the Running and Monitored Processes view to show a section for Running Processes, which, in turn, reveals a table of processes.
2. 3.
4.
Note: If the text (RFC 2790) does not appear in the section names, the agent does not support the RFC 2790 extension to MIB-II. You will not be able to monitor processes on that host and raise alarms when processes start or stop.
a.
Right-click a process in the table and select Monitor this process. The Add Monitored Process dialog appears. Select Alarm on Stop and click OK. Using this setting, the process model will experience a critical alarm if the corresponding process is stopped. The process appears in the Monitored Processes view.
b.
5.
After creating the appropriate process models, launch the Service Editor by selecting Tools, Utilities, Service Editor, or by right-clicking a process and selecting Utilities, Service Editor. The goal is to modify the Service that was created in example one to handle this more complex situation by using similar steps as are outlined for example one, but now with a deeper hierarchy, adding a new middle layer as well as adding logic for the processes. Using the Condition value map and Redundancy rule set policy, create the service Web Servers Redundancy Monitor, which watches the web server process models. Using the Service Health High Sensitivity policy, create the Web Service, which watches the Web Servers Contact Monitor and the Web Servers Redundancy Monitor. This will require the reparenting of the Web Servers Contact Monitor from the Customer Account Access Service to this service. Duplicate these tasks for the Database Server Redundancy Monitor and the Database Service. The Customer Account Access Service will now monitor the Web Service and Database Service with the Standard Account Access Policy described in the implementation of Example 1.
REVIEW EXAMPLE 2
Example 2 expanded the monitoring of the Customer Account Access Service to include monitoring the actual web server process. This example also reveals two distinct subservices within the Customer Account Access Service exist. Each of these subservices consists of multiple resources which are monitored in different ways, as shown in the Service Editor Hierarchy view below.
The table below displays the ever-increasing set of fault scenarios which can be supported by the existing service modeling.
Service Health Matrix Fault Scenarios Legend: WSD Web server device WSP Web server process DBD Database device DBP Database process AAS Customer Account Access Service DG Degraded service health SD Slightly degraded service health DN Down service health
WSD1 UP UP DN UP UP UP UP UP UP DN DN UP UP UP UP UP UP DN DN
WSD2 UP DN UP UP UP UP UP UP UP UP UP DN DN UP UP UP UP DN DN
WSP1 UP UP DN DN UP UP UP UP UP DN DN UP UP DN DN UP UP DN DN
WSP2 UP DN UP UP DN UP UP UP UP UP UP DN DN UP UP DN DN DN DN
DBD1 UP UP UP UP UP DN UP UP UP DN UP DN UP UP UP UP UP UP DN
DBD2 UP UP UP UP UP UP DN UP UP UP DN UP DN UP UP UP UP UP UP
DBP1 UP UP UP UP UP DN UP DN UP DN UP DN UP DN UP DN UP UP DN
DBP2 UP UP UP UP UP UP DN UP DN UP DN UP DN UP DN UP DN UP UP
CAAS UP SD SD SD SD SD SD SD SD DG DG DG DG DG DG DG DG DN DN
WSD1 DN UP UP UP UP
WSD2 DN UP UP UP UP UP DN
WSP1 DN DN DN DN DN DN DN
WSP2 DN DN DN DN DN DN DN
DBD1 UP UP DN UP UP UP DN
DBD2 DN UP UP DN UP UP DN
DBP1 UP UP DN UP DN DN DN
DBP2 DN UP UP DN UP DN DN
CAAS DN DN DN DN DN DN DN
UP
DN
The table indicates 25 different fault scenarios that can be reported with the implementation of Example 2. Note the scenario in the row above that is bold. In this scenario, all critical processes have failed. In this situation, the service is down, but it would not have been reported as down by the implementation of Example 1.
Example 3: Extend the Service to Include a Response Time Element

Example 2 enhanced the Customer Account Access service by extending visibility to the process level. In some situations, the devices may be up and the processes are running, but the service is not performing optimally. It is often useful to include some level of performance monitoring as a resource of service components. This is particularly important when the service health is intended to reflect what an end user is experiencing when using a service. In this example, you add a response time element to the Web Service component of the Customer Account Access Service. Adding the performance element will not only enhance the service monitoring, it will also test the modularity of the design produced in Example 2. One goal of service design should be to produce services that you can easily enhance as you gain more insight into how each service resource can be monitored. Adding the response time component involves creating Response Time Test models in SPECTRUM. Many devices and system agents are capable of supporting response time tests. Since this example is intended to enhance the monitoring of the Web Service component, you will be creating HTTP response time tests. The number of tests can vary based on your design. It is generally a good idea to build at least one HTTP request to each web server. For example, you could select two SPM test hosts and create two HTTP tests on each. The test host should issue requests to each web server. This would provide multiple request points to each individual server. The four new response time tests will collectively comprise a new set of resources within the Web Service. You can take two typical approaches to monitoring response time tests: Monitor the latest error status of each response time test model. Monitor the aggregate result values of each test model.
The second approach is discussed in more detail later. For this example, you will monitor the latest error status of each test model. The following table maps Latest Error Status (Response Time) values to equivalent service health values. This process is used extensively by the SPECTRUM Service Manager. The goal is to normalize pure attribute values to comparable service health values which can easily be applied to various rule sets. Service Health Matrix Response Tests Response Time Value OK TIMEOUT THRESHOLD CRITICAL THRESHOLD MAJOR THRESHOLD MINOR Equivalent Service Health UP CRITICAL CRITICAL DEGRADED SLIGHTLY DEGRADED
Under some circumstances, documentation might indicate acceptable response time levels. If this is not the case, a useful approach is to create response time tests without thresholds and review the latency results over a period of time. This will help you to establish baseline threshold values to ensure that an unusual latency value would result in a threshold violation. In Example 2, the Web Service was developed to include a resource monitor for the contact status of the web server devices and a resource monitor for the condition of the web server process models. It may be possible to extend the monitoring of the Web Service to include the response time component by simply adding a third resource monitor which monitors the response time test models. When monitoring these response time test models, the following rule set might be appropriate: When all resources are down, the service is down. When any one resource is down, the service is degraded. When all resources are degraded, the service is degraded. Consider how this rule set would apply to a set of response time tests as described above. If all response time tests experienced a timeout or critical threshold violation, it would indicate that neither web server was capable of responding. Clearly, this is a critical scenario and should indicate a down service health. If any one response time test timed out or violated a critical threshold, it would indicate that one of the web servers was impacted to such a degree that it could not adequately handle requests. Considering that some of the other response time tests are succeeding, it can be surmised that the service is not entirely down, but it is degraded. If none of the tests were timing out or violating a critical threshold, but all were violating a major threshold, you could assume that the service health is degraded.
Based on the configuration established above, you could enhance the web service by adding a new resource monitor for the response time tests. The web service component would function correctly under any of the scenarios described above. By monitoring its resources with the Service Health High Sensitivity policy, any resource monitor that is down would cause the Web Service to go down. Likewise, any resource that is degraded would cause the Web Service to also degrade. It turns out that the design produced in Example 2 can easily be extended to include the response time element. The following is an example of what the service hierarchy would look like after the addition of the response time component to the Web Service.
WEB SERVICE
DATABASE SERVICE
DEVICES
RESPONSE TIME
PROCESSES
DEVICES
PROCESSES
IMPLEMENT EXAMPLE 3 IN SPECTRUM
For this example, you will create four HTTP response time tests. You can locate response time test hosts using the Locator tab in the OneClick Console. The Locator menu has a set of pre-configured SPM Searches. Note: To run an HTTP test, you must discover test sources such as SystemEDGE Service Availability agents, Cisco IP SLA-enabled routers, and Network Harmoni agents using readwrite community strings. For details about response testing and supported agents, refer to the Service Performance Manager User Guide. To create response time tests
1.
Use the All Test Host search to locate test host models that can measure HTTP response time to the web servers. (In the Contents panel, expand SPM Searches and Test Hosts By; then right-click All Test Hosts and select Launch the selected search.) From each designated test host, create new HTTP tests by right-clicking the host in the table, choose New Test, then select HTTP. Specify the threshold data. Configure the thresholds to ensure that a critical threshold is generated when the response time is too slow to be usable, and a major threshold is generated when response time is usable but very slow. Add the destination for the test, which would be one of the web server hosts. To add the response time tests, use the Service Editor to add a new resource monitor, Web Server Response Monitor, which uses the Response Time High Sensitivity policy. The resources can be located by expanding SPM Searches. You can then add the four response time tests to the new resource monitor. Finally, attach this new resource to the Web Service.
2.
3.
4.
REVIEW EXAMPLE 3
Example 3 describes how you can extend an existing service implementation to include more sophisticated resource monitoring without altering the service hierarchy. This flexibility in the service hierarchy design makes it very easy for users to continually enhance their service models. In addition, Example 3 outlines how to incorporate a response time component within a service to greatly enhance the accuracy of service health reporting. Again, this iteration expanded the set of fault scenarios support and enriched the set of potential root causes of service impact.
Create SLAs
This section introduces SLA modeling concepts and techniques that you should understand before modeling SLAs. Note: The following sections up to and including Example 4 provide instructions for tracking SLA based on business hours. This functionality is included in SPECTRUM Release 8.1. For a complete description of the available capabilities, see the Service Performance Manager User Guide.
Key Concepts
SLA Periods SLAs consist of a set of service level objectives or guarantees that are measured for a given period of time. Commonly, this period of time coincides with a welldefined billing cycle or a reporting cycle. Frequently, an SLA period will be monthly (that is, the compliance of an SLA is evaluated on a month-to-month basis). Typically, the compliance or violation of an SLA will be expressed in terms of a particular period. For example, you might consider an SLA compliant for the month of January. If the period was weekly, you might consider an SLA violated for the week of November 5-11. SLA Guarantees or Service Level Objectives Among other stipulations, an SLA will include a set of guarantees or service level objectives. In particular, many of these guarantees relate to the availability and performance of a particular service or set of services. In typical service provider environments, SLAs often state very specific guarantees. Users may find stipulations similar to the following: certifies uptime at 99.9% monthly or will credit the customer 1/30th of the monthly service fee in the event that the customer reports a service outage of 30 minutes or more These statements represent guarantees given by the provider of a particular service. Within the enterprise environment, SLAs also exist, although an enterprise SLA may be less formal. It is very common to find SLAs such as the IT department guarantees no more than 30 minutes of web access down time per week In either case, it is these guarantees or service level objectives which provide the basis for determining SLA compliance with SPECTRUM. Active SLA Monitoring Unlike other SLA management products, SPECTRUM Service Manager provides active SLA management. This means that within a given period, you are able to determine the status of the SLA for that period. Based on outage trends, you are provided a projected status for the overall period. At the beginning of each SLA period, an SLA is considered unaffected. The unaffected status will persist until some form of outage causes the SLA to record outage time for the period. An SLA which has recorded outage time, but is not at a significant risk for a violation, is considered to have a compliant status. If additional outage time occurs within the period and the outage time accumulates to levels where the SLA is approaching violation, the SLA will transition to a status of
warned. If outage time for the period continues and specific guarantee thresholds are reached, the SLA will transition to a status of violated. This transition from unaffected to a violated state happens at real time. If the SLA period is monthly, and the SLA violates on the fifth day, you will be aware as soon as the SLA is violated as opposed to waiting for a report at the end of the period to indicate a violation. Consequently, this active SLA monitoring allows service providers to take action before the SLA becomes violated. SLA Time of Enforcement or Business Hours Within the SLA period, a particular guarantee will be enforced frequently. The SLA may contain statements such as guarantees no outage exceeding 30 minutes between the hours of 8AM and 5PM on Monday through Friday A statement such as this is commonly known as a businesshour guarantee. Sometimes, multiple guarantees will be based on particular timeframes. For example, guarantees 97.5% availability on a 7x24 basis, with 99.9% availability between the hours of 8AM to 5PM Monday-Friday, and 8AM to 12 PM on Saturday Although the same service is being measured, this statement actually includes two guarantees: one guarantee with a 7x24 timeframe, and a second guarantee for specific hours during the week.
Create SLAs and Guarantees

The first step in creating an SLA is to understand the particular service with which it is associated and the period during which the SLA will be in effect. The service modeling hierarchy often has some top-level service model which is logically associated to an SLA. For example, in the service provider environment, a high-level service such as Customer A High Speed Data may exist. Logically, the SLA is a binding of the high-speed data service which is being provided to Customer A. The particular period may be stipulated in a SLA document or may be determined arbitrarily, but it must be a timeframe which is agreed upon by both the service provider and service customer. Monthly SLA periods are very common as they frequently coincide with a service billing cycle. For example, an SLA period may be in effect from the first of the month with guarantees based on availability and performance for that month. Commonly, an SLA will specify restitution guarantees if a customer contacts the service provider regarding a dispute within a certain number of days from the end of a given period. Once the top level service and SLA period have been determined, the user should identify the SLA guarantees or service level objectives related to the availability and performance of the service being provided. Often, you can find these guarantees within the SLA document among other stipulations that are not within the scope of measuring the availability or performance of a service. You should look for those statements which specify a level of availability, a guaranteed response time, acceptable level of latency, and so on. In addition, you should determine if those statements are accompanied by statements that dictate specific times within the SLA period as to when they are guaranteed. Having identified the guarantees within an SLA, you should categorize them into availability and response time guarantees. Availability guarantees within the SLA are frequently specified as a percentage of availability. However, availability guarantees may be described in terms of downtime. For example, no more than 1 hour of outage time Response time guarantees can be identified either by specific statements such as 2000 ms or better response time or latency not exceeding 5000 ms for more than 30 minutes Availability can be described in a couple of different ways. Previous sections of this document discussed service health. Typically, you could describe availability as a service
being available when it is not down or a service being unavailable when it is down. However, an availability guarantee might also be described as a service being unavailable when it is not responsive. This second description can be very important when building guarantees models. Response time guarantees measuring services which utilize response time components as their resources. An interesting point regarding response time guarantees is that components within a service hierarchy may monitor response time specifically for the purpose of providing a way to support an SLAs response-time guarantee. This is actually a very common scenario. Frequently, a service hierarchy is built on the foundation of resources that actually comprise the physical devices and applications providing a user-consumable service. Response-time tests, despite providing an excellent way to report service health, are often not identified as service resources until an SLA is applied that stipulates response time guarantees. As mentioned in previous sections, you can use response-time monitors to identify high latency or service degradation. The response time tests should report a major threshold violation when latency exceeds an acceptable level. In addition, you can also use responsetime monitors to report a critical condition when latency reaches an unusable level or response-time requests time out. Considering this when response time monitors are built, they can support both the notion of monitoring latency and monitoring availability. In the case in which a service is considered unavailable when it is not responsive, although a service designed to report availability will never be guaranteed for response time, a service designed to report response time can also be used to measure availability.
Example 4: An SLA for the Customer Account Access Service

This section contains an example based on the Customer Account Access Service from the previous section. It includes an SLA and several guarantees. In the Creating Service Models section of this chapter, you implemented the Customer Account Access Service. For this example, the Customer Account Access service will represent the service being provided by a fictional company called Northeast Data Solutions (hereafter, referred to as Northeast). Northeast maintains customer account information for a large number of small businesses. Each small business is responsible for creating and maintaining its own customer data. Northeast takes responsibility for supporting and securing the customer account data. In addition to supporting the databases and web access, Northeast also negotiates with various Internet Service Provider (ISPs) to provide a local routing device for the remote customer site to ensure that customers will have reliable internet access to their customer account information. The relationship with the ISP is transparent to Northeast customers. They pay Northeast directly for service. The following items are segments of an SLA provided to each Northeast Data Solutions customer: Northeast Data Solutions provides access to customer account data guaranteeing that account access for each customer location will be available 99% of each month excluding those periods of scheduled system maintenance to be conducted between the hours of 12AM to 3AM on each Sunday.
Service availability to be restored such that the average outage resolution time is 30 minutes or less, with no individual outage exceeding 1 hour; outages are guaranteed to not exceed a rate of two or more outages within any 24-hour period. A standard business hours timeframe to be defined as the hours of 7AM to 6PM Eastern Standard Time on the days of Monday through Friday of each week. Within standard business hours, account access to be guaranteed available at 99.5% with no individual outage exceeding 20 minutes. Average transaction time for initial account access is not to exceed five seconds for more than 5% of the standard business hours timeframe, with successful transaction completion to be guaranteed at 99% for standard business hours. With a transaction deemed successful if completed within 15 seconds, no period of transaction failure shall persist for more than 20 minutes. Transaction monitoring average based on a sampling of five queries to be delivered randomly within a five-minute interval during standard business hours, each query originating from the customer access point device. Northeast assumes responsibility for an access device assuming the device is operational with the exception of power failure or an act of nature deemed beyond the control of Northeast. In this example, the SLA text includes a variety of guarantee metrics and terminology that allows for fictional representative statements such as those found within an actual SLA. Despite its confusing terminology, this SLA actually includes some very precise guarantee information, including how response time will actually be measured. This SLA would be provided for each Northeast customer, but this example will focus on the SLA between Northeast and a customer called A to Z Performance Components, which has offices in Atlanta and Savannah, Georgia. As mentioned above, the first step to designing an SLA implementation is to determine which service supports the SLA and identify the period. The hierarchy below represents the Customer Account Access Service.
WEB SERVICE
DATABASE SERVICE
DEVICES
RESPONSE TIME
PROCESSES
DEVICES
PROCESSES
Many components are required to monitor A to Zs service availability. In addition to providing web access and database access, Northeast must now build service components that monitor availability and response time specific to A to Zs Atlanta and Savannah offices. These new service components will monitor access routers at each site and response time for newly created response time tests that are hosted on the access routers at each site. The following figure shows how you might extend the hierarchy to support A to Z.
A to Z ACCOUNT ACCESS
A to Z Site Access
A to Z Site Response Time
Customer Account Access
Atlanta Routing
Savannah Routing
Atlanta Response Time
Savannah Response Time
Evaluation of the SLA implies that a variety of guarantees exist and that some new service models will be required. The chart above represents one possible configuration to use. You should review each SLA implementation carefully to determine the best way to organize services. Among the new services is a hierarchy called A to Z Site Access. A to Z Site Access has two subservices called Atlanta Routing and Savannah Routing. These services are designed to monitor the on-site router which provides access to the Customer Account Access Service. You can break down each one of these subservices into a set of resource monitors, producing a hierarchy similar to the figure below.
Routing
Router
Interfaces
One resource monitor watches the contact status of the router device model, while the other resource monitor watches the port status of interfaces on the router which are critical for providing access for the office. The routing service is considered down if the router is down or if all required interfaces are disabled. A similar service would be implemented for both Atlanta and Savannah. In reference to the SLA, the following statements are related to routing components of each site: Northeast provides access to customer account data guaranteeing that account access for each customer location will be available 99% of each month excluding those periods of scheduled system maintenance to be conducted between the hours of 12AM to 3AM on each Sunday. Service availability to be restored so that the average outage resolution time is 30 minutes or less, without an individual outage exceeding one hour; outages are guaranteed to not exceed a rate of two or more outages within any 24-hour period. Guarantees apply on a per-site basis. For the service manager user, consider offering 99% availability for the month of November. This implies that 432 minutes of downtime are allowed. When building the SLA, carefully consider the service or services to which this should be applied. If this guarantee was applied to the A to Z Site Access service, and Atlanta experienced 300 minutes of downtime and Savannah experienced 200 minutes of downtime (for a total of 500 minutes of downtime), the SLA would be violated. However, the wording in the SLA states ..each customer site.., so a guarantee should be applied at each site. By applying the guarantees in this manner, the SLA would not be violated as
neither site experienced more than 432 minutes of downtime. With regard to the availability of the Routing service, two separate guarantees of 99% apply: Atlanta availability 99% Savannah availability 99% The SLA also states average outage resolution time is 30 minutes or less... In addition to the 99% availability guarantee, a supplemental guarantee which specifies an average outage time of 30 minutes or less is unnecessary. This component can be added to the availability guarantees as a MTTR supplement. The availability guarantees should now include the MTTR component: Atlanta availability 99%, MTTR 30 minutes Savannah availability 99%, MTTR 30 minutes In addition to the MTTR component, the SLA states outages are guaranteed to not exceed a rate of 2 or more outages within any 24-hour period This statement is referred to as a Mean-Time-Between-Failures (MTBF) clause. The MTBF clause states that more than one outage per day cannot occur. The availability guarantees should now include the MTBF component: Atlanta availability 99%, MTTR 30 minutes, MTBF 24 hours Savannah availability 99%, MTTR 30 minutes, MTBF 24 hours A similar guarantee should be applied to the Customer Account Access Service; however, this guarantee will be independent of either customer site: Atlanta availability 99%, MTTR 30 minutes, MTBF 24 hours Savannah availability 99%, MTTR 30 minutes, MTBF 24 hours Customer Account Access availability 99%, MTTR 30 minutes, MTBF 24 hours In addition to the 99% overall availability guarantee, consider these additional availability specifications: A standard business hours timeframe is to be defined as the hours of 7AM to 6PM Eastern Standard Time Monday through Friday of each week. Within standard business hours, account access will be guaranteed available at 99.5% with no individual outage exceeding 20 minutes. Business-hour guarantees can be created by applying a schedule during creation. A weekly schedule for the days Monday through Friday from 7AM to 6PM will be applied to new guarantees ensuring 99.5% availability throughout the scheduled period. The new guarantees should be applied to each customer Routing Service and the Customer Account Access Service: Atlanta availability 99%, MTTR 30 minutes, MTBF 24 hours Savannah availability 99%, MTTR 30 minutes, MTBF 24 hours Customer Account Access availability 99%, MTTR 30 minutes, MTBF 24 hours Atlanta availability 99.5% M-F 7AM-6PM Savannah availability 99.5% M-F 7AM-6PM Customer Account Access availability 99.5% M-F 7AM-6PM
An additional stipulation to the business-hours guarantee must be accounted for: no outage can exceed 20 minutes This stipulation is referred to as a Maximum Outage Time (MOT) clause. The business-hours guarantees should also include the MOT component: Atlanta availability 99%, MTTR 30 minutes, MTBF 24 hours Savannah availability 99%, MTTR 30 minutes, MTBF 24 hours Customer Account Access availability 99%, MTTR 30 minutes, MTBF 24 hours Atlanta availability 99.5% M-F 7AM-6PM, MOT 20 minutes Savannah availability 99.5% M-F 7AM-6PM, MOT 20 minutes Customer Account Access availability 99.5% M-F 7AM-6PM, MOT 20 minutes To this point, six different availability guarantees have been identified, but none of these guarantees account for the response time element within the SLA: Average transaction time for initial account access is not to exceed five seconds for more than 5% of the standard business hours timeframe, with successful transaction completion to be guaranteed at 99% for standard business hours. A transaction will be deemed successful if completed within 15 seconds, and no period of transaction failure shall persist for more than 20 minutes. Transaction monitoring average based on a sampling of five queries to be delivered randomly within a 5-minute interval during standard business hours, each query originating from the customer access point device. The response time stipulations are very thorough and dictate how response time will be measured. To support this component of the SLA, you need to create two additional services using the response time tests as monitored resources. Create an Atlanta Response Time Service to monitor five new response time test models which will be hosted on the Atlanta access router and run at five-minute intervals. The SLA specifies 5 seconds and 15 seconds as major and critical thresholds. At first, you might consider that each response time test (SPM test) should be configured with a 5-second major threshold and a 15-second critical threshold; however, the SLA wording suggests that this would not be appropriate. Note the wording Average transaction time for initial account access to not exceed 5 seconds The average response time should be monitored, rather than the individual response time of each test. Imagine a response time result set of 4 seconds, 3 seconds, 3 seconds, 3 seconds, and 6 seconds. The 6-second result is in violation of the 5-second threshold. However, the average response time is less than 4 seconds, which is not in violation. To support this behavior, you should not set thresholds on the individual response time tests. Instead, create a new Response Time Service with a policy to monitor the latency of the response time tests. The new service policy should be created to monitor the Latest Result attribute on the response time test models. The policy should apply an aggregate rule set when evaluating response times as follows: When the average for all resources is greater than 15000, the service is down. When the average for all resources is greater than 5000, the service is degraded. The Latest Result attribute value of a response time test model is the number of milliseconds that the most recent test took to complete. Therefore, the values in the service policy above are expressed in terms of milliseconds. The Atlanta Response Time and Savannah Response Time Services should both utilize this policy to monitor five response time tests hosted by the respective site access router.
Referring back to the response time specifications in the SLA, two business-hours response time guarantees are needed to track response time of the services created above: Atlanta response time 95% M-F 7AM-6PM Savannah response time 95% M-F 7AM-6PM An additional availability component is included with the response time stipulation: successful transaction completion to be guaranteed at 99% for standard business hours. A transaction is deemed successful if completed within 15 seconds, and no period of transaction failure shall persist for more than 20 minutes. Recalling the second definition for an availability guarantee, a service is unavailable if it is not responsive, the specification in the SLA above requires two additional 99% availability guarantees with a supplemental MOT component: Atlanta response time 95% M-F 7AM-6PM Savannah response time 95% M-F 7AM-6PM Atlanta availability 99% M-F 7AM-6PM, MOT 20 minutes Savannah availability 99% M-F 7AM-6PM, MOT 20 minutes A Maintenance window clause within the SLA should also be considered: excluding those periods of scheduled system maintenance to be conducted between the hours of 12AM to 3AM on each Sunday To account for maintenance windows, modify each service to include a maintenance schedule for that time period. The following table lists all SLA components that have been accounted for in this design: SLA Design Components SERVICE: A to Z Account Access Customer Account Access SLA COMPONENT: Monthly SLA Availability 99%, MTTR 30 minutes, MTBF 24 hours Availability 99.5% M-F 7AM-6PM, MOT 20 minutes Atlanta Routing Availability 99%, MTTR 30 minutes, MTBF 24 hours Availability 99.5% M-F 7AM-6PM, MOT 20 minutes Savannah Routing Availability 99%, MTTR 30 minutes, MTBF 24 hours Availability 99.5% M-F 7AM-6PM, MOT 20 minutes Atlanta Response Time Response time 95% M-F 7AM-6PM Availability 99% M-F 7AM-6PM, MOT 20 minutes Savannah Response Time Response time 95% M-F 7AM-6PM Availability 99% M-F 7AM-6PM, MOT 20 minutes
How You Implement the A to Z Account Access SLA in SPECTRUM

To implement the SLA design described in the previous section, follow these high-level steps:
1.
Create the two Routing Services and their resource monitors for Routers and Interfaces. This results in 6 services grouped into two hierarchies. For each site use the Service editor to configure the first resource monitor (Atlanta Router and Savannah Router) to watch the contact status of the access router device model with the contact status high-sensitivity policy. For each site configure the second resource monitor (Atlanta Interfaces and Savannah Interfaces) to watch the port status of any critical router interface using a Port Status policy. Consider using either the Low Sensitivity or Percentage rule set for the service policy depending on the number of interface models required to provide access. Use the service editor to create the Atlanta Routing and Savannah Routing Services, which monitor the service health of their two resource monitors (defined above) using the Service Health High Sensitivity Policy.
2.
Create the Atlanta and Savannah Response Time Services and their individual response time test (SPM) models. Use the OneClick Consoles Locater tab to configure 5 SPM test models for each site with the following settings: A 5-minute (300 seconds) schedule interval and thresholds disabled A timeout value of 25-30 seconds Filter Timeout Data set to FALSE to configure the test models to have the timeout value written to the Latest Result Use the Service Editor to create the Atlanta Response Time and Savannah Response Time Services monitor and ensure that they monitor the newly created response time test (SPM) models. Using the Service Policy Editor, configure each Response Time Service to use a new service policy which monitors the response time tests Latest Result attribute and the following aggregate rule set: When the average for all resources is greater than 15000, the service is down. When the average for all resources is greater than 5000, the service is degraded.
3.
Consolidate the Routing, Response time, and the previously created Customer Access (from example 2) service into higher-level services. Consolidate the Routing and Response Time Services at both customer sites under two new services: A to Z Site Access and A to Z Response Time. The two new services should monitor their respective components with a Service Health High Sensitivity policy. Create the A to Z Account Access Service to monitor the A to Z site Access, A to Z Response Time, and Customer Account Access services. The new services should monitor its components with a Service Health High Sensitivity policy.
4.
Set up the SLA rules. After completing the changes made to the service hierarchy, navigate to the SLA tab within the Service Editor. Create an SLA against the A to Z Account Access service using a monthly SLA period. Do not at this time create guarantees against the A to Z Account Access Service. Launch the Guarantee Editor with the new SLA highlighted. Use the Guarantee Editor to create each of the ten guarantees (8 availability guarantees and 2 Response Time guarantees) that are identified in the previous section. Apply the MTTR, MTBF and MOT specification to the appropriate guarantees. Note: The functionality to associate business hours to these guarantees is planned for SPECTRUM 8.1. Disregard the Business Hour restriction for releases prior to 8.1.
An SLA guarantee will be violated if the following occurs: The threshold for any guarantee is violated. The threshold for any supplemental guarantee is violated (that is, MOT, MTTR, MTBF). When the MOT threshold is violated, the supplemental guarantee will immediately violate the SLA. If the MTTR or MTBF threshold is violated, the guarantee will transition the SLA to a state of at risk because the final determination of whether MTTR or MTBF has been violated cannot be made until the end of the SLA period. The SLA status is equivalent to the status of its worst guarantee. If any guarantee is violated, the SLA will likewise be violated for the SLA period. When the SLA period rolls over, the SLA will transition back to a state of unaffected. An SLA guarantee will be violated if the following occurs: The threshold for any guarantee is violated. The threshold for any supplemental guarantee is violated (that is, MOT, MTTR, MTBF). When the MOT threshold is violated, the supplemental guarantee will immediately violate the SLA. If the MTTR or MTBF threshold is violated, the guarantee will transition the SLA to a state of at risk because the final determination of whether MTTR or MTBF has been violated cannot be made until the end of the SLA period. The SLA status is equivalent to the status of its worst guarantee. If any guarantee is violated, the SLA will likewise be violated for the SLA period. When the SLA period rolls over, the SLA will transition back to a state of unaffected.
Service and SLA Reporting

Service Availability and SLA Reports are a major component of the Service Management solution. These reports complement the service and SLA modeling process and provide insight into the performance of service components over a variety of time periods. Service and SLA reports can be categorized into two groupings: customer-facing and internal. Customer-facing reports provide service availability and SLA status information, and can be delivered to service customers. Frequently, SLAs will stipulate that customers will receive Service and SLA reports for each SLA period. Customer-facing reports tend to summarize status. For example, a customer-facing availability report would only show two metrics: available time or down time. Likewise, a customer-facing SLA report would only show two metrics: compliance or violation. Internal reports are designed to provide a rich set of detailed data for use by the service provider or enterprise customer. In contrast to the customer-facing service availability report, an internal service availability report would display maintenance time, loss of management time, etc. Similarly, an internal SLA report would include all possible SLA states including unaffected, compliant-warned, and violated. Other internal reports may summarize services with the greatest downtime or service resources which contribute the most downtime. Internal reports are intended to provide insight into the health and performance of their Services and SLAs over a period of time.
Run SPECTRUM Service Manager Customer-Facing Reports

Several different customer-facing reports within the Service Availability and SLA category are available to Service Management users. You use the SPECTRUM Report Manager application to generate and manage your reports. You can access the application from any computer that can connect via a web session to the OneClick server on which Report Manager is installed. To access Report Manager and run reports
1.
Point a web browser to the Report Manager web page using the URL http://hosthame/spectrum/repmgr, where hostname is the name of the OneClick and Report Manager system. Log in to the application by specifying your username and password in the OneClick login window. Click the Begin Session link on the Report Manager Welcome window.
2.
3.
The Report Manager main window appears. The main window provides access to all report and report management options for your account. It lists any scheduled reports that have been generated for your account, reports that are scheduled to be generated for your account, and any messages in the Message of the Day and Whats New text boxes posted by a Report Manager administrator. For a complete description of Report Manager and how to use it, see the Report Manager User Guide. The following sections describe the customer-facing reports.
Service Availability by: Name, Customer, Owner

This report includes a pie chart showing service up/down time and availability percentage based on the period for which the report was run. A table listing all down shows start time, end time, duration, and outage notes. In addition, a subreport with detailed outage information is available for any outage with the table. You can generate multiple service availability reports by service name, service customer, or service owner.
Service Availability Variable Health Level

This report is similar to the Service Availability report, but allows you to include degraded and slightly degraded time if you choose. A pie chart including all service health types is shown with availability percentage calculation based on the period for which the report was run. All included outages are listed in the subreport showing detailed outage information.
Service Summary by: Name, Customer, Owner

This report lists multiple services based on service name, service customer, or service owner with outage times and percentage of availability.
Service Summary Variable Health Level

This report provides a table of services with columns that display summarized data for each service health level that you choose to include in the report, similar to the previous report. You can choose to display down only; down and degraded; or down, degraded, and slightly degraded. For each service listed in the table, a subreport with more detailed outage information is available.
SLA Detail By Customer

You can generate this report for one or more SLA periods. The report includes a pie chart which displays the percentage of guarantees for all reported periods which are compliant or violated. Below the chart, each of the guarantees is reported, including the status for each period. For any period, you can open a subreport showing detailed outage information for the particular guarantee, including any outage exemptions. If you run the report based on customer, a separate report will be generated for each of the customers SLAs. You can provide the report to the customer at the end of each period.
SLA Inventory by Customer

This report shows the configuration of each SLA and guarantee for a particular customer. This is a useful report to generate for a customer when the SLA and guarantee models are first created. The user should be able to compare the configuration with the SLA document to verify that all guarantees or service level objectives are addressed.
SPECTRUM Service Manager: Internal Reports

Several different internal reports within the Service Availability and SLA category are available to Service Management users. The following sections describe the internal reports.
Service Health by Service Name

This report is very similar to a service availability report, but includes all service health levels including maintenance and loss of management. The report can be run for both services and resource monitors. A pie chart showing the percentage of each service health value is shown. A table showing outage of all service health types, including outage notes and links to detailed outage information, is also included in this report. The service health report provides service manager users with very detailed information regarding the performance of a service over a given period of time.
Service Inventory
This report shows a breakdown of all services, resource monitors, and resources which are modeled in the system. It can be used to preserve a snapshot of service inventory configuration for the current time.
Top N Worst Performing Services

This report allows you to view the top N worst performing services for a given period of time. A bar chart summarizes the downtime for each service. A table, including summarized availability information for each service, is also available. This table includes links to a more detailed service availability report. This report is useful for a service management user to gain information regarding the worst performing services for any period of time.
Top N Worst Performing Services Including All Outage Types

This report is similar to the Top N Worst Performing Service report, but includes degraded and slightly degraded time in addition to downtime. Summarized availability information
and links to detailed service availability reports are available for each service model. The report provides very detailed information for the service manager user indicating which services experienced the most overall outage time for a particular period.
Top N Worst Service Outages

This report allows you to view the top N worst service outages which resulted in service downtime. This report is a useful tool for summarizing the worst outage for a period of time and may highlight areas within the service hierarchy which are lacking the redundancy to prevent service downtime.
Top N Worst Service Resources by Total Downtime

This report shows summarized information regarding the total service downtime caused by individual resources. This highlights the cumulative effect of each individual resource outage which results in downtime for one or more services. This report can be an important tool for identifying service resources which are chronic problem areas within the service modeling hierarchy.
SLA Status Current and Recent by Customer

This report provides you with a quick way of obtaining summarized SLA status for the current and recent periods. Status includes unaffected, compliant, warned, and violated SLAs with detailed subreports showing results for specific guarantees. This report can be run for selected SLAs or SLAs for a specific customer. This report can provide a quick review of the status of many SLAs for any customer.
SLA Summary by: Name, Customer, Status

This report produces a table of summarized SLA status for one or more periods. The report can be generated by SLA name, customer name, or simply organized by status. The report provides a summarized reference for multiple SLAs or multiple periods. You can access detailed subreports showing results for specific guarantees.
SLA Summary Warned or Violated

This report produces a table of all SLAs that are currently in the warned or violated state. The table also provides access to a subreport showing detailed guarantee outage information for the current period. This is a useful report for the service manager user to view SLAs that are not performing well for the current period.
SLA Detail By: SLA Name, Time Range, Last N Periods

This report is similar to the SLA Detail By SLA Customer report except that it displays all SLA Status values including unaffected, compliant, warned, and violated states. Detailed information is provided in a subreport which includes guarantee outages for the particular period. This is useful for obtaining detailed information about individual SLAs for one or more periods.
PERIOD DETAIL SUBREPORTS
SLA Detail with Resource Outages

This is a complex report that brings together SLA status and the associated resource outages which ultimately impacted the SLA for a specific period. This report is useful when used in conjunction with the Top N Worst Resources By Total Down Time report. You can use this report to show the impact of a particular resource at a very high level. Because it provides a great deal of information, it may generate many pages of data for SLAs with a high number of resource outages.
Customer SLA Summary

This report shows the status of the last six SLA periods for all customers SLAs. The status includes all four values. For each SLA, a chart summarizing six periods of status information is shown within a table providing summarized information for each period and a link to more detailed guarantee outage information. This report provides service managers with a quick view of SLA performance for a specific customer over the last six periods. The report may also be used by the sales organization to verify if a customers SLAs have been met for recent periods.
Chapter 8: Proactive Service Assurance

The CA Network and Voice Management solution has embedded algorithms that help operations teams to identify growing problem areas within the infrastructure before they impact customer service. Problems rarely occur instantaneously; often, warning signs occur such as subtle but growing service degradations, increasing errors, and delays. These problems might not be serious enough for users to notice or to warrant the opening of service calls, but they are growing. With tools that can analyze and detect growing problems and raise warnings, operations teams can proactively fix the problems before they result in outages and interrupted or lost service. This capability is particularly important for SLA enforcement. If you can resolve SLA troubles before the SLA is violated and without requiring additional network resources or servers you can avoid excessive effort and expense. Prerequisites: The procedures in this chapter assume that you have installed SPECTRUM and eHealth, and that you have configured Live Health to send traps to SPECTRUM. For more information about configuring the integrated product solution, see Chapter 5. For details on the Live Health application and how to create monitoring rules, see the Live Health web help. The eHealth web help is installed on the eHealth system and is also available on the TotalDoc online documentation CD.
How You Identify Potential Problems

The proactive analysis of the eHealth Live Health application and the Health report exceptions analysis are the key tools that warn you about growing problems in your network. For converged networks, the eHealth for Voice Policy Manager identifies when voice and messaging problems are starting. All of these tools provide configurable thresholds and settings so that you can define when a problem is serious enough to merit proactive attention.
You can configure these tools to automatically watch for these growing problems and send alarms to SPECTRUM when the problems require attention. In addition, you can define how
long the behavior must be occurring before alarms are raised so that you can reduce the false alarms of simple threshold violations and focus on the real, continuing situations. For example, the Live Exceptions application of the Live Health product family provides notifications of potential delay, failure, and unusual workload problems within networks, systems, and applications. It uses the historical data that eHealth gathers and maintains to assess potential problems over time. When Live Exceptions detects a condition that merits operator attention, it raises an alarm and sends it to SPECTRUM.
Configure Live Health to Watch for Growing Problems

For proactive service assurance, you use the Time over Threshold and Deviation from Normal algorithms within Live Exceptions to watch for growing problems in service. When performance changes from what is considered normal behavior (based on past history) for a particular length of time, Live Health raises an alarm and can send that alarm to SPECTRUM. To configure Live Health for proactive service assurance
1.
Use the Live Exceptions Browser to associate the applicable Unusual Workload default profiles to groups or group lists of your managed resources. Use the Live Health profile descriptions tool on http://support.concord.com/devices to identify the correct profiles for the element types that you have discovered. For more information about associating profiles to groups, see Chapter 5. If you have custom SLAs, you can create custom profiles with Time over Threshold and Deviation from Normal alarms to reflect your service thresholds. Make sure that your rules are configured to warn you when the service degradations require attention, which will typically be at a threshold that is lower than your service agreement thresholds. For instructions on creating custom profiles, see the Live Health web help.
2.
Configure Health Reports to Send Traps for Growing Problems

The Exceptions section of a Health report contains information about elements that have experienced unusual events or that may not have sufficient resources to accommodate the demand that is placed on them. This section of a Health report identifies elements that have accumulated a high number of exception points as the result of errors, high utilization, and divergence from trends. Elements appear in the report only when their accumulated exception points exceed a minimum number. eHealth administrators can specify this number in the service profile for the report. As an additional means to proactively monitor service, you can configure Health reports to forward traps for Health exceptions to the SpectroSERVER. When the scheduled Health report runs, eHealth sends an SNMP trap to the SpectroSERVER for the leading problem for each element in the Exceptions section of the Health report. Trap-forwarding is not enabled by default for eHealth; you must create a custom Health report to enable this feature, and then schedule that Health report to run automatically. Note: Only scheduled Health reports forward exceptions. If you manually run a Health report, it will not forward exceptions. For instructions on creating a custom Health report that forwards exceptions as traps to Live Health, see Chapter 5.
Send Voice Alerts to SPECTRUM

eHealth for Voice Release 4.0 provides alarm integration and correlation with SPECTRUM, taking advantage of the SPECTRUM Service Management and voice modeling capability. You can configure the eHealth for Voice Policy Manager application to send SNMP traps to SPECTRUM when violations of QoS or GoS occur. SPECTRUM applies its intelligence on policy, models, and rules to identify the severity of the problem. Policy Manager monitors all data voice and PBX activity, and then reviews that data against pre-defined criteria. With Policy Manager, you can define rules or policies against any data configuration changes, system traffic, individual usage, alarms, historical events, and so on. For instructions on configuring eHealth for Voice to send performance traps to SPECTRUM, see Chapter 5.
How You Respond to Alarm Actions in SPECTRUM

Using the SPECTRUM OneClick console, network operators and managers can view the models (or resources) in their topology and watch for events or status changes that indicate growing problems in their network. When SPECTRUM receives a trap from eHealth, the model that represents the element changes color to represent the alarm severity of the trap that was received. For example, when critical problems occur, the device icon changes to red, while minor problems cause it to change to yellow. Operators can right-click the icon and take the following actions to troubleshoot or investigate the problems: Drill down to an eHealth Alarm Detail report to obtain a picture of the performance trends that caused Live Health to detect performance problems that required an alarm to be raised. For example, if a device has performed outside its normal operating thresholds for more than 15 minutes, Live Health Alarm Detail reports can show you the performance trend line for the element. Run an At-a-Glance or Trend eHealth report to review the performance history of the resource. While a Trend report shows you the performance of the specific problem variable, the At-a-Glance shows you a set of common performance variables for that element type. Using this data, you can identify contributing causes or the root of the problem. Clear the alarm. If the operator knows that the alarm is related to a known problem or situation, the operator could clear the alarm and return the device status to normal. Open Service Desk tickets to record the problem as a work task and assign it to personnel to fix. With the Unicenter Service Desk integration, SPECTRUM can open, update, and close Service Desk tickets that track work to address problems in the network. Operators at the OneClick console can drill down to the Service Desk ticket details to determine the latest status and assigned troubleshooter for the tickets.
Chapter 9: Predictive Capacity Planning

Employee productivity and customer satisfaction both depend on the availability and performance of mission-critical applications. The applications depend on the IT infrastructure running smoothly and efficiently. Ensuring that IT resources meet the needs of your users requires more than just responding to problems. To keep your infrastructure running efficiently, you must obtain real-life data about the current status of your network, identify congestion and trouble spots before they affect users, and plan effectively for the future. These tasks are all part of predictive capacity planning. Capacity planning is a complex and critical part of managing IT resources. It helps you to use your current resources efficiently, evaluate trends in demand, and project future resource needs. Effective capacity planning allows you to achieve the following: Reduce costs through the reduction or elimination of underused leased lines. Improve performance though identification of both overused and underused elements, and rebalancing of capacity with demand. Reduce server and network downtime by anticipating overloads before they occur, and ensuring adequate capacity is in place. Improve budget predictability by tracking trends and modeling the affects of new services or infrastructure, allowing you to avoid emergency purchases and ensure you get the best prices. This chapter describes how eHealth can help you to perform three major capacity planning tasks: Identify underutilized resources to find existing devices or resources that are underused, resulting in unnecessary costs for leased lines and systems that are sitting idle. Identify overutilized resources to find existing devices or resources that are overused, resulting in performance degradation or penalty charges from overuse. Plan future capacity needs to project capacity needs based on current demand trends or anticipated business changes, allowing you to plan purchases and install upgrades as needed. Perform voice capacity planning to find over- and underutilization problems in your Telco or converged networks. Prerequisites: To use the best practices in this chapter, your eHealth database must have at least a week of collected data. With more performance data and longer history, these reports perform better for highlighting capacity trends and utilization problems. These examples also assume that you are viewing the reports from the eHealth Web interface. Reports on the Web interface have interactive hot-spots, which you can click to drill down to other reports and closer detail. Drilldowns are not available from reports that are in PDF format.
Additional References: Procedures for running, scheduling, and customizing reports are described in detail in the eHealth Report Management Guide and the eHealth web help. For details on the eHealth reports and how they work, see the eHealth web help. The best practices in this section are taken from the Capacity Planning with eHealth topic, which is available on the eHealth Support web site at http://support.concord.com.
How You Identify Underutilized Resources

To identify underused resources, follow these steps:
1. 2. 3. 4. 5.
Locate underutilized resources. Confirm underutilization. Address underutilized resources. Show ROI. Update your configuration.
Locate Underutilized Resources

eHealth provides an Underutilized Elements report that allows you to quickly identify elements that may be underutilized. This report is an optional Supplemental report in eHealth Health reports. To view this report, you must customize a Health report to include it. To locate the underutilized elements in your network
1. 2. 3.
Log in to the eHealth Web interface, and select the Run Reports tab. Click the Standard Health report link. The Run Health Report page appears. Specify the report subjects (for example, LAN/WAN technology, and a group of elements). Click More Options, and select Supplemental under Presentation Attributes. In the list of Supplemental reports, select Underutilized Elements to include that report. Save the report with a unique name, such as Underutilization_Report. Click Generate Report to run the report. Because this report can take several minutes to run on demand, the recommended best practice is to schedule the report to run from the eHealth console so that the report runs overnight or during a time when the eHealth system is not very busy. Review the Underutilized Elements supplemental report. The report lists elements that meet the following criteria for the past 8 days: Never reached 50% utilization Did not reach 10% utilization more than 5% of the time
4. 5. 6. 7.
8.
9.
In the report, look for leased lines, routers, switches, and systems that have underutilized bandwidth, CPU capacity, memory, or disk space. For example, the following report shows several high-speed OC-3 lines that have very low usage and should be investigated further.
When you run an Underutilized Elements report for only LAN/WAN elements, the elements are sorted first by speed (since faster WAN links are more expensive), and then by the percentage of time that they were underutilized.
BEST PRACTICES
As you use the Underutilized Elements report, consider the following best practices that can help to make the report more meaningful for your environment: When you first install eHealth, run it weekly to identify resources that are not being used. After this initial period, you can run it less frequently (monthly or quarterly) to identify usage changes in your network. Since the Underutilized Elements report looks at data from the past 8 days, you should schedule the report to run on Sunday so that you get data for an entire business week. Depending on how your network is used, you can edit the service profile so that the report includes data from only certain days or times, to eliminate periods of low network usage such as nights or weekends.
Confirm Underutilization
After you find underutilized elements, analyze the purpose of each element and run reports to confirm that it is actually underused. Run a monthly Health Report to confirm that it has been underutilized for at least a month. Important: Check unused network links to determine if they are backups. Since backups are used only when the primary fails, they often do not have any usage.
To confirm that an element is underutilized

1.
In the Health report (such as the one from the previous section), examine the Bandwidth Utilization chart on the Element Detail page of the report.
The Bandwidth Utilization chart shows the load on each of the network interfaces over the report period. For example, the bar for Helium 5734 is completely gray, indicating that it did not have any usage during the month. Several other bars, such as Helium 7839, Miami, and Atlanta are all dark green, indicating they never exceeded 10% usage. All of these interfaces appear underutilized.
2.
Run a Bandwidth Trend Report by clicking the bar for the element that you suspect is underused. The Bandwidth Trend Report shows the utilization for that element for the same time period as the Health Report.
3.
Examine the chart for constant underuse, or for a sudden decrease in usage due to network reconfiguration. For example, the element above never exceeded 3.5% utilization during the month, indicating that it is underused. Establish a report trail to document evidence of low usage. In addition to weekly and monthly Health Reports, you can run a Service Customer Service Level report for longer term (quarterly or yearly) bandwidth utilization data. The Service Customer report also contains the Daily Bandwidth Utilization chart, which provides a detailed picture of daily usage.
4.
How You Address Underutilized Resources

After you have identified and documented underused resources, consider taking the following actions to address utilization problems: Eliminate unused lines. If a leased line is completely unused and does not serve as a backup, you can eliminate it and save the monthly cost of the line. Downgrade underused lines. You can often save money by downgrading the capacity of the link. This solution is effective when the link is a fractional T1, frame relay circuit, or ATM channel, but may not be practical if you have to change technologies or run new cable. Reroute traffic to the underused line. If you have overused lines in the same location, or lines that can be consolidated, consider rerouting network traffic to an underused line. If you do not have other traffic from the same location, consider the following options: Relocate local servers to a central site, increasing traffic over the underused line and lowering costs through server consolidation. Aggregate traffic from several small local links to a regional center, then using one shared high-speed link to the central site.
Show ROI
After you have determined the capacity changes to make, you can calculate and show potential monthly savings from eliminated lines and the difference in cost between existing lines and new downgraded lines. To estimate the ROI
1.
Review your monthly usage fees to identify the cost of leased-lines that may be underutilized. Contact your service providers to identify possible costs for changing service as well as the new monthly costs for changes in speed or bandwidth. If you have internal costs for changing service, take those costs into consideration as well. Calculate the ROI for making changes using the following equation: ROI = switching costs / monthly savings The following table shows examples of costs and service charges for three interfaces.
2.
3.
Net-Link Current Speed New Speed Current Cost New Cost Switching Cost Monthly Savings 100 Mps 10 Mbs $3,300 $2,500 $5,000 $800
Chicago T1 1.54 Mbps 512 Kbs $2,500 $2,000 $1,000 $500
Paxton T1 1.54 Mbps 128 Kbs $5,000 $3,200 $1,200 $1,800
4.
Based on your ROI calculations, determine whether making the proposed changes makes sense. For example, the table shows that downgrading the Net-Link from a 100 Mbs line to a 10 Mbs line would save $800 each month, but the high switching cost means that you would not break even for over six months. Downgrading the Paxton T1 to a 128 Kbs line, on the other hand, would give you an ROI in less than one month.
Update Your Configuration

After you change your configuration to resolve capacity issues, update your SPECTRUM and eHealth environments to ensure that they reflect updated speeds and perhaps any resources that have been retired. To update your configuration
1.
Update your SPECTRUM views using rediscovery to ensure that they reflect the latest device information. Update the eHealth polling configuration and element lists by re-importing the element information from SPECTRUM and rediscovering your elements. Future reports for the time ranges when element speeds changed may show unusual utilization percentages. If you decreased capacity or added demand to existing resources, run Trend and Health reports on those resources. Look for any Health exceptions or other utilization problems that may result from the increased traffic. If you eliminated an element, disable polling and retire the element in the eHealth database. Retiring the element allows you to continue reporting on it until its data ages out of the database.
2.
3.
4.
How You Identify Overutilized Resources

To identify overutilized resources, follow these steps:
1. 2. 3.
Locate overutilized resources. Confirm overutilization. Address overutilized resources.
Locate Overutilized Resources

eHealths capacity planning tools can help you to identify overutilized resources before they start causing problems. By examining a single Health Report each week, you can identify network elements that are reaching their capacity. You can then consult other reports to analyze problems, and solve the issues before they become fires for your IT team. To locate overutilized resources
1.
On the eHealth web interface, run a daily Health report for the busiest day of your week. In the left pane of the Health report window, click Exceptions Summary to open the Exceptions Summary Report. The Exceptions Summary report identifies elements that have experienced unusual events or whose resources are consistently inadequate for the demand on them. The elements are ranked by exception points, so that those elements experiencing the worst problems are listed first.
2.
3.
Look for elements in the report that list Utilization Health Index or Congestion Health Index in the Leading Exception column. These elements are experiencing high volume and may be overutilized. For example, the Frame Relay link to the Virginia office is listed first, and has Utilization Health Index as its leading exception. This link is likely overutilized and should be investigated further.
Confirm Overutilized Resources

The Situations to Watch chart identifies elements that are predicted to exceed, reach, or come close to reaching their trend thresholds. The chart shows you how close each element is to its threshold, how fast utilization is growing, and how long until demand exceeds capacity.
To confirm that resources are overutilized

1. 2.
On the eHealth web interface, run a Health report for the previous week. Review the Situations to Watch chart in the Summary section of the Health report.
3.
Review the elements listed in the chart, looking for those that have exceeded their threshold or are growing fast enough to soon reach it. For example, the first element (Virginia) has already exceeded its threshold for two days, while the next two are predicted to reach threshold in the next week. All of these are likely to be overutilized elements. Demand on the final two elements listed is increasing, but both are still at less than 20% capacity, and do not represent a problem. Select Element Detail in the Health Report, and examine the Bandwidth Utilization chart for the elements that you suspect to be overutilized.
4.
The Bandwidth Utilization chart shows the percentage of time that each element was in each usage range. Generally, purple and red colors indicate an overutilized resource. Purple indicates greater than 100% utilization, meaning that the element is probably a leased line exceeding its contracted bandwidth, and, therefore, incurring overage charges.
5.
Examine the chart to see how often a suspected element was overutilized during the course of the week. Some elements may show consistently high demand (such as the Virginia line), but since demand varies over time, most elements will show significant periods of low usage. Depending on your network activity, an element may not have any usage at certain times (overnight for example), but still be overutilized because demand exceeds capacity at peak times.
For example, the Vermont line in the chart does not show any usage a third of the time (possibly overnight), and is under 20% usage most of the time. However, since it exceeds 100% utilization at peak demand, it could be incurring overage charges, and, therefore, be considered overutilized.
6.
To obtain more details about an elements performance, create an At-a-Glance report by clicking that element in the Bandwidth Utilization chart.
7.
Review the Bandwidth Utilization charts in the At-a-Glance report to determine how frequently the element was overused, and during what time periods. The sample charts show that the element had 50% utilization most of the week, but peaked near 100% several times. Depending on your business needs, an element that reaches its capacity for only one hour per week may be acceptable, or that one hour of overutilization could be a critical problem if it occurs at a key business time. Review the other charts in the At-a-Glance report for any anomalies, including high error rates or signs of congestion (forward explicit congestion notifications (FECNs), backward explicit congestion notifications (BECNs), discards). Use this information to determine the conditions that might be affecting the element such as the following: Insufficient capacity Inefficient or misconfigured applications consuming excessive bandwidth Too many or too few stations overloading a WAN link A highly repeated or bridged domain that should be routed
8.
9.
Establish a report trail to document evidence of high usage. In addition to the reports described here, you can select specific elements in the Exceptions Summary Report and Situations to Watch chart to run detail reports for those elements. You can also run Bandwidth Trend reports on specific elements to show the long-term utilization of a resource.
How You Address Overutilized Resources

After you have identified and documented underused resources, consider taking these typical actions to resolve the problem: Upgrade the element to a higher capacity. Relocate demand to other resources.
Add additional elements to share the workload.

BEST PRACTICE
Use Capacity Trend What-If reports to visualize the effects of higher capacity or lower demand on the overutilized element, and determine the optimal capacity of any new resources. When you run the report, you specify an element, a capacity variable, and a time range. The report shows the value of that performance variable during that historical range. The What-If report is very similar to the eHealth Trend report; however, you can change the capacity of the resource, the demand placed on the resource during that time, or both; and then update the report to model the effects of possible changes. Note: When you enter values for capacity and demand, note that you must specify percentage values. For example, 100% causes the report to use the current values; 50% causes the report to show half the current values (dividing the capacity or demand by 2); and 200% causes the report to double the current values.
This report shows that by increasing the capacity of the Virginia line by 50% (capacity = 125%), peak utilization would be reduced to about 60% of capacity. This capacity should be sufficient to meet expected demand.
How You Plan Future Capacity Changes

To plan and predict capacity changes, follow these steps:
1. 2. 3. 4.
Identify potential capacity changes Analyze capacity trends. Visualize capacity changes. Address capacity changes.
Identify Potential Capacity Changes

eHealth provides capacity planning reports that enable you to analyze the behavior of your resources under varying conditions, and predict where and when youll need to add capacity. To identify potential capacity issues
1.
Schedule a Health report to run every Sunday to ensure that you obtain data for an entire business week. For instructions, refer to the section on customizing and scheduling Health reports in Chapter 5 of this guide. Examine the Situations to Watch chart in the Summary section of the Health report. The Situations to Watch chart shows the top 10 elements (network interfaces, CPUs, disk partitions) that are nearing their capacity. The chart shows how close each element is to its threshold, how fast utilization is growing, and how long until demand exceeds capacity.
2.
This report shows several user partitions that are nearing their thresholds. In the Days To Threshold column, System-Orange shows 0, meaning that utilization has reached the Trend threshold. System-Green shows 20 days to threshold, and System-Pink shows Increasing, indicating utilization is growing, but will not reach threshold for a long period of time. Each of the systems at or near their threshold merit further investigation. For example, System-Orange could already be overutilized, or it could be a system partition designed to operate near capacity. System-Green, on the other hand, is 20 days from meeting its threshold, but could be a good candidate for upgrade if it is showing a steady increase in demand.
3.
To drill down to more information for each reported situation, click the element name to run a Situations to Watch Detail report for the partition.
4.
Examine the trend line to see how quickly the trend is approaching the threshold. If the line is rising at a steady rate, as in this example, consider adjusting capacity by increasing the size of the partition, deleting unneeded directories and files, or buying a new system.
Analyze Capacity Trends

After identifying potential upgrade candidates, run Capacity Projection and Capacity Provisioning reports to forecast volume changes over the upcoming weeks and months, and predict when elements need to be upgraded. To run Capacity Projection and Capacity Provisioning reports
1. 2. 3.
Log in to the eHealth Web interface, and select the Run Reports tab. Click the Standard Health report link. The Run Health Report page appears. Specify the report subjects (for example, System technology, and a group of elements). Click More Options, and do the following:
a. b. c. d.
4.
Under Presentation Attributes, select Capacity. Select Capacity Projection and Capacity Provisioning to those reports. Specify 20 in the Capacity Provisioning Minimum Lead-Time field. Specify 90 in the Capacity Provisioning Maximum Lead-Time field.
5. 6.
Save the report as a template with a unique name, such as Capacity_Report. Click Generate Report to run the report. The report can take a few minutes to run on demand. As a best practice, you can schedule the report to run from the eHealth console so that the report is automatically generated during off-peak hours and is ready for review when you need it. Review the Capacity Projection report. The report forecasts how the capacity of a particular variable (partition utilization, for example) will change in the future. You can run the report based on peak, average, or percentile capacity values. eHealth measures the predicted capacity values against a threshold that you specify, and displays those elements predicted to exceed the threshold.
7.
This report displays the percentage of partition capacity that will be consumed on each system at 30 days, 90 days, and nine months into the future. You can see that demand on System-Orange is near threshold, but not increasing very much. Demand on System-Purple; however, is quickly increasing and will soon exceed capacity. SystemPurple, therefore, may be in greatest need of upgrade.
8.
To project when these elements will need to be upgraded, review the Capacity Provisioning report. The Capacity Provisioning report compares projected capacity values against an upgrade threshold, and displays those elements predicted to exceed the threshold, along with the number of days until an upgrade is required.
Like the Capacity Projection report, you can run this report based on peak, average, or percentile capacity values. You can set both the upgrade threshold and an upgrade lead-time window by customizing the Presentation Attributes for the Health report. The report shows elements that are predicted to meet a 90% capacity upgrade point within the next 20 to 90 days. System-Green is most in need of upgrade, and should be addressed in the next 20 days.
BEST PRACTICES
For the Capacity Projection and Provision reports, it is important to know how much lead time you need to bring new capacity online. For example, some service providers may require 90 days to provide a new T1 line. For systems, you might need 30 days to order and add new disk space or memory. Therefore, for the types of resources you manage, you need to know when you must order additional capacity so that it is available installed, tested, and turned over before the upgrade point is reached. These reports identify those locations for which additional capacity needs to be ordered today to avoid reaching the threshold. The examples in this section show a 20-90 day lead time, and all three locations are projected to require an upgrade within that window. If it takes 90 days to add disk space or memory, it is very likely that the 90% upgrade threshold will be violated during this time period. When you first start to use eHealth to monitor your resources, you may find that some of your resources need upgrades sooner than your lead times might allow; but over time, these reports will help you to isolate problems earlier and avoid threshold violations before your lead time windows expire.
Visualize Capacity Changes

The Capacity Trend What-If report shows how resources perform as your infrastructure changes and grows. These reports allow you to leverage historical data to predict future patterns, model changes in capacity or demand, and determine the effect on resources. To help visualize the impact of changes in demand
1.
Run a Capacity Trend What-If report to analyze potential solutions:

a.
On the Run Reports page in the Available Reports column under What-If, select CapacityTrend or another template name. The Run a Capacity Trend What-If Report page appears. Select an element type from the Element Type list, and then select an element from the Available elements list. Select a variable for your report. Under Chart type, select the chart format. Under Divide by, specify how you want to graph the selected variable Optionally, select a time interval during which the data is aggregated. Select a sample size based on the time range for your report. The As Is sample size uses the most granular data available and does not aggregate the values. Under Report Time, select the report period. You can specify the values now, today, or yesterday, or an actual date or time value. Select More Options to specify the hours and days that the report will show. Optionally, customize the report by setting presentation attributes. Click Generate Report.
b.
c. d. e. f. g.
h.
i. j. k.
2.
Use the fields at the top of the report to adjust the capacity and/or demand for the resource, and run the report again to model the change.
3.
Use the report to model and determine whether an existing resource can support anticipated changes and, if not, how much capacity must be added. You can also illustrate potential problems so that you can propose requests for new equipment or upgrades. For example, this report shows that by doubling CPU capacity, demand on the server will be well under the trend threshold of 80%, even with a 50% increase in demand.
How You Address Capacity Changes

After you have identified possible capacity issues or improvements, consider these typical actions to resolve the problem: Upgrade the element to a higher capacity Replace the element with a larger or faster device (such as a larger disk, faster interface, or faster CPU system). The What-If report can help you to model, or visualize, how the proposed changes will improve performance. After making any changes to your devices or resources, update your configuration as described in Chapter 5.
Voice Capacity Planning

For networks with traditional or IP voice telephony devices or voice messaging systems, eHealth for Voice can help you to identify capacity problems and monitor GoS during the peak hours of the network. For voice devices, capacity problems can include trunk/port utilization and voice mail messaging disk space. When these factors are overutilized, it impacts service degradations and customer satisfaction. When these factors are underutilized, it is important to identify where your devices might be overprovisioned so that you can take some steps to reduce costs or reallocate resources to resolve congestion in other areas of the network. Effective capacity planning enables you to achieve the following: Reduce costs through the reduction or elimination of underused leased lines, as well as the reduction of maintenance costs for unused or unnecessary ports. Improve performance though identification of overused and underused ports or trunks and rebalancing of capacity with demand. Improve budget predictability by tracking trends, which helps you to avoid emergency purchases and to ensure that you can research and plan for the best service costs. To understand traffic patterns, you need to collect information from the PBX that details peak traffic for each trunk group for at least a few weeks, preferably months. This information is available on the switch. eHealth for Voice automates the collection of this information, making it easier to run quarterly and on-demand maintenance assessments of your voice capacity and usage patterns.
Analyze Voice Capacity

Once you have collected traffic for the desired period, you can use the Capacity Analyzer tool to determine how well your voice devices are servicing customers during the busiest hour. From this dialog, you can quickly calculate GoS, view disk space capacity for message servers, and view process capacity for communications servers. To access the Capacity Analyzer
1.
On the system on which eHealth for Voice is installed, select Start, Programs, eHealth for Voice, eHealth for Voice. The eHealth for Voice Program Console appears. Select Measurements, Reports in the left navigation console tree.
2.
3.
Double-click the Capacity Analyzer icon in the right pane of the console. The Capacity Analyzer dialog appears.
Analyze GoS
eHealth for Voice calculates a GoS to determine how callers are serviced (calls answered, busy, or ring no answer) during the busiest hour of the period. This can help you to determine additional bandwidth needed to carry voice traffic on the network. To analyze the GoS
1.
In the Capacity Analyzer dialog, select the Port Analysis tab to access the grade of service calculation tool. The target grade of service refers to the percentage of callers that will be serviced (calls answered) during the busiest hour. A GoS of .001 means that .999% of callers will get through. Select the target GoS and click Apply. The dialog shows the number of trunks that you need to add to support that GoS.
2.
3.
Use the horizontal scroll bar to scroll to the right side of the dialog.
4.
Review the Add/(Delete) Trunks column to determine the number of trunks that you will need to provide that GoS. A number in parentheses shows the number of trunks that you could remove and still be able to support the GoS during the peak hour, which can help you to detect underutilized resources. Review the Erlangs column to determine the actual peak traffic. An Erlang is a measurement of voice traffic capacity. It represents how many minutes of voice traffic occur during an hour of time. If 10 users each make one 10-minute call in a given hour, the hour had 100 minutes of calls, and had 1.67 Erlangs of traffic. This information helps you to identify how much additional bandwidth you need on the network to support voice.
5.
BEST PRACTICES
As you use the Capacity Analyzer, consider the following best practices that can make results more meaningful for your environment: When you first install eHealth for Voice, run the Capacity Analyzer weekly to identify resources that are not being used. After this initial period, you can run it less frequently (monthly or quarterly) to identify usage changes in your network. Trunk or port lines with zero or very low traffic could be backups or overflow lines. Before proceeding with detailed service change plans, always confer with the person responsible for PBX/IP-PBX engineering to ensure that you understand the purpose of any trucks or ports. The level of over- or underutilization varies, depending upon the GoS selected. As the GoS decreases, the need for additional resources increases. Company service levels will help define the GoS needed for your environment.
How You Address Underutilized Resources

After you have identified and documented underused resources, consider eliminating unused trunks, ports, or PBXs to reduce the service charges and/or maintenance charges for your network.
Show ROI
After you have identified capacity changes that you could make, you can calculate and show potential monthly savings from eliminated trunks and the difference in maintenance costs between current and future configurations. To estimate the ROI
1.
Review your monthly usage fees to identify the cost of leased lines that may be underutilized. Review your port maintenance fees to identify costs for unused ports. Contact your service providers to identify possible costs for changing service or reducing the number of ports. If you have internal costs for changing service, take those costs into consideration as well. Calculate the ROI for making changes using the following equation: ROI = (service change + port-change fees) / monthly savings
2. 3.
4.
5.
Based on your ROI calculations, determine whether making the proposed changes is wise.
How You Address and Confirm Overutilized Resources

The Capacity Analyzer provides the peak traffic for a given timeframe as well as the required number of trunks or ports to handle the traffic load for the desired grade of service. To confirm the results of overutilization, do the following: Run the Capacity Analyzer again and select a more granular date range. For example, if a quarterly-report peak hour shows utilizations that seem unusually out of range, evaluate each month to see the pattern or trends of busy-hour data. This can help you to investigate whether the busy hour is an anomaly, or if the traffic is growing in your network. If the busy hour is related to a one-time event, you can ignore this atypical activity in your capacity planning. Running Voice traffic reports for the platform will show trends in trunk or port usage. In this way, you can see if there is any overflow to another trunk group. Verify the GoS selected with the person responsible for PBX/IP-PBX engineering for this trunk group. Confirm that your analysis for each trunk group uses the GoS originally intended or planned for that group. Contact your service provider to add additional capacity, such as adding trunks to the hunt group or if a fractional T1, to add more capacity.
Analyze Voice Messaging Disk Capacity

The Capacity Analyzer provides you with the ability to check the message area capacity for your voice message mail services. If the free disk space runs out, users will not be able to create voice messages. For businesses that rely on voice mail messages to record sales orders or manage resources, disk space problems could result in critical service impacts. To analyze voice messaging disk capacity
1. 2.
In the Capacity Analyzer dialog, select the Voice Messaging Disk Analysis tab. Review the dialog to obtain information about disk utilization and busy-day statistics, which can help you to determine whether the voice mail servers are able to support their typical volume and to identify the busiest day.
How You Resolve Disk Capacity Issues

If your message services are running out of disk space, follow these steps to resolve the capacity problems: Request that users delete old/unwanted messages to free disk space. Consider implementing length limits on voice messages, time limits on how long messages can be saved, and limits on numbers of saved messages. This can help to keep user mailbox disk space usage more predictable. If the conservation techniques do not free enough disk space, add disk space to the messaging servers.
Chapter 10: Rapid Problem Resolution

When even the smallest problem occurs in a network, a wide range of services and capabilities can be affected. Network management systems detect these problems and can often send streams of events to report slowdowns, outages, and impacted services. This barrage of information, though accurate, often hinders troubleshooting efforts simply because of the amount of data that operators must filter through. CAs Network and Voice Management solution helps you to direct your troubleshooting efforts to the source of the problem. SPECTRUM software performs event correlation, impact analysis, and RCA for multiple vendors and technologies across network, system, voice, and application infrastructures. Combined with eHealths ability to find and report on performance behavior changes, and eHealth for Voices ability to monitor the policies and capacities of voice networks, CA offers a key solution for identifying problems, quickly targeting the real source of the problem, and providing deeper insights into historical trends and reports. This chapter describes how SPECTRUMs problem resolution and root cause identification processes work.
Problem-Solving Techniques
SPECTRUM offers three intelligent, automated, and integrated approaches to problem solving: Model-based IMT Rules-based EMS Policy-based Condition Correlation Technology (CCT) SPECTRUM is fundamentally a model-based system. Model-based systems are adaptable to changes that regularly occur in a real-time, on-demand, IT infrastructure. Rules-based systems are flexible in allowing customers to add their own intelligence without requiring programming skills. SPECTRUM combines the best of both approaches, using models to keep up with changes while leveraging easy-to-create rules running against the models to avoid the need for constant rule editing. Policy-based systems are automated means of connecting seemingly unrelated pieces of information to determine condition and state of physical devices and logical services. This condition correlation engine combines with SPECTRUMs modeling engine and rules engine to deliver a higher level of cross-silo service analysis. You can place almost every service delivery infrastructure problem into one of three categories: availability, performance, or threshold exceeded. Infrastructure faults occur when things break, whether they are related to LAN/WAN, server, storage, database, application or security. Infrastructure performance problems often result in brown-out conditions in which services are available but are performing poorly. From the users perspective, a slow infrastructure is a broken infrastructure. The final category is abnormal behavior conditions in which performance, utilization, or capacity thresholds have been exceeded as demand/load factors fall significantly above or below observed baselines.
SPECTRUM, eHealth, and eHealth for Voice can detect these problems in your network and raise alarms when problems occur. By sending all alarms to SPECTRUM, you can pinpoint the cause of problems. Model-based, rules-based, and policy-based analytics in SPECTRUM understand relationships between IT infrastructure elements and the customers or business processes that they are designed to support. It is through this understanding of relationships that SPECTRUM has been shown to deliver 70% reduction in downtime while resolving 90% of availability or performance problems from a central location. SPECTRUMs RCA has been able to reduce the number of alarms by several orders of magnitude while reducing MTTR from hours to minutes. SPECTRUMs distributed management architecture has also proven effective at performing RCA for over 5 million devices (20+ million ports) in a single environment with fully meshed and redundant core and distribution network layers. Our integrated approach to fault and performance management has enabled enterprise, government, and service provider organizations around the world to manage what matters through service level intelligence.
Complex Problems and Powerful Solutions

IT infrastructure operations management is a difficult and resource-intensive yet necessary undertaking. When the infrastructure fails or slows down, tools are required to quickly pinpoint the root cause, suppress all symptomatic faults, prioritize based on business impact, and aide in the troubleshooting and repair process to accelerate service restoration. To ensure the performance and availability of the infrastructure, most companies employ a dual approach of highly available, fault-tolerant, load-balancing designs for infrastructure devices and communication paths, and a management solution to ensure proper operation. In fact, the job of the management solution is further complicated by todays highavailability environments. The management solution must understand the load-balancing capacity; it must be able to track primary and fault-tolerant backup paths; and understand when redundant systems are active. The investment in the management solution is as important as the investment in the infrastructure itself.
Problem Prediction and Prevention

Management software should help predict or prevent problems. CAs out-of-the-box utilization, performance, and response time thresholds can be used to act as an early warning system when a problem is about to happen or when a service level guarantee is about to be violated. While these thresholds can obviously be tuned for a specific customer environment, it is also important to have out-of-the-box thresholds that are relevant from the start of your monitoring baselines. Before you can begin the true task of troubleshooting, you must isolate the problem. Simply being aware of the problem and collecting the data is not sufficient. To effectively triage the issue, you need to determine the location or source of the problem (and where the problem does not exist). If multiple problems are occurring simultaneously, you should be able to automatically prioritize issues based on impacted customers, services, or infrastructure devices. It is far too costly to rely on human intervention to determine the root cause of problems, and to sift through an unending stream of symptomatic problems. Every minute that you devote to isolating the problem is a minute lost to solving the problem.
Business Impact
The best management solutions will not only be able to identify problems, isolate them and suppress all symptomatic events, but also identify all impacted components, services, and customers. For the business, understanding impact is as important as understanding the root cause. When outages or performance degradations occur, business services and the users of those business services are affected. When this happens, people typically cannot do their jobs effectively, resulting in lower productivity or efficiency. Sometimes the services provided by the company to their customers are affected, which results in lost revenue, SLA penalties, and lost customers. For large organizations, it is possible for many problems to occur at the same time. Knowing the root cause allows an organization to efficiently fix problems without wasting time pursuing symptoms. Being aware of the impact allows an organization to prioritize response efforts and effectively provide help desk services.
Event Correlation and RCA A Three-Pronged Approach

SPECTRUMs Root Cause Analysis (RCA) is based on the following fundamentals: The system must understand the relationship between information within the infrastructure and the systems/applications/services/customers that depend on that information. The system must be proactive in its monitoring, rather than rely on event streams. The system must distinguish between a plethora of events and meaningful alarms. The system must scale and adapt to the requirements of growing and dynamic infrastructures. The solution must work across multiple-vendor and technology solutions. The system must allow for extensions and customization. Management software applications efficiently performing RCA should raise an alarm for the root condition and should prevent any symptom/effect resulting from the root condition from being presented as a unique or separate alarm. SPECTRUM relies on multiple techniques working cooperatively to deliver its event correlation and RCA capabilities. These include IMT, EMS, and CCT. Each of these techniques helps to diagnose a diverse and often unpredictable set of problems.
RCA
RCA can be simply defined as the act of interpreting a set of symptoms/events and pinpointing the source that is causing them. Within the context of infrastructure management, events are occurrences of significant happenings presented from a source to other systems. Events are typically local to a source, and without proper context, do not always help with RCA. Correlation of events is often required to determine if an actionable condition or problem exists, but is almost always required to isolate problems, identify all impacted components and services, and suppress all symptomatic events. Many components provide events in many forms: SNMP traps, syslog messages, application log file entries, TL1 events, ASCII streams, and so on. More sophisticated management systems such as SPECTRUM, eHealth, and eHealth for Voice can also generate events based on proactive polling of component
status, parameter-based threshold violations, response time measurement threshold violations, deviations from historical performance, and health analysis. SPECTRUMs RCA is the automated process of troubleshooting the infrastructure and identifying the managed elements that have failed to perform their function. The goal of SPECTRUMs RCA is straightforward: identify a single source of failure, the Root Cause, and generate the appropriate actionable alarm for the failed managed element.
Inductive Modeling Technology

The core of SPECTRUMs RCA solution is its patented IMT. IMT uses a powerful objectoriented modeling paradigm with model-based reasoning analytics. In SPECTRUM, IMT is most often used for physical and logical topology analysis as SPECTRUM can automatically map topological relationships through its auto-discovery engine. In SPECTRUM, a model is the software representation of a real-world managed element, or a component of that managed element. This representation allows SPECTRUM to not only investigate and query an individual element within the network, but also provides the means to establish relationships between elements to recognize them as part of a larger system. IMTs RCA is based on a sophisticated system of models, relationships and behaviors that create a software representation of the infrastructure. Decisions concerning which element is at fault are not determined by looking at a single element alone. Instead, the relationship between the elements is understood and the conditions of related managed elements are factored into the analysis. Models are in direct communication with their realworld counterparts, enabling SPECTRUM to not only listen, but proactively query for health status or additional diagnostic information. Models are described by their attributes, behaviors, relationships to other models, and algorithmic intelligence. Intelligent analysis is enabled through the collaboration of models in a system. This collaboration enables correlation of the symptoms, suppression of unnecessary alarms, and impact analysis of affected users, customers, and services. Collaboration includes the ability to exchange information and initiate processing between any models within the modeling system. A model that is making a request to another model may, in turn, trigger that model to make requests to other models, and so on. Relationships between models provide a context for collaboration. Collaboration between models enables the following: Correlation of the symptoms Suppression of unnecessary/symptomatic alarms Impact analysis A simple example of IMT in action can be demonstrated by a network router port transition from UP to DOWN. If a port model receives a LINK DOWN trap, it has intelligence to react by performing a status query to determine if the port is actually down. If it is, in fact, DOWN, it consults the system of models to determine if the port has lower layer subinterfaces. If any of the lower layer sub-interfaces are also DOWN, only the condition of the lower layer port will be raised as an alarm. An application of this example can be described by several Frame Relay DLCIs transitioning to INACTIVE. If the Frame Relay port is DOWN, IMT will suppress the symptomatic DLCI INACTIVE conditions and raise an alarm on the Frame Relay port model. Additionally, when the port transitions to DOWN, IMT will query the status of the connected Network Elements (NEs) and if those are also DOWN, those conditions will be considered symptomatic of the port DOWN, will be suppressed, and will
be identified as impacted by the port DOWN alarm. Root cause and impact are determined through IMTs ability to both listen and talk to the infrastructure.
Event Management System

At times, event streams local to a specific source are the only source of management information. Any one event may or may not be a significant occurrence but in the context of other events, information, or time, it may be an actionable condition. Event Rules in SPECTRUMs Event Management System provide a more complex decision-making system to indicate how events should be processed. You can apply Event Rules to look for a series of events to occur on a model in a certain pattern, within a specific timeframe, or with certain data value ranges. You can use Event Rules to generate other events or even alarms. If events occur that meet the preconditions of a rule, SPECTRUM may do the following: Generate another event, allowing cascading events. Log the event for later reporting/troubleshooting purposes. Promote the event into an actionable alarm. SPECTRUM provides six customizable Event Rule types that form the basis of the Event Management System rules-based engine. These rule types are building blocks that can be used individually or cooperatively to effect an alarm on the most simple or sophisticated event-oriented scenarios. This Event Management System rules engine allows for the correlation of event frequency/duration, event sequence and event coincidence. The Event Rule types are as follows: Event Pair (Event Coincidence): This rule generates an error when the first of two events that you define do not occur in sequence. If the second event in a series does not occur, this may indicate a problem. The Event Pair rule type creates a more relevant event based on this scenario. Event rules based on the Event Pair rule type generate a new event when an event occurs without its paired event. It is possible for other events to occur between the specified event pair without affecting this event rule. Event Rate Counter (Event Frequency): This rule type generates a new event based on events that occur at a specified rate in a specified time span. A few events of a certain type might not be a problem, but if the number of these events reaches a certain threshold within a specified time period, notification is required. SPECTRUM does not generate additional events if the rate stays at or above the threshold. If the rate drops below the threshold and then subsequently rises above the threshold, it generates another event. The Event Rate Counter type is best suited for detecting a long, sustained burst of events. Event Rate Window (Event Frequency): This rule type generates a new event when a number of the same events are generated in a specified time period. The Event Rate Window type is best suited for accurately detecting shorter bursts of events. It monitors an event that is not significant if it occurs occasionally, but is significant if it happens frequently within a short period of time. If an event occurs a few times during the day, a problem may not exist. If an event occurs five times in one minute, perhaps that is a condition for which you want to be notified. If the event occurs above a certain rate, SPECTRUM generates another event. SPECTRUM will not generate additional events if the
rate stays at or above the threshold. If the rate drops below the threshold and then subsequently rises above the threshold, it generates another event. Event Sequence (Event Sequence): This rule type generates an event when a particular order of sequenced events might be significant in your environment. This sequence can include any number and any type of events. When the sequence is detected in the given period of time, SPECTRUM generates a new event. Event Combo (Event Coincidence): This rule type generates a new event when a certain combination of events occurs in any order. The combination can include any number and type of events. When the combination is detected within a given time period, SPECTRUM generates a new event. Event Condition (Event Coincidence): This rule type generates an event based on a conditional expression. Part of SPECTRUMs trust but verify methodology a series of conditional expressions can be listed within the event rule and the first expression that is found to be TRUE will generate the event specified with the condition. You can construct rules to provide correlation through a combination of evaluating event data with IMT model data (including attributes which can be read directly from the remote managed element). For example, if a trap is received notifying the management system of memory buffer overload, to validate that an alarm condition has occurred, an Event Condition rule can initiate a request to the device to check actual memory utilization. SPECTRUM implements a number of event rules out-of-box by applying one or more of the event rule types to event streams. You can create or customize event rules using any of the rule types and apply these Event Rules on other event streams. Further implementation of event rules using the Event Management System is discussed later in this paper.
Condition Correlation
To perform more complex user-defined or user-controlled correlations, SPECTRUM offers a policy-based CCT that enables the following: Creation of correlation policies Creation of correlation domains Correlation of seemingly disparate event streams or conditions Correlation across sets of managed elements Correlation within managed domains Correlation across sets of managed domains Correlation of component conditions as they map to higher order concepts such as business services or customer access Several important concepts relate to condition correlation: Conditions: A condition is similar to state. An event/action can set a condition and it can clear it. It is also possible to have an event set a condition but require a user-based action to clear the condition. The condition exists from the time it is set until the time it is cleared. A very simple example of a condition is a port down condition. The port down condition will exist for a particular interface from the time that the LINK DOWN trap or set event (such as a failed status poll) is received until the time the LINK UP trap or clear event (such as a successful status poll) is received. A number of conditions that may be useful for establishing domain level correlations are defined out-of-box in SPECTRUM, and you can add more.
Seemingly Disparate Conditions: Many devices in an IT infrastructure provide a specific function. The device-level function is often without context as it relates to the functions of other devices/components. Most managed elements can emit event streams, but those event streams are local to each component. A simple example is when a Response Time Management system identifies a condition of a test result exceeding a threshold. At the same time, an Element Management System may identify a condition of a router port exceeding a transmit bandwidth threshold. These conditions are seemingly disparate as they are created independently and without context or knowledge of each other. In reality, the two are often closely related; that is, an overutilized port could be the cause of the response degradation. Rule Patterns: Rule Patterns associate conditions when specific criteria are met. A simple example is a port down condition caused by a board pulled condition. The two conditions are likely related if the port and board have the same slot number. The following diagram illustrates this rule pattern. A rule pattern can result in the creation of an actionable alarm or the suppression of symptomatic alarms.
Correlation Domains: You can use a Correlation Domain to both define and limit the scope of one or more Correlation Policies. You can apply it to a specific Service. For example, in the Cable Broadband environment, a return path monitoring system may detect a return path failure in a certain geographic service area. This return path failure condition is causing subscribers high-speed cable modems to become unreachable and Video on Demand (VoD) pay-per-view streams to fail. The knowledge that the return path failure, the modem problems, and the failed video streams are all in the same correlation domain is essential to correlating the events and ultimately identifying the root cause. However, it is also important to have the ability to distinguish that a return path failure condition occurring in one correlation domain (Philadelphia) should not be correlated with VoD stream failure conditions occurring in a different correlation domain (New York). Correlation Policies: You can bundle Multiple Rule Patterns into Correlation Policies. You can then apply Correlation Policies to a Service or Correlation Domain. For example, you can create a bundle of rule patterns applicable to OSPF and label them the OSPF Correlation Policy. You can apply the OSPF Correlation Policy to each Correlation Domain, where each autonomous OSPF region and the supporting routers in that region define the Correlation Domain. As another example, you could define Correlation Policy based on a set of rule patterns that operate within the confines of a MPLS/BGP VPN, labeled as the
Intra-VPN Policy, and apply them to all modeled VPNs. Whenever you add a rule to a Correlation Policy, or delete one from it, SPECTRUM automatically updates all related Correlation Domains immediately. You can apply multiple Correlation Policies to any Correlation Domain, and apply a Correlation Policy to many Correlation Domains. Condition-based correlations are very powerful and provide a mechanism to develop Correlation Policies and apply them to Correlation Domains. When you apply them to Service Level Management, Correlation Policies are similar to metrics of an SLA, and Correlation Domains are similar to service, customer, or geographical groupings. Occasionally, the only way to infer a causal relationship between two or more seemingly disparate conditions is when those conditions occur in a common Correlation Domain. These mechanisms are necessary when you SPECTRUM cannot discover causal relationships through interrogations.
Fault Scenarios
Out-of-box, SPECTRUM addresses a wide range of different scenarios to which it can perform RCA. This section provides specific scenarios where the techniques described in the previous section are employed to determine RCA and impact analysis. For the sake of simplicity and brevity, the detail will be limited to the basic processing. Also, for the purpose of the discussion and figures, the following table shows the color of alarms that are associated with the icon status of SPECTRUM models at any given time.
Communication Outages and Impacts

Communication outages are types of faults often described as black-outs or hard faults. With these types of faults, one or more communication paths are degraded to the point that traffic can no longer pass. The fault could be caused by many situations including broken copper/fiber cables/connections, improperly configured routers/switches, hardware failures, severe performance problems, security attacks, and so on. With these hard communication failures, limited information is available to the management system as it is unable to exchange information with one or more managed elements. With SPECTRUMs sophisticated system of models, relationships, and behaviors available through IMT, SPECTRUM can infer the fault and impact. IMT inference algorithms are also called Inference Handlers. A set of Inference Handlers designed for a purpose is referred to as an Intelligence Circuit, or simply Intelligence.
How SPECTRUMs Intelligence Isolates Communication Outages

SPECTRUM offers powerful capabilities that can help you to identify the real sources of problems in the network. For many management solutions, the steps to achieve this capability are often manual and very time-intensive. With SPECTRUM, however, many of these steps are performed automatically by the SPECTRUM software. The process that SPECTRUM uses to identify and isolate outages is as follows:
1.
Use SPECTRUM discovery to build a model of your infrastructure that shows the resources in your network and how they are connected. Upon receipt of a problem event, SPECTRUM checks the status of closely-connected resources to determine whether they have problems. SPECTRUM analyzes the status of the resources to identify the likely root cause of the problem. SPECTRUM suppresses alarms that are symptoms of the root cause, but not the cause itself. SPECTRUM evaluates the severity of the problem to help prioritize the problem among any other reported problems in the network.
2.
3.
4.
5.
The following sections describe these SPECTRUM capabilities in more detail.

BUILD THE MODEL WITH AUTODISCOVERY
An accurate representation of the infrastructure is critical for determining the fault and the impact of the fault. SPECTRUMs modeling system can represent not only a wide array of multi-vendor equipment, but also a wide range of technologies and connections that can exist between various infrastructure elements. SPECTRUM has specific solutions for discovering multi-path networks over a variety of technologies supporting many different architectures. SPECTRUM offers support for meshed and redundant, physical and logical topologies based on ATM, Ethernet, Frame Relay, HSRP, ISDN, ISL, MPLS, Multicast, PPP, VoIP, VPN, VLAN and 802.11 Wireless environments even legacy technologies such as FDDI and Token Ring. SPECTRUMs modeling is extremely extensible and can be used to model OSI Layers 1-7 in a communication infrastructure. SPECTRUM provides four different methods for building the physical and logical topology connectivity model for any given infrastructure: SPECTRUMs AutoDiscovery application automatically and dynamically interrogates the managed infrastructure about its physical and logical relationships. This approach to AutoDiscovery was patented in 1996, and SPECTRUM was the industrys first product to discover Layer 2 switch connectivity. SPECTRUMs AutoDiscovery application works in two distinct phases (although there are many different stages within each phase that are not covered here). The first phase is Discovery. When initiated (as described in Chapter 5), AutoDiscovery automatically discovers the elements that exist in the infrastructure. This provides SPECTRUM with an inventory of elements that could be managed. The second phase is Modeling. AutoDiscovery uses management and discovery protocols to query the elements it has found to gain information that will be used to determine the Layer 2 and Layer 3 connectivity between managed elements. For example AutoDiscovery uses SNMP to examine route tables, bridge tables, and interface tables, but also uses traffic analysis
and vendor proprietary discovery protocols such as Ciscos CDP. AutoDiscovery is a very thorough and automated mechanism for building the infrastructure model. The Modeling Gateway imports a description of the entire infrastructures components, as well as physical and logical connectivity information from external sources, such as Provisioning systems or Network Topology databases. The command line interface or Programmatic APIs can build a custom integration or application to import information from external sources. Graphical user interfaces allow users to quickly point, click, and drag and drop to manually build the model. SPECTRUMs modeling scheme allows a single managed element to be logically divided into any number of sub-models. This collection of models and the relationships between them is often referred to as the semantic data model for that type of managed element. Thus, a typical semantic data model for a networking device may include a chassis model with board models related to the chassis. Physical interface models would be associated to the board models. Each physical interface model may have a set of subinterface models associated below them. SPECTRUM has a set of well-defined associations that define how different semantic data model sets act with one another. When SPECTRUM represents the connectivity between two devices, a relationship is established not only between the two ports that form the link between them, but also between device models and to the corresponding interface and port models of other devices, as shown in the following figure.
START THE PROBLEM ANALYSIS
SPECTRUM can begin to solve a problem proactively upon receipt of a single symptom. Many problems share the same set of symptoms, but SPECTRUM must perform further analysis to determine the root cause. For communication outages, the analysis begins when a model in SPECTRUM recognizes the communications failures through failed polling, traps, events, performance threshold violations, or lack of response. SPECTRUM automatically validates the communication failures through retries, alternative protocols, and alternative path checking as part of its trust but verify methodology. The model that raised the
problem which started the intelligence is called the initiator, although more than one model can trigger the intelligence. The initiator model intelligence requests a list of other models that are directly connected to it. These connected models are referred to as the initiator models neighbors. For example, the following figure shows five models, where Model B is the initiator, and models A, C, D, and E are neighbors.
With a list of neighbors identified, the intelligence directs each neighbor model to check its current status. This check is referred to as the Are You OK? check. OK is a relative term, and a unique set of attributes related to performance and availability will vary from model to model based on the real-world capabilities of the device that the model is representing. When a model is asked Are You OK?, the model can initiate a variety of tests/checks to verify its current operational status. For example, with most SNMP-managed elements, the check is typically a combination of SNMP requests but could be more involved by interrogating an Element Management System or as simple as an ICMP Ping. A comprehensive check could include threshold performance calculations or execution of response time tests.
Each neighbor model returns an answer to Are YOU OK?.
LOCATE THE ROOT CAUSE FAULT ISOLATION
If the initiator model has a neighbor that responds that it is OK, such as Model A in the previous figure, SPECTRUM can infer that the problem lies between the unaffected neighbor and the affected initiator (Model B). In this case, the initiator model that triggered the intelligence is a likely culprit for this particular infrastructure failure. As a result, SPECTRUM raises a critical alarm on the initiator model, which is considered the Root Cause alarm, as shown in the next figure.
HIDE THE NOISE OF SYMPTOMATIC PROBLEMS WITH ALARM SUPPRESSION
As the analysis continues beyond isolating the device at fault (Model B), the next step is to analyze and suppress reporting of the effects of the fault. This is the goal of intelligent Alarm Suppression. If a neighbor (such as Models C, D, or E) of the initiator model responds that it is not OK, this neighbor is considered to be affected by the failure occurring elsewhere in the infrastructure. As a result, SPECTRUM places these models into a suppressed condition (Grey Color) because the alarms are symptomatic of a problem elsewhere. While these resources are experiencing problems, they are not the root cause problem; they will likely be fixed when operators have addressed the problems that are affecting Model B.
PRIORITIZE THE PROBLEM IMPACT ANALYSIS
SPECTRUM continues to analyze the total impact of the fault because of its ability to understand that the individual models exist as part of a larger network of models representing the managed infrastructure. As such, the intelligence will analyze each Fault Domain, which is the collection of models with suppressed alarms related to the same failure. These impacted models are linked to the root fault for presentation and analysis. The intelligence provides a measurement of the impact that this fault is having by examining the models that are included within this Fault Domain and calculating a measurement that serves as the impact severity. The impact severity value provides a ranking system so that operators can quickly assess the relative impact of each particular infrastructure fault in order to prioritize their corrective actions.
Event Management System

Event Rules provide even more processing and correlation of event streams. Event Rule processing is required for situations in which the event stream is the only source of management information. For example, SPECTRUMs Southbound Gateway enables SPECTRUM to accept event streams from devices and applications not directly monitored by SPECTRUM, such as the eHealth for Voice PBX devices and message servers. You can also apply event rules to perform intelligent processing of events within certain contexts; frequency, sequence, combination. As described earlier in this chapter, you can apply six event rule types as event rules: Event Pair: Expected pair event or missing pair event in specified time span. Event Rate Counter: Events at specified rate in specified time span. Event Rate Window: Number of events in specified time span. Event Sequence: Ordered sequence of events in specified time span. Event Combo: Two or more events, any order in specified time span. Event Condition: Events parsed for specific data to allow creation of new events based on comparisons of variable bindings, attributes, constants, etc. SPECTRUM provides many out-of-the-box event rules, but also provides easy-to-use methods for creating new rules using one or more of the event rule types. This section highlights a couple of out-of-box event rules and also a few customer examples of event rule applications.
OUT-OF-BOX EVENT PAIR RULE
SPECTRUM has the ability to interpret Cisco syslog messages as event streams. Each syslog message is generated on behalf of a managed switch or router and is directed to the SPECTRUM model representing that managed element. One of the many Cisco syslog messages indicates a new configuration has been loaded into the router. The Reload message should always be followed by a Restart message, indicating the device has been restarted to adopt the newly loaded configuration. If not, a failure during reload is probable. SPECTRUM uses an event rule based on the Event Pair rule type to raise an alarm with cause ERROR DURING ROUTER RELOAD if it does not receive the Restart message within 15 minutes of the Reload message. The following diagram illustrates the events and timing.
MANAGING SECURITY EVENTS USING AN EVENT RATE COUNTER RULE
SPECTRUM is able to collect event feeds from many sources. Some customers send events from security devices such as Intrusion Detection Systems (IDSs) and firewalls. These types of devices can generate millions of log file entries. These customers could use an Event Rate Counter rule to distinguish between sporadic client connection rejections and real security attacks. The rule generates a critical alarm if 20 or more connection failures occurred in less than one minute, as shown in the following figure.
MANAGING SERVER MEMORY GROWTH USING AN EVENT SEQUENCE RULE
A common problem with some applications is the inability to manage memory usage. Some applications will use system memory and never free it again for other applications to use. This can degrade performance on the host machine, and eventually the memory leaking application will fail. As one example, if you have a web server application with a history of slow memory leak problems, you might schedule a reboot once a week during a planned maintenance window to compensate for the memory consumption. However, if the memory leak occurs more quickly than usual, which is a deviation from normal behavior, you might want to perform an emergency reboot before the scheduled maintenance. You can employ
a combination of progressive SPECTRUM thresholds with an Event Sequence rule to monitor for abnormal behavior, or you could use eHealth Live Health analysis to report the deviation from normal memory consumption. Using the SPECTRUM thresholds as an example, you could set monitoring to create events as the memory usage passed threshold points of 50%, 75% and 90%. If those threshold points are reached in a period of less than one week, SPECTRUM generates an alarm to provide notification to reboot the server prior to the scheduled maintenance window, as shown in the following diagram.
AN OUT-OF-BOX EVENT CONDITION RULE COMBINED WITH AN EVENT PAIR RULE
RFC2668 (MIB for IEEE 802.3 Medium Attachment Units) provides management definitions for Ethernet hubs. Within the RFC, is the definition of an SNMP trap used to notify a management system when the jabber state of an interface changes. Jabber occurs when a device that is experiencing circuitry or logic failure continuously sends random (garbage) data. The trap identifier simply indicates a change in condition and the variable data portion of the trap indicates whether jabbering has started or stopped. SPECTRUM applies an Event Condition rule to create distinct start/stop events by looking at the variable portion of the trap, and uses an Event Pair rule to create an alarm if the jabbering start is not closely followed by a jabbering stop event.
CONDITION CORRELATION TECHNOLOGY
The CA SPECTRUM CCT offers advanced customization capabilities for defining event relationships to isolate root causes of problems. For example, consider the complexities of managing an IP network that provides VPN connectivity across an MPLS backbone with intra-area routing maintained by Intermediate System-to-Intermediate System (IS-IS) and inter-area routing maintained by BGP. Any physical link or protocol failure could cause dozens of events from multiple devices. Without applying sophisticated correlation carefully, the network troubleshooters could spend most of their time chasing after symptoms, rather than fixing the root cause.
AN IS-IS ROUTING FAILURE EXAMPLE
The following example illustrates the range of capabilities for Condition Correlation. A core router, labeled in the figure as R1, has lost IS-IS adjacencies to all neighbors (labeled in the figure as R2, R3, and R4). This also causes the BGP session with the route reflector (labeled in the figure as RR) to be lost. This condition, if it persists, will result in routes aging out of R1 and adjacent edge routers R3 and R4. Eventually, the customer VPN sites serviced by these edge routers will be unable to reach their peer sites (labeled in the figure as CPE1, CPE2, CPE3).
This failure causes the routers to generate a series of syslog error messages and traps. The following table shows the messages and traps that SPECTRUM would receive:
The root cause of all these messages is the IS-IS routing problem related to R1. For many management systems, the operator or troubleshooter would see each of these messages and traps as seemingly disparate events on the event/alarm console. A trained operator or experienced troubleshooter may be able to deduce, after some careful thought, that an R1 routing problem has occurred. However, in a large environment, these events/alarms will likely be interspersed with other events/alarms cluttering the console. Even if the operator or troubleshooter had the experience to identify the correlation manually, effort and time
would be devoted to doing so. That time is directly related to costs, lower user satisfaction, and lost revenue. Without condition correlation, SPECTRUM would send the alarm console users notification of ten or more events. However, using a combination of an Event Rule and Condition Correlation, you can apply a set of rule patterns to a Correlation Domain consisting of all core (LSR) routers, enabling SPECTRUM to produce a single actionable alarm. This alarm will indicate that R1 has an IS-IS routing problem, and a network outage may result if this is not corrected. The seemingly disparate conditions that SPECTRUM correlates which results in this alarm appear in the symptoms panel of the alarm console as follows:
1.
A local Event Rate Counter rule was used to define multiple IS-IS adjacency change syslog messages reported by the same source as a routing problem for that source. A rule pattern was used to make an IS-IS adjacency lost event caused by an IS-IS routing problem when the neighbor of the adjacency lost event is equal to the source of the routing problem event. A rule pattern was used to make a BGP adjacency down event caused by an IS-IS routing problem when the neighbor of the adjacency down event is equal to the source of the routing problem event. A rule pattern was used to make a BGP backward transition trap event caused by an IS-IS routing problem when the neighbor of the backward transition event is equal to the source of the routing problem event.
2.
3.
4.
AN HSRP/VRRP ROUTING FAILURE EXAMPLE
Condition Correlation can also provide interesting and useful correlation of events when a link is lost to a router in a Hot Standby Routing Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP) environment. In the following example, a site has two redundant routers that provide access via HSRP. For this case, the primary router experiences a failure, but the redundant router is still servicing the customers site. You might want an alarm notification of the redundant fail-over, and distinguish that from a total site outage. Knowledge from IMT, EMS and CCT can help to provide the RCA.
The following table outlines the syslog error messages and trap sequences for the HSRP failover.
The seemingly disparate conditions that SPECTRUM correlated to create this alarm appear in the symptoms panel of the alarm console as follows:
1.
A correlation domain consisting of only of the two CPE HSRP routers and the PE router interfaces that connect to these sites. A rule pattern correlating the coincidence of an HSRPGrpStandByState event with a state of active and a Device Contact Lost event to infer a Primary Connection Lost condition. A rule pattern that defines a Bad Link event caused by a Primary Connection Lost event.
2.
3.
It applies these rule patterns to the HSRP correlation domains to prevent any correlations outside of that scope. Without these rules, SPECTRUM would have raised a critical alarm on the lost CPE device, and on the connected port model. With these rules, it raises a major (Orange) alarm on the CPE device indicating that the primary connection to the customer is lost. The other conditions will appear in the symptoms table of this alarm.
Apply Condition Correlation to Service Correlation

Typically, networks carry and support more than one service. As an example, in the cable industry, telephone service (VoIP), internet access (High Speed Data), VoD and digital cable are delivered over the same physical data network. Managing this network can be quite a challenge. Inside the network (cable plant), the video transport equipment, video subscription services, and the Cable Model Termination System (CMTS) all work together to put data on the cable network at the correct frequencies. Uncounted miles of cable along with thousands of amplifiers and power supplies must carry the signals to the homes of millions of subscribers. If the network lines are cut in one area, as shown in the following diagram, the return path monitoring system and the head end controller would report return path and power problems in that area. The CMTS would provide the number of cable modems off-line for the node. The video transport system would generate tune errors for video subscriptions in that area. Lastly, the management system will lose contact with any business customer modems that it is managing. With the flood of events and error messages from the managed elements, it will be very obvious that problems exist with the service; the challenge is to translate all that data into root cause and service impact actionable information.
SPECTRUM can interpret the resulting deluge of events by using the service area of the seemingly disparate events as a factor in the Condition Correlation. If the service areas and services are modeled in SPECTRUM, it can use Condition Correlation to determine which services in which areas are affected and the root cause or causes.
Service impact relevance goes beyond understanding what is impacted; it is also important to identify what is not impacted. It is possible for the video subscription service to fail to deliver VoD content to a single service area, and yet all other services to that area could be operating normally. In another case, a return path problem in one area could cause Internet, VoIP, and VOD services to fail and digital cable to degrade, yet analog cable would still function normally. With the SPECTRUM capabilities and views of your infrastructure, you can more quickly and easily detect the root cause and focus on addressing that problem first.
Leverage the Integrated Solution

After SPECTRUM has identified the root cause problems, operation staff can quickly obtain details about the problem history and troubleshooting information by drilling down from the alarms in the OneClick browser to the eHealth and eHealth for Voice reports and tools. For example, operators could right-click an alarm in OneClick and do any of the following: Drill down to Trend reports for the problem variable. Drill down to At-a-Glance reports for a snapshot of several key performance variables for that resource. Drill down to the eHealth for Voice console or the eHealth web reporting interface to view more reports and details about problem voice PBX systems and call message servers. Launch a browser to the eHealth web reporting interface to view more reports and details about problem resources.
Index
A
agents, supported, 72 Alarm Detail report, 137 Alarm Notification Manager, 41 Alarm rules, 62 Alarm Suppression, 170 Assurance Server Infinity, 39 Assurance Server Xsight, 38 Assurance Servers, 38 At-a-Glance reports, 80 ATM Circuit Manager, 41 audience, 9 community string, 71 Condition Correlation Technology (CCT), 173 conditions, SPECTRUM, 164 configuration best practices, 57 steps, 57 updating, 144 Configuration Manager, 44 converged networks, 14 challenges, 14 Correlation Domains, 165 Correlation Policies, 165 customer-facing reports, 118
B
backups, archiving, 70 benefits eHealth, 23 eHealth for Voice, 26 SPECTRUM, 25 Business Objects, installing, 55 Business Service Intelligence (BSI), 87 Business Service Management (BSM), 87
D
Database Manager, eHealth for Voice, 56 deployment large, 49 small-to-medium enterprise, 48 types, 48 deployment example, 53 Deviation from Normal alarms, 31 Deviation from Normal rules, 136 disk capacity, message servers, 158 Distributed eHealth, 36 downtime, cost of, 13
C
CA Green Book, purpose, 9 CA Maturity Models, 28 CA Technology Services, 11 offerings, 26 Capacity Analyzer, 154 best practices, 156 capacity changes, planning, 149 capacity planning, 139 changes, 153 lead time, 152 projecting trends, 150 voice best practices, 156 Erlangs, 156 grade of service, 155 message disk capacity, 158 overused resources, 157 ROI, 157 services, 154 trunks, 156 underused resources, 157 watching trends, 149 Capacity Projection reports, 150 Capacity Provisioning reports, 150 Capacity Trend What-If report, 148 running, 152 Cisco syslog messages, 171 civilian and defense agencies, 21 cluster, eHealth, 36
E
E2E Console, 35 eHealth backing up, 70 benefits, 23 Certification pages, 35 components, 23 forwarding Live Health traps, 64 Report Developer Language (RDL), 37 scheduled discoveries, 61 Sizing Wizard, 50 SPECTRUM integration, 25 eHealth for Voice, 25 backing up, 70 components, 47 forwarding traps to SPECTRUM, 66 Policy Manager, 47, 135 Right to Use license, 47 sending traps, 137 enterprise, 21 Enterprise IT Management (EITM), 10 enterprise management, 18 Erlang, 156 Event Management System, 163 Event Rule types, 163 Event-to-Resolution Readiness Assessment, 11
Exceptions section, sending traps, 136 Exceptions Summary Report, 145
F
fault scenarios, 103 Frame Relay Manager, 41
Modeling Gateway, 168 models, SPECTRUM, 162 Multicast Manager, 42 MyHealth reports, 82
N
neighbors, 169 Network and Voice Management solution, eHealth, 10 network and voice management strategy, 19 network evolution, 13 network fault and performance issues, 10 Network Fault Management, components, 38 network management solutions, 10 Network Performance Management, components, 35 network support, 10 node licenses, 48
G
Global Collections, 58 governments, 21 Grade of service (GoS), 155 group lists, purpose of, 60 groups adding to group list, 61 creating, 60 purpose of, 60
H
Health reports, 65 forwarding traps, 65 systems, 82 Healthcheck services, 29 heterogeneous networks, 10
O
OneClick clearing alarms, 137 SPECTRUM, 39 OneClick for eHealth console, 60 Operational Support System (OSS), 20 overutilized resources confirming, 145 documenting history of, 147 locating, 145 modeling changes for, 148 resolving, 147
I
Inductive Modeling Technology (IMT), 162 installation prerequisites, 53 steps, 54 InstallPlus kit, 55 integrated solution, configuring, 57 integration eHealth SPECTRUM, 25 modules (IMs), 36 value, 11 IT resources, 18
P
predictive capacity planning, 33, 139 proactive service assurance, 31, 135 process rules, 102
L
lifecycle of best practices, 26 Live Exceptions profiles, 63 service alarm situations, 136 starting, 63 Live Health, 36 forwarding traps to SPECTRUM, 64 profiles, 62 Live Trend, 82
Q
QoS Manager, 42
R
rapid problem resolution, 32, 159 Remote Poller, 37 Report Center, 37 Report Manager accessing, 118 SPECTRUM, 45 reports At-a-Glance, 80 Health, 65, 82 MyHealth, 82
M
Model by IP Address settings, 78
Top N, 86 Trend, 84 What-If Capacity Trend, 86 resource monitoring, 92 response time tests, 105 return on investment (ROI), 143 RFC 2790 extensions, 72 support for process modeling, 102 RMON2 probes, 37 role-based service information, 10 root cause, 92
S
scale, 10 Secure Domain Manager (SDM), 44 service assurance, 135 Service Availability, 17 reports, 118 service dashboard, 46 service delivery platform, 13 Service Desk tickets, 137 Service Editor, 103 Service health, 92 Service Health Matrix table, 94 service hierarchies, 93 service level management, 30 service management approach, 87 interview process, 88 mapping procedures, 89 Service Management module, 87 Service Manager, 46 service modeling, 92 creating, 93 fault scenarios, 103 health table, 94 policy design, 95 process rule, 102 response time, 105 Service Performance Manager (SPM), 45 service providers, 20 Situations to Watch chart, 149 Situations to Watch Detail report, 150 Sizing Wizard, 50 guarantees, 108 SLA business hours, 109 components, 115 guarantees, 108 implementing, 116 monitoring, 108 periods, 108 reports, 118 SLA modeling concepts, 108 SNMPv3 support, 43 software and hardware requirements eHealth, 50 OneClick, 51 SPECTRUM, 50
Voice, 52 Solution Architecture Overview (SAO), 27 Solution Architecture Specification (SAS), 28 SpectroSERVER, 54 SPECTRUM backing up, 70 benefits, 25 components, 24 configuring eHealth, 68 discovery, 57 eHealth integration, 25 OneClick, 39 Service Manager, 30 viewing alarms, 69 Watch Editor, 40 SPECTRUM Alarm Notification Manager (SANM), 40 SPECTRUM Integrity, 39 SPECTRUM Report Manager, 45 syslog messages, 171 system agents, monitoring best practices, 71 system monitoring, 71 system requirements, 54 SystemEDGE Agents, 71
T
telecommunication service provider, 20 third-party agents, 71 Time over Threshold alarms, 31 Time over Threshold rules, 136 Top N reports, 86 topology, 77, 167 Traffic Accountant, 37 traps forwarding from Health reports, 65 forwarding from Voice, 66 forwarding to SPECTRUM from eHealth, 64 Trend reports, 84
U
underused resources confirming, 141 finding, 140 resolving, 143 return on investment, 143 Underutilized Elements report, 140 best practices, 141 Unicenter NSM Agents, 71 discovering in SPECTRUM, 79
V
voice message disk capacity, 158 voice services, network, 25
Voice, capacity planning, 154 VPN Manager, 43
W
Watch Editor, 40
watches, SPECTRUM, 40 What-If reports running, 152 What-If reports, 86

Network and Voice Management Green Book ENU

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Network and Voice Management Green Book ENU

Caricato da

Copyright:

Formati disponibili

CA GREEN BOOKS

Network and Voice Management

COPYRIGHT LICENSE AND NOTICE:

Network and Voice Management

3: Network and Voice Management

4: Network and Voice Management

5: Network and Voice Management

6: Network and Voice Management

7: Network and Voice Management

8: Network and Voice Management

9: Network and Voice Management

CAs Network and Voice Management Solution

10: Network and Voice Management

11: Network and Voice Management

12: Network and Voice Management

Chapter 2: Challenges of Network and Voice Management

Evolution of the Network-to-Service Delivery Platform

Impact on Network Operations Teams

13: Network and Voice Management

Impact on Network Management Software Requirements

14: Network and Voice Management

15: Network and Voice Management

(This page intentionally left blank)

16: Network and Voice Management

Chapter 3: CAs Network and Voice Management Solution

17: Network and Voice Management

Enterprise Systems Management

18: Network and Voice Management

The Value of CAs Network and Voice Management Solution

19: Network and Voice Management

A Key Part of CAs EITM Vision

Network and Voice Management for Key Vertical Markets

Telecommunication Service Providers

20: Network and Voice Management

21: Network and Voice Management

The Components of the Solution

22: Network and Voice Management

The Benefits of eHealth

23: Network and Voice Management

24: Network and Voice Management

The Benefits of SPECTRUM

Integration between eHealth and SPECTRUM

eHealth for Voice

25: Network and Voice Management

The Benefits of eHealth for Voice

CA Technology Services Network and Voice Management Service Offerings

26: Network and Voice Management

Assessment Understanding the Gaps

27: Network and Voice Management

Design Building the Right Solution

28: Network and Voice Management

Implementation The Bottom Line of Solution Success

Optimization Anticipating Change

Why Trust Your Service Availability to CA Technology Services?

29: Network and Voice Management

How the Solution Delivers the Key Points of Value

30: Network and Voice Management

Proactive Service Assurance

31: Network and Voice Management

Rapid Problem Resolution

32: Network and Voice Management

Predictive Capacity Planning

33: Network and Voice Management