Sei sulla pagina 1di 81

AVAILABILITY MEASUREMENT

SESSION NMS-2201

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Agenda
Introduction Availability Measurement Methodologies
Trouble Ticketing Device Reachability: ICMP (Ping), SA Agent, COOL SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application

Developing an Availability Culture

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Associated Sessions
NMS-1N01: Intro to Network Management NMS-1N02: Intro to SNMP and MIBs NMS-1N04: Intro to Service Assurance Agent NMS-1N41: Introduction to Performance Management NMS-2042: Performance Measurement with Cisco IOS ACC-2010: Deploying Mobility in HA Wireless LANs NMS-2202: How Cisco Achieved HA in Its LAN RST-2514: HA in Campus Network Deployments NMS-4043: Advanced Service Assurance Agent RST-4312: High Availability in Routing
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

INTRODUCTION WHY MEASURE AVAILABILITY?

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Why Measure Availability?


1. Baseline the network 2. Identify areas for network improvement 3. Measure the impact of improvement projects

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Why Should We Care About Network Availability?


Where are we now? (baseline) Where are we going? (business objectives) How best do we get from where we are not to where we are going? (improvements) What if, we cant get there from here?

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Why Should We Care About Network Availability?


Recent Studies by Sage Research Determined That US-Based Service Providers Encountered: Percent of downtime that is unscheduled: 44% 18% of customers experience over 100 hours of unscheduled downtime or an availability of 98.5% Average cost of network downtime per year: $21.6 million or $2,169 per minute!

DowntimeCosts too Much!!!


SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Cause of Network Outages


Change management Process consistency
Technology Hardware Links 20%

Design Environmental issues Natural disasters Software issues Performance and load Scaling

User Error and Process 40%

Source: Gartner Group


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Software and Application 40%

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Top Three Causes of Network Outages


Congestive degradation Capacity (unanticipated peaks) Solutions validation Software quality Inadvertent configuration change Change management Network design WAN failure (e.g., major fiber cut or carrier failure) Power Critical services failure (e.g. DNS/DHCP) Protocol implementations and misbehavior Hardware fault

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Method for Attaining a Highly-Available Network


Or a Road to Five Nines
Establish a standard measurement method Define business goals as related to metrics Categorize failures, root causes, and improvements Take action for root cause resolution and improvement implementation

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

10

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Where Are We Going? Or What Are Your Business Goals?


Financial
ROI Economic Value Added Revenue/Employee

Productivity Time to market Organizational mission Customer perspective


Satisfaction Retention Market Share

Define Your End-State? What Is Your Goal?


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

11

Why Availability for Business Requirements?


Availability as a basis for productivity data
Measurement of total-factor productivity Benchmarking the organization Overall organizational performance metric

Availability as a basis for organizational competency


Availability as a core competency Availability improvement as an innovation metric

Resource allocation information


Identify defects Identify root cause Measure MTTRtied to process
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

12

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

It Takes a Design Effort to Achieve HA


Hardware and Software Design

Process Design

Network and Physical Plant Design


13

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

INTRODUCTION WHAT IS NETWORK AVAILABILITY?

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

14

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

What Is High Availability?


High Availability Means an Average End User Will Experience Less than Five Minutes Downtime per Year
Availability 99.000% 99.500% 99.900% 99.950% 99.990% 99.999% 99.9999%
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Downtime per Year (24x7x365) 3 Days 1 Day 15 Hours 19 Hours 8 Hours 4 Hours 36 Minutes 48 Minutes 46 Minutes 23 Minutes 53 Minutes 5 Minutes 30 Seconds
15

Availability Definition
Availability definition is based on business objectives
Is it the user experience you are interesting in measuring? Are some users more important than other?

Availability groups?
Definitions of different groups

Exceptions to the availability definition


i.e. the CEO should never experience a network problem

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

16

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

How You Define Availability


Define availability perspective (customer, business, etc.) Define availability groups and levels of redundancy Define an outage Define impact to network
Ensure SLAs are compatible with outage definition Understand how maintenance windows affect outage definition Identify how to handle DNS and DHCP within definition of Layer 3 outage Examine component level sparing strategy

Define what to measure Define measurement accuracy requirements


NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

17

Network Design What Is Reliability?


Reliability is often used as a general term that refers to the quality of a product
Failure rate MTBF (Mean Time Between Failures) or MTTF (Mean Time To Failure) Engineered availability

Reliability is defined as the probability of survival (or no failure) for a stated length of time

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

18

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

MTBF Defined
MTBF stands for Mean Time Between Failure MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF) or, to a failure (MTTF) More technically, it is the mean time to go from an OPERATIONAL STATE to a NON-OPERATIONAL STATE MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems

MTTR stands for Mean Time to Repair

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

19

One Method of Calculating Availability


Availability = MTBF (MTBF + MTTR)

What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%

Annual uptime
8,760 hrs/year X (0.9988) = 8,749.5 hrs

Conversely, annual DOWN time is,


8,760 hrs/year X (1- 0.9988) = 10.5 hrs
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

20

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Networks Consist of Series-Parallel


Combinations of in-series and redundant components

D1 A B1
1/2

RBD
2/3

D2 D3

B2

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

21

More Complex Redundancy


Pure active parallel
All components are on

Standby redundant
Backup components are not operating

Perfect switching
Switch-over is immediate and without fail

Switch-over reliability
The probability of switchover when it is not perfect

Load sharing
All units are on and workload is distributed
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

22

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

MEASURING THE PRODUCTION NETWORK

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

23

Reliability or Engineered Availability vs. Measured Availability


Calculations Are SimilarBoth Are Based on MTBF and MTTR
1. Reliability is an engineered probability of the network being available 2. Measured Availability is the actual outcome produced by physically measuring over time the engineered system

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

24

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Choice Based on Business Goals


Passive availability measurement
(Without sending additional traffic on the production network using data from problem management, fault management, or another system)

Active availability measurement


(With traffic being sent specifically for availability measurement using ICMP echo, SNMP, SA agent, etc. to generate data)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

25

Types of Availability
Device/interface Path Users Application

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

26

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Some Types of Availability Metrics


Mean Time to Repair (MTTR) Impacted User Minutes (IUM) Defects per Million (DPM) MTBF (Mean Time Between Failure) Performance (e.g. latency, drops)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

27

Back to How Availability Is Calculated?


Availability (%) is calculated by tabulating end user outage time, typically on a monthly basis Some customers prefer to use DPM (Defects per Million) to represent network availability
Availability (%) = (Total User Time Total User Outage Time) X 102 Total User Time DPM = Total User Outage Time X 106 Total User Time
Total User Time = Total # of End Users X Time in Reporting Period Total User Outage Time = (# of End Users X Outage Time in Reporting Period) Is over All the Incidents in the Reporting Period Ports or Connections May Be Substituted for End Users

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

28

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Defects per Million


Started with mass produced items like toasters For PVCs,
DPM = (#conns*outage minutes) (#conns*total minutes)

For SVCs or phone calls,


DPM = (#existing calls lost + #new calls blocked) total calls attempted

For connectionless traffic (application dependent),


DPM = (#end users*outage minutes) (#end users*total minutes)
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

29

NETWORK AVAILABILITY COLLECTION METHODS TROUBLE TICKETING METHODS

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

30

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Improvement Process


Step I
Validate data collection/calculation methodology Establish network availability baseline Set high availability goals

Step II
Measure uptime ongoing Track defects per million (DPM) or IUM or availability (%)

Step III
Track customer impact for each ticket/MTTR Categorize DPM by reason code and begin trending Identify initiatives/areas for a focus to eliminate defects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

31

Data Collection/Analysis Process


Understand current data collection methodology
Customer internal ticket database Manual

Monthly collection of network performance data and export the following fields to a spreadsheet or database system:
Outage start time (date/time) Service restore time (date/time) Problem description Root cause Resolution Number of customers impacted Equipment model Component/part Planned maintenance activity/unplanned activity Total customers/ports on network
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

32

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Network Availability Results


Methodology and assumptions must be documented Network availability should include:
Overall % network availability (baseline/trending) Conversion of downtime to DPM by: Planned and unplanned Root cause Resolution Equipment type Overall MTTR MTTR by: Root cause Resolution Equipment type

Results are not necessarily limited to the above but should be customized based on your network and requirements
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

33

Availability Metrics: Reviewed


Network has 100 customers Time in reporting period is one year or 24 hours x 365 days 8 customers have 24 hours down time per year
DPM = 8 x 24 x 106 100 x 24 x 365 8 x 24 . 100 x 24 x 365 = 219.2 failures for every 1 million user hours = 0.978082

Availability = 1 -

MTBF =

24 x 365 . 8 1095 x (1-0.978082) . 0.978082

= 1095 (hours) = 0.24 (hours)

MTTR =

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

34

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

TROUBLE TICKETING METHOD SAMPLE OUTPUT

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

35

Overall Network Availability (Planned/Unplanned)


Network Availability
100.00 99.95 99.90 99.85 99.80 99.75 99.70 99.65 99.60 99.55 99.50

e tiv ra st u Ill
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun

Key takeaways

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

36

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Platform Related DPM Comparison


600 500

DPM

400 300 200 100 0 June July Aug

June Other Platform Related Total DPM ------99.99% Target 339.5 49.2 388.7 100

July 82.5

424.9

e tiv ra st u Ill
Sept Oct Dec

Platform related DPM contributed 13% of total DPM in September Platform DPM includes events from:
Backbone NAS PG POP Radius Server VPN Radius Server

Aug 104

Sept 52.6

Oct

Nov

Dec

394.7

362.2

507.4 100

498.7 100

414.8 100 100 100 100

All other events are included in the Other category

Breakdown of Platform Related DPM


June Backbone NAS PG POP Radius Server VPN Radius
NMS-2201 9627_05_2004_c2

July .8 19.4 59.6 3.9 0 8.8 82.5

Aug 15.7 27 56.8 .5 1.2 2.8 104

Sept 2.3 26.1 18.9 1.6 .3 3.4 52.6


37

1.5 21.7 26 0 0 0 49.2

Network Access Server (NAS) accounts for 50% of the total Platform related DPM in September Private Access Gateway (PG) showing significant decrease over the past 3 months

Total Platform Related

2004 Cisco Systems, Inc. All rights reserved.

DPM by Cause
2500 2000 1500 1000 500 0 Dec

Dec Unknown Human Error Environmental Power Other HW Config/SW


NMS-2201 9627_05_2004_c2

18.2 36.1 566.1 145.7 884.3 406 3789.3

ve ti tr a us Ill
Jan Feb Mar Apr

DPM

May

Jan

Feb

Mar

Apr 80 106 133.4 19 314.2 422.5 106.6 1641.9

May 95.2 115.2 127 14.8 604.4 240 201 1964.8


38

23.6 68.8 31.4

8.98 18.4 11.1

87.7 0 89.7 37

136.2 512.7 101.6 1202.2

212.4 553.6 117.5 1226

474.3 20.2 1293.1

TOTAL

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

MTTR Analysis: Hardware Faults


Router HW
16 14 12 12.42 15.1

Produce for Each Fault Type

Hours

10 8 6 4 2 0 Jun Jul Aug Sep 8.49

7.19

140 120 100


# of Faults

# of Total

80 60 40 20 0
Jun Jul Aug Sep

ve ti ra st llu I
Oct Nov Dec

Number of faults increased slightly in September however MTTR decreased 49% of faults resolved in < 1 Hour in September 11% of faults resolved in > 24 hours with an additional 3% >100 Hhours

100

90 80 70 60 50 40 30 20 10

>100

>100 >24 Hr 12-24 Hr 4-12 Hr 1-4 Hr <1 Hr

>24 Hr

12-24 Hr 4-12 Hr 1-4 Hr <1 Hr

Oct

Nov

Dec

Jun

Jul

Aug

Sep

Oct

Nov

Dec
39

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Unplanned DPM
1000 900 800 700 600 500 400 300 200 100 0
Feb

Feb 70 90 90 60 310

Mar 100 80 200 140 520

Mar

Apr 35 55 80 50 220

Apr

Other Process HW SW TOTAL

ve ti ra st llu I
May 79 100 104 67 350
May

Jun 80 100 180 80 440

Jun

Jul 80 90 65

Jul

Aug 165 210 385 200 960

Aug

Sep 110 180 325 145 760

Sep

Oct 40 75 245 100 460

Oct

Nov 10 10 110 40 170

Nov

Dec 0 5 5 10 40

Dec

115 350

Key take-a-ways

Action plans
Identify areas of focus to enable reduction of DPM to achieve network availability goal

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

40

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Trouble Ticketing Method


Pros
Easy to get started No network overhead Outages can be categorized based on event

Cons
Some internal subjective/consistency process issues Outages may occur that are not included in the trouble ticketing systems Resources needed to scrub data and create reports May not work with existing trouble ticketing system/process
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

41

Network Availability Collection Methods

AUTOMATED FAULT MANAGEMENT EVENTS METHOD

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

42

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Improvement Process


Step I
Determine availability goals Validate fault management data collection Determine a calculation methodology Build software package to use customer event log

Step II
Establish network availability baseline Measure uptime on an ongoing basis

Step III
Track root cause and customer impact Begin trending of availability issues Identify initiatives and areas of focus to eliminate defects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

43

Event Log
Analysis of events received from the network devices Analysis of accuracy of the data
Event Log Example Fri Jun 15 11:05:31 2001 Debug: Looking for message header ... Fri Jun 15 11:05:33 2001 Debug: Message header is okay Fri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532" Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013" Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm" Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235" Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!" Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%" "CISCO-Large-special" Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus" Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system" Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing" Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal" Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down" Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down" Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp" Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping" Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2" Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

44

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Calculation Methodology: Example


Primary events are device down/up Down time is calculated based on device-type outage duration Availability is calculated based on the total number of device types, the total time, and the total down time MTTR numbers are calculated from average duration of downtime With MTTR the shortest and longest outage provides a simplified curve
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

45

Automated Fault Management Methodology


Pros
Outage duration and scope can be fairly accurate Can be implemented within a NMS fault management system No additional network overhead

Cons
Requires an excellent change management/provisioning process Requires an efficient and effective fault management system Requires a custom development Does not account for routing problems Not true end-to-end measure
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

46

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

NETWORK AVAILABILITY DATA COLLECTION SAMPLE OUTPUT

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

47

Automated Fault Management: Example Reports


Device Type Host Totals Network Totals Other Totals # of Count of Devices Incidents 2389 4732 897 8018 801 1673 173 2647 Total Down Time hhh:mm:ss 202:27:27 430:02:03 212:29:46 844:59:16 % Down % Up Shortest Outage Duration 0:00:19 0:00:24 0:00:17 0:00:20 Mean Time to Repair 0:20:47 0:22:36 0:26:07 0:23:10 Longest Events Outage per Duration Device 7:48:46 9:49:35 2:16:10 6:38:11 24.42 14.90 16.84 18.72

.0673% 99.9327% .1309% 99.8691% .0509% 99.9491% .0830% 99.9170%

GRAND TOTAL

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

48

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Automated Fault Management: Example Reports (2)


Other Totals 11% Host Totals 30%

Number of Managed Devices


Host Totals Network Totals Other Totals

Network Totals 59%

Other Totals 7%

Host Totals 30%

Count of Incidents
Host Totals Network Totals Other Totals

Network Totals 63%

Other Totals 25%

Host Totals 24%

Total Down Time


Host Totals Network Totals Other Totals

Network Totals 51%


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

49

Network Availability Collection Methods

ICMP ECHO (PING) AND SNMP AS DATA GATHERING TECHNIQUES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

50

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Data Gathering Techniques


ICMP ping Link and device polling (SNMP) Embedded RMON Embedded event management Syslog messages COOL

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

51

Data Gathering Techniques


ICMP Reachability
Method definition:
Central workstation or computer configured to send ping packets to the network edges(device or ports) to determine reachability

How:
Edge interfaces and/or devices are defined and pinged on a determined interval

Unavailability:
Pre-defined, non-response from the interface

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

52

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Measurement Through ICMP


Periodic ICMP Test

Periodic Pings to Network Devices


NMS-2201 9627_05_2004_c2

Period Ping to Network Leaf Nodes

2004 Cisco Systems, Inc. All rights reserved.

53

Data Gathering Techniques


ICMP Reachability
Pros
Fairly accurate network availability Accounts for routing problems Can be implemented for fairly low network overhead

Cons
Point to multipoint implies not true end-to-end measure Availability granularity limited by ping frequency Maintenance of device databasemust have a solid change management and provisioning process
NMS-2201 9627_05_2004_c2 54

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Data Gathering Techniques


Link and Device Status
Method definition:
SNMP polling and trapping on links, edge ports, or edge devices

How:
An agent is configured to SNMP poll and tabulate outage times for defined devices or links; database maintains outage times and total service time; sometimes trap information is used to augment this method by providing more accurate information on outages

Unavailability:
Pre-defined, non-redundant links, ports, or devices that are down
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

55

Polling Interval vs. Sample Size


Polling interval is the rate at which data is collected from the network
Polling interval = 1 Sampling Rate

The smaller the polling interval the more detailed (granular) the data collected
Example polling data once every 15 minutes provides 4 times the detail (granularity) of polling once an hour

A smaller polling interval does not necessarily provide a better margin of error
Example polling once every 15 minutes for one hour, has the same margin of error as polling once an hour for 4 hours

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

56

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Link and Device Status Method


Method definition
SNMP polling and trapping on links, edge ports, or edge devices

How:
Utilizing existing NMS systems that are currently SNMP polling to tabulate outage times for defined devices or links A database maintains outage times and total service time SNMP Trap information is also used to augment this method by providing more accurate information on outages

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

57

Link and Device Status Method


Pros
Outage duration and scope can be fairly accurate Utilize existing NMS systems Low network overhead

Cons
No canned SW to do this; custom development Maintaining element device database challenging Requires an excellent change mgmt and provisioning process Does not account for routing problems Not a true end-to-end measure
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

58

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

CISCO SERVICE ASSURANCE AGENT (SA AGENT)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

59

Service Assurance Agent


Method Definition:
SA Agent is an embedded feature of Cisco IOS software and requires configuration of the feature on routers within the customer network; use of the SA agent can provide for a rapid, cost-effective deployment without additional hardware probes

How:
A data collector creates SA Agents on the routers to monitor certain network/service performances; the data collector then collects this data from the routers, aggregates it and makes it available

Unavailability:
Pre-defined paths with reporting on non-redundant links, ports, or devices that are down within a path
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

60

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Case Study: Financial Institution (Collection)


Internet Web Sites

DNS

SA Agent Collectors

Remote Sites
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

61

Availability Using Network-Based Probes


DPM equations used with network-based probes as input data Probes can be
Simple ICMP Ping probe, modified Ping to test specific applications, Cisco IOS SA Agent

DPM will be for connectivity between 2 points on the network, the source and destination of probe
Source of probe is usually a management system and the destination are the devices managed Can calculate DPM for every device managed DPM = Probes with No Response x 106 Total Probes Sent Availability = 1 - Probes with No Response Total Probes Sent

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

62

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Using Network-Based Probes: Example


Network probe is a ping 10000 probes are sent between management system and managed device 1 probe failed to respond
DPM = 1 x 106 = 100 probes out of 1 million will fail 10000 1 . = 0.9999 10000

Availability = 1 -

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

63

Sample Size
Sample size is the number of samples that have been collected The more samples collected the higher the confidence that the data accurately represents the network Confidence (margin of error) is defined by

m=

1 sample size
After One Month

Example data is collected from the network every 1 hour


After One Day

m=
NMS-2201 9627_05_2004_c2

1 24

= 0.2041

m=

1 24 x 31

= 0.0367
64

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Service Assurance Agent


Pros
Accurate network availability for defined paths Accounts for routing problems Implementation with very low network overhead

Cons
Requires a system to collect the SAA data Requires implementation in the router configurations Availability granularity limited by polling frequency Definition of the critical network paths to be measured

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

65

COMPONENT OUTAGE ONLINE MEASUREMENT (COOL)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

66

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

COOL Objectives
To automate the measurement to increase operational efficiency and reduce operational cost To measure the outage as close to the source of outage events as possible to pin point the cause of the outages To cope with large number of network elements without causing system and network performance degradation To maintain measurement data reliably in presents of element failure or network partition To support simplicity in deployment, configuration, and data collection (autonomous measurement)
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

67

COOL Features
NMS

rd 3rd Party Tools

NetTools C-NOTE PNL

Event Notification Filtering


Outage Monitor MIB

Open access via Outage Monitor MIB Embedded in Router Automated Real-Time Measurement Autonomous Measurement Outage Data Stored in Router

COOL
Access Router

Customer Equipment
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

68

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

COOL Features (Cont.)


Outage Correlation and Calculation

Two-tier framework
Reduces performance impact on the router Provides scalability to the NMS Makes easy to deploy Provides flexibility to availability calculation

NMS

NMS

NMS

Outage Monitoring and Measurement

Outage Monitor MIB COOL Core Router

Outage Monitor MIB COOL Access Router

Support NMS or tools for such applications as


Calculation of software or hardware MTBF, MTTR, availability per object, device, or network Verification of customers SLA Trouble shooting in real-time

Access Routers Customer Equipment

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

69

Outage Model
C
Access Router Network Management System

RP

B Power Fan, A Etc. Logical Interface

D D A

Physical D Interface

Link

MUX/ Hub/ Switch

Customer Equipment

Link

Peer Router

Type A B C D

Objects Monitored Physical Entity Objects Interface Objects Remote Objects Software Objects

Failure Modes Component Hardware or Software Failure Including the Failure of Line Card, Power Supplies, Fan, Switch Fabric, and So on Interface Hardware or Software Failure, Loss of Signal Failure of Remote Device (Customer Equipment or Peer Networking Device) or Link In-between Failure of Software Processes Running on the RPs and Line Cards
70

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to be defective (service degradation or complete outage) Duration threshold: the minimum period beyond which an outage needs to be reported (given SLA) Start time: when the object outage starts End time: when the outage ends

Defect Threshold

Duration Threshold Up Event Time

Down Event Start Time Outage Duration End Time

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

71

Architecture
Customer Interfaces
Outage Monitor MIB
SNMP Polling SNMP Notification

Configuration
CLI Customer Authentication

Data Table Structure


Outage Component Table

HA and Persistent Data Store


Time Stamp Temp Event Data Crash Reason Outage Data

MeasurementMap Table Metrics Event


Process Map Table Remote Component Map Table

Event History Table

Outage Manager

NVRAM

ATA Flash

Internal Component Outage Detector


Fault Manager CPU Event (IOS) Usage Source Detect Callbacks Syslog
NMS-2201 9627_05_2004_c2

Remote Component Outage Detector


Customer Equipment Ping Detection Function

SAA APIs

Baseline

Optional
72

2004 Cisco Systems, Inc. All rights reserved.

Measurement Methods

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Outage Data: AOT and NAF


Requirements of measurement metrics:
Enable calculation of MTTR, MTBF, availability, and SLA assessment Ensure measurement efficiency in terms of resource (CPU, memory, and network bandwidth)

Measurement metrics per object:


AOT: Accumulated Outage Time since measurement started NAF: Number of Accumulated Failures since measurement started

Router 1
Up Down

System Crash 10

System Crash 10 Time

AOT = 20 and NAF = 2

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

73

Outage Data: AOT and NAF


Object containment model
Router Device Line Card Physical Interface Logical Interface

Containment independent property


Router 1
Up Down Interface 1 Up
Interface Failure

System Crash 10

System Crash 10

Router Device AOT = 20; NAF = 2;

Service Affecting AOT = 27; NAF = 3; 20

Time 20 Interface AOT = 7; NAF = 1;

10

10

7 Time
Router 1 Interface 1

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

74

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi = 14/2 = 7 min

Object i

Measurement Interval (T2T1)

TTR
Up Down

TTR 4 min. Failure T2 Time

10 min. T1 Failure

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

75

Example: MTBF and MTTF


Find MTBF and MTTF for Object i
MTBFi = (T2 T1)/NAFi MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi MTBF = 700,000 = 1,400,000/2 MTTR = 699,993 = (700,000 7)
Object i

Measurement Interval (T2T1) TBF TTR TTF 4 min. Failure T2 Time

Up Down

10 min. T1 Failure

(T2T1) = 1,400,000 min


NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

76

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Example: Availability and DPM


Find availability and DPM for Object i
Availability (%) = MTBF MTBF + MTTR

* 100

Availability = 99.999% = (700,000/700,007) * 100 DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM

Measurement Interval = 1,400,000 min. Object i Up


Down

10 min. T1 Failure

4 min. Failure T2

Time

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

77

Planned Outage Measurement


To capture operation CLI commands both reload and forced switchover There is a simple rule to derive an upper bound of the planned outage
If there is no NVRAM soft crash file, check the reboot reason or switchover reason If its reload or forced switchover, it can be considered as an upper bound of the planned outage

Send Break
Operation Caused Outage

Reload
Planned Outage Forced Switchover

Upper Bound of the Planned Outage

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

78

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down May cause virtual network disconnection May occurs event storm when hundreds of messages for each flapping event May make the object MTBF unreasonably low due to frequent short failures This unstable condition needs to get operators attention COOL detects the flapping status Catching very short outage event (less than the duration threshold) Increasing the event counter, Flapping status, if it becomes over the flapping threshold (3 event counter) for the short period (1 sec); sends a notification Stable status, if it becomes less than the threshold; sends another notification
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

79

Data Persistency and Redundancy


Router COOL
Event Driven Update

COOL RAM
Outage Data

RAM
Outage Data

Periodic Update

Persistent Outage Data Persistent Outage Data

NVRAM

NVRAM
Copy Copy
Persistent Outage Data Persistent Outage Data

FLASH

FLASH

Active RP

Standby RP

Data persistency
To avoid data loss due to link outage or router itself crash

Data redundancy
To continue the outage measurement after the switchover To retain the outage data even if the RP is physically replaced
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

80

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Outage Monitor MIB


CISCO-OUTAGE-MONITOR-MIB
Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB 1.3.6.1.4.1.9.9.280
cOutageHistoryTable Object-Type; Object-Index; Event-Reason-Index; Event-Time; Event-Interval; cOutageObjectTable Object-Type; Object-Index; Object-Status; Object-AOT; Object-NAF;

IF-MIB Event Reason Map Table


(Event Description) ifTable (Interface Object Description)

ENTITY-MIB
entPhysicalTable (Physical Entity Object Description)

Process MIB Map

CISCO-PROCESS-MIB
cpmProcessTable (Process Object Description)

Remote Object Map Table


(Remote Object Description)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

81

Configuration
MIB Display Show CLI
Show event-table Show object-table

Config CLI

Event Table Object Table

COOL
Update
Customer Equipment Detection Function

run; add; removal filtering-enable;

Cisco IOS Configuration

Update

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

82

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Enabling COOL
ari#dir Directory of disk0:/ 1 -rw19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin

128057344 bytes total (109051904 bytes free) ari#copy tftp disk0: Address or name of remote host []? 88.1.88.9 Source filename []? auth_file Destination filename [auth_file]? Accessing tftp://88.1.88.9/auth_file... Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ! [OK - 705 bytes] 705 bytes copied in 0.532 secs (1325 bytes/sec) ari#clear cool per ari#clear cool persist-files ari#conf t Enter configuration commands, one per line. End with CNTL/Z. ari(config)#cool run ari(config)#^Z ari#wr mem Building configuration... [OK][OK][OK]
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Obtain Authorization File

Enable COOL

83

COOL
Pros
Accurate network availability for devices, components, and software Accounts for routing problems Implementation with low network overhead. Enables correlation between active and passive availability methodologies

Cons
Only a few system currently have the COOL feature Requires implementation in the router configurations of production devices Availability granularity limited by polling frequency New Cisco IOS Feature
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

84

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Network Availability Collection Methods

APPLICATION LAYER MEASUREMENT

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

85

Application Reachability

Similar to ICMP Reachability


Method definition:
Central workstation or computer configured to send packets that mimic application packets

How:
Agents on client and server computers and collecting data Fire Runner, Ganymede Chariot, Gyra Research, Response Networks, Vital Signs Software, NetScout, Custom applications queries on customer systems

Installing special probes located on user and server subnets to send, receive and collect data; NikSun and NetScout

Unavailability:
Pre-defined QoS definition
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

86

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Application Reachability
Pros
Actual application availability can be understood QoS, by application, can be factored into the availability measurement

Cons
Depending on scale, potential high overhead and cost can be expected

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

87

DATA COLLECTION FOR ROOT CAUSE ANALYSIS (RCA) OF NETWORK OR DEVICE DOWNTIME

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

88

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Data Gathering Techniques


Cisco IOS Embedded RMON
Alarm and event History and statistics Set thresholds in router configuration Configure SNMP trap to be sent when MIB variable rises above and/or falls below a given threshold Alleviates need for frequent polling Not an availability methodology by itself but can add valuable information and customization to the data collection method
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

89

Data Gathering Techniques


Syslog Messages
Provide information on what the router is doing Categorized by feature and severity level User can configure Syslog logging levels User can configure Syslog messages to be sent as SNMP traps Not an availability methodology by itself but can add valuable information and customization to the data collection method

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

90

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Expression and Event MIB


Expression MIB
Allows you to create new SNMP objects based upon formulas MIB persistence is supported a MIBs SNMP data persists across reloads Delta and wildcard support allows you to: Calculate utilization for all interfaces with one expression Calculate errors as a percentage of traffic

Event MIB
Allows you to create custom notifications and log them and/or send them as SNMP traps or informs MIB persistence is supported a MIBs SNMP data persists across reloads Can be used to test objects on other devices More flexible than RMON events/alarms RMON is tailored for use with counter objects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

91

Data Gathering Techniques


Embedded Event Manager
Underlying philosophy:
Embed intelligence in routers and switches to enable a scalable and distributed solution, with OPEN interfaces for NMS/EMS leverage of the features

Mission statement:
Provide robust, scalable, powerful, and easy-to-use embedded managers to solve problems such as syslog and event management within Cisco routers and switches

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

92

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Embedded Event Manager (Cont.)


Development goal: predictable, consistent, scalable management
Distributed Independent of central management system

Control is in the customers hands


Customization

Local programmable actions:


Triggered by specific events

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

93

Cisco IOS Embedded Event Manager: Basic Architecture (v1)


Syslog Event SNMP Data Other Event

Event Detector Feeds EEM Syslog Event Detector SNMP Event Detector Other Event Detector

Embedded Event Manager


Network Knowledge Notify Switchover Reload

EEM EEM EEM Policies Policies Policies

Actions
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

94

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet The following policy actions can be established: Generate prioritized syslog messages Generate a CNS event for upstream processing by Cisco CNS devices Reload the Cisco IOS software Switch to a secondary processor in a fully redundant hardware configuration

EEM Version 2
EEM Version 2 adds programmable actions using the Tcl subsystem within Cisco IOS Includes more event detectors and capabilities
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

95

EEM Version 2 Architecture


Event Publishers Syslog Daemon System Manager Watchdog Sysmon
HA Redundancy Facility

More event detectors! Define policies or programmable local actions using Tcl Register policy with EEM Server Events trigger policy execution Tcl extensions for CLI control and defined actions

System Manager Syslog

Posix Process Manager

Timer Services

Counters
Redundancy Facility

IOS Process

Watchdog

Event Detectors SNMP


Application Embedded Event Specific Event Detector Manager Server Interface Counters and Stats

IOS Subsystems Event Subscribers to Subscriber

Tcl Shell
Subscribers to Receive Events, Implements Policy Actions

Receive Application Events, Publishes Application Events Using Application Specific Event Detector

EEM Policy

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Cisco Internal Use Only

96

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

What Does This Mean to the Business?


Better problem determination
Widely applicable scripts from Cisco engineering and TAC Automated local action triggered by events Automated data collection

Faster problem resolution


Reduces the next time it happensplease collect Better diagnostic data to Cisco engineering Faster identification and repair

Less downtime
Reduce susceptibility and Mean Time to Repair (MTTR)

Better service
Responsiveness Prevent recurrence Higher availability

Not an availability methodology by itself but can add valuable information and customization to the data collection method
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

97

INSTILLING AN AVAILABILITY CULTURE

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

98

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Putting an Availability Program into Practice


Track network availability Identify defects Identify root cause and implement fix Reduce operating expense by eliminating non value added work
How much does an outage cost today? How much can i save thru process and product enhancements?
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

99

How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis b. Add or improve active monitoring method

2. Processanalyze the data!


a. What caused an outage? b. Can a root cause be identified and addressed?

3. Implement improvements or fixes 4. Measure the results 5. Back to step 1are other metrics needed?

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

100

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

If You Have a Network Availability Method


Use the current method and metric for improvement
Dont try to change completely Use incremental improvements Develop additional methods to gather data as identified

Concentrate on understanding unavailability causesAll unavailability causes should be classified at a minimum under:
Change, SW, HW, power/facility, or link

Identify the actions to correct unavailability causes


i.e., network design, customer process change, HW MTBF improvement, etc.
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

101

Multilayer Network Design

SA Agent Between Access and Distribution Access

Distribution

Core

Core/Backbone

Server Farm

Building Block Additions

NMS-2201 9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

Internet

PSTN
102

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Multilayer Network Design

SA Agent between Servers and WAN Users


Access

Distribution

Core

Core/Backbone

Server Farm

Building Block Additions

NMS-2201 9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

Internet

PSTN
103

Multilayer Network Design

COOL for HighEnd Core Devices


Access

Distribution

Core

Core/Backbone

Server Farm

Building Block Additions

NMS-2201 9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

Internet

PSTN
104

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Multilayer Network Design

Trouble Ticketing Methodology


Access

Distribution

Core

Core/Backbone

Server Farm

Building Block Additions

NMS-2201 9627_05_2004_c2

WAN
2004 Cisco Systems, Inc. All rights reserved.

Internet

PSTN
105

AVAILABILITY MEASUREMENT SUMMARY

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

106

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Summary
Availability metric is governed by your business objectives Availability measurements primary goal is:
To provide an availability baseline (maintain) To help identify where to improve the network To monitor and control improvement projects

Can you identify Where you are now? for your network? Do you know Where you are going? as network oriented business objectives? Do you have a plan to take you there?
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

107

Complete Your Online Session Evaluation!


WHAT: Complete an online session evaluation and your name will be entered into a daily drawing Win fabulous prizes! Give us your feedback!

WHY:

WHERE: Go to the Internet stations located throughout the Convention Center HOW: Winners will be posted on the onsite Networkers Website; four winners per day

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

108

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

109

Recommended Reading
Performance and Fault Management
ISBN: 1-57870-180-5

High Availability Network Fundamentals


ISBN: 1-58713-017-3

Network Performance Baselining


ISBN: 1-57870-240-2

The Practical Performance Analyst


ISBN: 0-07-912946-3

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

110

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Recommended Reading (Cont.)


The Visual Display of Quantitative Information
by Edward Tufte (ISBN: 0-9613921-0)

Practical Planning for Network Growth


by John Blommers (ISBN: 0-13-206111-2)

The Art of Computer Systems Performance Analysis


by Raj Jain (ISBN: 0-421-50336-3)

Implementing Global Networked Systems Management: Strategies and Solutions


by Raj Ananthanpillai (ISBN: 0-07-001601-1)

Information Systems in Organizations: Improving Business Processes


by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)

Integrated Management of Networked SystemsConcepts, Architectures, and Their Operational Application


by Hegering, Abeck, Neumair (ISBN: 1558605711)
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

111

Appendix A: Acronyms
AVGAverage ATMAsynchronous Transfer Mode DPMDefects Per Million FCAPSFault, Config, Acct, Perf, Security GEGigabit Ethernet HAHigh Availability HDLCHigh Level Data Link Control HSRPHot Standby Routing Protocol IPMInternet Performance Monitor IUMImpacted User Minutes MIBManagement Information Base MTBFMean Time Between Failure MTTRMean Time to Repair RMEResource Manager Essentials RMONRemote Monitor SA AgentService Assurance Agent SNMPSimple Network Management Protocol SPFSingle Point of Failure; Shortest Path First (routing protocol) TCPTransmission Control Protocol

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

112

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

BACKUP SLIDES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

113

ADDITIONAL RELIABILITY SLIDES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

114

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Network Design What Is Reliability?


Reliability is often used as a general term that refers to the quality of a product
Failure Rate MTBF (Mean Time Between Failures) or MTTF (Mean Time to Failure) Availability

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

115

Reliability Defined Reliability:


1. The probability of survival (or no failure) for a stated length of time 2. Or, the fraction of units that will not fail in the stated length of time
A mission time must be stated Annual reliability is the probability of survival for one year

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

116

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Availability Defined Availability:


1. The probability that an item (or network, etc.) is operational, and ready-to-go, at any point in time 2. Or, the expected fraction of time it is operational. annual uptime is the amount (in days, hrs., min., etc.) the item is operational in a year
Example: For 98% availability, the annual availability is 0.98 * 365 days = 357.7 days

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

117

MTBF Defined
MTBF stands for Mean Time Between Failure MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF) or, to a failure (MTTF) More technically, it is the mean time to go from an operational state to a non-operational state MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

118

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

How Reliable Is It?


MTBF Reliability:
R = e-(MTBF/MTBF) R = e-1 = 36.7%

MTBF reliability is only 37%; that is, 63% of your HARDWARE fails before the MTBF! But remember, failures are still random!

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

119

MTTR Defined
MTTR stands for Mean Time to Repair
or

MRT (Mean Restore Time)


This is the average length of time it takes to repair an item More technically, it is the mean time to go from a nonoperational state to an operational state

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

120

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

One Method of Calculating Availability


Availability = MTBF (MTBF + MTTR)

What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

121

Uptime
Annual uptime
8,760 hrs/year X (0.9988) = 8,749.5 hrs

Conversely, annual DOWNtime is,


8,760 hrs/year X (1- 0.9988) = 10.5 hrs

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

122

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Systems
Components In-Series

Component 1

Component 2

Components In-Parallel (Redundant) Component 1

RBD

Component 2
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

123

In-Series

Part 1 Part 2

Up Up

Down Up Down Up

Down

Up

Down Up

In-Series

Up

Down Up Down Up

Down

Up

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

124

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

In-Parallel

Part 1 Part 2

Up

Down

Up
Down

Down

Up

Up

Up

Down Up

Up

Down

Up

In-Parallel

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

125

In-Series MTBF
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.

Component Failure Rate = 1/2500 = 0.0004 System Failure Rate = 0.0004 + 0.0004 = 0.0008 System MTBF = 1/(0.0008) = 1,250 hrs.
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

126

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

In-Series Reliability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.

Component ANNUAL Reliability: R = e-(8760/2500) = 0.03 System ANNUAL Reliability: R = 0.03 X 0.03 = 0.0009
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

127

In-Series Availability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.

Component Availability: A = 2500 (2500 + 10) = 0.996 System Availability: A = 0.996 X 0.996 = 0.992
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

128

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

In-Parallel MTBF
COMPONENT 1 MTBF = 2,500 hrs. COMPONENT 2 MTBF = 2,500 hrs. In general*,

System MTBF*: = 2500 + 2500/2= 3,750 hrs.

i =1

MTBF i

*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

129

1-of-4 Example

4 i =1

2500 i

2500 1

+ 2500 + 2500 + 2500 2 3 4


= 5,208 hrs.

In general*,

i =1

MTBF i

*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

130

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

In-Parallel Reliability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs.

Component ANNUAL Reliability: R = e-(8760/2500) = 0.03 System ANNUAL Reliability: R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Un re li

ab ili t

131

In-Parallel Availability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs.

Component Availability: A = 2500 (2500 + 10) = 0.996 System Availability:

Un av a

ila b

ilit

A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

132

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Complex Redundancy
Examples: 1-of-2
m-of-n

1 2 3 . . . n

2-of-3 2-of-4 8-of-10

Pure Active Parallel


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

133

More Complex Redundancy


Pure active parallel
All components are on

Standby redundant
Backup components are not operating

Perfect switching
Switch-over is immediate and without fail

Switchover reliability
The probability of switchover when it is not perfect

Load sharing
All units are on and workload is distributed
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

134

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Networks Consist of Series-Parallel


Combinations of in-series and redundant components

D1 A B1
1/2

D2 D3

2/3

B2

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

135

Failure Rate
The number of failures per time:
Failures/hour Failures/day Failures/week Failures/106 hours Failures/109 hours called FITs (Failures in Time)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

136

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2 failures occurring Another 4 units were tested for 6,000 hours with 1 failure occurring The failed units are repaired (or replaced) What is the approximate MTBF?

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

137

Approximating MTBF (Cont.)


MTBF = 13*1000 + 4*6000
1+2 = 37,000 3 = 12,333 hours

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

138

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Modeling
Frequency
MTBF

Distributions
Normal Log-Normal Weibull Exponential

MTBF

Time-to-Failure Frequency

MTBF

Time-to-Failure
NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

139

Constant Failure Rate

The Exponential Distribution


The exponential function:
f(t) = e-t, t > 0 Failure rate, , IS CONSTANT = 1/MTBF

If MTBF = 2,500 hrs., what is the failure rate? = 1/2500 = 0.0004 failures/hr.

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

140

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

The Bathtub Curve

Failure Rate

DECREASING Failure Rate

INCREASING Failure Rate

CONSTANT Failure Rate Time

Infant Mortality

Useful Life Period

Wear-Out

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

141

The Exponential Reliability Formula


Commonly used for electronic equipment The exponential reliability formula: R(t) = e-t or R(t) = e-t/MTBF

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

142

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs; what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs

R =e-(8760/100000) = 91.6% This says that the probability of no failure in one year is 91.6%; or, 91.6% of all units will survive one year

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

143

ADDITIONAL TROUBLE TICKETING SLIDES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

144

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Essential Data Elements


Parameter
Date Ticket Start Date Start Time Resolution Date Resolution Time Customers Impacted Problem Description Root Cause Component/Part/SW Version Type Resolution

Format
dd/mmm/yy Alphanumeric dd/mmm/yy hh:mm dd/mmm/yy hh:mm Interger String String Alphanumeric Planned/Unplanned String

Description
Date Ticket Issued Trouble Ticket Number Date of Fault Time of Fault Date of Resolution Time of Resolution Number of Customers that Lost Service; Number Impacted or Names of Customers Impacted Outline of the Problem HW, SW, Process, Environmental, etc. For HW Problems include Product ID; for SW Include Release Version Identity if the Event Was Due to Planned Maintenance Activity or Unplanned Outage Description of Action Taken to Fix the Problem

Note: Above Is the Minimum Data Set, However, if Other Information Is Captured it Should Be Provided
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

145

HA Metrics/NAIS Synergy
Referral for Analysis

Data Analysis

Baseline availability Determine DPM Network reliability improvement analysis (Defects Per Million) Trouble Tickets by: Problem management Definitions
Planned/Unplanned Root Cause Resolution Equipment Data accuracy Collection processes

Operational Process and Procedures Analysis

MTTR

Fault management Resiliency assessment Change management Performance management Availability management

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

Analyzed Trouble Ticket Data Referral for Process/Procedural Improvement

146

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

ADDITIONAL SA AGENT SLIDES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

147

SA Agent: How It Works


SNMP

Management Application
1. User configures Collectors through Mgmt Application GUI 2. Mgmt Application provisions Source routers with Collectors

SA Agent
3. Source router measures and stores performance data, e.g.:
Response time Availability

6. Application retrieves data from Source routers once an hour 7. Data is written to a database 8. Reports are generated
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

4. Source router evaluates SLAs, sends SNMP Traps 5. Source router stores latest data point and 2 hours of aggregated points

148

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

SAA Monitoring IP Core


R2 R1
P1

IP Core
P2

P3

R3

Management System
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

149

Monitoring Customer IP Reachability

Nw1

P1

Nw3

TP1 TPx
P3 Nw3

P2

PN
P1-Pn Service Assurance Agent ICMP Polls to a Test Point in the IP Core

NwN

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

150

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Service Assurance Agent Features


Measures Service Level Agreement (SLA) metrics
Packet Loss Response time Availability Throughput Jitter

Evaluates SLAs Proactively sends notification of SLA violations

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

151

SA Agent Impact on Devices


Low impact on CPU utilization 18k memory per SA agent SAA rtr low-memory

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

152

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Monitored Network Availability Calculation


Not calculated:
Already have availability baseline Fault type, frequency and downtime may be more useful Faults directly measured from management system(s)

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

153

Monitored Network Availability Assumptions


All connections below IP are fixed Management systems can be notified of all fixed connection state changes All (L2) events impact on IP (L3) service

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

154

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

ADDITIONAL COOL SLIDES

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

155

CLIs
Configuration CLI Commands
[no] cool run <cr> [no] cool interface interface-name(idb) <cr> [no] cool physical-FRU-entity entity-index (int) <cr> [no] cool group-interface group-objectID(string) <cr> [no] cool add-cpu objectID threshold duration <cr> [no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr> [no] cool if-filter group-objectID (string)<cr>

Display CLI Commands


Router#show cool event-table [<number of entries>] displays all if not specified Router#show cool object-table [<object-type(int)>] displays all object types if not specified Router#show cool fru-entity

Exec CLI Commands


Router#clear cool event-table Router#clear cool persistent-files

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

156

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Measurement Example: Router Device Outage

Reload (Operational) , Power Outage, or Device H/W failure

Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4). Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB. Status: Up (1) Down (2). Last-change: last object status change time. AOT: Accumulated Outage Time (sec). NAF: Number of Accumulated Failure.
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

157

Measurement Example: Cisco IOS S/W Outage


Standby RP in Slot 0 Crash Using Address Error (4) Test Crash; AdEL Exception It Is Caused Purely by Cisco IOS S/W

Standby RP Crash Using Jump to Zero (5) Test Crash; Bp Exception It Can Be Caused by S/W, H/W, or Operation

NMS-2201 9627_05_2004_c2

2004 Cisco Systems, Inc. All rights reserved.

158

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Measurement Example: Linecard Outage

Add a Linecard Reset the Linecard

Down Event Captured Up Event Captured

AOT and NAF Updated


NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

159

Measurement Example: Interface Outage


1
12406-R1202(config)#cool group-interface ATM2/0. 12406-R1202(config)#no cool group-interface ATM2/0.3

Configure to Monitor All the Interfaces which Includes ATM2/0; String, Except ATM2/0.3

Object Table
ATM2/0.1 ATM2/0.2 ATM2/0.4 ATM2/0.5

3
12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#shut Shut ATM2.0 show cool event-table Interface Down **** COOL Event Table **** type index event time-stamp interval hist_id object-name 1 33 1 1054859105 18 1 ATM2/0.1 1 35 1 1054859106 18 2 ATM2/0.2 Down Event 1 39 1 1054859107 17 3 ATM2/0.4 Captured 1 41 1 1054859108 18 4 ATM2/0.5

sh cool object 1 | include ATM2/0. 33 1 1054859087 0 0 0 35 1 1054859088 0 0 0 39 1 1054859090 0 0 0 41 1 1054859090 0 0 0

4
12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut No Shut ATM2.0 show cool event-table Interface **** COOL Event Table **** type index event time-stamp interval hist_id object-name 1 33 0 1054859146 41 1 ATM2/0.1 1 35 0 1054859147 41 2 ATM2/0.2 Up Event 1 39 0 1054859149 42 3 ATM2/0.4 Captured 1 41 0 1054859150 42 4 ATM2/0.5
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Object Table Shows AOT and NAF


1 1 1 1 ATM2/0.1 ATM2/0.2 ATM2/0.4 ATM2/0.5

sh cool object 1 | include ATM2/0. 33 1 1054859087 0 41 35 1 1054859088 0 41 39 1 1054859090 0 42 41 1 1054859090 0 42

160

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Measurement Example: Remote Device Outage


12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 1 12406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 1 12406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1 sh cool object-table 4 | include remobj 1 1 1054867061 0 0 remobj.1 2 1 1054867063 0 0 remobj.2 3 1 1054867065 0 0 remobj.3 12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#shut 4 4 4 2 1 3 5 5 5 1054867105 1054867108 1054867130 42 47 65 2 3 10 remobj.2 remobj.1 remobj.3

3 Remote Devices Are Added

Object Table

Shut Down the Interface Link Between the Remote Device and Router Down Event Captured

12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut 4 4 4 1 3 2 4 4 4 1054867171 1054867193 1054867200 63 63 95 1 8 10 remobj.1 remobj.3 remobj.2

No Shut the Interface Link

Up Event Captured

sh cool object-table 4 | include remobj 1 1 1054867061 63 1 remobj.1 2 1 1054867063 63 1 remobj.2 3 1 1054867065 95 1 remobj.3
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.

Object Table Shows AOT and NAF


161

2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr

Potrebbero piacerti anche