Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SESSION NMS-2201
NMS-2201 9627_05_2004_c2
Agenda
Introduction Availability Measurement Methodologies
Trouble Ticketing Device Reachability: ICMP (Ping), SA Agent, COOL SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Associated Sessions
NMS-1N01: Intro to Network Management NMS-1N02: Intro to SNMP and MIBs NMS-1N04: Intro to Service Assurance Agent NMS-1N41: Introduction to Performance Management NMS-2042: Performance Measurement with Cisco IOS ACC-2010: Deploying Mobility in HA Wireless LANs NMS-2202: How Cisco Achieved HA in Its LAN RST-2514: HA in Campus Network Deployments NMS-4043: Advanced Service Assurance Agent RST-4312: High Availability in Routing
NMS-2201 9627_05_2004_c2
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Design Environmental issues Natural disasters Software issues Performance and load Scaling
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
NMS-2201 9627_05_2004_c2
10
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
11
12
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Process Design
NMS-2201 9627_05_2004_c2
NMS-2201 9627_05_2004_c2
14
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Downtime per Year (24x7x365) 3 Days 1 Day 15 Hours 19 Hours 8 Hours 4 Hours 36 Minutes 48 Minutes 46 Minutes 23 Minutes 53 Minutes 5 Minutes 30 Seconds
15
Availability Definition
Availability definition is based on business objectives
Is it the user experience you are interesting in measuring? Are some users more important than other?
Availability groups?
Definitions of different groups
NMS-2201 9627_05_2004_c2
16
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
17
Reliability is defined as the probability of survival (or no failure) for a stated length of time
NMS-2201 9627_05_2004_c2
18
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
MTBF Defined
MTBF stands for Mean Time Between Failure MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF) or, to a failure (MTTF) More technically, it is the mean time to go from an OPERATIONAL STATE to a NON-OPERATIONAL STATE MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems
NMS-2201 9627_05_2004_c2
19
What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
Annual uptime
8,760 hrs/year X (0.9988) = 8,749.5 hrs
20
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
D1 A B1
1/2
RBD
2/3
D2 D3
B2
NMS-2201 9627_05_2004_c2
21
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switch-over reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
22
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
23
NMS-2201 9627_05_2004_c2
24
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
25
Types of Availability
Device/interface Path Users Application
NMS-2201 9627_05_2004_c2
26
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
27
NMS-2201 9627_05_2004_c2
28
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
29
NMS-2201 9627_05_2004_c2
30
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Step II
Measure uptime ongoing Track defects per million (DPM) or IUM or availability (%)
Step III
Track customer impact for each ticket/MTTR Categorize DPM by reason code and begin trending Identify initiatives/areas for a focus to eliminate defects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
31
Monthly collection of network performance data and export the following fields to a spreadsheet or database system:
Outage start time (date/time) Service restore time (date/time) Problem description Root cause Resolution Number of customers impacted Equipment model Component/part Planned maintenance activity/unplanned activity Total customers/ports on network
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
32
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Results are not necessarily limited to the above but should be customized based on your network and requirements
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
33
Availability = 1 -
MTBF =
MTTR =
NMS-2201 9627_05_2004_c2
34
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
35
e tiv ra st u Ill
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun
Key takeaways
NMS-2201 9627_05_2004_c2
36
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
DPM
June Other Platform Related Total DPM ------99.99% Target 339.5 49.2 388.7 100
July 82.5
424.9
e tiv ra st u Ill
Sept Oct Dec
Platform related DPM contributed 13% of total DPM in September Platform DPM includes events from:
Backbone NAS PG POP Radius Server VPN Radius Server
Aug 104
Sept 52.6
Oct
Nov
Dec
394.7
362.2
507.4 100
498.7 100
Network Access Server (NAS) accounts for 50% of the total Platform related DPM in September Private Access Gateway (PG) showing significant decrease over the past 3 months
DPM by Cause
2500 2000 1500 1000 500 0 Dec
ve ti tr a us Ill
Jan Feb Mar Apr
DPM
May
Jan
Feb
Mar
87.7 0 89.7 37
TOTAL
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Hours
7.19
# of Total
80 60 40 20 0
Jun Jul Aug Sep
ve ti ra st llu I
Oct Nov Dec
Number of faults increased slightly in September however MTTR decreased 49% of faults resolved in < 1 Hour in September 11% of faults resolved in > 24 hours with an additional 3% >100 Hhours
100
90 80 70 60 50 40 30 20 10
>100
>24 Hr
Oct
Nov
Dec
Jun
Jul
Aug
Sep
Oct
Nov
Dec
39
NMS-2201 9627_05_2004_c2
Unplanned DPM
1000 900 800 700 600 500 400 300 200 100 0
Feb
Feb 70 90 90 60 310
Mar
Apr 35 55 80 50 220
Apr
ve ti ra st llu I
May 79 100 104 67 350
May
Jun
Jul 80 90 65
Jul
Aug
Sep
Oct
Nov
Dec 0 5 5 10 40
Dec
115 350
Key take-a-ways
Action plans
Identify areas of focus to enable reduction of DPM to achieve network availability goal
NMS-2201 9627_05_2004_c2
40
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Cons
Some internal subjective/consistency process issues Outages may occur that are not included in the trouble ticketing systems Resources needed to scrub data and create reports May not work with existing trouble ticketing system/process
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
41
NMS-2201 9627_05_2004_c2
42
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Step II
Establish network availability baseline Measure uptime on an ongoing basis
Step III
Track root cause and customer impact Begin trending of availability issues Identify initiatives and areas of focus to eliminate defects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
43
Event Log
Analysis of events received from the network devices Analysis of accuracy of the data
Event Log Example Fri Jun 15 11:05:31 2001 Debug: Looking for message header ... Fri Jun 15 11:05:33 2001 Debug: Message header is okay Fri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532" Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013" Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm" Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235" Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!" Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%" "CISCO-Large-special" Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus" Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system" Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing" Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal" Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down" Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down" Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp" Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping" Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2" Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2
NMS-2201 9627_05_2004_c2
44
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
45
Cons
Requires an excellent change management/provisioning process Requires an efficient and effective fault management system Requires a custom development Does not account for routing problems Not true end-to-end measure
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
46
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
47
GRAND TOTAL
NMS-2201 9627_05_2004_c2
48
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Other Totals 7%
Count of Incidents
Host Totals Network Totals Other Totals
49
NMS-2201 9627_05_2004_c2
50
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
51
How:
Edge interfaces and/or devices are defined and pinged on a determined interval
Unavailability:
Pre-defined, non-response from the interface
NMS-2201 9627_05_2004_c2
52
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
53
Cons
Point to multipoint implies not true end-to-end measure Availability granularity limited by ping frequency Maintenance of device databasemust have a solid change management and provisioning process
NMS-2201 9627_05_2004_c2 54
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
How:
An agent is configured to SNMP poll and tabulate outage times for defined devices or links; database maintains outage times and total service time; sometimes trap information is used to augment this method by providing more accurate information on outages
Unavailability:
Pre-defined, non-redundant links, ports, or devices that are down
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
55
The smaller the polling interval the more detailed (granular) the data collected
Example polling data once every 15 minutes provides 4 times the detail (granularity) of polling once an hour
A smaller polling interval does not necessarily provide a better margin of error
Example polling once every 15 minutes for one hour, has the same margin of error as polling once an hour for 4 hours
NMS-2201 9627_05_2004_c2
56
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
How:
Utilizing existing NMS systems that are currently SNMP polling to tabulate outage times for defined devices or links A database maintains outage times and total service time SNMP Trap information is also used to augment this method by providing more accurate information on outages
NMS-2201 9627_05_2004_c2
57
Cons
No canned SW to do this; custom development Maintaining element device database challenging Requires an excellent change mgmt and provisioning process Does not account for routing problems Not a true end-to-end measure
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
58
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
59
How:
A data collector creates SA Agents on the routers to monitor certain network/service performances; the data collector then collects this data from the routers, aggregates it and makes it available
Unavailability:
Pre-defined paths with reporting on non-redundant links, ports, or devices that are down within a path
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
60
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
DNS
SA Agent Collectors
Remote Sites
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
61
DPM will be for connectivity between 2 points on the network, the source and destination of probe
Source of probe is usually a management system and the destination are the devices managed Can calculate DPM for every device managed DPM = Probes with No Response x 106 Total Probes Sent Availability = 1 - Probes with No Response Total Probes Sent
NMS-2201 9627_05_2004_c2
62
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Availability = 1 -
NMS-2201 9627_05_2004_c2
63
Sample Size
Sample size is the number of samples that have been collected The more samples collected the higher the confidence that the data accurately represents the network Confidence (margin of error) is defined by
m=
1 sample size
After One Month
m=
NMS-2201 9627_05_2004_c2
1 24
= 0.2041
m=
1 24 x 31
= 0.0367
64
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Cons
Requires a system to collect the SAA data Requires implementation in the router configurations Availability granularity limited by polling frequency Definition of the critical network paths to be measured
NMS-2201 9627_05_2004_c2
65
NMS-2201 9627_05_2004_c2
66
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
COOL Objectives
To automate the measurement to increase operational efficiency and reduce operational cost To measure the outage as close to the source of outage events as possible to pin point the cause of the outages To cope with large number of network elements without causing system and network performance degradation To maintain measurement data reliably in presents of element failure or network partition To support simplicity in deployment, configuration, and data collection (autonomous measurement)
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
67
COOL Features
NMS
Open access via Outage Monitor MIB Embedded in Router Automated Real-Time Measurement Autonomous Measurement Outage Data Stored in Router
COOL
Access Router
Customer Equipment
NMS-2201 9627_05_2004_c2
68
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Two-tier framework
Reduces performance impact on the router Provides scalability to the NMS Makes easy to deploy Provides flexibility to availability calculation
NMS
NMS
NMS
NMS-2201 9627_05_2004_c2
69
Outage Model
C
Access Router Network Management System
RP
D D A
Physical D Interface
Link
Customer Equipment
Link
Peer Router
Type A B C D
Objects Monitored Physical Entity Objects Interface Objects Remote Objects Software Objects
Failure Modes Component Hardware or Software Failure Including the Failure of Line Card, Power Supplies, Fan, Switch Fabric, and So on Interface Hardware or Software Failure, Loss of Signal Failure of Remote Device (Customer Equipment or Peer Networking Device) or Link In-between Failure of Software Processes Running on the RPs and Line Cards
70
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to be defective (service degradation or complete outage) Duration threshold: the minimum period beyond which an outage needs to be reported (given SLA) Start time: when the object outage starts End time: when the outage ends
Defect Threshold
NMS-2201 9627_05_2004_c2
71
Architecture
Customer Interfaces
Outage Monitor MIB
SNMP Polling SNMP Notification
Configuration
CLI Customer Authentication
Outage Manager
NVRAM
ATA Flash
SAA APIs
Baseline
Optional
72
Measurement Methods
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Router 1
Up Down
System Crash 10
NMS-2201 9627_05_2004_c2
73
System Crash 10
System Crash 10
10
10
7 Time
Router 1 Interface 1
NMS-2201 9627_05_2004_c2
74
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi = 14/2 = 7 min
Object i
TTR
Up Down
10 min. T1 Failure
NMS-2201 9627_05_2004_c2
75
Up Down
10 min. T1 Failure
76
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
* 100
10 min. T1 Failure
4 min. Failure T2
Time
NMS-2201 9627_05_2004_c2
77
Send Break
Operation Caused Outage
Reload
Planned Outage Forced Switchover
NMS-2201 9627_05_2004_c2
78
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down May cause virtual network disconnection May occurs event storm when hundreds of messages for each flapping event May make the object MTBF unreasonably low due to frequent short failures This unstable condition needs to get operators attention COOL detects the flapping status Catching very short outage event (less than the duration threshold) Increasing the event counter, Flapping status, if it becomes over the flapping threshold (3 event counter) for the short period (1 sec); sends a notification Stable status, if it becomes less than the threshold; sends another notification
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
79
COOL RAM
Outage Data
RAM
Outage Data
Periodic Update
NVRAM
NVRAM
Copy Copy
Persistent Outage Data Persistent Outage Data
FLASH
FLASH
Active RP
Standby RP
Data persistency
To avoid data loss due to link outage or router itself crash
Data redundancy
To continue the outage measurement after the switchover To retain the outage data even if the RP is physically replaced
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
80
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
ENTITY-MIB
entPhysicalTable (Physical Entity Object Description)
CISCO-PROCESS-MIB
cpmProcessTable (Process Object Description)
NMS-2201 9627_05_2004_c2
81
Configuration
MIB Display Show CLI
Show event-table Show object-table
Config CLI
COOL
Update
Customer Equipment Detection Function
Update
NMS-2201 9627_05_2004_c2
82
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Enabling COOL
ari#dir Directory of disk0:/ 1 -rw19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin
128057344 bytes total (109051904 bytes free) ari#copy tftp disk0: Address or name of remote host []? 88.1.88.9 Source filename []? auth_file Destination filename [auth_file]? Accessing tftp://88.1.88.9/auth_file... Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ! [OK - 705 bytes] 705 bytes copied in 0.532 secs (1325 bytes/sec) ari#clear cool per ari#clear cool persist-files ari#conf t Enter configuration commands, one per line. End with CNTL/Z. ari(config)#cool run ari(config)#^Z ari#wr mem Building configuration... [OK][OK][OK]
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
Enable COOL
83
COOL
Pros
Accurate network availability for devices, components, and software Accounts for routing problems Implementation with low network overhead. Enables correlation between active and passive availability methodologies
Cons
Only a few system currently have the COOL feature Requires implementation in the router configurations of production devices Availability granularity limited by polling frequency New Cisco IOS Feature
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
84
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
85
Application Reachability
How:
Agents on client and server computers and collecting data Fire Runner, Ganymede Chariot, Gyra Research, Response Networks, Vital Signs Software, NetScout, Custom applications queries on customer systems
Installing special probes located on user and server subnets to send, receive and collect data; NikSun and NetScout
Unavailability:
Pre-defined QoS definition
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
86
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Application Reachability
Pros
Actual application availability can be understood QoS, by application, can be factored into the availability measurement
Cons
Depending on scale, potential high overhead and cost can be expected
NMS-2201 9627_05_2004_c2
87
DATA COLLECTION FOR ROOT CAUSE ANALYSIS (RCA) OF NETWORK OR DEVICE DOWNTIME
NMS-2201 9627_05_2004_c2
88
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
89
NMS-2201 9627_05_2004_c2
90
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Event MIB
Allows you to create custom notifications and log them and/or send them as SNMP traps or informs MIB persistence is supported a MIBs SNMP data persists across reloads Can be used to test objects on other devices More flexible than RMON events/alarms RMON is tailored for use with counter objects
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
91
Mission statement:
Provide robust, scalable, powerful, and easy-to-use embedded managers to solve problems such as syslog and event management within Cisco routers and switches
NMS-2201 9627_05_2004_c2
92
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
93
Event Detector Feeds EEM Syslog Event Detector SNMP Event Detector Other Event Detector
Actions
NMS-2201 9627_05_2004_c2
94
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet The following policy actions can be established: Generate prioritized syslog messages Generate a CNS event for upstream processing by Cisco CNS devices Reload the Cisco IOS software Switch to a secondary processor in a fully redundant hardware configuration
EEM Version 2
EEM Version 2 adds programmable actions using the Tcl subsystem within Cisco IOS Includes more event detectors and capabilities
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
95
More event detectors! Define policies or programmable local actions using Tcl Register policy with EEM Server Events trigger policy execution Tcl extensions for CLI control and defined actions
Timer Services
Counters
Redundancy Facility
IOS Process
Watchdog
Tcl Shell
Subscribers to Receive Events, Implements Policy Actions
Receive Application Events, Publishes Application Events Using Application Specific Event Detector
EEM Policy
NMS-2201 9627_05_2004_c2
96
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Less downtime
Reduce susceptibility and Mean Time to Repair (MTTR)
Better service
Responsiveness Prevent recurrence Higher availability
Not an availability methodology by itself but can add valuable information and customization to the data collection method
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
97
NMS-2201 9627_05_2004_c2
98
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
99
How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis b. Add or improve active monitoring method
3. Implement improvements or fixes 4. Measure the results 5. Back to step 1are other metrics needed?
NMS-2201 9627_05_2004_c2
100
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Concentrate on understanding unavailability causesAll unavailability causes should be classified at a minimum under:
Change, SW, HW, power/facility, or link
101
Distribution
Core
Core/Backbone
Server Farm
NMS-2201 9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
102
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Distribution
Core
Core/Backbone
Server Farm
NMS-2201 9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
103
Distribution
Core
Core/Backbone
Server Farm
NMS-2201 9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
104
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Distribution
Core
Core/Backbone
Server Farm
NMS-2201 9627_05_2004_c2
WAN
2004 Cisco Systems, Inc. All rights reserved.
Internet
PSTN
105
NMS-2201 9627_05_2004_c2
106
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Summary
Availability metric is governed by your business objectives Availability measurements primary goal is:
To provide an availability baseline (maintain) To help identify where to improve the network To monitor and control improvement projects
Can you identify Where you are now? for your network? Do you know Where you are going? as network oriented business objectives? Do you have a plan to take you there?
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
107
WHY:
WHERE: Go to the Internet stations located throughout the Convention Center HOW: Winners will be posted on the onsite Networkers Website; four winners per day
NMS-2201 9627_05_2004_c2
108
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
109
Recommended Reading
Performance and Fault Management
ISBN: 1-57870-180-5
NMS-2201 9627_05_2004_c2
110
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
111
Appendix A: Acronyms
AVGAverage ATMAsynchronous Transfer Mode DPMDefects Per Million FCAPSFault, Config, Acct, Perf, Security GEGigabit Ethernet HAHigh Availability HDLCHigh Level Data Link Control HSRPHot Standby Routing Protocol IPMInternet Performance Monitor IUMImpacted User Minutes MIBManagement Information Base MTBFMean Time Between Failure MTTRMean Time to Repair RMEResource Manager Essentials RMONRemote Monitor SA AgentService Assurance Agent SNMPSimple Network Management Protocol SPFSingle Point of Failure; Shortest Path First (routing protocol) TCPTransmission Control Protocol
NMS-2201 9627_05_2004_c2
112
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
BACKUP SLIDES
NMS-2201 9627_05_2004_c2
113
NMS-2201 9627_05_2004_c2
114
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
115
NMS-2201 9627_05_2004_c2
116
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
117
MTBF Defined
MTBF stands for Mean Time Between Failure MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF) or, to a failure (MTTF) More technically, it is the mean time to go from an operational state to a non-operational state MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems
NMS-2201 9627_05_2004_c2
118
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
MTBF reliability is only 37%; that is, 63% of your HARDWARE fails before the MTBF! But remember, failures are still random!
NMS-2201 9627_05_2004_c2
119
MTTR Defined
MTTR stands for Mean Time to Repair
or
NMS-2201 9627_05_2004_c2
120
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
NMS-2201 9627_05_2004_c2
121
Uptime
Annual uptime
8,760 hrs/year X (0.9988) = 8,749.5 hrs
NMS-2201 9627_05_2004_c2
122
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Systems
Components In-Series
Component 1
Component 2
RBD
Component 2
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
123
In-Series
Part 1 Part 2
Up Up
Down Up Down Up
Down
Up
Down Up
In-Series
Up
Down Up Down Up
Down
Up
NMS-2201 9627_05_2004_c2
124
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
In-Parallel
Part 1 Part 2
Up
Down
Up
Down
Down
Up
Up
Up
Down Up
Up
Down
Up
In-Parallel
NMS-2201 9627_05_2004_c2
125
In-Series MTBF
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.
Component Failure Rate = 1/2500 = 0.0004 System Failure Rate = 0.0004 + 0.0004 = 0.0008 System MTBF = 1/(0.0008) = 1,250 hrs.
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
126
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
In-Series Reliability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.
Component ANNUAL Reliability: R = e-(8760/2500) = 0.03 System ANNUAL Reliability: R = 0.03 X 0.03 = 0.0009
NMS-2201 9627_05_2004_c2
127
In-Series Availability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 2 MTBF = 2,500 hrs. MTTR = 10 hrs.
Component Availability: A = 2500 (2500 + 10) = 0.996 System Availability: A = 0.996 X 0.996 = 0.992
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
128
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
In-Parallel MTBF
COMPONENT 1 MTBF = 2,500 hrs. COMPONENT 2 MTBF = 2,500 hrs. In general*,
i =1
MTBF i
*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
129
1-of-4 Example
4 i =1
2500 i
2500 1
In general*,
i =1
MTBF i
*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
130
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
In-Parallel Reliability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs.
Component ANNUAL Reliability: R = e-(8760/2500) = 0.03 System ANNUAL Reliability: R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
Un re li
ab ili t
131
In-Parallel Availability
COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs. COMPONENT 1 MTBF = 2,500 hrs. MTTR = 10 hrs.
Un av a
ila b
ilit
132
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Complex Redundancy
Examples: 1-of-2
m-of-n
1 2 3 . . . n
133
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switchover reliability
The probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
134
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
D1 A B1
1/2
D2 D3
2/3
B2
NMS-2201 9627_05_2004_c2
135
Failure Rate
The number of failures per time:
Failures/hour Failures/day Failures/week Failures/106 hours Failures/109 hours called FITs (Failures in Time)
NMS-2201 9627_05_2004_c2
136
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2 failures occurring Another 4 units were tested for 6,000 hours with 1 failure occurring The failed units are repaired (or replaced) What is the approximate MTBF?
NMS-2201 9627_05_2004_c2
137
NMS-2201 9627_05_2004_c2
138
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Modeling
Frequency
MTBF
Distributions
Normal Log-Normal Weibull Exponential
MTBF
Time-to-Failure Frequency
MTBF
Time-to-Failure
NMS-2201 9627_05_2004_c2
139
If MTBF = 2,500 hrs., what is the failure rate? = 1/2500 = 0.0004 failures/hr.
NMS-2201 9627_05_2004_c2
140
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Failure Rate
Infant Mortality
Wear-Out
NMS-2201 9627_05_2004_c2
141
NMS-2201 9627_05_2004_c2
142
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs; what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs
R =e-(8760/100000) = 91.6% This says that the probability of no failure in one year is 91.6%; or, 91.6% of all units will survive one year
NMS-2201 9627_05_2004_c2
143
NMS-2201 9627_05_2004_c2
144
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Format
dd/mmm/yy Alphanumeric dd/mmm/yy hh:mm dd/mmm/yy hh:mm Interger String String Alphanumeric Planned/Unplanned String
Description
Date Ticket Issued Trouble Ticket Number Date of Fault Time of Fault Date of Resolution Time of Resolution Number of Customers that Lost Service; Number Impacted or Names of Customers Impacted Outline of the Problem HW, SW, Process, Environmental, etc. For HW Problems include Product ID; for SW Include Release Version Identity if the Event Was Due to Planned Maintenance Activity or Unplanned Outage Description of Action Taken to Fix the Problem
Note: Above Is the Minimum Data Set, However, if Other Information Is Captured it Should Be Provided
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
145
HA Metrics/NAIS Synergy
Referral for Analysis
Data Analysis
Baseline availability Determine DPM Network reliability improvement analysis (Defects Per Million) Trouble Tickets by: Problem management Definitions
Planned/Unplanned Root Cause Resolution Equipment Data accuracy Collection processes
MTTR
Fault management Resiliency assessment Change management Performance management Availability management
NMS-2201 9627_05_2004_c2
146
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
147
Management Application
1. User configures Collectors through Mgmt Application GUI 2. Mgmt Application provisions Source routers with Collectors
SA Agent
3. Source router measures and stores performance data, e.g.:
Response time Availability
6. Application retrieves data from Source routers once an hour 7. Data is written to a database 8. Reports are generated
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
4. Source router evaluates SLAs, sends SNMP Traps 5. Source router stores latest data point and 2 hours of aggregated points
148
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
IP Core
P2
P3
R3
Management System
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
149
Nw1
P1
Nw3
TP1 TPx
P3 Nw3
P2
PN
P1-Pn Service Assurance Agent ICMP Polls to a Test Point in the IP Core
NwN
NMS-2201 9627_05_2004_c2
150
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
151
NMS-2201 9627_05_2004_c2
152
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
153
NMS-2201 9627_05_2004_c2
154
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
NMS-2201 9627_05_2004_c2
155
CLIs
Configuration CLI Commands
[no] cool run <cr> [no] cool interface interface-name(idb) <cr> [no] cool physical-FRU-entity entity-index (int) <cr> [no] cool group-interface group-objectID(string) <cr> [no] cool add-cpu objectID threshold duration <cr> [no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr> [no] cool if-filter group-objectID (string)<cr>
NMS-2201 9627_05_2004_c2
156
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4). Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB. Status: Up (1) Down (2). Last-change: last object status change time. AOT: Accumulated Outage Time (sec). NAF: Number of Accumulated Failure.
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
157
Standby RP Crash Using Jump to Zero (5) Test Crash; Bp Exception It Can Be Caused by S/W, H/W, or Operation
NMS-2201 9627_05_2004_c2
158
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
159
Configure to Monitor All the Interfaces which Includes ATM2/0; String, Except ATM2/0.3
Object Table
ATM2/0.1 ATM2/0.2 ATM2/0.4 ATM2/0.5
3
12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#shut Shut ATM2.0 show cool event-table Interface Down **** COOL Event Table **** type index event time-stamp interval hist_id object-name 1 33 1 1054859105 18 1 ATM2/0.1 1 35 1 1054859106 18 2 ATM2/0.2 Down Event 1 39 1 1054859107 17 3 ATM2/0.4 Captured 1 41 1 1054859108 18 4 ATM2/0.5
4
12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut No Shut ATM2.0 show cool event-table Interface **** COOL Event Table **** type index event time-stamp interval hist_id object-name 1 33 0 1054859146 41 1 ATM2/0.1 1 35 0 1054859147 41 2 ATM2/0.2 Up Event 1 39 0 1054859149 42 3 ATM2/0.4 Captured 1 41 0 1054859150 42 4 ATM2/0.5
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
160
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr
Object Table
Shut Down the Interface Link Between the Remote Device and Router Down Event Captured
12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut 4 4 4 1 3 2 4 4 4 1054867171 1054867193 1054867200 63 63 95 1 8 10 remobj.1 remobj.3 remobj.2
Up Event Captured
sh cool object-table 4 | include remobj 1 1 1054867061 63 1 remobj.1 2 1 1054867063 63 1 remobj.2 3 1 1054867065 95 1 remobj.3
NMS-2201 9627_05_2004_c2
2004 Cisco Systems, Inc. All rights reserved.
2004 Cisco Systems, Inc. All rights reserved. Printed in USA. Presentation_ID.scr