Sei sulla pagina 1di 41

Deal with Production Issues

Suggestions from ITIL


Problems to solve

Long resolution time


Neglected issues
Issues we lose track of until our
users remind us
Recurring issues
Inconsistency in response time
Developers are distracted
constantly to resolve issues
Goal

Manage issues in a consistent


manner
Fast resolution
Reduce client impact
Proactively resolve issues
before they impact clients
Basic Concepts
Incidents
Any event which is not part of the standard
operation of a service and which causes, or may
cause an interruption to or a reduction in, the
quality of that service
Problems
A problem is a condition often identified as the
cause of multiple incidents that exhibit common
symptoms.
Known Errors
A known error is a condition identified by
successful diagnosis of the root cause of a
problem, and subsequent development of a
Work-around
Relationship of the three

Problem is the root cause of the


incidents
Incident is the manifest of a
underline Problem
One Problem can cause many
Incidents
Known error is a problem with
known root cause and known
workaround
Manage Incident vs. Manage
Problem
Different goals
Incident Management focus on restoring the
service operation as quickly as possible
Problem management focus on finding and
eliminating the root cause
Different actions
Incident management applies workarounds or
temporary fixes to quickly restore the services
Problem management issue a change to
fundamentally eliminate the root cause
Incident management is reactive and
problem management is proactive
Incident management emphasize speed and
problem management emphasize quality
Common mistakes

Spend tremendous time and


efforts to find root cause before
the service level is recovered
Stop the investigation after an
incident is fixed by a
workaround
Same incident occurs
repeatedly without
understanding of the root cause
Solutions from ITIL

Separate out Incident Management


and Problem Management into two
independent but related processes
Handle incidents (restore service) as
quickly as possible
Proactively and independently work
on resolving problems
Wisely manage Known Errors
Incident Management
Always remember the goal is to
Restore service level as quickly as
possible
How to go fast?
Classification
Match known errors and known
workarounds
Appropriate escalation
Go fast, but not go crazy. Dont miss
Record
Prioritize
Follow up
Incident Management Process
Acceptance And Record

Benefits of recording
Help to diagnosis new incidents based
on known incidents
Help Problem Management to find the
root cause
Easy to determine the impact
Be able to track and control the issue
resolution.
Incident Reporting Channels
User
System Monitor/Alert
IT person
Incident Record
Unique ID
Basic diagnosis info
Timestamp
Symptoms
User info (name, contact info)
Whos responsible
Additional information
Screenshots
Logs
Status
New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, Terminated
Classification

Classification
Possible reasons (application, network,
database, business logic, etc.)
Supporting group (application group,
database group, infrastructure group,
network group, etc.)
Prioritize
Priority = Impact X Urgency
Determine resolution timeline (resolve
within X hours) based on Service Level
Agreement
Preliminary Support

Preliminary Response
Acknowledge of acceptance
Collect basic info

Provide basic help to the user

Service Requests
Service Request is standard service like
check status, reset password, etc.
Go through standard procedure to
handle service requests
Match

Match known errors


Known solution
Known workaround
Known resolution procedure

Match existing incidents


Link the new incident with the existing
incidents
Increase the impact level of the existing
incident
If the existing one is already worked on,
inform the responsible personal/group
Investigate and Diagnosis

Escalation
Functional escalation (Technical
escalation) : Involve more
technical experts, involve teams in
other functional group, or involve
external suppliers
Hierarchical escalation
(Management escalation):
Escalate to higher level
management team
Escalation by Priorities
Priority Resolution 0 10 30% 60% 100%
timeline Minute Minute timeline timeline timeline

1 2 hr A B CD EF
2 4 hr A B C D E,F
3 6 hr A B C D
4 8 hr A B C

A (Service Desk)
D (Incident Manager)
B (Second Line)
E (Division
Management)
C (Third Line,
Supplier)
F (Corporate
Management
Investigation Activities

Assign dedicated support person


Collect basic info
Query historical data
Recent releases
Recent changes
Workload trend

Analyze
Again, dont spend too much time in
finding the root cause. Find a
workaround as soon as possible!
Resolve and recover

Resolution (workarounds or
permanent fix)
Create a Request For Change (RFC)
Approve RFC

Implement Change.

Record the analysis, the root cause,


the workaround and the solution
Leave the incident in Open status
when resolution hasnt been found
Termination

Contact the user to confirm


incident is resolved
Change the Incident status into
Closed
Update all the Incident record to
reflect the final priority, impact,
user and root cause
Track and Monitor

Assign an owner to each


incident. Usually its the Service
Desk person.
Provide feedback to the users
after a change
Enforce the escalation based on
the priority
Problem Management

Problem Control
Find the root cause of a problem
Turn a problem into a Known Error

Error Control
Control and Monitor the Known Errors
until they are appropriately handled
Proactive Problem Management
Resolve problems before they cause
any incidents
Problem Control
Identify Problems

Analyze the trends of incidents


Likely to reoccur
Likely more will occur
Likely to have larger impact

Analyze the weakness of the


infrastructure
Availability
Capability

A significant incident (outage)


Diagnosis

Recreate incident in testing


environment
Link the modules with incidents
Review the latest changes
After the root cause of a
problem is found, this problem
becomes a Known Error
Temporary Fixes
Its important to find a temporary fix if
the problem causes significant
incident
If temporary fix involves changes in
the infrastructure, a Request For
Change must be submitted. (Later,
another RFC may be submitted to fix
the root cause)
For urgent problems, Emergency
Change Request Process should be
initialized.
Error Control
Identify and Record Known
Error
Identify
Find the root cause of a problem
Link a problem with a known error
Record
Assign an ID
Symptoms
Root cause
Status
Notification
Notify incident management team. They
can associate new incidents with known
errors
Determine the solution

Evaluate based on
Service Level Agreement
Impact and Urgency
Cost and benefit

Possible solutions
Temporary fixes
Permanent fixes
No fix (cost is greater than benefits)

Record the decision in Problem


Database
Known Errors from other
environments
Known errors from development
environment
We may choose to release with some
minor known issues
Known errors from suppliers
Usually reported in the release notes
Record, Monitor and Track those
known errors
Relate problems with those known
errors
PIR (Post Implementation
Review)
Normal problems
Confirm all the related incidents are closed
Verify if the problem record is complete
(symptoms, root cause and solutions)
Change the problem status into Resolved

Significant problems
What went well?
What went wrong?
How to do better next time?
How to prevent the similar issues from
happening again?
Track and Monitor

Track the full lifecycle of each


known error
Reevaluate impact and urgency.
Adjust the priorities accordingly.
Monitor the progress of the
diagnosis and implementation of
the solution. Monitor the
implementation of the RFC.
Proactive Problem
Management
Focus on the quality of the
service and the infrastructure
Analyze operational trends
Detect the potential incidents
and prevent them from
happening
Find out the weak points of the
infrastructure or the overloaded
components
Ideas to improve our
Production Support process
Idea 1: Create an independent Problem
Management Team.
Idea 2: Create an Problem Database
Idea 3: Define the Production Support
Procedure
Idea 4: Review and revise the procedures
of using TeamTrack
Idea 5: Enforce Post Implementation
Review
Idea 6: Proactively manage problems
Idea 7 (optional): Acquire an Service Desk
software to facilitate the process
Create an independent
Problem Management Team.
Can be a full time team or a part time team
Appoint a Problem Management Manager.
Must be different than the Production
Support Manager. Their goals, schedules
and requirements are different.
Responsible for managing all the production
problems (not incidents) for multiple
applications
Identify problems
Record problem
Find and evaluate solutions
Track the progress till closure
Work closely with the existing Production
Support team.
Create a Problem Database
A easy to search knowledge database
Include problems and known errors
Track symptoms, root causes, temporary
fixes, workarounds, and permanent
solutions
Include all the known errors in DEV and
unresolved or deferred defects in QA/RATE
environments
Maintained by the Problem Management
Team
Will be used by Production Support team
for match and fast resolution of incidents
Define the Production Support
Procedure (Work Instructions)
Create a formal and detailed document.
Train Production Support Team to follow
the new procedure
Start with ITIL Incident Management
Process. Adjust it to our own situation and
tools
Clearly define how to calculate priorities
Clearly define the time-bound escalation
procedure
Clearly define the monitoring and tracking
steps
Review and define the procedure
of using TeamTrack
TeamTrack is our existing Incident Tracking
system
Review the functions of TeamTrack
Redefine the incident escalation process
according to ITIL suggestions
Define the interface between PC Support
and IT Production Support Team
Communication channel
Roles and responsibilities
Escalation
Track and Control
Knowledge sharing
Enforce PIR

Contact each user to confirm all


the incidents are closed
Make sure the Problem record
is complete and useful
Identify issues in the Incident
and Problem Management
process. Add those to Problem
database.
Proactively Manage Problems
Responsibility of the Problem Management
Team.
Perform the following activities:
Analyze incidents to find the trend
Analyze infrastructure to identify possible
bottleneck
Run fail-over and stress tests
Apply a problem solution across multiple related
applications
Establish and maintain the Production Monitor
System to proactively detect system anomalies
Evaluate how many problems are
proactively identified and resolved
Service Desk Software

Evaluate the existing TeamTrack


software and see if it covers out
needs
Other popular options
HP Openview Service Desk
Remedy Strategic Service Suite

CA Unicenter Service Desk

Potrebbero piacerti anche