Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
87% of IT problems reported to your Service Desk get fixed within hours, if not minutes. A further 11% are resolved nd rd within days by 2 and 3 line support, perhaps with help from supplier technical staff. The final 2% are the toughest and represent chronic problems that are fixed by trial & error, or simply remain unresolved. Unfortunately, trial & error is very slow, sometimes expensive and often disruptive. The good news is that we are at the dawn of a new era in problem resolution. The wide adoption of ITIL has provided a framework for the management of incidents and problems, and this in turn is driving an interest in problem resolution methods, particularly those based on Root Cause Analysis (RCA). One such method is Advance7s Rapid Problem Resolution (RPR) Technique. In this paper we look at the need for RPR, when we should use it, how it works, and the skills we need to practice it, and its challenges & limitations. We finish the paper with a very recent case study that helps demonstrate the benefits.
We can put this in an ITIL context. Incident Management covers all of Phase 1 and much of Phase 2 in that most of these issues are dealt with through any action that achieves rapid service recovery. Where the IT Team identifies
Pattern Method is at the core of a number of problem resolution methods and involves linking the first indication of a problem, a change in frequency or the pattern of problems to a common cause e.g. only Windows XP users experience the problem, therefore, the problem must be related to Windows XP.
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805
repeated occurrences of the same problem the issue moves into Problem Management. Such problems are typical of those that drop through Phase 2, often are fixed in Phase 3 and sometimes end up in Phase 4.
In Phase 4, the combination of procedures, skills and tools available in a typical company has been unable to determine the cause of the problem. At this point the IT Team is faced with two choices: 1 2 Do more of the same, even though this has not produced the solution to date Resort to trial and error which could involve testing a range of changes, upgrading infrastructure, or replacing software
There is a third way method-based problem resolution, and there are many methods available. Some have sprung from the IT industry, but many of the front-runners are actually adaptations of business problem resolution methods. There are some common shortcomings: Due to the soft nature of many business problems, methods with this lineage are not designed to take advantage of the logic and tools available to us in the IT industry Many of the methods are actually just processes with no supporting IT techniques making it difficult for IT people to run the process Many methods require that we already know the root cause as one of a list of possible causes that are then tested To avoid disruption, some methods force the IT Team to attempt to recreate the problem in a lab environment, which is time consuming, can be expensive and rarely works Many methods rely on statistical analysis which often fails with Phase 4 problems due to their intermittent nature and transient causes Some methods rely on trial and error Although many methods claim to be based on root cause analysis most only achieve this with hindsight
RPR was designed from the outset to solve IT problems and is heavily influenced by software engineering techniques, 2 primarily IBMs PSI/PD . From this starting point RPR avoids the shortcomings suffered by other methods because: It makes full use of the IT tools that are available in every business It is a fully mature method with a core 5-step process and supporting IT techniques RPR requires no pre-conceived idea of the cause of the problem, in fact such thoughts are positively discouraged The method uses non-disruptive techniques and so there is minimal business impact RPR doesnt require recreation in a lab environment, or even testing outside of normal working hours The method is based around the collection of definitive diagnostic data at the exact point of a problem and so precisely identifies the cause; transient or not RPRs primary objective is to identify the root cause RPR enhances the skills of the IT team and support companies
In the late 1970s and early 1980s IBM taught its software engineers a two stage process of problem diagnosis called Problem Source Identification / Problem Determination.
RPR Process
RPR starts with the premise that first we must identify the root cause, and only then can we define a fix. The method differs from many Root Cause Analysis methods that start from the fix and work backwards to the root cause. RPR begins by determining the root cause and then works forward to a solution. The root cause is determined through a five-step process: Gain an accurate understanding of the problem at some level Choose one specific symptom Create an Action Plan to capture definitive diagnostic data for one or more identifiable instances of the chosen symptom Execute the plan whilst controlling the environment Analyse the results and either; - identify the root cause and determine a fix, or - define a new Action Plan and execute it It is likely that we will need to iterate around the last three steps, revising our Action Plan and re-analysing the results. At first sight the process might look ridiculously simplistic but the devil is in the detail.
A Single Symptom
RPR dictates that we can only diagnose one symptom at a time, even if we think many symptoms have one common cause. This can be a tougher proposition than you might think. The RPR Practitioner can come under considerable pressure to deal with all of the issues particularly as they are all linked. RPR warns against trying to establish patterns, i.e. links between differing symptoms. There are many reasons for avoiding using a pattern method to diagnose problems and heres an illustration of one. The Service Desk reported that users suffered a 3 oclock slowdown. At around three oclock every day users said that the network was slow. Three specific symptoms were identified; Outlook Inbox items were slow to open, Word documents sometimes took 30 seconds to save and Citrix users suffered intermittent type-ahead delays. It turned out that none of these problems had a common cause. Starting from a presumption that all were linked would have led to failure to find the root cause of any of them. If multiple symptoms have the same cause then by fixing one we will fix them all.
Definitive Diagnostics
Shortcomings with Statistics Generating definitive diagnostics is a very big subject that alone could fill several whitepapers. The most important point of this step is that we must be able to gather diagnostics that can be directly correlated with the users experience of the problem. RPR rejects the use of statistical data that cannot be directly matched to the moment that the problem occurred. This usually comes down to wiggly graphs like this:
CPU Utilisation 100 90 80 70 60
Utilisation
50 40 30 20 10 0
If the CPU utilisation of a server was constantly above 90% it would make sense to solve this, although even then such load does not guarantee that we have discovered the cause of the performance problem in a complex end-toend system. More often than not we are actually faced with a graph like that above, the interpretation of which is very subjective. th To remove the subjectivity we might work on a designated overload threshold of 50% for the 95 percentile figure but even that ignores the issues of transient problems which can get hidden by the averaging that occurs over the sample period. A possible solution is to use a more granular measurement based on one-second samples, but even then we must be able to match the start and end time of the problem to the correct points on the graph. Without correlation there is a danger that we might spend money, and, more importantly, time on upgrading the server only to find that we still have the problem. Correlation RPR proposes that we gather diagnostics that can be directly correlated with one or more user experienced problems. So returning to our earlier scenario of the slow response time to the Appointments button we might decide to set-up: A network trace for the users PC A network trace showing everything going in and out of the application server A SQL trace for all database calls A perfmon study of CPU, memory and disk I/O on the application and SQL server
We then wait for the problem and when it occurs we immediately mark the diagnostic data. Many of the Supporting Techniques of RPR are designed to address the issue of correlation of the user experience of a problem to the corresponding diagnostic events.
Markers Markers are a key RPR technique to achieve correlation. A marker is simply an entry in the diagnostic data that is generated under our control, and is unique and easily identifiable.
11 :0 0 11 :00 :0 2 11 :00 :0 4 11 :00 :0 6 11 :00 :0 8 11 :00 :1 0 11 :00 :1 2 11 :00 :1 4 11 :00 :1 6 11 :00 :1 8 11 :00 :2 0 11 :00 :2 2 11 :00 :2 4 11 :00 :2 6 11 :00 :2 8 11 :00 :3 0 11 :00 :3 2 11 :00 :3 4 11 :00 :3 6 11 :00 :3 8 11 :00 :4 0 11 :00 :4 2 11 :00 :4 4 11 :00 :4 6 11 :00 :4 8 11 :00 :5 0 11 :00 :5 2 11 :00 :5 4 11 :00 :5 6 11 :00 :5 8: 00
Time of Day
Here are some examples of markers: A ping with a payload length of 101 bytes ping n 1 l 101 LONSERVER01 - Generates an identifiable entry in network trace data - Correlate with perfmon by adding ICMP / Received Echo / sec counter A GET for a non-existent URL http://intranet/marker101.asp - Generates a trace entry and web log entry so that we can match server and analyzer time A dir command for a file that doesnt exist - Generates a marker in a filemon and / or procmon trace Remote execution of Performance to show the Thread Count for a process every second - When the process dies we see the event in the network trace Use a SQL client to generate an identifiable query in a database select marker101; - Generates a marker in a network trace and SQL profiler trace
In all these cases, and many more like them we have to set the clock back to zero. We cannot safely assume that anything we have already discovered remains true. Of course, every IT department comes under great pressure to fix a problem as quickly as possible. Sometimes a pragmatic approach is needed and we are not able to follow the RPR method. We just need to make sure that everyone involved accepts that RPR will not work if ad-hoc changes are made.
When studying a slow response time from for example, an application server, try to account for all of the time spent on network interactions anything left over must have occurred inside the server When studying a failure or a hang, compare diagnostics for a working scenario with those for a failure and focus on the point where things first differ
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805
It is useful to consider the type of output that we would like from the analysis. Continuing with the Appointments scenario, once we have successfully executed the plan, careful analysis of the diagnostics might lead us to the conclusions that: At the time of the user reported problem we can see from network trace data that the database takes 13 seconds to execute a request The request is a stored procedure call sp_GetCustomerDetail At this time SQL server CPU load is 65%, max disk I/O queue length is 2 and there is 1.2 GB of memory in use Analysis of a matching SQL Profiler trace taken at the time shows that an additional index is required on table custinfo
The techniques presented are generic ie they are applicable to any technology. Although some tools and technologies are cited as an illustration of a technique, RPR teaches nothing about specific tools or technologies. One of the Supporting Techniques is Whiteboard Analysis, and well take a brief look at this by way of example.
Whiteboard Analysis
We use whiteboard analysis to pull together the strands of a brainstorming session, the first of which is the Initiation Workshop. The technique is quite simple. Write five headings on a whiteboard or flipchart: Symptoms Boundaries Other Observations Possible Causes
Action Plan
Under Symptoms write an accurate description of one or more symptoms to a level that would enable us to attempt to recreate the problem. Prioritise the symptoms and choose the one we intend to tackle remember, we can only tackle one.
Under Boundaries note when, where and under what circumstances the problem occurs. These boundaries are used to determine when, where and how to conduct the investigation. They are never used to reach a conclusion regarding the cause of a problem. The boundaries simply guide you to the best time and place to do the diagnosis. For example, if the problem has only ever been noted in the Munich office at 3 pm on a Friday, there is little point trying to collect diagnostics in London on Monday morning. That doesnt mean that the problem is linked to the Munich office or the time of day. For all we know the problem might occur in London, but the users just dont report it, or, their pattern of application use is different. Other Observations are things that may or may not be related to the problem, and are worth noting for later consideration but must not figure in our diagnostic efforts. Put simply, they must be ignored at this stage. From the Symptoms and Boundaries you can identify the components of the end-to-end system. Each of these is a Possible Cause. Dont be too granular in determining the possible causes choose large chunks of infrastructure. The list might be: User PC Munich LAN WAN Data centre LAN Application Server Database Server Now devise an Action Plan to prove that a particular Possible Cause is, or is not, the root cause of the problem. If its possible to maintain good control of the diagnostic data capture, diagnosis can be made quicker by gathering data from many places along the end-to-end path. For the first pass make sure that the plan includes capture of data adjacent the users PC since its here that the user experience can be most easily correlated with diagnostic events.
Appendix B - Whiteboard Exercise Briefing Sheet on page 11 gives a sample scenario for practice.
IT Management must allow the Problem Analyst dedicated time for most of the duration of the problem analysis certainly during the data collection phase The Problem Analyst must have a good knowledge of diagnostic tools, when to use them and their limitations The Problem Analyst must have strong analytical skills I guess this goes with the territory
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805
Initially selling the approach as the most effective method of resolving Phase 4 problems can be tough, but, on the positive side, it becomes easier to promote the method after the first success.
Skills Needed
A broad knowledge of IT is needed to make best use of RPR. The practitioner must have a good basic knowledge of the technologies and concepts of an end-to-end system. We live in a networked world and modern systems have many components networked together. The networking slant is further influenced by the choice of tools. The natural tool of choice for analysis of end-to-end problems is the network analyser since: Its use is non-disruptive There is no requirement to install additional software on any components of the system It can be used to narrow a problem to a component, and hence an owner
This means that a good knowledge of networking and protocols is a significant advantage. Once we have narrowed the problem to say a server, we may then need to use further tools to drill into that component, and this demands further skills, largely based around knowledge of software and operating systems. For career enhancement, becoming an RPR practitioner is ideal for senior IT support staff who dont want to move into management, but need a new challenge. A very bright junior would also be a valuable addition to a problem management team but its important to note that there is often a lot of politics, many business considerations and significant people management in the resolution of a Phase 4 problem, and so senior support is likely to be needed. The ideal RPR person would be a senior application developer who has a good knowledge of networking or vice versa. Those companies that have embraced ITIL may have dedicated Problem Managers and Problem Analysts. RPR is ideal training for both since the RPR Process provides a framework for the Problem Manager, and the Supporting Techniques help develop effective skills in the Problem Analyst.
Problem analysis skills can be taught (as per RPR) and require development through experience. Its important to recognise that effective problem analysis requires its own set of skills. These skills are often built upon operational (BAU) experience, but the skills are different. For a Problem Analyst to fix problems quickly with RPR, he or she needs to be exposed to Phase 4 problems almost continuously. It may be difficult to keep a Problem Analyst sufficiently occupied in an organisation of less than, say, 8,000 users.
Closing Comments
I hope this short introduction has given some idea of the power of RPR. There is only so much that we can cover in a whitepaper. However, you can benefit today from this whitepaper by: Using it as a guide, you could set an escalation point to recognise that a problem has entered Phase 4 and requires a different approach hearing or using the phrase Were just going to try one more thing is a good starting point Focusing on one symptom of a Phase 4 problem will simplify diagnosis and speed up resolution Avoiding pattern-based methods will increase the likelihood of success Collecting definitive diagnostics will identify the root cause, which will save time and money Using the Whiteboard Analysis will help you plan the diagnosis
Build your knowledge base Rapid Problem Resolution (RPR) Paul Offord 4th September 2007
This was a text book RPR scenario and demonstrates the collaborative nature of the method. I hope it also shows that you can use the method even if you are not an expert in the particular technical subject.
User PC
Staff at Giant Finance use a third-party application called TopFund hosted by Acme Corp. Access is via a Virtual Private Network (VPN) established across the Internet from Giants Southampton site to Acmes Derby Data Centre. Giant Finance staff at Cambridge say that the system is so slow that it is unusable. Heres the information as it is told to you: Only Giant Finance users in Cambridge suffer the problem, so it must be a Giant Finance problem Cambridge users dont have problems with any other systems and so it must be an Acme problem No other Acme users get the problem and so it must be a problem with Giants VPN firewalls Giant users access other 3rd party systems via the same VPN firewalls and they dont have problems Users of the Citrix-based TopFund system suffer type-ahead delays The problem only happens when more than one Cambridge user accesses TopFund so it must be a Citrix server issue Lots of users access the same TopFund servers and dont experience problems All Giant users experience slow web access but the Citrix traffic has been prioritised above web traffic
4
Write five headings on a whiteboard or flipchart: Symptom comes from the rule You can go no further until you understand the problem at some level. It must be accurate. It must be specific Boundaries used to help you decide where and when to investigate the problem Other Observations things that may or may not be related to the problem Possible Causes at a high level Action Plan a plan to prove or disprove the first of the possible causes
Users of terminal-type services that use character echo protocols such as Citrix ICA, Windows Terminal Services and TELNET sometimes complain that they type characters but nothing appears for a few seconds, then all the characters appear at once. We call this effect type-ahead delay.
Executive Whitepaper 2008 Advance Seven Limited Advance7 Defining IT stability, control & performance www.Advance7.com +44 (0) 1371 876805
If this were a general complaint of slow performance, we might make Slow web browsing a symptom and then determine the priority of the two symptoms. However, in this case the issue at hand was Citrix type-ahead problems, and so the slow web browsing is relegated to an observation.