0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
53 visualizzazioni9 pagine
Software estimations are commonly viewed with skepticism by well informed executives. This article borrows a techniques described as Applied Information Economics. It will quickly build an estimation framework and statistically driven simulation mechanism.
Software estimations are commonly viewed with skepticism by well informed executives. This article borrows a techniques described as Applied Information Economics. It will quickly build an estimation framework and statistically driven simulation mechanism.
Software estimations are commonly viewed with skepticism by well informed executives. This article borrows a techniques described as Applied Information Economics. It will quickly build an estimation framework and statistically driven simulation mechanism.
Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 1
Improving Software Estimation Using Monte Carlo Simulation Using statistical simulation to provide a software delivery date and the probability of meeting a given deadline Troy Magennis, July 2008 Introduction Software estimations are commonly viewed with the same skepticism by well informed executives, as the automotive history provided by a used-car salesperson in New Orleans after the devastation of Hurricane Katrina. Its not unreasonable for executive level management to desire some indication of when a software project may be delivered, however, they are often furnished with massive ranges; wild estimates presented as fact with no indication that it was likely calculated from the sell-by date on the milk container in the engineering teams kitchen. What is the most likely delivery date for a current project? What is the probability of hitting this go- live date? These are fair questions for any executive or investor to ask, but they are often answered by a sweating and nervous looking development manager who presents a date with no mention of probability. This date holds the key to making better executive decisions, and risk is a normal aspect of any decision process. Padding estimates to cover all risks will give a date too far out to be acceptable; the counter case of aggressive estimates might get the project funded, but actual delivery of a quality product unlikely. What is the likelihood of successful delivery by a given date? This article borrows a techniques described as Applied Information Economics (AIE) that will quickly build an estimation framework and statistically driven simulation mechanism to reflect the normal distribution of estimate risk ranges, presenting the probability of completion in a certain time. E.g. 80% chance of 66 days, and 34% of success in 60 days. It is also a simple charting step to present the same simulation data as a histogram showing the level of confidence that a given date is achievable.
Figure 1 - Our desired outputs. A probability calculator and a probability histogram of a specific completion date. The technique described exhibits the following characteristics Allows an initial date estimate to be achieved quickly Provides a constant probability of hitting a date 0 20 40 60 80 4 7 5 1 5 5 5 9 6 3 6 8 7 2 7 6 F r e q u e n c y Bin (Days) Histogram of Days to Completion (n = 1,000 iterations) Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 2 Allows a date estimate to be progressively firmed up as more information comes to hand Works for an iteration of a project and/or the entire project schedule (E.g. 92% chance of hitting an iteration with x features, but only a 63% chance with y set of features) Provides a mathematical proven mechanism for presenting the risk profile of a project (E.g. Given a target of 65 days, there is a 73% chance of hitting that mark) Avoids the common issue of massive over-estimation in an attempt to highlight risk (this often gets projects cancelled before they begin)
This technique was generically described by Douglas Hubbarb in his book titled How to Measure Anything: Finding the Value of Intangibles in Business (Hubbard, 2007), and this article proposes one implementation that migrates those principles into the software estimation problem domain. Hubbarbs approach borrows and explains techniques commonly used in many decision fields. Wikipedia is a good starting source for understanding the basic principles of Applied Information economics, its history and adoption: http://en.wikipedia.org/wiki/Applied_information_economics
The simulation method used is very commonly employed in other fields to solve probability problems, such as in electronic component simulation software (Monte Carlo method) which the author has some experience. Monte Carlo methods are used in calculations where it is not possible to deterministically solve for an exact result. Instead, repeated calculations using bounded random inputs are used in an attempt to discern the probability histogram among viable results. Software Estimation Basics In the aftermath of many failed software projects, the technique of simply asking the development manager for an accurate delivery estimate between lunch and 2pm whilst they eat pizza, has been re-considered best practice in many organizations. Mentally solving the complex risk relationships for even a small software project is unlikely at best. Software estimation practices vary, but often share the common principle of breaking down the larger project into smaller chunks. Smaller tasks are thought of as easier to accurately estimate, following the logic is more likely to accurately estimate the time to walk across the room than it is to estimate how long it would take to walk across the country (exclude narrow countries like New Zealand, Chile, Italy, etc). The actual act of breaking down the project into smaller pieces also clarifies specification intention and forces some basic design decisions to be resolved up-front, reducing the risk that these will be uncovered later. To break down the tasks for even a small project can take a considerable time. The conflict between getting a rough estimate in reasonable time, while breaking down enough detail to reduce the risk of missing a considerable feature is hard to solve. Nobody likes giving an inaccurate estimate, but estimation time will grow exponentially in order to narrow in on more accuracy and this might take more time than available. Some estimates will have to be accepted as having a higher uncertainty than others. This is often portrayed as a low, medium and high scale, which is used to multiply the estimate by an increasing factor. One example is low risk estimates are multiplied by 1, medium level multiplied by 1.5 and high risk estimated are doubled (these are common, but some people use higher multipliers- Im trying to keep the number real). Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 3 Feature Name Raw Estimate Estimate Risk Weighted Estimate A 10 Low 10 B 5 Medium 7.5 C 7 High 14 D 3 Low 3 E 2 Medium 3 F 8 High 16 G 5 Low 5 H 6 Medium 9 I 1 High 2 J 4 Low 4 51 73.5
Risk Weight Key Low 1 Medium 1.5 High 2 Table 1 - Example of a common technique for arriving at a quick project estimate. Estimates are allocated a low, medium or high risk. These risks are used to weight the estimates in order to give a low and a high date. Table 1 would indicate an estimate between 51 units and 73.5 units (lets say days for this experiment). Is an over 40% increase acceptable? If you were being charges $1,000 per day that would be a dollar range of $51,000 to $73,500. If $60,000 was the budgeted limit you just lost your project. Problems with Range Estimates One issue with weighted risk estimation techniques is that the weighted estimate is often interpreted at the extremes. Well intentioned in an attempt to cater for in-accurate estimates means some people over-estimate, making the prediction is long. Other estimators are more susceptible to under-estimating, creating unrealistic expectations. What are the issues with a simple weighted range? Not all values within the 51 to 74 day range have an equal chance There is no measure of probability for whether any given date is achievable One person hears 51 days, the other 74 days. Both parties will start the project with different expectations a sure recipe for discontentment in the future In real life some estimates wont always be at the extreme high end of the risk multiplier, nor will they all be at the low end. There is no way of predicting how many will be at which end of the risk multiplier scale; Well, almost no way. Statistics offers some hope by introducing the concept of Normal Distribution, often referred to as the bell curve. Real World Simulation Simple Monte Carlo Method If you were to sit down and pick random estimates of each task within a given range, and continue doing this and tabulating the results for thousands of imaginary cycles, you would achieve a Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 4 statistical average following the Normal Distribution pattern. This is exactly what Monte Carlo simulation does. To carry out this simulation we need the following inputs 1. A sum for each risk level; Sum of all low estimates, medium estimates and high estimates 2. A lower and upper bound for each risk level; Low, medium and high 3. The estimated mean (average) for each risk level; Low medium and high From these inputs, it is possible using random functions, to pick a viable value for the multiplier for each risk segment. Monte Carlo methods repeat and tabulate what the total estimate might look like for these random scenario iterations. Step 1 - Estimates and Risk Allocation The simulation needs an estimate to work with. Each feature needs to have an allocated estimate, and a well considered low, medium or high risk estimate. It goes without saying the better these estimates, the better the calculated data. However, the intention of this estimation technique is to get a result in a reasonable time. Know when you need to mark an estimate high risk and move on. One technique the author uses is to allocate a medium to every estimate, then work through the list and allocate a low when Im certain there are no mis-understood or new concepts and a high when there are many unknowns. Depending on whether you are after a rough- cut estimate, or a fairly accurate solution, make a decision as to how much research is undertaken for each feature, prioritizing in an attempt to decrease the high risks to medium or low. After moving through the list, a table similar to that shown in Table 2 should be produced (yours will be bigger, and feature names hopefully more descriptive than A, B, C, etc). Feature Name Low Risk Med Risk High Risk A 10 B 5 C 7 D 3 E 2 F 8 G 5 H 6 I 1 J 4 Total: 22 13 16 Table 2 - Feature estimate partitioned by risk level. (Hubbard, 2007) advises that estimators be calibrated. His work demonstrates a simple technique of asking calibration questions of individuals and determining that when someone says they are 90% sure, how often they actually are. Some people will estimate their own Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 5 accuracy low, some people will estimate high. Profiling your estimation team will bring immediate rewards in accuracy and help refine the initial upper and lower bounds for risk multipliers. Calibrating estimators aside, nothing improves estimate accuracy more than applying the lessons learnt by comparing real values versus estimates over time and determining your specific projects trend. Step 2 Create Initial Risk Bounds At the start of any project, the lower and upper boundary for each risk level multiplier is subjective. As a project progresses and actual values can be compared to estimates, a pattern that profiles the development team, and also profiles the estimation teams bias can be used to refined weights. For initial rough-cut estimates, the values shown in Table 3 are the starting point used for this article. In addition to the upper and lower bounds, the average (or mean) value for each risk level needs to be determined. The mean value does not have to be centered in the bounds. In fact, for the medium and high risk levels, Table 3 shows the offset the average towards the higher bound. The logic is that low risk estimates might go either way (be under or over by a small amount), but as the risk grows, estimates are more likely to be low, and the actual features taking longer due to un-identified unknowns. Low Medium High Upper Bound 110% 150% 200% Mean 100% 125% 150% Lower Bound 90% 85% 80% Table 3 - Initial risk boundaries and mean values
Figure 2 - The High Risk estimates will be multiplied by random samples selected statistically spread over the Normal Distribution function (Bell Curve) between 80% and 200%. The Mean is slightly offset towards the higher end, ensuring more values will be around 150%. Step 3 Build the Random Function Our random selection of multiplier for an estimate has to follow the Normal Cumulative Distribution profile, falling between the lower and upper bound, centered on the mean. Our formula determines random values within the bounds, falling off at a rate consistent with the traditional bell curve function closer to the bounds. x 0.8 x 1.5 x 2.0 Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 6 The examples in this article use built-in functions in Microsoft Excel, although any spreadsheet will perform similar calculations. Excels NORMINV function according to the Excel 2007 (Microsoft Corp, 2007) documentation Returns the inverse of the normal cumulative distribution for the specified mean and standard deviation. We will use this function and the built-in random number function RAND to compute a likely risk multiplier for each of our risk levels. The general form is: NORMINV(Probability, Mean, Standard Deviation) =NORMINV(RAND(),LowMean,(LowUpperBound-LowLowerBound)/3.29) =NORMINV(RAND(),MediumMean,(MediumUpperBound- MediumLowerBound)/3.29) =NORMINV(RAND(),HighMean,(HighUpperBound-HighLowerBound)/3.29) Figure 3 - Random function used to determine the multiplier value for each iteration scenario The Standard Deviation is used to determine the fall-off rate either side of the Mean in our ranges. The only complexity in the Standard Deviation calculation involves the division by 3.29. The Normal Cumulative function falls off to 0.997 within the first 3 Standard Deviations from the Mean. Dividing the range by 3.29 would include the values out to 0.999. More experimentation is required to determine if this will have any real impact on the outcomes. Step 4 Perform Multiple Iterations and Compute Totals Monte Carlo analysis now dictates computing an outcome hundreds or thousands of time. This allows the computation of a likely distribution, and the likelihood a certain date would be possible (what percentage of the iterations fall below a given date). Iteration Low Mult Med Mult High Mult Low Weighted Med Weighted High Weighted Total 1 0.93 (see fig 2) 1.09 (see fig 2) 1.89 (see fig 2) 20.54 (22 * 0.93) 22 = Original Low Estimate 17.46 (16 * 1.09) 30.31 (16 * 1.89) 68.30 (20.54 + 17.46 + 30.31) 2 0.97 1.14 1.36 21.45 18.23 21.80 61.48 3 1.05 1.37 1.74 23.12 21.94 27.82 72.87 4 0.97 1.31 2.11 21.37 21.00 33.75 76.12 5... 1.00 1.29 1.10 21.98 20.61 17.67 60.26 Table 4 - The first 5 iterations of ideally hundreds or thousands. How many iterations fall within your desired range? This is the key to calculating probability versus a simple average. Step 5 Calculate Probabilities From the set of iteration scenarios, the Monte Carlo analysis returns a set of Total estimates. The probability calculation is a matter of counting those below a given threshold and presenting this as a percentage. % probability of success = (count of # iterations < [target] / number of iterations) * 100 Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 7
Figure 4 - Sample Target Probability calculator. This allows the spreadsheet user to see what percentage of simulation iterations fell under a desired target number of days. Excel has a neat function for counting rows that match a certain criteria, and the Chance of Success columns in Figure 4 use the following basic formula =(COUNTIF(*simulation total cell range+,"< " & [target days])/[iterations]) In addition to this single value, the entire result set can be displayed in a Histogram as demonstrated in Figure 5. This graph shows the full spread of scenario total estimates. It is also useful to understand what risk level contributes to the total, and a histogram for each risk level total is often helpful to determine that more time spent on analysing high or medium risks will improve the probability estimate by narrowing the range.
Figure 5 - Histogram of 1,000 iterations. This shows the most common scenarios are centred between 58 to 63 days. Building Confidence and Refinement Once the project is underway, it is possible to measure how close estimates and actual are aligning. This insight allows the estimated completion date to be refined by 1. Narrowing the range of risk multipliers by understanding the estimations teams bias 2. High tasks being understood and moved to medium tasks, and medium tasks to low The most important contributor to accuracy though is the initial estimates. If the estimates are incorrect, then there can be no mathematical magic that will bring accuracy to the endeavor. The process of deriving these estimates should in the absence of actually understanding every unknown, 0 10 20 30 40 50 60 70 80 90 4 3 4 8 5 3 5 8 6 3 6 8 7 4 7 9 F r e q u e n c y Bin (Days) Total Estimate Histogram Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 8 be at the very least - consistent. Define the rules as to what qualifies a feature as high, medium and low risk early and keep to that recipe. Consider running a higher number of iteration scenarios. The higher number of iterations, the less impact statistical outliers will skew values. Even the long-shot pays dividends occasionally, and the random function and distribution pattern we are using allows these to be represented in the data. The histogram show in Figure 5 clearly demonstrates that we dont yet have a perfect Normal Distribution pattern (61 is highest, but 62 is lower than 63). More repetitions would bring this outcome into conformity. However, your actual project will have an iteration of 1! We are looking for probabilities, there are no absolutes. The issue of testing and other dark matter (tasks that need to be done, but arent captured in the estimates) also needs to be represented. This can be achieved in one of the following ways 1. Add an estimate for each iteration for dark matter 2. Move the mean value for the risk multipliers further towards the upper bound 3. Add a separate random function for each iteration scenario to create a random estimate for each simulation. If you choose to do this, increase the permutations and keep the range low to avoid it over-influencing the outcome. If dark matter is such a problem, some important tasks must be overlooked in the estimates. Summary Applied Information Economics (AIE) of which this technique has adapted to the software estimation process, is a well respected method for bringing statistical and mathematical rigor to business decision problems. It allows probability to be part of the decision process, alongside tables of numbers and dates. Monte Carlo simulation brings normal probability distribution patterns to the intangible risks in order to find the most likely outcomes from thousands of possibilities. Averaging the risks values would give a result in the likely range; however the simulation technique derives a probability measure. It is this probability measure that offers so much value in the decision making discussion and negotiation process. The ability to quickly estimate a project, iteration or probability of hitting a specified date should be a key competency practiced by anyone who seriously wants to improve the software planning and development process. Monte Carlo simulation and other Applied Information Economic principles should be another weapon in your arsenal. About the author: Troy Magennis is a software developer and Enterprise Software Architect located in Seattle, WA. He has worked on numerous enterprise scale software projects over the last 14 years, and is a Microsoft MVP for Visual C#. He can be contacted and welcomes feedback by sending email to troy@aspiring-technology.com. Sample Spreadsheet and More Information A sample spreadsheet is available to support this article. Download it from the following location: http://blog.aspiring-technology.com/page/Technical-Articles.aspx Improving Software Estimation Using Monte Carlo Simulation
Troy Magennis July 2008 Page 9 Updates to this article will be posted at the same location. The author intends to blog about updates and new ideas enhancing this technique. You can follow these on the blog: Blog: http://blog.aspiring-technology.com RSS feed: http://feeds.feedburner.com/LinqedIn The author welcomes feedback (good and bad), and can be reached at the following email address troy@aspiring-technology.com Bibliography Hubbard, D. (2007). How to Measure Anything: Finding the Value of "Intangibles" in Business. Wiley. Microsoft Corp. (2007). Excel 2007 Help File. Wikipedia. (n.d.). Applied Information Economics. Retrieved 07 14, 2008, from Wikipedia: http://en.wikipedia.org/wiki/Applied_information_economics