Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
David S. Matteson
School of Operations Research and Information Engineering Rhodes Hall, Cornell University Ithaca NY 14853 USA dm484@cornell.edu January 20, 2009
Title Page
1.
Basis for the scientic method, management science, inteligent empiricism. Tidbits: Fact based decision making. Scientic decision making. In God we trust; everyone else bring data. Conclusion: Hunches can be wrong, intuition can be fooled, instincts may be immature or just wrong.
Title Page
Hydrology: 10,000 year ood. Build the dam so high that water will exceed the height once per 10,000 years. (Small quantile estimation to the cognoscenti.) Finance: Value at riskthe banks risk reserve needs to be high enough to cover a loss so big that it occurs with probability 1/10,000. Environment: City is ruled out of compliance with EPA regulations if pollutant concentration exceeds a specied level more than 5% of the days in a year; that is, the standard is set so the prob of exceeding the standard is 5%.
Weather prediction: Chance of rain tomorrow is 53%. What does this mean?
Title Page
Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Aaron Levenstein It is easy to lie with statistics, but it is easier to lie without them. Frederick Mosteller Subjects perceived as dull. (Why?) Tell someone at a social gathering that you are studying statistics and watch their reaction. (Is the reaction the same if you say Engineering?) A statistician is someone who is good with numbers, but lacks the personality to be an accountant. (Or is it the other way round?)
The Greatest Scope More Details EDA Describing sample data
Title Page
Statisticians sometimes perceived as hired guns. (My statistician can beat up your statistician.) Statisticians are often used as expert witnesses in court. Government witnesses in regulatory hearings.
The Greatest
Lowest blow
Best selling book:. How to Lie with Statistics
Title Page
Quit
Title Page
Figure 2: Blowup
Quit
2.
Caution: All models are wrong; some models are useful. George Box
Statistics: The science of organizing and summarizing data and using information in the data to draw conclusions; scientic extraction of information from data. Statisticians consumers of probability. Statisticians t parameters in probability models. Statisticians draw conclusions about population based on the partial information contained in a sample. The manner in which the conclusions are drawn use probability tools and reasoning. (The sampling error in the estimate of p is 5%.)
Title Page
3.
More Details
Title Page
Population:
a well-dened set of items under attention and discussion. (It is clear how to decide if an item is or is not in the population.)
Sample: a well-dened subset of the population that has been selected for
study or measurement.
Observation:
Examples: Population: all US universities Sample: ivy league universities. Population: all voters in the US. Sample: all voters with incomes over $200,000. Population: Yearly best times in seconds (minus 3 minutes) in the mile run over 121 years (ending in mid-80s). Sample: Yearly best times which were record times during this time period.
Title Page
100
90
mile
80
Title Page
60
70
Page 10 of 25
50
20
40
60 Time
80
100
120
Title Page
Title Page
3. Abstracted population represented by an innite setsay an interval of real numbers (or worse). Population representing time to failure of a machine component; population could be represented by the set of all positive real numbers [0, ).
Population representing the study of the point of maximum pollution concentration in city; population could be represented by a two dimensional set {(x, y) : 0 x 1, 0 y 1}.
Title Page
Types of samples:
A simple random sample of size n: a sample of size n in which each subset of size n in the population has the same likelihood of being selected. Problem with denition Likelihood is undened. For populations from continuous innite sets, the likelihood may be zero; eg, the probability of drawing a sample of size 2 from [0, 1] and getting 1 , 3 is 0. 4 4 Stratied random sample: The population consists of sub-populations. EG: Take a sample of size 700 from the Cornell student population by sampling 100 from each college. (Useful to predict outcome of union vote?) Biased sample. The samples are clearly dependent and unrepresentative. Population: residents of Florida. Determine the average income of residents of Florida. Biased sample: NBA players living in Florida.
The Greatest Scope More Details EDA Describing sample data
Title Page
More later.
4.
Exceptions: Makes of cars sold in the US {Buick Le Sabre, Volkswagon Passat, VW Golf, Honda Accord, Honda Civic, . . . }. Marketing: Brands of toothpaste {Colgate, Crest, Toms, . . . }.
Conclusion:
Result of sampling often yields a set of numbers. Sometimes the set is large. We need to make sense of the set of numbers.
Title Page
Title Page
Page 17 of 25
Example:
Trace is packet counts per 100 milliseconds=1/10 second for Financial Company Xs wide area network link including USA-UK trac. Length of dataset=288,009; 8 hours of collection from 9am5pm. Top plot too muddy; bottom represents subsets sized 20,000.
Title Page
Conclusion:
First step:
Descriptive statistics:
Graphics
organization and summarization of large amounts of data for the purposes of drawing conclusions. Use
Summary statistics (mean, median, variance, ...) Use of descriptive statistics is often a rst step in an exploratory data analysis. Somewhat informal, pictorial.
Formal inference:
Draw scientic inferences about the population from the data. More formal methods. For example we can formally test hypotheses. (Sometimes the hypotheses are suggested by the EDA.) Most famous clinical trial: Salk vaccine given to sample of kids in early 50s with a control group receiving a placebo. The formal hypothesis was that the Salk vacine was more eective than randomness.
Title Page
5.
5.1.
stem-and-leaf plot
The Greatest Scope More Details EDA Describing sample data
Older method originated for small univariate data sets when analysis was often by hand. This is just a clever arrangement of the data values to reect the shape of the distribution. Advantage: simple, quick, easy to construct. Disadvantage: a little primitive; doesnt capitalize on graphics capabilities of packages and computers.
Title Page
Procedure:
digits.
1. split each xi into a stem of leading digits and a leaf of the remaining digits. 2. List the stem values in the left hand margin column and to the right list the leaves corresponding to each stem, listed in the order they are encountered in the data set.
A simple illustration: Suppose student scores on an exam are 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95. A stem-and-leaf plot is below: 9 8 7 6 5 4 | | | | | | 5 0038 033699 379 8
Positive features: The entire data set can be read with ease from the display. Gives a clear indication of the shape of the distribution of data values. R does this automatically with the following commands: > grades<-c( 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95) > stem(grades)
Title Page
4 5 6 7 8 9
| | | | | |
Minitab output (check the drop down menu under GRAPH) Stem-and-Leaf Display: C1 Stem-and-leaf of C1 Leaf Unit = 1.0 1 1 4 (5) 5 1 4 5 6 7 8 9 8 379 03699 0038 5 N = 14
The Greatest Scope More Details EDA Describing sample data
Title Page
Minitab output: extra column on the left. Features: The number in parentheses gives the number of observations on the line that contains the median (or the middle value). The 4 in the row above that, gives the total number of observations in the rst three rows, i.e. there were 4 scores below 70. The 1 above that indicates that there was one score in the 50s or below. The 5 below the median line indicates that there are 5 scores in the 80s and above. The 1 on the last line indicates one score 90.
Page 22 of 25 Go Back Full Screen Close Quit
Note The help le gives a detailed explanation. More extensive data sets - particularly those in which three or more digits vary - are dealt with in a variety of ways, but all are similar to the simple case shown above.
Title Page
Contents
The Greatest Scope More Details EDA Describing sample data
Title Page