Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Jiheng Zhang
A data set having a relatively small number of distinct values can be conveniently presented in a frequency table. For instance, Table 2.1 is a frequency table for a data set consisting of the starting Sets students Describing Data yearly salaries (to the nearest thousand dollars) of 42 recently graduated Paired Data Sets Summarizing Data Sets Chebyshevs Inequality Normal Data Sets with B.S. degrees in electrical engineering. Table 2.1 tells us, among other things, that the Frequency Tables $47,000 was received by four of the graduates, whereas the highest lowest starting salary of and Graphs salary of $60,000 was received by a single student. The most common starting salary was A data set having by 10 of the students. $52,000, and was receiveda relatively small number of distinct values
Starting Salary
47 48 49 50 51 52 53 54 56 57 60
Frequency
4 1 3 5 8 10 0 5 2 3 1
11
12
Spring, 2010
10
Chebyshevs Inequality
Data from a frequencyTABLE : Starting Yearly Salaries a line graph that plots the table can be graphically represented by 6 distinct data values on the horizontal axis and indicates their frequencies by the heights of Chapter 1: Jiheng Zhang vertical lines. A lineDescriptive Statistics presented in Table 2.1 is shown in Figure 2.1. graph of the data 4 When the lines in a line graph are given added thickness, the graph is called a bar graph. Figure 2.2 presents a bar graph. 2 Another type of graph used to represent a frequency table is the frequency polygon, which plots the frequencies of the different data values on the vertical axis, and then connects the 0 47 49 51 plotted points with straight48 Sets 50 Chebyshevs52 a53 54 Normal Data Sets 60 the dataData Sets lines. Figure 2.3 presents frequency 56 57 for Paired of polygon Describing Data Sets Summarizing Data Inequality Starting salary Table 2.1.
FIGURE Tables and Frequency 2.1 Starting salary data. Graphs 2.2.2 Relative Frequency Tables and Graphs
11
Consider a data set consisting of n values. If f is the frequency of a particular value, then 12 the ratio f /n is called its relative frequency. That is, the relative frequency of a data value is
10
10
Frequency
8 Frequency Frequency 47 48 49 50 51 52 53 54 56 57 60
47
48
49
50
Starting salary
51 52 53 Starting salary
54
56
57
60
FIGURE 2.1
FIGURE 2.2
10
13
Starting Salary
47 48 49 50 51 52 53 54 56 57 60
Frequency
4/42 = .0952 1/42 = .0238 3/42 5/42 8/42 10/42 0 5/42 2/42 3/42 1/42
10
8 Frequency
0 47
48
49
50
51
52 53 Starting salary
54
56
57
60
FIGURE 2.3
the proportion of the data that have that value. The relative frequencies can be represented F IGURE : Frequency Polygon graphically by a relative frequency line or bar graph or by a relative frequency polygon. Indeed, Descriptive frequency Chapter 1: these relativeStatistics graphs will look like the corresponding graphs of the Jiheng Zhang absolute frequencies except that the labels on the vertical axis are now the old labels (that gave the frequencies) divided by the total number of data points.
EXAMPLE 2.2a Table 2.2 is a relative frequency table for the data of Table 2.1. The relative frequencies are obtained by dividing the corresponding frequencies of Table 2.1 by 42, the size of the data set. I
Melanoma 4.5%
Bladder 6%
Jiheng Zhang
13
TABLE 2.2
Starting Salary
47 48 49 50 51 52 53 54 56 57 60
Frequency
4/42 = .0952 1/42 = .0238 5/42 8/42 10/42 0 5/42 2/42 3/42 1/42
Lung 21%
A pie chart is often used to indicate relative frequencies when the data are not numerical
Describing Data Sets in nature. A circle is constructed and then sliced into different sectors; one for each distinct Summarizing Data Sets Chebyshevs Inequality Normal Data Sets Paired Data Sets Describing Data Sets Summarizing Data Sets
type of data value. The relative frequency of a data value is indicated by the area of its sector, this area being equal to the total area of the circle multiplied by the relative frequency of the data value.
EXAMPLE 2.2b The following data relate to the different types of cancers affecting the 200
most recent patients to enroll at a clinic These data are The following chart presented in Figure 2.4. specializing in cancer.types of represented cancers I in the pie data relate to the different 14 affecting the 200 most recent patients to enroll at a clinic Chapter 2: Descriptive Statistics specializing in cancer.
Type of Cancer
Lung Breast Colon Prostate Melanoma Bladder
Relative Frequency
.21 .25 .16 .275 .045 .06
Prostate 27.5%
Breast 25%
Breast 25%
As seen in Subsection 2.2.2, using a line or a bar graph to plot the frequencies of data values Chapter 1: Descriptive Statistics Jiheng Zhang is often an effective way of portraying a data set. However, for some data sets the number of distinct values is too large to utilize this approach. Instead, in such cases, it is useful to divide the values into groupings, or class intervals, and then plot the number of data values
Jiheng Zhang
Colon 16%
Chebyshevs Inequality
For some data sets the number of distinct values is too large. It is useful to divide the values into groupings, or class intervals, and then plot the number of data values falling in each class interval. too few classes: loss too much information about the 2.2 Describing actual data values in a class Data Sets too many classes: the frequencies of each class being too small for a pattern to be discernible
frequencies of each class being too small for a pattern to be discernible. Although 5 to 10 Paired Data Sets class intervals are typical, the appropriate number is a subjective choice, and of course, you can try different Data, Histograms and Ogives of the resulting charts appears to Grouped numbers of class intervals to see which be most revealingTABLE 2.3 Life the data.Incandescentcommon, although not essential, to choose class about in Hours of 200 It is Lamps Item Lifetimes intervals of equal length. 1,067 919 1,196 785 The endpoints 855 a 1,092 1,162 1,170 are1,126 936 class 1,156 1,035 1,045 will adopt the of class interval called950 918 boundaries. 948 the 905 972 920 We 929 1,157 1,195 1,195 1,340 1,122 that a class interval956 938 970 left-end inclusion convention, which1,009 1,157 1,151 1,009 1,237 958 1,102 its left-end but stipulates contains 1,022 978 832 765 902 923 1,333 811 not its right-end boundary point. 1,217 1,085 instance, the class interval 2030 contains Thus, for 896 958 1,311 1,037 702 521 933 all values that are both greater 928 1,153 equal to 20 and less1,069 1,062 1,063 than1,063 1,002 858 1,071 1,021 30. 1,157 or 946 909 1,077 than 830 930 807 954 999 932 1,035 944 940 Table 2.3 presents the lifetimes 1,250200 incandescent lamps. A833 1,320 of 1,049 1,078 1,122 1,115 1,011 1,102 class frequency table for 901 1,324 818 1,203 890 1,303 996 is presented in Table 2.4. 621 780 900 1,106 704 854 1,178 1,138are of length 100, with 951 the data of Table 2.3 The class intervals 1,187 1,067 1,118 1,037 958 760 1,101 949 992 966 the rst one starting at 500. 980 935 878 934 910 1,058 730 980 824 653
1,037 TABLE 2.4 1,026 1,039 1,023 1,134 998 610 844 814 1,151 1,147 1,083 984 932 996 916 990 1,035 A Class 863 Frequency Table 883 867 990 1,040 856 938 1,133 1,001 1,289 924 1,078 765 895 699 801 1,180 775 709 1,103 1,000 788 1,083 880Frequency 1,029 658 912 1,122 1,292 1,116 (Number of 954 880 1,173 in Data 824 529 Values 1,106 1,184 1,105 1,081 the 1,171 705 1,425 Interval) 860 1,110 1,149 972 1,002 1,143 1,112 1,258 935 931 1,192 1,069 970 922 1,170 932 1,150 1,067 904 1,091
As seen in Subsection 2.2.2, using a line or a bar graph to plot the frequencies of data values is often an effective way of portraying a data set. However, for some data sets the number of distinct values is too large to utilize this approach. Instead, in such cases, it is useful to divide the values into groupings, or class intervals, and then plot the number of data values Describing Data Sets Chebyshevs Inequality Normal Data Sets falling Summarizing interval. The number of class intervals chosen should be a trade-off in each class Data Sets between (1) choosing too few classes at a cost of losing too much information about the actual data values in a class and (2) choosing too many classes, which will result in the
15
frequencies of each class being too small for a pattern to be discernible. Although 5 to 10 class intervals are typical, the appropriate number is a subjective choice, and of course, you can try different numbers of class intervals to see which of the resulting charts appears to be most revealing about the data. It is common, although not essential, to choose class intervals of equal length. Chapter 1: Descriptive Statistics Jiheng Zhang The endpoints of a class interval are called the class boundaries. We will adopt the left-end inclusion convention, which stipulates that a class interval contains its left-end but not its right-end boundary point. Thus, for instance, the class interval 2030 contains all values that are both greater than or equal to 20 and less than 30. Table 2.3 presents the lifetimes of 200 incandescent lamps. A class frequency table for Describing Data Sets Summarizing Data Sets Chebyshevs Inequality Normal Data Sets Paired Data Sets the data of Table 2.3 is presented in Table 2.4. The class intervals are of length 100, with the rst one starting atHistograms and Ogives Grouped Data, 500.
TABLE 2.4 A Class Frequency Table
Class Interval
500600 2 600700 : Life in Hours of 200 Incandescent Lamps 5 TABLE 700800 12 Chapter 1: Descriptive Statistics Jiheng Zhang 800900 25 9001000 58 10001100 41 11001200 43 12001300 7 Describing Data Sets Summarizing Data Sets Chebyshevs Inequality Normal Data Sets 13001400 6 14001500 1
Class Interval
500600 600700 700800 800900 9001000 10001100 11001200 12001300 13001400 14001500
Number of occurrences 60 50 40
9 10 11 12 13 14 15
FIGURE 2.5
A frequency histogram.
Chebyshevs Inequality
16
Chebyshevs Inequality
Histogram: bar graph with bars representing the frequency frequency histogram relative frequency histogram Ogive: Cumulative frequency (or relative frequency) graph A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data whose values are less than or equal to it.
FIGURE 2.6
500
700
900
1,100 Lifetimes
1,300
1,500
Jiheng Zhang
Chebyshevs Inequality
An efcient way of organizing a small- to moderate-sized data set is to utilize a stem and leaf plot. For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit. The number 62 can be expressed as Stem 6 Leaf 2
A bar graph plot1:of class data, with the bars placed adjacent toJiheng Zhang each other, is called Chapter Descriptive Statistics a histogram. The vertical axis of a histogram can represent either the class frequency or the relative class frequency; in the former case the graph is called a frequency histogram and in the latter a relative frequency histogram. Figure 2.5 presents a frequency histogram of the data in Table 2.4. We are sometimes interested in plotting a cumulative frequency Data Sets (or cumulative relative Describing Data Sets Summarizing Data Sets Chebyshevs Inequality Normal Paired Data Sets frequency) graph. A point on the horizontal axis of such a graph represents a possible data value; its corresponding vertical plot gives the number (or proportion) of the data Stem and Leaf Plot whose values are less than or equal to it. A cumulative relative frequency plot of the data of Table 2.3 is given in Figure 2.6. We can conclude from this gure that 100 percent of theThe followingless than 1,500, approximately 40 percent are less than or equal to data values are data give noise levels measured at 36 different times directly outsideare less thanCentral to 1,100, in Manhattan. 900, approximately 80 percent of Grand or equal Station and so on. A cumulative frequency plot is called an ogive. 82, 89, 94, 110, 74, 122, 112, 95, 100, 78, 65, 60, An efcient way of organizing a small- to moderate-sized data set is to utilize a stem and leaf plot. Such a90, 83, 87, 75, 114, 85, 69, 94, 124, 115, 107, 88, into two parts plot is obtained by rst dividing each data value its stem and its leaf. 97, 74, 72, 68, 83, 91, 90, 102, 77, 125, 108, 65 For instance, if the data are all two-digit numbers, then we could let the stem part of a data value be its tens digit and let the leaf be its ones digit. Thus, for instance, the value 62 is expressed as 6 0, 5, 5, 8, 9 7 2, 4, 4, 5, 7, 8 Stem Leaf 8 2, 3, 3, 5, 7, 8, 9 6A stem and leaf plot is: 9 0, 0, 1, 4, 4, 5, 7 2 10 0, 2, 7, 8 and the two data values 62 and 67 can be represented as 11 0, 2, 4, 5 12 2, 4, 5 Stem Leaf 6 2, 7
Chapter 1: Descriptive Statistics Jiheng Zhang
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
To obtain a feel for a large amount of data, it is useful to be able to summarize it by some suitably chosen measures.
= x
Jiheng Zhang
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
i = 1, . . . , n
n i=1 xi n i=1 b
Eg. The winning scores in the U.S. Masters golf tournament in the years from 1999 to 2008 were as follows 280, 278, 272, 276, 281, 279, 276, 281, 289, 280 Subtract 280 from each one yi = xi 280: 0, 2, 8, 4, 1, 1, 4, 1, 9, 0 So = 0.8, thus = + 208 = 279.2. y x y
Chapter 1: Descriptive Statistics Jiheng Zhang
= (152+165+1711+189+1914+2013)/54 18.24 x
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
Suppose we have k distinct values v1 , . . . , vk . They have corresponding frequencies f1 , . . . , fk . How many observations in this data set?
k
D EFINITION (S AMPLE M EDIAN ) Order the values of a data set of size n from smallest to largest. If n is odd, the sample median is the value in position (n + 1)/2 If n is even, the sample median is the average of the values in positions n/2 and n/2 + 1. The number of values which are bigger than (>) than the sample median is equal the number of values which are less than (<) the sample median.
n=
i=1
fi
Jiheng Zhang
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
D EFINITION (S AMPLE M ODE ) Sample mode is the value that occurs with the greatest frequency. If no single value occurs most frequently, then all the values that occur at the highest frequency are called modal values. Q: What is the relationship among sample mean, sample median and sample mode?
Think about two data sets, which have the same mean but different spread (variability). D EFINITION (S AMPLE VARIANCE ) The sample variance, denoted by s2 , of the data set x1 , . . . , xn with mean is dened by x s2 =
n i=1 (xi
)2 x n1
Note: for technical reason, the sum of squared distances is divided by n 1 rather than n.
Jiheng Zhang
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
Example: nd the sample variance of data set A and B given below. A : 3, 4, 6, 7, 10 B : 20, 5, 15, 24 The sample variance for A is
n
(xi )2 = x
i=1 i=1
2 xi n2 x
(xi )2 = x
i=1 n
= =
xi +
i=1
2 x
Jiheng Zhang
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
(yi ) = y
i=1
yi = 35
i=1 i=1
y2 = 16 + 36 + 25 + 9 + 64 + 49 + 4 = 203 i
Chebyshevs Inequality
Chebyshevs Inequality
D EFINITION (S AMPLE S TANDARD D EVIATION ) The quantity s, which is the square root of the sample variance, is called sample standard deviation.
D EFINITION (S AMPLE P ERCENTILE ) The sample 100p percentile is that data value such that at least 100p percent of the data are less than or equal to it at least 100(1 p) percent are greater than or equal to it If two data values satisfy this condition, then the sample 100p percentile is the arithmetic average of these two values.
Jiheng Zhang
Jiheng Zhang
26
Describing Data Sets Summarizing Data Sets Chebyshevs Inequality
TABLE 2.6
Rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
City
New York, NY . . . . . . . . . . . . . . . . . Los Angeles, CA . . . . . . . . . . . . . . . Chicago, IL . . . . . . . . . . . . . . . . . . . Houston, TX . . . . . . . . . . . . . . . . . . Philadelphia, PA . . . . . . . . . . . . . . . San Diego, CA. . . . . . . . . . . . . . . . . Phoenix, AR . . . . . . . . . . . . . . . . . . . Dallas, TX . . . . . . . . . . . . . . . . . . . . San Antonio, TX . . . . . . . . . . . . . . . Detroit, MI . . . . . . . . . . . . . . . . . . . San Jose, CA . . . . . . . . . . . . . . . . . . Indianapolis, IN . . . . . . . . . . . . . . . San Francisco, CA . . . . . . . . . . . . . . Baltimore, MD . . . . . . . . . . . . . . . . Jacksonville, FL . . . . . . . . . . . . . . . . Columbus, OH . . . . . . . . . . . . . . . . Milwaukee, WI . . . . . . . . . . . . . . . . Memphis, TN . . . . . . . . . . . . . . . . . El Paso, TX . . . . . . . . . . . . . . . . . . . Washington, D.C. . . . . . . . . . . . . . Boston, MA . . . . . . . . . . . . . . . . . . . Seattle, WA . . . . . . . . . . . . . . . . . . . Austin, TX . . . . . . . . . . . . . . . . . . . . Nashville, TN . . . . . . . . . . . . . . . . . Denver, CO . . . . . . . . . . . . . . . . . . .
Population
7,333,253 3,448,613 2,731,743 1,702,086 1,524,249 1,151,977 1,048,949 1,022,830 998,905 992,038 816,884 752,279 734,676 702,979 665,070 635,913 617,044 614,289 579,307 567,094 547,725 520,947 514,013 504,505 493,559
Jiheng Zhang
What is the 10 percentile? 25 10/100 = 2.5 What is the 80 percentile? 25 80/100 = 20 1, 338, 113
25 percent being between the second and third quartile, and 25 percent being greater than the third quartile.
Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
D EFINITION the rst quantile: the sample 25 percentile the second quantile: the sample 50 percentile the third quantile: the sample 75 percentile
90, 83, 87, 75, 114, 85, 69, 94, 124, 115, 107, 88, 97, 74, 72, 68, 83, 91, 90, 102, 77, 125, 108, 65 6 7 8 9 10 11 12 0, 5, 5, 8, 9 2, 4, 4, 5, 7, 8 2, 3, 3, 5, 7, 8, 9 0, 0, 1, 4, 4, 5, 7 0, 2, 7, 8 0, 2, 4, 5 2, 4, 5
A stem and leaf plot is: What is another name for the second quantile?
the rst quartile is 74.5, the average of the 9th and 10th smallest data values the second quartile is 89.5, the average of the 18th and 19th smallest values the third quartile is 104.5, the average of the 27th and 28th smallest values
Chapter 1: Descriptive Statistics Jiheng Zhang Chapter 1: Descriptive Statistics Jiheng Zhang
Chebyshevs Inequality
Chebyshevs Inequality
Chebyshevs Inequality
60
70
80
90
100
110
120
Jiheng Zhang
Jiheng Zhang