Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
271 (part 1)
BASIC
STATISTICS
Basic Statistics
Population
- Entire collection of individuals which are characteristic being
studied.
Sample
- A portion, or part of the population interest.
Variable
- Characteristics which make different values.
Observation
- Value of variable for an element.
Data Set
- A collection of observation on one or more variables.
• Direct observation
- The simplest method of obtaining data.
- Advantage: relatively inexpensive
- Disadvantage: difficult to produce useful information
since it does not consider all aspects regarding the issues.
• Experiments
- More expensive methods but better way to produce data
- Data produced are called experimental
• Surveys
- Most familiar methods of data collection
- Depends on the response rate
• Personal Interview
- Has the advantage of having higher
expected response rate
- Fewer incorrect respondents.
Grouped Data Vs Ungrouped Data
Line chart
Bar chart
Pie chart
Frequency table
Observation Frequency
Malay 33
Chinese 9
Indian 6
Others 2
Malay
Chinese
Indian
Others
Ogive: line graph with the horizontal axis represent the upper
limit of the class interval while the vertical axis represent the
cummulative frequencies.
Constructing Frequency Distribution
When summarizing large quantities of raw data, it is often useful to distribute
the data into classes. Table 1 shows that the number of classes for Students`
weight.
Weight Frequency Table 1: Weight of 100 male students in
60-62 5 XYZ university
63-65 18
66-68 42
69-71 27
72-74 8
Total 100
• A frequency distribution for quantitative data lists all the classes and the
number of values that belong to each class.
• Data presented in the form of a frequency distribution are called grouped
data.
• For quantitative data, an interval that includes all the values that fall
within two numbers; the lower and upper class which is called class.
Class is in the first column for frequency distribution table.
Classes always represent a variable, non-overlapping; each value is
belong to one and only one class.
• The numbers listed in second column are called frequencies, which
gives the number of values that belong to different classes.
Frequencies denoted by f.
Table 1.: Weight of 100 male students in XYZ university
Variable Frequency
Weight Frequency column
60-62 5
Third class
(Interval Class) 63-65 18
66-68 42 Frequency
69-71 27 of the third
Lower Limit
class.
of the fourth class 72-74 8
Total 100
Upper limit of the fifth class
• The class boundary is given by the midpoint of the upper limit of
one class and the lower limit of the next class.
• The difference between the two boundaries of a class gives the
class width; also called class size.
Formula:
- Class Midpoint or Mark
Class midpoint or mark = (Lower Limit + Upper Limit)/2
- Finding The Number of Classes
Number of classes = 1 3.3log n
- Finding Class Width For Interval Class
class width , i = (Largest value – Smallest value)/Number of classes
* Any convenient number that is equal to or less than the smallest values in the
data set can be used as the lower limit of the first class.
Example 1:
From Table 1: Class Boundary
5
60-62 5 59.5-62.5
5 + 18 = 23
63-65 18 62.5-65.5
23 + 42 = 65
66-68 42 65.5-68.5
65 + 27 =92
69-71 27 68.5-71.5
92 + 8 = 100
72-74 8 71.5-74.5
100
Exercise 1:
_
x
x1 x2 ....... xn x _
, for n 1,2,..., n or x x
n n
fx i i
fx
x i 1
or
f
n
f
i
i 1
Example 2 (Ungrouped data):
Solution :
Weight Frequency
60-62 5
63-65 18
66-68 42
69-71 27
72-74 8
Solution :
Weight (Class Frequency, f Class Mark, fx
Interval x
60-62 5 61 305
63-65 18 64 1152
66-68 42 67 2814
69-71 27 70 1890
72-74 8 73 584
100 6745
fx
?
6745
x 67.45
f 100
Median of ungrouped data:
• The median depends on the number of observations in the data,
n . If n is odd, then the median is the (n+1)/2 th observation of the
ordered observations.
• But if n is even, then the median is the arithmetic mean of the
n/2 th observation and the (n+1)/2 th observation.
Median of grouped data:
f
F j 1
x Lc 2
f j
where
L = the lower class boundary of the median class
c = the size of median class interval
Fj 1 the sum of frequencies of all classes lower than the median class
f j the frequency of the median class
Example 4 (Ungrouped data):
n is odd
Find the median for data 4,6,3,1,2,5,7 ( n = 7)
Rearrange the data : 1,2,3,4,5,6,7
(median = (7+1)/2=4th place)
Median = 4
n is even
Find the median for data 4,6,3,2,5,7 (n = 6)
Rearrange the data : 2,3,4,5,6,7
Median = (4+5)/2 = 4.5
Example 5 (Grouped Data):
The sample median for frequency distribution as in example 3
Solution:
Weight Frequency, f Cumulative Class
(Class Frequency, Boundary
Interval F
60-62 5 5 59.5-62.5
63-65 18 23 62.5-65.5
Median 66-68 42 65 65.5-68.5
class 69-71 27 92 68.5-71.5
72-74 8 100 71.5-74.5
f 100
F j 1
23
x Lc 2 ? 65.5 3[ 2 ] 67.73
fj 42
Mode
When data has been grouped in classes and a frequency curveis drawn
to fit the data, the mode is the value of x corresponding to the maximum
point on the curve, that is
1
xˆ L c
1 2
*the class which has the highest frequency is called the modal class
Example 6 (Ungrouped data)
Total 100
1 (42 18)
xˆ L c ? 65.5 3[ (42 18) (42 27) ]
1 2
67.35
b) Measures of Dispersion
• Range = Largest value – smallest value
• Variance: measures the variability (differences) existing in a
set of data.
S 2
( x x) 2
n 1
For population
2 ( x ) 2
n
The variance for the grouped data:
• For sample
fx 2
2
nx or fx 2
( fx ) 2
S 2
S
2 n
n 1 n 1
• For population
fx
( fx ) 2
2
fx 2 nx 2
2
or 2 n
n n
The positive square root of the variance is the standard
deviation
2
S
( x x) 2
fx 2
nx
n 1 n 1
S2
( x x ) 2
n 1
(3 5.1) 2 (5 5.1) 2 (2 5.1) 2 (6 5.1) 2 (5 5.1) 2 (9 5.1) 2
(5 5.1) 2
( 2 5.1) 2
(8 5.1) 2
( 6 5.1) 2
s2
9
48.9
5.43
9
s s 2 5.43
2.33
Example 9 (Grouped data)
Find the variance and standard deviation of the sample data below:
( fx) 2 67452
fx
2 455803
100 852.75
S
2 n 8.61
n 1 99 99
s 8.61 2.93
Exercise 3
The defects from machine A for a sample of products were
organized into the following:
Defects Number of products get
(Class Interval) defect, f (frequency)
2-6 1
7-11 4
12-16 10
17-21 3
22-26 2
22 13 26 16 18 12 9 26 20 16
23 14 19 23 20 16 27 9 21 14
x s 40 12 [28,52]
x 2s 40 2.12 [16,64]
x 3s 40 3.12 [4,76]
z scores
Percentiles
Quartiles
Outliers
Z SCORE
• A standard score or z score tells how many standard
deviations a data value is above or below the mean for a
specific distribution of values.
• If a z score is 0, then the data value is the same as the
mean.
• The formula is:
value - mean
z
standard deviation
for samples,
X X
z
s
for populations,
X
z=
• Note that if the z score is positive, the score is above the mean. If
the z score is 0, the score is the same as the mean. And if the z
score is negative, the z score is below the mean.
Example:
A student scored 65 on a calculus test that had amean of 50
and standard deviation of 10. She scored 30 on a history test
with a mean of 25 and a standard deviation of 5. Compare her
relative positions on the two tests.
Solution:
Find the z scores.
For calculus: 65 50 The calculus score of 65 was
z 1.5 actually 1.5 standard
10 deviations above the mean 50
For history: 30 25 The history score of 30 was
z 1.0 actually 1.0 standard
5 deviations above the mean 25
Test X X s
Mathematics 38 40 5
Statistics 94 100 10
Quartiles
Divide data sets into fourths or four equal parts.
Smallest Largest
data value Q1 Q2 Q3 data value
1
Q1 (n 1)th
4
1
Q2 median (n 1)th
2
3
Q3 (n 1)th
4
The positions are integers
Example: 5, 8, 4, 4, 6, 3, 8 (n=7)
3, 4, 4, 5, 6, 8, 8
Q1 4,
Q2 5, Q3 8
The positions are not integers
Example: 5, 12, 10, 4, 6, 3, 8, 14 (n=8)
3, 4, 5, 6, 8, 10, 12, 14
1
Q1 (8 1)th 2.25th 4 0.25(5 4) 4.25
4
1
Q2 (8 1)th 4.5th 6 0.5(8 6) 7,
2
3
Q3 (8 1)th 6.75th 10 0.75(12 10) 11.5
4
Example 11
The following data represent the number of inches of rain in
Chicago during the month of April for 10 randomly years.
1
Q1 (10 1)th 2.75th 1.14 0.75(1.85 1.14) 1.6725
4
1
Q2 (10 1)th 5.5th 3.41 0.5(3.94 3.41) 3.675,
2
3
Q3 (10 1)th 8.25th 4.02 0.25(4.11 4.02) 4.0425
4
Outliers
• Extreme observations
• Can occur because of the error in measurement of a variable,
during data entry or errors in sampling.
Checking for outliers by using Quartiles
Step 1:
Determine the first and third quartiles of data.
Step 2:
Compute the interquartile range (IQR). IQR Q3 Q1
Step 3:
Determine the fences. Fences serve as cut off points for
determining outliers. Lower Fence Q 1.5( IQR)
1
Q1 1.6725, Q3 4.0425
Since all the data are not less than -1.8825 and not
greater than 7.5975, then there are no outliers in the
data
The Five Number Summary
MINIMUM Q1 M Q3 MAXIMUM
Example 13
Minimum 0.97,
Q1 1.6725,
M Q2 3.675,
Q3 4.0425
Maximum 5.22,
BOXPLOT
• The five-number summary can be used to create a
simple graph called a boxplot.
• Form the boxplot, you can quickly detect any
skewness in the shape of the distribution and see
whether there are any outliers in the data set.
Outlier Outlier
Lower Upper
fence fence
Interpreting Boxplot
- symmetric
- Skewed left
because the tail is
to the left
- Skewed right
because the tail
is to the right
Mean/Median Versus Skewness
TO CONSTRUCT BOXPLOT
Step 1: Determine the lower and upper fences:
Lower Fence Q1 1.5( IQR)
Upper Fence Q3 1.5( IQR)