Sei sulla pagina 1di 151

Chapter 4

Statistics
Statistics is the science of
conducting studies to collect,
organize, summarize, present,
analyze, and make conclusions from
data.
• The branch of statistics that involves
the collection, organization,
summarization, and presentation of
data is called descriptive statistics.

• The branch that interprets and draws


conclusions from the data is called
inferential statistics.
Population - consists of all
subjects being studied

Sample – a group of subjects


selected from a population
Variables and Types of Data
• Variables can be classified as
qualitative and quantitative

• Variables can also be classified by


how they are categorized, counted
or measured, this type of
classification uses measurement
scales: nominal, ordinal, interval
and ratio
Nominal Level Data
• Zip code
• Gender
• Nationality
• Political affiliation
• Religious affiliation
• Major field (mathematics,
mechanical eng’g, etc.)
• Eye color (black, brown, blue,
green, hazel)
Ordinal-Level Data
• Grade (1.0, 1.5, 2.0, etc)
• Judging (first place, second place,
etc )
• Rating scale (excellent, good, poor)
• Ranking ( of tennis player)
Interval- Level Data
• IQ
• Temperature
• SAT
Ratio-Level Data
• Height
• Weight
• Time
• Salary
• Age
Example: Transportation Safety
The chart shows the number of job -
related injuries for each transportation
industries for 1998
Industry Number of Injuries
Railroad 4520
Intercity Bus 5100
Subway 6850
Trucking 7144
Airline 9950
1. What are the variables under study?
2. Categorize each variable as
quantitative or qualitative.
3. Identify the level measurement of
each variable.
4. The railroad is shown as the safest
transportation industry. Does that
mean the railroads have fewer
accidents than the other industries?
5. What factors other than safety
influence a person’s choice of
transportation?
1. The variables are industry and number
of job –related injuries.
2. The type of industry is qualitative
variable, while the number of job –
related injuries is quantitative.
3. The type of industry is nominal, and
the number of job- related injuries is
ratio.
4. The railroads do show fewer job-
related injuries; however there are
other things to consider.
5. A person’s choice of transportation
might also be affected by convenience
issues, cost, service, etc.
Measures of Central Tendency
• Measures of central tendency
means measures of average, it
yield information about “particular
places or locations in a group of
numbers.”

–Mean - Midrange
–Median - Weighted mean
–Mode
Mean
• Is the average of a group of numbers
• Applicable for interval and ratio data,
not applicable for nominal or ordinal
data
• Affected by each value in the data set,
including extreme values
• Computed by summing all values in the
data set and dividing the sum by the
number of values in the data set
Example
The table represent the no. of miles
run per week for sample of 20
runners
Median
• Middle value in an ordered array of
numbers.
• Applicable for ordinal, interval, and ratio
data
• Not applicable for nominal data
• Unaffected by extremely large and
extremely small values.
Find the median
1. The number of rooms in the downtown
hotels in Pittsburgh
292, 300, 311, 401, 595, 618, 713

2. The number of tornadoes that have


occurred in the United States over an 8-year
period
684, 764, 656, 702, 856, 1133, 1132, 1303
Mode
• The most frequently occurring value in a
data set
• Applicable to all levels of data
measurement (nominal, ordinal, interval,
and ratio)
• Bimodal -- Data sets that have two
modes
• Multimodal -- Data sets that contain
more than two modes
The data shows the number of
licensed nuclear reactors in the US.
Find the mode.
104 104 104 104 104
107 109 109 109 110
109 111 112 111 109
Mode – 104

Mode - 109

Bimodal - 2 modes
Find the modal class for the
frequency distribution of miles that
20 runners ran in one week
Class frequency
5.5 - 10.5 1
10.5 - 15.5 2
15.5 - 20.5 3
20.5 – 25.5 5
25.5 – 30.5 4
30.5 – 35.5 3
35.5 – 40.5 2
Midrange
Find the midrange of the NFL signing
bonuses , given the bonuses in million of
dollars are
18.0, 14.0, 34.5, 10, 12.4, 10
Weighted Mean
A student received an A in English
Composition 1 (3 credits), a C in
Psychology I (3 credits), a B in biology
(4credits) and a D in PE (2 credits).
Assuming A = 4 grade point
B = 3 grade point
C = 2 grade point
D = 1 grade point
find the student’s grade point average.
Course Credits (w) Grade (X)

English Composition 3 A (4 points)


Psychology 3 C (2 points)
Biology 4 B (3 points)
PE 2 D (1 point )
Measures of Variation
Comparison of Outdoor Paint

A testing lab wishes to test two


experimental brands of outdoor paint to
see how long each will last before fading.
The testing lab makes 6 gallons of each
paint to test. Since different chemical
agents are added to each group and only
six cans are involved, these two groups
constitute two small populations. The
results (in months) are shown. Find the
mean of each group.
Since the means are equal, we might
conclude that both brands of paints last
equally well. However, when the data sets
are examined graphically a somewhat
different conclusion might be drawn.
Even though the means are the same for
brands , the spread or variation is
different
Variation of paint ( in months)

A
10 20 30 40 50 60

25 30 35 40 45
Measures of Variation

• Range

• Variance

• Standard Deviation
Range
Comparison of outdoor paint
Find the range the range of the two brands of
paint.
For brand A, the range is R = 60 – 10
= 50 months
For brand B, the range is R = 45 -25
= 20 months
The range for brand A shows that 50 months
separate the largest data value from lowest
data value. For brand B, 20 months separate
the largest data value from lowest data
value.
Variance
Standard Deviation (Population)
X

35 0 0
45 10 100
30 -5 25
35 0 0
40 5 25
25 -10 __100__
Since the standard deviation of brand A is
7.1 and the standard deviation of brand B
is 6.5.the data is more variable in brand
A. In summary, when the means are
equal the larger the variance and the
standard deviation is, the more
variable the data are.
Variance and Standard Deviation
(sample)
Find the sample variance and standard
deviation for the amount of European auto
sales for a sample of 6 years shown.
The data are in millions of dollars.

11.2, 11.9, 12.0, 12.8, 13.4, 14.3


Variance and Standard Deviation for
Grouped Data
Class Frequency Midpoint
5.5 – 10.5 1 8
10.5 -15.5 2 13
15.5 -20.5 3 18
20.5 – 25.5 5 23
25.5 – 30.5 4 28
30.5 – 35.5 3 33
35.5 – 40.5 2 38
Class Frequency
f

5.5 – 10.5 1 8 8 64
10.5 – 15.5 2 13 26 338
15.5 – 20.5 3 18 54 972
20.5 – 25.5 5 23 115 2645
25.5 – 30.5 4 28 112 3136
30.5 – 35.5 3 33 99 3267
35.5 – 40.5 2 38 76 2888

n =20
Coefficient of Variation
The mean of the number of sales of
cars over a 3 – month period is 87,
and the standard deviation is 5. the
mean of the commission is $ 5225
and the standard deviation is $773.
Compare the variation of the two.
Measures of Position
Used to locate the relative position of a
data value in the data set. For example if
the value is located in the 80th percentile ,
it means that 80 % of the values fall below
the distribution and 20 % of the values
fall above it. The median is the 50th
percentile , since one half of the values
fall below it and one half of the values fall
above it.
Measures of Position
To measure the relative position of the
data value in the data set:

• Standard scores
• Quartiles
• Deciles
• Percentiles
There is an old saying that “You cannot
compare apples and oranges”. But with
the use of statistics , it can be done to
some extent. Suppose a student scored
90 on music test and 45 on English
exam. Direct comparison of raw scores is
impossible , since the exam might not be
equivalent in terms of number of , value
of each question and so on. However, a
comparison of relative standard similar ro
both can be made. This comparison uses
the mean and standard deviation and is
called standard score or z score.
Standard Score or Z Score
Example:

A student scored 65 on calculus test


that had a mean of 50 and a standard
deviation of 10; she scored 30 on a
history test with a mean of 25 and a
standard deviation of 5. Compare the
relative position of the two test
Finding the Data Value
Corresponding to a Z Score
Percentile
Percentiles are position measures used in
educational and health related fields, to
indicate the position of an individual in a
group.

Percentile divide the data set into 100


equal groups.
Finding a Value Corresponding to a
given Percentile
Quartiles
Procedure Table
Deciles
Exploratory Data Analysis
In traditional statistics, data are organized by
using frequency distribution. From this
distribution various graphs such as the
histogram, frequency polygon, and ogive can
be constructed to determine the shape and
nature of the distribution. In addition, various
statistics such as mean and standard
deviation can be computed to summarize the
data.
Box Plot
Box Plot or Box-and-Whisker Plot
47 83.5 164
30 296
A dietitian is interested in comparing the
sodium content of real cheese with the
sodium content of a cheese substitute.
The data of the two random samples are
shown. Compare the distributions using
boxplots.
Real Cheese
310 420 45 40
220 240 180 90

Cheese Substitute
270 180 250 290
130 260 340 310
Real Cheese

87.5 200 275


40 420

Cheese Substitute
215 265 300
130 340

0 100 200 300 400 500


Example:
Construct a box –and –whisker plot
for
43 37 42 40 53 62 36 32
50 49 26 53 73 48 45 39
45 48 40 56 41 36 58 42
39
• Solution:
1. Arrange the data from lowest to
highest
1. 26 2. 32 3.36 4. 36 5. 37

6. 39 7. 39 8. 40 9. 40 10. 41

11. 42 12. 42 13. 43 14. 45 15. 45

16. 48 17. 48 18. 49 19. 50 20. 53

21. 53 22. 56 23. 58 24. 62 25. 73


39 43 51.5
26 73

20 25 30 35 40 45 50 55 60 65 70 75 80
Stem and Leaf
• A stem & leaf plot organizes data points by
the place value of the leading digits. When
making a stem & leaf plot, each item of data
is separated into two parts. The “stems”
usually consist of the digits in the greatest
common place value of each item of
data. The “leaves” contain the other digits
of each item of data.


1. For example, suppose you're given the
data set 35, 37, 23, 24, 27, 31, 33, 49,
34, 35, 41, 35, 37, 23, 24, 27, 31, 33, 49,
34, 35, 41 .

The tens digits would be the "stems", and


the ones digits would be the "leaves". The
smallest tens digit is 2 , and the greatest
is 4 ; write them in vertically on the left.
Then, for each stem, write the leaves in
increasing order on the right.   9
2 347

3 13457

4 19
2. Consider 65, 72, 96, 86, 43, 61, 75, 86,
49, 68, 98, 74, 84, 78, 85, 75,
86, 73,
Stems Leaves

4 3 9
5
6 1 5 8
7 2 3 4 5 5 8
8 4 5 6 6 6
9 6 8
3. The number of stories in selected
samples of tall buildings in Atlanta and
Philadelphia is shown. Construct a back
to back stem and leaf plot, and compare
the distributions

Atlanta Philadelphia
55 70 44 36 40 61 40 38 32 30
63 40 44 34 38 58 40 40 25 30
60 47 52 32 32 54 40 36 30 30
50 53 32 28 31 53 39 16 34 33
52 32 34 32 50 50 38 36 39 32
26 29
Atlanta Philadelphia
1 6

9 8 6 2 5

8 6 4 4 2 2 2 2 2 3 0 0 0 0 2 2 3 4 6
6 6 8 8 9 9

7 4 4 0 0 4 0 0 0 0

5 3 2 2 0 0 5 0 3 4 8

3 0 6 1

0 7
The buildings in Atlanta have a large
variation in the number of stories per
building. Although both data are peaked
in the 30 –to – 39 story class ,
Philadelphia has more buildings in this
class . Atlanta has more buildings that
have 40 or more stories than Philadelphia
does.
Another important point to remember is
that summary statistics (median and
interquartile range) used in exploratory
data analysis are said to be resistant
statistics. A resistant statistics is relatively
affected by outliers. The mean and
standard deviation are nonresistant
statistics. Sometimes when a distribution
is skewed or contains outliers.
Traditional versus EDA Techniques
Traditional Exploratory Data
Analysis

Frequency Distribution Stem and Leaf Plot


Histogram Boxplot
Mean Median
Standard Deviation Interquartile Range
What is normal?
Medical researchers have determined so -
called normal intervals for a person’s
blood pressure, cholesterol, triglycerides,
glucose, etc. For example, the normal
range of systolic blood pressure is 110 to
140. cholesterol should be 130 - 200,
triglycerides 60 -130, glucose should be
60 – 100.
In measuring these variables, a physician
can determine if the patient’s vital
statistics are within the normal interval or
if some type of treatment is needed to
correct a condition and avoid future
illnesses
Random variables can either be
discrete or continuous.

Discrete variable:
number of people, number of phone
calls, outcome in rolling a die, etc

Continuous variable:
Temperature, a person’s height,
time, a person’s weight, etc
Many continuous variables are
have distributions that are bell
shaped , and these are called
approximately normally
distributed variables
The Normal Distribution

A normal distribution forms a bell-shaped


curve that is symmetric about a vertical
line through the mean of the data.
Normal Distribution
Properties of Normal Distribution
Every normal distribution has the
following properties:
• A normal distribution curve is bell -
shaped.
• The mean, median, mode are
equal and are located at the center
of the distribution.
• A normal distribution is unimodal.
• The curve is symmetric about the
mean.
• The curve is continuous, that is,
there is no gaps or holes.
• The curve never touches the x
axis.
• The total area under the normal
distribution curve is equal to 1
or 100%
Empirical Rule for a Normal
Distribution
In a normal distribution
• 68.26 % of the data lie within 1
standard deviation away from the
mean
• 95.44% of the data lie within 2
standard deviation from the mean
• 99.72% of the data lie within 3
standard deviation from the mean
Common Distribution Shapes
When the majority of the data
values fall to the right of the
mean, the distribution is said to
be negatively skewed or left-
skewed distribution. The mean is
to the left of the median, and the
mean and the median are to the
left of the mode.
A distribution is negatively
skewed if the scores fall
toward the higher side of the
scale and there are few low
scores.
When majority of the data values fall to
the left of the mean ,a distribution is said
to be positively skewed or right –skewed
distribution. The mean falls to the right of
the median and both mean and median
fall to the right of the mode.
In positively skewed distributions,
the mean is usually greater than
the median, which is always
greater than the mode.
A survey of 1000 US gas stations found
that the price charged for a gallon of
regular gas could closely be
approximated by a normal distribution
with a mean of $ 3.10 and a standard
deviation of $0.18. How many of the
stations charge
a. Between $2.74 and $ 3.46 for a gallon
of regular gas?
b. Less than $3.28 for a gallon of regular
gas?
c. More than $ 3.46 for a gallon of regular
gas?
Solution:
a. i) The $ 2.74 per gallon price is 2
standard deviation below the mean. The
$3.46 price is 2 standard deviations
above the mean.
ii) In a normal distribution, 95% of all data
lie within 2 standard deviations of the
mean
Therefore, approximately
(95%)(1000) = (0.95)(1000)
= 950
Meaning 950 of the stations charge between
$ 2.74 and $ 3.46 for a gallon of regular gas.
b. i) The $ 3.28 price is 1 standard
deviation above the mean.
ii) In a normal distribution, 34% of all
data lie between the mean and 1 standard
deviation above the mean.
Thus approximately
(34%)(1000) = (0.34)(1000)
= 340
Therefore 340 of the stations charge
between $ 3.10 and $ 3.28 for a gallon of
regular gasoline. Half of 1000 or 500
stations charge less than the mean.
Therefore about 340 + 500 of the stations
charge less than $ 3.28 for a gallon of
regular gas
2. A vegetable distributor knows that
during the month of August, the weights
of its tomatoes are normally distributed
with a mean of 0.61 lb and a standard
deviation of 0.15
a. What percent of the tomatoes weigh
less than 0.76 lb?
b. In a shipment of 6000 tomatoes, how
many tomatoes can be expected to
weigh more than 0.31 lb?
c. In a shipment of 4500 tomatoes, how
many tomatoes can be expected to
weigh from 0.31 lb to 0.91 lb?
a. i) 0.76 lb is 1 standard deviation above
the mean .

ii) In a normal distribution 34% of all


data lie between the mean and 1
standard above the mean, and 50%
of all data lie below the mean.

iii) Thus 34% + 50% = 84% of the


tomatoes weigh less than 0.76 lb
b. i)0.31 lb is 2 standard deviations below
the mean of 0.61 lb.

ii) In a normal distribution , 47.5 % of


the data lie between the mean and 2
standard deviations below the mean, and
50% of all data lie above the mean.

iii) This gives a total of 47.5% + 50% = 97.5%


of the tomatoes that weigh more than 0.31lb
Therefore ( 97.5%)(6000) = 5850
of the tomatoes can be expected to weigh
weigh from 0.31 lb to 0.91 lb
c. i) 0.31 lb is 2 standard deviations below the
mean of 0.61 lb and 0.91 lb is 2 standard

ii) In a normal distribution , 95% of all data


lie within 2 standard deviations above
the mean.

iii) Therefore
( 95%)(4500) = 4275
of the tomatoes can be expected to
weigh from 0.31 lb to 0.91 lb
The Standard Normal Distribution
A table is used in order to determine the
approximate areas of the standard normal
distribution between the mean 0 and z
standard deviations from the mean.

In a standard normal distribution, the area


of the distribution from z =a and z =b
represents :
• The percentage of z-values that lie in
the interval from a to b.
• The probability that z lies in the interval
from a to b.
Finding Area under the standard
Normal distribution Curve

1.Draw the normal distribution curve and


shade the area.

2. Find the appropriate figure in the


procedure table and follow the direction
Finding the Area Under the
Standard Normal Distribution Curve
1. To the left of any z value:
Look up z value in the table and use the area
given.
2. To the right of any z value:
Look up the z value and subtract from 1.
3. Between two z values :
Look up z values and subtract corresponding
z areas
Example :
1. Find the area to the left of z = 2.06
It is 0.9803. Hence, 98.03 % the area
is to the left of z = 2.06

2. Find the area to the right of z = -1.19


It is 0.1170 .subtract it from 1
so, 1.0000 -0.1170 =0.8830 , therefore
88.30%of the area under the standard
normal distribution curve is to the right of
z = -1.19
3. Find the area between z =1.68 and z = -1.37
Since the two areas desired is between two
z values , look up the areas corresponding
to the two z values and subtract the smaller
the larger
z = 1.68 is 0.9535
z = -1.37 is 0. 0853

Therefore the area between the two values


is 0.9535 – 0.0853 = 0.8682
= 86.82%
In a standard normal distribution, the area
of the distribution from z = a and z =b
represents :

• The percentage of z-values that lie in


the interval from a to b.
• The probability that z lies in the interval
from a to b.
A soda machine dispenses soda into 12-ounce
cup. Tests show tat the actual amount of soda
dispense is normally distributed with a mean of
11.5 oz and a standard deviation of 0.2 oz.

a. What percent of the cup receive less than


11.25 oz of soda ?
b. What percent of the cups will receive 11.2
oz. and 11.55 oz. of soda?
c. If a cup is chosen at random, what is the
probability that the machine will overflow the
cup?
A study of the careers of professional
football players shows the length of their
careers are nearly normally distributed
with a mean of 6.1 years and a standard
deviation of 1.8 years.

a. What percent of the professional


football players has a career of more than
9 years?

b. If a professional football plyer is chosen


at random , what is the probability that the
player will have a career between 3 and 4
years?
Linear Regression and Correlation
• In many applications, scientists try to
determine whether two variables are
related. If they are related , the scientist
then try to find an equation that can be
used to model a relationship.
• For instance a zoology professor Prof.
R. Alexander wanted to determine
whether the stride length of a dinosaur
as shown by its fossilized footprints
could be use to estimate the speed of
the dinosaur.
• Stride length of an animal is the
distance from one footprint to the next.
• Because no dinosaurs were available , the
professor carried the experiments with
many kinds of animals including adult men,
dogs and camels.
a. Adult Men
Stride length 2.5 3.0 3.3 3.5 3.8 4.0 4.2
(m)
Speed (m/s) 3.4 4.9 5.5 6.6 7.0 7.7 8.3

b. Dogs

Stride length 1.2 1.7 2.0 2.4 2.7 3.0 3.2


(m)
Speed (m/s) 3.7 4.4 4.8 7.1 7.7 9.1 8.8

c. Camels
Stride length 2.5 3.0 3.2 3.4 3.5 3.8 4.0
(m)
Speed (m/s) 2.3 3.9 4.4 5.0 5.5 6.2 7.1
After the relationship being paired (which is
referred to as bivariate data) , has been
discovered, the scientist try to model the
relationship with an equation. One method of
determining linear relationship is called linear
regression.

The Least Square Regression Line

The least square regression line for a set of bivariate data is the line that
minimizes the sum of the squares of the vertical deviations from each data
point of the line.
Example:
1. Find the equation of the least square line for
the ordered pairs: (2.5, 3.4), (3.0, 4.9),
(3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7),
(4.2, 8.3), (4.5 8.7)
x y xy
2.5 3.4 6.26 8.50
3.0 4.9 9.00 14.70
3.3 5.5 10.89 18.15
3.5 6.6 12.25 23.10
3.8 7.0 14.44 26.60
4.0 7.7 16.00 30.80
4.2 8.3 17.64 34.86
4.5 8.7 20.25 39.15
Car Rental Companies:

Company Number of Cars Revenue


(In Ten Thousands) (In Billions of
Dollars)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Find the equation of the regression line and
graph the line on the scatter plot of the data.
Compan Cars Revenue xy
y x y
( in 10T) (in billions
of $)

A 63.0 7.0 441.00 3969.00 49.00

B 29.0 3.9 113.10 841.00 15.21

C 20.8 2.1 43.68 432.64 4.41

D 19.1 2.8 53.48 364.81 7.84

E 13.4 1.4 18.76 179.56 1.96

F 8.5 1.5 12.75 72.25 2.25


Linear Correlation Coefficient
Scatter Diagrams

Potrebbero piacerti anche