Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
TM
StatPad
Amazon Revenue
($billions)
15
10
Revenue
Trend
5
0
2004
Forecast
2006
2008
2010
Time
2012
2014
What is StatPad?
Table of Contents
What is StatPad? ..............................................................................................................................4
How to Install StatPad .....................................................................................................................5
How to Use StatPad .........................................................................................................................6
Overview of StatPad Features ........................................................................................................12
One-Sample Analysis .....................................................................................................................18
Summaries................................................................................................................................18
Histogram .................................................................................................................................19
Histogram (With Customized Bin Width and Landmark) .......................................................20
Box Plot ...................................................................................................................................21
Cumulative Distribution ..........................................................................................................22
Confidence Interval ..................................................................................................................23
Confidence Interval (One-Sided, 99%) ....................................................................................24
Hypothesis Test ........................................................................................................................25
Hypothesis Test (One-Sided) ...................................................................................................26
Percentile..................................................................................................................................27
Percentile Ranking ...................................................................................................................28
Sampling ........................................................................................................................................29
Random Sample Without Replacement ...................................................................................29
Random Sample With Replacement ........................................................................................30
Uniform Distribution ...............................................................................................................31
Normal Distribution .................................................................................................................32
Binomial Distribution ..............................................................................................................33
Binomial Percentages ...............................................................................................................34
Probability Calculations .................................................................................................................35
Normal Probability (Greater Than) ..........................................................................................35
Normal Probability (Between) .................................................................................................36
Binomial Probability (Equal to) ...............................................................................................37
Binomial Probability (This or Less) .........................................................................................38
Binomial Percent (Equal to).....................................................................................................39
Binomial Percent (Between) ....................................................................................................40
Poisson Probability (Equal to) .................................................................................................41
Poisson Probability (This or Less) ...........................................................................................42
Exponential Probability (This or More) ...................................................................................43
Exponential Probability (Between) ..........................................................................................44
Discrete Probability..................................................................................................................45
Two-Sample Analysis ....................................................................................................................46
Summaries................................................................................................................................46
Histograms ...............................................................................................................................47
Box Plots ..................................................................................................................................48
Confidence Interval ..................................................................................................................49
Hypothesis Test ........................................................................................................................50
What is StatPad?
What is StatPad?
Welcome to StatPad1, a software system designed for people who wish to perform statistical
analysis within their Microsoft Excel2 computer spreadsheets. StatPad was designed to make
statistical analysis as accessible, painless, and easy to understand as possible by bringing basic
statistical analysis and its interpretation into the environment where business and other data are
often found: namely within an Excel spreadsheet. Whenever possible, the analysis is guided by
choices from a dialog box that adapts itself automatically to your situation. The results,
consisting of charts, explanatory text, and computations, then become part of your worksheet.
StatPad will perform all aspects of basic statistics: design using a random sample, exploration
through graphic representations of data, estimation with summaries and confidence intervals
(both one-and two-sided at various confidence levels), hypothesis testing, normal and binomial
probability calculations, multiple regression analysis, trend-seasonal time series analysis, and
statistical quality control charts.
Heres how to get started if you are in a hurry: after you open the file STATPAD.XLA, you will
find StatPad listed under the Excels Add-Ins Ribbon (or Tools menu for older versions of Excel)
ready for you to select. When selected, StatPad greets you with its main dialog box, ready for
analysis.
1StatPad
2Excel
If you wish to load StatPad manually each time you open Excel:
Either double-click the file STATPAD.XLA or use Excels File Open menu commands to
open this file from its folder on your computer. Choose Enable Macros if necessary.
The choice StatPad will then be available under Excels Add-Ins Ribbon (or Tools menu
for older versions of Excel). StatPad will remain available until you close Excel.
If you need to change Excel's macro security level, you will find this at File / Options /
TrustCenter / TrustCenterSettings / Add-Ins.
4. Select a situation from the list near the top left (One Sample, Sampling, Probability, Two
Sample, Many Sample, Bivariate, Multivariate, Time Series, or Quality Control).
5. Select the analysis you want from the list near the top right. Note that this analysis list
changes automatically for you, depending on the situation you choose. For a One Sample
situation, the analysis choices are Summaries, Histogram, etc. But if you select
Probability instead, the analysis choices instantly change to Normal Probability, Binomial
Probability, and Binomial Percent.
6. Give StatPad the additional information it needs. StatPad will automatically change to
show you what is needed, so you may fill in the blanks as they appear. For One Sample,
Summaries, you need to give StatPad a data set name and an output range. For One
Sample, Confidence Interval, so that you can tell StatPad which confidence level you
wish, an edit box will appear automatically for this purpose (you may also decide to
choose a one-sided interval). Heres how the main dialog box changes:
For a multiple regression analysis, StatPads main dialog changes again (automatically!)
allowing you to select the X variables (for example, income, percent male, and
readership) to use to explain the Y variable (for example, the cost of a full-page color
magazine ad).
3Heres
a quick way to find out the name (if any) associated with a list of numbers. Highlight the list (drag with the
mouse), then look for the name in Excels Name Box near the top left corner of the worksheet. StatPad limits the size
of a each list to a maximum of 65,000 numbers.
4If
you have used Excel to name a column of numbers (e.g., with Excels Insert Name Define menu items), this name
will appear automatically in StatPads list. When you name a column of numbers within StatPad, this name also
becomes an Excel range name for your data. Names can be deleted using Excels Insert Name Define Delete menu
items.
Heres how the screen might look after you (1) click in the Range box of the above
dialog box, (2) highlight your data in the worksheet, (3) click in the Name box of the
dialog box, and (4) type in the name (Prices for this example, but please dont use
spaces or special characters):
b. Alternatively, you may feel free to type a name for the data set into the edit-box in
StatPads main dialog box, even if that name is not proposed for you. This can be
done whenever only one data set can be used for the chosen situation (but please dont
use spaces or special characters in the name). Once you hit Enter, click on Do It, or
double click, to begin the analysis, StatPad will ask you to select the column of
numbers you want, using the following dialog box. After this, the name will
automatically show up in StatPads data set lists.
10
9. Use the Output Range box at the lower right of the main dialog box to tell StatPad where
to put the results.
a. If youve asked for a chart:
i. If you provide a single cell as the Output Range, then StatPad will place a chart of
the default size with upper-left corner at this cell.
ii. If you provide a rectangular range of cells as the Output Range, then StatPad will
make the chart the same size as your range.
b. If youve asked for numbers and text:
i. If there is enough room without erasing any of your data, StatPad will place the
upper-left cell of the output at the Output Range you specified.
ii. If your results would overwrite any of your data, StatPad will give you the option
of either specifying a different Output Range, or (use caution!) going ahead and
erasing some of your data to make room for the results if you wish.
10. After StatPad performs the analysis you requested (or asks for clarification, if needed),
you will again find the StatPad main dialog box on your screen, ready for further analysis.
You may either continue your analysis with StatPad, or leave StatPad (select Cancel or
hit the Esc key) to return control to Excel and your worksheet.
11. You can format StatPads results because they are part of your Excel spreadsheet, after
leaving the StatPad dialog box by hitting the Esc key or selecting Cancel.
a. You can select and format numbers in individual cells as you ordinarily would in
Excel (for example, using the Number Group of the Home Ribbon). For example, you
can format with dollar signs, set the number of decimal places, format as percentages,
etc.
b. You can customize StatPads charts as you would for any Excel chart. For example,
you might select the chart and then use the Chart Tools Ribbons (Design, Layout, and
Format) at the top of the Excel window. Another method would be to double-click the
part of the chart you wish to change, for example the x axis, to bring up the relevant
formatting options. You might then choose set the scale under Axis Options (e.g., to
change the minimum and/or maximum) or select Number (e.g., to change the number
formatting).
11
12. You can copy StatPads results to your word processor, after leaving the StatPad dialog
box by hitting the Esc key or selecting Cancel.
a. To copy text and numbers to your word processor, proceed as follows:
i. Highlight your cell(s) and choose Copy from the Clipboard Group of the Home
Ribbon.
ii. Activate your word processor and move the cursor to where you want the results
to go.
iii. Depending upon your word processor, you may wish to paste as unformatted text.
The text then becomes part of the text document and you may format it as you
like. For example, with Microsoft Word 2010, you might click on the word
"Paste" in the Clipboard Group of the Home Ribbon, then choose Paste Special
from the Paste Options, to obtain the Unformatted Text choice.
b. To copy charts to your word processor, proceed as follows:
i. Click on the edge of a chart (just one at a time) to select it, then choose Copy from
the Clipboard Group of Excel's Home Ribbon.
ii. Activate your word processor and move the cursor to where you want the chart to
go.
iii. Depending upon your word processor, you may wish to paste as a Picture
(instead of as an Excel object). The chart then becomes part of the text document
and you would be able to place and size it using your word processors
commands. For example, with Microsoft Word 2010, you might click on the
word "Paste" in the Clipboard Group of the Home Ribbon, then choose Paste
Special from the Paste Options, to obtain the Picture (Enhanced Metafile)
choice.
13. For more information about statistical analysis, its applications and interpretation, please
consult a book such as Practical Business Statistics by Andrew F. Siegel (Elsevier /
Academic Press, sixth edition, 2012).
12
One Sample
Summaries
Compute statistical summaries for the data: count, average or mean, median,
smallest, largest, quartiles, standard deviation, and standard error.
Histogram
Draw a histogram to explore the data, showing the shape of the distribution,
typical values, variability, and outliers. Data are concentrated where the
histogram bars are high. Check 'Customize' to specify optional bin width and
landmark point.
Box Plot
Draw a box plot to explore the data, showing the 5-number summary (smallest,
lower quartile, median, upper quartile, and largest). In the ordinary box plot, a
line extends from the box on each side to the most extreme value. Check
Detailed box plot to indicate outliers separately and have the lines extend from
the box on each side to the most extreme value (adjacent value) that is not an
outlier.
Cumulative
Distribution
Draw a cumulative distribution function for the data, showing the percentage of
data values less than each given number. This shows you the percentiles.
13
Confidence
Interval
Hypothesis
Test
Test the null hypothesis that the population mean is equal to a given reference
value. This is statistical inference about the population, based on random
sampling. Two-sided or one-sided testing (Student's t test) is used.
Percentile
Given a percentage, find the percentile value. This data value has approximately
this percentage of the data values smaller than it.
Percentile
Ranking
Find the percentage ranking for a given value. This is the approximate
percentage of data values that are less than the given value.
Sampling
Sample
Without
Replacement
Sample With
Replacement
Uniform
Distribution
Select a random sample from a uniform distribution, where all values are
equally likely between the smallest and largest possible value. By specifying a
name, you will be able to easily use the result later.
Normal
Distribution
Select a random sample from a normal distribution, given the mean and
standard deviation. By specifying a name, you will be able to easily use the
result later.
Binomial
Distribution
Binomial
Percentages
Select a random sample of binomial percentages, given the number of trials and
the probability of occurrence. By specifying a name, you will be able to easily
use the result later.
14
Probability
Normal
Probability
Binomial
Probability
Binomial
Percent
Poisson
Probability
Exponential
Probability
Discrete
Probability
Mean (expected value) and standard deviation for a discrete random variable,
given a set of values and their associated probabilities.
Two Samples
Summaries
Compute univariate summaries for each data set. Also find the average
difference and its standard error. If sample sizes are identical, you may indicate
that a pair of measurements was made on each item.
Histograms
Box Plots
Draw a box plot for each data set, for data exploration, using the same scale for
comparison. In the ordinary box plot, a line extends from the box on each side
to the most extreme value. Check Detailed box plot to indicate outliers
separately and have the lines extend from the box on each side to the most
extreme value (adjacent value) that is not an outlier.
Confidence
Interval
Hypothesis
Test
Test the null hypothesis that the population mean difference is zero. This is
statistical inference. Two-sided testing using Student's t test. If sample sizes are
identical, you may indicate that a pair of measurements was made on each item.
15
Many Samples
Summaries
Select as many data sets as you wish. Compute univariate summaries for each.
Histograms
Box Plots
Draw a box plot for each sample, for data exploration, using the same scale for
comparison. In the ordinary box plot, a line extends from the box on each side
to the most extreme value. Check Detailed box plot to indicate outliers
separately and have the lines extend from the box on each side to the most
extreme value (adjacent value) that is not an outlier.
F Test
One-way analysis of variance (ANOVA). Test the null hypothesis that the
population means are all identical. This is statistical inference.
Mean
Differences
Confidence intervals and hypothesis tests for the difference of each pair of
population means (least-significant-difference test). This is statistical inference.
Bivariate
Scatterplot
Scatterplot
with Line
Correlation
Find the strength of the relationship between two variables as a pure number
where 1 indicates a perfect increasing relationship, -1 a perfect decreasing
relationship, and 0 suggesting no relationship.
Correlation
with Test
Find and test the strength of the relationship between two variables. This is
statistical inference.
Regression
Predicted and Predicted values of Y based on X, the residual difference: Actual Y Predicted
Residuals
Y, and the standardized residuals.
Univariate
Summaries
Histograms
16
Box Plots
Multivariate
Scatterplots
Select as many X variables as you wish, but just one Y variable. Draw
scatterplots for all pairs of variables to explore their relationships.
Correlations
Regression
Predicted and Predicted values of Y based on the X variables, the residual differences (Actual
Residuals
Y Predicted Y) and the standardized residuals.
Diagnostic
Plot
Look for problems in the regression linear model, such as unequal variability or
nonlinearity.
Univariate
Summaries
Histograms
Box Plots
Draw a box plot of each variable, for data exploration. In the ordinary box plot,
a line extends from the box on each side to the most extreme value. Check
Detailed box plot to indicate outliers separately and have the lines extend from
the box on each side to the most extreme value (adjacent value) that is not an
outlier.
Time Series
TrendSeasonal
17
Quality Control
X-Bar, R
Charts
Chart the averages and the ranges of your data to see if this process is in or out
of control. Choose a subgroup size from 2 to 25. You may specify a standard if
one is available.
Pct, Count
Chart
Chart the percents or counts to see if this process is in or out of control. Your
data may be either counts or percentages (counts divided by the sample size).
You may specify a standard if one is available.
18
One-Sample Analysis
Summaries
Summaries are used to give you selected
numbers that represent and describe your
data set.
StatPads summaries (below) for Quality
scores show how many data values there are
(n = 50), typically how high the scores are
(X
far
(S
n
i 1
individual
i 1
scores
are
( X i X ) / (n 1) = 7.56) from
2
Quality
50
90.78
7.56
72
86
93
97
99
1.069
Summaries
Count n
Mean or average
Standard deviation (variability of individuals)
Smallest
Lower quartile
Median
Upper quartile
Largest
Standard error (variability of sample average, if random sample)
19
Histogram
The histogram is used to visually explore
a data set. The data axis is horizontal, and
the bars show how many data values are
within each interval. Data are concentrated
where bars are tall. You can see typical
value, variability, and distribution shape.
StatPads histogram (below) shows that
the Quality scores fall within the interval
from about 70 to 100. They are skewed with a
long tail towards lower values, being more
concentrated in the higher end of the range.
To create a histogram using StatPads
main dialog box, select One Sample as the
situation and Histogram as the analysis. Select your data from the list (or use Add Data if your
column of numbers is in the worksheet but is not in the list), check the Output Range, and then
select Do It.
Frequency
StatPad chooses a default bin width and landmark (which could be a left or right endpoint
of the histogram, or any bin boundary) for the histogram bars. These can be changed using the
Customize check-box (see next item). Note that Excel (not StatPad) chooses the minimum and
maximum horizontal scale. These may be changed (as was done for the chart below) by leaving
StatPad by hitting the Esc key or selecting Cancel, then double-clicking on the axis to find
Minimum and Maximum as Axis Options.
20
10
0
60
70
80
90
Quality
100
20
Frequency
10
0
60
70
80
90
Quality
100
21
Box Plot
The box plot is used to quickly and
visually explore a data set; it shows you a
central box defined by the quartiles, with the
median indicated within the box. In the
ordinary box plot, a line extends from the box
on each side to the most extreme value. In the
detailed box plot, outliers are indicated
separately and these lines extend from the
box on each side to the most extreme value
(adjacent value) that is not an outlier.
In StatPads box plot for Sensitivity
(below left) you see that the middle half of the
data extends from about 60 to 100, with the
median at about 80. The line at the right
extends to the largest at about 180.
StatPads detailed box plot (below right) shows outliers separately, revealing that the
largest value, at about 180, is an outlier.
To display a box plot using StatPads main dialog box, select One Sample as the situation
and Box Plot as the analysis. Select your data from the list (or use Add Data if your column of
numbers is in the worksheet but is not in the list). Click on Detailed box plot if you wish outliers
to be displayed separately. Then check the Output Range, and then select Do It.
Outliers are defined as data values more than 1.5 times the interquartile range away from
either quartile.
50
100
Sensitivity
150
200
50
100
Sensitivity
150
200
22
Cumulative Distribution
The cumulative distribution function is
used to show you the percentiles of the data.
Percentages are shown vertically (from 0 to
100%) and data values are horizontal. The
chart shows the percentage of the data values
(vertical scale) that are equal or less to the
given value (horizontal scale).
In StatPads cumulative distribution
function for Quality (below) you can see that
about 10% of the Quality scores are less than
or equal to 80, about 25% of the Quality
scores are less than or equal to 85, and that
about a third are scores of 90 or less.
To compute a cumulative distribution function using StatPads main dialog box, select One
Sample as the situation and Cumulative Distribution as the analysis. Select your data from the
list (or use Add Data if your column of numbers is in the worksheet but is not in the list), check
the Output Range, and then select Do It.
Cumulative Percent
100%
80%
60%
40%
20%
0%
60
70
80
90
Quality
100
23
Confidence Interval
A confidence interval for the mean
includes the unknown population mean with
known confidence, e.g., 95%. Random
sampling from a normal population is
assumed.
StatPads two-sided 95% confidence
interval results for Quality (below) tell you
that the bounds of the interval are 88.63 and
92.93.
To compute a confidence interval using
StatPads main dialog box, select One
Sample as the situation and Confidence
Interval as the analysis. Select your data
from the list (or use Add Data if your column of numbers is in the worksheet but is not in the
list), check the Output Range, and then select Do It. You may also change the Confidence level
(from the default 95%) or select a one-sided interval instead of a two-sided interval (see next
item).
24
25
Hypothesis Test
A hypothesis test is used to decide, based
on data, whether or not the unobservable
population mean could reasonably be equal
to a given reference value. Because the
sample average represents (with statistical
error) the unknown population mean, the
result is often stated in terms of a significant
(or nonsignificant) difference between the
sample average and the reference value, both
of which are known. Random sampling from
a normal population is assumed.
StatPads hypothesis test results for
Quality (below) show a very highly
significant difference between the reference
value (given here as 87.5) and the observed average Quality score of 90.78. Results include the t
value, the p value, the practical interpretation of the results, and a formal statement of the null
hypothesis being tested.
To perform a hypothesis test using StatPads main dialog box, select One Sample as the
situation and Hypothesis Test as the analysis. Select your data from the list (or use Add Data if
your column of numbers is in the worksheet but is not in the list), specify the Reference Value,
check the Output Range, and then select Do It. Optionally, you may specify a one-sided test
(upper or lower); see next item.
The p value says that, if the population mean had been equal to the reference value, then p is
the probability of observing such a large (or larger) difference between the sample average and
the reference value. Smaller p values indicate significance because rare events are unlikely.
.
Hypothesis test for Quality:
t = 3.07
p = 0.00350
The sample average
90.78
is highly significantly different (p<0.01)
from the reference value
87.5
We have REJECTED the null hypothesis
that claims that the population mean equals 87.5
and have instead ACCEPTED the research hypothesis
(assuming a random sample from a normal population).
26
27
Percentile
Percentiles are landmarks in the data
that are a known percentage (of the data
values) from smallest to largest. The smallest
data value is the 0th percentile, the largest is
the 100th percentile, the median is the 50th
percentile, and so forth.
In StatPads percentile calculation
(below) the 85th percentile for the Quality
scores is found to be a score of 98. That is,
the score 98 is about 85% of the way (in the
ordered list of scores) from the smallest to the
largest score.
To find a percentile using StatPads main
dialog box, select One Sample as the situation and Percentile as the analysis. Select your data
from the list (or use Add Data if your column of numbers is in the worksheet but is not in the
list), provide the Percentage for which you would like the percentile, check the Output Range,
and then select Do It.
For Quality:
85 th percentile
is 98
28
Percentile Ranking
The percentile ranking of a given data
value gives you the percentage of the way
along in the list of data values (from smallest
to largest) that this given data value is.
In StatPads percentile calculation
(below) the Quality score 87.5 is found to be
30% of the way from smallest to largest.
To find a percentile ranking using
StatPads main dialog box, select One
Sample as the situation and Percentile
Ranking as the analysis. Select your data
from the list (or use Add Data if your column
of numbers is in the worksheet but is not in
the list), provide the data Value for which you would like the percentile ranking, check the
Output Range, and then select Do It.
For Quality:
87.5 is the
30 th percentile
29
Sampling
Random Sample Without Replacement
A random sample without replacement
is chosen from a population so that (1) all
population units are equally likely to be
chosen, (2) units are selected independently
of one another, and (3) once a unit is chosen,
it cannot be chosen again. All sampled units
are different when sampling without
replacement.
StatPads results (below) show a sample
of 5 selected at random (without replacement)
from a population of size 100. The selected
items (in order) are 19, 25, 59, 67, and 89.
This list of five numbers has also been given a
name (firstSample was chosen here) which
will appear in StatPads lists of data sets.
To select a random sample without replacement using StatPads main dialog box, select
Sampling as the situation and Sample Without Replacement as the analysis. Specify a
Population Size and a Sample Size. Provide an optional name for the resulting data in case you
plan to refer to it later, check the Output Range, and then select Do It.
30
list of five numbers has also been given a name (secondSample was chosen here) which will
appear in StatPads lists of data sets.
To select a random sample with replacement using StatPads main dialog box, select
Sampling as the situation and Sample With Replacement as the analysis. Specify a Population
Size and a Sample Size. Provide an optional name for the resulting data in case you plan to
refer to it later, check the Output Range, and then select Do It.
31
Uniform Distribution
A uniform distribution generates
numbers, chosen independently of one
another, that are equally likely to fall
anywhere within a specified interval.
In StatPads results (below) five numbers
were selected uniformly from 35 to 45. This
list of five numbers has also been given a
name (uniformSample was chosen here)
which will appear in StatPads lists of data
sets.
To select a uniform sample using
StatPads main dialog box, select Sampling
as the situation and Uniform Distribution as
the analysis. Specify the Smallest and Largest values of the distribution. Specify the Sample
Size. Provide an optional name for the resulting data in case you plan to refer to it later, check
the Output Range, and then select Do It.
32
Normal Distribution
A
normal
distribution
generates
numbers, chosen independently of one
another, that follow a bell-shaped
distribution, with values most likely to fall
near the mean and the width of the bell
defined by the standard deviation (Std dev).
Observations fall within one standard
deviation of the mean about 68% of the time.
In StatPads results (below) five numbers
were selected from a normal distribution with
mean 65 and standard deviation 20. This list
of five numbers has also been given a name
(simulatedScores was chosen here) which
will appear in StatPads lists of data sets.
To select a normal sample using StatPads main dialog box, select Sampling as the
situation and Normal Distribution as the analysis. Specify the Mean and Standard Deviation
(Std dev) values of the distribution. Specify the Sample Size. Provide an optional name for the
resulting data in case you plan to refer to it later, check the Output Range, and then select Do It.
33
Binomial Distribution
A binomial distribution is used to
describe the number of times an event
happens out of n trials, where each trial was
performed independently with a fixed
probability.
In StatPads results (below) five numbers
are selected from a binomial distribution with
10 trials each having probability 0.5 of
success. In the first of the five samples, there
were 4 out of 10 successes. In the second
sample, 6 of 10 were successful.
To select a binomial sample using
StatPads main dialog box, select Sampling
as the situation and Binomial Distribution as the analysis. Specify the Number n of trials and
the Probability of each trial. Specify the Sample Size. Provide an optional name for the resulting
data in case you plan to refer to it later, check the Output Range, and then select Do It.
34
Binomial Percentages
Binomial percentages describe the
percent or proportion of the time an event
happens out of n trials, where each trial was
performed independently with a fixed
probability.
In StatPads results (below) five binomial
percents were selected from a distribution
with 10 trials each having probability 0.5 of
success. In the first of the five samples, 0.3 or
30% of the 10 trials were successful. In the
second sample, 60% of the 10 were
successful.
To select a sample of binomial
percentages using StatPads main dialog box, select Sampling as the situation and Binomial
Percentages as the analysis. Specify the Number n of trials and the Probability of each trial.
Specify the Sample Size. Provide an optional name for the resulting data in case you plan to
refer to it later, check the Output Range, and then select Do It.
35
Probability Calculations
Normal Probability (Greater Than)
A normal distribution generates numbers
according to a bell-shaped distribution, with
values most likely to fall near the mean and
the width of the bell defined by the standard
deviation. Observations fall within one
standard deviation of the mean about 68% of
the time. Probabilities for a normal
distribution are given by the area under the
bell-shaped curve.
StatPads result (below) shows the
probability (0.401) that the specified normal
distribution (with mean 75 and standard
deviation 20) is greater than the given value
(80).
To find a normal probability using StatPads main dialog box, select Probability as the
situation and Normal Probability as the analysis. Choose the type of probability you want
(Greater than, Less than, Between, or Not between), then give the Value(s) requested. Specify the
Mean and Standard Deviation of the normal distribution. Check the Output Range, and then
select Do It.
36
37
38
39
40
41
42
43
44
45
Discrete Probability
A discrete probability distribution is
characterized by two lists: a list of values and
a list of probabilities (where the probabilities
must add up to 1). StatPad computes the
Expected Value (also called the Mean) as the
weighted average of the values (using
probabilities as the weights) and also
computes the standard deviation, once you
specify these two columns of numbers.
StatPads results for a situation with
three possibilities is shown below, where the
probability is 0.2 that profit is 3
($thousands), the probability is 0.5 that profit
is 5, and probability is 0.3 that profit is 8.
These are specified as two separate columns of numbers, each with its name (Profit is a
column containing 3, 5, and 8, while ProbabilityOfProfit is a column containing 0.2, 0.5, and
0.3 which properly add up to 1). We see from the results below that the expected value is $5.5
thousand and the standard deviation (measuring the risk of this situation) is $1.8 thousand for
this discrete random variable.
To compute mean and standard deviation for a discrete random variable, using StatPads
main dialog box, select Probability as the situation and Discrete Probability as the analysis.
Select one from each of the two lists (or use Add Data if your columns of numbers are in the
worksheet but are not in the lists) being sure to correctly specify which one contains the values
and which one contains the probabilities. Next check the Output Range and then select Do It.
46
Two-Sample Analysis
Summaries
Summaries are used to give you selected
numbers that represent and describe your
data sets. When you have two samples,
StatPad first reports summaries for each
sample separately, then gives the average
difference and the standard error of the
average difference, indicating the sampling
variability of the average difference. Note
that the two samples are assumed to have the
same measurement units (e.g., dollars).
StatPads two-sample summaries (below)
are shown for the results of a survey sent to
customers in the East and to those in the
West.
To compute summaries for two samples using StatPads main dialog box, select Two
Samples as the situation and Summaries as the analysis. Select a data set from each list (or use
Add Data if your columns of numbers are in the worksheet but are not in the lists). You may
(optionally) click on Paired to specify that the data sets have a natural pairing if the counts are
equal for the two data sets. Next check the Output Range and then select Do It.
The Paired check-box only affects the standard error of the difference. For a paired
situation, StatPad gives the ordinary standard error for the paired differences. For an unpaired
situation, StatPad uses the large-sample formula S12 / n1 S22 / n2 if both counts are at least 30.
Otherwise, StatPad uses the small-sample formula (assuming equal population variabilities)
(n1 1)S12 (n2 1)S22 1 / n1 1 / n2 / (n1 n2 2) .
East
17
1,834
661
752
1,295
1,931
2,426
2,975
160
557
239
West
19
2,390
761
836
2,004
2,294
2,853
4,085
175
Summaries
Count n
Mean or average
Standard deviation (variability of individuals)
Smallest
Lower quartile
Median
Upper quartile
Largest
Standard error (variability of sample average, if random sample)
47
Histograms
Histograms are used to visually explore
data sets. The data axis is horizontal, and the
bars show how many data values are within
each interval. Data are concentrated where
bars are tall. You can see typical value,
variability, and distribution shape.
StatPads histograms are shown below
for the East and West survey data, one
histogram for each data set.
To create histograms for two samples
using StatPads main dialog box, select Two
Samples as the situation and Histograms as
the analysis. Select a data set from each list
(or use Add Data if your columns of numbers are in the worksheet but are not in the lists). Next
check the Output Range and then select Do It.
StatPad chooses a default bin width and landmark for the histogram bars. If you wish to
change these, use the Customize check-box found under One Sample, Histogram. Note that
Excel (not StatPad) chooses the minimum and maximum horizontal scale. These may be changed
by leaving StatPad by hitting the Esc key or selecting Cancel, then double-clicking on the axis to
find Minimum and Maximum as Axis Options.
Frequency
Frequency
4
3
2
1
0
10
5
0
1000
2000
East
3000
1000
2000
3000
West
4000
5000
48
Box Plots
Box plots are used to visually explore
and compare data sets; they show you a
central box defined by the quartiles, with the
median indicated within the box. In the
ordinary box plot, a line extends from the box
on each side to the most extreme value. In the
detailed box plot, outliers are indicated
separately and these lines extend from the
box on each side to the most extreme value
(adjacent value) that is not an outlier.
StatPads detailed box plots are shown
below, on the same scale, for the East and
West survey data. Note that the western
values are generally somewhat higher,
although there is considerable overlap. There are no outliers.
To create box plots for two samples using StatPads main dialog box, select Two Samples
as the situation and Box Plots as the analysis. Select a data set from each list (or use Add Data
if your columns of numbers are in the worksheet but are not in the lists). Click on Detailed box
plot if you wish outliers to be displayed separately. Next check the Output Range and then select
Do It.
Outliers are defined as data values more than 1.5 times the interquartile range away from
either quartile.
1000
2000
3000
4000
5000
49
Confidence Interval
A two-sample confidence interval for the
population mean difference includes this
unknown population mean difference with
known confidence, e.g., 95%, when random
sampling is used and normal distributions are
assumed.
StatPads 95% confidence interval
results (below) for the mean difference, West
minus East, tell you that the bounds of the
interval are 71.16 and 1,042.42.
To compute a two-sample confidence
interval using StatPads main dialog box,
select Two Samples as the situation and Confidence Interval as the analysis. Select a data set
from each list (or use Add Data if your columns of numbers are in the worksheet but are not in
the lists). You may (optionally) change the Confidence level (from the default 95%). You may
also (optionally) click on Paired to specify that the data sets have a natural pairing if the counts
are equal for the two data sets. Next check the Output Range and then select Do It.
The two-sample confidence interval is based on the standard error of the difference,
described previously under Two Sample, Summaries. If unpaired, random sampling from each of
two normal populations is assumed (also assuming equal population variabilities if the smallsample standard error is used). If paired, random sampling from a normal population is
assumed for the differences formed from the two measurements on each unit sampled.
50
Hypothesis Test
A two-sample hypothesis test is used to
decide, based on data, whether or not the
unobservable population means could
reasonably be equal to each other. Because
the sample averages represent (with
statistical error) their respective unknown
population means, the result is often stated in
terms of a significant (or nonsignificant)
difference between the sample averages, both
of which are known.
StatPads two-sample hypothesis test
results for the East and West survey (below)
show a significant difference between the two
regions (East and West) on average. Results
include the t value, the p value, the practical interpretation of the results, and a formal statement
of the null hypothesis being tested.
To perform a two-sample hypothesis test using StatPads main dialog box, select Two
Samples as the situation and Hypothesis Test as the analysis. Select a data set from each list (or
use Add Data if your columns of numbers are in the worksheet but are not in the lists). You may
(optionally) click on Paired to specify that the data sets have a natural pairing if the counts are
equal for the two data sets. Next check the Output Range and then select Do It.
The two-sample hypothesis test is based on the standard error of the difference, described
previously under Two Sample Summaries. If unpaired, random sampling from each of two
normal populations is assumed (also assuming equal population variabilities if the small-sample
standard error is used). If paired, random sampling from a normal population is assumed for the
differences formed from the two measurements on each unit sampled.
The p value says that, if the population means had been equal to each other, then p is the
probability of observing such a large (or larger) difference between the sample averages.
Smaller p values indicate significance because rare events are unlikely.
Hypothesis test for East and West:
t = 2.33
p = 0.026
The sample averages
1,834 and 2,390
are significantly different (p<0.05).
We have REJECTED the null hypothesis
that claims that the population means are equal
and have instead ACCEPTED the research hypothesis
using the small-sample unpaired standard error,
which assumes equal variabilities, and assuming
random samples from normal populations.
51
Many-Sample Analysis
Summaries
Summaries are used to give you selected
numbers that represent and describe your
data sets. When you have many samples,
StatPad reports summaries for each sample
separately.
StatPads
many-sample
summaries
(below) are shown for the quality scores of
four suppliers (defining four samples). For
example, supplier B had 35 scores listed, with
an average of 85.14.
To compute summaries for many samples
using StatPads main dialog box, select Many
Samples as the situation and Summaries as
the analysis. Select your data sets from the list (or use Add Data if your columns of numbers are
in the worksheet but are not in the list). Next check the Output Range and then select Do It.
52
Histograms
Histograms are used to visually explore
data sets. The data axis is horizontal, and the
bars show how many data values are within
each interval. Data are concentrated where
bars are tall. You can see typical value,
variability, and distribution shape.
StatPads
many-sample
histograms
(below) are shown for the quality scores of
the four suppliers. Some of the horizontal
scales have been changed using Excel chart
commands (see below) because Excels
choice did not show enough detail.
To create histograms for many samples
using StatPads main dialog box, select Many Samples as the situation and Histograms as the
analysis. Select your data sets from the list (or use Add Data if your columns of numbers are in
the worksheet but are not in the list). Next check the Output Range and then select Do It.
StatPad chooses a default bin width and landmark for the histogram bars. If you wish to
change these, use the Customize check-box found under One Sample, Histogram. Note that
Excel (not StatPad) chooses the minimum and maximum horizontal scale. These may be changed
by leaving StatPad by hitting the Esc key or selecting Cancel, then double-clicking on the axis to
find Minimum and Maximum as Axis Options.
15
Frequency
Frequency
4
2
0
10
5
0
70
80
90
100
70
80
SupplierA
100
90
100
SupplierB
Frequency
4
Frequency
90
3
2
1
0
10
5
0
70
75
80
SupplierC
85
70
80
SupplierD
53
Box Plots
Box plots are used to visually explore
and quickly compare data sets; they show you
a central box defined by the quartiles, with
the median indicated within the box. In the
ordinary box plot, a line extends from the box
on each side to the most extreme value. In the
detailed box plot, outliers are indicated
separately and these lines extend from the
box on each side to the most extreme value
(adjacent value) that is not an outlier.
StatPads many-sample detailed box
plots (below) are shown for the quality scores
of the four suppliers, arranged on the same
scale for easy comparison. There is one box
plot for each supplier. Suppliers A and D seem to have the highest scores overall, while supplier
C has the lowest. Supplier D has a low outlier score. The horizontal scale was changed using
Excel chart commands (see below) because Excels choice did not show enough detail.
To create box plots for many samples using StatPads main dialog box, select Many
Samples as the situation and Box Plots as the analysis. Select your data sets from the list (or use
Add Data if your columns of numbers are in the worksheet but are not in the list). Click on
Detailed box plot if you wish outliers to be displayed separately. Next check the Output Range
and then select Do It.
Outliers are defined as data values more than 1.5 times the interquartile range away from
either quartile. Note that Excel (not StatPad) chooses the minimum and maximum horizontal
scale. These may be changed by leaving StatPad by hitting the Esc key or selecting Cancel, then
double-clicking on the axis to find Minimum and Maximum as Axis Options.
70
80
90
100
54
55
Mean Differences
If your F test is significant, indicating
that there are significant differences among
the averages, you may be interested in
learning which pairs in particular show
differences. The least-significant-difference
test can be used to provide a hypothesis test
and a confidence interval for each pair of
data sets. It is assumed that samples are
drawn randomly from normal populations
with equal variabilities.
StatPads many-sample mean-differences
results for the four supplier quality scores
(below) show that all pairs of suppliers show
very highly significant differences (p<0.001)
with the exception of suppliers A and D. This corresponds well with to visual impression from
the box plots created earlier. Note that 99% confidence intervals were used. Note that with four
suppliers there are six pairs of suppliers.
To perform a many-sample mean-difference analysis using StatPads main dialog box, select
Many Samples as the situation and Mean Differences as the analysis. Select your data sets from
the list (or use Add Data if your columns of numbers are in the worksheet but are not in the list).
Specify the confidence level to be used in computing the confidence intervals for the mean
differences. Next check the Output Range and then select Do It.
Sample1
SupplierA
SupplierA
SupplierA
SupplierB
SupplierB
SupplierC
99%
LowerCI
9.30
19.19
5.34
12.96
0.96
9.30
99%
UpperCI
2.35
10.73
2.10
5.31
7.45
17.39
t
4.41
9.30
1.15
6.29
3.41
8.67
p
2.83E05
7.46E15
0.255012
1.1E08
0.000976
1.52E13
Significant?
Yes (p<0.001)
Yes (p<0.001)
No (p>0.05)
Yes (p<0.001)
Yes (p<0.001)
Yes (p<0.001)
56
Bivariate Analysis
Scatterplot
A scatterplot is used to visually explore a
bivariate data set, showing the distribution of
two measurements (X and Y) that describe
each item in a sample. The X axis is
horizontal and Y is vertical. Each item is
represented by one point in the scatterplot.
You can see if there is a linear or nonlinear
relationship, if the variability is equal or not,
if there is clustering or if outliers are present.
StatPads scatterplot of coupon price (X)
and bid price (Y) for a group of tax-exempt
bonds is shown below. There is a strong
linear (straight-line) increasing relationship:
bonds that pay a higher coupon are worth
more. One bond stands out (an outlier?) with a lower coupon and price than the others.
To create a scatterplot using StatPads main dialog box, select Bivariate as the situation
and Scatterplot as the analysis. Select a data set from each list (or use Add Data if your columns
of numbers are in the worksheet but are not in the lists). Your X variable will be on the
horizontal axis, with Y on the vertical axis. Next check the Output Range and then select Do It.
105
price
100
95
90
85
4
coupon
57
105
price
100
95
90
85
4
coupon
58
Correlation
The correlation between two variables
indicates the strength of their relationship as
a pure number. A perfect linear relationship
(i.e., all points exactly along a straight line)
has correlation either 1 or 1 depending on
whether it is increasing or decreasing. If
there is no relationship, the correlation will
be close to 0 (although there can be a
nonlinear relationship with correlation 0). If
all points fall on a horizontal or vertical line,
the correlation is undefined.
StatPad finds the correlation between
coupon payment and bond price to be 0.945
(below). This is a strong correlation, close to
1, summarizing the strong increasing relationship visible in the scatterplot.
To compute a correlation using StatPads main dialog box, select Bivariate as the situation
and Correlation as the analysis. Select a data set from each list (or use Add Data if your
columns of numbers are in the worksheet but are not in the lists). Next check the Output Range
and then select Do It.
59
0.945
Correlation between coupon and price
12.23
t
3.71E10 p
The correlation
is very highly significantly different (p<0.001)
from the reference value zero.
We have REJECTED the null hypothesis
that claims that the population correlation is zero
and have instead ACCEPTED the research hypothesis
(assuming a random sample from a bivariate normal population).
60
Regression
Regression is used to predict or explain
the Y variable from the X variable, using the
least-squares line. The regression line
summarizes the form of the relationship and
can be used to predict Y for a new value of X.
StatPads regression analysis of the bond
data (below) shows that prices are
approximately $48.326 plus 8.730 times the
coupon value. Results initially include the R2
value, the standard error of estimate, the
number of observations, the F statistic, and
the p value. The regression table gives
confidence intervals, standard errors, t, and p
values for the constant term and the
regression coefficient for coupon (the X variable). The practical interpretation of these results
then follows.
To perform a regression analysis using StatPads main dialog box, select Bivariate as the
situation and Regression as the analysis. Select a data set from each list (or use Add Data if
your columns of numbers are in the worksheet but are not in the lists). You may optionally
change the Confidence level (from the default 95%). Next check the Output Range and then
select Do It.
Regression analysis to predict price from coupon.
The prediction equation is:
price =
48.326
+8.730 coupon
0.893
1.034
20
149.577
3.71E10
Constant
coupon
R squared
Standard error of estimate
Number of observations
F statistic
p value
95%
95%
Coeff
LowerCI
UpperCI
48.326
39.594
57.058
8.730
7.230
10.230
StdErr
4.156
0.714
t
11.627
12.230
p
8.38E10
3.71E10
Significant?
Yes (p<0.001)
Yes (p<0.001)
The R-squared value, 89.3%, indicates the proportion of the variance of price
that is explained by the regression model.
Thus coupon explains
a very highly significant proportion of the variation in price, based on the F test (p<0.001).
The standard error of estimate, 1.034, indicates the typical size
of errors made in predicting price using the regression model.
We estimate that:
8.730 is the increase in price associated with an increase in coupon of 1 unit. This is very highly
significant (p<0.001).
61
-0.799
-0.328
-1.080
-1.400
-0.707
0.168
1.033
1.067
0.035
0.162
1.158
1.317
1.033
0.634
0.297
0.297
-2.727
0.408
-0.017
-0.549
-0.772
-0.317
-1.044
-1.353
-0.684
0.162
0.999
1.031
0.034
0.156
1.119
1.273
0.999
0.613
0.287
0.287
-2.637
0.394
-0.016
-0.530
62
Univariate Summaries
Summaries are used to give you selected
numbers that represent and describe your
data sets. When you have bivariate data, you
can find summaries separately for each
variable.
StatPads univariate summaries for
bivariate data (below) are shown for the
coupon rates and prices of bonds.
To compute univariate summaries for
bivariate data using StatPads main dialog
box, select Bivariate as the situation and
Univariate Summaries as the analysis. Select
a data set from each list (or use Add Data if
your columns of numbers are in the worksheet but are not in the lists). Next check the Output
Range and then select Do It.
coupon
20
5.814
0.332
5.000
5.550
5.813
6.125
6.200
0.074
price
20
99.081
3.072
89.250
97.375
99.375
101.125
102.750
0.687
Summaries
Count n
Mean or average
Standard deviation (variability of individuals)
Smallest
Lower quartile
Median
Upper quartile
Largest
Standard error (variability of sample average, if random sample)
63
Histograms
Histograms are used to visually explore
data sets. The data axis is horizontal, and the
bars show how many data values are within
each interval. Data are concentrated where
bars are tall. You can see typical value,
variability, and distribution shape.
StatPads histograms are shown below
for the coupon rates and prices of bonds, one
histogram for each variable.
To create histograms for bivariate data
using StatPads main dialog box, select
Bivariate as the situation and Histograms as
the analysis. Select a data set from each list
(or use Add Data if your columns of numbers are in the worksheet but are not in the lists). Next
check the Output Range and then select Do It.
StatPad chooses a default bin width and landmark for the histogram bars. If you wish to
change these, use the Customize check-box found under One Sample, Histogram. Note that
Excel (not StatPad) chooses the minimum and maximum horizontal scale. These may be changed
by leaving StatPad by hitting the Esc key or selecting Cancel, then double-clicking on the axis to
find Minimum and Maximum as Axis Options.
Frequency
Frequency
6
4
2
0
4
2
0
4.5
5.5
coupon
6.5
85
90
95
price
100
105
64
Box Plots
Box plots are used to visually explore
and compare data sets; they show you a
central box defined by the quartiles, with the
median indicated within the box. In the
ordinary box plot, a line extends from the box
on each side to the most extreme value. In the
detailed box plot, outliers are indicated
separately and these lines extend from the
box on each side to the most extreme value
(adjacent value) that is not an outlier.
StatPads detailed box plots are shown
below, on separate scales, for the coupon
rates and prices of bonds. There are no
coupon outliers, but there is one lower price
outlier.
To create box plots for bivariate data using StatPads main dialog box, select Bivariate as
the situation and Box Plots as the analysis. Select a data set from each list (or use Add Data if
your columns of numbers are in the worksheet but are not in the lists). Click on Detailed box plot
if you wish outliers to be displayed separately. Next check the Output Range and then select Do
It.
Outliers are defined as data values more than 1.5 times the interquartile range away from
either quartile. With bivariate data, StatPad creates box plots separately for each variable
because bivariate data are often measured in different units. To see box plots on the same scale,
you would select Two Sample, Box Plots from StatPads main dialog box.
4
coupon
85
90
95
price
100
105
65
Multivariate Analysis
and Multiple Regression
Scatterplots
With multivariate data, scatterplots can
be drawn for each pair of variables. A
scatterplot is used to visually explore a
bivariate data set, showing the distribution of
the two measurements
StatPads results show scatterplots of all
pairs of variables (below) for a mail-order
firms multivariate data set consisting of
information on recently received catalog
orders with questionnaires attached: order
size, income, education, and region (East or
West). Scatterplots involving region (West)
look different because it is coded as an
indicator variable with 1 = West and 0=East.
100
100
80
80
80
60
40
60
40
20
20
0
0
10
15
20
Order
100
Order
Order
To create scatterplots for multivariate data using StatPads main dialog box, select
Multivariate as the situation and Scatterplots as the analysis. Choose one of your data sets to be
the Y variable, and select the others from the list of X variables (or use Add Data if your
columns of numbers are in the worksheet but are not in the lists). Next check the Output Range
and then select Do It.
25
0
0
50,000
150,000
0.8
0.8
0.6
0.6
0.4
Education
0.4
20
25
0.6
0.8
0.4
0.2
0
15
0.2
West
0.2
10
West
West
Income
100,000
Income
120,000
100,000
80,000
60,000
40,000
20,000
0
5
40
20
Education
60
0
0
10
15
Education
20
25
50,000
100,000
Income
150,000
66
Correlations
With multivariate data, the correlation
can be calculated for each pair of variables
and these can be displayed in a table (the
correlation matrix). The correlation between
two variables indicates the strength of their
relationship as a pure number.
StatPads correlation matrix for the
catalog-order data is shown below. Note that
the diagonal values are all 1 because each
variable is perfectly correlated with itself.
The highest correlations are Order with West
(0.700), Education with Income (0.607), and
Order with Education (0.564), corresponding
to the scatterplots with the clearest tilt.
To find the correlations for multivariate data using StatPads main dialog box, select
Multivariate as the situation and Correlations as the analysis. Choose one of your data sets to
be the Y variable, and select the others from the list of X variables (or use Add Data if your
columns of numbers are in the worksheet but are not in the lists). Next check the Output Range
and then select Do It.
Correlation
Order
Education
Income
West
Order
1.000
0.564
0.158
0.700
Education
0.564
1.000
0.607
0.207
Income
0.158
0.607
1.000
0.011
West
0.700
0.207
0.011
1.000
67
Multiple Regression
Multiple regression is used to predict or
explain the Y variable from two or more X
variables, using the best (least-squares) linear
relationship.
The
prediction
equation
summarizes the form of the relationship and
can be used to predict Y given new values for
each of the X variables.
StatPads regression analysis of the
catalog-order data (below) shows the
prediction equation, the R2 value, the standard
error of estimate, the number of observations,
the F statistic, and the p value. The regression
table gives confidence intervals, standard
errors, t, and p values for the constant term and
the regression coefficients (one line per X
variable). The practical interpretation of these results then follows.
To perform a multiple regression analysis using StatPads main dialog box, select Multivariate
as the situation and Regression as the analysis. Choose one of your data sets to be the Y variable (to
be predicted), and select the others from the list of X variables (or use Add Data if your columns of
numbers are in the worksheet but are not in the lists). You may optionally change the Confidence
level (from the default 95%). Next check the Output Range and then select Do It.
Multiple regression analysis to predict Order from Education, Income and West.
The prediction equation is:
Order =
-3.636
+3.356
Education
-0.0002 Income
+24.595 West
0.690
R squared
13.413
Standard error of estimate
16
Number of observations
8.898
F statistic
0.002
p value
95%
95%
Coeff
LowerCI
UpperCI
StdErr
t
p Significant?
Constant
-3.636
-39.557
32.285
16.487
-0.221
0.829 No (p>0.05)
Education
3.356
0.532
6.180
1.296
2.589
0.024 Yes (p<0.05)
Income
-0.00020
-0.00072
0.00033 0.00024
-0.809
0.434 No (p>0.05)
West
24.595
9.301
39.889
7.019
3.504
0.004 Yes (p<0.01)
The R-squared value, 69.0%, indicates the proportion of the variance of Order
that is explained by the regression model.
Thus Education, Income and West together explain
a highly significant proportion of the variation in Order, based on the F test (p<0.01).
The standard error of estimate, 13.413, indicates the typical size
of errors made in predicting Order using the regression model.
Holding the other X variables constant, we estimate that:
3.356
is the increase in Order associated with an increase in Education of 1 unit. This is significant (p<0.05).
-0.00020 is the increase in Order associated with an increase in Income of 1 unit. This is not significant (p>0.05).
24.595 is the increase in Order associated with an increase in West of 1 unit. This is highly significant (p<0.01).
68
69
Diagnostic Plot
The diagnostic plot is used to look for
potential problems in multiple regression. It
is a scatterplot of the residuals (vertically)
against the predicted values (horizontally).
Any useful structure that the regression
equation has failed to capture will be found
in the residuals. Consequently, if you see
structure in the diagnostic plot (a curve,
outliers, unequal variability) it suggests that
the multiple regression analysis is not
capturing all of the available structure in the
data.
StatPads diagnostic plot for the catalogorder data is shown below. No problems are
evident: it looks like random scatter without tilt or curvature, suggesting that multiple regression
has already extracted the available structure from the data.
To create a diagnostic plot using StatPads main dialog box, select Multivariate as the
situation and Diagnostic Plot as the analysis. Choose one of your data sets to be the Y variable (to
be predicted), and select the others from the list of X variables (or use Add Data if your columns of
numbers are in the worksheet but are not in the lists). Next check the Output Range and then select
Do It.
40
20
0
0
20
40
60
80
-20
Order Values Predicted from Education,
Income and West
70
Univariate Summaries
Summaries are used to give you selected
numbers that represent and describe your
data sets. When you have multivariate data,
you can find summaries separately for each
variable.
StatPads univariate summaries for
multivariate data are shown below for the
catalog-order data.
To compute univariate summaries for
multivariate data using StatPads main
dialog box, select Multivariate as the
situation and Univariate Summaries as the
analysis. Choose one of your data sets to be the
Y variable, and select the others from the list of X variables (or use Add Data if your columns of
numbers are in the worksheet but are not in the lists). Next check the Output Range and then select
Do It.
Order Education
16
16
43.313 15.063
21.543
3.492
10.0
9.0
27.5
12.5
39.5
16.0
57.5
17.0
93.0
21.0
5.386
0.873
Income
16
$73,558
$18,360
$41,000
$65,418
$70,519
$79,839
$117,370
$4,590
West
16
0.438
0.512
0
0
0
1
1
0.128
Summaries
Count n
Mean or average
Standard deviation (variability of individuals)
Smallest
Lower quartile
Median
Upper quartile
Largest
Standard error (variability of sample average, if
random sample)
71
Histograms
Histograms are used to visually explore
data sets. Data are concentrated where bars
are tall.
StatPads histograms are shown below
for the catalog-order data, one histogram for
each variable. The histogram for West
reflects the fact that each data value is either
0 or 1.
To create histograms for multivariate
data using StatPads main dialog box, select
Multivariate as the situation and Histograms
as the analysis. Choose one of your data sets
to be the Y variable, and select the others from
the list of X variables (or use Add Data if your columns of numbers are in the worksheet but are not
in the lists). Next check the Output Range and then select Do It.
StatPad chooses a default bin width and landmark for the histogram bars. If you wish to
change these, use the Customize check-box found under One Sample, Histogram. Note that
Excel (not StatPad) chooses the minimum and maximum horizontal scale. These may be changed
by leaving StatPad by hitting the Esc key or selecting Cancel, then double-clicking on the axis to
find Minimum and Maximum as Axis Options.
Frequency
Frequency
6
4
2
0
6
4
2
0
20
40
60
80
100
Order
10
15
20
25
Education
12
10
10
Frequency
Frequency
8
6
4
8
6
4
2
2
0
0
0
50000
100000
Income
150000
0.5
1
West
1.5
72
Box Plots
Box plots are used to visually explore
and compare data sets; they show you a
central box defined by the quartiles, with the
median indicated within the box. In the
ordinary box plot, a line extends from the box
on each side to the most extreme value. In the
detailed box plot, outliers are indicated
separately and these lines extend from the
box on each side to the most extreme value
(adjacent value) that is not an outlier.
StatPads detailed box plots are shown
below, on separate scales, for each variable
in the catalog-order data set. The box plot for
West reflects the fact that each data value is
either 0 or 1.
To create box plots for multivariate data using StatPads main dialog box, select
Multivariate as the situation and Box Plots as the analysis. Choose one of your data sets to be the
Y variable, and select the others from the list of X variables (or use Add Data if your columns of
numbers are in the worksheet but are not in the lists). Click on Detailed box plot if you wish
outliers to be displayed separately. Next check the Output Range and then select Do It.
Outliers are defined as data values more than 1.5 times the interquartile range away from
either quartile. With multivariate data, StatPad creates box plots separately for each variable
because multivariate data are often measured in different units. To see box plots on the same
scale, you would select Many Sample, Box Plots from StatPads main dialog box.
20
40
60
80
100
10
Order
50000
15
20
25
Education
100000
Income
150000
-0.5
0.5
West
1.5
73
Time-Series Analysis
Trend-Seasonal
Trend-seasonal analysis of time series
provides understanding and forecasting of
data that show a repeating pattern, often
quarterly or monthly throughout the year.
There are four components: the trend (longterm, a straight line), the seasonal variation
(repeating each year), the cyclic variation
(medium-term wandering) and the irregular
component (randomness).
StatPad
performs
trend-seasonal
analysis of quarterly and monthly time-series
data. To show you what is available, StatPad
has a trend-seasonal dialog box that will
allow you to choose a combination of
components to chart or display, and then will return for further analysis of this same time-series
data set.
We will be working with sales numbers for a gift shop that tends to have higher sales in the
fourth quarter due to the holiday season. Data are quarterly from 2007 through 2010, starting in
the first quarter.
To begin trend-seasonal analysis using StatPads main dialog box, select Time Series as the
situation and Trend-Seasonal as the analysis. Please note that StatPad assumes that time
increases as you move down your column of data. Select your data from the list (or use Add Data
if your column of numbers is in the worksheet but is not in the list), type in the starting year, click
to select Quarterly or Monthly, select the Initial Quarter or Initial Month, check the Output
Range, and then select Do It. You will then see the trend-seasonal dialog box (below).
74
1800
1600
1400
Sales
1200
1000
800
Sales
600
Forecast
400
200
0
2007
2008
2009
2010 2011
Time
2012
2013
2014
75
1400
1200
Sales
1000
800
600
Sales
Smooth
400
200
0
2007
2008
2009
Time
2010
2011
76
Seasonal Index
The seasonal index shows you the
repeating yearly pattern, centered near the
value 1 (or 100%). A period that is typically
higher than the rest of the year will have a
seasonal index larger than 1. Seasonal
adjustment is done by dividing each data
value by the appropriate seasonal index (for
its month or quarter). Forecasts are obtained
by multiplying the trend by the seasonal
index.
In the chart below, StatPad shows the
seasonal index. It is best to plot the seasonal
index alone, without any of the others,
because its values are near 1 and may be
obscured by the scale of the series itself.
To display the seasonal index values using StatPads trend-seasonal dialog box, be sure that
Seasonal Index is selected. Select Graph, check the Output Range, and then select Do It. It is
best not to choose any other items when charting the seasonal index.
2.0
1.8
1.6
1.4
Sales
1.2
1.0
0.8
Seasonal
0.6
0.4
0.2
0.0
2007
2008
2009
Time
2010
2011
77
1400
1200
Sales
1000
800
600
Sales
Seasonally Adjusted
400
200
0
2007
2008
2009
Time
2010
2011
78
Long-Term Trend
The long-term trend summarizes the
basic behavior of the series as a line or very
smooth curve. It is often found by linear
regression (or exponential curve-fitting).
In the chart below, StatPad shows the
sales data series together with its long-term
trend.
To display a chart of the data series with
its long-term trend using StatPads trendseasonal dialog box, be sure that Data Series
and Long-Term Trend are selected. Select
Graph, check the Output Range, and then
select Do It.
StatPad finds the long-term trend by fitting the best straight line using regression..
1400
1200
Sales
1000
800
600
Sales
Trend
400
200
0
2007
2008
2009
Time
2010
2011
79
Seasonalized Trend
The seasonalized trend is found by
multiplying each long-term trend value by the
appropriate seasonal index value (for its
month or quarter).
In the chart below, StatPad shows the
sales data series together with its
seasonalized trend.
To display a chart of the data series with
its seasonalized trend using StatPads trendseasonal dialog box, be sure that Data Series
and Seasonalized Trend are selected. Select
Graph, check the Output Range, and then
select Do It.
1400
1200
Sales
1000
800
600
Sales
Seasonalized Trend
400
200
0
2007
2008
2009
Time
2010
2011
80
1800
1600
1400
Sales
1200
1000
800
Sales
600
Trend
400
Forecast
200
0
2007 2008 2009 2010 2011 2012 2013 2014
Time
81
Numeric Output
You may have a need for numbers as well
as charts for trend-seasonal analysis. All of
the options available for charting are also
there in StatPad for numeric output.
All options have been selected here
(including a three-year forecast) for
StatPads numeric output, shown below.
To
display
your
numeric-output
combination using StatPads trend-seasonal
dialog box, make your selections, select
Numbers, check the Output Range, and then
select Do It.
Year
2007
2007
2007
2007
2008
2008
2008
2008
2009
2009
2009
2009
2010
2010
2010
2010
2011
2011
2011
2011
2012
2012
2012
2012
2013
2013
2013
2013
Quarter
QI
QII
QIII
QIV
QI
QII
QIII
QIV
QI
QII
QIII
QIV
QI
QII
QIII
QIV
QI
QII
QIII
QIV
QI
QII
QIII
QIV
QI
QII
QIII
QIV
Sales
257
308
428
850
304
431
479
831
318
352
564
1,255
398
472
745
1,015
Smooth
466.6
487.9
509.6
513.6
513.0
504.9
505.6
569.3
632.3
657.3
694.9
687.5
Seasonal
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
0.599
0.715
0.914
1.766
Seasonally
Adjusted
428.8
431.0
468.1
481.3
507.2
603.1
523.9
470.6
530.5
492.5
616.8
710.7
664.0
660.4
814.8
574.8
Trend
424.4
442.6
460.9
479.1
497.3
515.6
533.8
552.0
570.3
588.5
606.7
625.0
643.2
661.4
679.7
697.9
716.1
734.4
752.6
770.8
789.1
807.3
825.5
843.8
862.0
880.3
898.5
916.7
Seasonalized
Trend
254.4
316.3
421.4
846.0
298.1
368.5
488.1
974.8
341.8
420.6
554.8
1,103.6
385.5
472.7
621.5
1,232.4
Forecast
429.3
524.8
688.1
1,361.2
473.0
577.0
754.8
1,490.0
516.7
629.1
821.5
1,618.8
82
Quality Control
X-Bar, R Charts (No Standard Given)
Control charts, such as X-bar ( X ) and R
charts, show you whether your process is in
or out of control. A series of measurements is
divided into subgroups of a fixed size, e.g.,
five at a time. The average and range (largest
minus smallest) are computed for each
subgroup. Each is plotted together with a
central line and control limits (upper and
lower). If the series (averages or ranges) goes
outside the control limits, it indicates that the
process is not in control. However, even if the
series remains within the limits, it can still be
out of control, e.g., if there is a trend that will
clearly soon break out of the limits.
StatPads X-bar and R charts for the light intensity of laser units (below) show a process in
control. The averages and ranges move seemingly at random within the control limits. The range
chart comes very close to the upper control limit at group 13, but this by itself is not a problem.
The input data consists of a column of 125 individual measurements. StatPad groups and
averages them 5 at a time, as requested. The control limits are computed based only on the data
values because no standard was given.
To create X-bar and R charts using StatPads main dialog box, select Quality Control as the
situation and X-Bar, R Charts as the analysis. Select your data from the list (or use Add Data if
your column of numbers is in the worksheet but is not in the list). You may optionally change the
Subgroup Size from the default (5) to any whole number from 2 to 25. Then check the Output
Range, and then select Do It. You may, optionally, specify standards for the process mean and
standard deviation (see next item).
25.0
24.9
24.8
24.7
24.6
24.5
Ranges of Intensity
Averages of Intensity
StatPad computes control limits according to ASTM-STP 15D, American Society for Testing
and Materials.
10
15
20
Group Number
25
1.0
0.8
0.6
0.4
0.2
0.0
0
10
15
20
Group Number
25
83
25.3
25.2
25.1
25.0
24.9
24.8
24.7
24.6
Ranges of Intensity
Averages of Intensity
StatPad computes control limits according to ASTM-STP 15D, American Society for Testing
and Materials.
10
15
20
Group Number
25
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0
10
15
20
Group Number
25
84
The center line for percentages is the average percentage p , and the lower and upper
control limits are found as follows: p 3 p(1 p) / n and p 3 p(1 p) / n . For counts,
multiply the center line and control limits by the sample size n.
15%
10%
5%
0%
0
10 15 20 25
Group Number
30
35
85
The center line for percentages is the standard p0 , and the lower and upper control limits
are found as follows: p0 3 p0 (1 p0 ) / n and p0 3 p0 (1 p0 ) / n . For counts, multiply the
center line and control limits by the sample size n.
20%
15%
10%
5%
0%
0
10 15 20 25
Group Number
30
35