© Orangetree Business Solutions Private Limited, 201213
No part of this book should be referenced or copied without the prior permission of the company.
A FEW WORDS TO THE STUDENTS
Analytics is becoming a popular tool for managerial decision making. It‘s still not so widespread in countries like India, but in the west it has become a standard practice. Previously studying analytics involved an in depth knowledge of statistics and pro gramming languages. But widespread availability of statistical package software has changed the reality to some extent. Now more emphasis is given on the application of the techniques to solve the business problems. So there is a need to understand the meaning of the statistical procedures. This book has been written to cater that need.
In this book, all the necessary concepts have been explained keeping the business problem in mind. Also, to remove the apathy for statistics, use of mathematical expres sions have been limited. That doesn‘t imply that we don‘t have to study the mathe matics part. The intention is to put the substance over matter. As the students get ac customed to these statistical concepts, they can go for further investigations using vari ous mathematical and statistical techniques. A list of suggested books and links have been given in the appendix.
This book is directly related to the instructor‘s presentation. So it is highly advised that students should go through this material at the end of each class. As for general reading, the reader is advised to go according to the chapters. Chapters have been arranged in the order of higher complexity. So the initial chapters are very important.
In this book, the statistical procedures have been implemented on SAS. The expla nations of the codes have been from the perspective of a data modeler. For the per spective of a programmer, the students are advised to go through the documentation of the procedures in the SAS website.
In fine, statistical concepts are a way of thinking. The more you recognize the think ing pattern, the quicker you will learn.
Best of Luck!
Team OTG
CONTENTS
PAGE
1. Introduction to Analytics and Basic Statistics 
5 
2. Introduction to Probability Theory 
22 
3. Sampling Theory and Estimation 
33 
4. Important Tests of Statistical Significance (Part I) 
41 
5. Understanding The Association Between The Variables 
48 
6. Important Tests of Statistical Significance (Part II) 
53 
7. Exploratory Factor Analysis 
57 
8. Cluster Analysis 
62 
9. Linear Regression 
69 
10. Logistic Regression 
81 
11. Time Series Analysis 
96 
Appendix: Suggested Books and References 
_{1}_{1}_{6} 
© Orangetree Business Solutions Private Limited, 201213
4
Chapter 1
INTRODUCTION TO ANALYTICS AND BASIC STATISTICS
B usiness analytics (BA) refers to the skills, technologies, applications and practic es for continuous iterative exploration and investigation of past business perfor mance to gain insight and drive business planning.
There are three main categories of analytics:
1.Descriptive  the use of data to find out what happened in the past and what is happening now. 2.Predictive  the use of data to find out what could happen in the future. 3.Prescriptive  the use of data to prescribe the best course of action for the future.
A
N
A
L
Y
T
I
C
S
D
O
M
A
I
N
S
1.Retail sales analytics 2.Financial services analyt ics 3.Risk & Credit analytics 4.Talent analytics 5.Marketing analytics 6.Behavioral analytics 7.Collections analytics 8.Fraud analytics 9.Pricing analytics
According to McKinsey Global Institute, The amount of data in our world has been exploding and ana lyzing large data,
socalled big data will become a key basis of competition, underpinning new waves of
productivity growth, innovation, and consumer
surplus. MGI studied big data in five domains—
healthcare in the United States, the public sec
tor in Europe, retail in the United States, and
manufacturing and personallocation data
10.Telecommunications
11.Supply Chain analytics 12.Transportation ana
globally.
Big data can generate value in each. For example, a retailer using big data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Twothirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed econo mies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personallocation data could capture $600 billion in consumer surplus.
© Orangetree Business Solutions Private Limited, 201213
5
Common Business Problems in Telecom:
Customer churn is a common term used both in academia and practice to de note the customers with propensity to leave for competing companies. Ac cording to various estimates in European mobile service markets, churn rate reaches twentyfive to thirty percent an nually. On the other hand financial anal ysis and economic studies are in agree ment that acquiring new customers is five times as expensive compared to retain ing existing customers.
Snapshot of Companies Using Analytics:
MoneyGram International uses analytics to detect and prevent money transfer fraud before it impacts customers and has prevented more than US$37.7 million in fraudulent transactions, reduced customer fraud complaints by 72 percent.
Primerica provides its representatives with the ability to drill down into sales data in order to increase productivity and boost revenue. Primerica has more than 142 thousands licensed sales representatives.
T Mobile USA uses analytics to detect the influencers in the network and design lu crative customized offers to the influencers. In this way, they reduced the churn rate by 25%.
Dillard‘s uses analytics to improve its customer relationship management and mer chandise management to deliver the right product at the right store.
Seton Healthcare Family uses analytics to detect patients who are at considerable risk.
Del Monte Foods uses analytics to understand the macro variables like inflation and how these variables impact the cost structure of the company.
Reliance Capital uses analytics to retain customer in its mutual fund business, to confirm the continual premium payment in the life insurance business, to design products of high claim ratio in the general insurance business, and finally, for credit scoring in the mortgage finance business.
Common Business Problems in Retail:
1.Increase customer value and overall rev enues 2.Reduce costs and increase operational efficiency 3.Develop successful new products and ser vices 4.Determine profitable sites for new stores and improve existing stores 5.Communicate effectively between de partments for better decision making
© Orangetree Business Solutions Private Limited, 201213
6
Types of Data Analysis
Exploratory Data Analysis (EDA) makes few assumptions, and its purpose is to suggest hypotheses and assumptions. An OEM manufacturer was experiencing customer
complaints.
move causes of these complaints.
customers for usage data so the team could cal
culate defect rates.
Data Analysis. The investigation established that
a supplier used the wrong raw material.
sions with the supplier and team members moti
vated further analysis of raw material, and its
composition.
rial completed the Exploratory Data Analysis.
A team wanted to identify and re
They asked
This started an Exploratory
Discus
This decision to analyze raw mate
The Exploratory Data Analysis used both data
analysis and process knowledge possessed by team members. The supplier and com pany conducted a series of designed experiments which identified an improved raw
material composition.
to .004%.
(CDA). Note that the experimental design required a hypothesis generated by the Ex
ploratory Data Analysis. Exploratory Data Analysis uncovers statements or hypotheses for Confirmatory Data Analysis to consider.
Using this composition, the defect rate improved from .023%
The experimental design and its analysis was Confirmatory Data Analysis
Properties of Measurement
Identity: Each value on the measurement scale has a unique meaning.
Magnitude: Values on the measurement scale have an ordered relationship to on another. That is, some values are larger and some are smaller.
Equal intervals: Scale units along the scale are equal to one another. This means, for example, that the difference between 1 and 2 would be equal to the differ ence between 19 and 20.
Absolute zero: The scale has a true zero point, below which no values exist.
Scales of Measurement
Nominal Scale: The nominal scale of measurement only satisfies the identity property of measurement. Values assigned to variables represent a descriptive category, but have no inherent numerical value with respect to magnitude. Gender is an example of a variable that is measured on a nominal scale. Individuals may be classified as "male" or "female", but neither value represents more or less "gender" than the other. Religion and political affiliation are other examples of variables that are normally measured on a nominal scale.
© Orangetree Business Solutions Private Limited, 201213
7
Ordinal Scale: The ordinal scale has the property of both identity and magnitude. Each value on the ordinal scale has a unique meaning, and it has an ordered relationship to every other value on the scale. An example of an ordinal scale in action would be the results of a horse race, re ported as "win", "place", and "show". We know the rank order in which horses finished the race. The horse that won finished ahead of the horse that placed, and the horse that placed finished ahead of the horse that showed. However, we cannot tell from this ordinal scale whether it was a close race or whether the winning horse won by a mile.
Interval Scale: The interval scale of measurement has the properties of identity, magni tude, and equal intervals. A perfect example of an interval scale is the Fahrenheit scale to measure temperature. The scale is made up of equal temperature units, so that the difference between 40 and 50 degrees Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit. With an interval scale, you know not only whether different values are bigger or smaller, you also know how much bigger or smaller they are. For example, suppose it is 60 degrees Fahrenheit on Monday and 70 degrees on Tuesday. You know not only that it was hotter on Tuesday; you also know that it was 10 degrees hotter.
Ratio Scale: The ratio scale of measurement satisfies all four of the properties of meas urement: identity, magnitude, equal intervals, and an absolute zero. The weight of an object would be an example of a ratio scale. Each value on the weight scale has a unique meaning, weights can be rank ordered, units along the weight scale are equal to one another, and there is an absolute zero. Absolute zero is a property of the weight scale because objects at rest can be weightless, but they cannot have negative weight.
Types of Data
Quantitative Data: In most of the cases, we will find ourselves using numeric data. This type of data is the one that contains numbers.
© Orangetree Business Solutions Private Limited, 201213
Delivery Time in Minutes 

19 
10 
17 
15 
18 
16 
12 
16 
16 
18 
15 
15 
16 
18 
13 
15 
19 
17 
14 
10 
13 
12 
13 
16 
8
Qualitative Data: The other type of data is string type data. A string is simply a line of text and could represent comments about certain participant, or other infor mation that you don‘t wish to analyze as a grouping variable.
Cube # 
Touch 
See 
Smell 
1 
Rough 
Brown 
Wood 
2 
Rough 
Sliver 
Metallic 
3 
Slightly Rough 
Sliver 
Metallic 
4 
Smooth 
Gold 
No Smell 
5 
Smooth 
Brown 
No Smell 
6 
Smooth 
Brown 
No Smell 
7 
Rough 
Brown 
Wood 
8 
Smooth 
Gold 
No Smell 
Categorical Data: The third type of data is categorical data represented by a grouping variable. For example, you insert a variable called gender and insert ‗Male‘ or ‗Female‘ under this variable as observations. In this case we can group the entire data with respect to gender. Here gender is a group ing variable.
Number of 

Color 
Items 
Brown 
4 
Gold 
2 
Silver 
2 
Presentation of Qualitative Data
Tabular Presentation
Cumulative 
Cumulative 

Subcategory 
Frequency 
Percent 
Frequency 
Percent 
Chocolate 
491 
32.73 
491 
32.73 
Fruit 
170 
11.33 
661 
44.07 
Gum 
194 
12.93 
855 
57 
Mixed 
92 
6.13 
947 
63.13 
Soft 
365 
24.33 
1312 
87.47 
Sweet 
188 
12.53 
1500 
100 
Graphical Presentation
Simple Bar Chart
Pie Chart
© Orangetree Business Solutions Private Limited, 201213
9
Horizontal Bar Chart: Good for Geographical data
Stacked Bar Chart:
Good for Intra Analysis
Presentation of Quantitative Data
Tabular Presentation
Multiple Bar Chart:
Good for Inter Analysis
Graphical Presentation
Histogram: Understanding the Distribution of the Data
Scatter Plot: Understanding the Re lationship between two numerical Variables
© Orangetree Business Solutions Private Limited, 201213
10
Various Types of Scatter Plots
Positively Related: One increases, then the other also increases
Negatively Related: One in creases, then the other de creases
Measure of Central Tendency
Undefined: No Clear Relation
The vice president of marketing of a fast – food chain is studying the sales perfor mance of the 100 stores in the eastern part of the country. He has constructed the fol lowing frequency distribution of annual sales:
Sales (000s) 
Frequency 
Sales (000s) 
Frequency 

700  799 
4 
1300 
 1399 
13 

800  899 
7 
1400 
 1499 
10 

900  999 
8 
1500 
 1599 
9 

1000 
 1099 
10 
1600 
 1699 
7 
1100 
 1199 
12 
1700 
 1799 
2 
1200 
 1299 
17 
1800 
 1899 
1 
He would be looking at the distribution with an eye toward getting information about the central tendency to compare the eastern part with other parts of country. Central tendency is basically the central most value of a distribution. Now how do we know which one is the central most value? There are precisely three ways to find the central value: Arithmetic mean, Median and Mode. Arithmetic mean is the simple average of the data. The problem with arithmetic mean is that it is influenced by the extreme values. Suppose, you take a sample of 10 persons whose monthly incomes are 10k, 12k, 14k, 12.5k, 14.2k, 11k, 12.3k, 13k, 11k, 10k. So the average income turns out to be 12k. So that‘s a good representation of the data. Now if you replace the last data with 100k, then the average turns out to be 21k which is very absurd as 9 out of 10 people earns way be
© Orangetree Business Solutions Private Limited, 201213
11
low that mark. This problem of Arithmetic mean can be reduced though the use of Geometric and Harmonic mean. But the effect of outliers can be almost nullified by the use of Median. Median is the mark where the entire data is split into exact halves, that is, 50% of the data lie above the mark and the rest lie below. In intuitive sense, it is the proper meas ure of central tendency. But for various computational reason, Arithmetic mean is the most popular measure. Whereas median looks for half mark, Mode looks for the value with the highest fre quency, that is highest number of occurrence. So using central tendency, we are trying to find out a value around which all the data are clustering. This property of data can be used to deal with the missing values. Sup pose, some of the income data is missing, then you can replace the missing values with the mean or the median values. If some city name is missing, one may replace those by using the mode, that is the city which appeared most of the times.
Measures of Dispersion As the name says, here we are trying to access how disperse the data is. A measure of central tendency without any idea about the measures of dispersion don‘t make any sense. Why it is so? Look at the following charts.
The horizontal data is the central value in both the cases. But for the first case where the data is less dispersed, the data is really clustered around the central line. Whereas in the second case, data is so dispersed that central value is not that meaningful, as you cannot say that the horizontal line is a true representative of the data. So there is a need to measure the dispersion in the data. Broadly there are two measures of data, one is absolute measures like Range or Vari ance and the other is relative measure like Coefficient of Variation. Range is the simplest measure. It is basically the difference between the maximum and the minimum value in a data. The other absolute measure Variance is a bit com plicated to express in plain words. It basically comes from the sum of squared differ
© Orangetree Business Solutions Private Limited, 201213
12
ence of the each data from the arithmetic mean of the data. Now as you go on in creasing the number of data, the sum basically increases. So we take the average. Now if you take the square root (e.g. square root of 9 is 3), we get the Standard Deviation of the da ta. If you like you can memorize the following ex pression:
Some of you might find difficulties with the denomi nator being n1 instead of n. The reason is that here we are calculating the sample standard devi ation. If it had been population standard devia tion, we could have used n.
We will discuss about the population and sample in the coming chapters.
Apart from understanding the dispersion in the data, standard deviation can be used for transforming the data. Suppose, if we want to com pare two variables like the amount of money persons earn and the number of pair of shoes their wives have, then it is better to express those data in terms of stand ard deviations. That is, we simply divide the data by their respective standard deviations. So here the stand ard deviation acts as a unit or we make the data unit free. Now if you want to understand which data is more vol atile, personal income or pair of shoes, you better use Coefficient of Variation. As mentioned earlier, it is a rel ative measure of dispersion and is expressed by stand ard deviation per unit of central value, i.e. mean. If you have income in dollar terms and income in rupee terms, and if the first data has less coefficient of variation than the second one, use the first data for analysis. You will find more meaningful information.
Measures of Location
Using Measures of Location, we can get a bird‘s eye view of the data. Measures of Central Tendency also comes under the Measures of Location. Minimum and maxi mum are also measures of location. Other measures are Percentiles, Deciles, and Quartiles. For example, if 90 percentile denotes the number 86, the it is implied that 90% of the students have got marks which are less than 86. Now the 90 percentile is the 9th Deciles.
© Orangetree Business Solutions Private Limited, 201213
13
For quartiles, we are basically dividing the total data into four equal parts. So we are looking for 3 points Q1, Q2, and Q3. The other name for Q2 is Median. So we have 25% of the data be low Q1, 25% within Q1 and Q2, similarly 25% within Q2 and Q3 and finally, rest of the 25% above Q3.
Statistics Related to The Shape of The Distribu tion
As we look at the shape of the histogram of a numeric data, we have various under standing about the distribution of the data. We have two statistics that are related to the shape of the distribu tion: Skewness and Kurtosis. If the distribution has a longer left tail, the data is nega tively skewed. The opposite is for the posi tively skewed. So we are basically detecting whether the data is symmetric about the central value of the distribution. In options markets, the difference in implied volatili ty at different strike prices represents the market's view of skew, and is called volatility skew. (In pure Black–Scholes, implied volatility is constant with respect to strike and time to maturity.) Skewness causes the Skewness risk in the statistical models, that are built out of variables which are assumed to be symmetrically distributed. Kurtosis, on the other hand, measures the peakedness of the distribution as well as the heaviness of the tail. Generally heavy tailed distributions don‘t have a finite variance. In other words, we cannot calculate the variance for these distributions. Now if we consider that the distribution is not heavy tailed and build the model on this assump tion, it can lead to Kurtosis risk of the model. For instance, LongTerm Capital Manage ment, a hedge fund cofounded by Myron Scholes, ignored kurtosis risk to its detriment. After four suc cessful years, this hedge fund had to be bailed out by major investment banks in the late 90s because it understated the kurtosis of many financial securi ties underlying the fund's own trading positions. There can be several situations as shown in the chart. The value of kurtosis for a Mesokurtic Distribu tion is zero. For Platykurtic it‘s negative and for Lep tokurtic it‘s positive. Kurtosis is sometimes referred as volatility of volatility or the risk with in risk.
© Orangetree Business Solutions Private Limited, 201213
14
Box Plot for Detecting Outliers
An outlier is a score very different from the rest of the data. When we analyze data we have to be aware of such values because they bias the model we fit to the data. A good example of this bias can be seen by looking at a simple statistical model such as mean. Suppose a film gets a rating from 1 to 5. Seven people saw the film and rated the movie with ratings of 2, 5, 4, 5, 5, 5, and 5. All but
one of these ratings is fairly similar (mainly 5 and 4) but the first rating was quite dif ferent from the rest. It was a rating of 2. This is an exam ple of an outlier. The box plots tell us something about the distributions of scores. The boxplots show us the lowest (the bottom horizontal line) and the highest (the top horizontal line). The dis tance between the lowest horizontal line and the lowest edge of the tinted box is the range between which the lowest 25% of scores fall (called the bottom quartile). The box (the tinted area) shows the middle 50% of scores (known as interquartile range); i.e. 50% of the scores are bigger than the lowest part of the tinted area but smaller than the top part of the tint ed area. The distance between the top edge of the tinted box and the top hori zontal line shows the range between which top 25% of scores fall (the top quartile). In the middle of the tinted box
is a slightly thicker horizontal line. This
represents the value of the median. Like
histograms they also tell us whether the distribution is symmetrical or skewed. For
a symmetrical distribution, the whiskers
on either side of the box are of equal length. Finally you will notice small some circles above each boxplot. These are the cases that are deemed to be outliers. Each circle
has a number next to it that tells us in which row of the data editor to find the case.
Correcting Problems in the data
Generally we find problems related to the distribution or outliers while exploring the da ta. Suppose you detect outliers in the data. There are several options for reducing the impact of these values. However, before you do any of these things, it‘s worth check ing whether the data you have entered is correct or not. If the data are correct then the three main options you have are:
Remove the Case: It entails deleting the data from the person who contributed the
© Orangetree Business Solutions Private Limited, 201213
15
outlier. However, this should be done only if you have good reason to believe that this case is not from the population that you intend to sample. For example, if you were investigating factors that affected how much babies cry and baby didn‘t cry at all, this would likely be an outlier. Upon inspection, if you discovered that this ba by was actually a 10 year old boy, then you would have grounds to exclude this case as it comes from a different population.
Transform the data: If you have a nonnormal distribution then this should be done anyway (and skewed distributions will by their nature generally have outliers be cause it‘s these outliers that skew the distribution). Such transformation should re duce the impact of these outliers. For transformation we use the compute variable facility.
Log Transformation (log Xi): Taking the logarithm of a set of numbers squashes the right tail of the distribution. However, you cannot get a log value of zero or negative numbers, so if your data tend to zero or produce negative numbers you need to add a constant to all the data before you do transformation.
Square root transformation (√X _{i} ): Taking the square root of large values has more of an effect than taking the square root of small values. Consequently, taking the square root of each of your scores will bring large scores closer to the cen ter. So this can be a very useful way to reduce positively skewed data. But we still have the problems related to negative numbers.
Reciprocal transformation (1/X _{i} ): Dividing 1 by each of the scores reduces the impact of large scores. The transformed variable will have a lower limit of zero. One thing to bear in mind with this transformation is that it reverses the scores in the sense that scores that were originally large in the data set become small after the transformation, but the scores that were originally small become big after the transformation.
Change the score: If transformation fails, then you can consider replacing the score. This on the face of it may seem like cheating (you are changing the data from what was actually collected); however, if the score you‘re changing is very unrepresentative and biases your statistical model anyway then changing the score is helpful. There are several options for how to change the score. The first one is next highest value plus one. We can replace our outliers with mean plus three times standard deviation derived from the rest of the data. A variation of this meth od is that we can use two instead of three time standard deviation.
© Orangetree Business Solutions Private Limited, 201213
16
SAS IMPLEMENTATION
BAR GRAPH
Proc gchart data = day1.candy_sales_summary; Vbar subcategory; run;
gchart is the procedure to generate barchart. The data set we use here is the can dy_sales_summary. The bar chart is generated using the keyword vbar. This presenta tion is used to represent the qualitative variable subcategory. This code generates a bar graph showing the frequency of occurrence of the different subcategory.
proc gchart data = day1.candy_sales_summary; vbar3d subcategory; run;
This code generates a 3d bar graph for ‗subcategory‘. This is a better form of repre senting a qualitative data. ‗vbar3d‘ is the keyword for generating a three dimensional bar graph.
proc gchart data = day1.candy_sales_summary; hbar3d subcategory; run;
This code generates a horizontal 3d bar graph. The bar graph is generated for the vari able ‗subcategory‘. ‗hbar3d‘ is the keyword for generating the horizontal 3d bar graph. This form of representing the data is useful when we are representing a spatial data.
Proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount; run;
This code generates a 3d vertical bar graph for the variable ‗subcategory‘. But, corre sponding to each vertical bar graph for the subcategory it gives the total sale amount on top of each of the vertical bar.
proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sumvar=sale_amount; run;
This code results in the same output as the code above but does not display the sum corresponding to each bar at the top. The ‗sum‘ keyword is responsible for the display.
© Orangetree Business Solutions Private Limited, 201213
17
SAS IMPLEMENTATION
proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;
This code generates a subdivided multiple bar diagram. The group generates the bar diagram corresponding to the ‗fiscal years‘ and show the sales corresponding to each subcategory for a given fiscal year.
goptions vsize=6in hsize=20in;
This code is run according to the margins specified by the options specified using the ‗goptions‘ keyword. This is a global statement which holds throughout the rest of the session. Every graph constructed by the software thereon would have these dimen sions.
proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;
The multiple bar diagram which is generated as a result of the above code, appears very shabby on screen. To make them look better, we need to space them out and this is done through the above code. We specify the margins for the vertical and hori zontal axis. This is a global statement, in the sense that, that any graphical representa tions, here onwards, would take these dimensions as given.
PIECHART
proc gchart data= day1.candy_sales_summary; pie3d subcategory; run;
This code generates a 3 dimensional piechart using the keyword ‗pie3d‘. ‗gchart‘ is the keyword to generate the chart. The piechart represents each of the ‗subcategory‘ on a pie, i.e. as a percentage of 360 degrees.
proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value= inside; run;
This is a variation of the previous pie chart representation. This would generate a pie chart where the discrete value of the respective ‗subcategory‘ would be placed in the different slices. ‗value=inside‘ keeps the frequency values in the slices along with
© Orangetree Business Solutions Private Limited, 201213
18
SAS IMPLEMENTATION
the names of the subcategory. Each of the subcategory is shown in slices of different colors.
proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside; run;
This code generates the pie chart such that the name of the frequency value and the percentage frequency value of the subcategory inside the slice and the name of the subcategory outside the slice.
proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside freq=sale_amount; run;
This code for piechart puts out the frequency of sale corresponding to the sale sub category. The percentage frequency of the sale and the discrete value of the sale of the subcategory are shown outside and the name of the variable is shown outside the slice.
HISTOGRAM
proc univariate data=day1.candy_sales_summary; var sale_amount; histogram sale_amount; run;
This is the representation of quantitative data. The ‗univariate‘ keyword is used to gen erate all the key descriptive statistics related to a particular variable. Here, the variable under consideration is ‗sale_amount‘. The code to generate histogram is ‗histogram‘. If
no dimension is mentioned then, it is by default, a two dimensional
diagram.
proc univariate data=day1.candy_sales_summary noprint; var sale_amount; histogram sale_amount; class subcategory; run;
The ‗univariate‘ keyword in the code generates all the descriptive statistics associated with the variable ‗sale_amount‘ in the data set candy_sales_summary. Another objec tive of the code is to construct a histogram for the same variable using the keyword ‗histogram‘. The total amount of sales is generated for each of the sub categories,
© Orangetree Business Solutions Private Limited, 201213
19
SAS IMPLEMENTATION
which is specified using the keyword ‗class‘.
SCATTER PLOT
proc gplot data= day1.candy_sales_summary; plot sale_amount*units; run;
‗gplot‘ is the procedure to generate a plot of two quantitative variables. The scatter plot for two variables sale_amount and units is generated using the keyword ‗plot‘. The variable on the lefthandside of the * represents the variable on the yaxis and the variable on the righthandside is the variable on the xaxis.
NORMALITY CHECK
proc univariate data=day1.class; var height; run;
The ‗univariate‘ keyword generates all the descriptive statistics associated with the var iable ‗height‘ in the data set ‗Class‘. The descriptive statistics associated with a distri bution helps in the identification of normality of a distribution. Normality of a distribution implies an element of symmetry associated with the distribution. In this data set the mean, median and the mode are approximately 62. The standard deviation is pretty ‗low‘ (5) compared to the existent mean. The Skewness and Kurtosis of the data set lies in the neighborhood of zero. A basic analysis yields the result that the variable ‗height‘ is normally distributed in the data set ‗Class‘.
proc univariate data= day1.class normal plot; var height; qqplot height/normal (mu=est sigma=est color=green); run;
The qqplot (Quantile Quantileplot) is an alternate technique for examining whether a variable is normally distributed or not. ‗normal plot‘ is the keyword for generating a normal plot of variable. The keyword ‗qqplot‘ generates a plot which compares a hy pothetical normal line (having an estimated mean and standard deviation) and actu al points from the distribution. If the actual points of the distribution lie around the green coloured normal line, then the normality of the variable holds.
proc univariate data= day1.candy_sales_summary normal plot; var sale_amount; qqplot sale_amount/normal (mu=est sigma=est color=green); run;
© Orangetree Business Solutions Private Limited, 201213
20
SAS IMPLEMENTATION
This is the same code which has been executed for a different data set: can dy_sales_summary. The mean of the variable, sale_amount (4951.97) is significantly dif ferent from its median (4040.525) and mode (0.00). Also the average fluctuation in the data set represented by the standard deviation is very high (3986). This means that the mean is not a ‗good‘ representative value for the data set as there is a very high fluc tuation in the data set. It is easy to conclude that the variable sale_amount is not nor mally distributed.
BOXPLOT AND THE EXISTENCE OF OUTLIERS
The quality of the measures of Central tendency and dispersion are affected adversely in the presence of outliers. Boxplot is widely used to examine the existence of outliers in the data set. Our reference data set is a hypothetical data set consisting of the marks and the name of the subject. Two important facts that must be kept in mind for box plot are:
The number of observations in the data set must be at least as large as five.
If there are more than one category in the data set must be sorted according to the category.
proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Tanmoy\Book1.csv"
out=day1.boxplot
dbms=csv replace; run;
A data set containing the marks of 5 students in the subjects English and Math‘s exist in
a csv format. The file is imported into the SAS library by using ‗proc import‘ code. The
logic of this code is to import a given file in its existent format, convert it to SAS format and replace the freshly imported file with any file that would have the same name.
proc boxplot data=day1.boxplot; plot marks*subject/ boxstyle=schematic; run;
‗boxplot‘ is the key word for generating a boxplot. The plot is done between the marks
obtained by the students and the subject. The existence of the outliers in the data set
is observed as points outside the box. The ‗boxstyle‘ is a keyword to generate a partic ular format of the boxplot.
© Orangetree Business Solutions Private Limited, 201213
21
Chapter 2
INTRODUCTION TO PROBABILITY THEORY
F uture events are far from certain in the business world. Most managers who use probabilities are concerned with two conditions:
The case when one event or another will occur The situation where two or more events will both occur
We are interested in the first case when we ask, ―What is the probability that today‘s demand will exceed our inventory?‖ To illustrate the second situation, we could ask, ―What is the probability that today‘s demand will exceed our inventory and that more than 10% of our sales force will not report for work?‖ Probability is used throughout busi ness to evaluate financial and decisionmaking risks. Every decision made by manage ment carries some chance for failure, so probability analysis is conducted formally ("math") and informally (i.e. "I hope"). Consider, for example, a company considering entering a new business line. If the company needs to generate $500,000 in revenue in order to break even and their probability distribution tells them that there is a 10 percent chance that revenues will be less than $500,000, the company knows roughly what level of risk it is facing if it de cides to pursue that new business line.
Three Approaches Towards Probability
Classical Approach:
"Probability of an event" =(Number of outcomes where the event occurs)/(Total number of possible outcomes" " ) Relative Frequency Approach:
Number of Trials
© Orangetree Business Solutions Private Limited, 201213
22
Axiomatic Approach:
A) 0 ≤ P(A) ≤ 1, for all event A
B) ∑ P(A) = 1
Apart from all these, there is a concept of subjective probability. It‘s basically based on individual‘s past experience and intuition. Most higher level social and managerial decisions are concerned with specific, unique situations. Decision makers at this level make considerable use of subjective probability.
Concept of Random Variable
Informally, a random variable is the value of a measurement associated with an exper iment, e.g. the number of heads in n tosses of a coin. More formally, a random varia ble is defined as follows:
A random variable over a sample space is a function that maps every sample point (i.e. outcome) to a real number. The picture shown has all the outcomes when two dice are rolled. We can define a random variable X which is the sum of points appeared on the two dices. Then X can as sume values from 2 to 12. Each of these numbers represents a set of outcomes. Ele ments of such sets have same outer color, e.g. for X =5, we have the outcomes in the yellow boxes. Based on the events that we have, there can be two types of random variables: Discrete random variable and Continuous ran dom variable. In the previous example, we are basically talking about discrete ran dom variable. Again, John Brower Minnoch had a weight of 635 kg. Let‘s say this is the upper limit of human weight. So the weight of a person lies in between 0 and 635. So here the random variable weight is continuous.
Probability Mass Function
Probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution. Suppose that S is the sample space of all outcomes of a single toss of a fair coin, and X is the random variable defined on S assigning 0 to "tails" and 1 to "heads". Since the coin is fair, the probability mass function is given by:
The probability mass function of a fair die has been show in the chart. All the numbers on the die have an equal chance of ap pearing on top when the die is rolled.
© Orangetree Business Solutions Private Limited, 201213
23
Probability Density Function
Probability density function (pdf) or density of a continuous random variable, is a func tion that describes the relative likelihood for this random variable to take on a given value. The probability for the random variable to fall within a particular region is given by the integral of this variable‘s density over the region. If f(x) is the density function then the probability that X falls within a and b is given by
If you put this concept into a chart, then it will represent the area under the probability density function curve between a and b.
Suppose you toss a coin 10 times and get 7 heads. ―Hmm, strange,‖ you say. You then ask a friend to try tossing the coin 20 times; she gets 15 heads and 5 tails. So you have, in all, 22 heads and 8 tails out of
30 tosses. What did you expect? Was it something close to 15 heads and 15 tails (half and half)? Now suppose you turn the tossing over to a machine and get 792 heads and 208 tails out of 1000 tosses of the same coin. Then you might be suspicious of the coin because it didn‘t live up to what you expected. To obtain the ex pected value of a discrete ran dom variable, we multiply each value of that the random variable can assume by the probability of occurrence of that value and sum these products. Again, re
member that an expected value of 108.02 doesn‘t imply that tomorrow exactly 108.2 patients will visit the clinic.
Number of Patients (1) 
Probability (2) 
1 X 2 
100 
0.01 
1.00 
101 
0.02 
2.02 
102 
0.03 
3.06 
103 
0.05 
5.15 
104 
0.06 
6.24 
105 
0.07 
7.35 
106 
0.09 
9.54 
107 
0.1 
10.70 
108 
0.12 
12.96 
109 
0.11 
11.99 
110 
0.09 
9.90 
111 
0.08 
8.88 
112 
0.06 
6.72 
113 
0.05 
5.65 
114 
0.04 
4.56 
115 
0.02 
2.30 
Expected Number of Patients 
108.02 
© Orangetree Business Solutions Private Limited, 201213
24
Probability Distributions
Probability distributions are related to frequency distributions. We can think of a proba bility distribution as a theoretical frequency distribution. A theoretical frequency distri bution is a probability distribution that describes how outcomes are expected to vary. These distributions deal with expectations, they are useful models in making inferences and decisions under conditions of uncertainty. A probability distribution is a listing of the probabilities of all the possible outcomes that could result if the experiment were done.
As the Random Variable is of two types, the Probability Distributions, hence, are of two types, namely, discrete and continuous. The Probability Distribution for the sum of point on two dice rolled is as follows:
Common Probability Distributions
Related to realvalued quan tities that grow linearly (e.g. er rors, offsets): Normal Distributions
Related to positive real valued quantities that grow ex ponentially (e.g. prices, incomes, populations): Lognormal Distri bution, Pareto Distribution
Related to realvalued quan tities that are assumed to be uni formly distributed over a (possibly unknown) region: Uni form Distribution
Related to Bernoulli trials (yes/no events, with a given probability): Bernoulli Distribution, Binomial Distribution
Related to events in a Poisson process (events that occur independently with a giv en rate): Poisson Distribution, Exponential Distribution
Binomial Distribution
The binomial distribution describes discrete data resulting from an experiment known as Bernoulli process. The tossing of a fair coin a fixed number of times is a Bernoulli pro cess and the outcomes of such tosses can be represented by the binomial probability distribution. The success or failure of interviewees on an aptitude test may also be de scribed by a Bernoulli process. On the other hand, the frequency distribution of the lives of fluorescent lights in a factory would be measured on a continuous scale of hours and would not qualify as a binomial distribution. The probability mass function, the mean and the variance are as follows:
© Orangetree Business Solutions Private Limited, 201213
25
Characteristics of a Binomial Distribution
There can be only two possible outcomes: heads or tails, yes or no, success or fail ure
Each Bernoulli process has its own characteristic probabil ity. Take the situation in which historically seven – tenths of all people who applied for a certain type of job passed the job test. We would say that the characteristic proba bility here is 0.7, but we could describe our testing results as Bernoulli only if we felt certain that the proportion of those passing the test (0.07) re mained constant over time.
At the same time, outcome of one test must not affect the outcome of the other tests.
Poisson Distribution
The Poisson distribution is used to describe a number of processes, including the distri bution of telephone calls going through a switchboard system, the demand of pa tients for service at a health institution, the arrivals of trucks and cars at a tollbooth, and the number of accidents at an intersection. These examples all have a common element: They can be described by a discrete random variable that takes on integer values (0, 1, 2, 3, 4, and so on). The number of patients who arrive at a physician‘s of fice in an given interval of time will be 0, 1, 2, 3, 4, 5, or some other whole number. Simi larly, if you count the number of cars arriving at a tollbooth on an highway during some 10 minutes period, the number will be 0, 1, 2, 3, 4, 5, and so on. The probability mass function, the mean and the variance are as follows:
Characteristics of a Poisson Distribution

If 
we consider the example of number of cars, then the average number of vehi 
cles that arrive per rush hour can be estimated from the past traffic data. 


we divide the rush hour into intervals of one second each, we will find the follow ing statements to be true : If 


The probability that exactly one vehicle will arrive at the single booth per second is 

a 
very small number and is constant for every one second interval. 


The probability that two or more vehicles will arrive within one second interval is so 
© Orangetree Business Solutions Private Limited, 201213
26
small that we can assign it a zero value.
The number of vehicles that arrive in a given one second interval is independent of the time at which that one second interval occurs during the rush hour.
The number of arrivals in any one second interval is not dependent on the number of arrivals in any other one second interval.
Normal Distribution
The normal distribution has applications in many areas of business administration. For example:
Modern portfolio theory commonly assumes that the returns of a diversified asset portfolio follow a normal distribution.
In operations management, process variations often are normally distributed.
In human resource management, employee performance sometimes is consid ered to be normally distributed.
The probability density function, mean, and variance are given by
Is The Distribution Normal?
The following conditions should be satisfied by the distribution in order to be a normal dis tribution:
The mean, median and mode should be almost equal
The standard deviation should be low
Skewness and kurtosis should be close to zero
Median should lie exactly in between the upper and lower quartile
Normal Probability Plot
The normal probability plot is a graphical technique for normality testing: assessing whether or not a data set is approximately normally distributed. Here we are basical ly comparing the observed cumulative probability with the theoretical cumulative probability. If the observed data are really from the normal distribution, then we should get a straight line as shown in the chart.
© Orangetree Business Solutions Private Limited, 201213
27
Q  Q Plot
The points in this graph are obtained through inverting the cumulative distribution function. Here we are comparing the points of observed distribution to the theoretical distribution for the same probability level. Here again, if the data are from the theoretical distribution, the plot will be a straight line.
Standard Normal Distribution
It‘s is a normal distribution
with
For a normal distribution, 68.2% of the data lies within the (mean  standard devi ation, mean + standard de viation) range.
© Orangetree Business Solutions Private Limited, 201213
28
SAS IMPLEMENTATION
BINOMIAL DISTRIBUTION
data binom; binom_prob=pdf ('binomial', 50, 0.6, 100); run;
This code defines the probability of getting fifty successes in 100 trials of a binomial ex periment where the probability of getting success in a single trial is 0.6. This code cre ates a new data set by the name ‗binom‘ in the work library. The data set can also be opened in the permanent library as well by assigning a library name. binom_prob is the variable that stores the probability associated with the abovementioned outcome. ‗Pdf‘ stands for probability density function. It generates the probability associated with the given outcome, given the parameters of the distribution. ‗Pdf‘ is the general command for calculating the probabilities associated with various points of a distribu tion (be it discrete or continuous) since SAS does not identify ‗pmf‘ (Probability Mass Function).
data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.5, 20); output; end; run;
This code generates a schedule of probabilities associated with the various outcomes or success in the binomial distribution. A loop is created for generating the schedule. The number of success in the experiment is kept variable and within the loop. The pa rameters to the distribution, namely the number of trials (20) and the probabilities of success (0.5) are specified. The loop is terminated using the ‗end‘ keyword. The key word ‗output‘ is used to print the output at each iteration.
proc gplot data=binom_plot; plot binom_prob*x; run;
This command directly plots the binomial probability distribution with probabilities on the vertical axis and the number of successes on the horizontal axis.
data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.3, 20); output; end; run;
This is the command to generate the binomial probability distribution for 0 to 20 trials with a much lower probability of success. Examining the nature of the distribution over
© Orangetree Business Solutions Private Limited, 201213
29
SAS IMPLEMENTATION
changing values of the probabilities of success gives us a fair idea of the Skewness of the distribution. If the probability of obtaining a success in a particular trial is ‗low‘ then the chances of getting very high successes is ‗low‘ and that of getting ‗low‘ successes is very high.The distribution, given the specification of the parameters is a negatively skewed distribution.
proc gplot data=binom_plot; plot binom_prob*x; run;
The command plots the binomial probability distribution for the newly specified param eters with the values of the probability on the vertical axis and the number of success on the horizontal axis. The graphical representation displays the varying nature of Skewness in the distribution very distinctly.
POISSON DISTRIBUTION
data day1.poisson; pois_prob=pdf ('Poisson', 12, 10); run;
This is a data step where a data set by the name ‗Poisson‘ is created in the permanent library day1. The syntax defined by the function ‗pdf‘ is as follows:
New variable = pdf (name of the distribution, value of x, value of n). This code calcu lates the probability of obtaining a particular number of successes in the Poisson ex periment where the parameter to the experiment is 10 and the number of trials is 12.
data day1.pois_plot; do x=0to 25; pois_prob=pdf ('Poisson', x, 10); output; end; run;
This command plots the poisson probability distribution with probabilities on the vertical axis and number of successes on the horizontal axis. The output keyword is to print the output of each iteration.
proc gplot data=day1.pois_plot; plot pois_prob*x; run;
This command directly plots the poisson probability distribution with probabilities on the vertical axis and the number of successes on the horizontal axis. Following set of codes
© Orangetree Business Solutions Private Limited, 201213
30
SAS IMPLEMENTATION
are used for analyzing the Skewness associated with the poisson distribution:
data day1.pois_plot; do x= 0 to 25; pois_prob=pdf ('poisson', x, 10.5); output; end; run;
proc gplot data=day1.pois_plot; plot pois_prob*x; run;
The last couple of codes can be used to analyse the nature of the Skewness of the poisson distribution. The Skewness can be analyzed by changing the parameters to the distribution.
NORMAL DISTRIBUTION
data day1.normal; do x=12 to 18 by 0.05; normal_prob=pdf ('normal', x, 3, 8); output; end; run;
This is a data step which creates a new data set ‗normal‘ in the userdefined library day1. The command generates the normal probability distribution. The values of the respective probability densities are stored in the variable ‗normal_prob‘. The syntax of this function is: Name of the variable = pdf (distribution name, number of trials, mean, variance). The mean and variance must be specified for a proper characterization of the normal distribution. The schedule of probabilities corresponding to the different val ues of x is generated using the ‗do‘ loop. Since, normal distribution is a continuous dis tribution it assumes continuous values. By default, at every successive step in the loop function the value is increased at a step of 1, which makes it a discrete loop. To make it continuous we increase the trials at a step of 0.05. The result of each iteration is dis played using the ‗output‘ keyword.
proc gplot data=day1.normal; run;
This command directly plots the normal probability distribution with probabilities on the vertical axis and the number of trials on the horizontal axis. The graph obtained from the data ‗normal‘ is symmetric in nature.
© Orangetree Business Solutions Private Limited, 201213
31
SAS IMPLEMENTATION
proc univariate data=day1.class normal plot; var height; histogram height/normal (mu=est sigma=est color=green); run;
The ‗proc univariate‘ is the procedure for listing out all the descriptive statistics associ ated with the variable ‗height‘ which is our analysis variable. The keyword ‗histogram‘ is used for generating a histogram over which a normal curve is superimposed. The normal curve here is a green coloured curve, specified by the estimated mean and the estimated standard deviation. Super imposition of the normal curve over the histo gram gives us an idea whether the variable is normally distributed. If the ‗normal curve‘ fits on nicely to the histogram then we say that the variable is ‗normally distribut ed‘. The variable ‗height‘ in the data set ‗class‘ has a normal plot. The normality of the variable can be clearly observed in the diagram below:
proc univariate data=day1.candy_sales_summary normal plot; var sale_amount; histogram sale_amount/normal (mu=est sigma=est color=green); run;
This is the same code as above which has been used on a separate variable on a dif ferent data set. The variable ‗sale_amount‘ is not normally distributed and the normal curve does not fit symmetrically on the histogram.
© Orangetree Business Solutions Private Limited, 201213
32
Chapter 3
SAMPLING THEORY AND ESTIMATION
S ampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population. Researchers rarely survey the entire population because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is
faster, and since the data set is smaller it is possible to ensure homogeneity and to im
prove the accuracy and quality of the data.
Concept of Population
Sampling is concerned with the selection of a subset of individuals from within a popu lation to estimate characteristics of the whole popu lation. Researchers rarely survey the entire popula tion because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogene ity and to improve the accuracy and quality of the data.
Techniques of Sampling
There are two broader techniques of sampling:
Probability Sampling or Random Sampling and Non probability sampling, among which only Random Sampling can be used for statistical investigation.
Probability Sampling or Random Sampling
Probability sampling, or random sampling, is a sampling technique in which the proba bility of getting any particular sample may be calculated. Examples of random sam pling include:
Simple Random Sampling
Without Replacement: One deliberately avoids choosing any member of the pop ulation more than once.
With Replacement: One member can be chosen more than once.
Systematic Sampling Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that or
© Orangetree Business Solutions Private Limited, 201213
33
dered list. Suppose you are talking data from every 10th person entering into a mall.
Stratified Sampling Where the population embraces a number of distinct categories or "strata‖, each stratum is then sampled as an independent subpopulation, out of which individual elements can be randomly selected. Where the population embraces a number of distinct categories or "strata‖, each stra tum is then sampled as an independent subpopulation, out of which individual ele ments can be randomly selected. male, fulltime: 90 male, parttime: 18 female, fulltime: 9 female, parttime: 63 Total: 180 and we are asked to take a sample of 40 staff, stratified according to the above cate gories.
The first step is to find the total number of staff (180) and calculate the percentage in each group.
% male, fulltime = 90 / 180 = 50%
% male, parttime = 18 / 180 = 10%
% female, fulltime = 9 / 180 = 5%
% female, parttime = 63 / 180 = 35%
This tells us that of our sample of 40, 50% should be male, fulltime. 10% should be male, parttime. 5% should be female, fulltime. 35% should be female, parttime. 50% of 40 is 20. 10% of 40 is 4. 5% of 40 is 2. 35% of 40 is 14. Another easy way without having to calculate the percentage is to multiply each group size by the sample size and divide by the total population size (size of entire staff):
male, fulltime = 90 x (40 / 180) = 20 male, parttime = 18 x (40 / 180) = 4 female, fulltime = 9 x (40 / 180) = 2
female, parttime = 63 x (40 / 180) = 14
NonProbability Sampling
In non – probability sampling, we cannot assign any probability to the selected sam ple. Nonprobability sampling techniques cannot be used to infer from the sample to the general population.
Examples of nonprobability sampling include:
Convenience, Haphazard or Accidental sampling  members of the population are
© Orangetree Business Solutions Private Limited, 201213
34
chosen based on their relative ease of access. To sample friends, coworkers, or shoppers at a single mall, are all examples of convenience sampling.
Judgmental sampling or Purposive sampling  The researcher chooses the sample based on who they think would be appropriate for the study. This is used primarily when there is a limited number of people that have expertise in the area being re searched.
Sampling Bias
In statistics, sampling bias is when a sample is collect
ed in such a way that some members of the intend ed population are less likely to be included than oth ers. It results in a biased sample, a nonrandom sam ple of a population (or nonhuman factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, re sults can be erroneously attributed to the phenome non under study rather than to the method of sam pling.
Sampling Distribution
The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when de rived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.
Population Parameters and The Estimation Theory
A statistical parameter is a parameter that indexes a family
of probability distributions. It can be regarded as a numeri cal characteristic of a population or a model. For example, the family of normal distri butions has two parameters, the mean μ and the variance σ^2: if these are specified,
the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.
In statistics, our purpose is to learn about the population by studying the samples. Esti
mation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to esti mate population parameters. For example, sample means are used to estimate popu lation means. So, sample mean is an estimator here and the value of the mean is the estimate A statistical parameter is a parameter that indexes a family of probability dis tributions. It can be regarded as a numerical characteristic of a population or a mod el. For example, the family of normal distributions has two parameters, the mean μ and
the variance σ^2: if these are specified, the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.
In statistics, our purpose is to learn about the population by studying the samples. Esti
© Orangetree Business Solutions Private Limited, 201213
35
mation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to esti mate population parameters. For example, sample means are used to estimate popu lation means. So, sample mean is an estimator here and the value of the mean is the estimate. So as a parameters is to the population, a statistic is to a sample.
Types of Estimator
There are two types of estimator: Point Estimator and Interval Estimator. The point estimators yield singlevalued results, whereas an interval estimators results in a range of plausible val ues.
Properties of Estimator
Unbiased: The estimator is an unbiased estimator of if and only if the expectation of the estimator is equal to the population parameter.
Consistency: An estimator is called consistent if increasing the sample size increases the probability of the estimator being close to the population parameter.
Efficiency: Among unbiased estimators, there often exists one with the lowest vari ance, called the minimum variance unbiased estimator (MVUE) or an efficient esti mator.
Sufficiency: An estimator is called sufficient if no other statistic which can be calcu lated from the same sample provides any additional information as to the value of the parameter.
Testing of Statistical Hypothesis
Statistical hypotheses are statements about real relationships; and like all hypotheses, statistical hypotheses may match the reality, or they may fail to do so. Statistical hy potheses have the special characteristic that one ordinarily attempts to test them (i.e., to reach a decision about whether or not one believes the statement is correct, in the sense of corresponding to the reality) by observing facts relevant to the hypothesis in a sample. This procedure, of course, introduces the difficulty that the sample may or may not represent well the population from which it was drawn.
Types of Hypotheses
Null Hypothesis (H _{0} ): Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true. If the dataset is very unlikely, defined as being part of a class of sets of data that only rarely will be ob served, the experimenter rejects the null hypothesis concluding it (probably) is false. The null hypothesis can never be proven, only thing we can do is to reject it or not re ject it.
Alternative Hypothesis (H _{1} or H _{A} ): The alternative hypothesis (or maintained hypothesis
© Orangetree Business Solutions Private Limited, 201213
36
or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in
a stream has been observed over many years and a test is made of the null hypothesis
that there is no change in quality between the first and second halves of the data against the alternative hypothesis that the quality is poorer in the second half of the record.
Examples of Statistical Hypotheses
The mean age of all Calcutta University students is 23.4 years.
The proportion of Calcutta University students who are women is 50 percent.
The heights of all the male students of Calcutta University are normally distributed.
Types of Errors in Testing of Hypothesis
There are two types of error as follows:
Type I Error: A type I error, also known as an error of the first kind, occurs when the null hypothesis (H _{0} ) is true, but is rejected. It is asserting something that is absent, a false hit. In terms of folk tales, an investigator may be "crying wolf" without a wolf in sight (raising a false alarm) (H _{0} :
no wolf).
Type II Error: A type II error, also known as an error
of the second kind, occurs when the null hypoth esis is false, but it is erroneously accepted as true.
It is missing to see what is present, a miss. A type II
error may be compared with a socalled false negative (where an actual 'hit' was disregarded by the test and seen as a 'miss') in a test checking for a single condition with a definitive result of true or false. A Type II error is committed when we fail to believe a truth.
Consequences of Type I and Type II Errors
Both types of errors are problems for individuals, corporations, and data analysis. Based on the reallife consequences of an er ror, one type may be more serious than the other. For example, NASA engineers would prefer to throw out an electronic circuit that is really fine (null hypothesis H0: not broken; reali ty: not broken; action: thrown out; error: type I, false positive) than to use one on a space craft that is actually broken (null hypothesis H _{0} :
not broken; reality: broken; action: use it; error: type II, false negative). In that situation
a type I error raises the budget, but a type II error would risk the entire mission.
© Orangetree Business Solutions Private Limited, 201213
37
Level of Significance
Statistical significance is a statistical assessment of whether observations reflect a pat tern rather than just chance, the fundamental challenge being that any partial picture is subject to observational error. In statistical testing, a result is deemed statistically sig nificant if it is unlikely to have occurred by chance, and hence provides enough evi dence to reject the hypothesis of 'no effect'. As used in statistics, significant does not mean important or meaningful, as it does in everyday speech. The significance level is usually denoted by the Greek symbol α. Popular levels of signif icance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of sig nificance gives a pvalue lower than the significance level α, the null hypothesis is re jected.
Confidence Interval
In statistics, a confidence interval (CI) is a kind of interval estimate of a population pa rameter and is used to indi cate the reliability of an es timate. Confidence inter vals consist of a range of values (interval) that act as good estimates of the un known population parame ter. However, in rare cases, none of these values may cover the value of the pa rameter. The level of confi dence of the confidence interval would indicate the probability that the confi dence range captures this true population parameter given a distribution of sam ples. If a corresponding hypothesis test is performed, the confidence level corre sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi cance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample. In statistics, a confidence interval (CI) is a kind of interval estimate of a population parameter and is used to indi cate the reliability of an estimate. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. However, in rare cases, none of these values may cover the value of the parameter. The level of confidence of the confidence interval would indicate the probability that the confi dence range captures this true population parameter given a distribution of sam ples. If a corresponding hypothesis test is performed, the confidence level corre sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi cance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample.
© Orangetree Business Solutions Private Limited, 201213
38
SAS IMPLEMENTATION
SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT
proc surveyselect data=day1.employee_satisfaction out=day1.emp1 method=srs n=50; run;
Surveyselect is the procedure for executing a sampling procedure. The data set that we consider here is ‗employee_satisfaction‘. The method of sampling specified here is ‗simple random sampling without replacement‘ (SRS). We have prespecified the sam ple size to be 50. This is a proc step which generates a report. Some important con cepts generated in the report are:
Random Number Seed: A integer used to set the starting point for generating a se ries of random numbers. The seed sets the generator to a random starting point. A unique seed returns a unique random number sequence. Given the seed a series of random numbers is generated. If no random number seed is specified, then the numerical value of the system time is used for generating the subsequent random numbers.
Selection Probability: This shows the probability of selecting a sample of ‗n‘ obser vations from a total of ‗N‘ observations (N > n). Each of the observations are equal ly likely of being drawn from the population and a sample observation once drawn from a population is not returned back.
Sampling Weight: A sampling weight is a statistical correction factor that compen sates for a sample design that tends to over or underrepresent various segments within a population. In some samples, small subsets of the population, such as reli gious, ethnic, or racial minorities, may be oversampled in order to have enough cases to analyze. When these subsamples are combined with the larger sample, their disproportionately large numbers must be diluted by a sampling weight. This is just the reciprocal of the selection probability of a sample.
SIMPLE RANDOM SAMPLING WITH REPLACEMENT
proc surveyselect data=day1.employee_satisfaction out=day1.emp2 method=urs n=50; run;
This code describes an alternate technique of sampling. The method ‗urs‘ or unrestrict ed sampling refers to the type of random sampling where the sample points are re turned to the population once the observations are recorded. This process of sampling is also called the simple random sampling with replacement. In the final data set that we get, there might not be 50 unique observations, since repetition may occur in the selection of the sample observations. In this form of sampling, the report generated contains, in addition to the concepts introduced in srs, another concept called the ex pected number of hits.
© Orangetree Business Solutions Private Limited, 201213
39
SAS IMPLEMENTATION
The concept of the expected number of hits is synonymous to the concept of selec tion probability in the simple random sampling with replacement. This measure repre sents the average number of times a particular observation is selected in the process of random sampling without replacement. The sampling weight, in this context, is the reciprocal of the expected number of hits made in the procedure.
STRATIFIED RANDOM SAMPLING
STEP 1: SORTING THE DATA SET ACCORDING TO THE SUB_CATEGORY
proc sort data=day1.candy_sales_summary
out=day1.candy_sort;
by subcategory;
run;
The command sorts the data set according to the variable ‗subcategory‘. The sorting of the data set is important because it divides the data set according to the available strata. The variable ‗subcategory‘ act as the strata in the given data set.
STEP 2: SAMPLING USING THE STRATIFICATION TECHNIQUE
proc surveyselect data=day1.candy_sort n= (5 7 15 10 12 8) method=seq
out=day1.candy_seq;
strata subcategory; run;
The method of sampling applied for each stratum is the sequential random sampling technique. The observations to be chosen from each stratum are specified using ‗n‘.
SYSTEMATIC OR ORDERED SAMPLING
The sample, in this technique, is drawn from the population, based on a particular or der. For example: If a departmental store wants to know about the level of customer satisfaction then he needs to survey the customers. If in a day the mall expects a foot fall of 1000 customers and the number of sample size he requires is 100, then the mall can question every 10th person walking in through the door.
proc surveyselect data= day1.candy_sort
out=day1.candy_seq
n=30 method=sys; run;
This command ‗method=sys‘ is used to execute the systematic sampling process. The systematic number of observations that is to be sampled is calculated using: K = N/n, where n = size of the sample, N = size of the population. So, for getting a sample size of 30, every 50th observation should be surveyed.
© Orangetree Business Solutions Private Limited, 201213
40
Chapter 4
IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART I)
Concept of Parametric Data
A parametric test is one that requires data from one of the large catalogue of distributions that statisticians have described and for data to be parametric certain assumptions must be true. If you use a parametric test when your da ta is not parametric then the results are likely to be inaccurate. Therefore, it is
very important that we check the assumptions before deciding which statistical test is appropriate.
Assumptions of Parametric Test
Normally Distributed Data: It is assumed that the data are from one or more nor mally distributed populations. The rationale behind hypothesis testing relies on nor mally distributed populations and so if this assumption is not met then the logic be hind hypothesis testing is flawed. Most researchers eyeball their sample data using a histogram and if the sample data look roughly normal, then the researchers as sume that the populations are also.
Homogeneity of Variance: The assumption means that the variance should be the same throughout the data. In designs in which you test several groups of partici pants, this assumption means that each of these samples comes from populations with the same variance.
Interval Data: Data should be measured at least at the interval level. This means that the distance between points of your scale should be equal at all parts along the scale. For example, if you had a 10 point anxiety scale, then the difference in anxiety represented by a change in score from 2 to 3 should be the same as that represented by a change in score from 9 to 10.
Independence: This assumption is that data from different participants are inde pendent, which means that the behavior of one participant does not influence the behavior of another.
The assumptions of interval data and independent measurement are tested only by common sense. The assumption of homogeneity of variance is tested in different ways for different procedures .
Z Test
A Ztest is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.
© Orangetree Business Solutions Private Limited, 201213
41
Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, re spectively. Our interest is in the scores of 55 stu dents in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the region al mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low?
Assumptions
The parent population from which the sample is drawn should be normal
The sample observations are independent, i.e., the given sample is random
The population standard deviation σ is known
T Test
A ttest is any statistical hypothesis test in which the test statistic follows a Student's t dis tribution if the null hypothesis is supported. Among the most frequently used ttests are:
A onesample location test of whether the mean of a normally distributed popula tion has a value specified in a null hypothesis.
A two sample location test of the null hypothesis that the means of two normally distributed populations are equal.
A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero.
A test of whether the slope of a regression line differs significantly from zero.
Assumptions
Most t statistics have the form t= Z∕s.
Z follows a standard normal distribution under the null hypothesis or the parent pop ulation from which the sample is drawn should be normal
The sample observations are independent, i.e., the given sample is random
The population standard deviation σ is unknown
Two Independent Samples T Test
Consider you have Conducted a survey that studied the Commitment to Change in your Organization. Now you require to find out if there are any differences in the Com mitment to Change between Male and Female Staff Members or for instance a re searcher wants to find out if Middle level employees are more satisfied than top level employees. In this case the researchers needs the Satisfaction Scores for Middle and Top Management. Here again we can see that One Variable (Satisfaction) is divided into two groups (Middle and Top Level). So in summary when we need to compare
© Orangetree Business Solutions Private Limited, 201213
42
two groups for one numeric variable with each other we would use Two Independent Samples T Test, here the two samples are drawn from one single variable. An Assumption for Two Independent Samples T test is that the Data is Normally Distributed .
Forming The Research Hypotheses
Now for Instance we have conducted a Survey that studied the salary of Respondent, Now we want to Check if there are any difference in the salary of Male and Female Employee in the Business Organization. Example of research question: Are there any differences in the earning of males and females employees? What you need: One categorical independent variable with only two groups (e.g. sex:
males/ females). One continuous dependent variable (e.g. Salary). Hypotheses of Two Independent Samples t Test:
H _{0} : The two population means are equal, i.e. there is no difference in earnings H _{1} : The two population means are not equal, i.e. there is difference in earnings
Paired Sample T Test
A company markets an eight week long weight loss program and claims that at the end of the program on average a participant will have lost 5 pounds. On the other hand, you have studied the pro gram and you believe that their program is scientifically un sound and shouldn't work at all. You want to test the hy pothesis that the weight loss program does not help people lose weight. Your plan is to get a random sample of people and put them on the program. You will measure their weight at the beginning of the program and then measure their weight again at the end of the program. Based on some previous research, you believe that the standard de viation of the weight difference over eight weeks will be 5
pounds.
Assumptions
The assumptions underlying the paired samples ttest are similar to the onesample t test but refer to the set of difference scores.
The observations are independent of each other
The dependent variable is measured on an interval scale
The differences are normally distributed in the population Hypotheses of Paired Sample t Test:
H _{0} : The two population means are equal H _{1} : The two population means are not equal In summary, a paired sample t test tries to assesses whether an action is effective or not.
© Orangetree Business Solutions Private Limited, 201213
43
SAS IMPLEMENTATION
A SINGLE VARIABLE TTEST
The case study on a single variable ttest pertains to a leading hospital in the city. The baseline blood pressures for 60 patients belonging to different age groups were rec orded. The data set contains three variables namely: the subject (id variable), Age (numeric variable) and Baseline bp (numeric variable). The objective of the case study is to check whether there has been a statistically signif icant change in the average blood pressure over a span of 45 days. We use the ttest in this case. However, before using the test we need to test for the assumption of nor mality.
STEP 1: CHECK FOR NORMALITY
proc univariate data=day1.bp normal plot; var baselinebp; qqplot baselinebp/normal (mu=est sigma=est color=pink); run;
The univariate procedure generates all the vital descriptive statistics associated with the variable ‗baseline bp‘. The qqplot of ‗baselinebp‘ shows that observations of the variable lie very close to the hypothetical pinkcoloured normal line. Therefore, base linebp is normally distributed.
STEP 2: TESTING THE SIGNIFICANCE OF THE HYPOTHESIS
proc ttest data=day1.bp h0=96 alpha=0.05; var baselinebp; run;
The procedure ‗ttest‘ is used to run the student‘s ttest. The nullhypothesis (h0) is speci fied to be equal to 96. This implies that any differences observed in the readings of the average blood pressure are caused due to sampling fluctuations in the data set. The keyword ‗alpha‘ is used to denote the level of significance which shows the probability of committing a Type I error.
The ttest generates the following tables:
Statistics: This table generates the vital statistics associated with the variable base linebp. It displays the sample mean, variance and standard error associated with the sample.
Ttest: This table reports the results associated with the ttest. The most important component of this table is the pvalue which is shown at the end of the table. The p value shows a value of 0.2688 which is much higher than the level of significance. Therefore, an analyst knows that he runs a very high chance of committing a TypeI error if he rejects the NullHypothesis. Thus, it is in the interest of the analyst to ac cept the Ho. So, in this situation it can be inferred that minor fluctuations observed in the mean blood pressure are due to sampling fluctuations.
© Orangetree Business Solutions Private Limited, 201213
44
SAS IMPLEMENTATION
TWO INDEPENDENT SAMPLE TTEST
The twoindependent sample ttest is useful for examining significant differences in the mean of two data sets. The present case study considers two renowned pizza compa nies: ABC and XYZ. The manager of the XYZ company is apprehensive of the falling sales compared to its competitor ABC. The absolute delivery time for Pizza company ABC is less than XYZ, but this would be considered a crucial factor in explaining the de clining sales of XYZ if the differences in the mean delivery time of company ABC are significantly less than the mean delivery time of XYZ.
STEP 1: IMPORTING THE REQUIRED FILE The file containing the required information does not initially exist in the SAS data base. The original file is in a csv format and so, we first import the data set using the ‗import‘ file. This imports the dataset to the SAS database and renames it ‗twoind_sample‘.
proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\twoindsample.csv"
out=day1.twoind_sample
dbms=csv replace; run;
STEP 2: RUNNING THE TTEST
proc ttest data=day1.twoind_sample; class company; var waiting_time in_minutes_; run;
The ttest is executed using the procedure ‗ttest‘. Since this is a ttest to check the dif ference of mean between two groups, we introduce the ‗class‘ keyword to identify the two pizza companies. The variable in terms of which the ttest is to be carried out is
the variable ‗waiting_time
the code is executed:
STATISTICS: This table describes the vital statistics associated with the two pizza com panies. This table gives us a clear idea that the delivery time of the pizza company ABC is distinctly less than the delivery time of the company XYZ. How can we say so? This can be said so from the confidence intervals within which the sample means of the two companies lie.
EQUALITY OF VARIANCES: To compare the means of two different sets it is neces sary to check that the variances of the two set. The population variances of the two data sets must be identical in nature. This implies that the meandifference test is executed under the assumption that the variance remains constant across the two data sets. The equality of variances is tested using the Folded Ftest. This is de fined as: F = max (s _{1} 2 ,s _{2} 2 )/min(s _{1} 2 ,s _{2} 2 ) where s _{1} 2 and s _{2} 2 are variances of category 1 and category 2.
in_minutes_‘.
Three important tables are generated once
© Orangetree Business Solutions Private Limited, 201213
45
SAS IMPLEMENTATION
The hypothesis tested is:
H _{0} : The population variances are equal V/s H _{A} : The population variances are unequal
The decision rule used is the pvalue rule whereby the null hypothesis is accepted if the exact probability of committing the type I error exceeds the benchmark proba bility as prescribed by the level of significance. Here, the pvalue associated with the folded Fstatistic is 0.38. This is much greater than the level of significance. Hence, the chance of committing a type I error is much higher in this model and we do not take the risk of committing the error and accept the null hypothesis. Therefore, it is safe to conclude that the population variances of the two pizza companies are not identically different.
TTESTS: This table displays the results of the ttest corresponding to the difference in the mean delivery time of pizzas. The results are displayed under two subheadings:
Pooled Variance and Unequal variance. We consider the results corresponding to the Pooled variance for the ttest analysis. The pvalue corresponding to the t statistic is 0.0003 which is less than the prescribed level of significance. Therefore, it is easy to conclude that the difference in the mean delivery time of the pizza com panies ABC and XYZ are significantly different from one another.
PAIRED SAMPLE TTEST
To analyze the impact of elearning on the students, the Ministry of the Human Re source Development of the Government of India performed an exploratory study on the a sample of 50 students. The students were first taught in the traditional method of teaching and then through the method of elearning without the presence of any teachers. The marks were recorded for the students before the elearning and after the elearning. The marks were then compared to analyze the impact of the e learning on the performance of the students.
STEP1: IMPORTING THE DATA FILE The first step in this part is to import the required datafile using the proc import key word. The original file is in the csv format.
proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\pairedsample.csv"
out=day1.pairedsample
dbms=csv replace; run;
STEP2: RUNNING THE PAIRED SAMPLE TTEST
proc ttest data=day1.pairedsample;
© Orangetree Business Solutions Private Limited, 201213
46
SAS IMPLEMENTATION
paired before*after; run;
The keyword ‗paired‘ is used to execute the paired ttest between the marks ‗before‘ and marks ‗after‘. The hypothesis set up is:
H _{0} : The expost and exante means are not significantly different v/s H _{A} : The expost and exante means are significantly different
The results for this ttest are displayed through the following tables:
STATISTICS: The statistic table shows that the mean marks of students have in creased after incorporating the elearning process. The question that arises from the table above is: Is the rise in the mean marks post the elearning a significant rise? To test the significance of the change we use the ttest table.
TTEST: The ttest table details the significance of the difference of the paired means. The pvalue rule is used for deciding whether the null hypothesis is should be accepted or not. The pvalue generated (0.4539) within the model is greater than the level of significance. This means that the differences in the means are not statistically significant. Therefore, the analysis shows that the mean of the performance of the students post the elearning process did not change significantly. Hence, elearning employed by the ministry of education did not prove to be effective as a strategy.
© Orangetree Business Solutions Private Limited, 201213
47
Chapter 5
UNDERSTANDING THE ASSOCIATION BETWEEN THE VARIABLES
Chi Square Test for Independence of Attributes
C onsider the following questions:
bought?
Is their any association between income level and brand preference?
Is their any association between family size and size of washing machine
Are the attributes educational background and type of job chosen independent? The solutions to the above questions need the help of ChiSquare test of independ ence in a contingency table. Please note that the variables involved in ChiSquare analysis are nominally scaled. Nominal data are also known by two names  categori cal data and attribute data.
Contingency Table: Is there any relation be tween age and invest ment?
Assumptions
The data should be categorical variables
Total frequency should
be reasonably large, say greater than 50
Investment 

Stock 
Bond 
Cash 
Total 

25 
 34 
30 
10 
1 
41 

35 
 44 
35 
25 
2 
62 

Age 
45 
 54 
38 
35 
4 
77 
55 
 70 
22 
30 
4 
56 

Total 
125 
100 
11 
236 
The observations of the sample are independent, i.e., the samples are random
The theoretical frequency of any category or class should not be less than 5
Hypotheses of the test are H _{0} : There is no association between the variables H _{1} : There is an association between the variables
Calculation of Chi Square Statistic
© Orangetree Business Solutions Private Limited, 201213
48
Calculation of Theoretical Frequency
Remember, Chi square test of independence only checks whether there is any associ ation between the attributes, but it does not tell what is the nature of the association.
Correlation Analysis
The simplest way to look at whether two variables are associated is to look at whether they covary. To understand what covariance is, we first need to think back to the con cept of variance. Variance = ∑ (x _{i}  m _{x} ) ^{2} / (N – 1) = ∑ (x _{i}  m _{x} ) (x _{i}  m _{x} )/ (N – 1) The mean of the sample is represented by m _{x} , x _{i} is the data point in question and N is the number of observations. If we are interested in whether two variables are related, then we are interested in whether changes in one variable are met with similar changes in the other variable. When there are two variables, rather than squaring each difference, we can multiply the difference for one var iable by the corresponding difference for the second variable. As with the variance, if we want an average value of the combined differences for the two variables, we must divide by the number of observations (we actually divide by N – 1). This averaged sum of combined differences is known as the covariance: Cov(x,y) = ∑ (x _{i}  m _{x} ) (y _{i}  m _{y} )/ (N – 1) There is, however, one problem with covariance as a measure of the relationship be tween variables and that is that it depends upon the scales of measurement used. So, covariance is not a standardized measure. To overcome the problem of dependence on the measurement scale, we need to convert the covariance into a standard set of units. This process is known as standardization. Therefore, we need a unit of measurement into which any scale of measurement can be con verted. The unit of measurement we use is the standard deviation. The standardized covariance is known as a cor relation coefficient. r = covxy / sx sy = ∑ (xi  mx) (yi  my)/ [(N – 1) sx sy] which always lies in between –1 and 1.
Remember, correlation doesn‘t necessarily imply causation.
© Orangetree Business Solutions Private Limited, 201213
49
Test of Hypotheses for Correlation
For pairs from an uncorrelated bivariate normal distribution, the sampling distribution of Pearson's correlation coefficient follows Student's tdistribution with degrees of freedom n − 2. Specifically, if the underlying variables have a bivariate normal distribution, the variable
has a Student's tdistribution in the null case (zero correlation).
Duration 
Salary in 
Age of 

Correlations 
of Educa 
Professional 
Dollar per 
the Per 

tion 
Experience 
Hour 
son 

Duration of 
Pearson Cor 
1 
.308 * 
.115 
.238 
Education 
relation 

Sig. (2tailed) 
.017 
.381 
.067 

Professional 
Pearson Cor 
.308 * 
1 
.121 
.985 ** 
Experience 
relation 

Sig. (2tailed) 
.017 
.358 
.000 

Salary in Dol lar per Hour 
Pearson Cor 
.115 
.121 
1 
.180 
relation 

Sig. (2tailed) 
.381 
.358 
.169 

Age of the Person 
Pearson Cor 
.238 
.985 ** 
.180 
1 
relation 

Sig. (2tailed) 
.067 
.000 
.169 

*. Correlation is significant at the 0.05 level (2tailed). 

**. Correlation is significant at the 0.01 level (2tailed). 
Partial Correlation
A correlation between two variables in which the effects of other variables are held constant is known as partial correlation. The partial correlation for 1 and 2 with control ling variable 3 is given by:
r _{1}_{2}_{.}_{3} = (r _{1}_{2} – r _{1}_{3} r _{2}_{3} ) / [√ (1 – r _{1}_{3}_{2} ) √ (1 – r _{2}_{3}_{2} )] For example, we might find the ordinary correlation between blood pressure and blood cholesterol might be a high, strong positive correlation. We could potentially find a very small partial correlation between these two variables, after we have taken into account the age of the subject. If this were the case, this might suggest that both variables are related to age, and the observed correlation is only due to their com mon relationship to age.
© Orangetree Business Solutions Private Limited, 201213
50
SAS IMPLEMENTATION
CORRELATION
proc corr data=day1.correlation; var Education Experience Age Wage_dollars_per_hour_; run;
proc corr is used to calculate correlation between two or more quantitative variables. The var option identifies the variables whose correlation coefficients are to be quanti fied. The output to this code generates a 4x4 correlation matrix. Each element in this matrix shows the correlation coefficient between two variables. Associated with each correlation coefficient is a pvalue which shows the statistical significance of the corre lation coefficient.
PARTIAL CORRELATION
proc corr data=day1.correlation; var Education Experience; partial Age; run;
This code produces the correlation between the two variables Education and Experi ence. The option partial is used to adjust the correlation coefficient value between Ed ucation and Experience for the impact of the variable ‗Age‘. This adjustment is im portant to find out the extent exactly to which Education and Experience are correlat ed.
MATRIX PLOT
ods html; ods graphics on; proc corr data=day1.correlation noprint plots=matrix; var Education Experience Wage_dollars_per_hour_ Age; run; ods graphics off; ods html close;
For a matrix view of the correlations we first set the ods (Output Delivery System) to html. Then we turn on the graphics mode. In the proc corr we use the options noprint to suppress the output in the output window. At the same time, we set the type of the plot to matrix. After running the code, we turn off the graphics mode and reset the output delivery system.
© Orangetree Business Solutions Private Limited, 201213
51
SAS IMPLEMENTATION
CHI SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES
Here we are trying to find out whether there is any association between the Frequen cy_of_Readership and Level_of_Educational_Achievement. This test is done under the procedure freq and we request a chi square test in the table statement.
proc freq data=day1.chi; tables Frequency_of_Readership * Level_of_Educational_Achievement/chisq; run;
© Orangetree Business Solutions Private Limited, 201213
52
Chapter 6
IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART II)
One Way ANOVA
A manager wants to raise the productivity at his company by increasing the speed at which his employees can use a particular spreadsheet program. As he does not have the skills inhouse, he employs an external agency which provides training in this spreadsheet program. They offer 3 packages  a be
ginner, intermediate and advanced course. He is unsure which course is needed for the type of work they do at his company so he sends 10 employees on the beginner course, 10 on the intermediate and 10 on the advanced course. When they all return from the training he gives them a problem to solve using the spreadsheet program and times how long it takes them to complete the problem. He wishes to then com
pare the three courses (beginner, intermediate, advanced) to see if there are any dif
Assumptions
Response variable are normally distributed (or approximately normally distributed)
Samples are independent
Variances of populations are equal
Responses for a given group are independent and identically distributed normal random variables The hypotheses for the test are:
H0: The population means are equal H1: At least one of the population means is different The name ‗One Way ANOVA‘ implies that the number of independent variable is one. Here the intergroup variation is basically systematic variation and the intragroup vari
© Orangetree Business Solutions Private Limited, 201213
53
ation is unsystematic. Then we are checking whether inter – group variation is signifi cantly larger than the intra – group variation.
Two Way ANOVA
The twoway analysis of variance (ANOVA) test is an extension of the oneway ANOVA
test that examines the influence of different categorical independent variables on one dependent variable. While the oneway ANOVA measures the significant effect of one independent variable (IV), the twoway ANOVA is used when there are more than one
IV and multiple observations for each IV. The twoway ANOVA can not only determine
the main effect of contributions of each IV but also identifies if there is a significant in
teraction effect between the IVs.
Example
A researcher was interested in whether an individual's interest in politics was influenced by their level of education and their gender. They recruited a random sample of par ticipants to their study and asked them about their interest in politics, which they scored from 0  100 with higher scores meaning a greater interest. The researcher then divided the participants by gender (Male/Female) and then again by level of educa tion (School/College/University).
What is Interaction?
When gender and level of education interact, we find 6 different groups, namely, Male – School, Female – School, Male – College, Female – College, Male – University and Female – University. Using two way ANOVA, we are trying to understand whether any of the group is significantly different from the rest. If the interaction levels don‘t show any significant differences, nor will the main factors for their levels.
Assumptions
As with other parametric tests, we make the following assumptions when using two
way ANOVA:
The populations from which the samples are obtained must be normally distributed
Sampling is done correctly. Observations for within and between groups must be independent
The variances among populations must be equal (homogeneity)
Data are interval or nominal The Hypotheses for the test are:
For each factor and interaction, H0: Means of all groups are equal H1: There is one significant difference
© Orangetree Business Solutions Private Limited, 201213
54
SAS IMPLEMENTATION
ONEWAY ANOVA
We demonstrate one way anova through a case study. The case that we consider is that of three production plants: Maruti, Hyundai and Tata. The associated processing time of cars in each of these plants is mentioned along with them. The objective of the analyst is to find out whether there exists a significant difference between the mean processing time of the plant.
proc anova data=day1.anova; class plant; model processing_time=plant; run;
‗anova‘ is the procedure used in analysis of variance when the data is balanced. ‗Class‘ is the keyword for specifying the different groups in the problem. In this case, the class variables are the respective production plants of the companies. The ‗model‘ keyword is used for executing functions which involve an independent and a depend ent variable. The lefthand side of the equality is the dependent variable and the right hand side represents the independent variable. The code generates the following ta bles:
First table shows the statistics associated with the overall goodness of the model. This table displays the variations across the groups (Mean Model Sum of Squares) and within the groups (Mean Squares of Errors). The Fstatistic is calculated as a ra tio of the Explained variation in the model to the unexplained variation. The p value rule is employed to check the significance of the Fvalue. The pvalue for the Fstatistic in this study is 0.1447, which is significantly greater than the level of signifi cance. Thus it can be concluded that there is no significant difference in the pro cessing time of cars in the three plants.
The second table generates all the descriptive statistics corresponding to the varia ble mean_processing_time_of_plant. The mean processing times of plants of the three companies are not significantly differ ent from each other. One problem with the oneway anova is that it does not include any interaction effect between the independent variables. This problem is addressed by twoway anova.
TWOWAY ANOVA
A survey referred to weight gained by men because of different factors, viz, the amount of food consumed by the men and the type or nature of diet. By Ten repre sentative men were randomly selected and each of them were fed with each type of diet in the two specified diet amounts (i.e. ―High‖ and ―low‖ respectively). The weight gained by the men was measured in grams. There are three variables with a total of 60 observations.
© Orangetree Business Solutions Private Limited, 201213
55
SAS IMPLEMENTATION
The numeric variable Weight Gain denotes the weight gained by the men. The two separate samples of pre and post treatment weight is not taken; rather; a single sam ple of actual weight gain is considered. The variable Diet Amount denotes the amount of diet. It is a categorical variable recording two responses; 1 for ‗High‘ and 2 for ‗Low‘ amounts of diet. Also, the variable Diet Type denotes the type of diet consumed which is also a categorical variable. It records three responses: 1 for Vegetarian diet, 2 for nonvegetarian diet and 3 for a mixed diet. The objective of the study is to locate the factors which most significantly affect the weight gain in individuals. The code for twoway anova is:
proc glm data=day1.twowayanova; class Diet_Amount Diet_type; model Weight_gain=Diet_Amount Diet_type Diet_Amount*Diet_type; means Diet_amount Diet_type/tukey; run;
This can also be done using proc anova. But anova works well when the data is bal anced, i.e. the interaction groups are equal in size. Also we are more interested about the type III sum of squares. So we prefer proc glm over proc anova.
© Orangetree Business Solutions Private Limited, 201213
56
Chapter 7
EXPLORATORY FACTOR ANALYSIS
S uppose, we are interested in consumers‘ evaluation of a brand of coffee. We take a random sample of consumers whom were given a cup of coffee. They were not told which brand of coffee they were given. After they had drunk the
coffee, they were asked to rate it on 14 semantic – differential scales. The 14 at
tributes which were investigated are shown below:
1. Pleasant Flavor – Unpleasant Flavor
2. Stagnant, muggy taste – Sparkling, Refreshing Taste
3. Mellow taste – Bitter taste
4. Cheap taste – Expensive taste
5. Comforting, harmonious – Irritating, discordant
6. Smooth, friendly taste –
Rough, hostile taste
7. Dead, lifeless, dull taste –
alive, lively, peppy taste
8. Tastes artificial – Tastes like
real coffee
9. Deep distinct flavor – Shal
low indistinct flavor
Factor 
Attributes 

A. Comforting Quality 
1. 
Pleasant flavor 
3. 
Mellow taste 

5. 
Comforting taste 

12. 
Pure, clear taste 

B. Heartiness 
9. 
Deep distinct flavor 
11. 
Hearty, full  bodied, full fla 

vor 

C. Genuineness 
2. 
Sparkling taste 
4. 
Expensive taste 

6. 
Smooth, friendly taste 

7. 
Alive, lively, peppy taste 

8. 
Tastes like real coffee 

14. 
Overall preference 

D. Freshness 
10. 
Tastes just brewed 
13. 
Raw taste 
10. Tastes warmed over –
Tastes just brewed
11. Hearty, full – bodies, full fla
vor – Warm, thin empty flavor
12. Pure, clear taste – Muddy,
swampy taste
13. Raw taste – Stale taste
14. Overall preference: Excel
lent quality – Very poor quality A factor analysis of the ratings given by consumers indicated that four factors could summa rize the 14 attributes. These factors were: comforting quali ty, heartiness, genuineness and freshness.
Here we are only exploring the factors, but we cannot confirm whether these are the only factors, hence the name Exploratory Factor Analysis.
© Orangetree Business Solutions Private Limited, 201213
57
Principal Component Analysis
Principal component analysis was developed by Pearson and adapted for factor analysis by Hotelling. A goal for the user of PCA is to summarize the interrelationships among a set of original variables in terms of a smaller set of uncorrelated principal components that are linear combinations of the original variables.
Estimating The Initial Communalities
PCA assumes that there is as much variance to be analyzed as the number of ob served variables and that all of the variance in an item can be explained by the ex tracted factors. Communality means the variance that the items and factors share in common.
Eigenvalues and Eigen Vectors
PCA has been described as Eigen analysis or seeking of the solution to the characteris tic equation of the correlation matrix. An Eigen value represents the amount of vari ance in all of the items that can explained by a given principal component or factor. An Eigen vector of a correlation matrix is a column of weights.
Is Factor Analysis Feasible?
Correlation Matrix Check: Is it a combination of high and low correlations?
KMO MSA Check: The Kaiser – Meyer – Olkin Measure of Sampling Adequacy tests whether the partial correlations among variables are small.
Bartlett‘s Test of Sphericity: It tests whether the correlation matrix is an identity ma trix, which could indicate that the factor model is inappropriate.
Factor Loadings
To obtain a principal component, each of the weights of a Eigen vector is multiplied by the square root of the principal component‘s associated Eigen value. These newly generated weights are called factor loadings and represent the correlation of each item with the given principal component.
Deciding The Number of Factors
A Priori Criterion: Number of Factors to extract is predecided
Eigen Value Criterion:
Min Eigen Criterion: We decide the floor of Eigen value. If the floor is 0.6 and there are 3 Eigen values above that mark, then we are looking for 3 factors.
Proportional and Cumulative Variance: We consider how much information is ex plained by an individual factor and on aggregate by the selected factors.
Scree Plot: This is basically graphical presentation of proportional variance
So, PCA explains the entire variance and EFA explains a part of it. In EFA we are basi
© Orangetree Business Solutions Private Limited, 201213
58
cally trying to explain the common variance among the variables.
Number of Factors
Factor Analysis is an Interdependence technique. In interdependence techniques the variables are not classified as dependent or independent; rather, the whole set of in terdependence relationships is examined.
Problems of Factor Loadings and Solutions
Initially, the weights are distributed across all the variables. So it is not possible to under stand the underlying factor of one or more variables. To remove this problem , we ap ply rotation to the axes. We mainly deal with two types of rotation:
Orthogonal Rotation: Varimax
Oblique Rotation: Promax The problem with oblique rotation is that it makes the factors correlated. Varimax rota tion is used in principal component analysis so that the axes are rotated to a position in which the sum of the variances of the loadings is the maximum possible.
© Orangetree Business Solutions Private Limited, 201213
59
SAS IMPLEMENTATION
EXPLORATORY FACTOR ANALYSIS
Here we are concerned about the underlying factors of the employee satisfaction. Here the name of the data set is employee_satisfaction. Let‘s first look at the variables in the data set.
proc contents data=day1.employee_satisfaction position short; run; /*Employee Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work*/
So apart from the variable employee which is basically the identification of the em ployee, all the variables contribute to the satisfaction of the employee. Using factor analysis we are going to find out the underlying factors of the employee satisfaction and see which variable belongs to which factor.
But first we have to see whether factor analysis is feasible or not.
proc factor data=day1.employee_satisfaction corr msa scree; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;
The corr option in the data statement of procedure factor produces the correlation matrix mentioned in the var statement. If the correlation between the variables are very near to zero (say within +/ 0.2 ), then the variables are independent. So they themselves are the factors. The other option msa produces a KMO MSA Check. The scree option produces a scree plot.
Now suppose we want to produce 4 factors. Then we set the value of n to 4.
proc factor data=day1.employee_satisfaction corr msa scree n = 4 rotate = varimax; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;
© Orangetree Business Solutions Private Limited, 201213
60
SAS IMPLEMENTATION
The rotate option specifies the type of rotation that we give. Here we have assigned Varimax rotation.
If we want to calculate all the scoring coefficients, then we mention the option score.
proc factor data=day1.employee_satisfaction corr msa scree score mineigen = 0.5; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;
The mineigen = 0.5 option implies we want to retain those factors only that have eigen values greater than 0.5.
For individual factor scores, we write specify the option out = day1.factor_scores.
proc factor data=day1.employee_satisfaction n = 4 out = day1.factor_scores; var
Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.