Sei sulla pagina 1di 117

No part of this book should be referenced or copied without the prior permission of the company.

A FEW WORDS TO THE STUDENTS

Analytics is becoming a popular tool for managerial decision making. It‘s still not so widespread in countries like India, but in the west it has become a standard practice. Previously studying analytics involved an in depth knowledge of statistics and pro- gramming languages. But widespread availability of statistical package software has changed the reality to some extent. Now more emphasis is given on the application of the techniques to solve the business problems. So there is a need to understand the meaning of the statistical procedures. This book has been written to cater that need.

In this book, all the necessary concepts have been explained keeping the business problem in mind. Also, to remove the apathy for statistics, use of mathematical expres- sions have been limited. That doesn‘t imply that we don‘t have to study the mathe- matics part. The intention is to put the substance over matter. As the students get ac- customed to these statistical concepts, they can go for further investigations using vari- ous mathematical and statistical techniques. A list of suggested books and links have been given in the appendix.

This book is directly related to the instructor‘s presentation. So it is highly advised that students should go through this material at the end of each class. As for general reading, the reader is advised to go according to the chapters. Chapters have been arranged in the order of higher complexity. So the initial chapters are very important.

In this book, the statistical procedures have been implemented on SAS. The expla- nations of the codes have been from the perspective of a data modeler. For the per- spective of a programmer, the students are advised to go through the documentation of the procedures in the SAS website.

In fine, statistical concepts are a way of thinking. The more you recognize the think- ing pattern, the quicker you will learn.

Best of Luck!

Team OTG

CONTENTS

PAGE

 1. Introduction to Analytics and Basic Statistics 5 2. Introduction to Probability Theory 22 3. Sampling Theory and Estimation 33 4. Important Tests of Statistical Significance (Part I) 41 5. Understanding The Association Between The Variables 48 6. Important Tests of Statistical Significance (Part II) 53 7. Exploratory Factor Analysis 57 8. Cluster Analysis 62 9. Linear Regression 69 10. Logistic Regression 81 11. Time Series Analysis 96 Appendix: Suggested Books and References 116

4

Chapter 1

INTRODUCTION TO ANALYTICS AND BASIC STATISTICS

B usiness analytics (BA) refers to the skills, technologies, applications and practic- es for continuous iterative exploration and investigation of past business perfor- mance to gain insight and drive business planning.

There are three main categories of analytics:

1.Descriptive - the use of data to find out what happened in the past and what is happening now. 2.Predictive - the use of data to find out what could happen in the future. 3.Prescriptive - the use of data to prescribe the best course of action for the future.

A

N

A

L

Y

T

I

C

S

D

O

M

A

I

N

S

1.Retail sales analytics 2.Financial services analyt- ics 3.Risk & Credit analytics 4.Talent analytics 5.Marketing analytics 6.Behavioral analytics 7.Collections analytics 8.Fraud analytics 9.Pricing analytics

According to McKinsey Global Institute, The amount of data in our world has been exploding and ana- lyzing large data,

so-called big data will become a key basis of competition, underpinning new waves of

productivity growth, innovation, and consumer

surplus. MGI studied big data in five domains

healthcare in the United States, the public sec-

tor in Europe, retail in the United States, and

manufacturing and personal-location data

10.Telecommunications

11.Supply Chain analytics 12.Transportation ana-

globally.

Big data can generate value in each. For example, a retailer using big data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than \$300 billion in value every year. Two-thirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed econo- mies of Europe, government administrators could save more than €100 billion (\$149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal-location data could capture \$600 billion in consumer surplus.

5

Customer churn is a common term used both in academia and practice to de- note the customers with propensity to leave for competing companies. Ac- cording to various estimates in European mobile service markets, churn rate reaches twenty-five to thirty percent an- nually. On the other hand financial anal- ysis and economic studies are in agree- ment that acquiring new customers is five times as expensive compared to retain- ing existing customers.

Snapshot of Companies Using Analytics:

MoneyGram International uses analytics to detect and prevent money transfer fraud before it impacts customers and has prevented more than US\$37.7 million in fraudulent transactions, reduced customer fraud complaints by 72 percent.

Primerica provides its representatives with the ability to drill down into sales data in order to increase productivity and boost revenue. Primerica has more than 142 thousands licensed sales representatives.

T Mobile USA uses analytics to detect the influencers in the network and design lu- crative customized offers to the influencers. In this way, they reduced the churn rate by 25%.

Dillard‘s uses analytics to improve its customer relationship management and mer- chandise management to deliver the right product at the right store.

Seton Healthcare Family uses analytics to detect patients who are at considerable risk.

Del Monte Foods uses analytics to understand the macro variables like inflation and how these variables impact the cost structure of the company.

Reliance Capital uses analytics to retain customer in its mutual fund business, to confirm the continual premium payment in the life insurance business, to design products of high claim ratio in the general insurance business, and finally, for credit scoring in the mortgage finance business.

1.Increase customer value and overall rev- enues 2.Reduce costs and increase operational efficiency 3.Develop successful new products and ser- vices 4.Determine profitable sites for new stores and improve existing stores 5.Communicate effectively between de- partments for better decision making

6

Types of Data Analysis

Exploratory Data Analysis (EDA) makes few assumptions, and its purpose is to suggest hypotheses and assumptions. An OEM manufacturer was experiencing customer

complaints.

move causes of these complaints.

customers for usage data so the team could cal-

culate defect rates.

Data Analysis. The investigation established that

a supplier used the wrong raw material.

sions with the supplier and team members moti-

vated further analysis of raw material, and its

composition.

rial completed the Exploratory Data Analysis.

A team wanted to identify and re-

This started an Exploratory

Discus-

This decision to analyze raw mate-

The Exploratory Data Analysis used both data

analysis and process knowledge possessed by team members. The supplier and com- pany conducted a series of designed experiments which identified an improved raw

material composition.

to .004%.

(CDA). Note that the experimental design required a hypothesis generated by the Ex-

ploratory Data Analysis. Exploratory Data Analysis uncovers statements or hypotheses for Confirmatory Data Analysis to consider.

Using this composition, the defect rate improved from .023%

The experimental design and its analysis was Confirmatory Data Analysis

Properties of Measurement

Identity: Each value on the measurement scale has a unique meaning.

Magnitude: Values on the measurement scale have an ordered relationship to on another. That is, some values are larger and some are smaller.

Equal intervals: Scale units along the scale are equal to one another. This means, for example, that the difference between 1 and 2 would be equal to the differ- ence between 19 and 20.

Absolute zero: The scale has a true zero point, below which no values exist.

Scales of Measurement

Nominal Scale: The nominal scale of measurement only satisfies the identity property of measurement. Values assigned to variables represent a descriptive category, but have no inherent numerical value with respect to magnitude. Gender is an example of a variable that is measured on a nominal scale. Individuals may be classified as "male" or "female", but neither value represents more or less "gender" than the other. Religion and political affiliation are other examples of variables that are normally measured on a nominal scale.

7

Ordinal Scale: The ordinal scale has the property of both identity and magnitude. Each value on the ordinal scale has a unique meaning, and it has an ordered relationship to every other value on the scale. An example of an ordinal scale in action would be the results of a horse race, re- ported as "win", "place", and "show". We know the rank order in which horses finished the race. The horse that won finished ahead of the horse that placed, and the horse that placed finished ahead of the horse that showed. However, we cannot tell from this ordinal scale whether it was a close race or whether the winning horse won by a mile.

Interval Scale: The interval scale of measurement has the properties of identity, magni- tude, and equal intervals. A perfect example of an interval scale is the Fahrenheit scale to measure temperature. The scale is made up of equal temperature units, so that the difference between 40 and 50 degrees Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit. With an interval scale, you know not only whether different values are bigger or smaller, you also know how much bigger or smaller they are. For example, suppose it is 60 degrees Fahrenheit on Monday and 70 degrees on Tuesday. You know not only that it was hotter on Tuesday; you also know that it was 10 degrees hotter.

Ratio Scale: The ratio scale of measurement satisfies all four of the properties of meas- urement: identity, magnitude, equal intervals, and an absolute zero. The weight of an object would be an example of a ratio scale. Each value on the weight scale has a unique meaning, weights can be rank ordered, units along the weight scale are equal to one another, and there is an absolute zero. Absolute zero is a property of the weight scale because objects at rest can be weightless, but they cannot have negative weight.

Types of Data

Quantitative Data: In most of the cases, we will find ourselves using numeric data. This type of data is the one that contains numbers.

 Delivery Time in Minutes 19 10 17 15 18 16 12 16 16 18 15 15 16 18 13 15 19 17 14 10 13 12 13 16

8

Qualitative Data: The other type of data is string type data. A string is simply a line of text and could represent comments about certain participant, or other infor- mation that you don‘t wish to analyze as a grouping variable.

 Cube # Touch See Smell 1 Rough Brown Wood 2 Rough Sliver Metallic 3 Slightly Rough Sliver Metallic 4 Smooth Gold No Smell 5 Smooth Brown No Smell 6 Smooth Brown No Smell 7 Rough Brown Wood 8 Smooth Gold No Smell

Categorical Data: The third type of data is categorical data represented by a grouping variable. For example, you insert a variable called gender and insert ‗Male‘ or ‗Female‘ under this variable as observations. In this case we can group the entire data with respect to gender. Here gender is a group- ing variable.

 Number of Color Items Brown 4 Gold 2 Silver 2

Presentation of Qualitative Data

Tabular Presentation

 Cumulative Cumulative Subcategory Frequency Percent Frequency Percent Chocolate 491 32.73 491 32.73 Fruit 170 11.33 661 44.07 Gum 194 12.93 855 57 Mixed 92 6.13 947 63.13 Soft 365 24.33 1312 87.47 Sweet 188 12.53 1500 100

Graphical Presentation

Simple Bar Chart

Pie Chart

9

Horizontal Bar Chart: Good for Geographical data

Stacked Bar Chart:

Good for Intra Analysis

Presentation of Quantitative Data

Tabular Presentation

Multiple Bar Chart:

Good for Inter Analysis

Graphical Presentation

Histogram: Understanding the Distribution of the Data

Scatter Plot: Understanding the Re- lationship between two numerical Variables

10

Various Types of Scatter Plots

Positively Related: One increases, then the other also increases

Negatively Related: One in- creases, then the other de- creases

Measure of Central Tendency

Undefined: No Clear Relation

The vice president of marketing of a fast food chain is studying the sales perfor- mance of the 100 stores in the eastern part of the country. He has constructed the fol- lowing frequency distribution of annual sales:

 Sales (000s) Frequency Sales (000s) Frequency 700 - 799 4 1300 - 1399 13 800 - 899 7 1400 - 1499 10 900 - 999 8 1500 - 1599 9 1000 - 1099 10 1600 - 1699 7 1100 - 1199 12 1700 - 1799 2 1200 - 1299 17 1800 - 1899 1

He would be looking at the distribution with an eye toward getting information about the central tendency to compare the eastern part with other parts of country. Central tendency is basically the central most value of a distribution. Now how do we know which one is the central most value? There are precisely three ways to find the central value: Arithmetic mean, Median and Mode. Arithmetic mean is the simple average of the data. The problem with arithmetic mean is that it is influenced by the extreme values. Suppose, you take a sample of 10 persons whose monthly incomes are 10k, 12k, 14k, 12.5k, 14.2k, 11k, 12.3k, 13k, 11k, 10k. So the average income turns out to be 12k. So that‘s a good representation of the data. Now if you replace the last data with 100k, then the average turns out to be 21k which is very absurd as 9 out of 10 people earns way be-

11

low that mark. This problem of Arithmetic mean can be reduced though the use of Geometric and Harmonic mean. But the effect of outliers can be almost nullified by the use of Median. Median is the mark where the entire data is split into exact halves, that is, 50% of the data lie above the mark and the rest lie below. In intuitive sense, it is the proper meas- ure of central tendency. But for various computational reason, Arithmetic mean is the most popular measure. Whereas median looks for half mark, Mode looks for the value with the highest fre- quency, that is highest number of occurrence. So using central tendency, we are trying to find out a value around which all the data are clustering. This property of data can be used to deal with the missing values. Sup- pose, some of the income data is missing, then you can replace the missing values with the mean or the median values. If some city name is missing, one may replace those by using the mode, that is the city which appeared most of the times.

Measures of Dispersion As the name says, here we are trying to access how disperse the data is. A measure of central tendency without any idea about the measures of dispersion don‘t make any sense. Why it is so? Look at the following charts.

The horizontal data is the central value in both the cases. But for the first case where the data is less dispersed, the data is really clustered around the central line. Whereas in the second case, data is so dispersed that central value is not that meaningful, as you cannot say that the horizontal line is a true representative of the data. So there is a need to measure the dispersion in the data. Broadly there are two measures of data, one is absolute measures like Range or Vari- ance and the other is relative measure like Coefficient of Variation. Range is the simplest measure. It is basically the difference between the maximum and the minimum value in a data. The other absolute measure Variance is a bit com- plicated to express in plain words. It basically comes from the sum of squared differ-

12

ence of the each data from the arithmetic mean of the data. Now as you go on in- creasing the number of data, the sum basically increases. So we take the average. Now if you take the square root (e.g. square root of 9 is 3), we get the Standard Deviation of the da- ta. If you like you can memorize the following ex- pression:

Some of you might find difficulties with the denomi- nator being n-1 instead of n. The reason is that here we are calculating the sample standard devi- ation. If it had been population standard devia- tion, we could have used n.

We will discuss about the population and sample in the coming chapters.

Apart from understanding the dispersion in the data, standard deviation can be used for transforming the data. Suppose, if we want to com- pare two variables like the amount of money persons earn and the number of pair of shoes their wives have, then it is better to express those data in terms of stand- ard deviations. That is, we simply divide the data by their respective standard deviations. So here the stand- ard deviation acts as a unit or we make the data unit free. Now if you want to understand which data is more vol- atile, personal income or pair of shoes, you better use Coefficient of Variation. As mentioned earlier, it is a rel- ative measure of dispersion and is expressed by stand- ard deviation per unit of central value, i.e. mean. If you have income in dollar terms and income in rupee terms, and if the first data has less coefficient of variation than the second one, use the first data for analysis. You will find more meaningful information.

Measures of Location

Using Measures of Location, we can get a bird‘s eye view of the data. Measures of Central Tendency also comes under the Measures of Location. Minimum and maxi- mum are also measures of location. Other measures are Percentiles, Deciles, and Quartiles. For example, if 90 percentile denotes the number 86, the it is implied that 90% of the students have got marks which are less than 86. Now the 90 percentile is the 9th Deciles.

13

For quartiles, we are basically dividing the total data into four equal parts. So we are looking for 3 points Q1, Q2, and Q3. The other name for Q2 is Median. So we have 25% of the data be- low Q1, 25% within Q1 and Q2, similarly 25% within Q2 and Q3 and finally, rest of the 25% above Q3.

Statistics Related to The Shape of The Distribu- tion

As we look at the shape of the histogram of a numeric data, we have various under- standing about the distribution of the data. We have two statistics that are related to the shape of the distribu- tion: Skewness and Kurtosis. If the distribution has a longer left tail, the data is nega- tively skewed. The opposite is for the posi- tively skewed. So we are basically detecting whether the data is symmetric about the central value of the distribution. In options markets, the difference in implied volatili- ty at different strike prices represents the market's view of skew, and is called volatility skew. (In pure BlackScholes, implied volatility is constant with respect to strike and time to maturity.) Skewness causes the Skewness risk in the statistical models, that are built out of variables which are assumed to be symmetrically distributed. Kurtosis, on the other hand, measures the peakedness of the distribution as well as the heaviness of the tail. Generally heavy tailed distributions don‘t have a finite variance. In other words, we cannot calculate the variance for these distributions. Now if we consider that the distribution is not heavy tailed and build the model on this assump- tion, it can lead to Kurtosis risk of the model. For instance, Long-Term Capital Manage- ment, a hedge fund cofounded by Myron Scholes, ignored kurtosis risk to its detriment. After four suc- cessful years, this hedge fund had to be bailed out by major investment banks in the late 90s because it understated the kurtosis of many financial securi- ties underlying the fund's own trading positions. There can be several situations as shown in the chart. The value of kurtosis for a Mesokurtic Distribu- tion is zero. For Platykurtic it‘s negative and for Lep- tokurtic it‘s positive. Kurtosis is sometimes referred as volatility of volatility or the risk with- in risk.

14

Box Plot for Detecting Outliers

An outlier is a score very different from the rest of the data. When we analyze data we have to be aware of such values because they bias the model we fit to the data. A good example of this bias can be seen by looking at a simple statistical model such as mean. Suppose a film gets a rating from 1 to 5. Seven people saw the film and rated the movie with ratings of 2, 5, 4, 5, 5, 5, and 5. All but

one of these ratings is fairly similar (mainly 5 and 4) but the first rating was quite dif- ferent from the rest. It was a rating of 2. This is an exam- ple of an outlier. The box- plots tell us something about the distributions of scores. The boxplots show us the lowest (the bottom horizontal line) and the highest (the top horizontal line). The dis- tance between the lowest horizontal line and the lowest edge of the tinted box is the range between which the lowest 25% of scores fall (called the bottom quartile). The box (the tinted area) shows the middle 50% of scores (known as interquartile range); i.e. 50% of the scores are bigger than the lowest part of the tinted area but smaller than the top part of the tint- ed area. The distance between the top edge of the tinted box and the top hori- zontal line shows the range between which top 25% of scores fall (the top quartile). In the middle of the tinted box

is a slightly thicker horizontal line. This

represents the value of the median. Like

histograms they also tell us whether the distribution is symmetrical or skewed. For

a symmetrical distribution, the whiskers

on either side of the box are of equal length. Finally you will notice small some circles above each boxplot. These are the cases that are deemed to be outliers. Each circle

has a number next to it that tells us in which row of the data editor to find the case.

Correcting Problems in the data

Generally we find problems related to the distribution or outliers while exploring the da- ta. Suppose you detect outliers in the data. There are several options for reducing the impact of these values. However, before you do any of these things, it‘s worth check- ing whether the data you have entered is correct or not. If the data are correct then the three main options you have are:

Remove the Case: It entails deleting the data from the person who contributed the

15

outlier. However, this should be done only if you have good reason to believe that this case is not from the population that you intend to sample. For example, if you were investigating factors that affected how much babies cry and baby didn‘t cry at all, this would likely be an outlier. Upon inspection, if you discovered that this ba- by was actually a 10 year old boy, then you would have grounds to exclude this case as it comes from a different population.

Transform the data: If you have a non-normal distribution then this should be done anyway (and skewed distributions will by their nature generally have outliers be- cause it‘s these outliers that skew the distribution). Such transformation should re- duce the impact of these outliers. For transformation we use the compute variable facility.

Log Transformation (log Xi): Taking the logarithm of a set of numbers squashes the right tail of the distribution. However, you cannot get a log value of zero or negative numbers, so if your data tend to zero or produce negative numbers you need to add a constant to all the data before you do transformation.

Square root transformation (√X i ): Taking the square root of large values has more of an effect than taking the square root of small values. Consequently, taking the square root of each of your scores will bring large scores closer to the cen- ter. So this can be a very useful way to reduce positively skewed data. But we still have the problems related to negative numbers.

Reciprocal transformation (1/X i ): Dividing 1 by each of the scores reduces the impact of large scores. The transformed variable will have a lower limit of zero. One thing to bear in mind with this transformation is that it reverses the scores in the sense that scores that were originally large in the data set become small after the transformation, but the scores that were originally small become big after the transformation.

Change the score: If transformation fails, then you can consider replacing the score. This on the face of it may seem like cheating (you are changing the data from what was actually collected); however, if the score you‘re changing is very unrepresentative and biases your statistical model anyway then changing the score is helpful. There are several options for how to change the score. The first one is next highest value plus one. We can replace our outliers with mean plus three times standard deviation derived from the rest of the data. A variation of this meth- od is that we can use two instead of three time standard deviation.

16

SAS IMPLEMENTATION

BAR GRAPH

Proc gchart data = day1.candy_sales_summary; Vbar subcategory; run;

gchart is the procedure to generate bar-chart. The data set we use here is the can- dy_sales_summary. The bar chart is generated using the keyword vbar. This presenta- tion is used to represent the qualitative variable subcategory. This code generates a bar graph showing the frequency of occurrence of the different subcategory.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory; run;

This code generates a 3d bar graph for ‗subcategory‘. This is a better form of repre- senting a qualitative data. ‗vbar3d‘ is the keyword for generating a three dimensional bar graph.

proc gchart data = day1.candy_sales_summary; hbar3d subcategory; run;

This code generates a horizontal 3d bar graph. The bar graph is generated for the vari- able ‗subcategory‘. ‗hbar3d‘ is the keyword for generating the horizontal 3d bar graph. This form of representing the data is useful when we are representing a spatial data.

Proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount; run;

This code generates a 3d vertical bar graph for the variable ‗subcategory‘. But, corre- sponding to each vertical bar graph for the subcategory it gives the total sale amount on top of each of the vertical bar.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sumvar=sale_amount; run;

This code results in the same output as the code above but does not display the sum corresponding to each bar at the top. The ‗sum‘ keyword is responsible for the display.

17

SAS IMPLEMENTATION

proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;

This code generates a sub-divided multiple bar diagram. The group generates the bar diagram corresponding to the ‗fiscal years‘ and show the sales corresponding to each subcategory for a given fiscal year.

goptions vsize=6in hsize=20in;

This code is run according to the margins specified by the options specified using the ‗goptions‘ keyword. This is a global statement which holds throughout the rest of the session. Every graph constructed by the software thereon would have these dimen- sions.

proc gchart data = day1.candy_sales_summary; vbar3d subcategory/sum sumvar=sale_amount group=fiscal_year subgroup=fiscal_quarter; run;

The multiple bar diagram which is generated as a result of the above code, appears very shabby on screen. To make them look better, we need to space them out and this is done through the above code. We specify the margins for the vertical and hori- zontal axis. This is a global statement, in the sense that, that any graphical representa- tions, here onwards, would take these dimensions as given.

PIE-CHART

proc gchart data= day1.candy_sales_summary; pie3d subcategory; run;

This code generates a 3 dimensional pie-chart using the keyword ‗pie-3d‘. ‗gchart‘ is the keyword to generate the chart. The pie-chart represents each of the ‗subcategory‘ on a pie, i.e. as a percentage of 360 degrees.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value= inside; run;

This is a variation of the previous pie chart representation. This would generate a pie- chart where the discrete value of the respective ‗subcategory‘ would be placed in the different slices. ‗value=inside‘ keeps the frequency values in the slices along with

18

SAS IMPLEMENTATION

the names of the subcategory. Each of the subcategory is shown in slices of different colors.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside; run;

This code generates the pie chart such that the name of the frequency value and the percentage frequency value of the subcategory inside the slice and the name of the subcategory outside the slice.

proc gchart data= day1.candy_sales_summary; pie3d subcategory/discrete value=inside percent=inside slice=outside freq=sale_amount; run;

This code for pie-chart puts out the frequency of sale corresponding to the sale sub- category. The percentage frequency of the sale and the discrete value of the sale of the subcategory are shown outside and the name of the variable is shown outside the slice.

HISTOGRAM

proc univariate data=day1.candy_sales_summary; var sale_amount; histogram sale_amount; run;

This is the representation of quantitative data. The ‗univariate‘ keyword is used to gen- erate all the key descriptive statistics related to a particular variable. Here, the variable under consideration is ‗sale_amount‘. The code to generate histogram is ‗histogram‘. If

no dimension is mentioned then, it is by default, a two dimensional

diagram.

proc univariate data=day1.candy_sales_summary noprint; var sale_amount; histogram sale_amount; class subcategory; run;

The ‗univariate‘ key-word in the code generates all the descriptive statistics associated with the variable ‗sale_amount‘ in the data set candy_sales_summary. Another objec- tive of the code is to construct a histogram for the same variable using the keyword ‗histogram‘. The total amount of sales is generated for each of the sub- categories,

19

SAS IMPLEMENTATION

which is specified using the keyword ‗class‘.

SCATTER PLOT

proc gplot data= day1.candy_sales_summary; plot sale_amount*units; run;

‗gplot‘ is the procedure to generate a plot of two quantitative variables. The scatter plot for two variables sale_amount and units is generated using the keyword ‗plot‘. The variable on the left-hand-side of the * represents the variable on the y-axis and the variable on the right-hand-side is the variable on the x-axis.

NORMALITY CHECK

proc univariate data=day1.class; var height; run;

The ‗univariate‘ keyword generates all the descriptive statistics associated with the var- iable heightin the data set Class. The descriptive statistics associated with a distri- bution helps in the identification of normality of a distribution. Normality of a distribution implies an element of symmetry associated with the distribution. In this data set the mean, median and the mode are approximately 62. The standard deviation is pretty ‗low‘ (5) compared to the existent mean. The Skewness and Kurtosis of the data set lies in the neighborhood of zero. A basic analysis yields the result that the variable ‗height‘ is normally distributed in the data set ‗Class‘.

proc univariate data= day1.class normal plot; var height; qqplot height/normal (mu=est sigma=est color=green); run;

The qqplot (Quantile Quantile-plot) is an alternate technique for examining whether a variable is normally distributed or not. ‗normal plot‘ is the key-word for generating a normal plot of variable. The keyword ‗qqplot‘ generates a plot which compares a hy- pothetical normal line (having an estimated mean and standard deviation) and actu- al points from the distribution. If the actual points of the distribution lie around the green coloured normal line, then the normality of the variable holds.

proc univariate data= day1.candy_sales_summary normal plot; var sale_amount; qqplot sale_amount/normal (mu=est sigma=est color=green); run;

20

SAS IMPLEMENTATION

This is the same code which has been executed for a different data set: can- dy_sales_summary. The mean of the variable, sale_amount (4951.97) is significantly dif- ferent from its median (4040.525) and mode (0.00). Also the average fluctuation in the data set represented by the standard deviation is very high (3986). This means that the mean is not a ‗good‘ representative value for the data set as there is a very high fluc- tuation in the data set. It is easy to conclude that the variable sale_amount is not nor- mally distributed.

BOXPLOT AND THE EXISTENCE OF OUTLIERS

The quality of the measures of Central tendency and dispersion are affected adversely in the presence of outliers. Box-plot is widely used to examine the existence of outliers in the data set. Our reference data set is a hypothetical data set consisting of the marks and the name of the subject. Two important facts that must be kept in mind for box plot are:

The number of observations in the data set must be at least as large as five.

If there are more than one category in the data set must be sorted according to the category.

proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Tanmoy\Book1.csv"

out=day1.boxplot

dbms=csv replace; run;

A data set containing the marks of 5 students in the subjects English and Math‘s exist in

a csv format. The file is imported into the SAS library by using ‗proc import‘ code. The

logic of this code is to import a given file in its existent format, convert it to SAS format and replace the freshly imported file with any file that would have the same name.

proc boxplot data=day1.boxplot; plot marks*subject/ boxstyle=schematic; run;

‗boxplot‘ is the key word for generating a boxplot. The plot is done between the marks

obtained by the students and the subject. The existence of the outliers in the data set

is observed as points outside the box. The ‗boxstyle‘ is a keyword to generate a partic- ular format of the boxplot.

21

Chapter 2

INTRODUCTION TO PROBABILITY THEORY

F uture events are far from certain in the business world. Most managers who use probabilities are concerned with two conditions:

The case when one event or another will occur The situation where two or more events will both occur

We are interested in the first case when we ask, ―What is the probability that today‘s demand will exceed our inventory?‖ To illustrate the second situation, we could ask, ―What is the probability that today‘s demand will exceed our inventory and that more than 10% of our sales force will not report for work?‖ Probability is used throughout busi- ness to evaluate financial and decision-making risks. Every decision made by manage- ment carries some chance for failure, so probability analysis is conducted formally ("math") and informally (i.e. "I hope"). Consider, for example, a company considering entering a new business line. If the company needs to generate \$500,000 in revenue in order to break even and their probability distribution tells them that there is a 10 percent chance that revenues will be less than \$500,000, the company knows roughly what level of risk it is facing if it de- cides to pursue that new business line.

Three Approaches Towards Probability

Classical Approach:

"Probability of an event" =(Number of outcomes where the event occurs)/(Total number of possible outcomes" " ) Relative Frequency Approach:

Suppose, we are tossing a coin. Initially the ratio of number of heads to number of trials
will remain volatile. As the number of trials increases, the ratio converges to a fixed
number (say 0.5). So probability of getting a head is 0.5; this concept has been shown
in the following chart.
1.0
0.5
20
40
60
80
100
120
140
160
180
200
220
240
260
Ratio

Number of Trials

22

Axiomatic Approach:

A) 0 ≤ P(A) ≤ 1, for all event A

B) ∑ P(A) = 1

Apart from all these, there is a concept of subjective probability. It‘s basically based on individual‘s past experience and intuition. Most higher level social and managerial decisions are concerned with specific, unique situations. Decision makers at this level make considerable use of subjective probability.

Concept of Random Variable

Informally, a random variable is the value of a measurement associated with an exper- iment, e.g. the number of heads in n tosses of a coin. More formally, a random varia- ble is defined as follows:

A random variable over a sample space is a function that maps every sample point (i.e. outcome) to a real number. The picture shown has all the outcomes when two dice are rolled. We can define a random variable X which is the sum of points appeared on the two dices. Then X can as- sume values from 2 to 12. Each of these numbers represents a set of outcomes. Ele- ments of such sets have same outer color, e.g. for X =5, we have the outcomes in the yellow boxes. Based on the events that we have, there can be two types of random variables: Discrete random variable and Continuous ran- dom variable. In the previous example, we are basically talking about discrete ran- dom variable. Again, John Brower Minnoch had a weight of 635 kg. Let‘s say this is the upper limit of human weight. So the weight of a person lies in between 0 and 635. So here the random variable weight is continuous.

Probability Mass Function

Probability mass function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution. Suppose that S is the sample space of all outcomes of a single toss of a fair coin, and X is the random variable defined on S assigning 0 to "tails" and 1 to "heads". Since the coin is fair, the probability mass function is given by:

The probability mass function of a fair die has been show in the chart. All the numbers on the die have an equal chance of ap- pearing on top when the die is rolled.

23

Probability Density Function

Probability density function (pdf) or density of a continuous random variable, is a func- tion that describes the relative likelihood for this random variable to take on a given value. The probability for the random variable to fall within a particular region is given by the integral of this variable‘s density over the region. If f(x) is the density function then the probability that X falls within a and b is given by

If you put this concept into a chart, then it will represent the area under the probability density function curve between a and b.

f(x)
a
b
x
Expectation of A Random Variable

Suppose you toss a coin 10 times and get 7 heads. ―Hmm, strange,‖ you say. You then ask a friend to try tossing the coin 20 times; she gets 15 heads and 5 tails. So you have, in all, 22 heads and 8 tails out of

30 tosses. What did you expect? Was it something close to 15 heads and 15 tails (half and half)? Now suppose you turn the tossing over to a machine and get 792 heads and 208 tails out of 1000 tosses of the same coin. Then you might be suspicious of the coin because it didn‘t live up to what you expected. To obtain the ex- pected value of a discrete ran- dom variable, we multiply each value of that the random variable can assume by the probability of occurrence of that value and sum these products. Again, re-

member that an expected value of 108.02 doesn‘t imply that tomorrow exactly 108.2 patients will visit the clinic.

 Number of Patients (1) Probability (2) 1 X 2 100 0.01 1.00 101 0.02 2.02 102 0.03 3.06 103 0.05 5.15 104 0.06 6.24 105 0.07 7.35 106 0.09 9.54 107 0.1 10.70 108 0.12 12.96 109 0.11 11.99 110 0.09 9.90 111 0.08 8.88 112 0.06 6.72 113 0.05 5.65 114 0.04 4.56 115 0.02 2.30 Expected Number of Patients 108.02

24

Probability Distributions

Probability distributions are related to frequency distributions. We can think of a proba- bility distribution as a theoretical frequency distribution. A theoretical frequency distri- bution is a probability distribution that describes how outcomes are expected to vary. These distributions deal with expectations, they are useful models in making inferences and decisions under conditions of uncertainty. A probability distribution is a listing of the probabilities of all the possible outcomes that could result if the experiment were done.

As the Random Variable is of two types, the Probability Distributions, hence, are of two types, namely, discrete and continuous. The Probability Distribution for the sum of point on two dice rolled is as follows:

Common Probability Distributions

Related to real-valued quan- tities that grow linearly (e.g. er- rors, offsets): Normal Distributions

Related to positive real- valued quantities that grow ex- ponentially (e.g. prices, incomes, populations): Log-normal Distri- bution, Pareto Distribution

Related to real-valued quan- tities that are assumed to be uni- formly distributed over a (possibly unknown) region: Uni- form Distribution

Related to Bernoulli trials (yes/no events, with a given probability): Bernoulli Distribution, Binomial Distribution

Related to events in a Poisson process (events that occur independently with a giv- en rate): Poisson Distribution, Exponential Distribution

Binomial Distribution

The binomial distribution describes discrete data resulting from an experiment known as Bernoulli process. The tossing of a fair coin a fixed number of times is a Bernoulli pro- cess and the outcomes of such tosses can be represented by the binomial probability distribution. The success or failure of interviewees on an aptitude test may also be de- scribed by a Bernoulli process. On the other hand, the frequency distribution of the lives of fluorescent lights in a factory would be measured on a continuous scale of hours and would not qualify as a binomial distribution. The probability mass function, the mean and the variance are as follows:

25

Characteristics of a Binomial Distribution

There can be only two possible outcomes: heads or tails, yes or no, success or fail- ure

Each Bernoulli process has its own characteristic probabil- ity. Take the situation in which historically seven tenths of all people who applied for a certain type of job passed the job test. We would say that the characteristic proba- bility here is 0.7, but we could describe our testing results as Bernoulli only if we felt certain that the proportion of those passing the test (0.07) re- mained constant over time.

At the same time, outcome of one test must not affect the outcome of the other tests.

Poisson Distribution

The Poisson distribution is used to describe a number of processes, including the distri- bution of telephone calls going through a switchboard system, the demand of pa- tients for service at a health institution, the arrivals of trucks and cars at a tollbooth, and the number of accidents at an intersection. These examples all have a common element: They can be described by a discrete random variable that takes on integer values (0, 1, 2, 3, 4, and so on). The number of patients who arrive at a physician‘s of- fice in an given interval of time will be 0, 1, 2, 3, 4, 5, or some other whole number. Simi- larly, if you count the number of cars arriving at a tollbooth on an highway during some 10 minutes period, the number will be 0, 1, 2, 3, 4, 5, and so on. The probability mass function, the mean and the variance are as follows:

Characteristics of a Poisson Distribution

  If we consider the example of number of cars, then the average number of vehi- cles that arrive per rush hour can be estimated from the past traffic data.  we divide the rush hour into intervals of one second each, we will find the follow- ing statements to be true : If  The probability that exactly one vehicle will arrive at the single booth per second is a very small number and is constant for every one second interval.  The probability that two or more vehicles will arrive within one second interval is so

26

small that we can assign it a zero value.

The number of vehicles that arrive in a given one second interval is independent of the time at which that one second interval occurs during the rush hour.

The number of arrivals in any one second interval is not dependent on the number of arrivals in any other one second interval.

Normal Distribution

The normal distribution has applications in many areas of business administration. For example:

Modern portfolio theory commonly assumes that the returns of a diversified asset portfolio follow a normal distribution.

In operations management, process variations often are normally distributed.

In human resource management, employee performance sometimes is consid- ered to be normally distributed.

The probability density function, mean, and variance are given by

Is The Distribution Normal?

The following conditions should be satisfied by the distribution in order to be a normal dis- tribution:

The mean, median and mode should be almost equal

The standard deviation should be low

Skewness and kurtosis should be close to zero

Median should lie exactly in between the upper and lower quartile

Normal Probability Plot

The normal probability plot is a graphical technique for normality testing: assessing whether or not a data set is approximately normally distributed. Here we are basical- ly comparing the observed cumulative probability with the theoretical cumulative probability. If the observed data are really from the normal distribution, then we should get a straight line as shown in the chart.

27

Q - Q Plot

The points in this graph are obtained through inverting the cumulative distribution function. Here we are comparing the points of observed distribution to the theoretical distribution for the same probability level. Here again, if the data are from the theoretical distribution, the plot will be a straight line.

Standard Normal Distribution

It‘s is a normal distribution

with

For a normal distribution, 68.2% of the data lies within the (mean - standard devi- ation, mean + standard de- viation) range.

28

SAS IMPLEMENTATION

BINOMIAL DISTRIBUTION

data binom; binom_prob=pdf ('binomial', 50, 0.6, 100); run;

This code defines the probability of getting fifty successes in 100 trials of a binomial ex- periment where the probability of getting success in a single trial is 0.6. This code cre- ates a new data set by the name ‗binom‘ in the work library. The data set can also be opened in the permanent library as well by assigning a library name. binom_prob is the variable that stores the probability associated with the above-mentioned outcome. ‗Pdf‘ stands for probability density function. It generates the probability associated with the given outcome, given the parameters of the distribution. ‗Pdf‘ is the general command for calculating the probabilities associated with various points of a distribu- tion (be it discrete or continuous) since SAS does not identify ‗pmf‘ (Probability Mass Function).

data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.5, 20); output; end; run;

This code generates a schedule of probabilities associated with the various outcomes or success in the binomial distribution. A loop is created for generating the schedule. The number of success in the experiment is kept variable and within the loop. The pa- rameters to the distribution, namely the number of trials (20) and the probabilities of success (0.5) are specified. The loop is terminated using the ‗end‘ keyword. The key word ‗output‘ is used to print the output at each iteration.

proc gplot data=binom_plot; plot binom_prob*x; run;

This command directly plots the binomial probability distribution with probabilities on the vertical axis and the number of successes on the horizontal axis.

data binom_plot; do x=0 to 20; binom_prob=pdf ('binomial', x, 0.3, 20); output; end; run;

This is the command to generate the binomial probability distribution for 0 to 20 trials with a much lower probability of success. Examining the nature of the distribution over

29

SAS IMPLEMENTATION

changing values of the probabilities of success gives us a fair idea of the Skewness of the distribution. If the probability of obtaining a success in a particular trial is ‗low‘ then the chances of getting very high successes is ‗low‘ and that of getting ‗low‘ successes is very high.The distribution, given the specification of the parameters is a negatively skewed distribution.

proc gplot data=binom_plot; plot binom_prob*x; run;

The command plots the binomial probability distribution for the newly specified param- eters with the values of the probability on the vertical axis and the number of success on the horizontal axis. The graphical representation displays the varying nature of Skewness in the distribution very distinctly.

POISSON DISTRIBUTION

data day1.poisson; pois_prob=pdf ('Poisson', 12, 10); run;

This is a data step where a data set by the name ‗Poisson‘ is created in the permanent library day1. The syntax defined by the function ‗pdf‘ is as follows:

New variable = pdf (name of the distribution, value of x, value of n). This code calcu- lates the probability of obtaining a particular number of successes in the Poisson ex- periment where the parameter to the experiment is 10 and the number of trials is 12.

data day1.pois_plot; do x=0to 25; pois_prob=pdf ('Poisson', x, 10); output; end; run;

This command plots the poisson probability distribution with probabilities on the vertical axis and number of successes on the horizontal axis. The output keyword is to print the output of each iteration.

proc gplot data=day1.pois_plot; plot pois_prob*x; run;

This command directly plots the poisson probability distribution with probabilities on the vertical axis and the number of successes on the horizontal axis. Following set of codes

30

SAS IMPLEMENTATION

are used for analyzing the Skewness associated with the poisson distribution:

data day1.pois_plot; do x= 0 to 25; pois_prob=pdf ('poisson', x, 10.5); output; end; run;

proc gplot data=day1.pois_plot; plot pois_prob*x; run;

The last couple of codes can be used to analyse the nature of the Skewness of the poisson distribution. The Skewness can be analyzed by changing the parameters to the distribution.

NORMAL DISTRIBUTION

data day1.normal; do x=-12 to 18 by 0.05; normal_prob=pdf ('normal', x, 3, 8); output; end; run;

This is a data step which creates a new data set ‗normal‘ in the user-defined library day1. The command generates the normal probability distribution. The values of the respective probability densities are stored in the variable ‗normal_prob‘. The syntax of this function is: Name of the variable = pdf (distribution name, number of trials, mean, variance). The mean and variance must be specified for a proper characterization of the normal distribution. The schedule of probabilities corresponding to the different val- ues of x is generated using the ‗do‘ loop. Since, normal distribution is a continuous dis- tribution it assumes continuous values. By default, at every successive step in the loop function the value is increased at a step of 1, which makes it a discrete loop. To make it continuous we increase the trials at a step of 0.05. The result of each iteration is dis- played using the ‗output‘ keyword.

proc gplot data=day1.normal; run;

This command directly plots the normal probability distribution with probabilities on the vertical axis and the number of trials on the horizontal axis. The graph obtained from the data ‗normal‘ is symmetric in nature.

31

SAS IMPLEMENTATION

proc univariate data=day1.class normal plot; var height; histogram height/normal (mu=est sigma=est color=green); run;

The ‗proc univariate‘ is the procedure for listing out all the descriptive statistics associ- ated with the variable ‗height‘ which is our analysis variable. The keyword ‗histogram‘ is used for generating a histogram over which a normal curve is super-imposed. The normal curve here is a green coloured curve, specified by the estimated mean and the estimated standard deviation. Super imposition of the normal curve over the histo- gram gives us an idea whether the variable is normally distributed. If the ‗normal curve‘ fits on nicely to the histogram then we say that the variable is ‗normally distribut- ed‘. The variable ‗height‘ in the data set ‗class‘ has a normal plot. The normality of the variable can be clearly observed in the diagram below:

proc univariate data=day1.candy_sales_summary normal plot; var sale_amount; histogram sale_amount/normal (mu=est sigma=est color=green); run;

This is the same code as above which has been used on a separate variable on a dif- ferent data set. The variable ‗sale_amount‘ is not normally distributed and the normal curve does not fit symmetrically on the histogram.

32

Chapter 3

SAMPLING THEORY AND ESTIMATION

S ampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population. Researchers rarely survey the entire population because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is

faster, and since the data set is smaller it is possible to ensure homogeneity and to im-

prove the accuracy and quality of the data.

Concept of Population

Sampling is concerned with the selection of a subset of individuals from within a popu- lation to estimate characteristics of the whole popu- lation. Researchers rarely survey the entire popula- tion because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogene- ity and to improve the accuracy and quality of the data.

Techniques of Sampling

There are two broader techniques of sampling:

Probability Sampling or Random Sampling and Non- probability sampling, among which only Random Sampling can be used for statistical investigation.

Probability Sampling or Random Sampling

Probability sampling, or random sampling, is a sampling technique in which the proba- bility of getting any particular sample may be calculated. Examples of random sam- pling include:

Simple Random Sampling

Without Replacement: One deliberately avoids choosing any member of the pop- ulation more than once.

With Replacement: One member can be chosen more than once.

Systematic Sampling Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that or-

33

dered list. Suppose you are talking data from every 10th person entering into a mall.

Stratified Sampling Where the population embraces a number of distinct categories or "strata‖, each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected. Where the population embraces a number of distinct categories or "strata‖, each stra- tum is then sampled as an independent sub-population, out of which individual ele- ments can be randomly selected. male, full-time: 90 male, part-time: 18 female, full-time: 9 female, part-time: 63 Total: 180 and we are asked to take a sample of 40 staff, stratified according to the above cate- gories.

The first step is to find the total number of staff (180) and calculate the percentage in each group.

% male, full-time = 90 / 180 = 50%

% male, part-time = 18 / 180 = 10%

% female, full-time = 9 / 180 = 5%

% female, part-time = 63 / 180 = 35%

This tells us that of our sample of 40, 50% should be male, full-time. 10% should be male, part-time. 5% should be female, full-time. 35% should be female, part-time. 50% of 40 is 20. 10% of 40 is 4. 5% of 40 is 2. 35% of 40 is 14. Another easy way without having to calculate the percentage is to multiply each group size by the sample size and divide by the total population size (size of entire staff):

male, full-time = 90 x (40 / 180) = 20 male, part-time = 18 x (40 / 180) = 4 female, full-time = 9 x (40 / 180) = 2

female, part-time = 63 x (40 / 180) = 14

Non-Probability Sampling

In non probability sampling, we cannot assign any probability to the selected sam- ple. Nonprobability sampling techniques cannot be used to infer from the sample to the general population.

Examples of nonprobability sampling include:

Convenience, Haphazard or Accidental sampling - members of the population are

34

chosen based on their relative ease of access. To sample friends, co-workers, or shoppers at a single mall, are all examples of convenience sampling.

Judgmental sampling or Purposive sampling - The researcher chooses the sample based on who they think would be appropriate for the study. This is used primarily when there is a limited number of people that have expertise in the area being re- searched.

Sampling Bias

In statistics, sampling bias is when a sample is collect-

ed in such a way that some members of the intend- ed population are less likely to be included than oth- ers. It results in a biased sample, a non-random sam- ple of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, re- sults can be erroneously attributed to the phenome- non under study rather than to the method of sam- pling.

Sampling Distribution

The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when de- rived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.

Population Parameters and The Estimation Theory

A statistical parameter is a parameter that indexes a family

of probability distributions. It can be regarded as a numeri- cal characteristic of a population or a model. For example, the family of normal distri- butions has two parameters, the mean μ and the variance σ^2: if these are specified,

the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.

In statistics, our purpose is to learn about the population by studying the samples. Esti-

mation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to esti- mate population parameters. For example, sample means are used to estimate popu- lation means. So, sample mean is an estimator here and the value of the mean is the estimate A statistical parameter is a parameter that indexes a family of probability dis- tributions. It can be regarded as a numerical characteristic of a population or a mod- el. For example, the family of normal distributions has two parameters, the mean μ and

the variance σ^2: if these are specified, the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.

In statistics, our purpose is to learn about the population by studying the samples. Esti-

35

mation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to esti- mate population parameters. For example, sample means are used to estimate popu- lation means. So, sample mean is an estimator here and the value of the mean is the estimate. So as a parameters is to the population, a statistic is to a sample.

Types of Estimator

There are two types of estimator: Point Estimator and Interval Estimator. The point estimators yield single-valued results, whereas an interval estimators results in a range of plausible val- ues.

Properties of Estimator

Unbiased: The estimator is an unbiased estimator of if and only if the expectation of the estimator is equal to the population parameter.

Consistency: An estimator is called consistent if increasing the sample size increases the probability of the estimator being close to the population parameter.

Efficiency: Among unbiased estimators, there often exists one with the lowest vari- ance, called the minimum variance unbiased estimator (MVUE) or an efficient esti- mator.

Sufficiency: An estimator is called sufficient if no other statistic which can be calcu- lated from the same sample provides any additional information as to the value of the parameter.

Testing of Statistical Hypothesis

Statistical hypotheses are statements about real relationships; and like all hypotheses, statistical hypotheses may match the reality, or they may fail to do so. Statistical hy- potheses have the special characteristic that one ordinarily attempts to test them (i.e., to reach a decision about whether or not one believes the statement is correct, in the sense of corresponding to the reality) by observing facts relevant to the hypothesis in a sample. This procedure, of course, introduces the difficulty that the sample may or may not represent well the population from which it was drawn.

Types of Hypotheses

Null Hypothesis (H 0 ): Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true. If the data-set is very unlikely, defined as being part of a class of sets of data that only rarely will be ob- served, the experimenter rejects the null hypothesis concluding it (probably) is false. The null hypothesis can never be proven, only thing we can do is to reject it or not re- ject it.

Alternative Hypothesis (H 1 or H A ): The alternative hypothesis (or maintained hypothesis

36

or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in

a stream has been observed over many years and a test is made of the null hypothesis

that there is no change in quality between the first and second halves of the data against the alternative hypothesis that the quality is poorer in the second half of the record.

Examples of Statistical Hypotheses

The mean age of all Calcutta University students is 23.4 years.

The proportion of Calcutta University students who are women is 50 percent.

The heights of all the male students of Calcutta University are normally distributed.

Types of Errors in Testing of Hypothesis

There are two types of error as follows:

Type I Error: A type I error, also known as an error of the first kind, occurs when the null hypothesis (H 0 ) is true, but is rejected. It is asserting something that is absent, a false hit. In terms of folk tales, an investigator may be "crying wolf" without a wolf in sight (raising a false alarm) (H 0 :

no wolf).

Type II Error: A type II error, also known as an error

of the second kind, occurs when the null hypoth- esis is false, but it is erroneously accepted as true.

It is missing to see what is present, a miss. A type II

error may be compared with a so-called false negative (where an actual 'hit' was disregarded by the test and seen as a 'miss') in a test checking for a single condition with a definitive result of true or false. A Type II error is committed when we fail to believe a truth.

Consequences of Type I and Type II Errors

Both types of errors are problems for individuals, corporations, and data analysis. Based on the real-life consequences of an er- ror, one type may be more serious than the other. For example, NASA engineers would prefer to throw out an electronic circuit that is really fine (null hypothesis H0: not broken; reali- ty: not broken; action: thrown out; error: type I, false positive) than to use one on a space- craft that is actually broken (null hypothesis H 0 :

not broken; reality: broken; action: use it; error: type II, false negative). In that situation

a type I error raises the budget, but a type II error would risk the entire mission.

37

Level of Significance

Statistical significance is a statistical assessment of whether observations reflect a pat- tern rather than just chance, the fundamental challenge being that any partial picture is subject to observational error. In statistical testing, a result is deemed statistically sig- nificant if it is unlikely to have occurred by chance, and hence provides enough evi- dence to reject the hypothesis of 'no effect'. As used in statistics, significant does not mean important or meaningful, as it does in everyday speech. The significance level is usually denoted by the Greek symbol α. Popular levels of signif- icance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of sig- nificance gives a p-value lower than the significance level α, the null hypothesis is re- jected.

Confidence Interval

In statistics, a confidence interval (CI) is a kind of interval estimate of a population pa- rameter and is used to indi- cate the reliability of an es- timate. Confidence inter- vals consist of a range of values (interval) that act as good estimates of the un- known population parame- ter. However, in rare cases, none of these values may cover the value of the pa- rameter. The level of confi- dence of the confidence interval would indicate the probability that the confi- dence range captures this true population parameter given a distribution of sam- ples. If a corresponding hypothesis test is performed, the confidence level corre- sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi- cance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample. In statistics, a confidence interval (CI) is a kind of interval estimate of a population parameter and is used to indi- cate the reliability of an estimate. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. However, in rare cases, none of these values may cover the value of the parameter. The level of confidence of the confidence interval would indicate the probability that the confi- dence range captures this true population parameter given a distribution of sam- ples. If a corresponding hypothesis test is performed, the confidence level corre- sponds with the level of significance, i.e. a 95% confidence interval reflects an signifi- cance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample.

38

SAS IMPLEMENTATION

SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT

proc surveyselect data=day1.employee_satisfaction out=day1.emp1 method=srs n=50; run;

Surveyselect is the procedure for executing a sampling procedure. The data set that we consider here is ‗employee_satisfaction‘. The method of sampling specified here is ‗simple random sampling without replacement‘ (SRS). We have pre-specified the sam- ple size to be 50. This is a proc step which generates a report. Some important con- cepts generated in the report are:

Random Number Seed: A integer used to set the starting point for generating a se- ries of random numbers. The seed sets the generator to a random starting point. A unique seed returns a unique random number sequence. Given the seed a series of random numbers is generated. If no random number seed is specified, then the numerical value of the system time is used for generating the subsequent random numbers.

Selection Probability: This shows the probability of selecting a sample of ‗n‘ obser- vations from a total of ‗N‘ observations (N > n). Each of the observations are equal- ly likely of being drawn from the population and a sample observation once drawn from a population is not returned back.

Sampling Weight: A sampling weight is a statistical correction factor that compen- sates for a sample design that tends to over- or under-represent various segments within a population. In some samples, small subsets of the population, such as reli- gious, ethnic, or racial minorities, may be oversampled in order to have enough cases to analyze. When these subsamples are combined with the larger sample, their disproportionately large numbers must be diluted by a sampling weight. This is just the reciprocal of the selection probability of a sample.

SIMPLE RANDOM SAMPLING WITH REPLACEMENT

proc surveyselect data=day1.employee_satisfaction out=day1.emp2 method=urs n=50; run;

This code describes an alternate technique of sampling. The method ‗urs‘ or unrestrict- ed sampling refers to the type of random sampling where the sample points are re- turned to the population once the observations are recorded. This process of sampling is also called the simple random sampling with replacement. In the final data set that we get, there might not be 50 unique observations, since repetition may occur in the selection of the sample observations. In this form of sampling, the report generated contains, in addition to the concepts introduced in srs, another concept called the ex- pected number of hits.

39

SAS IMPLEMENTATION

The concept of the expected number of hits is synonymous to the concept of selec- tion probability in the simple random sampling with replacement. This measure repre- sents the average number of times a particular observation is selected in the process of random sampling without replacement. The sampling weight, in this context, is the reciprocal of the expected number of hits made in the procedure.

STRATIFIED RANDOM SAMPLING

STEP 1: SORTING THE DATA SET ACCORDING TO THE SUB_CATEGORY

proc sort data=day1.candy_sales_summary

out=day1.candy_sort;

by subcategory;

run;

The command sorts the data set according to the variable ‗subcategory‘. The sorting of the data set is important because it divides the data set according to the available strata. The variable ‗subcategory‘ act as the strata in the given data set.

STEP 2: SAMPLING USING THE STRATIFICATION TECHNIQUE

proc surveyselect data=day1.candy_sort n= (5 7 15 10 12 8) method=seq

out=day1.candy_seq;

strata subcategory; run;

The method of sampling applied for each stratum is the sequential random sampling technique. The observations to be chosen from each stratum are specified using ‗n‘.

SYSTEMATIC OR ORDERED SAMPLING

The sample, in this technique, is drawn from the population, based on a particular or- der. For example: If a departmental store wants to know about the level of customer satisfaction then he needs to survey the customers. If in a day the mall expects a foot fall of 1000 customers and the number of sample size he requires is 100, then the mall can question every 10th person walking in through the door.

proc surveyselect data= day1.candy_sort

out=day1.candy_seq

n=30 method=sys; run;

This command ‗method=sys‘ is used to execute the systematic sampling process. The systematic number of observations that is to be sampled is calculated using: K = N/n, where n = size of the sample, N = size of the population. So, for getting a sample size of 30, every 50th observation should be surveyed.

40

Chapter 4

IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART I)

Concept of Parametric Data

A parametric test is one that requires data from one of the large catalogue of distributions that statisticians have described and for data to be parametric certain assumptions must be true. If you use a parametric test when your da- ta is not parametric then the results are likely to be inaccurate. Therefore, it is

very important that we check the assumptions before deciding which statistical test is appropriate.

Assumptions of Parametric Test

Normally Distributed Data: It is assumed that the data are from one or more nor- mally distributed populations. The rationale behind hypothesis testing relies on nor- mally distributed populations and so if this assumption is not met then the logic be- hind hypothesis testing is flawed. Most researchers eyeball their sample data using a histogram and if the sample data look roughly normal, then the researchers as- sume that the populations are also.

Homogeneity of Variance: The assumption means that the variance should be the same throughout the data. In designs in which you test several groups of partici- pants, this assumption means that each of these samples comes from populations with the same variance.

Interval Data: Data should be measured at least at the interval level. This means that the distance between points of your scale should be equal at all parts along the scale. For example, if you had a 10 point anxiety scale, then the difference in anxiety represented by a change in score from 2 to 3 should be the same as that represented by a change in score from 9 to 10.

Independence: This assumption is that data from different participants are inde- pendent, which means that the behavior of one participant does not influence the behavior of another.

The assumptions of interval data and independent measurement are tested only by common sense. The assumption of homogeneity of variance is tested in different ways for different procedures .

Z Test

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.

41

Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, re- spectively. Our interest is in the scores of 55 stu- dents in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the region- al mean that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low?

Assumptions

The parent population from which the sample is drawn should be normal

The sample observations are independent, i.e., the given sample is random

The population standard deviation σ is known

T Test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t dis- tribution if the null hypothesis is supported. Among the most frequently used t-tests are:

A one-sample location test of whether the mean of a normally distributed popula- tion has a value specified in a null hypothesis.

A two sample location test of the null hypothesis that the means of two normally distributed populations are equal.

A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero.

A test of whether the slope of a regression line differs significantly from zero.

Assumptions

Most t statistics have the form t= Z∕s.

Z follows a standard normal distribution under the null hypothesis or the parent pop- ulation from which the sample is drawn should be normal

The sample observations are independent, i.e., the given sample is random

The population standard deviation σ is unknown

Two Independent Samples T Test

Consider you have Conducted a survey that studied the Commitment to Change in your Organization. Now you require to find out if there are any differences in the Com- mitment to Change between Male and Female Staff Members or for instance a re- searcher wants to find out if Middle level employees are more satisfied than top level employees. In this case the researchers needs the Satisfaction Scores for Middle and Top Management. Here again we can see that One Variable (Satisfaction) is divided into two groups (Middle and Top Level). So in summary when we need to compare

42

two groups for one numeric variable with each other we would use Two Independent Samples T Test, here the two samples are drawn from one single variable. An Assumption for Two Independent Samples T test is that the Data is Normally Distributed .

Forming The Research Hypotheses

Now for Instance we have conducted a Survey that studied the salary of Respondent, Now we want to Check if there are any difference in the salary of Male and Female Employee in the Business Organization. Example of research question: Are there any differences in the earning of males and females employees? What you need: One categorical independent variable with only two groups (e.g. sex:

males/ females). One continuous dependent variable (e.g. Salary). Hypotheses of Two Independent Samples t Test:

H 0 : The two population means are equal, i.e. there is no difference in earnings H 1 : The two population means are not equal, i.e. there is difference in earnings

Paired Sample T Test

A company markets an eight week long weight loss program and claims that at the end of the program on average a participant will have lost 5 pounds. On the other hand, you have studied the pro- gram and you believe that their program is scientifically un- sound and shouldn't work at all. You want to test the hy- pothesis that the weight loss program does not help people lose weight. Your plan is to get a random sample of people and put them on the program. You will measure their weight at the beginning of the program and then measure their weight again at the end of the program. Based on some previous research, you believe that the standard de- viation of the weight difference over eight weeks will be 5

pounds.

Assumptions

The assumptions underlying the paired samples t-test are similar to the one-sample t- test but refer to the set of difference scores.

The observations are independent of each other

The dependent variable is measured on an interval scale

The differences are normally distributed in the population Hypotheses of Paired Sample t Test:

H 0 : The two population means are equal H 1 : The two population means are not equal In summary, a paired sample t test tries to assesses whether an action is effective or not.

43

SAS IMPLEMENTATION

A SINGLE VARIABLE T-TEST

The case study on a single variable t-test pertains to a leading hospital in the city. The baseline blood pressures for 60 patients belonging to different age groups were rec- orded. The data set contains three variables namely: the subject (id variable), Age (numeric variable) and Baseline bp (numeric variable). The objective of the case study is to check whether there has been a statistically signif- icant change in the average blood pressure over a span of 45 days. We use the t-test in this case. However, before using the test we need to test for the assumption of nor- mality.

STEP 1: CHECK FOR NORMALITY

proc univariate data=day1.bp normal plot; var baselinebp; qqplot baselinebp/normal (mu=est sigma=est color=pink); run;

The univariate procedure generates all the vital descriptive statistics associated with the variable ‗baseline bp‘. The qq-plot of ‗baseline-bp‘ shows that observations of the variable lie very close to the hypothetical pink-coloured normal line. Therefore, base- linebp is normally distributed.

STEP 2: TESTING THE SIGNIFICANCE OF THE HYPOTHESIS

proc ttest data=day1.bp h0=96 alpha=0.05; var baselinebp; run;

The procedure ‗ttest‘ is used to run the student‘s t-test. The null-hypothesis (h0) is speci- fied to be equal to 96. This implies that any differences observed in the readings of the average blood pressure are caused due to sampling fluctuations in the data set. The keyword ‗alpha‘ is used to denote the level of significance which shows the probability of committing a Type I error.

The t-test generates the following tables:

Statistics: This table generates the vital statistics associated with the variable base- linebp. It displays the sample mean, variance and standard error associated with the sample.

T-test: This table reports the results associated with the t-test. The most important component of this table is the p-value which is shown at the end of the table. The p -value shows a value of 0.2688 which is much higher than the level of significance. Therefore, an analyst knows that he runs a very high chance of committing a Type-I error if he rejects the Null-Hypothesis. Thus, it is in the interest of the analyst to ac- cept the Ho. So, in this situation it can be inferred that minor fluctuations observed in the mean blood pressure are due to sampling fluctuations.

44

SAS IMPLEMENTATION

TWO INDEPENDENT SAMPLE T-TEST

The two-independent sample t-test is useful for examining significant differences in the mean of two data sets. The present case study considers two renowned pizza compa- nies: ABC and XYZ. The manager of the XYZ company is apprehensive of the falling sales compared to its competitor ABC. The absolute delivery time for Pizza company ABC is less than XYZ, but this would be considered a crucial factor in explaining the de- clining sales of XYZ if the differences in the mean delivery time of company ABC are significantly less than the mean delivery time of XYZ.

STEP 1: IMPORTING THE REQUIRED FILE The file containing the required information does not initially exist in the SAS data base. The original file is in a csv format and so, we first import the data set using the ‗import‘ file. This imports the dataset to the SAS database and renames it ‗twoind_sample‘.

proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\twoindsample.csv"

out=day1.twoind_sample

dbms=csv replace; run;

STEP 2: RUNNING THE T-TEST

proc ttest data=day1.twoind_sample; class company; var waiting_time in_minutes_; run;

The t-test is executed using the procedure ‗ttest‘. Since this is a t-test to check the dif- ference of mean between two groups, we introduce the ‗class‘ keyword to identify the two pizza companies. The variable in terms of which the t-test is to be carried out is

the variable ‗waiting_time

the code is executed:

STATISTICS: This table describes the vital statistics associated with the two pizza com- panies. This table gives us a clear idea that the delivery time of the pizza company ABC is distinctly less than the delivery time of the company XYZ. How can we say so? This can be said so from the confidence intervals within which the sample means of the two companies lie.

EQUALITY OF VARIANCES: To compare the means of two different sets it is neces- sary to check that the variances of the two set. The population variances of the two data sets must be identical in nature. This implies that the mean-difference test is executed under the assumption that the variance remains constant across the two data sets. The equality of variances is tested using the Folded F-test. This is de- fined as: F = max (s 1 2 ,s 2 2 )/min(s 1 2 ,s 2 2 ) where s 1 2 and s 2 2 are variances of category 1 and category 2.

in_minutes_‘.

Three important tables are generated once

45

SAS IMPLEMENTATION

The hypothesis tested is:

H 0 : The population variances are equal V/s H A : The population variances are unequal

The decision rule used is the p-value rule whereby the null hypothesis is accepted if the exact probability of committing the type I error exceeds the benchmark proba- bility as prescribed by the level of significance. Here, the p-value associated with the folded F-statistic is 0.38. This is much greater than the level of significance. Hence, the chance of committing a type I error is much higher in this model and we do not take the risk of committing the error and accept the null hypothesis. Therefore, it is safe to conclude that the population variances of the two pizza companies are not identically different.

T-TESTS: This table displays the results of the t-test corresponding to the difference in the mean delivery time of pizzas. The results are displayed under two sub-headings:

Pooled Variance and Unequal variance. We consider the results corresponding to the Pooled variance for the t-test analysis. The p-value corresponding to the t- statistic is 0.0003 which is less than the prescribed level of significance. Therefore, it is easy to conclude that the difference in the mean delivery time of the pizza com- panies ABC and XYZ are significantly different from one another.

PAIRED SAMPLE T-TEST

To analyze the impact of e-learning on the students, the Ministry of the Human Re- source Development of the Government of India performed an exploratory study on the a sample of 50 students. The students were first taught in the traditional method of teaching and then through the method of e-learning without the presence of any teachers. The marks were recorded for the students before the e-learning and after the e-learning. The marks were then compared to analyze the impact of the e- learning on the performance of the students.

STEP1: IMPORTING THE DATA FILE The first step in this part is to import the required datafile using the proc import key- word. The original file is in the csv format.

proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data sets and case studies\pairedsample.csv"

out=day1.pairedsample

dbms=csv replace; run;

STEP2: RUNNING THE PAIRED SAMPLE T-TEST

proc ttest data=day1.pairedsample;

46

SAS IMPLEMENTATION

paired before*after; run;

The keyword ‗paired‘ is used to execute the paired t-test between the marks ‗before‘ and marks ‗after‘. The hypothesis set up is:

H 0 : The ex-post and ex-ante means are not significantly different v/s H A : The ex-post and ex-ante means are significantly different

The results for this t-test are displayed through the following tables:

STATISTICS: The statistic table shows that the mean marks of students have in- creased after incorporating the e-learning process. The question that arises from the table above is: Is the rise in the mean marks post the e-learning a significant rise? To test the significance of the change we use the t-test table.

T-TEST: The t-test table details the significance of the difference of the paired means. The p-value rule is used for deciding whether the null hypothesis is should be accepted or not. The p-value generated (0.4539) within the model is greater than the level of significance. This means that the differences in the means are not statistically significant. Therefore, the analysis shows that the mean of the performance of the students post the e-learning process did not change significantly. Hence, e-learning employed by the ministry of education did not prove to be effective as a strategy.

47

Chapter 5

UNDERSTANDING THE ASSOCIATION BETWEEN THE VARIABLES

Chi Square Test for Independence of Attributes

C onsider the following questions:

bought?

Is their any association between income level and brand preference?

Is their any association between family size and size of washing machine

Are the attributes educational background and type of job chosen independent? The solutions to the above questions need the help of Chi-Square test of independ- ence in a contingency table. Please note that the variables involved in Chi-Square analysis are nominally scaled. Nominal data are also known by two names - categori- cal data and attribute data.

Contingency Table: Is there any relation be- tween age and invest- ment?

Assumptions

The data should be categorical variables

Total frequency should

be reasonably large, say greater than 50

 Investment Stock Bond Cash Total 25 - 34 30 10 1 41 35 - 44 35 25 2 62 Age 45 - 54 38 35 4 77 55 - 70 22 30 4 56 Total 125 100 11 236

The observations of the sample are independent, i.e., the samples are random

The theoretical frequency of any category or class should not be less than 5

Hypotheses of the test are H 0 : There is no association between the variables H 1 : There is an association between the variables

Calculation of Chi Square Statistic

48

Calculation of Theoretical Frequency

Remember, Chi square test of independence only checks whether there is any associ- ation between the attributes, but it does not tell what is the nature of the association.

Correlation Analysis

The simplest way to look at whether two variables are associated is to look at whether they covary. To understand what covariance is, we first need to think back to the con- cept of variance. Variance = ∑ (x i - m x ) 2 / (N – 1) = ∑ (x i - m x ) (x i - m x )/ (N 1) The mean of the sample is represented by m x , x i is the data point in question and N is the number of observations. If we are interested in whether two variables are related, then we are interested in whether changes in one variable are met with similar changes in the other variable. When there are two variables, rather than squaring each difference, we can multiply the difference for one var- iable by the corresponding difference for the second variable. As with the variance, if we want an average value of the combined differences for the two variables, we must divide by the number of observations (we actually divide by N 1). This averaged sum of combined differences is known as the covariance: Cov(x,y) = ∑ (x i - m x ) (y i - m y )/ (N 1) There is, however, one problem with covariance as a measure of the relationship be- tween variables and that is that it depends upon the scales of measurement used. So, covariance is not a standardized measure. To overcome the problem of dependence on the measurement scale, we need to convert the covariance into a standard set of units. This process is known as standardization. Therefore, we need a unit of measurement into which any scale of measurement can be con- verted. The unit of measurement we use is the standard deviation. The standardized covariance is known as a cor- relation coefficient. r = covxy / sx sy = ∑ (xi - mx) (yi - my)/ [(N 1) sx sy] which always lies in between 1 and 1.

Remember, correlation doesn‘t necessarily imply causation.

49

Test of Hypotheses for Correlation

For pairs from an uncorrelated bivariate normal distribution, the sampling distribution of Pearson's correlation coefficient follows Student's t-distribution with degrees of freedom n − 2. Specifically, if the underlying variables have a bivariate normal distribution, the variable

has a Student's t-distribution in the null case (zero correlation).

 Duration Salary in Age of Correlations of Educa- Professional Dollar per the Per- tion Experience Hour son Duration of Pearson Cor- 1 -.308 * .115 -.238 Education relation Sig. (2-tailed) .017 .381 .067 Professional Pearson Cor- -.308 * 1 .121 .985 ** Experience relation Sig. (2-tailed) .017 .358 .000 Salary in Dol- lar per Hour Pearson Cor- .115 .121 1 .180 relation Sig. (2-tailed) .381 .358 .169 Age of the Person Pearson Cor- -.238 .985 ** .180 1 relation Sig. (2-tailed) .067 .000 .169 *. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).

Partial Correlation

A correlation between two variables in which the effects of other variables are held constant is known as partial correlation. The partial correlation for 1 and 2 with control- ling variable 3 is given by:

r 12.3 = (r 12 r 13 r 23 ) / [√ (1 – r 132 ) √ (1 – r 232 )] For example, we might find the ordinary correlation between blood pressure and blood cholesterol might be a high, strong positive correlation. We could potentially find a very small partial correlation between these two variables, after we have taken into account the age of the subject. If this were the case, this might suggest that both variables are related to age, and the observed correlation is only due to their com- mon relationship to age.

50

SAS IMPLEMENTATION

CORRELATION

proc corr data=day1.correlation; var Education Experience Age Wage_dollars_per_hour_; run;

proc corr is used to calculate correlation between two or more quantitative variables. The var option identifies the variables whose correlation coefficients are to be quanti- fied. The output to this code generates a 4x4 correlation matrix. Each element in this matrix shows the correlation coefficient between two variables. Associated with each correlation coefficient is a p-value which shows the statistical significance of the corre- lation coefficient.

PARTIAL CORRELATION

proc corr data=day1.correlation; var Education Experience; partial Age; run;

This code produces the correlation between the two variables Education and Experi- ence. The option partial is used to adjust the correlation coefficient value between Ed- ucation and Experience for the impact of the variable ‗Age‘. This adjustment is im- portant to find out the extent exactly to which Education and Experience are correlat- ed.

MATRIX PLOT

ods html; ods graphics on; proc corr data=day1.correlation noprint plots=matrix; var Education Experience Wage_dollars_per_hour_ Age; run; ods graphics off; ods html close;

For a matrix view of the correlations we first set the ods (Output Delivery System) to html. Then we turn on the graphics mode. In the proc corr we use the options noprint to suppress the output in the output window. At the same time, we set the type of the plot to matrix. After running the code, we turn off the graphics mode and reset the output delivery system.

51

SAS IMPLEMENTATION

CHI SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES

Here we are trying to find out whether there is any association between the Frequen- cy_of_Readership and Level_of_Educational_Achievement. This test is done under the procedure freq and we request a chi square test in the table statement.

proc freq data=day1.chi; tables Frequency_of_Readership * Level_of_Educational_Achievement/chisq; run;

52

Chapter 6

IMPORTANT TESTS OF STATISTICAL SIGNIFICANCE (PART II)

One Way ANOVA

A manager wants to raise the productivity at his company by increasing the speed at which his employees can use a particular spreadsheet program. As he does not have the skills in-house, he employs an external agency which provides training in this spreadsheet program. They offer 3 packages - a be-

ginner, intermediate and advanced course. He is unsure which course is needed for the type of work they do at his company so he sends 10 employees on the beginner course, 10 on the intermediate and 10 on the advanced course. When they all return from the training he gives them a problem to solve using the spreadsheet program and times how long it takes them to complete the problem. He wishes to then com-

pare the three courses (beginner, intermediate, advanced) to see if there are any dif-

ferences in the
average time it took to complete the problem.
Beginner
Intermediate
Time

Assumptions

Response variable are normally distributed (or approximately normally distributed)

Samples are independent

Variances of populations are equal

Responses for a given group are independent and identically distributed normal random variables The hypotheses for the test are:

H0: The population means are equal H1: At least one of the population means is different The name ‗One Way ANOVA‘ implies that the number of independent variable is one. Here the inter-group variation is basically systematic variation and the intra-group vari-

53

ation is unsystematic. Then we are checking whether inter group variation is signifi- cantly larger than the intra group variation.

Two Way ANOVA

The two-way analysis of variance (ANOVA) test is an extension of the one-way ANOVA

test that examines the influence of different categorical independent variables on one dependent variable. While the one-way ANOVA measures the significant effect of one independent variable (IV), the two-way ANOVA is used when there are more than one

IV and multiple observations for each IV. The two-way ANOVA can not only determine

the main effect of contributions of each IV but also identifies if there is a significant in-

teraction effect between the IVs.

Example

A researcher was interested in whether an individual's interest in politics was influenced by their level of education and their gender. They recruited a random sample of par- ticipants to their study and asked them about their interest in politics, which they scored from 0 - 100 with higher scores meaning a greater interest. The researcher then divided the participants by gender (Male/Female) and then again by level of educa- tion (School/College/University).

What is Interaction?

When gender and level of education interact, we find 6 different groups, namely, Male School, Female School, Male College, Female College, Male University and Female University. Using two way ANOVA, we are trying to understand whether any of the group is significantly different from the rest. If the interaction levels don‘t show any significant differences, nor will the main factors for their levels.

Assumptions

As with other parametric tests, we make the following assumptions when using two-

way ANOVA:

The populations from which the samples are obtained must be normally distributed

Sampling is done correctly. Observations for within and between groups must be independent

The variances among populations must be equal (homogeneity)

Data are interval or nominal The Hypotheses for the test are:

For each factor and interaction, H0: Means of all groups are equal H1: There is one significant difference

54

SAS IMPLEMENTATION

ONE-WAY ANOVA

We demonstrate one way anova through a case study. The case that we consider is that of three production plants: Maruti, Hyundai and Tata. The associated processing time of cars in each of these plants is mentioned along with them. The objective of the analyst is to find out whether there exists a significant difference between the mean processing time of the plant.

proc anova data=day1.anova; class plant; model processing_time=plant; run;

‗anova‘ is the procedure used in analysis of variance when the data is balanced. ‗Class‘ is the keyword for specifying the different groups in the problem. In this case, the class variables are the respective production plants of the companies. The ‗model‘ keyword is used for executing functions which involve an independent and a depend- ent variable. The left-hand side of the equality is the dependent variable and the right- hand side represents the independent variable. The code generates the following ta- bles:

First table shows the statistics associated with the overall goodness of the model. This table displays the variations across the groups (Mean Model Sum of Squares) and within the groups (Mean Squares of Errors). The F-statistic is calculated as a ra- tio of the Explained variation in the model to the unexplained variation. The p- value rule is employed to check the significance of the F-value. The p-value for the F-statistic in this study is 0.1447, which is significantly greater than the level of signifi- cance. Thus it can be concluded that there is no significant difference in the pro- cessing time of cars in the three plants.

The second table generates all the descriptive statistics corresponding to the varia- ble mean_processing_time_of_plant. The mean processing times of plants of the three companies are not significantly differ- ent from each other. One problem with the one-way anova is that it does not include any interaction effect between the independent variables. This problem is addressed by two-way anova.

TWO-WAY ANOVA

A survey referred to weight gained by men because of different factors, viz, the amount of food consumed by the men and the type or nature of diet. By Ten repre- sentative men were randomly selected and each of them were fed with each type of diet in the two specified diet amounts (i.e. ―High‖ and ―low‖ respectively). The weight gained by the men was measured in grams. There are three variables with a total of 60 observations.

55

SAS IMPLEMENTATION

The numeric variable Weight Gain denotes the weight gained by the men. The two separate samples of pre and post treatment weight is not taken; rather; a single sam- ple of actual weight gain is considered. The variable Diet Amount denotes the amount of diet. It is a categorical variable recording two responses; 1 for ‗High‘ and 2 for ‗Low‘ amounts of diet. Also, the variable Diet Type denotes the type of diet consumed which is also a categorical variable. It records three responses: 1 for Vegetarian diet, 2 for non-vegetarian diet and 3 for a mixed diet. The objective of the study is to locate the factors which most significantly affect the weight gain in individuals. The code for two-way anova is:

proc glm data=day1.twowayanova; class Diet_Amount Diet_type; model Weight_gain=Diet_Amount Diet_type Diet_Amount*Diet_type; means Diet_amount Diet_type/tukey; run;

This can also be done using proc anova. But anova works well when the data is bal- anced, i.e. the interaction groups are equal in size. Also we are more interested about the type III sum of squares. So we prefer proc glm over proc anova.

56

Chapter 7

EXPLORATORY FACTOR ANALYSIS

S uppose, we are interested in consumers‘ evaluation of a brand of coffee. We take a random sample of consumers whom were given a cup of coffee. They were not told which brand of coffee they were given. After they had drunk the

coffee, they were asked to rate it on 14 semantic differential scales. The 14 at-

tributes which were investigated are shown below:

1. Pleasant Flavor Unpleasant Flavor

2. Stagnant, muggy taste Sparkling, Refreshing Taste

3. Mellow taste Bitter taste

4. Cheap taste Expensive taste

5. Comforting, harmonious Irritating, discordant

6. Smooth, friendly taste

Rough, hostile taste

alive, lively, peppy taste

8. Tastes artificial Tastes like

real coffee

9. Deep distinct flavor Shal-

low indistinct flavor

 Factor Attributes A. Comforting Quality 1. Pleasant flavor 3. Mellow taste 5. Comforting taste 12. Pure, clear taste B. Heartiness 9. Deep distinct flavor 11. Hearty, full - bodied, full fla- vor C. Genuineness 2. Sparkling taste 4. Expensive taste 6. Smooth, friendly taste 7. Alive, lively, peppy taste 8. Tastes like real coffee 14. Overall preference D. Freshness 10. Tastes just brewed 13. Raw taste

10. Tastes warmed over

Tastes just brewed

11. Hearty, full bodies, full fla-

vor Warm, thin empty flavor

12. Pure, clear taste Muddy,

swampy taste

13. Raw taste Stale taste

14. Overall preference: Excel-

lent quality Very poor quality A factor analysis of the ratings given by consumers indicated that four factors could summa- rize the 14 attributes. These factors were: comforting quali- ty, heartiness, genuineness and freshness.

Here we are only exploring the factors, but we cannot confirm whether these are the only factors, hence the name Exploratory Factor Analysis.

57

Principal Component Analysis

Principal component analysis was developed by Pearson and adapted for factor analysis by Hotelling. A goal for the user of PCA is to summarize the interrelationships among a set of original variables in terms of a smaller set of uncorrelated principal components that are linear combinations of the original variables.

Estimating The Initial Communalities

PCA assumes that there is as much variance to be analyzed as the number of ob- served variables and that all of the variance in an item can be explained by the ex- tracted factors. Communality means the variance that the items and factors share in common.

Eigenvalues and Eigen Vectors

PCA has been described as Eigen analysis or seeking of the solution to the characteris- tic equation of the correlation matrix. An Eigen value represents the amount of vari- ance in all of the items that can explained by a given principal component or factor. An Eigen vector of a correlation matrix is a column of weights.

Is Factor Analysis Feasible?

Correlation Matrix Check: Is it a combination of high and low correlations?

KMO MSA Check: The Kaiser Meyer Olkin Measure of Sampling Adequacy tests whether the partial correlations among variables are small.

Bartlett‘s Test of Sphericity: It tests whether the correlation matrix is an identity ma- trix, which could indicate that the factor model is inappropriate.

To obtain a principal component, each of the weights of a Eigen vector is multiplied by the square root of the principal component‘s associated Eigen value. These newly generated weights are called factor loadings and represent the correlation of each item with the given principal component.

Deciding The Number of Factors

A Priori Criterion: Number of Factors to extract is pre-decided

Eigen Value Criterion:

Min Eigen Criterion: We decide the floor of Eigen value. If the floor is 0.6 and there are 3 Eigen values above that mark, then we are looking for 3 factors.

Proportional and Cumulative Variance: We consider how much information is ex- plained by an individual factor and on aggregate by the selected factors.

Scree Plot: This is basically graphical presentation of proportional variance

So, PCA explains the entire variance and EFA explains a part of it. In EFA we are basi-

58

cally trying to explain the common variance among the variables.

Scree Plot
1
2
3
4
5
6
7
8
Eigen Value

Number of Factors

Factor Analysis is an Interdependence technique. In interdependence techniques the variables are not classified as dependent or independent; rather, the whole set of in- terdependence relationships is examined.

Initially, the weights are distributed across all the variables. So it is not possible to under- stand the underlying factor of one or more variables. To remove this problem , we ap- ply rotation to the axes. We mainly deal with two types of rotation:

Orthogonal Rotation: Varimax

Oblique Rotation: Promax The problem with oblique rotation is that it makes the factors correlated. Varimax rota- tion is used in principal component analysis so that the axes are rotated to a position in which the sum of the variances of the loadings is the maximum possible.

59

SAS IMPLEMENTATION

EXPLORATORY FACTOR ANALYSIS

Here we are concerned about the underlying factors of the employee satisfaction. Here the name of the data set is employee_satisfaction. Let‘s first look at the variables in the data set.

proc contents data=day1.employee_satisfaction position short; run; /*Employee Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut- ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work*/

So apart from the variable employee which is basically the identification of the em- ployee, all the variables contribute to the satisfaction of the employee. Using factor analysis we are going to find out the underlying factors of the employee satisfaction and see which variable belongs to which factor.

But first we have to see whether factor analysis is feasible or not.

proc factor data=day1.employee_satisfaction corr msa scree; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut- ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

The corr option in the data statement of procedure factor produces the correlation matrix mentioned in the var statement. If the correlation between the variables are very near to zero (say within +/- 0.2 ), then the variables are independent. So they themselves are the factors. The other option msa produces a KMO MSA Check. The scree option produces a scree plot.

Now suppose we want to produce 4 factors. Then we set the value of n to 4.

proc factor data=day1.employee_satisfaction corr msa scree n = 4 rotate = varimax; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut- ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

60

SAS IMPLEMENTATION

The rotate option specifies the type of rotation that we give. Here we have assigned Varimax rotation.

If we want to calculate all the scoring coefficients, then we mention the option score.

proc factor data=day1.employee_satisfaction corr msa scree score mineigen = 0.5; var Organization_competitive_place Like_ppl_I_work_with Job_allows_learn_newthngs Paid_more_than_others I_like_work_culture Frnds_have_heard_of_this_comp Co_looks_good_on_resume Can_work_from_home cut- ting_edge_work_done good_perks_incentives Good_pension_plan I_never_worked_on_weekend Paid_well_for_my_work; run;

The mineigen = 0.5 option implies we want to retain those factors only that have eigen values greater than 0.5.

For individual factor scores, we write specify the option out = day1.factor_scores.

proc factor data=day1.employee_satisfaction n = 4 out = day1.factor_scores; var