Sei sulla pagina 1di 13

Encyclopedia of Research Design

Multiple Regression

Contributors: Chris Segrin Editors: Neil J. Salkind Book Title: Encyclopedia of Research Design Chapter Title: "Multiple Regression" Pub. Date: 2010 Access Date: December 08, 2013 Publishing Company: SAGE Publications, Inc. City: Thousand Oaks Print ISBN: 9781412961271 Online ISBN: 9781412961288 DOI: http://dx.doi.org/10.4135/9781412961288.n253 Print pages: 845-850 This PDF has been generated from SAGE knowledge. Please note that the pagination of the online version will vary from the pagination of the print book.

National University Singapore Copyright 2013

SAGE knowledge

http://dx.doi.org/10.4135/9781412961288.n253 Multiple regression is a general and flexible statistical method for analyzing associations between two or more independent variables and a single dependent variable. As a general statistical technique, multiple regression canbeemployedtopredict values of a particular variable based on knowledge of its association with known values of other variables, and it can be used to test scientific hypotheses about whether and to what extent certain independent variables explain variation in a dependent variable of interest. As a flexible statistical method, multiple regression can be used to test associations among continuous as well as categorical variables, and it can be used to test associations between individual independent variables and a dependent variable, as well as interactions among multiple independent variables and a dependent variable. In this entry, different approaches to the use of multiple regression are presented, along with explanations of the more commonly used statistics in multiple regression, methods of conduction multiple regression analysis, and the assumptions of multiple regression.

Approaches to Using Multiple Regression Prediction


One common application of multiple regression is for predicting values of a particular dependent [p. 845 ] variable based on knowledge of its association with certain independent variables. In this context, the independent variables are commonly referred to as predictor variables and the dependent variable is characterized as the criterion variable. In applied settings, it is often desirable for one to be able to predict a score on a criterion variable by using information that is available in certain predictor variables. For example, in the life insurance industry, actuarial scientists use complex regression models to predict, on the basis of certain predictor variables, how long a person will live. In scholastic settings, college and university admissions offices will use predictors such as high school grade point average (GPA) and ACT scores to predict an applicant's college GPA, even before he or she has entered the university.

Page 2 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

Multiple regression is most commonly used to predict values of a criterion variable based on linear associations with predictor variables. A brief example using simple regression easily illustrates how this works. Assume that a horticulturist developed a new hybrid maple tree that grows exactly 2 feet for every year that it is alive. If the height of the tree was the criterion variable and the age of the tree was the predictor variable, one could accurately describe the relationship between the age and height of the tree with the formula for a straight line, which is also the formula for a simple regression equation:

where Y is the value of the dependent, or criterion, variable; X is the value of the independent, or predictor, variable; b is a regression coefficient that describes the slope of the line; and a is the Y intercept. The Y intercept is the value of Y when X is 0. Returning to the hybrid tree example, the exact relationship between the tree's age and height could be described as follows:

Notice that the Y intercept is 0 in this case because at 0 years of age, the tree has 0 height. At that point, it is just seed in the ground. It is clear how knowledge of the relationship between the tree's age and height could be used to easily predict the height of any given tree by just knowing what its age is. A 5-year-old tree will be 10 feet tall, an 8-year-old tree will be 16 feet tall, and so on. At this point two important issues must be considered. First, virtually any time one is working with variables from people, animals, plants, and so forth, there are no perfect linear associations. Sometimes students with high ACT scores do poorly in college whereas some students with low ACT scores do well in college. This shows how there can always be some error when one uses regression to predict values on a criterion variable. The stronger the association between the predictor and criterion variable, the less error there will be in that prediction. Accordingly, regression is based on the line of best fit, which is simply the line that will best describe or capture the relationship between X and Y by minimizing the extent to which any data points fall off that line.

Page 3 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

A college admissions committee wants to be able to predict the graduating GPA of the students whom they admit. The ACT score is useful for this, but as noted above, it does not have a perfect association with college GPA, so there is some error in that prediction. This is where multiple regression becomes very useful. By taking into account the association of additional predictor variables with college GPA, one can further minimize the error in predicting college GPA. For example, the admissions committee might also collect information on high school GPA and use that in conjunction with the ACT score to predict college GPA. In this case, the regression equation would be

where Y# is the predicted value of Y (college GPA), X 1 and X 2 are the values of the predictor variables (ACT score and high school GPA), b 1 and b 2 are the regression coefficients by which X 1 and X 2 are multiplied to get Y#, and a is the intercept (i.e., value of Y when X 1 and X 2

Page 4 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

are both 0). In this particular case, the intercept serves only an arithmetic function as it has no practical interpretability because having a high school GPA of 0 and an ACT score of 0 is meaningless.

Explanation
In social scientific contexts, multiple regression is rarely used to predict unknown values on [p. 846 ] a criterion variable. In social scientific research, values of the independent and dependent variables are almost always known. In such cases multiple regression is used to test whether and to what extent the independent variables explain the dependent variable. Most often the researcher has theories and hypotheses that specify causal relations among the independent variables and the dependent variable. Multiple regression is a useful tool for testing such hypotheses. For example, an economist is interested in testing a hypothesis about the determinants of workers salaries. The model being tested could be depicted as follows:

where SES stands for socioeconomic status. In this simple model, the economist hypothesizes that the SES of one's family of origin will influence how much formal education one acquires, which in turn will predict one's salary. If the economist collected data on these three variables from a sample of workers, the hypotheses could be tested with a multiple regression model that is comparable to the one presented previously in the college GPA example:

In this case, what was Y# is now Y because the value of Y is known. It is useful to deconstruct the components of this equation to show how they can be used to test various aspects of the economist's model. In the equation above, b 1

Page 5 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

and b 2 are the partial regression coefficients. They are the weights by which one multiplies the value of X 1 and X 2 when all variables are in the equation. In other words, they represent the expected change in Y per unit of X when all other variables are accounted for, or held constant. Computationally, the values of b 1 and b 2 can be determined easily by simply knowing the zero-order correlations among all possible pairwise combinations of Y, X 1 , and X 2 , as well as the standard deviations of the three variables:

where r YX 1 is the Pearson correlation between Y and X


Page 6 of 13 Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

1 ,r X 1 X 2 is the Pearson correlation between X 1 and X 2 , and so on, and s Y is the standard deviation of variable Y, sx 1 is the standard deviation of X 1 , and so on. The partial regression coefficients are also referred to as unstandardized regression coefficients because they represent the value by which one would multiply the raw X 1 or X 2 score in order to arrive at Y In the salary example, these coefficients could look something like this:

Page 7 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

This means that subjects annual salaries are best described by an equation whereby their family of origin SES is multiplied by 745.67, their years of formal education are multiplied by 104.36, and these products are added to 11,325. Notice how the regression coefficient for SES is much larger than that for years of formal education. Although it might be tempting to assume that family of origin SES is weighted more heavily than years of formal education, this would not necessarily be correct. The magnitude of an unstandardized regression coefficient is strongly influenced by the units of measurement used to assess the independent variable with which it is associated. In this example, assume that SES is measured on a 5-point scale (Levels 15) and years of formal education, at least in the sample, runs from 7 to 20. These differing scale ranges have a profound effect on the magnitude of each regression coefficient, rendering them incomparable. However, it is often the case that researchers want to understand the relative importance of each independent variable for explaining variation in the dependent variable. In other words, which is the more powerful determinant of people's salaries, their education or their family of origin's socioeconomic status? This question can be evaluated by examining the standardized regression coefficient, or #. Computationally, # can be determined by the following formulas:

The components of these formulas are identical to those for the unstandardized regression coefficients, but they lack multiplication by the ratio of standard deviations of Y and X 1 and X 2 . Incidentally, one can [p. 847 ] easily convert # to b with the following formulas, which illustrate their relationship:

Page 8 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

where sY is the standard deviation of variable Y, and so on. Standardized regression coefficients can be thought of as the weight by which one would multiply a standardized score (or z score) for each independent variable in order to arrive at the z score for the dependent variable. Because z scores essentially equate all variables on the same scale, researchers are inclined to make comparisons about the relative impact of each independent variable by comparing their associated standardized regression coefficients, sometimes called beta weights. In the economist's hypothesized model of workers salaries, there are several subhypotheses or research questions that can be evaluated. For example, the model presumes that both family of origin SES and education will exert a causal influence on annual salary. One can get a sense of which variable has a greater impact on salary by comparing their beta weights. However, it is also important to ask whether either of the independent variables is a significant predictor of salary. In effect, these tests ask whether each independent variable explains a statistically significant portion of the variance in the dependent variable, independent of that explained by the other independent variable(s) also in the regression equation. This can be accomplished by dividing the # by its standard error (SE # ). This ratio is distributed as t with degrees of freedom = n k - 1, where n is the sample size and k is the number of independent variables in the regression analysis. Stated more formally,

If this ratio is significant, that implies that the particular independent variable uniquely explains a statistically significant portion of the variance in the dependent variable. These t tests of the statistical significance of each independent variable are routinely provided by computer programs that conduct multiple regression analyses. They play an important role in testing hypotheses about the role of each independent variable in explaining the dependent variable.

Page 9 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

In addition to concerns about the statistical significance and relative importance of each independent variable for explaining the dependent variable, it is important to understand the collective function of the independent variables for explaining the dependent variable. In this case, the question is whether the independent variables collectively explain a significant portion of the variance in scores on the dependent variable. This question is evaluated with the multiple correlation coefficient. Just as a simple bivariate correlation is represented by r, the multiple correlation coefficient is represented by R. In most contexts, data analysts prefer to use R to understand the association between the independent variables and the dependent variable. This is because the squared multiple correlation coefficient can be thought of as the percentage of variance in the dependent variable that is collectively explained by the independent variables. So, an R value of .65 implies that 65% of the variance in the dependent variable is explained by the combination of independent variables. In the case of two independent variables, the formula for the squared multiple correlation coefficient can be explained as a function of the various pairwise correlations among the independent and dependent variables:
2 2

In cases with more than two independent variables, this formula becomes much more complex, requiring the use of matrix algebra. In such cases, calculation of R is ordinarily left to a computer. The question of whether the collection of independent variables explains a statistically significant amount of variance in the dependent variable can be approached by testing the multiple correlation coefficient for statistical significance. The test can be carried out by the following formula:
2

This test is distributed as F with df= k in the numerator and n k - 1 in the denominator, [p. 848 ] where n is the sample size and k is the number of independent variables. Two important features of the test for significance of the multiple correlation coefficient require discussion. First, notice how the sample size, n, appears in the numerator. This
Page 10 of 13 Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

implies that all other things held constant, the larger the sample size, the larger the F ratio will be. That means that the statistical significance of the multiple correlation coefficient is more probable as the sample size increases. Second, the amount of variation in the dependent variable that is not explained by the independent variables, indexed by 1 R (this is called error or residual variance), is multiplied by the number of independent variables, k. This implies that all other things held equal, the larger the number of independent variables, the larger the denominator, and hence the smaller the F ratio. This illustrates how there is something of a penalty for using a lot of independent variables in a regression analysis. When trying to explain scores on a dependent variable, such as salary, it might be tempting to use a large number of predictors so as to take into account as many possible causal factors as possible. However, as this formula shows, this significance test favors parsimonious models that use only a few key predictor variables.
2

Methods of Variable Entry


Computer programs used for multiple regression provide several options for the order of entry of each independent variable into the regression equation. The order of entry can make a difference in the results obtained and therefore becomes an important analytic consideration. In hierarchical regression, the data analyst specifies a particular order of entry of the independent variables, usually in separate steps for each. Although there are multiple possible logics by which one would specify a particular order of entry, perhaps the most common is that of causal priority. Ordinarily, one would enter independent variables in order from the most distal to the most proximal causes. In the previous example of the workers salaries, a hierarchical regression analysis would enter family of origin SES into the equation first, followed by years of formal education. As a general rule, in hierarchical entry, an independent variable entered into the equation later should never be the cause of an independent variable entered into the equation earlier. Naturally, hierarchical regression analysis is facilitated by having a priori theories and hypotheses that specify a particular order of causal priority. Another method of entry that is based purely on empirical rather than theoretical considerations is stepwise entry. In this case, the data analyst specifies the full
Page 11 of 13 Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

compliment of potential independent variables to the computer program and allows it to enter or not enter these variables into the regression equation, based on the strength of their unique association with the dependent variable. The program keeps entering independent variables up to the point at which addition of any further variables would no longer explain any statistically significant increment of variance in the dependent variable. Stepwise analysis is often used when the researcher has a large collection of independent variables and little theory to explain or guide their ordering or even their role in explaining the dependent variable. Because stepwise regression analysis capitalizes on chance and relies on a post hoc rationale, its use is often discouraged in social scientific contexts.

Assumptions of Multiple Regression


Multiple regression is most appropriately used as a data analytic tool when certain assumptions about the data are met. First, the data should be collected through independent random sampling. Independent means that the data provided by one participant must be entirely unrelated to the data provided by another participant. Cases in which husbands and wives, college roommates, or doctors and their patients both provide data would violate this assumption. Second, multiple regression analysis assumes that there are linear relationships between the independent variables and the dependent variable. When this is not the case, a more complex version of multiple regression known as nonlinear regression must be employed. A third assumption of multiple regression is that at each possible value of each independent variable, the dependent variable must be normally distributed. However, multiple regression is reasonably robust in the case of modest violations of this assumption. Finally, for each possible value of each independent variable, the variance of the residuals or errors in predicting Y (i.e., Y# Y) must be [p. 849 ] consistent. This is known as the homoscedasticity assumption. Returning to the workers salaries example, it would be important that at each level of family-of-origin SES (Levels 15), the degree of error in predicting workers salaries was comparable. If the salary predicted by the regression equation was within $5,000 for everyone at Level 1 SES, but it was within $36,000 for everyone at Level 3 SES, the homoscedasticity assumption would be violated because there is far greater variability in residuals at the higher compared with lower SES levels.

Page 12 of 13

Encyclopedia of Research Design: Multiple Regression

National University Singapore Copyright 2013

SAGE knowledge

When this happens, the validity of significance tests in multiple regression becomes compromised. Chris Segrin http://dx.doi.org/10.4135/9781412961288.n253 See also Further Readings Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions . Newbury Park, CA: Sage. Alison, P. D. (1999). Multiple regression: A primer . Thousand Oaks, CA: Sage. Berry, W. D. (1993). Understanding regression assumptions . Thousand Oaks, CA: Sage. Cohen, J., Cohen, P., West, S., & Aiken, L. (2003). Applied multiple regression/ correlation analysis for the behavioral sciences (3rd ed.). Hillsdale, NJ: Lawrence Erlbaum. Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York: Wadsworth.

Page 13 of 13

Encyclopedia of Research Design: Multiple Regression

Potrebbero piacerti anche