Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
http://mec.sagepub.com/ Score Reliability: A Retrospective Look Back at 12 Years of Reliability Generalization Studies
Tammi Vacha-Haase and Bruce Thompson Measurement and Evaluation in Counseling and Development 2011 44: 159 DOI: 10.1177/0748175611409845 The online version of this article can be found at: http://mec.sagepub.com/content/44/3/159
Published by:
http://www.sagepublications.com
On behalf of:
Additional services and information for Measurement and Evaluation in Counseling and Development can be found at: Email Alerts: http://mec.sagepub.com/cgi/alerts Subscriptions: http://mec.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://mec.sagepub.com/content/44/3/159.refs.html
Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014
409845
and Evaluation in Counseling and Development 44(3)
Research in Brief
Measurement and Evaluation in Counseling and Development 44(3) 159 168 The Author(s) 2011 Reprints and permission: http://www. sagepub.com/journalsPermissions.nav DOI: 10.1177/0748175611409845 http://mecd.sagepub.com
Abstract The present study was conducted to characterize (a) the features of the thousands of primary reports synthesized in 47 reliability generalization (RG) measurement meta-analysis studies and (b) typical methodological practice within the RG literature to date. With respect to the treatment of score reliability in the literature, in an astounding 54.6% of the 12,994 primary reports authors did not even mention reliability! Furthermore, in 15.7% of the primary reports authors did mention score reliability, but merely inducted previously reported values as if they applied to their data. Clearly, the admonitions of Wilkinson and the APA Task Force (1999) have yet to have their desired impacts with respect to reporting reliability estimates for ones own data. Keywords reliability, measurement, psychometrics, reliability generalization, meta-analysis All the statistical analyses (e.g., t tests, ANOVA, ANCOVA, Pearson r, regression, as well as T2, MANOVA, MANCOVA, descriptive discriminant analysis, canonical correlation analysis) within the general linear model (GLM; see Cohen, 1968; Knapp, 1978) are correlational in that the implicit building block for these analyses is the computation of the intervariable correlation or covariance matrix. Indeed, secondary analyses of previously published results are easily performed given access to these matrices, even if the raw data are unavailable (Zientek & Thompson, 2009). However, poor score reliability will compromise estimates of both statistical significance (i.e., pCALCULATED values) and effect size within classical GLM analyses, because score reliabilities are not considered by the analyses. Instead, classical GLM analyses assume perfect or at least very good score reliabilities. Score reliability characterizes the degree to which scores measure something as opposed to nothing (e.g., are completely random). Random variations in data, including the random variations associated with measurement error, attenuate the relationships among measured variables. Such attenuation occurs because correlation coefficients are sensitive to systematic covariances among measured variables replicated over study participants and not random fluctuations. The fact that poor score reliability compromises the foundation of commonly applied statistical analyses suggests the obvious conclusion that evaluation of the score reliabilities for the scores in hand ought to be the
1 2
Colorado State University, Fort Collins, CO, USA Texas A&M University, College Station, TX, USA 3 Baylor College of Medicine, Houston, TX, USA Corresponding Author: Bruce Thompson, Dept. of Educ. Psyc., 4225 TAMU College Station, TX 77843, USA Email: bruce-thompson@tamu.edu
160
Measurement and Evaluation in Counseling and Development 44(3) The dozen years or so since the VachaHaases (1998) conceptualization of RG have seen both RG-related methodology developments (e.g., Bonnett, 2010; Rodriguez & Maeda, 2006) as well as an increasing number of RG studies being published. Tutorials on how to do RG studies have been presented (Henson & Thompson, 2002). And recognition of RG has been international (Dandan & Houcan, 2004). To date, several dozen RG meta-analyses have been reported across an impressive array of measures. For example, RG studies have been conducted on literatures for measures involving statetrait anxiety (Barnes, Harp, & Jung, 2002), locus of control (Beretvas, Suizzo, Durham, & Yarnell, 2008), mathematics anxiety (Capraro, Capraro, & Henson, 2001), psychopathology (Campbell, Pulos, Hogan, & Murry, 2005), learning styles (Henson & Hwang, 2002), substance abuse propensities (Miller, Woodson, Howell, & Shields, 2009), ways of coping (Rexrode, Petersen, & OToole, 2008), and life satisfaction (Wallace & Wheeler, 2002).
obligatory first step in any quantitative study, prior to conducting any substantive analyses. In the words of the American Psychological Association (APA) Task Force on Statistical Inference, It is important to remember that a test is not reliable or unreliable. Reliability is a property of the scores on a test for a particular population of examinees. . . . Thus, authors should provide reliability coefficients of the scores for the data being analyzed even when the focus of their research is not psychometric. (Wilkinson & APA Task Force, 1999, p. 596) Given the importance of score reliability in all quantitative analyses, and the fluctuations in reliabilities across test administrations, ways to explore systematically the variabilities in reliabilities should be of special interest to researchers.
Vacha-Hasse and Thompson the data analyzed in the studies, and he concluded that reliability . . . is unreported in . . . [so much published research] is . . . inexcusable at this late date (p. 9). Almost 20 years later, Vacha-Haase, Ness, Nilsson, and Reetz (1999) reviewed three journals and found that only 36% of the quantitative articles provided reliability coefficients for the data being analyzed. One reason for poor treatment of psychometric issues within the social sciences literature is that [a]lthough most programs in sociobehavioral sciences, especially doctoral programs, require a modicum of exposure to statistics and research design, few seem to require the same where measurement is concerned (Pedhazur & Schmelkin, 1991, p. 2). Unfortunately, doctoral curricula in recent years have allocated less and less space for psychometric training (Capraro & Thompson, 2008), so practices may not have improved since the time of Willsons (1980) report. Our mega meta-analysis (see Vacha-Haase, Henson, & Caruso, 2002) of the 47 RG studies provides a contemporary assessment of the degree to which authors of primary reports are attending to score reliability issues. The RG studies each synthesized an average of 342.0 (SD = 494.8) prior studies. Thus, the present study characterizes a huge array of original studies in diverse areas of the social sciences. This first research focus included consideration of how often primary researchers ignored reliability, inducted prior reliability for measures rather than reporting reliability for their own scores (see Vacha-Haase, Kogan, & Thompson, 2000), or reported reliability for the data actually being analyzed in their substantive studies. We also sought to characterize the typical score reliabilities reported in primary reports summarized in RG studies and the variability of these reliabilities. Typical practice within the RG literature. In addition to characterizing the quality of the primary reports synthesized in the 47 RG studies with respect to score reliability, second, we also sought to characterize typical methodological practice within the RG literature to
161 date. For example, we were interested in the ways that RG researchers identified source studies, the types of statistical and graphical analyses reported, the types of predictor variables used to predict variabilities in score reliabilities, and which predictors were or were not generally found to be useful in making these predictions.
162
Measurement and Evaluation in Counseling and Development 44(3) studies on average investigated 8.5 (SD = 4.0) predictor variables in these analyses. The most commonly used predictor variables included gender (83.3% of the 47 RG studies), sample size (68.8%), age in years (54.2%), and ethnicity (52.1%). The predictors that, when used, tended to be noteworthy included the number of items for measures that had forms of different lengths (31.2%) and the score standard deviation in the primary report (29.2%). Both these results are psychometrically reasonable. Scores from measures with more items tend to be more reliable, especially when more items result in more dispersed total test scores, because total test score dispersion drives score reliability so strongly (Reinhardt, 1996; Thompson, 2003). Of course, scores from longer tests do not inherently have higher reliability, if the test is made longer by adding items of poor quality or that do not increase total score dispersion. For example, the Bem Sex Role Inventory is a published test for which the short form scores on the Femininity scale tend to be higher than for their long form counterpart (Bem, 1981). Two other predictors also tended to be noteworthy in predicting variabilities in score reliabilities. Participant age (22.9%) and participant gender (22.9%) were among the better predictors of variabilities in score reliabilities.
RG studies were based on an average of 64.5 (SD = 52.4) primary reports in which authors reported reliability coefficients for their own data. In only eight of the RG studies did the researchers contact primary authors in an attempt to obtain information missing from the primary reports. Because some RG studies involved multiple related measures, multiple subscale scores from a single measure, reliability coefficients being reported for subgroups, or multiple administrations of measures, a given RG study often involved multiple reliability coefficients. RG studies on average involved 240.0 (SD = 755.6) reliability estimates. The average of the mean coefficient alpha values reported across the RG studies was .80 (SD = .09) and ranged from .45 to .95. The smallest mean alpha reported in the RG studies was .17, and the largest mean alpha was .92. However, some of these values were for subscales on measures rather than for total scores. And it must be remembered that coefficient alpha and other coefficients, such as stability reliability coefficients, measure quite different things and thus tend to vary even for the same measure (McCrae, Kurtz, Yamagata, & Terracciano, 2011).
Discussion
RG studies provide some insight about both the score reliabilities produced by given measures across samples and typical reliability reporting practices within the literature. With respect to the first outcome, authors of RG studies must work to avoid certain pitfalls (see Dimitrov, 2002). For example, RG investigators should take into account use of (a) different types of reliability estimates across studies and (b) different test forms, especially when forms have different numbers of items. A particularly difficult challenge for RG researchers involves the RG modeling misspecifications that occur when relevant
Vacha-Hasse and Thompson characteristics of the study samples are not coded as independent variables in RG analysis (Dimitrov, 2002, p. 794). These model misspecifications may occur because original reports often do not provide enough detail about the measurement and sampling designs being used. A related problem is that substantive researchers who report score reliability coefficients for their own data most often report only Cronbachs alpha, notwithstanding the limitations of this estimate and the fact that the measurement model underlying that estimate may not fit many of the situations in which the estimate is used (see Dimitrov, 2002). Hogan, Benjamin, and Brezinskis (2000) empirical study of the literature found that two thirds of the articles they examined reported alpha, and they also noted that despite their prominence in the psychometric literature of the past 20 years, we encountered no reference to generalizability coefficients . . . or to the test information functions that arise from item response theory (p. 528). These problems limit the potential benefits of RG studies. Of course, as Thompson and Vacha-Haase (2000) reminded, It is important to remember that RG studies are a meta-analytic characterization of what is hoped is a population of previous reports. We may not like the ingredients that go into making this sausage, but the RG chef can only work with the ingredients provided by the literature. (p. 184)
163 Task Force (1999) have yet to have their desired impacts. We believe this disturbing reality is an artifact of too many applied researchers still believing that tests qua tests themselves have the property of reliability. This misconception may not be conscious, but is all the more pernicious when unconscious, because unconscious misperceptions may be less likely to be reconsidered and corrected. The problem of sloppy speaking about reliability, in which tests are described as being reliable, is not just an issues of sloppy speakingthe problem is that sometimes we unconsciously come to think what we say or what we hear, so that sloppy speaking does sometimes lead to a more pernicious outcome, sloppy thinking and sloppy practice. (Thompson, 1992, p. 436) Some textbooks directly confront the misconception that tests are reliable. For example, Pedhazur and Schmelkin (1991) noted, Statements about the reliability of a measure are . . . [inherently] inappropriate and potentially misleading (p. 82). Similarly, Gronlund and Linn (1990) emphasized that reliability refers to the results obtained with an evaluation instrument and not to the instrument itself. . . . Thus, it is more appropriate to speak of the reliability of the test scores or the measurement than of the test or the instrument. (p. 78) More recently, Urbina (2004) emphasized the fact is that the quality of reliability is one that, if present, belongs not to test but to test scores (p. 119). She perceptively noted that the distinction between scores versus tests being reliable is subtle, but noted that the distinction is fundamental to an understanding of the implications of the concept of reliability with regard to the use of tests and the interpretation of test
164
Measurement and Evaluation in Counseling and Development 44(3) features may cause reliabilities to fluctuate across test administrations. Over time, we expect the quality of RG studies to improve further once more and more primary reports include estimates of score reliabilities as more researchers realize that tests are not reliable. Indeed, the most important impact of the creation of RG and the reporting of RG findings is that these reports in themselves directly confront chronic misconceptions that tests are reliable. RG studies in and of themselves communicate the important understanding that score reliabilities vary across administrations and are not secreted into test booklets during the test printing process. Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
scores. If a test is described as reliable, the implication is that its reliability has been established permanently, in all respects for all uses, and with all users. (p. 120) Urbina utilizes a piano analogy to illustrate the fallacy of describing tests as reliable, noting that saying the test is reliable is similar to stating a piano will always sound the same, regardless of the type of music played, the person who is playing it, the type of the piano, or the surrounding acoustical environment. Our second major finding is that score reliability on average in the applied research literature appears to be reasonably sufficient to support inquiry using classical GLM statistics, given a mean coefficient alpha of .80 (SD = .09) and a range from .45 to .95. These results suggest a glass-half-fullhalf-empty conclusion about the quality of our literature with respect to score reliability. Clearly, some substantive studies are being conducted with scores of questionable reliability. Furthermore, we must wonder what were the reliabilities of those scores in those studies in which reliabilities were not reported for data in hand, or reliability was not even mentioned!
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Note: The 47 RG studies included in this study are marked with asterisks. *Bachner, Y. G., & ORourke, N. (2007). Reliability generalization of responses by care providers to the Zarit Burden Interview. Aging & Mental Health, 11, 678685. doi:10.1080/ 13607860701529965 *Barnes, L. L. B., Harp, D., & Jung, W. S. (2002). Reliability generalization of scores on the Spielberger StateTrait Anxiety Inventory. Educational and Psychological Measurement, 62, 603618. doi:10.1177/0013164402062004005 Bem, S. L. (1981). Bem Sex-Role Inventory: Professional manual. Palo Alto, CA: Consulting Psychologists Press. *Beretvas, S. N., Meyers, J. L., & Leite, W. L. (2002). A reliability generalization study of the Marlowe Crowne Social Desirability Scale. Educational and Psychological Measurement, 62, 570589. doi:10.1177/0013164402062004003
165
Dandan, G., & Houcan, Z. (2004). A redefinition of reliability and the study of reliability generalization. Psychological Science (China), 27, 445448. *Deditius-Island, H. K., & Caruso, J. C. (2002). An examination of the reliability of scores from Zuckermans Sensation Seeking Scales, Form V. Educational and Psychological Measurement, 62, 728734. doi:10.1177/0013164402062004012 Dimitrov, D. M. (2002). Reliability: Arguments for multiple perspectives and potential problems with generalizability across studies. Educational and Psychological Measurement, 62, 783801. doi:10.1177/001316402236878 *Dunn, T. W., Smith, T. B., & Montoya, J. A. (2006). Multicultural competency instrumentation: A review and analysis of reliability generalization. Journal of Counseling & Development, 84, 471482. *Graham, J. M., & Christiansen, K. (2009). The reliability of romantic love: A reliability generalization meta-analysis. Personal Relationships, 16, 4966. doi:10.1111/j.1475-6811.2009.01209.x *Graham, J. M., Liu, Y. J., & Jeziorski, J. L. (2006). The Dyadic Adjustment Scale: A reliability generalization meta-analysis. Journal of Marriage and Family, 68, 701717. doi:10.1111/ j.1741-3737.2006.00284.x Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York, NY: Macmillan. *Hanson, W. E., Curry, K. T., & Bandalos, D. L. (2002). Reliability generalization of Working Alliance Inventory scale scores. Educational and Psychological Measurement, 62, 659673. doi:10.1177/0013164402062004008 *Hellman, C. M., Fuqua, D. R., & Worley, J. (2006). A reliability generalization study on the Survey of Perceived Organizational Support: The effects of mean age and number of items on score reliability. Educational and Psychological Measurement, 66, 631642. doi:10.1177/ 0013164406288158 *Hellman, C. M., Muilenburg-Trevino, E. M., & Worley, J. A. (2008). The belief in a just world: An examination of reliability estimates across three measures. Journal of Personality Assessment, 90, 399401. doi:10.1080/ 00223890802108238
166
*Henson, R. K., & Hwang, D.-Y. (2002). Variability and prediction of measurement error in a Kolbs Learning Style Inventory Scores: A reliability generalization study. Educational and Psychological Measurement, 62, 712727. doi:10.1177/ 0013164402062004011 *Henson, R. K., Kogan, L. R., & Vacha-Haase, T. (2001). A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educational and Psychological Measurement, 61, 404420. doi:10.1177/00131640121971284 Henson, R. K., & Thompson, B. (2002). Characterizing measurement error in scores across studies: Some recommendations for conducting reliability generalization (RG) studies. Measurement and Evaluation in Counseling and Development, 35, 113127. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523531. Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. *Huynh, Q.-L., Howell, R. T., & Benet-Martinez, V. (2009). Reliability of bidimensional acculturation scores: A meta-analysis. Journal of Cross-Cultural Psychology, 40, 256274. doi:10.1177/0022022108328919 *Kieffer, K. M., Cronin, C., & Fister, M. C. (2004). Exploring variability and sources of measurement error in Alcohol Expectancy Questionnaire reliability coefficients: A meta-analytic reliability generalization study. Journal of Studies on Alcohol, 65, 663671. *Kieffer, K. M., & Reese, R. J. (2002). A reliability generalization study of the Geriatric Depression Scale. Educational and Psychological Measurement, 62, 969994. doi:10.1177/0013 164402238085 Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin, 85, 410416. doi:10.1037//0033-2909.85.2.410 *Lane, G. G., White, A. E., & Henson, R. K. (2002). Expanding reliability generalization methods with KR-21 estimates: An RG study of the Coopersmith Self-Esteem Inventory. Educa-
167
*Shields, A. L., & Caruso, J. C. (2003). Reliability generalization of the Alcohol Use Disorders Identification Test. Educational and Psychological Measurement, 63, 404413. doi:10.1177/0013164403063003004 *Shields, A. L., & Caruso, J. C. (2004). A reliability induction and reliability generalization study of the CAGE Questionnaire. Educational and Psychological Measurement, 64, 254270. doi:10.1177/0013164403261814 Thompson, B. (1992). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434438. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837847. Thompson, B. (Ed.). (2003). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks, CA: Sage. *Thompson, B., & Cook, C. (2002). Stability of the reliability of LibQUAL + TM scores: A reliability generalization meta-analysis study. Educational and Psychological Measurement, 62, 735743. doi:10.1177/0013164402062004013 Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174195. doi:10.1177/00131640021970448 Urbina, S. (2004). Essentials of psychological testing. Hoboken, NJ: John Wiley. Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 620. doi:10.1177/00131640121971059 Vacha-Haase, T., Henson, R. K., & Caruso, J. C. (2002). Reliability generalization: Moving toward improved understanding and use of score reliability. Educational and Psychological Measurement, 62, 562569. doi:10.1177/ 0013164402062004002 *Vacha-Haase, T., Kogan, L. R., Tani, C. R., & Woodall, R. A. (2001). Reliability generalization: Exploring variation of reliability coefficients of MMPI clinical scales scores. Educational and Psychological Measurement, 61, 4559. doi:10.1177/00131640121971059
168
Vacha-Haase, T., Kogan, L.R., & Thompson, B. (2000). Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement, 60, 509522. doi:10.1177/00131640021970682 Vacha-Haase, T., Ness, C. M., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reliability coefficients: A review of three journals. Journal of Experimental Education, 67, 335341. doi:10.1080/00220979909598487 *Vacha-Haase, T., Tani, C. R., Kogan, L. R., Woodall, R. A., & Thompson, B. (2001). Reliability generalization: Exploring reliability variations on MMPI/MMPI-2 validity scale scores. Assessment, 8, 391401. doi:10.1177/ 107319110100800404 *Vassar, M., & Crosby, J. W. (2008). A reliability generalization study of coefficient alpha for the UCLA Loneliness Scale. Journal of Personality Assessment, 90, 601607. doi:10.1080/ 00223890802388624 *Victorson, D., Barocas, J., Song, J., & Cella, D. (2008). Reliability across studies from the Functional Assessment of Cancer TherapyGeneral (FACT-G) and its subscales: A reliability generalization. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment,Care&Rehabilitation,17,11371146. doi:10.1007/s11136-008-9398-2 *Wallace, K. A., & Wheeler, A. J. (2002). Reliability generalization of the Life Satisfaction Index. Educational and Psychological Measurement, 62, 674684. doi:10.1177/001316 4402062004009 Wilkinson, L., & American Psychological Association (APA) Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American
Bios
Tammi Vacha-Haase is a professor of psychology at Colorado State University. Bruce Thompson is a distinguished professor of educational psychology, and of library science, at Texas A&M University, and adjunct professor of allied health sciences, Baylor College of Medicine (Houston).