Sei sulla pagina 1di 12

Measurement and Evaluation in Counseling and Development

http://mec.sagepub.com/ Score Reliability: A Retrospective Look Back at 12 Years of Reliability Generalization Studies
Tammi Vacha-Haase and Bruce Thompson Measurement and Evaluation in Counseling and Development 2011 44: 159 DOI: 10.1177/0748175611409845 The online version of this article can be found at: http://mec.sagepub.com/content/44/3/159

Published by:
http://www.sagepublications.com

On behalf of:

Institution of Mechanical Engineers

Additional services and information for Measurement and Evaluation in Counseling and Development can be found at: Email Alerts: http://mec.sagepub.com/cgi/alerts Subscriptions: http://mec.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://mec.sagepub.com/content/44/3/159.refs.html
Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

>> Version of Record - Jun 2, 2011 What is This?

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

409845
and Evaluation in Counseling and Development 44(3)

MECXXX10.1177/0748175611409845Vacha-Hasse and ThompsonMeasurement

Research in Brief

Score Reliability: A Retrospective Look Back at 12 Years of Reliability Generalization Studies


Tammi Vacha-Haase1 and Bruce Thompson2,3

Measurement and Evaluation in Counseling and Development 44(3) 159 168 The Author(s) 2011 Reprints and permission: http://www. sagepub.com/journalsPermissions.nav DOI: 10.1177/0748175611409845 http://mecd.sagepub.com

Abstract The present study was conducted to characterize (a) the features of the thousands of primary reports synthesized in 47 reliability generalization (RG) measurement meta-analysis studies and (b) typical methodological practice within the RG literature to date. With respect to the treatment of score reliability in the literature, in an astounding 54.6% of the 12,994 primary reports authors did not even mention reliability! Furthermore, in 15.7% of the primary reports authors did mention score reliability, but merely inducted previously reported values as if they applied to their data. Clearly, the admonitions of Wilkinson and the APA Task Force (1999) have yet to have their desired impacts with respect to reporting reliability estimates for ones own data. Keywords reliability, measurement, psychometrics, reliability generalization, meta-analysis All the statistical analyses (e.g., t tests, ANOVA, ANCOVA, Pearson r, regression, as well as T2, MANOVA, MANCOVA, descriptive discriminant analysis, canonical correlation analysis) within the general linear model (GLM; see Cohen, 1968; Knapp, 1978) are correlational in that the implicit building block for these analyses is the computation of the intervariable correlation or covariance matrix. Indeed, secondary analyses of previously published results are easily performed given access to these matrices, even if the raw data are unavailable (Zientek & Thompson, 2009). However, poor score reliability will compromise estimates of both statistical significance (i.e., pCALCULATED values) and effect size within classical GLM analyses, because score reliabilities are not considered by the analyses. Instead, classical GLM analyses assume perfect or at least very good score reliabilities. Score reliability characterizes the degree to which scores measure something as opposed to nothing (e.g., are completely random). Random variations in data, including the random variations associated with measurement error, attenuate the relationships among measured variables. Such attenuation occurs because correlation coefficients are sensitive to systematic covariances among measured variables replicated over study participants and not random fluctuations. The fact that poor score reliability compromises the foundation of commonly applied statistical analyses suggests the obvious conclusion that evaluation of the score reliabilities for the scores in hand ought to be the
1 2

Colorado State University, Fort Collins, CO, USA Texas A&M University, College Station, TX, USA 3 Baylor College of Medicine, Houston, TX, USA Corresponding Author: Bruce Thompson, Dept. of Educ. Psyc., 4225 TAMU College Station, TX 77843, USA Email: bruce-thompson@tamu.edu

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

160

Measurement and Evaluation in Counseling and Development 44(3) The dozen years or so since the VachaHaases (1998) conceptualization of RG have seen both RG-related methodology developments (e.g., Bonnett, 2010; Rodriguez & Maeda, 2006) as well as an increasing number of RG studies being published. Tutorials on how to do RG studies have been presented (Henson & Thompson, 2002). And recognition of RG has been international (Dandan & Houcan, 2004). To date, several dozen RG meta-analyses have been reported across an impressive array of measures. For example, RG studies have been conducted on literatures for measures involving statetrait anxiety (Barnes, Harp, & Jung, 2002), locus of control (Beretvas, Suizzo, Durham, & Yarnell, 2008), mathematics anxiety (Capraro, Capraro, & Henson, 2001), psychopathology (Campbell, Pulos, Hogan, & Murry, 2005), learning styles (Henson & Hwang, 2002), substance abuse propensities (Miller, Woodson, Howell, & Shields, 2009), ways of coping (Rexrode, Petersen, & OToole, 2008), and life satisfaction (Wallace & Wheeler, 2002).

obligatory first step in any quantitative study, prior to conducting any substantive analyses. In the words of the American Psychological Association (APA) Task Force on Statistical Inference, It is important to remember that a test is not reliable or unreliable. Reliability is a property of the scores on a test for a particular population of examinees. . . . Thus, authors should provide reliability coefficients of the scores for the data being analyzed even when the focus of their research is not psychometric. (Wilkinson & APA Task Force, 1999, p. 596) Given the importance of score reliability in all quantitative analyses, and the fluctuations in reliabilities across test administrations, ways to explore systematically the variabilities in reliabilities should be of special interest to researchers.

Reliability Generalization (RG) Meta-Analysis


Twelve years ago, in a seminal article, VachaHaase (1998) proposed RG as an extension of another measurement meta-analytic method, Validity Generalization, which was developed by Schmidt and Hunter (1977) and Hunter and Schmidt (1990). Vacha-Haase (1998) described RG as a method to characterize empirically: (a) the typical reliability of scores for a given test across studies, (b) the amount of variability in reliability coefficients for given measures, and (c) the sources of variability in reliability coefficients across studies (p. 6). Reliability generalization is built on the recognition that it is incorrect to speak of the reliability of the test, or to say that the test is reliable (Thompson, 1994). Reliability inures as a property to scores, and not to tests (Thompson & Vacha-Haase, 2000). Thus, reliability coefficients fluctuate across test administrations, and these fluctuations are ripe for meta-analytic investigation.

Purposes of the Present Article


The present article reports a secondary analysis of the 47 RG studies presented in journal articles during the past 12 years. We identified these 47 RG studies by searching PsycInfo and ERIC for any use of the term reliability generalization in the title, abstract, or as a keyword. The source RG reports are designated with asterisks in our references. We conducted our study for two broad purposes. Quality of the social sciences literature. Our first purpose was to characterize the quality of the social sciences literature with respect to score reliability considerations, as reflected in the primary reports synthesized in the 47 RG studies. Similar older analyses provide some historical context for our more contemporary report. In an examination of the American Educational Research Journal (AERJ), Willson (1980) reported that only 37% of AERJ articles explicitly provided reliability coefficients for

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

Vacha-Hasse and Thompson the data analyzed in the studies, and he concluded that reliability . . . is unreported in . . . [so much published research] is . . . inexcusable at this late date (p. 9). Almost 20 years later, Vacha-Haase, Ness, Nilsson, and Reetz (1999) reviewed three journals and found that only 36% of the quantitative articles provided reliability coefficients for the data being analyzed. One reason for poor treatment of psychometric issues within the social sciences literature is that [a]lthough most programs in sociobehavioral sciences, especially doctoral programs, require a modicum of exposure to statistics and research design, few seem to require the same where measurement is concerned (Pedhazur & Schmelkin, 1991, p. 2). Unfortunately, doctoral curricula in recent years have allocated less and less space for psychometric training (Capraro & Thompson, 2008), so practices may not have improved since the time of Willsons (1980) report. Our mega meta-analysis (see Vacha-Haase, Henson, & Caruso, 2002) of the 47 RG studies provides a contemporary assessment of the degree to which authors of primary reports are attending to score reliability issues. The RG studies each synthesized an average of 342.0 (SD = 494.8) prior studies. Thus, the present study characterizes a huge array of original studies in diverse areas of the social sciences. This first research focus included consideration of how often primary researchers ignored reliability, inducted prior reliability for measures rather than reporting reliability for their own scores (see Vacha-Haase, Kogan, & Thompson, 2000), or reported reliability for the data actually being analyzed in their substantive studies. We also sought to characterize the typical score reliabilities reported in primary reports summarized in RG studies and the variability of these reliabilities. Typical practice within the RG literature. In addition to characterizing the quality of the primary reports synthesized in the 47 RG studies with respect to score reliability, second, we also sought to characterize typical methodological practice within the RG literature to

161 date. For example, we were interested in the ways that RG researchers identified source studies, the types of statistical and graphical analyses reported, the types of predictor variables used to predict variabilities in score reliabilities, and which predictors were or were not generally found to be useful in making these predictions.

Results Quality of the Literature With Respect to Score Reliability


Across the 47 studies, on average, literature searches for instrument uses yielded 814.1 hits (SD = 1195.4). However, many of these turned out to be theoretical or nonempirical studies or studies in which the target measure was mentioned but not administered. On average, each RG study involved 342.0 (SD = 494.9) empirical studies in which the target measure was administered. In an astounding 54.6% of the 12,994 primary reports authors did not even mention reliability! This is a discouraging finding with respect to the integrity of such a broad array of substantive studies, especially because most of these reports used classical GLM methods. Although structural equation modeling (SEM) does estimate measurement error variance as part of substantive analyses, as noted previously, classical GLM methods (e.g., ANOVA, regression, descriptive discriminant analysis) do not estimate measurement error variances as part of their substantive analyses (see Yetkiner & Thompson, in press). Clearly, the admonitions of Wilkinson and the APA Task Force (1999) have yet to have their desired impacts with respect to reporting reliability estimates for ones own data. In 15.7% of the 12,994 primary reports, authors did mention score reliability but merely inducted previously reported values as if they applied to their data. When this was done, in 48.0% of the inductions only the test manual was referenced as the source of the induction, whereas in the remaining cases the manual and/or prior articles were referenced.

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

162

Measurement and Evaluation in Counseling and Development 44(3) studies on average investigated 8.5 (SD = 4.0) predictor variables in these analyses. The most commonly used predictor variables included gender (83.3% of the 47 RG studies), sample size (68.8%), age in years (54.2%), and ethnicity (52.1%). The predictors that, when used, tended to be noteworthy included the number of items for measures that had forms of different lengths (31.2%) and the score standard deviation in the primary report (29.2%). Both these results are psychometrically reasonable. Scores from measures with more items tend to be more reliable, especially when more items result in more dispersed total test scores, because total test score dispersion drives score reliability so strongly (Reinhardt, 1996; Thompson, 2003). Of course, scores from longer tests do not inherently have higher reliability, if the test is made longer by adding items of poor quality or that do not increase total score dispersion. For example, the Bem Sex Role Inventory is a published test for which the short form scores on the Femininity scale tend to be higher than for their long form counterpart (Bem, 1981). Two other predictors also tended to be noteworthy in predicting variabilities in score reliabilities. Participant age (22.9%) and participant gender (22.9%) were among the better predictors of variabilities in score reliabilities.

RG studies were based on an average of 64.5 (SD = 52.4) primary reports in which authors reported reliability coefficients for their own data. In only eight of the RG studies did the researchers contact primary authors in an attempt to obtain information missing from the primary reports. Because some RG studies involved multiple related measures, multiple subscale scores from a single measure, reliability coefficients being reported for subgroups, or multiple administrations of measures, a given RG study often involved multiple reliability coefficients. RG studies on average involved 240.0 (SD = 755.6) reliability estimates. The average of the mean coefficient alpha values reported across the RG studies was .80 (SD = .09) and ranged from .45 to .95. The smallest mean alpha reported in the RG studies was .17, and the largest mean alpha was .92. However, some of these values were for subscales on measures rather than for total scores. And it must be remembered that coefficient alpha and other coefficients, such as stability reliability coefficients, measure quite different things and thus tend to vary even for the same measure (McCrae, Kurtz, Yamagata, & Terracciano, 2011).

Typical Practice Within the RG Literature


Diverse statistical and graphical methods were used across the 47 RG meta-analyses. RG researchers frequently used multiple analyses to understand and characterize their RG data. A majority (i.e., 54.2%) of the 47 RG reports used multiple regression as an analysis, whereas 27.1% of the reports used ANOVA. Some 6.2% of the RG researchers used hierarchical linear modeling to honor the fact that in some studies subscales were nested within measures or several reliability coefficients were nested within single primary reports. Box-and-whisked plots were used in 35.4% of the 47 RG studies. RG researchers typically investigate which features of the primary reports may predict variabilities in score reliabilities. The RG

Discussion
RG studies provide some insight about both the score reliabilities produced by given measures across samples and typical reliability reporting practices within the literature. With respect to the first outcome, authors of RG studies must work to avoid certain pitfalls (see Dimitrov, 2002). For example, RG investigators should take into account use of (a) different types of reliability estimates across studies and (b) different test forms, especially when forms have different numbers of items. A particularly difficult challenge for RG researchers involves the RG modeling misspecifications that occur when relevant

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

Vacha-Hasse and Thompson characteristics of the study samples are not coded as independent variables in RG analysis (Dimitrov, 2002, p. 794). These model misspecifications may occur because original reports often do not provide enough detail about the measurement and sampling designs being used. A related problem is that substantive researchers who report score reliability coefficients for their own data most often report only Cronbachs alpha, notwithstanding the limitations of this estimate and the fact that the measurement model underlying that estimate may not fit many of the situations in which the estimate is used (see Dimitrov, 2002). Hogan, Benjamin, and Brezinskis (2000) empirical study of the literature found that two thirds of the articles they examined reported alpha, and they also noted that despite their prominence in the psychometric literature of the past 20 years, we encountered no reference to generalizability coefficients . . . or to the test information functions that arise from item response theory (p. 528). These problems limit the potential benefits of RG studies. Of course, as Thompson and Vacha-Haase (2000) reminded, It is important to remember that RG studies are a meta-analytic characterization of what is hoped is a population of previous reports. We may not like the ingredients that go into making this sausage, but the RG chef can only work with the ingredients provided by the literature. (p. 184)

163 Task Force (1999) have yet to have their desired impacts. We believe this disturbing reality is an artifact of too many applied researchers still believing that tests qua tests themselves have the property of reliability. This misconception may not be conscious, but is all the more pernicious when unconscious, because unconscious misperceptions may be less likely to be reconsidered and corrected. The problem of sloppy speaking about reliability, in which tests are described as being reliable, is not just an issues of sloppy speakingthe problem is that sometimes we unconsciously come to think what we say or what we hear, so that sloppy speaking does sometimes lead to a more pernicious outcome, sloppy thinking and sloppy practice. (Thompson, 1992, p. 436) Some textbooks directly confront the misconception that tests are reliable. For example, Pedhazur and Schmelkin (1991) noted, Statements about the reliability of a measure are . . . [inherently] inappropriate and potentially misleading (p. 82). Similarly, Gronlund and Linn (1990) emphasized that reliability refers to the results obtained with an evaluation instrument and not to the instrument itself. . . . Thus, it is more appropriate to speak of the reliability of the test scores or the measurement than of the test or the instrument. (p. 78) More recently, Urbina (2004) emphasized the fact is that the quality of reliability is one that, if present, belongs not to test but to test scores (p. 119). She perceptively noted that the distinction between scores versus tests being reliable is subtle, but noted that the distinction is fundamental to an understanding of the implications of the concept of reliability with regard to the use of tests and the interpretation of test

Score Reliability Within the Social Sciences Literature


Our most important finding is that such an astonishingly large proportion (i.e., a little more than half) of primary substantive studies do not even mention score reliability! This is a discouraging finding with respect to the integrity of such a broad array of primary studies, especially because most of these studies used classical GLM analyses. Clearly, the admonitions of Wilkinson and the APA

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

164

Measurement and Evaluation in Counseling and Development 44(3) features may cause reliabilities to fluctuate across test administrations. Over time, we expect the quality of RG studies to improve further once more and more primary reports include estimates of score reliabilities as more researchers realize that tests are not reliable. Indeed, the most important impact of the creation of RG and the reporting of RG findings is that these reports in themselves directly confront chronic misconceptions that tests are reliable. RG studies in and of themselves communicate the important understanding that score reliabilities vary across administrations and are not secreted into test booklets during the test printing process. Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

scores. If a test is described as reliable, the implication is that its reliability has been established permanently, in all respects for all uses, and with all users. (p. 120) Urbina utilizes a piano analogy to illustrate the fallacy of describing tests as reliable, noting that saying the test is reliable is similar to stating a piano will always sound the same, regardless of the type of music played, the person who is playing it, the type of the piano, or the surrounding acoustical environment. Our second major finding is that score reliability on average in the applied research literature appears to be reasonably sufficient to support inquiry using classical GLM statistics, given a mean coefficient alpha of .80 (SD = .09) and a range from .45 to .95. These results suggest a glass-half-fullhalf-empty conclusion about the quality of our literature with respect to score reliability. Clearly, some substantive studies are being conducted with scores of questionable reliability. Furthermore, we must wonder what were the reliabilities of those scores in those studies in which reliabilities were not reported for data in hand, or reliability was not even mentioned!

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

References
Note: The 47 RG studies included in this study are marked with asterisks. *Bachner, Y. G., & ORourke, N. (2007). Reliability generalization of responses by care providers to the Zarit Burden Interview. Aging & Mental Health, 11, 678685. doi:10.1080/ 13607860701529965 *Barnes, L. L. B., Harp, D., & Jung, W. S. (2002). Reliability generalization of scores on the Spielberger StateTrait Anxiety Inventory. Educational and Psychological Measurement, 62, 603618. doi:10.1177/0013164402062004005 Bem, S. L. (1981). Bem Sex-Role Inventory: Professional manual. Palo Alto, CA: Consulting Psychologists Press. *Beretvas, S. N., Meyers, J. L., & Leite, W. L. (2002). A reliability generalization study of the Marlowe Crowne Social Desirability Scale. Educational and Psychological Measurement, 62, 570589. doi:10.1177/0013164402062004003

Typical Practice Within the RG Literature


Each of the 47 RG studies we investigated involved a gargantuan investment of researcher time and effort, as does any meta-analysis, whether the meta-analysis is substantive or psychometric in focus. RG researchers are employing a wide array of statistical analyses, and one out of three used box-and-whisker plots to communicate their results, which is consistent with the recommendation of Wilkinson and APA Task Force (1999) to use graphics to communicate multiple features of data (e.g., central tendency, dispersion, shape, outliers) in pictures. We also found that RG researchers are using a wide array of predictor variables to help understand what design

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

Vacha-Hasse and Thompson


*Beretvas, S. N., Suizzo, M.-A., Durham, J. A., & Yarnell, L. M. (2008). A reliability generalization study of scores on Rotters and NowickiStricklands locus of control scales. Educational and Psychological Measurement, 68, 97119. doi:10.1177/0013164407301529 Bonnett, D. G. (2010). Varying coefficient alpha meta-analytic methods for alpha reliability. Psychological Methods, 15, 368385. doi:10.1037/ a0020142 *Campbell, J. S., Pulos, S., Hogan, M., & Murry, F. (2005). Reliability generalization of the Psycho pathy Checklist Applied in youthful samples. Educational and Psychological Measurement, 65, 639656. doi:10.1177/0013164405275666 *Capraro, R. M., & Capraro, M. M. (2002). MyersBriggs Type Indicator score reliability across studies: A meta-analytic reliability generalization study. Educational and Psychological Measurement, 62, 590602. doi:10.1177/ 0013164402062004004 *Capraro, M. M., Capraro, R. M., & Henson, R. K. (2001). Measurement error of scores on the Mathematics Anxiety Rating Scale across studies. Educational and Psychological Measurement, 61, 373386. doi:10.1177/00131640 121971266 Capraro, R. M., & Thompson, B. (2008). The educational researcher defined: What will future researchers be trained to do? Journal of Educational Research, 101, 247253. doi:10.3200/ JOER.101.4.247-253 *Caruso, J. C. (2000). Reliability generalization of the NEO Personality Scales. Educational and Psychological Measurement, 60, 236254. doi:10.1177/00131640021970484 *Caruso, J. C., & Edwards, S. (2001). Reliability generalization of the Junior Eysenck Personality Questionnaire. Personality and Individual Differences, 31, 173184. doi:10.1016/S091-8869(00)00126-4 *Caruso, J. C., Witkiewitz, K., Belcourt-Dittloff, A., & Gottlieb, J. D. (2001). Reliability of scores from the Eysenck Personality Questionnaire: A reliability generalization study. Educational and Psychological Measurement, 61, 675689. doi:10.1177/00131640121971437 Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426433. doi:10.1037/h0026714

165
Dandan, G., & Houcan, Z. (2004). A redefinition of reliability and the study of reliability generalization. Psychological Science (China), 27, 445448. *Deditius-Island, H. K., & Caruso, J. C. (2002). An examination of the reliability of scores from Zuckermans Sensation Seeking Scales, Form V. Educational and Psychological Measurement, 62, 728734. doi:10.1177/0013164402062004012 Dimitrov, D. M. (2002). Reliability: Arguments for multiple perspectives and potential problems with generalizability across studies. Educational and Psychological Measurement, 62, 783801. doi:10.1177/001316402236878 *Dunn, T. W., Smith, T. B., & Montoya, J. A. (2006). Multicultural competency instrumentation: A review and analysis of reliability generalization. Journal of Counseling & Development, 84, 471482. *Graham, J. M., & Christiansen, K. (2009). The reliability of romantic love: A reliability generalization meta-analysis. Personal Relationships, 16, 4966. doi:10.1111/j.1475-6811.2009.01209.x *Graham, J. M., Liu, Y. J., & Jeziorski, J. L. (2006). The Dyadic Adjustment Scale: A reliability generalization meta-analysis. Journal of Marriage and Family, 68, 701717. doi:10.1111/ j.1741-3737.2006.00284.x Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York, NY: Macmillan. *Hanson, W. E., Curry, K. T., & Bandalos, D. L. (2002). Reliability generalization of Working Alliance Inventory scale scores. Educational and Psychological Measurement, 62, 659673. doi:10.1177/0013164402062004008 *Hellman, C. M., Fuqua, D. R., & Worley, J. (2006). A reliability generalization study on the Survey of Perceived Organizational Support: The effects of mean age and number of items on score reliability. Educational and Psychological Measurement, 66, 631642. doi:10.1177/ 0013164406288158 *Hellman, C. M., Muilenburg-Trevino, E. M., & Worley, J. A. (2008). The belief in a just world: An examination of reliability estimates across three measures. Journal of Personality Assessment, 90, 399401. doi:10.1080/ 00223890802108238

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

166

Measurement and Evaluation in Counseling and Development 44(3)


tional and Psychological Measurement, 62, 685711. doi:10.1177/0013164402062004010 *Leach, L. F., Henson, R. K., Odom, L. R., & Cagle, L. S. (2006). A reliability generalization study of the Self-Description Questionnaire. Educational and Psychological Measurement, 66, 285304. doi:10.1177/0013164405284030 *Li, A., & Bagger, J. (2007). The Balanced Inventory of Desirable Responding (BIDR): A reliability generalization study. Educational and Psychological Measurement, 67, 525544. doi:10.1177/001316440292087 *Lopez-Pina, J. A., Sanchez-Meca, J., & RosaAlcazar, A. I. (2009). The Hamilton Rating Scale for Depression: A meta-analytic reliability generalization study. International Journal of Clinical and Health Psychology, 9, 143159. McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internal consistency, retest reliability, and their implications for personality scale validity. Personality and Social Psychology Review, 15, 2850. doi:10.1177/1088868310366253 *Miller, B. K., & Byrne, Z. S. (2009). Perceptions of organizational politics: A demonstration of the reliability generalization technique. Journal of Managerial Issues, 21, 280300. *Miller, C. S., Shields, A. L., Campfield, D., Wallace, K. A., & Weiss, R. D. (2007). Substance use scales of the Minnesota Multiphasic Personality Inventory: An exploration of score reliability via meta-analysis. Educational and Psychological Measurement, 67, 10521065. doi:10.1177/0013164406299130 *Miller, C. S., Woodson, J., Howell, R. T., & Shields, A. L. (2009). SASSI: Assessing the reliability of scores produced by the Substance Abuse Subtle Screening Inventory. Substance Use & Misuse, 44, 10901100. *Mji, A., & Alkhateeb, H. M. (2005). Combining reliability coefficients: Toward reliability generalization of the Conceptions of Mathematics Questionnaire. Psychological Reports, 96, 627634. doi:10.2466/pr0.96.3.627-634 *Nilsson, J. E., Schmidt, C. K., & Meek, W. D. (2002). Reliability generalization: An examination of the Career Decision-Making SelfEfficacy Scale. Educational and Psychological

*Henson, R. K., & Hwang, D.-Y. (2002). Variability and prediction of measurement error in a Kolbs Learning Style Inventory Scores: A reliability generalization study. Educational and Psychological Measurement, 62, 712727. doi:10.1177/ 0013164402062004011 *Henson, R. K., Kogan, L. R., & Vacha-Haase, T. (2001). A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educational and Psychological Measurement, 61, 404420. doi:10.1177/00131640121971284 Henson, R. K., & Thompson, B. (2002). Characterizing measurement error in scores across studies: Some recommendations for conducting reliability generalization (RG) studies. Measurement and Evaluation in Counseling and Development, 35, 113127. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523531. Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. *Huynh, Q.-L., Howell, R. T., & Benet-Martinez, V. (2009). Reliability of bidimensional acculturation scores: A meta-analysis. Journal of Cross-Cultural Psychology, 40, 256274. doi:10.1177/0022022108328919 *Kieffer, K. M., Cronin, C., & Fister, M. C. (2004). Exploring variability and sources of measurement error in Alcohol Expectancy Questionnaire reliability coefficients: A meta-analytic reliability generalization study. Journal of Studies on Alcohol, 65, 663671. *Kieffer, K. M., & Reese, R. J. (2002). A reliability generalization study of the Geriatric Depression Scale. Educational and Psychological Measurement, 62, 969994. doi:10.1177/0013 164402238085 Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin, 85, 410416. doi:10.1037//0033-2909.85.2.410 *Lane, G. G., White, A. E., & Henson, R. K. (2002). Expanding reliability generalization methods with KR-21 estimates: An RG study of the Coopersmith Self-Esteem Inventory. Educa-

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

Vacha-Hasse and Thompson


Measurement, 62, 647658. doi:10.1177/ 0013164402062004007 *ORourke, N. (2004). Reliability generalization of responses by care providers to the Center for Epidemiologic StudiesDepression Scale. Educational and Psychological Measurement, 64, 973990. doi:10.1177/0013164404268668 Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Erlbaum. *Reese, R. J., Kieffer, K. M., & Briggs, B. K. (2002). A reliability generalization study of select measures of adult attachment style. Educational and Psychological Measurement, 62, 619646. doi:10.1177/0013164402062004006 Reinhardt, B. (1996). Factors affecting coefficient alpha: A mini Monte Carlo study. In B. Thompson (Ed.), Advances in social science methodology (Vol. 4, pp. 320). Greenwich, CT: JAI Press. *Rexrode, K. R., Petersen, S., & OToole, S. (2008). The Ways of Coping Scale: A reliability generalization study. Educational and Psychological Measurement, 68, 262280. doi:10.1177/ 0013164407310128 Rodriguez, M. C., & Maeda, Y. (2006). Meta-analysis of coefficient alpha. Psychological Methods, 11, 306322. doi:10.1037/1082-989X.11.3.306 *Ross, M. E., Blackburn, M., & Forbes, S. (2005). Reliability generalization of the Patterns of Adaptive Learning Survey Goal Orientation Scales. Educational and Psychological Measurement, 65, 451464. doi:10.1177/0013164404272496 *Rouse, S. V. (2007). Using reliability generalization methods to explore measurement error: An illustration using the MMPI-2 PSY-5 Scales. Journal of Personality Assessment, 88, 264275. *Ryngala, D. J., Shields, A. L., & Caruso, J. C. (2005). Reliability generalization of the Revised Childrens Manifest Anxiety Scale. Educational and Psychological Measurement, 65, 259271. doi:10.1177/0013164404272495 Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529540. doi:10.1037//00219010.62.5.529

167
*Shields, A. L., & Caruso, J. C. (2003). Reliability generalization of the Alcohol Use Disorders Identification Test. Educational and Psychological Measurement, 63, 404413. doi:10.1177/0013164403063003004 *Shields, A. L., & Caruso, J. C. (2004). A reliability induction and reliability generalization study of the CAGE Questionnaire. Educational and Psychological Measurement, 64, 254270. doi:10.1177/0013164403261814 Thompson, B. (1992). Two and one-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434438. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837847. Thompson, B. (Ed.). (2003). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks, CA: Sage. *Thompson, B., & Cook, C. (2002). Stability of the reliability of LibQUAL + TM scores: A reliability generalization meta-analysis study. Educational and Psychological Measurement, 62, 735743. doi:10.1177/0013164402062004013 Thompson, B., & Vacha-Haase, T. (2000). Psychometrics is datametrics: The test is not reliable. Educational and Psychological Measurement, 60, 174195. doi:10.1177/00131640021970448 Urbina, S. (2004). Essentials of psychological testing. Hoboken, NJ: John Wiley. Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 620. doi:10.1177/00131640121971059 Vacha-Haase, T., Henson, R. K., & Caruso, J. C. (2002). Reliability generalization: Moving toward improved understanding and use of score reliability. Educational and Psychological Measurement, 62, 562569. doi:10.1177/ 0013164402062004002 *Vacha-Haase, T., Kogan, L. R., Tani, C. R., & Woodall, R. A. (2001). Reliability generalization: Exploring variation of reliability coefficients of MMPI clinical scales scores. Educational and Psychological Measurement, 61, 4559. doi:10.1177/00131640121971059

Downloaded from mec.sagepub.com at University of Bucharest on March 6, 2014

168

Measurement and Evaluation in Counseling and Development 44(3)


Psychologist, 54, 594604. doi:10.1037//0003066X.54.8.594 Willson, V. L. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Researcher, 9(6), 510. doi:10.2307/1175221 Yetkiner, Z. E., & Thompson, B. (in press). Demonstration of how score reliability is integrated into SEM and how reliability affects all statistical analyses. Multiple Linear Regression Viewpoints, 36(2). *Yin, P., & Fan, X. (2000). Assessing the reliability of Beck Depression Inventory scores: Reliability generalization across studies. Educational and Psychological Measurement, 60, 201223. doi:10.1177/00131640021970466 *Youngstrom, E. A., & Green, K. W. (2003). Reliability generalization of self-report of emotions when using the Differential Emotions Scale. Educational and Psychological Measurement, 63, 279295. doi:10.1177/0013164403253226 *Zangaro, G. A., & Soeken, K. L. (2005). Metaanalysis of the reliability and validity of Part B of the Index of Work Satisfaction across studies. Journal of Nursing Measurement, 13, 722. doi:10.1891/jnum.2005.13.1.7 Zientek, L. R., & Thompson, B. (2009). Matrix summaries improve research reports: Secondary analyses using published literature. Educational Researcher, 38, 343352. doi:10.3102/0013189X09339056

Vacha-Haase, T., Kogan, L.R., & Thompson, B. (2000). Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement, 60, 509522. doi:10.1177/00131640021970682 Vacha-Haase, T., Ness, C. M., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reliability coefficients: A review of three journals. Journal of Experimental Education, 67, 335341. doi:10.1080/00220979909598487 *Vacha-Haase, T., Tani, C. R., Kogan, L. R., Woodall, R. A., & Thompson, B. (2001). Reliability generalization: Exploring reliability variations on MMPI/MMPI-2 validity scale scores. Assessment, 8, 391401. doi:10.1177/ 107319110100800404 *Vassar, M., & Crosby, J. W. (2008). A reliability generalization study of coefficient alpha for the UCLA Loneliness Scale. Journal of Personality Assessment, 90, 601607. doi:10.1080/ 00223890802388624 *Victorson, D., Barocas, J., Song, J., & Cella, D. (2008). Reliability across studies from the Functional Assessment of Cancer TherapyGeneral (FACT-G) and its subscales: A reliability generalization. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment,Care&Rehabilitation,17,11371146. doi:10.1007/s11136-008-9398-2 *Wallace, K. A., & Wheeler, A. J. (2002). Reliability generalization of the Life Satisfaction Index. Educational and Psychological Measurement, 62, 674684. doi:10.1177/001316 4402062004009 Wilkinson, L., & American Psychological Association (APA) Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American

Bios
Tammi Vacha-Haase is a professor of psychology at Colorado State University. Bruce Thompson is a distinguished professor of educational psychology, and of library science, at Texas A&M University, and adjunct professor of allied health sciences, Baylor College of Medicine (Houston).

Potrebbero piacerti anche