Reliability and Validity

Reliability and Validity 1
Running head: INSTRUMENT RELIABILITY AND VALIDITY
Instrument Reliability and Validity:
Introductory Concepts and Measures
Amanda Jane Fairchild
James Madison University

Instrument Reliability and Validity: Introductory Concepts and Measures
Instrument validity and reliability have often been both misunderstood and under
emphasized in the social scientific literature. In truth, these phenomena lie at the heart of
competent and effective study. How productive can one’s research to be if the instrument that
he or she implemented does not measure what it purports to gauge? How legitimate, or
justifiable, is research that is based on inconsistent instrumentation? What constitutes an
inconsistent instrument? What constitutes a valid instrument? What are the implications of both
proper and improper testing? This contribution should, at a minimum, provide an explanation to
these and other related questions in the field of measurement and instrument design. Further,
the content should impart a sense of responsibility on academics and professionals to address
the issues of reliability and validity in their respective fields of study.
Reliability
Devellis (1991) defines reliability as "the proportion of variance attributable to the true
score of the latent variable". The obscurity of this definition provides a perfect example of why
reliability, as a measurement concept, has been unable to effectively penetrate the general
research public thus far. In fact, the concept of reliability often goes unrecognized outside of the
measurement literature. Across disciplines competent researchers not only fail to report the
reliability of their measures (Henson, 2001; Thompson, 1999), but also fall short of grasping the
inextricable link between scale reliability and effective research. At best, measurement error
affects the ability to find significant results in one’s data. At worst, measurement error can
significantly damage the interpretability of scores or the function of a testing instrument. This
gap between well-founded ideas and corresponding measurement instruments based on sound
theoretical grounding is of particular concern. A first step towards remedying this split and
integrating the two disciplines involves bringing the appropriate quantitative methodology to the
forefront of general understanding, as the inclusion of reliability analysis in scientific study is as
essential as the studies themselves (Thompson, 1999).

What is Reliability?
Reliability involves the consistency, or reproducibility, of test scores. That is, the degree
to which one can expect relatively constant deviation scores of individuals across testing
situations on the same, or parallel, testing instruments. This property is not a stagnant function
of the test. Rather, reliability estimates change with different populations (i.e. population
samples) and as a function of the error involved. These facts underscore the importance of
consistently reporting reliability estimates for each administration of an instrument, as test
samples, or subject populations, are rarely the same across situations and in different research
settings. More important to understand is that reliability estimates are a function of the test
scores yielded from an instrument, not the test itself (Thompson, 1999). Accordingly, reliability
estimates should be considered based upon the various sources of measurement error that will
be involved in test administration (Crocker & Algina, 1986). In order to fully understand
reliability then, it is first necessary to development a clearer picture of measurement error and
test scores.
Measurement Error
In any given testing situation, there are two types of error present: systematic and
random error. Systematic error is the more problematic of the two, as it does not necessarily
affect the consistency of test scores. Rather, it may just affect the utility, or validity, of the test.
Systematic error refers to those errors that consistently affect an individual's observed score.
This may be a function of the individual themselves (e.g. a personality attribute, such as
forgetfulness, or a quality, like fatigue) or a function of the measure. For instance, a test could
reliably be measuring a test taker’s level of depression, even though it was intended to gauge
anxiety. Regardless, systematic error reflects the measurement of something other than the
intended construct. In contrast, random error refers to that error which affects individuals'
scores by pure chance. Consider an examinee who is ill, or whose family member was placed
in a hospital the night before; his or her observed score on the measure will certainly be less
reflective of their true score on a construct in comparison to other test takers. These random
errors directly affect the reliability of a measure, as well as impacting the utility of the instrument.
Components of Individual Scores
An examinee’s true score is never really known. Rather, it is simply a theoretical
construct that represents the infinite average of an individual’s observed scores (Crocker &
Algina, 1986). Therefore, researchers can only infer information about a true score from the
examinee’s observed score. Classical test theory indicates that observed test scores are
comprised of both an individual’s true score on an instrument and random error in testing
(Spearman, 1907, 1913, as cited in Crocker & Algina, 1986). Theoretically, error is considered
the disparity between one’s true score on a construct, as defined by the measure, and his or her
observed score, the result of their scores on the test items. This testing error should be variant,
or random, over time, and the average of this error for a given individual should have a mean of
zero (Crocker & Algina, 1986; Gregory, 1992). Similarly, truly random error values should not
correlate with one another, or anything at all. Evidence of such a correspondence may indicate
the presence of systematic error, which would need to be addressed in the testing instrument
(Crocker & Algina, 1986).
Standard Error of Measurement
The standard error of measurement (SEM) refers to the individual standard deviation
scores for error averaged across an entire group of examinees, and can be thought of in terms
of the normal distribution of scores for each examinee (Crocker & Algina, 1986). This element
of measurement is important in that it is used to create a confidence interval around the
individual’s true scores. This confidence interval, as defined by: the observed score (X) ± SEM,
delineates a bound by which the observed values are accurate in assessing the true values, as
the actual true score of an examinee is never really known. In other words, in calculating the
standard error of measurement one is able to get an idea of how much their set of observed
scores accurately reflects the examinees’ true scores on a construct. For example, a
confidence interval might indicate, with 95% certainty, that an examinee’s true score on a math
test falls between his or her observed score of 83 ± a calculated SEM of 2.32 (Crocker & Algina,
1986).
Estimating Reliability
There are many methods to estimating reliability, each resulting in a different dimension
of reliability. These applications of these methods will vary depending on your testing situation
and how you plan to use the test results. Firstly, there are instances in which multiple testing
sessions of the same instrument will occur. Consider the clinician who is interested in
measuring the psychological growth of his clients over time (e.g. on a depression scale). In this
example, coefficients of stability most readily apply. A measure of reliability called the test-
retest method should be employed, where test proctors administer the same test to a set of
examinees more than once. The test should be administered, a sufficient period of time should
elapse, and the test should then be administered once again. Upon completion of the second
administration, one is able to calculate the correlation coefficient between scores on the two
measures, which will yield information on how stable the test results (i.e. observed scores) are
over time (Crocker & Algina, 1986; Gregory, 1992). A second type of reliability estimate is the
alternate form method. This test-retest technique evaluates the consistency of alternate forms
of a single test (DeVellis, 1991). This approach is particularly useful in the context of
standardized testing procedures, where it is ideal to have multiple, and equivalent, forms of the
same test. In this method, participants take one form of the test, a period of time elapses, and
they then take a second form of the test. Once results are gathered from both sessions, the
correlation coefficient between the two sets of scores is calculated. In this technique, a
coefficient of equivalence is yielded (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).
Other reliability estimates only necessitate one test administration. These are often
referred to as internal consistency measures of reliability. These methods are concerned with
the consistency of scores within the test itself, or rather consistency of scores among the items
(Crocker & Algina,1986; DeVellis, 1991). Here, the key is to have a homogenous set of items
that reflects a unified underlying construct. High reliability estimates of this kind will result in
high inter-item correlations among the items or subscales (Crocker & Algina, 1986; DeVellis,
1991; Gregory, 1992; Henson, 2001). The most common method of assessing internal
consistency reliability estimates is through the use of coefficient alpha. Though there are three
different measures of coefficient alpha, the most widely used measure is Cronbach’s coefficient
alpha. Cronbach’s alpha is actually an average of all the possible split-half reliability estimates
of an instrument (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992; Henson, 2001). It is
important to note that coefficient alpha is the lower bound of reliability. The two lesser-used
techniques of estimating coefficient alpha are appropriate in limited circumstances. For
example, the Kuder Richardson 20 is appropriate for use with dichotomously-scored items, and
Hoyt’s method is useful in particular testing situations that involve computer programming
(Crocker & Algina, 1986).
Another type of internal consistency measure that only requires one test administration
is split-half reliability. This technique literally takes an instrument, assesses the reliability of the
first half, and then compares this estimate to the reliability measure of the second half. The
method demands equal item representation across the two halves of the instrument. Clearly the
comparison of dissimilar sample items will not yield an accurate reliability estimate.
Researchers can insure equal item representation through the use of random item selection,
matching items from one half to the next, or assigning items to halves based on an even/odd
distribution (Crocker & Algina, 1986). It should be noted that reliability estimates are often
underestimated when computing split-half reliability, due to the shortened nature of the
instrument. This error in calculation can be addressed by using the Spearman Brown prophecy,
which provides the means necessary to estimate reliability for the full-length test based on your
split-half calculations (Crocker & Algina, 1986; Gregory, 1992).
In the social sciences, acceptable reliability estimates range from .70 to .80 (Nunnally &
Bernstein, 1994). However, research in the physical sciences typically demands more rigorous
reliability standards, as the constructs involved are more concrete and easily defined. In both
settings, acceptable reliability estimates should be congruent with the implications of the test
scores. That is, higher stakes testing should have higher standards of instrument reliability
(Nunnally & Bernstein, 1994).
Reliability and Power
Assessing scale reliability is crucial to maximizing power in one’s study. Simply put,
unreliable scales decrease the statistical power of an instrument. This is important in many
ways. Most notably, as power decreases, larger sample sizes are necessary to find significant
results. An increase in statistical effect size is also observed with an increase in instrument
reliability and subsequent power gained. Additionally, reliable instruments introduce less error
into the statistical measurement and resulting analysis. Still, the significant results may well be
meaningless if the instrument is faulty (DeVellis, 1991).
Factors that Affect Reliability
Low internal consistency estimates are often the result of poorly written items or an
excessively broad content area of measure (Crocker & Algina, 1986). However, other factors
can equally reduce the reliability coefficient. Namely, the homogeneity of your testing sample,
imposed time limits in your testing situation, item difficulty and the length of your testing
instrument (Crocker & Algina, 1986; Mehrens & Lehman, 1991; DeVellis, 1991; Gregory, 1992).
Group homogeneity is particularly influential when you are trying to apply a norm-referenced
test to a homogenous testing sample. In such circumstances, the restriction of range of your
testing group (i.e. low variability) translates into a smaller proportion of variance explained by
your testing instrument, ultimately deflating the reliability coefficient. It is essential to bear in
mind the intended use of your instrument when considering these circumstances and deciding
how to use an instrument (Crocker & Algina, 1986; Mehrens & Lehman, 1991; Gregory, 1992).
Imposed time constraints in a testing situation pose a different type of problem. That is,
time limits ultimately affect a test taker’s ability to fully answer questions or to complete an
instrument. As a result, variance in test taker ability to work at a specific rate becomes
enmeshed in that person’s score variance. Ultimately, test takers that work at similar rates have
higher degrees of variance correspondence, artificially inflating the reliability of the testing
instrument. Clearly, this situation becomes problematic when the construct that an instrument
intends to measure has nothing to due with speed competency (Crocker & Algina, 1986;
Mehrens & Lehman, 1991; Gregory, 1992).
The relationship between reliability and item difficulty addresses a variability issue once
again. Ultimately, if the testing instrument has little to no variability in the questions (i.e. either
items are all too difficult or too easy), the reliability of scores will be affected. Aside from this,
reliability estimates can also be artificially deflated if the test has too many difficult items, as they
promote uneducated guesses (Crocker & Algina, 1986; Mehrens & Lehman, 1991).
Lastly, test length also factors into the reliability estimate. Simply, longer tests yield
higher estimates of reliability. This phenomenon can be best explained through an examination
of the Spearman Brown prophecy equation, which indicates that as number of items increase,
there is a direct increase in the reliability estimate. However, one must consider the reliability
gains earned in such situations, as infinitely long tests are not necessarily desirable. For
instance, if you have an 80-item testing instrument with an internal consistency reliability
coefficient of .78 and the Spearman-Brown prophecy indicates that your reliability estimate will
increase to .85 if you add an additional 25 items, you may consider that the slightly lower
reliability estimate is more desirable than an excessively long instrument (Crocket & Algina,
1986; Mehrens & Lehman, 1991; Gregory, 1992).

Reliability and Criterion-Referenced Tests
Reliability plays a different role in criterion-referenced testing situations. Namely, in
these circumstances, researchers or test administrators are interested in how individuals score
with reference to set standards of performance, not with reference to other individuals (i.e.,
norm-referenced testing). This situation does not necessarily demand high estimates of
reliability in the classical sense. In fact, it may be more desirable to have individuals all do well,
which would yield low variability. Therefore, the invariance of scores is acceptable as long as
the measure adequately represents the testing domain (the measure’s adequate representation
of the testing domain can be established via content validity procedures discussed earlier).
Measures of reliability in this respect are interested in the “precision of the decision” or
the “precision of the score” (Mehrens & Lehman, 1991, p. 262), rather than the consistency of
individual scores. That is, test administrators are interested in whether the pass/fail, or
satisfactory/unsatisfactory, decisions made on the examinees are consistent across multiple
testing administrations, or parallel forms of the instrument. In other words, individuals with the
same amount of knowledge or skill set should be judged in an equivalent manner. Accordingly,
classical measures of reliability are not relevant in this circumstance. Instead, these situations
call for the calculation of indices of consistency, which compute the “proportion of examinees for
which the decisions are the same [across different measures]” (Traub, 1994, p.142). Traub
illustrates this notion with a matrix, where participant scores for tests one and two, or for test
administrations one and two, are given:
Test #2
Fail Pass
Pass Pass/Fail Pass/Pass
Test #1
Fail/Fail Fail/Pass
Fail
Test administrators are seeking to make the same decisions for participants across
instruments or sessions. The cell texts in bold illustrate this consistency. This index of
consistency has received mild criticism though, as there is always a proportion of chance
consistency. In fact, your reliability index will never be zero, even if the instrument is unreliable,
due to the simple rules of probability. However, there is a correction formula that addresses this
problem: coefficient kappa. Researchers may decide whether or not implementation of this
latter index is necessary through evaluating the shapes of the score distributions, assessing the
correlation between scores on the respective tests, and estimating the magnitude of the cut-off
scores. Regardless of which index is ultimately used for interpretation, the literature indicates
that it sound practice to report both coefficients (Traub, 1994).
For those testing situations that do not allow for multiple administrations of an
instrument, there are indices of consistency available for use with one test administration.
However, these indices normally entail more complicated computation. Some examples of
these methods are: the Subkoviak index of decision consistency (which is appropriate for use
with dichotomously scored items), linear regression estimates and Huynh’s estimate of decision
consistency. Please refer to Crocker and Algina (1986) for an in depth analysis of these issues.
Validity
Validity has been defined by “the extent to which [a test] measures what it claims to
measure” (Gregory, 1992, p.117). The focus here is not necessarily on scores or items, but
rather inferences made from the instrument. That is, the behavioral inferences that one can
extrapolate from test scores is of immediate focus. In order to be valid, the inferences made
from scores need to be “appropriate, meaningful, and useful” (Gregory, 1992, p. 117). These
distinctions illuminate the inextricable link between validity and reliability. For example, a testing
instrument can reliably measure something other than the supposed construct, but an unreliable
measure cannot be valid (Crocker & Algina, 1986; Gregory, 1992). Violations of instrument
validity severely impact the function and functioning of a testing instrument. In some ways,
validity inadequacies impart even more serious consequences on an instrument than its
reliability counterpart. This can be substantiated in the sense that validity is a comprehensive
construct that cannot be definitively measured in any one given statistic, and that this
instrumental testing property is often even less understood than reliability (Crocker & Algina,
1986; Gregory, 1992). Effective validity studies not only demand the integration of multiple
sources of evidence, but also must continually take place over time. That is, a measure cannot
be deemed valid in a simple instance of study. Rather, multiple studies must be implemented
over different samples, and the collection of validity evidence must cover specified areas
(Crocker & Algina, 1986; Gregory, 1992; Messick, 1995). Moreover, in recent years
researchers have expanded the understanding of validity to comprise more dimensionality than
previously recognized. Specifically, Messick (1995) has criticized traditional approaches to
validity, asserting that researchers have narrowly focused attention on compartmentalized
distinctions at the expense of fostering the development of a unified concept. Alternatively, he
purports an integrative approach to instrument validity that not only focuses on conventional test
score issues, but also emphasizes the significance of score implication and their social use.
Still, this unified concept of validity is best understood and examined within the context of its
four discrete facets: content validity, construct validity, criterion validity and consequential
validity (Messick, 1995).
Content Validity
Content validity should not be confused with face validity, a non-statistical assessment of
whether or not a test appears to be “valid”. This concept is really not an index of validity at all.
Rather, it simply addresses the layman acceptability of a measure (Gregory, 1992). In contrast,
content validity considers whether or not the items on a given test accurately reflect the
theoretical domain of the latent construct it claims to measure. Items need to effectively act as
a representative sample of all the possible questions that could have been derived from the
construct (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). This concept is easy to
understand in terms of criterion-referenced measures. Consider a high school chemistry test on
the periodic table. There are a finite number of elements in this population from which
questions can be drawn. Further, there are different groupings of elements that may be
addressed. Finally, there are a specified number of symbols that students can be tested on.
Thus, an accurate, or valid, testing measure would sample these criteria in such a way that
facets of knowledge in these areas would be equally represented. Application of this notion to
the social sciences is slightly more convoluted. This stems from the fact that the theories and
constructs involved are innately intangible (e.g. anxiety, intelligence, depression, etc.). In other
words, their measurement depends on the operationalization of variables deemed to be
representative of the domain. In this respect, there is no clean set of exhaustive measures that
represents any given construct. Rather, there exists an almost infinite sampling domain from
which questions can be drawn. In this instance, content validity becomes more of a qualitative
judgment than an absolute definitive measure (Crocker & Algina, 1986; DeVellis, 1991; Gregory,
1992). Crocker and Algina (1986) have shared a sequential process that is relevant to both of
these circumstances. They suggest employing the following four steps to effectively evaluate
content validity: 1) identify and outline the domain of interest, 2) gather resident domain experts,
3) develop consistent matching methodology, and 4) analyze results from the matching task. In
this example, matching refers to the pairing of each item on an instrument with an element of
the construct’s theoretical domain. In the academic arena, the testing content is usually defined
via course objectives. Matching in this instance might involve pairing objectives with test items,
particularly in the context of achievement tests (Crocker & Algina, 1986). A number of different
issues arise in response to Crocker and Algina’s methodology. Two questions that are
particularly pertinent are: should researchers weight objectives in order of importance in the
theoretical domain and how should the results of this process be interpreted. Researchers are
divided on the former consideration. Some (e.g. Katz, as cited in Crocker & Algina, 1986)
maintain that it is necessary to weight objectives in order of their pertinence to the theoretical
domain. However, most others argue that equal representation of theoretical objectives is not
damaging. With reference to summarizing the information gathered in the matching task, this
truly becomes a subjective judgment on the part of the researcher. There are no hard and fast
statistics that indicate whether or not one’s instrument is ‘valid enough’. However, there are
definite quantitative tools available to aid in this decision-making. Crocker and Algina (1986)
delineate these measures. Two particularly useful pieces of information researchers may
choose to implement are the percentage of successfully matched items and the percentage of
theoretical domain objectives not addressed in the instrument. It is necessary to consider these
and other measures in full when deciding on the validity of an instrument. Interpretation of only
one index is imprudent and not recommended. Some common problems experienced with
assessing content validity are inadequate representation of domain in test objectives and ethnic,
racial, and gender biases. Crocker and Algina (1986) provide an excellent example of the issue
of testing bias. Consider a teacher who is interested in measuring their students’ math ability.
He or she provides a math test with word problems presented in English. This measure would
not accurately reflect the quantitative ability of non-English speakers in the classroom, as their
verbal ability, or inability, directly affects their ability to execute the problems. This problem may
have been averted with a clearer specification of the testing domain. In this example, the
teacher was really measuring the ability of students to execute word problems presented in
English, not only their ability to do math.
Construct Validity
DeVellis explains that the construct validity of a measure “is directly concerned with the
theoretical relationship of a variable (e.g. a score on some scale) to other variables. It is the
extent to which a measure ‘behaves’ the way that the construct it purports to measure should
behave with regard to established measures of other constructs” (1991, p. 46). As clear as this
sounds, it proves a bit more difficult in practice as constructs are not readily observable. Recall
that one must develop items, or variables, that act as representations of the construct and serve
to measure examinee scores with respect to the paradigm. This is facilitated through the
delineation of the construct itself. That is, a construct needs to be both operationalized and
syntactically defined in order to measure it effectively (Benson, 1998; Crocker & Algina, 1986;
Gregory, 1992). The operationalizing of the construct involves developing a series of
measurable behaviors or attributes that are hypothesized to correspond to the latent construct.
Defining the construct syntactically involves establishing hypothesized relationships between
the construct of interest and other related constructs or behaviors (Benson, 1998; Crocker &
Algina, 1986; Gregory, 1992). An example may provide more clarity on this issue. Suppose
that a researcher is interested in the construct of anxiety. They first may choose to establish
behaviors that are characteristic of this state (e.g. heart palpitations, sweaty palms, intrusive
thoughts, etc), so that it becomes measurable. They then formulate hypotheses about how
anxiety should relate to other criteria. For instance, researchers may claim that anxiety should
have a high correlation with depression and socially avoidant behaviors. Crocker and Algina
(1986) provide a series of steps in which to follow when pursuing a construct validation study: 1)
generate hypotheses of how the construct should relate to both other constructs of interest and
relevant group differences, 2) choose a measure that adequately represents the construct of
interest, 3) pursue empirical study to examine the relationships hypothesized, and 4) analyze
gathered data to check hypothesized relationships and to assess whether or not alternative
hypotheses could explain the relationships found between the variables. Benson (1998)
suggests that not all construct validation studies are the same though. Rather, she considers
that there are both weak and strong approaches to this practice. She contends that weak
attempts at establishing construct validity only focus on gathering empirical research on both the
construct of interest and other related constructs. In contrast, she suggests that strong attempts
at proving construct validity instead concentrate on the theoretical (i.e. not exclusively empirical)
foundations of the constructs involved. Through an examination of the theory, hypothesized

relationships naturally evolve. Further, Benson affirms Nunnally’s validation practice which
suggests three crucial steps: 1) a substantive component, 2) a structural component, and 3) an
external component. Step one in this series is relatively similar to step two in Crocker and
Algina’s methodology. Step two here focuses on the relationships between the items, or
observables, themselves. That is, the researcher examines the relationships between the items
on the measure to see how they are relating to both the construct and other items on the scales.
Finally, step three parallels step number one in Crocker and Algina’s conceptualization. The
key difference between these views is that Benson asserts focus on item analysis of the scale,
while this component is not utilized in the Crocker and Algina literature.
Statistical Procedures Involved in Establishing Construct Validity
No cut-off measure exists as to what are acceptable correlations between the construct
and related entities in order to purport construct validity. Once again, this becomes a qualitative
judgment that the researcher must make. However, significance tests of the correlations
themselves or proportion of variance explained as a function of the other measure may aid in
this decision-making process (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). Further,
investigation of the inter-item correlations on the scale and differentiating item response
measures may aid in learning more about the function of the items themselves, with respect to
one another and the latent construct. Further, affirming hypothesized group differences will
bolster the relationship between the measure and related constructs. Exploratory factor
analysis techniques facilitate the derivation of clearer structure evidence by statistically
illustrating how items load on particular factors, and confirmatory factor analysis techniques
should be employed to establish quality of model fit. Correlational evidence between the
underlying constructs may ultimately be studied within the context of a structural equation
model, which derives this information from the measures themselves (Benson, 1998).
Subcomponents of Construct Validity

Different validity coefficients are yielded as a result of investigations into construct
validity. When comparing the measured construct to other constructs based on hypothesized
relationships, one expects to see either convergent or discriminant validity. That is, convergent
validity coefficients should arise when considering two constructs hypothesized to be related.
Further, the correlation between the two should be moderately high in order to contend test
validity, as you want the measure to be correlated to others scales purporting to measure the
same, or related, thing. Again, though there is no explicit rule that states what correlation
coefficient is acceptable, you should be looking for relationships around .50 and above (Crocker
& Algina, 1986). In contrast, when you are looking at the relationship between your scale of
interest and a construct that is not hypothesized to be related (i.e. should be unrelated), you aim
to find discriminant validity coeffiecients. In other words, you want little to no correlation
between your measure and unrelated constructs. For instance, if there is a high degree of
overlap and correlation between your measure that is supposed to measure depression and an
index of shoe size, you should be alerted to the fact that your measure is not functioning as
intended. Or rather, that your measure is not measuring the underlying construct you claim it
does (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).
Criterion Validity
Criterion validity refers to the ability to draw accurate inferences from test scores to a
related behavioral criterion of interest. This validity measure can be pursued in one of two
contexts: predictive validity or concurrent validity. In predictive validity, researchers are
interested in assessing the predictive utility of an instrument. For example, the purpose of
standardized testing such as the SAT and GRE are to predict performance outcomes in college
and graduate school respectively. The test scores ultimately become the basis from which
decisions are made. This predictive validation normally occurs in the following sequence: test
scores of an instrument are gathered, stored, and then compared to a criterion of interest later
on (Crocker & Algina, 1986; Gregory, 1992). Researchers are looking for a high degree of
correlation between the criterion variable and scores on the testing instrument, in order to assert
good criterion validity. Validity coefficients are ultimately derived from the correlation between
these components. From this, one can calculate a coefficient of determination for the measures
by squaring the validity coefficient. This informs the researcher what percentage of variance in
the criterion variable is accounted for by the testing measure, or predictor variable (Crocker &
Algina, 1986; Gregory, 1992). However, perfect correlations between the two variables do not
indicate that one is simply an estimation of the other. More specifically, high correlations do not
mean that the testing instrument estimates the criterion variable (DeVellis, 1991).
Though concurrent validity also looks at the correlation between criterion and test
scores, the two measures are taken one right after the other in this instance. In other words,
there is no extended time period between the testing measure and the criterion measure, and
the examiner is more interested in the measure’s ability to reflect current ability than any
predictive aptitude. This is illustrated well in multidimensional licensing exams, where there
might be both a written and practical component. For example, if a test taker does well on the
written portion of an exam that measures anatomical knowledge, you would expect them to
perform reasonably well on a surgical exam of the same area. If the two skill measures were
not highly correlated, you may choose to further investigate the dynamics of your testing
instrument (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). Measures with established
concurrent validity ultimately have the ability to act as “shortcut[s] for obtaining information that
might otherwise require the extended investment of professional time”. That is, validated
measures may be used as precursory diagnostic tools in professional settings. For example,
validated measures may serve clinical staff by providing initial diagnostic information that would
be difficult to obtain quickly (Gregory, 1992, p. 122).
Measurement Error and Criterion-Related Validity
The Standard Error of Estimate (SEE) reflects how ‘off’ you are, on average, in your
predictions of the criterion variable from the predictor variable. This index is derived from the
validity coefficient of a measure and the standard deviations of the scores on the criterion
measure, and parallels the Standard Error of Measurement as it relates to reliability. In this
context, the SEE reflects the degree to which a measure is not wholly valid. That is, it illustrates
the instrument’s inaccuracy in predicting scores on a related criterion measure (Crocker &
Algina, 1986).
Clearly, one needs to begin with reliable measures before attempting to validate them in
this manner. Likewise, criterion variables also have to maintain a high degree of reliability. The
subsequent validity of a measure is actually restricted by the influence of these two variables.
That is, the validity of a predictive measure has a precise relationship with the reliability of the
two measures involved in its study. Please see Crocker & Algina for a more thorough analysis
of this statistic (Crocker & Algina, 1986; Gregory, 1992).
Further problems associated with criterion-related validity include restriction of range and
sample size. Restriction of range artificially deflates the correlation coefficient yielded between
the two measures, and inadequate sample size has ramifications for both power and accurate
validity estimations. More specifically, sample sizes below 200 individuals are often incapable
of representing accurate population validity coefficients. With an increase in sample size, there
is a corresponding increase in the accuracy of validity prediction (Crocker & Algina, 1986).
Consequential Validity
In recent years, more emphasis has been placed on the social utility and bias of
interpretation in test scores. Messick (1995) has been at the forefront of this push for the
consideration of consequential validity within the context of a measure’s construct validity.
Consequential validity refers to the notion that the social consequences of test scores and their
subsequent interpretation should consider not only with the original intention of the test, but also
cultural norms (Messick, 1995). This idea points to both the intended and unintended
consequences of a measure, which may be either positive or negative. Consider the
deleterious effects of an invalid instrument that creates unmerited distinctions between racial or
ethnic groups on an academic measure. It is critical that distinctions of this kind are supported
with valid measurement instruments, particularly if they go against the grain of social norms.
This concern is wholly based in the scale’s construct validity. Moreover, any inferences made
from test scores are contingent upon the construct validity of the scale. Therefore adequate
evidence, as collected through all four facets of validity study, is crucial in developing an
argument for the value of test scores (Messick, 1995).
General Factors That Affect Validity
An integral issue at hand in establishing validity coefficients is the actual relationship
between the two variables, or constructs, that you are interested in. Beyond this, comparable
measurement issues that affected the nature of reliability coefficients also affect validity
coefficients. That is, the more heterogeneous the groups are, the higher the correlations
between two measures will ultimately be. This phenomenon is most readily observable in
samples with a restriction of range problem. When the data range is limited, the scores become
more homogenous and the resulting correlation coefficients derived are artificially inflated. An
important matter of note is that the more effective an instrument is at screening individuals for a
particular purpose, the less heterogeneous the resulting sample will be, which in turn results in a
smaller validity coefficient. Consider a diagnostic instrument that screens potential patients for
depression before receiving services at an inpatient clinic. If the measure is accurate in
assessing the depression of these individuals, then only depressed people should result in the
later clinic sample. Note that in this instance the test functioned exactly as it should, and was
appropriate and accurate in its assessment of the individuals involved. Therefore, a testing
instrument may not necessarily be invalid, as direct interpretation of a low validity coefficient
would indicate (Mehrens & Lehman, 1991).
Implications for Testing
Ultimately, researchers should strive for reliable and valid instrumentation in their
investigations. This goal can be accomplished in part by a push for quality item writing, an
insistence on reporting reliability data across studies, sound theoretical bases for construct
measurement and accurate operationalization of the constructs at hand. This objective imparts
a direct responsibility on behalf of all examiners in a given field. That is, it is essential for
researchers and test administrators to actively measure the reliability and validity of instrument
scores over populations and time. The continual nature of both these processes should not be
undermined or overlooked. Moreover, it is critical for this type of information to be intellectually
accessible outside of the quantitative domain in order to facilitate the understanding and sharing
of this knowledge. Without credible instrumentation that is monitored and measured over time,
research results become meaningless. Further, until this knowledge and insistence on reporting
is successfully integrated into the mainstream, there will continue to be visible shortcomings in
the social scientific methodologies of today.

References
Benson, J. (1998). Developing a strong program of construct validation: a test anxiety
example. Educational Measurement: Issues and Practice, 17, 10-17.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.
Philadelphia: Harcourt Brace Jovanovich College Publishers.
DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social
Research Methods Series, Vol. 26). Newbury Park: Sage.
Gregory, R.J. (1992). Psychological testing: history, principles and applications. Boston:
Allyn and Bacon.
Henson, R.K. (2001). Understanding internal consistency reliability estimates: a
conceptual primers on coefficient alpha. Measurement and Evaluation in
Counseling and Development, 34, 177-188.
Messick, S. (1995). Validity of psychological assessment: validation of inferences from
persons’ responses and performances as scientific inquiry into score meaning.
American Psychologist, Vol.50(9), 741-749.
Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and
Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL.
Nunnally, J.C., & Bernstein, I.H. (1994). Psychometric Theory (3rd ed.). New York: McGraw-Hill.
Thompson, B. (1999). Understanding coefficient alpha, really. Paper presented at the
annual meeting of the Education Research Exchange, College Station, Texas,
February 5, 1999.
Traub, R.E. (1994). Reliability for the social sciences: theory and applications, Vol. 3.
Thousand Oaks, California: Sage.

Reliability and Validity

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Reliability and Validity

Caricato da

Copyright:

Formati disponibili

Reliability and Validity 1

Running head: INSTRUMENT RELIABILITY AND VALIDITY

Instrument Reliability and Validity:

Introductory Concepts and Measures

Amanda Jane Fairchild

James Madison University

Instrument Reliability and Validity: Introductory Concepts and Measures

justifiable, is research that is based on inconsistent instrumentation? What constitutes an

the issues of reliability and validity in their respective fields of study.

forefront of general understanding, as the inclusion of reliability analysis in scientific study is as

essential as the studies themselves (Thompson, 1999).

consistently reporting reliability estimates for each administration of an instrument, as test

Components of Individual Scores

An examinee’s true score is never really known. Rather, it is simply a theoretical

(Crocker & Algina, 1986).

Standard Error of Measurement

of measurement is important in that it is used to create a confidence interval around the

techniques of estimating coefficient alpha are appropriate in limited circumstances. For

(Crocker & Algina, 1986).

split-half calculations (Crocker & Algina, 1986; Gregory, 1992).

(Nunnally & Bernstein, 1994).

Reliability and Power

meaningless if the instrument is faulty (DeVellis, 1991).

Factors that Affect Reliability

Mehrens & Lehman, 1991; Gregory, 1992).

1986; Mehrens & Lehman, 1991; Gregory, 1992).

Reliability and Criterion-Referenced Tests

Reliability plays a different role in criterion-referenced testing situations. Namely, in

satisfactory/unsatisfactory, decisions made on the examinees are consistent across multiple

administrations one and two, are given:

Pass Pass/Fail Pass/Pass

that it sound practice to report both coefficients (Traub, 1994).

previously recognized. Specifically, Messick (1995) has criticized traditional approaches to

validity, asserting that researchers have narrowly focused attention on compartmentalized

distinctions at the expense of fostering the development of a unified concept. Alternatively, he

validity (Messick, 1995).

understand in terms of criterion-referenced measures. Consider a high school chemistry test on

words, their measurement depends on the operationalization of variables deemed to be

English, not only their ability to do math.

Gregory, 1992). The operationalizing of the construct involves developing a series of

Defining the construct syntactically involves establishing hypothesized relationships between

foundations of the constructs involved. Through an examination of the theory, hypothesized

suggests three crucial steps: 1) a substantive component, 2) a structural component, and 3) an

Statistical Procedures Involved in Establishing Construct Validity

analysis techniques facilitate the derivation of clearer structure evidence by statistically

Subcomponents of Construct Validity

Different validity coefficients are yielded as a result of investigations into construct

does (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).

contexts: predictive validity or concurrent validity. In predictive validity, researchers are

be difficult to obtain quickly (Gregory, 1992, p. 122).

Measurement Error and Criterion-Related Validity

of this statistic (Crocker & Algina, 1986; Gregory, 1992).

consideration of consequential validity within the context of a measure’s construct validity.

consequences of a measure, which may be either positive or negative. Consider the

argument for the value of test scores (Messick, 1995).

General Factors That Affect Validity

An integral issue at hand in establishing validity coefficients is the actual relationship

depression before receiving services at an inpatient clinic. If the measure is accurate in

would indicate (Mehrens & Lehman, 1991).

Implications for Testing

undermined or overlooked. Moreover, it is critical for this type of information to be intellectually

the social scientific methodologies of today.