Sei sulla pagina 1di 22

Reliability and Validity 1

Running head: INSTRUMENT RELIABILITY AND VALIDITY

Instrument Reliability and Validity:

Introductory Concepts and Measures

Amanda Jane Fairchild

James Madison University


Reliability and Validity 2

Instrument Reliability and Validity: Introductory Concepts and Measures

Instrument validity and reliability have often been both misunderstood and under

emphasized in the social scientific literature. In truth, these phenomena lie at the heart of

competent and effective study. How productive can one’s research to be if the instrument that

he or she implemented does not measure what it purports to gauge? How legitimate, or

justifiable, is research that is based on inconsistent instrumentation? What constitutes an

inconsistent instrument? What constitutes a valid instrument? What are the implications of both

proper and improper testing? This contribution should, at a minimum, provide an explanation to

these and other related questions in the field of measurement and instrument design. Further,

the content should impart a sense of responsibility on academics and professionals to address

the issues of reliability and validity in their respective fields of study.

Reliability

Devellis (1991) defines reliability as "the proportion of variance attributable to the true

score of the latent variable". The obscurity of this definition provides a perfect example of why

reliability, as a measurement concept, has been unable to effectively penetrate the general

research public thus far. In fact, the concept of reliability often goes unrecognized outside of the

measurement literature. Across disciplines competent researchers not only fail to report the

reliability of their measures (Henson, 2001; Thompson, 1999), but also fall short of grasping the

inextricable link between scale reliability and effective research. At best, measurement error

affects the ability to find significant results in one’s data. At worst, measurement error can

significantly damage the interpretability of scores or the function of a testing instrument. This

gap between well-founded ideas and corresponding measurement instruments based on sound

theoretical grounding is of particular concern. A first step towards remedying this split and

integrating the two disciplines involves bringing the appropriate quantitative methodology to the

forefront of general understanding, as the inclusion of reliability analysis in scientific study is as

essential as the studies themselves (Thompson, 1999).


Reliability and Validity 3

What is Reliability?

Reliability involves the consistency, or reproducibility, of test scores. That is, the degree

to which one can expect relatively constant deviation scores of individuals across testing

situations on the same, or parallel, testing instruments. This property is not a stagnant function

of the test. Rather, reliability estimates change with different populations (i.e. population

samples) and as a function of the error involved. These facts underscore the importance of

consistently reporting reliability estimates for each administration of an instrument, as test

samples, or subject populations, are rarely the same across situations and in different research

settings. More important to understand is that reliability estimates are a function of the test

scores yielded from an instrument, not the test itself (Thompson, 1999). Accordingly, reliability

estimates should be considered based upon the various sources of measurement error that will

be involved in test administration (Crocker & Algina, 1986). In order to fully understand

reliability then, it is first necessary to development a clearer picture of measurement error and

test scores.

Measurement Error

In any given testing situation, there are two types of error present: systematic and

random error. Systematic error is the more problematic of the two, as it does not necessarily

affect the consistency of test scores. Rather, it may just affect the utility, or validity, of the test.

Systematic error refers to those errors that consistently affect an individual's observed score.

This may be a function of the individual themselves (e.g. a personality attribute, such as

forgetfulness, or a quality, like fatigue) or a function of the measure. For instance, a test could

reliably be measuring a test taker’s level of depression, even though it was intended to gauge

anxiety. Regardless, systematic error reflects the measurement of something other than the

intended construct. In contrast, random error refers to that error which affects individuals'
Reliability and Validity 4

scores by pure chance. Consider an examinee who is ill, or whose family member was placed

in a hospital the night before; his or her observed score on the measure will certainly be less

reflective of their true score on a construct in comparison to other test takers. These random

errors directly affect the reliability of a measure, as well as impacting the utility of the instrument.

Components of Individual Scores

An examinee’s true score is never really known. Rather, it is simply a theoretical

construct that represents the infinite average of an individual’s observed scores (Crocker &

Algina, 1986). Therefore, researchers can only infer information about a true score from the

examinee’s observed score. Classical test theory indicates that observed test scores are

comprised of both an individual’s true score on an instrument and random error in testing

(Spearman, 1907, 1913, as cited in Crocker & Algina, 1986). Theoretically, error is considered

the disparity between one’s true score on a construct, as defined by the measure, and his or her

observed score, the result of their scores on the test items. This testing error should be variant,

or random, over time, and the average of this error for a given individual should have a mean of

zero (Crocker & Algina, 1986; Gregory, 1992). Similarly, truly random error values should not

correlate with one another, or anything at all. Evidence of such a correspondence may indicate

the presence of systematic error, which would need to be addressed in the testing instrument

(Crocker & Algina, 1986).

Standard Error of Measurement

The standard error of measurement (SEM) refers to the individual standard deviation

scores for error averaged across an entire group of examinees, and can be thought of in terms

of the normal distribution of scores for each examinee (Crocker & Algina, 1986). This element

of measurement is important in that it is used to create a confidence interval around the

individual’s true scores. This confidence interval, as defined by: the observed score (X) ± SEM,

delineates a bound by which the observed values are accurate in assessing the true values, as

the actual true score of an examinee is never really known. In other words, in calculating the
Reliability and Validity 5

standard error of measurement one is able to get an idea of how much their set of observed

scores accurately reflects the examinees’ true scores on a construct. For example, a

confidence interval might indicate, with 95% certainty, that an examinee’s true score on a math

test falls between his or her observed score of 83 ± a calculated SEM of 2.32 (Crocker & Algina,

1986).

Estimating Reliability

There are many methods to estimating reliability, each resulting in a different dimension

of reliability. These applications of these methods will vary depending on your testing situation

and how you plan to use the test results. Firstly, there are instances in which multiple testing

sessions of the same instrument will occur. Consider the clinician who is interested in

measuring the psychological growth of his clients over time (e.g. on a depression scale). In this

example, coefficients of stability most readily apply. A measure of reliability called the test-

retest method should be employed, where test proctors administer the same test to a set of

examinees more than once. The test should be administered, a sufficient period of time should

elapse, and the test should then be administered once again. Upon completion of the second

administration, one is able to calculate the correlation coefficient between scores on the two

measures, which will yield information on how stable the test results (i.e. observed scores) are

over time (Crocker & Algina, 1986; Gregory, 1992). A second type of reliability estimate is the

alternate form method. This test-retest technique evaluates the consistency of alternate forms

of a single test (DeVellis, 1991). This approach is particularly useful in the context of

standardized testing procedures, where it is ideal to have multiple, and equivalent, forms of the

same test. In this method, participants take one form of the test, a period of time elapses, and

they then take a second form of the test. Once results are gathered from both sessions, the

correlation coefficient between the two sets of scores is calculated. In this technique, a

coefficient of equivalence is yielded (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).
Reliability and Validity 6

Other reliability estimates only necessitate one test administration. These are often

referred to as internal consistency measures of reliability. These methods are concerned with

the consistency of scores within the test itself, or rather consistency of scores among the items

(Crocker & Algina,1986; DeVellis, 1991). Here, the key is to have a homogenous set of items

that reflects a unified underlying construct. High reliability estimates of this kind will result in

high inter-item correlations among the items or subscales (Crocker & Algina, 1986; DeVellis,

1991; Gregory, 1992; Henson, 2001). The most common method of assessing internal

consistency reliability estimates is through the use of coefficient alpha. Though there are three

different measures of coefficient alpha, the most widely used measure is Cronbach’s coefficient

alpha. Cronbach’s alpha is actually an average of all the possible split-half reliability estimates

of an instrument (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992; Henson, 2001). It is

important to note that coefficient alpha is the lower bound of reliability. The two lesser-used

techniques of estimating coefficient alpha are appropriate in limited circumstances. For

example, the Kuder Richardson 20 is appropriate for use with dichotomously-scored items, and

Hoyt’s method is useful in particular testing situations that involve computer programming

(Crocker & Algina, 1986).

Another type of internal consistency measure that only requires one test administration

is split-half reliability. This technique literally takes an instrument, assesses the reliability of the

first half, and then compares this estimate to the reliability measure of the second half. The

method demands equal item representation across the two halves of the instrument. Clearly the

comparison of dissimilar sample items will not yield an accurate reliability estimate.

Researchers can insure equal item representation through the use of random item selection,

matching items from one half to the next, or assigning items to halves based on an even/odd

distribution (Crocker & Algina, 1986). It should be noted that reliability estimates are often

underestimated when computing split-half reliability, due to the shortened nature of the

instrument. This error in calculation can be addressed by using the Spearman Brown prophecy,
Reliability and Validity 7

which provides the means necessary to estimate reliability for the full-length test based on your

split-half calculations (Crocker & Algina, 1986; Gregory, 1992).

In the social sciences, acceptable reliability estimates range from .70 to .80 (Nunnally &

Bernstein, 1994). However, research in the physical sciences typically demands more rigorous

reliability standards, as the constructs involved are more concrete and easily defined. In both

settings, acceptable reliability estimates should be congruent with the implications of the test

scores. That is, higher stakes testing should have higher standards of instrument reliability

(Nunnally & Bernstein, 1994).

Reliability and Power

Assessing scale reliability is crucial to maximizing power in one’s study. Simply put,

unreliable scales decrease the statistical power of an instrument. This is important in many

ways. Most notably, as power decreases, larger sample sizes are necessary to find significant

results. An increase in statistical effect size is also observed with an increase in instrument

reliability and subsequent power gained. Additionally, reliable instruments introduce less error

into the statistical measurement and resulting analysis. Still, the significant results may well be

meaningless if the instrument is faulty (DeVellis, 1991).

Factors that Affect Reliability

Low internal consistency estimates are often the result of poorly written items or an

excessively broad content area of measure (Crocker & Algina, 1986). However, other factors

can equally reduce the reliability coefficient. Namely, the homogeneity of your testing sample,

imposed time limits in your testing situation, item difficulty and the length of your testing

instrument (Crocker & Algina, 1986; Mehrens & Lehman, 1991; DeVellis, 1991; Gregory, 1992).

Group homogeneity is particularly influential when you are trying to apply a norm-referenced

test to a homogenous testing sample. In such circumstances, the restriction of range of your

testing group (i.e. low variability) translates into a smaller proportion of variance explained by

your testing instrument, ultimately deflating the reliability coefficient. It is essential to bear in
Reliability and Validity 8

mind the intended use of your instrument when considering these circumstances and deciding

how to use an instrument (Crocker & Algina, 1986; Mehrens & Lehman, 1991; Gregory, 1992).

Imposed time constraints in a testing situation pose a different type of problem. That is,

time limits ultimately affect a test taker’s ability to fully answer questions or to complete an

instrument. As a result, variance in test taker ability to work at a specific rate becomes

enmeshed in that person’s score variance. Ultimately, test takers that work at similar rates have

higher degrees of variance correspondence, artificially inflating the reliability of the testing

instrument. Clearly, this situation becomes problematic when the construct that an instrument

intends to measure has nothing to due with speed competency (Crocker & Algina, 1986;

Mehrens & Lehman, 1991; Gregory, 1992).

The relationship between reliability and item difficulty addresses a variability issue once

again. Ultimately, if the testing instrument has little to no variability in the questions (i.e. either

items are all too difficult or too easy), the reliability of scores will be affected. Aside from this,

reliability estimates can also be artificially deflated if the test has too many difficult items, as they

promote uneducated guesses (Crocker & Algina, 1986; Mehrens & Lehman, 1991).

Lastly, test length also factors into the reliability estimate. Simply, longer tests yield

higher estimates of reliability. This phenomenon can be best explained through an examination

of the Spearman Brown prophecy equation, which indicates that as number of items increase,

there is a direct increase in the reliability estimate. However, one must consider the reliability

gains earned in such situations, as infinitely long tests are not necessarily desirable. For

instance, if you have an 80-item testing instrument with an internal consistency reliability

coefficient of .78 and the Spearman-Brown prophecy indicates that your reliability estimate will

increase to .85 if you add an additional 25 items, you may consider that the slightly lower

reliability estimate is more desirable than an excessively long instrument (Crocket & Algina,

1986; Mehrens & Lehman, 1991; Gregory, 1992).


Reliability and Validity 9

Reliability and Criterion-Referenced Tests

Reliability plays a different role in criterion-referenced testing situations. Namely, in

these circumstances, researchers or test administrators are interested in how individuals score

with reference to set standards of performance, not with reference to other individuals (i.e.,

norm-referenced testing). This situation does not necessarily demand high estimates of

reliability in the classical sense. In fact, it may be more desirable to have individuals all do well,

which would yield low variability. Therefore, the invariance of scores is acceptable as long as

the measure adequately represents the testing domain (the measure’s adequate representation

of the testing domain can be established via content validity procedures discussed earlier).

Measures of reliability in this respect are interested in the “precision of the decision” or

the “precision of the score” (Mehrens & Lehman, 1991, p. 262), rather than the consistency of

individual scores. That is, test administrators are interested in whether the pass/fail, or

satisfactory/unsatisfactory, decisions made on the examinees are consistent across multiple

testing administrations, or parallel forms of the instrument. In other words, individuals with the

same amount of knowledge or skill set should be judged in an equivalent manner. Accordingly,

classical measures of reliability are not relevant in this circumstance. Instead, these situations

call for the calculation of indices of consistency, which compute the “proportion of examinees for

which the decisions are the same [across different measures]” (Traub, 1994, p.142). Traub

illustrates this notion with a matrix, where participant scores for tests one and two, or for test

administrations one and two, are given:

Test #2

Fail Pass

Pass Pass/Fail Pass/Pass

Test #1
Fail/Fail Fail/Pass
Fail
Reliability and Validity 10

Test administrators are seeking to make the same decisions for participants across

instruments or sessions. The cell texts in bold illustrate this consistency. This index of

consistency has received mild criticism though, as there is always a proportion of chance

consistency. In fact, your reliability index will never be zero, even if the instrument is unreliable,

due to the simple rules of probability. However, there is a correction formula that addresses this

problem: coefficient kappa. Researchers may decide whether or not implementation of this

latter index is necessary through evaluating the shapes of the score distributions, assessing the

correlation between scores on the respective tests, and estimating the magnitude of the cut-off

scores. Regardless of which index is ultimately used for interpretation, the literature indicates

that it sound practice to report both coefficients (Traub, 1994).

For those testing situations that do not allow for multiple administrations of an

instrument, there are indices of consistency available for use with one test administration.

However, these indices normally entail more complicated computation. Some examples of

these methods are: the Subkoviak index of decision consistency (which is appropriate for use

with dichotomously scored items), linear regression estimates and Huynh’s estimate of decision

consistency. Please refer to Crocker and Algina (1986) for an in depth analysis of these issues.

Validity

Validity has been defined by “the extent to which [a test] measures what it claims to

measure” (Gregory, 1992, p.117). The focus here is not necessarily on scores or items, but

rather inferences made from the instrument. That is, the behavioral inferences that one can

extrapolate from test scores is of immediate focus. In order to be valid, the inferences made

from scores need to be “appropriate, meaningful, and useful” (Gregory, 1992, p. 117). These

distinctions illuminate the inextricable link between validity and reliability. For example, a testing

instrument can reliably measure something other than the supposed construct, but an unreliable

measure cannot be valid (Crocker & Algina, 1986; Gregory, 1992). Violations of instrument
Reliability and Validity 11

validity severely impact the function and functioning of a testing instrument. In some ways,

validity inadequacies impart even more serious consequences on an instrument than its

reliability counterpart. This can be substantiated in the sense that validity is a comprehensive

construct that cannot be definitively measured in any one given statistic, and that this

instrumental testing property is often even less understood than reliability (Crocker & Algina,

1986; Gregory, 1992). Effective validity studies not only demand the integration of multiple

sources of evidence, but also must continually take place over time. That is, a measure cannot

be deemed valid in a simple instance of study. Rather, multiple studies must be implemented

over different samples, and the collection of validity evidence must cover specified areas

(Crocker & Algina, 1986; Gregory, 1992; Messick, 1995). Moreover, in recent years

researchers have expanded the understanding of validity to comprise more dimensionality than

previously recognized. Specifically, Messick (1995) has criticized traditional approaches to

validity, asserting that researchers have narrowly focused attention on compartmentalized

distinctions at the expense of fostering the development of a unified concept. Alternatively, he

purports an integrative approach to instrument validity that not only focuses on conventional test

score issues, but also emphasizes the significance of score implication and their social use.

Still, this unified concept of validity is best understood and examined within the context of its

four discrete facets: content validity, construct validity, criterion validity and consequential

validity (Messick, 1995).

Content Validity

Content validity should not be confused with face validity, a non-statistical assessment of

whether or not a test appears to be “valid”. This concept is really not an index of validity at all.

Rather, it simply addresses the layman acceptability of a measure (Gregory, 1992). In contrast,

content validity considers whether or not the items on a given test accurately reflect the

theoretical domain of the latent construct it claims to measure. Items need to effectively act as

a representative sample of all the possible questions that could have been derived from the
Reliability and Validity 12

construct (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). This concept is easy to

understand in terms of criterion-referenced measures. Consider a high school chemistry test on

the periodic table. There are a finite number of elements in this population from which

questions can be drawn. Further, there are different groupings of elements that may be

addressed. Finally, there are a specified number of symbols that students can be tested on.

Thus, an accurate, or valid, testing measure would sample these criteria in such a way that

facets of knowledge in these areas would be equally represented. Application of this notion to

the social sciences is slightly more convoluted. This stems from the fact that the theories and

constructs involved are innately intangible (e.g. anxiety, intelligence, depression, etc.). In other

words, their measurement depends on the operationalization of variables deemed to be

representative of the domain. In this respect, there is no clean set of exhaustive measures that

represents any given construct. Rather, there exists an almost infinite sampling domain from

which questions can be drawn. In this instance, content validity becomes more of a qualitative

judgment than an absolute definitive measure (Crocker & Algina, 1986; DeVellis, 1991; Gregory,

1992). Crocker and Algina (1986) have shared a sequential process that is relevant to both of

these circumstances. They suggest employing the following four steps to effectively evaluate

content validity: 1) identify and outline the domain of interest, 2) gather resident domain experts,

3) develop consistent matching methodology, and 4) analyze results from the matching task. In

this example, matching refers to the pairing of each item on an instrument with an element of

the construct’s theoretical domain. In the academic arena, the testing content is usually defined

via course objectives. Matching in this instance might involve pairing objectives with test items,

particularly in the context of achievement tests (Crocker & Algina, 1986). A number of different

issues arise in response to Crocker and Algina’s methodology. Two questions that are

particularly pertinent are: should researchers weight objectives in order of importance in the

theoretical domain and how should the results of this process be interpreted. Researchers are

divided on the former consideration. Some (e.g. Katz, as cited in Crocker & Algina, 1986)
Reliability and Validity 13

maintain that it is necessary to weight objectives in order of their pertinence to the theoretical

domain. However, most others argue that equal representation of theoretical objectives is not

damaging. With reference to summarizing the information gathered in the matching task, this

truly becomes a subjective judgment on the part of the researcher. There are no hard and fast

statistics that indicate whether or not one’s instrument is ‘valid enough’. However, there are

definite quantitative tools available to aid in this decision-making. Crocker and Algina (1986)

delineate these measures. Two particularly useful pieces of information researchers may

choose to implement are the percentage of successfully matched items and the percentage of

theoretical domain objectives not addressed in the instrument. It is necessary to consider these

and other measures in full when deciding on the validity of an instrument. Interpretation of only

one index is imprudent and not recommended. Some common problems experienced with

assessing content validity are inadequate representation of domain in test objectives and ethnic,

racial, and gender biases. Crocker and Algina (1986) provide an excellent example of the issue

of testing bias. Consider a teacher who is interested in measuring their students’ math ability.

He or she provides a math test with word problems presented in English. This measure would

not accurately reflect the quantitative ability of non-English speakers in the classroom, as their

verbal ability, or inability, directly affects their ability to execute the problems. This problem may

have been averted with a clearer specification of the testing domain. In this example, the

teacher was really measuring the ability of students to execute word problems presented in

English, not only their ability to do math.

Construct Validity

DeVellis explains that the construct validity of a measure “is directly concerned with the

theoretical relationship of a variable (e.g. a score on some scale) to other variables. It is the

extent to which a measure ‘behaves’ the way that the construct it purports to measure should

behave with regard to established measures of other constructs” (1991, p. 46). As clear as this

sounds, it proves a bit more difficult in practice as constructs are not readily observable. Recall
Reliability and Validity 14

that one must develop items, or variables, that act as representations of the construct and serve

to measure examinee scores with respect to the paradigm. This is facilitated through the

delineation of the construct itself. That is, a construct needs to be both operationalized and

syntactically defined in order to measure it effectively (Benson, 1998; Crocker & Algina, 1986;

Gregory, 1992). The operationalizing of the construct involves developing a series of

measurable behaviors or attributes that are hypothesized to correspond to the latent construct.

Defining the construct syntactically involves establishing hypothesized relationships between

the construct of interest and other related constructs or behaviors (Benson, 1998; Crocker &

Algina, 1986; Gregory, 1992). An example may provide more clarity on this issue. Suppose

that a researcher is interested in the construct of anxiety. They first may choose to establish

behaviors that are characteristic of this state (e.g. heart palpitations, sweaty palms, intrusive

thoughts, etc), so that it becomes measurable. They then formulate hypotheses about how

anxiety should relate to other criteria. For instance, researchers may claim that anxiety should

have a high correlation with depression and socially avoidant behaviors. Crocker and Algina

(1986) provide a series of steps in which to follow when pursuing a construct validation study: 1)

generate hypotheses of how the construct should relate to both other constructs of interest and

relevant group differences, 2) choose a measure that adequately represents the construct of

interest, 3) pursue empirical study to examine the relationships hypothesized, and 4) analyze

gathered data to check hypothesized relationships and to assess whether or not alternative

hypotheses could explain the relationships found between the variables. Benson (1998)

suggests that not all construct validation studies are the same though. Rather, she considers

that there are both weak and strong approaches to this practice. She contends that weak

attempts at establishing construct validity only focus on gathering empirical research on both the

construct of interest and other related constructs. In contrast, she suggests that strong attempts

at proving construct validity instead concentrate on the theoretical (i.e. not exclusively empirical)

foundations of the constructs involved. Through an examination of the theory, hypothesized


Reliability and Validity 15

relationships naturally evolve. Further, Benson affirms Nunnally’s validation practice which

suggests three crucial steps: 1) a substantive component, 2) a structural component, and 3) an

external component. Step one in this series is relatively similar to step two in Crocker and

Algina’s methodology. Step two here focuses on the relationships between the items, or

observables, themselves. That is, the researcher examines the relationships between the items

on the measure to see how they are relating to both the construct and other items on the scales.

Finally, step three parallels step number one in Crocker and Algina’s conceptualization. The

key difference between these views is that Benson asserts focus on item analysis of the scale,

while this component is not utilized in the Crocker and Algina literature.

Statistical Procedures Involved in Establishing Construct Validity

No cut-off measure exists as to what are acceptable correlations between the construct

and related entities in order to purport construct validity. Once again, this becomes a qualitative

judgment that the researcher must make. However, significance tests of the correlations

themselves or proportion of variance explained as a function of the other measure may aid in

this decision-making process (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). Further,

investigation of the inter-item correlations on the scale and differentiating item response

measures may aid in learning more about the function of the items themselves, with respect to

one another and the latent construct. Further, affirming hypothesized group differences will

bolster the relationship between the measure and related constructs. Exploratory factor

analysis techniques facilitate the derivation of clearer structure evidence by statistically

illustrating how items load on particular factors, and confirmatory factor analysis techniques

should be employed to establish quality of model fit. Correlational evidence between the

underlying constructs may ultimately be studied within the context of a structural equation

model, which derives this information from the measures themselves (Benson, 1998).

Subcomponents of Construct Validity


Reliability and Validity 16

Different validity coefficients are yielded as a result of investigations into construct

validity. When comparing the measured construct to other constructs based on hypothesized

relationships, one expects to see either convergent or discriminant validity. That is, convergent

validity coefficients should arise when considering two constructs hypothesized to be related.

Further, the correlation between the two should be moderately high in order to contend test

validity, as you want the measure to be correlated to others scales purporting to measure the

same, or related, thing. Again, though there is no explicit rule that states what correlation

coefficient is acceptable, you should be looking for relationships around .50 and above (Crocker

& Algina, 1986). In contrast, when you are looking at the relationship between your scale of

interest and a construct that is not hypothesized to be related (i.e. should be unrelated), you aim

to find discriminant validity coeffiecients. In other words, you want little to no correlation

between your measure and unrelated constructs. For instance, if there is a high degree of

overlap and correlation between your measure that is supposed to measure depression and an

index of shoe size, you should be alerted to the fact that your measure is not functioning as

intended. Or rather, that your measure is not measuring the underlying construct you claim it

does (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992).

Criterion Validity

Criterion validity refers to the ability to draw accurate inferences from test scores to a

related behavioral criterion of interest. This validity measure can be pursued in one of two

contexts: predictive validity or concurrent validity. In predictive validity, researchers are

interested in assessing the predictive utility of an instrument. For example, the purpose of

standardized testing such as the SAT and GRE are to predict performance outcomes in college

and graduate school respectively. The test scores ultimately become the basis from which

decisions are made. This predictive validation normally occurs in the following sequence: test

scores of an instrument are gathered, stored, and then compared to a criterion of interest later

on (Crocker & Algina, 1986; Gregory, 1992). Researchers are looking for a high degree of
Reliability and Validity 17

correlation between the criterion variable and scores on the testing instrument, in order to assert

good criterion validity. Validity coefficients are ultimately derived from the correlation between

these components. From this, one can calculate a coefficient of determination for the measures

by squaring the validity coefficient. This informs the researcher what percentage of variance in

the criterion variable is accounted for by the testing measure, or predictor variable (Crocker &

Algina, 1986; Gregory, 1992). However, perfect correlations between the two variables do not

indicate that one is simply an estimation of the other. More specifically, high correlations do not

mean that the testing instrument estimates the criterion variable (DeVellis, 1991).

Though concurrent validity also looks at the correlation between criterion and test

scores, the two measures are taken one right after the other in this instance. In other words,

there is no extended time period between the testing measure and the criterion measure, and

the examiner is more interested in the measure’s ability to reflect current ability than any

predictive aptitude. This is illustrated well in multidimensional licensing exams, where there

might be both a written and practical component. For example, if a test taker does well on the

written portion of an exam that measures anatomical knowledge, you would expect them to

perform reasonably well on a surgical exam of the same area. If the two skill measures were

not highly correlated, you may choose to further investigate the dynamics of your testing

instrument (Crocker & Algina, 1986; DeVellis, 1991; Gregory, 1992). Measures with established

concurrent validity ultimately have the ability to act as “shortcut[s] for obtaining information that

might otherwise require the extended investment of professional time”. That is, validated

measures may be used as precursory diagnostic tools in professional settings. For example,

validated measures may serve clinical staff by providing initial diagnostic information that would

be difficult to obtain quickly (Gregory, 1992, p. 122).

Measurement Error and Criterion-Related Validity

The Standard Error of Estimate (SEE) reflects how ‘off’ you are, on average, in your

predictions of the criterion variable from the predictor variable. This index is derived from the
Reliability and Validity 18

validity coefficient of a measure and the standard deviations of the scores on the criterion

measure, and parallels the Standard Error of Measurement as it relates to reliability. In this

context, the SEE reflects the degree to which a measure is not wholly valid. That is, it illustrates

the instrument’s inaccuracy in predicting scores on a related criterion measure (Crocker &

Algina, 1986).

Clearly, one needs to begin with reliable measures before attempting to validate them in

this manner. Likewise, criterion variables also have to maintain a high degree of reliability. The

subsequent validity of a measure is actually restricted by the influence of these two variables.

That is, the validity of a predictive measure has a precise relationship with the reliability of the

two measures involved in its study. Please see Crocker & Algina for a more thorough analysis

of this statistic (Crocker & Algina, 1986; Gregory, 1992).

Further problems associated with criterion-related validity include restriction of range and

sample size. Restriction of range artificially deflates the correlation coefficient yielded between

the two measures, and inadequate sample size has ramifications for both power and accurate

validity estimations. More specifically, sample sizes below 200 individuals are often incapable

of representing accurate population validity coefficients. With an increase in sample size, there

is a corresponding increase in the accuracy of validity prediction (Crocker & Algina, 1986).

Consequential Validity

In recent years, more emphasis has been placed on the social utility and bias of

interpretation in test scores. Messick (1995) has been at the forefront of this push for the

consideration of consequential validity within the context of a measure’s construct validity.

Consequential validity refers to the notion that the social consequences of test scores and their

subsequent interpretation should consider not only with the original intention of the test, but also

cultural norms (Messick, 1995). This idea points to both the intended and unintended

consequences of a measure, which may be either positive or negative. Consider the

deleterious effects of an invalid instrument that creates unmerited distinctions between racial or
Reliability and Validity 19

ethnic groups on an academic measure. It is critical that distinctions of this kind are supported

with valid measurement instruments, particularly if they go against the grain of social norms.

This concern is wholly based in the scale’s construct validity. Moreover, any inferences made

from test scores are contingent upon the construct validity of the scale. Therefore adequate

evidence, as collected through all four facets of validity study, is crucial in developing an

argument for the value of test scores (Messick, 1995).

General Factors That Affect Validity

An integral issue at hand in establishing validity coefficients is the actual relationship

between the two variables, or constructs, that you are interested in. Beyond this, comparable

measurement issues that affected the nature of reliability coefficients also affect validity

coefficients. That is, the more heterogeneous the groups are, the higher the correlations

between two measures will ultimately be. This phenomenon is most readily observable in

samples with a restriction of range problem. When the data range is limited, the scores become

more homogenous and the resulting correlation coefficients derived are artificially inflated. An

important matter of note is that the more effective an instrument is at screening individuals for a

particular purpose, the less heterogeneous the resulting sample will be, which in turn results in a

smaller validity coefficient. Consider a diagnostic instrument that screens potential patients for

depression before receiving services at an inpatient clinic. If the measure is accurate in

assessing the depression of these individuals, then only depressed people should result in the

later clinic sample. Note that in this instance the test functioned exactly as it should, and was

appropriate and accurate in its assessment of the individuals involved. Therefore, a testing

instrument may not necessarily be invalid, as direct interpretation of a low validity coefficient

would indicate (Mehrens & Lehman, 1991).

Implications for Testing

Ultimately, researchers should strive for reliable and valid instrumentation in their

investigations. This goal can be accomplished in part by a push for quality item writing, an
Reliability and Validity 20

insistence on reporting reliability data across studies, sound theoretical bases for construct

measurement and accurate operationalization of the constructs at hand. This objective imparts

a direct responsibility on behalf of all examiners in a given field. That is, it is essential for

researchers and test administrators to actively measure the reliability and validity of instrument

scores over populations and time. The continual nature of both these processes should not be

undermined or overlooked. Moreover, it is critical for this type of information to be intellectually

accessible outside of the quantitative domain in order to facilitate the understanding and sharing

of this knowledge. Without credible instrumentation that is monitored and measured over time,

research results become meaningless. Further, until this knowledge and insistence on reporting

is successfully integrated into the mainstream, there will continue to be visible shortcomings in

the social scientific methodologies of today.


Reliability and Validity 21

References

Benson, J. (1998). Developing a strong program of construct validation: a test anxiety

example. Educational Measurement: Issues and Practice, 17, 10-17.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.

Philadelphia: Harcourt Brace Jovanovich College Publishers.

DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social

Research Methods Series, Vol. 26). Newbury Park: Sage.

Gregory, R.J. (1992). Psychological testing: history, principles and applications. Boston:

Allyn and Bacon.

Henson, R.K. (2001). Understanding internal consistency reliability estimates: a

conceptual primers on coefficient alpha. Measurement and Evaluation in

Counseling and Development, 34, 177-188.

Messick, S. (1995). Validity of psychological assessment: validation of inferences from

persons’ responses and performances as scientific inquiry into score meaning.

American Psychologist, Vol.50(9), 741-749.

Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and

Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL.

Nunnally, J.C., & Bernstein, I.H. (1994). Psychometric Theory (3rd ed.). New York: McGraw-Hill.

Thompson, B. (1999). Understanding coefficient alpha, really. Paper presented at the

annual meeting of the Education Research Exchange, College Station, Texas,

February 5, 1999.

Traub, R.E. (1994). Reliability for the social sciences: theory and applications, Vol. 3.
Reliability and Validity 22

Thousand Oaks, California: Sage.

Potrebbero piacerti anche