Sei sulla pagina 1di 17

Reliability and Validity

http://www.creative-wisdom.com/teaching/assessment/reliability.html

Reliability and Validity


Chong Ho Yu, Ph.D.
Conventional views of reliability (AERA et al., 1985)

Temporal stability: This type of reliability utilizes the same form of a test on two
or more separate occasions to the same group of examinees (Test-retest). On many
occasions this approach is not practical because the behaviors of the examinees
could be affected by repeated measurements. For example, the examinees might
be adaptive to the test format and thus tend to score higher in later tests. This
consequence is known as the carry-over effect. Hence, careful implementation of
the test-retest approach is strongly recommended (Yu, 2005).

Form equivalence: This approach requires two different test forms based on the
same content (Alternate form). After alternate forms have been developed and
validated by test equating, it can be used for different examinees. In high-stake
examinations it is very common to employ this method to preempt cheating.
Because the two forms have different items, an examinee who took Form A earlier
could not "help" another student who takes Form B later.

Internal consistency: This type of reliability estimate uses the coefficient of test
scores obtained from a single test or survey (Cronbach Alpha, KR20, Spilt-half).
Consider this scenario: respondents are asked to rate the statements in an attitude

survey about computer anxiety. One statement is: "I feel very negative about
computers in general." Another statement is: "I enjoy using computers." People
who strongly agree with the first statement should be strongly disagree with the
second statement, and vice versa. If the rating of both statements is high or low
among several respondents, the responses are said to be inconsistent and
patternless. The same principle can be applied to a test. When no pattern is found
in the students' responses, probably the test is too difficult and as a result the
examinees just guess the answers randomly.

Inter-rater reliability: This is a measure of agreement between two raters,


coders, or observers. It is also known as intercoder reliability. This reliability
estimate is useful when the subject matter under judgment is highly subjective
(e.g. aesthetical value of a painting or a photo). If the rating scale is continuous
(e.g. 0-10), Pearson's product moment correlation coefficient is suitable. If the
scale is ordinal in nature, Spearman's coefficient is the most appropriate approach.
For categorical classification (e.g. pass/fail), Cohen's Kappa coefficient should be
employed.

Reliability is a necessary but not sufficient condition


for validity. For instance, if the needle of the scale is
five pounds away from zero, I always over-report my
weight by five pounds. Is the measurement consistent?
Yes, but it is consistently wrong! Is the measurement
valid? No! (But if it always under-reports my weight by
five pounds, I will accept this measurement!)

Performance, portfolio, and responsive evaluations, where the tasks vary


substantially from student to student and where multiple tasks may be evaluated
simultaneously, are attacked for lacking reliability. One of the difficulties is that
there are more than one source of measurement errors in performance assessment.

For example, the reliability of writing skill test score is affected by the raters, the
mode of discourse, and several other factors (Parkes, 2000).

Replications as unification: Users may be confused by the diversity of reliability


indices. Nevertheless, different types of reliability measures share a common
thread: What constitutes a replication of a measurement procedure? (Brennan,
2001) Take internal consistency as an example. This measure is used because it is
convenient to compute the reliability index based upon data collected from one
occasion. However, the ultimate inference should go beyond one single testing
occasion to others (Yu, 2005). In other words, any procedures for estimating
reliability should attempt to mirror a result based upon full-length replications.

Conventional views of validity (Cronbach, 1971)

Face validity: Face validity simply means that the validity is taken at face value.
As a check on face validity, test/survey items are sent to teachers or other subject
matter experts to obtain suggestions for modification. Because of its vagueness
and subjectivity, psychometricians have abandoned this concept for a long time.
However, outside the measurement arena, face validity has come back in another
form. While discussing the validity of a theory, Lacity and Jansen (1994) defines
validity as making common sense, and being persuasive and seeming right to the
reader. For Polkinghorne (1988), validity of a theory refers to results that have the
appearance of truth or reality.
The internal structure of things may not concur with the appearance. Many times
professional knowledge is counter-common sense. The criteria of validity in
research should go beyond "face," "appearance," and "common sense."

Content validity: In the context of


content validity, we draw an
inference from the test scores to a
larger domain of items similar to
those on the test. Thus, content
validity is concerned with samplepopulation representativeness. i.e.
the knowledge and skills covered by
the test items should be
representative to the larger domain
of knowledge and skills.

For example, computer literacy includes skills in operating system, word


processing, spreadsheet, database, graphics, internet, and many others. However,
it is difficult, if not impossible, to administer a test covering all aspects of
computing. Therefore, only several tasks are sampled from the universe of
computer skills.
Content validity is usually established by content experts. Take computer literacy
as an example again. A test of computer literacy should be written or reviewed by
computer science professors because it is assumed that computer scientists should
know what are important in his own discipline. At first glance, this approach
looks similar to the validation process of face validity, but yet there is a subtle
difference. In content validity, evidence is obtained by looking for agreement in
judgments by judges. In short, face validity can be established by one person but
content validity should be checked by a panel, and thus usually it goes hand in
hand with inter-rater reliability.
However, this approach has some drawbacks. Usually experts tend to take their
knowledge for granted and forget how little other people know. It is not
uncommon that some tests written by content experts are extremely difficult.

Second, very often content experts fail to identify the learning objectives of a
subject. Take the following question in a philosophy test as an example:

What is the time period of the philosopher Epicurus?


a. 341-270 BC
b. 331-232 BC
c. 280-207 BC
d. None of the above
This type of question tests the ability of memorizing historical facts, but not
philosophizing or any form of logical reasoning. The content expert may argue
that "historical facts" are important for a student to further understand philosophy.
Let's change the subject to computer science and statistics. Look at the following
two questions:

When was the founder and CEO of Microsoft, William Gates III born?
a. 1949
b. 1953
c. 1957
d. None of the above

Which of the following statement is true about ANOVA

a. It was invented by R. A. Fisher in 1914


b. It was invented by R. A. Fisher in 1920
c. It was invented by Karl Pearson in 1920
d. None of the above
It would be hard pressed for any computer scientist or statistician to accept that
the above questions fulfill content validity. As a matter of fact, the memorization
approach is still a common practice among instructors.
Further, sampling knowledge from a larger domain of knowledge involves
subjective values. For example, a test regarding art history may include many
questions on oil paintings, but less questions on watercolor paintings and
photography because of the perceived importance of oil paintings in art history.
Content validity is sample-oriented rather than sign-oriented. A behavior is
viewed as a sample when it is a subgroup of the same kind of behaviors. On the
other hand, a behavior is considered a sign when it is an indictor or a proxy of a
construct (Goodenough, 1949). Construct validity and criterion validity, which
will be discussed later, are sign-oriented because both of them indicate behaviors
that are different from those of the test.

Criterion: When the focus of the


test is on criterion validity, we draw
an inference from test scores to
performance. A high score of a valid
test indicates that the test taker has
met the performance criteria.

Regression analysis can be applied to establish criterion validity. An independent


variable could be used as a predictor variable and a dependent variable, the
criterion variable. The correlation coefficient between them is called validity
coefficients.
For instance, the test scores of the driving test by simulation is considered the
predictor variable while the scores of the road test is treated as the criterion
variable. It is hypothesized that if the tester passes the simulation test, he/she
should be a competent driver. In other words, if the simulation test scores could
predict the road test scores in a regression model, the simulation test is claimed to
have a high degree of criterion validity.
In short, criterion validity is about prediction rather than explanation.
Predication is concerned with non-casual or mathematical dependence where as
explanation is pertaining to causal or logical dependence. For example, one can
predict the weather based on the height of mercury inside a thermometer. Thus,
the height of mercury could satisfy the criterion validity as a predictor. However,
one cannot explain why the weather changes by the variation of mercury height.
Because of this limitation of criterion validity, an evaluator has to conduct
construct validation.

Construct: When construct validity


is emphasized, as the name implies,
we draw an inference form test
scores to a psychological construct.
Because it is concerned with abstract
and theoretical construct, construct
validity is also known as theoretical
construct.

According to Hunter and Schmidt (1990), construct validity is a quantitative


question rather than a qualitative distinction such as "valid" or "invalid"; it is a
matter of degree. Construct validity can be measured by the correlation between
the intended independent variable (construct) and the proxy independent variable
(indicator, sign) that is actually used.
For example, an evaluator wants to study the relationship between general
cognitive ability and job performance. However, the evaluator may not be able to
administer a cognitive test to every subject. In this case, he can use a proxy
variable such as "amount of education" as an indirect indicator of cognitive
ability. After he administered a cognitive test to a portion of all subjects and found
a strong correlation between general cognitive ability and years of schooling, the
latter can be used to the larger group because its construct validity is established.
Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that construct
validity cannot be expressed in a single coefficient; there is no mathematical
index of construct validity. Rather the nature of construct validity is qualitative.
There are two types of indictors:
o

Reflective indictor: the effect of the construct.

Formative indictor: the cause of the construct.

When an indictor is expressed in terms of multiple items of an instrument, factor


analysis is utilized for construct validation.
Test bias is a major threat against construct validity, and therefore test bias
analyses should be employed to examine the test items (Osterlind, 1983). The
presence of test bias definitely affects the measurement of the psychological
construct. However, the absence of test bias does not guarantee that the test
possesses construct validity. In other words, the absence of test bias is a necessary,
but not a sufficient condition.

Construct validation as unification: The criterion and the content models tends
to be empirical-oriented while the construct model is inclined to be theoretical.
Nevertheless, all models of validity requires some form of interpretation: What is
the test measuring? Can it measure what it intends to measure? In standard
scientific inquiries, it is important to explicitly formulate an interpretative
(theoretical) framework and then to subject it to empirical challenges. In this
sense, theoretical construct validation is considered functioning as a unified
framework for validity (Kane, 2001).

A modified view of reliability (Moss, 1994)

"There can be validity without reliability if reliability is defined as consistency


among independent measures.

Reliability is an aspect of construct validity. As assessment becomes less


standardized, distinctions between reliability and validity blur.

In many situations such as searching faculty candidate and conferring graduate


degree, committee members are not trained to agree on a common set of criteria
and standards

Inconsistency in students' performance across tasks does not invalidate the


assessment. Rather it becomes an empirical puzzle to be solved by searching for a
more comprehensive interpretation.

Initial disagreement (e.g., among students, teachers, and parents in responsive


evaluation) would not invalidate the assessment. Rather it would provide an
impetus for dialog."

Li (2003) argued that the preceding view is incorrect:

"The definition of reliability should be defined in terms of the classical test


theory: the squared correlation between observed and true scores or the proportion
of true variance in obtained test scores.

Reliability is a unitless measure and thus it is already model-free or standard-free.

It has been a tradition that multiple factors are introduced into a test to improve
validity but decrease internal-consistent reliability."

An extended view of Moss's reliability (Mislevy, 2004)

Being inspired by Moss, Mislevy went further to ask whether there can be
reliability without reliability (indices).

By blending psychometrics and Hermeneutics, in which a holistic and integrative


approach to understand the whole in light of its parts is used (the whole is more
than the summation of its parts), Mislevy demanded psychometricians to think
about what they intend to make inferences about.

In many situations we don't present just one single argument; rather problem
solving involves a chain of arguments with multiple evidence.

Off-the-shelf inferential technology (e.g. compute reliability indices) may be


problematic if we measure things or tasks that we don't know much about.

Probability-based reasoning to complex assessments based upon cognitive science


is necessary.

A radical view of reliability (Thompson et al, 2003)

Reliability is not a property of the test; rather it is attached to the property of the
data. Thus, psychomterics is datammetrics.

Tests are not reliable. It is important to explore reliability in virtually all studies.

Reliability generalization, which is similar to meta-analysis, should be


implemented to assess variance of measurement error across many studies.

An updated perspective of reliability (Cronbach, 2004)

In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of measuring
reliability, reviewed the historical development of Cronbach Alpha. He asserted, "I no
longer regard the formula (of Cronbach Alpha) as the most appropriate way to examine
most data. Over the years, my associates and I developed the complex generaliability (G)
theory" (p. 403). Discussion of the G theory is beyond the scope of this document.
Nevertheless, Cronbach did not object use of Cronbach Alpha but he recommended that
researchers should take the following into consideration while employing this approach:

Standard error of measurement: It is the most important piece of information to


report regarding the instrument, not a coefficient.

Independence of sampling

Heterogeneity of content

How the measurement will be used: Decide whether future uses of the
instrument are likely to be exclusively for absolute decisions, for differential
decisions, or both.

Number of conditions for the test

A critical view of validity (Pedhazur & Schmelkin,1991)

"Content validity is not a type of validity at all because validity refers to


inferences made about scores, not to an assessment of the content of an
instrument.

The very definition of a construct implies a domain of content. There is no sharp


distinction between test content and test construct."

A modified view of validity (Messick, 1995)

The conventional view (content, criterion, construct) is fragmented because it fails to take
into account both evidence of the value implications of score meaning as a basis for
actionable items and the social consequences of using the test scores.
Validity is not a property of the test or assessment, but rather it is about the meaning of
the test scores.

Content: evidence of content relevance, representativeness, and technical quality

Substantive: theoretical rationale

Structural: the fidelity of the scoring structure

Generalizability: generalization to the population and across populations

External: applications to multitrait-multimethod comparison

Consequential: bias, fairness, and justice; the social consequence of the


assessment to the society

Critics argued that consequences should not be a component of validity because test
developers should not be held responsible for the consequences of misuse that are out of
their control. Rather, accountability should be tied to the misuser. Messick (1998)
counter-argued that social consequences of score interpretation include the value
implications of the construct, and this implication must be addressed by evaluating the
meaning of the test score. While test developers should not be accountable to misuse of
tests, they should still be cautious to the unanticipated consequences of legitimate score
interpretation.

A different view of reliability and validity (Salvucci, Walter, Conley,


Fink, & Saba (1997)

Some scholars argue that the traditional view that "reliability is a necessary but not a
sufficient condition of validity" is incorrect. This school of thought conceptualizes
reliability as invariance and validity as unbiasedness. A sample statistic may have an
expected value over samples equal to the population parameter (unbiasedness), but have
very high variance from a small sample size. Conversely, a sample statistic can have very
low sampling variance but have an expected value far departed from the population
parameter (high bias). In this view, a measure can be unreliable (high variance) but still
valid (unbiased).

Population parameter (Red line) =

Population parameter (Red line) <>

Sample statistic (Yellow line) --> unbiased Sample statistic (Yellow line) --> Biased
High variance (Green line)

low variance (Green line)

Unreliable but valid

Invalid but reliable

Caution and advice

There is a common misconception that if someone adopts a validated instrument, he/she


does not need to check the reliability and validity with his/her own data. Imagine this:

When I buy a drug that has been approved by FDA and my friend asks me whether it
heals me, I tell him, "I am taking a drug approved by FDA and therefore I don't need to
know whether it works for me or not!" A responsible evaluator should still check the
instrument's reliability and validity with his/her own data and make any modifications if
necessary.
Low reliability is less detrimental to the performance pretest. In the pretest where
subjects are not exposed to the treatment and thus are unfamiliar with the subject matter,
a low reliability caused by random guessing is expected. One easy way to overcome this
problem is to include "I don't know" in multiple choices. In an experimental settings
where students' responses would not affect their final grades, the experimenter should
explicitly instruct students to choose "I don't know" instead of making a guess if they
really don't know the answer. Low reliability is a signal of high measurement error, which
reflects a gap between what students actually know and what scores they receive. The
choice "I don't know" can help in closing this gap.
Last Updated: 2012

References
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational and
psychological testing. Washington, DC: Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun (Eds.),
Test validity. Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the perspective
of replications. Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd Ed.). Washington, D. C.: American Council on Education.

Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor


procedures. Educational and Psychological Measurement, 64, 391-418.
Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of
Education. New York: McGraw-Hill.
Goodenough, F. L. (1949). Mental testing: Its history, principles, and applications. New
York: Rinehart.
Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and
bias in research findings. Newsbury Park: Sage Publications.
Kane, M. (2001). Current concerns in validity theory. Journal of educational
Measurement, 38, 319-342.
Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of text
analysis methods. Journal of Management Information System, 11, 137-160.
Li, H. (2003). The resolution of some paradoxes related to reliability and validity. Journal
of Educational and Behavioral Statistics, 28, 89-95.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from
persons' responses and performance as scientific inquiry into scoring meaning. American
Psychologist, 9, 741-749.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research,
45, 35-44.
Mislevy, R. (2004). Can there be reliability without reliability? Journal of Educational
and Behavioral Statistics, 29, 241-244.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23,
5-12.
Osterlind, S. J. (1983). Test item bias. Newbury Park: Sage Publications.

Parkes, J. (2000). The relationship between the reliability and cost of performance
assessments. Education Policy Analysis Archives, 8. Retrieved from
http://epaa.asu.edu/epaa/v8n16/
Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An
integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State
University of New York Press.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error
studies at the National Center for Education Statistics. Washington D. C.: U. S.
Department of Education
Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability issues.
Thousand Oaks: Sage.
Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of
Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press.

Questions for discussion


Pick one of the following cases and determine whether the test or the assessment is valid.
Apply the concepts of reliability and validity to the situation. These cases may be remote
to this cultural context. You may use your own example.
1. In ancient China, candidates for government officials had to take the examination
regarding literature and moral philosophy, rather than public administration.
2. Before July 1, 1997 when Hong Kong was a British colony, Hong Kong doctors,
including specialists, who graduated from non-Common Wealth medical schools

had to take a general medical examination covering all general areas in order to be
certified.

Potrebbero piacerti anche