Test Worthiness: Validity, Reliability, Cross-Cultural Fairness, and Practicality

Psychological Assessment
TEST WORTHINESS: VALIDITY, RELIABILITY, CROSS-CULTURAL FAIRNESS, AND PRACTICALITY
TEST WORTHINESS = based on validity, reliability, cross-cultural fairness, and practicality.
Demonstrating test worthiness requires an involved, objective analysis of a test in four critical areas:
(1) validity: whether it measures what it’s supposed to measure (accuracy)
(2) reliability: a measure of the stability or consistency of scores (repeatability)
(3) cross-cultural fairness: whether a person’s score is a true reflection of the individual and not a function of cultural bias inherent in
the test; and
(4) practicality: whether it makes sense to use a test in a particular situation.
RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY
 ALL valid tests are reliable

 NOT ALL reliable tests are valid
 Reliability is a necessary BUT insufficient condition for validity
 A test cannot be valid YET unreliable
 Reliability LIMITS validity
 As the reliability increases, the measurement error decreases.
 Tests that are valid are ALSO reliable, however tests that are reliable AREN’T ALWAYS valid.
 Test can be reliable WITHOUT being valid. Therefore, test cannot be valid UNLESS it is reliable.
CORRELATION COEFFICIENT
 Shows the relationship between two sets of scores, is a statistical concept frequently used in discussions of the critical factors
just listed.
 The strength of a correlation is measured by a correlation coefficient, which is denoted by the symbol r.
 A correlation that approaches -1.00 to +1.00 demonstrates a strong relationship, while a correlation that approaches 0 shows
little or no relationship between two measures or variables.
 The + sign or – sign indicates the direction of the association between the variables
POSITIVE CORRELATIONS = shows a tendency for scores to be related/moved in the same direction.
 For instance, if a group of individuals took two tests, a positive correlation would show a tendency for those
who obtained high scores on the first test to obtain high scores on the second test, for those who obtained
low scores on the first test to obtain low scores on the second test, and so forth.
 Another example, researchers have generally found positive correlations between social connectedness and
other variables such as subjective well-being, longevity and life satisfaction. This makes sense as people who
are more connected, tend to be happier, live longer, and are more satisfied. These are positive correlations
that move in the same direction
NEGATIVE CORRELATIONS = shows an inverse relationship between sets of scores

 For instance, individuals who obtain high scores on the first test would be likely to obtain low scores on the
second test.
 We can expect a negative correlation between social connectedness and depression; that is, as people
connect more with others, they tend to be less depressed. Since social connectedness and depression move
in opposite directions, this makes them inversely related or negatively correlated.
VALIDITY
 How well does a test measure what it’s supposed to measure?
VALIDITY COEFFICIENT
 The higher the validity coefficient, the more beneficial it is to use the test.
VALIDITY COEFFICIENT VALUE INTERPRETATION

Above .35 Very beneficial
.21 - .35 Likely to be useful
.11 - .20 Depends on the circumstances
Below .11 Unlikely to be useful
Face validity = Superficial appearance of a test — not true validity
1. CONTENT VALIDITY
 When a test has content validity, the items on the test represent the entire range of possible items the test should
cover.
 The extent to which a measure ‘covers’ the construct of interest.
 Like face validity, it is not usually assessed quantitatively. Instead, it is assessed by carefully checking the
measurement method against conceptual definition of the construct.
 Is the content of the test valid for the kind of test it is?
Example: If a researcher conceptually defines test anxiety as involving both sympathetic NS (leading to nervous feeling) and
negative thoughts, then his measure of anxiety should include items about both nervous feelings and negative thoughts.
2. CRITERION-RELATED VALIDITY
 The extent to which people’s scores on a measure are correlated with other variable (known as criteria) that one would
expect them to be correlated with
 A criterion can be any variable that one has reason to think should be correlated with the construct being measured,
and there usually be many of them.
 A test is said to have criterion-related validity when the test has demonstrated its effectiveness in predicting
criterion or indicator of a construct. For instance, when an employer hires new employees based on a normal hiring
procedure like interview, this methods demonstrates that people who do well on a test will do well on a job, an people
with low score on a test will do poorly on a job.
 What is the relationship between a test and a criterion (external source) that the test should be related to?
Example: People’s scores on a new measure of test anxiety should be negatively correlated with their performance on an
important school exam. If it were found that their scores were in fact negatively correlated with their exam performance,
then these scores really represent people’s test anxiety.
A. CONCURRENT VALIDITY
 “Here and now”, occurs when a test is shown to be related to an external source that can be measured at
around the same time the test is being given.
 The criterion measures are obtained at the same time as the test scores.
For instance, do students’ scores in calculus class correlate well with their scores in linear algebra class? These scores
should be related concurrently because they are both tests of mathematics.
Another example, on a test that measures depression – it measures the current level of depression experienced by the
test taker.
B. PREDICTIVE VALIDITY
 Relationship between test scores and a future standard.
 The criterion is measured at some point in the future (after the construct has been measured).
 A practical application of predictive validity is the standard error of the estimate, which, based on one
variable, allows us to predict a range of scores on a second variable.
For example, Career or aptitude test – helpful in determining who is likely to succeed or fail in a certain subject or
occupation.
 Another concept in the application of predictive validity is false positives and false negatives.
o A false positive (F+) is when an instrument incorrectly predicts a test taker will have an attribute or
be successful when he or she will not.
o A false negative (F-) is when a test forecasts an individual will not have an attribute or will be
unsuccessful when in fact he or she will.
For example, let’s say you are working in a high school and use the adolescent version of the Substance Abuse Subtle
Screening Inventory (SASSI) to flag students who may have an addiction. After giving the instrument to 100 students,
10 are identified as having a high likelihood of having a substance dependency. If two of those students do not, in fact,
have an addiction, that would be considered a false positive. If one of the 90 students whose scores indicated a low
probability of a dependency, but in actuality had a diagnosable substance dependence, that would be considered a
false negative.
3. CONSTRUCT VALIDITY
 If an assessment is related to other assessments measuring the same psychological construct - a construct being a
concept used to explain behavior (e.g. intelligence, honesty)
 For example, intelligence is used to explain a person’s ability to explain a person’s ability to understand and solve
problems. Construct validity can be evaluated by comparing intelligence scores on one test to intelligence scores on
other tests (Wonderlic Cognitive Ability Test vs. Wechsler Adult Intelligence Scale).
i. EXPERIMENTAL DESIGN VALIDITY

 Using experimentation to show that a test measures a concept
 Experimentally based construct validity confirms the hypothesis you developed to scientifically show
that your construct exists.
ii. FACTOR ANALYSIS

 A data reduction technique which aggregates a given set of items to a smaller set of factors based
on a statistical technique called principal component analysis.
 Demonstrates the statistical relationship among subscales or items of a test.
iii. CONVERGENT VALIDITY

 Refers to the closeness with which a measure RELATES TO (or converges on) the construct that it
is purports to measure.
 Can be established by comparing the observed values of one indicator of one construct with that of
other indicators of the same construct and demonstrating similarity (High Correlation) between
values of these indicators,
 With convergent validity you’re expecting to find a relationship between your test and other variables
of a similar nature.
 Occurs when you find a significant positive correlation between your test and other existing measures
of a similar nature. Sometimes this relationship involves very similar kinds of instruments, but other
times you would be looking for correlations between your test and variables that may seem only
somewhat related.
 For instance, if you correlate your test with a test to measure despair, you would expect a positive
correlation. However, because despair is theoretically different from depression, you would expect a
lower correlation—(maybe 0.4).
iv. DISCRIMINANT VALIDITY

 Showing a lack of relationship between a test and other dissimilar tests.
 The extent to which scores on a measure are NOT correlated with measures of variables that are
conceptually distinct.
 Established by demonstrating that indicator of one construct are dissimilar from other constructs.
 In a sense, you’re looking to find little or no relationship between your test and measures of
constructs that are not theoretically related to your test.
 For instance, with your test of depression, you might want to compare your test scores with an
existing test that measures anxiety. In this case, you give 500 subjects your test of depression as
well as a valid test to measure anxiety, looking to find little or no relationship. If you were able to give
your client a test that could discriminate depression from anxiety, you would get a better picture of
your client.
Both concurrent and predictive validity are

represented under criterion-related validity.
Concurrent validity is symbolized by a new test being
compared to a known quantity (a ruler), and
predictive validity hopefully allows a test to estimate
a future event. In experimental design validity, we
give our new test to a group of people receiving an
intervention hoping our instrument captures the
desired change. The factor analysis figure
represents the statistical process where similar test
items should correlate with other like items within the
instrument creating subgroups called factors or
dimensions. In convergent validity, our new test will
hopefully have a high correlation (r) with an existing
instrument meant to measure the same construct.
Divergent validity is the opposite, and we want our
new test has to have a low correlation with another
test that measures a different construct.
RELIABILITY
 Amount of freedom from measurement error—consistency of test scores.

 Defined as “the degree to which test scores are free from errors of measurement” (AERA, 1999).
 Psychologists consider three types of consistency: over time (test-retest), across items (internal consistency), and across
different researchers (inter-rater reliability)
FACTORS WHICH CAN AFFECT RELIABILITY:
 The length of assessment – a longer assessment

generally produces more reliable results.
 The suitability of the questions or task for the individual
being assessed.
 The phrasing and terminology of the questions.
 The consistency in test administration
 The design of the marking schedule and moderation of
making procedures
 The readiness of a test-taker for the assessment.
RELIABILITY COEFFICIENT
» An index of reliability, a proportion of variance that indicates the ratio between the true score and observed scores.
» A measure of how well a test measures achievement.
» Refers to several methods used in calculating the coefficients in test-retest, parallel forms and alternate forms.
a) Cronbach’s Alpha – the most widely used internal-consistency coefficient.
b) Pearson’s Correlation – reliability coefficient between parallel tests.
c) Spearman Brown Formula – reliability for split-half tests.
d) Cohen’s Kappa – reliability for interrater reliability
» RULE OF THUMB
a) For high stake tests (e.g. college admission), > 0.85 or above .90
b) For low stake tests (e.g. classroom assessment), > 0.70 or above 0.80
RELIABILITY COEFFICIENT VALUE INTERPRETATION

.90 and up Excellent
.80 - .89 Good
.70 - .79 Adequate
Below .70 May have limited applicability
SOURCES OF ERROR
» Test Construction – Item sampling; content sampling
» Test Administration
» Test Scoring and Interpretation
TEST-RETEST RELIABILITY (Time Sampling)

 Relationship between scores from one test given at two different administrations.
 A relatively simple way to determine whether an instrument is reliable is to give the test twice to the same
group of people.
 When the interval between testing is greater than six months, the estimate of test-retest reliability is often
referred to as the coefficient of stability.
 Test-retest reliability tends to be more effective in areas that are less likely to change over time, such as
intelligence.
Example: A person who is highly intelligent today will be highly intelligent next week. This means that any good measure
of intelligence should produce roughly the same scores for this individual next week as it does today.
 A measure that produces highly inconsistent scores over time cannot be a very good measure of a construct
that is supposed to be consistent.
ALTERNATE/PARALLEL FORMS RELIABILITY (Item Sampling)

 Accomplished by creating a large pool of test items that measure the same quality and then randomly dividing
the items into two separate tests. The two test should be administered to the same subjects at the same time.
 Relationship between scores from two similar versions of the same test.
 These alternate forms are created to mimic one another, yet are different enough to eliminate some of the
problems found in test-retest reliability (e.g., looking up an answer).
 Use of Pearson’s r (Pearson’s product-moment correlation) to estimate reliability
In this case, rather than giving the same test twice, the examiner gives the alternate form the second time. Clearly, a
challenge involved in this kind of reliability is to assure that both forms of the test use the same or very similar directions,
format, and number of questions, and are equal in difficulty and content. Also, because creating a parallel form is labor-
intensive and costly, it is often not a practical kind of reliability to implement. If this kind of reliability is used, the burden
is on the test creator to prove that both tests are truly equal.
INTERNAL CONSISTENCY
 The consistency of people’s responses across the items on a multiple-item measure.
 Essentially you are comparing test items that measure the same construct to determine the tests internal
consistency.
 In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s
scores on those items should be correlated with each other.
 Reliability measured statistically by looking “within” the test to determine reliability estimates as you would be
with test-retest or parallel forms reliability.
 When you see a question that seems very similar to another test question, it may indicate that the two
questions are being used to gauge reliability. Because of these similarity and it measures the same thing, the
test taker should answer both questions the same, which would indicate that the test has internal consistency.
o SPLIT-HALF CORRELATION (ODD-EVEN RELIABILITY)
» The most basic form of internal consistency reliability.
» This method, which requires only one form and one administration of the test, splits the test in
half and correlates the scores of one half of the test with the other half.
» Obtained by correlating two pairs of scores obtained from equivalent halves of a single test
administered once
→ Step 1. Divide the test into equivalent halves.
→ Step 2. Calculate a Pearson r between scores on the two halves of the test.
→ Step 3. Adjust the half-test reliability using the Spearman-Brown formula
» Advantages of this kind of reliability include only having to give the test once and not having to
create a separate alternate form. One potential pitfall of this form of reliability would arise if the
two halves of the test were not parallel or equivalent, such as when a test gets progressively
more difficult.
» One common method to mathematically compensate for shortening the number of correlations
is to use the Spearman–Brown equation. So if a test manual states that split-half reliability was
used, check to see if the Spearman–Brown formula was applied. If it was not, the test might be
somewhat more reliable than actually noted.
o CRONBACH’S COEFFICIENT ALPHA AND KUDER–RICHARDSON

» Reliability based on a mathematical comparison of individual items with one another and total
score.
 KUDER-RICHARDSON 20
o A measure of internal reliability for a binary test with right/wrong answers
(ex. Achievement test)
 CRONBACH’S ALPHA
o Measures internal reliability for tests with multiple possible answers or
responses (ex. Rating scales)
o α = the mean of all possible split-half correlation.
o A value of +80 or greater = good internal consistency
Notice that test-retest is represented by the same test being given again with a time difference, alternative forms is represented
by two similar forms that are not affected by time, and two types of internal consistency are shown: split-half, which is represented
by a test being split, and coefficient alpha and Kuder–Richardson being shown as a grid that represents each item being related
to the whole test. These visual representations can help you remember each form.
o INTERRATER RELIABILITY
» The extent to which different observers are consistent with their judgments.
» To the extent that each participant can be detected by an attentive observer, different observers’
rating should be highly correlated with each other.
Example: Bobo Doll experiment by Bandura

ITEM RESPONSE THEORY: ANOTHER WAY OF LOOKING AT RELIABILITY
 Classical Test Theory, which originated with Spearman (1904), assumes that measurement error exists in every instrument.
Hence, a respondent’s true score is equal to his or her observed score plus or minus measurement error.
True score = observed score ± measurement error

 Reliability is the opposite of measurement error. As the reliability increases, the measurement error decreases.
 If we have a reliability α of 0.60, then we have a lot of measurement error, which makes it more difficult to
estimate a true score.
 An instrument with 0.93 reliability has little error, allowing us to feel more confident.
 In classical test theory, test items are viewed as a whole, with the desire to reduce overall measurement error (increase reliability)
so that we can estimate the true score.
ITEM RESPONSE THEORY, an extension of classical test theory, examines items individually for their ability to measure the trait being
examined.
 The item characteristic curve is a tool used in IRT that assumes that as people’s abilities increase, their probability of answering
an item correctly increases.
 We can see that if the shape of the “S” flattens out, the item has less ability to discriminate or provide a range of probabilities of
getting a correct or incorrect response. If the S is tall, it means the item is creating strong differentiation across ability.
We can see that if someone has average ability (i.e., IQ of 100) along the bottom of the graph, he or she has a 50% chance of
getting the item correct on the left side of the chart. If an individual has less ability, say an IQ of 85, his or her chance of getting
the item correct has been reduced to 25%. Similarly, you can see as one’s ability increases along the bottom toward the right,
the chance of getting items correct begins to approach 100%.
PRACTICALITY
 Decisions about test selection should be made carefully because they have an impact on the person being tested, the examiner,
and at times, the institutions that are requiring testing.
 Examples of a few of the major practical concerns examiners face include:
o Time. Time factors tend to be related to the attention span of the client you are testing, to the amount of time allotted
for testing in a particular setting, and to the final cost of testing.
o Cost
o Format
o Readability
o Ease of Administration, Scoring, and Interpretation

Test Worthiness: Validity, Reliability, Cross-Cultural Fairness, and Practicality

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Test Worthiness: Validity, Reliability, Cross-Cultural Fairness, and Practicality

Caricato da

Copyright:

Formati disponibili

Psychological Assessment

TEST WORTHINESS: VALIDITY, RELIABILITY, CROSS-CULTURAL FAIRNESS, AND PRACTICALITY

TEST WORTHINESS = based on validity, reliability, cross-cultural fairness, and practicality.

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY

 ALL valid tests are reliable

NEGATIVE CORRELATIONS = shows an inverse relationship between sets of scores

 How well does a test measure what it’s supposed to measure?

VALIDITY COEFFICIENT VALUE INTERPRETATION

Face validity = Superficial appearance of a test — not true validity

i. EXPERIMENTAL DESIGN VALIDITY

ii. FACTOR ANALYSIS

iii. CONVERGENT VALIDITY

iv. DISCRIMINANT VALIDITY

Both concurrent and predictive validity are

 Amount of freedom from measurement error—consistency of test scores.

FACTORS WHICH CAN AFFECT RELIABILITY:

 The length of assessment – a longer assessment

RELIABILITY COEFFICIENT VALUE INTERPRETATION

TEST-RETEST RELIABILITY (Time Sampling)

ALTERNATE/PARALLEL FORMS RELIABILITY (Item Sampling)

o CRONBACH’S COEFFICIENT ALPHA AND KUDER–RICHARDSON

Example: Bobo Doll experiment by Bandura

True score = observed score ± measurement error

Potrebbero piacerti anche