Lecture 16

Reliability and Validity
Q560: Experimental Methods in Cognitive Science
Lecture 16
Psychological Measurement
Psychometric theory, sources of measurement noise

Cognitive Construct:
The label given to our hypothetical characteristic (e.g., attention,
STM-capacity, executive function, intelligence, forgetting, etc.)
Dimensions along which Ss can be located based on behavior
Cannot directly measure construct, so we link it to behavior
believed to reflect the construct
Operational Definition:
Statement specifies how a construct is measured
Link between the overt and latent variables
Reliability and Validity
Reliability and Validity are important concepts in both

measurement and the full experiment to insure the link
between our statistics and conclusions is sound.
Reliability = consistency
Validity = on targetness
Reliability is a necessary but insufficient condition for validity

Reliability
Reliability is the extent to which the measurements of a test

remain consistent over repeated tests of the same subject
under identical conditions.
An experiment is reliable if it yields consistent results of the same

measure. It is unreliable if repeated measurements give
different results.
Reliable car, repeatability, etc.
Reliability of experiment, replication, and error rate
Reliability does not imply validity

Validity
Validity of a measure is the degree to which the variable

measures what it is intended to measure
e.g., IQ tests, GREs, Eye tests
A valid measure is reliable

A reliable measure is not necessarily valid
Unreliable and Invalid Shooting:
Sam Scattershot Wilson

Reliable but Invalid Shooting:
Ralph Rightpull Roberts

Reliable and Valid Shooting:
Kit Bullseye Carson

Experimental Validity
Valid design is necessary for valid scientific conclusions
Crib sheet: www.indiana.edu/~clcl/Q560/Validity.pdf
1. Statistical Conclusion Validity:

Validity with which statements about the association of two
variables can be made based on statistical tests
Threats: measurement Rxx, statistical assumptions
2. Construct Validity:
Validity with which we can make generalizations about higher-order
constructs from the experimental results
Threats: vague operational def (Watson); experimenter/participant
expectancy effects
Experimental Validity
3. Internal Validity:
Validity with which statements about the causal relationship
between variables as manipulated
Threats: Confounds (history, maturation, testing), instrumentation,
statistical regression, mortality, etc.
4. External Validity:
Validity with which we can make generalizations from sample/expt
Ecological validity
Threats: Interactions of setting/selection method and treatment
Dr. N. Lewis is interested in whether memory encoding is stronger for pictures of
objects or for words that refer to the same objects. He has participants learn a list of 30
written words that refer to objects (then a distracting task) and then recall as many
words as they can. Next, he gives the same participants a list of 30 pictures of the
same objects and (after the same distracting task) has them again recall as many
words as they can. At the end of the experiment, participants recalled a mean of 16
words in the written condition vs. a mean of 24 words in the pictures condition.
1. What type of study is this?

2. What is the independent variable?
3. What is the dependent variable?
4. Is the dependent variable discrete or continuous?
5. What is the scale of measurement for the dependent variable?
6. Name one confounding variable.
A researcher is interested in whether or not snakes can detect insults. He buys 20
exotic snakes, and separates them into two groups of 10. For one group of snakes, he
insults them for 10 minutes each. For the other group, he simply stares at them for 10
minutes each. For each group, he records the number of times the snakes bite him
(assuming that a bite indicates that the snake took offense to him). At the end of the
experiment, the group of snakes he insulted had bit him 23 times vs. only 8 bites from
the group he did not insult.
1. What type of study is this?

2. What is the independent variable?
3. What is the dependent variable?
4. Is the dependent variable discrete or continuous?
5. What is the scale of measurement for the dependent variable?
6. Name one confounding variable.
Article Critique
Assignment #5: Article critique

Small-N Designs
Small-N Designs
Why would we want to do an experiment with a
small number of subjects?
B/c its easier/faster than doing an experiment

with a large number of Ss?
Small-N Designs
Why would we want to do an experiment with a
small number of subjects?
B/c its easier/faster than doing an experiment

with a large number of Ss?
This is untruesmall N experiments are frequently

more difficult and time consuming, even though
there are only 1..5 participants
Small-N Designs
In a small N design, we make repeated
measurements on a small number of participants
Although there are fewer participants, there are

many more observations per participant
These experiments can be many hours of testing

consisting of several thousand trials
The goal is to provide a complete and accurate

description of a single subjects behavioral
changes as a function of a repeated measure
Other subjects are replications (separate expts)
The experimenter is often a subject

Why would we want to do this?
1. Practical Reasons: A small N design may be
necessary b/c it is difficult to get Ss from a rare
population (e.g., OCD, Alzheimers), the
treatments may be expensive or time consuming
(e.g., teaching sign language to a chimp:
Patterson & Linden, 1981, spent 10 years doing
this)
2. Theoretical Reasons: Skinner believed that the

best way to understand behavior is to study
single individuals intensely one should study a
single subject for a thousand hours rather than a
thousand subjects for an hour (1966, p.21)
If conditions are precisely controlled, then orderly

and predictable behavior will follow
Why would we want to do this?
3. Methodological Reasons: Pooling or averaging
data from many subjects can produce misleading
results as an artifact of grouping (Sidman, 1960)
Averaging can produce results that do not

characterize any subject who participated. More
importantly, it can produce a result supporting
theory X when perhaps it shouldnt (Estes)
E.g.: Manis (1971) tested children in a

discrimination learning taskstimuli were simple
objects, and your will have to learn which feature
is the diagnotic one (shape, color, position, etc.)
Trial 1: +
Trial 2: +
Trial 3: +
Trial 4: +
Trial 5: +
Trial 6: +
Continuity Theory: Concept learning is a gradual
process of accumulating habit strength
Noncontinuity Theory: Subjects actively try out

different hypotheses over trials. While they search
for the correct hypothesis, performance is at chance,
but once they hit the correct hypothesis, the
performance shoots up to 100% and stays there
Perfect
Performance
Chance
Trials
Here are the averaged data. Continuity is right!!
But here are the individual data before averaging.
Discontinuity is right?!
CogSci Began with Small-N Research
Ebbinghaus, Wundt, Dressler, Thorndike
It wasnt until the 1930s that experiments with

large numbers of participants and aggregate
statistics became commonplace (largely due to
Fisher)
Psychophysics is still dominated by Small-N

designs. Idea is that we have very high similarity
between our perceptual systems; with sufficient
control, a stable effect should be observable
without needing many subjects
An effect that isnt stable enough to be studied

with a small N isnt worth studying
Elements of Small-N Designs
1.A within-Ss manipulation
2.Target behavior must be operationally defined
3.Establish a baseline of responding/behavior
4.Begin treatment manipulation, and monitor

change from baseline
How do we analyze this?
Visual inspection
Curve/Trend fitting based on theory
Change from baseline

Withdraw Designs
Get measurement of baseline behavior on the DV
Introduce the manipulationbut, a change in

responding may be due to history or maturation
(AB design)
Return to Baseline: If the change is due to hist

or mat, it is unlikely that the behavior will regress
when treatment is removed (called an ABA
design).ABAB design is more popular
1.Multiple baselines design

2.Alternating treatments design
3.Changing criterion design
4.Staircase designs
Withdraw Designs:
E.g., Does talking to a plant make it grow?
(A) (B) (A)

Baseline Treatment Baseline
Growth in inches
Growth of ficus divinicus
First three months Second three months Final three months

Withdraw Designs:
E.g., Does talking to a plant make it grow?
(A) (B) (A)

Baseline Treatment Baseline
Growth in inches
Growth of ficus divinicus
First three months Second three months Final three months

Small-N Designs and Psychophysics
Psychophysics: the relationship between the
physical stimulus and the perceptual reaction to it
Small-N designs are popular in Psychophysics bc:

Few Ss are needed b/c of of the similarity
between our sensory systems (generlizes)
On each trial, data are much less affected by
error variance than in questionnaire research
(also b/c of laboratory control)
Trials are very quick and easy: so why do a 30
min experiment where the data collection only
takes 30 seconds?
Study one S, others are replications
Criticisms Against Small-N Designs
1.External validity: To what extent do these results
generalize?
2.Criticized for relying on visual inspection of data

instead of statistical analysis (but there are more
theory-driven and useful for model fitting)
3.Small-N designs cannot adequately test for

interaction effects (interactive designs exist, but
are very cumbersome: ABBCBABBCB design, etc.)
4.Due to their operant learning tradition

(Skinnerian), they tend to focus on response
frequency as a DV, rather than RT, accuracy,
habituation, etc.

Lecture 16

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 16

Caricato da

Copyright:

Formati disponibili

Reliability and Validity

Q560: Experimental Methods in Cognitive Science

Psychometric theory, sources of measurement noise

Reliability and Validity are important concepts in both

Reliability is a necessary but insufficient condition for validity

Reliability is the extent to which the measurements of a test

An experiment is reliable if it yields consistent results of the same

Reliable car, repeatability, etc.

Reliability of experiment, replication, and error rate

Reliability does not imply validity

Validity of a measure is the degree to which the variable

e.g., IQ tests, GREs, Eye tests

A valid measure is reliable

Sam Scattershot Wilson

Ralph Rightpull Roberts

Kit Bullseye Carson

1. Statistical Conclusion Validity:

1. What type of study is this?

1. What type of study is this?

Assignment #5: Article critique

B/c its easier/faster than doing an experiment

B/c its easier/faster than doing an experiment

This is untruesmall N experiments are frequently

Although there are fewer participants, there are

These experiments can be many hours of testing

The goal is to provide a complete and accurate

Other subjects are replications (separate expts)

The experimenter is often a subject

2. Theoretical Reasons: Skinner believed that the

If conditions are precisely controlled, then orderly

Averaging can produce results that do not

E.g.: Manis (1971) tested children in a

Noncontinuity Theory: Subjects actively try out

It wasnt until the 1930s that experiments with

Psychophysics is still dominated by Small-N

An effect that isnt stable enough to be studied

2.Target behavior must be operationally defined

3.Establish a baseline of responding/behavior

4.Begin treatment manipulation, and monitor

How do we analyze this?

Curve/Trend fitting based on theory

Change from baseline

Introduce the manipulationbut, a change in

Return to Baseline: If the change is due to hist

1.Multiple baselines design

(A) (B) (A)

Growth of ficus divinicus

First three months Second three months Final three months

(A) (B) (A)

Growth of ficus divinicus

First three months Second three months Final three months

Small-N designs are popular in Psychophysics bc:

2.Criticized for relying on visual inspection of data

3.Small-N designs cannot adequately test for

4.Due to their operant learning tradition

Potrebbero piacerti anche