Sei sulla pagina 1di 7

1

PSYCHOLOGICAL ASSESSMENT
Conceptual Paradigm for Measurement and Evaluation

Samples of
Behavior:
- Mental Abilities
- Personality
-

Measurement
Scales
(IRON)
-

Interval
Ratio
Ordinal
Nominal

Test

Battery Tests

Assessment

Series of
tests

Various
Techniques
(DITO)

Single
Measure

Documents
Interview
Test
Observation

Evaluation
(RAP)
Recommendatio
n

- Action Plan
- Program
Development

Psychopathology
Mental Abilities
- Diagnosis
- General
Intelligence (g) IQ
Personality
-- classification
Specific
Intelligence (s) Non- Traits
-verbal
severity
IQ- States
- Prognosis
- Multiple
Intelligence
- Types
MBTI
- predicting the
- Aptitude
devt of the d/o
- Interest
- Values

Measurement (IRON)
Parametric: Normal Distribution of Scores
Non-Parametric: Abnormal Distribution of Scores
(Pearsons r)
(Spearman, (chi-square(nominal))
Interval: Temperature, Time, (IQ) has no
Ordinal: Rank, Positions, Likert Scale, Birth Order
absolute zero
Ratio: Weight, Height has absolute zero
Nominal: Sex, Civil Status classifying
*has absolute zero: weight there could be no or 0 [value of] weight
has no absolute zero: temperature theres no 0 or no temperature
normal distribution of scores if the mean, median, mode are all the same (measures of central
tendency)
abnormal distribution of skewed

Psychological Tests
Objective Tests
Standardized

Test Administration, Scoring, Interpreting


test scores
Limited number of responses multiple choice;
true or false
Group Tests

Norms
- norm-referenced test (NRT)

- criterion-referenced test (CRT)


Norms where we base the scores of the test takers
-- transform scores into a meaningful scale
> NRT age norms
> CRT ex. how would we know if a basketball player is
to be met

Medium of Psychological Tests


Paper and pencil
Objects: wooden blocks, puzzles
Machine: Galvanic skin responses (ex. EEG, CT Scan)
Computer

Projective Tests (WIDU)


Wishes
Intrapsychic conflict conflict bet. desires &
morals
Desires
Unconscious motives
Subjectivity on test interpretation/clinical
judgment
Self-administered/individual tests
Unlimited no. of responses

skillful? -> sharpshooter; theres certain criterion

Battery of Tests sets of tests

2
Psychological Tests
Ability Tests

Intelligence Tests
Verbal Intelligence
Non-verbal Intelligence

Ex. Weschler Adult Intelligence


Scale
Stanford Binet Int. Scale
Culture Fair Intelligence Test

Achievement Tests
measures the extent if
ones knowledge
various academic subject

Ex. Achievement Test (what has


been learned?)
Stanford Achievement Test in
reading

Personality Tests
Personality Tests
Traits / Domains
or Factors
Ex. Myers-Briggs Test
Inventory
-

Objectiv
e

* Usually, no right or
wrong answers

Aptitude Tests (predicting)


Various skills /
competencies

Ex. Differential Aptitude Test


Results are integrated into a single score interpretation

Documents
-records, protocols,
collateral reports

Assessment Techniques (DITO)


Interviews
Tests
-interview responses,
-Initial assessment
screening
> verification
Forms: written, verbal,
visual,

Evaluation Recommendation, Action Plan, Program Development


Summarizing results of assessment
Test to SSCCRREEN

Screen applicants
Self-understanding
Classify people
Counsel individuals
Retain, dismiss, or promote employees
Research for programs, test construction
Evaluate performance for decision-making
Examine and gauge abilities
Need for diagnosis and intervention

VALIDITY measures what it


purports to measure

Observation
-behavioral observation
-observation checklist

Content Validity
-

degree to which the tests represent the essence, the topics, and the areas that the test is designed to
measure (appropriate domain)
Primary concern of test developers because it is the content of the items that really reflects the whatness
of the property intent to be measured
Ex. achievement, aptitude, personality tests
table of specification (blueprint) (TOS) (under analysis)
TOS generate items checked/validated by (at least 3) experts a.k.a raters
Depression

*Suicidal
Ideation |* SelfDomains* (* in the box )

harm

Procedures on how to achieve high degree of content validity


1. Pre-survey or Review of Related Literature
Focus on the theoretical constructs that is related to the test you are planning to make, test used,
purpose of the said test, areas covered, format, scaling techniques, etc.
This may start the development phase of the instrument you are to construct.
Item analysis focuses on the items itself
o Ability, aptitude tests (tests that
have right & wrong answers)
Factor analysis focuses on the domains
(if a factor really is a factor)
o Personality tests
o Uses Chronbach alpha
Empirical research
Development of Table of Specification (TOS)
Determining the areas of concepts thatll represent he nature of the variable being measured and
the relative emphasis of each area are essentially judgmental
A detailed TOS includes areas / concepts, objectives, number of items in each area
Consultation with Experts (raters)
After making your own judgments, you need to consult your thesis adviser or someone who has the
expertise in making judgment about the representativeness / relevance of the entries made in your
TOS
Item Writing
At this stage, you should know what type of items you are supposed to construct: the type of
instrument, format, scaling, and scoring techniques
Every test item is based on the creative talent of the item writer and on the background on the test
content

2.

3.

4.

Construct Validity
Theoretical domains, factors / components
Personality

X
Y
Optimism Optimism
(convergent)
Constructs
X
Y
1. Convergent V direct correlations between variables ( XY)
Optimism Pessimism
Measure that correlates well with other tests believed to measure the same construct
2. Divergent V (Discriminant) demonstrates that a test measures something different from that other
available tests measures
A tests should have low correlations, or evidence for what the test does not measure

Criterion-related Validity is estimated by correlating a subjects score on a test with an analysis of their
behavior on an independent real life criterion. If this criterion you need to assess and correlate is occurring now,
you are assessing concurrent validity. If the assessment criterion is to occur in the future, you are assessing
predictive ability.
Construct Validity (a.k.a true validity) is the extent to which there is evidence that a test measures a
particular hypothetical construct. For example, are we really measuring intelligence with an IQ test where there are
so many competing theories regarding what intelligence actually is?
Coefficient value estimate value
Variability margin of errors (because were human beings)
Unsystematic error can result from varied assessment implementation. E.g. scoring via raters
RELIABILITY consistency

This suggests that the scores you gather on


psychological tests are not in fact true of real
X
scores. But, rather, those scores represent a
combination
of influence
many factors.
In theory, the reliability coefficient (rxx) gives us an
index of the
of true scores and error

Observed Test Score

= True Score + Measurement Error


=T+e

scores on any given test. It is the ratio of true score variance of the total variance of the test.
In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that this r
represents an rxx.

Models / Types of Reliability (the type depends on what test you are going to measure)
1.

Test
-

2.

Retest Reliability Pearsons r


Gives the same test to the same group of test takers on 2 different occasions
Scores on the 1st administration are compared to scores on the 2nd administration using r
15 days or a month
Too early familiarity (carryover effect); too long maturity
Often researchers consider this to be a better measure of temporal stability (consistency of test
scores..)
Assumption: people dont change on 2 administrations
PROBLEM: Practice or carryover effects ( beneficial to the test takers)

Alternate Forms of Reliability r


To eliminate the practice effects and other problems with the test-retest method (i.e reactivity),
test developers often give 2 highly similar forms of the test to the same people at different times.
Reliability, in this case, is again assessed at different times.
To develop alt. form that is equivalent in terms of content, response, and statistical characteristic.
PROBLEM: difficulty of developing another (equivalent; same difficulty) form of the test

3. Split-half Reliability Spearman-Brown prophecy


Measures the internal consistency of the test
Eliminate / reduce the problems of the ff:
1.The need for 2 admin. of a test
2.The difficulty of developing another form
3.Carryover or reactivity effect
1.
2.

Spearman-Brown Formula

rxx=

where

kr
( 1+ ( k 1 ) ) r

rxx reliability coefficient

KR20 (Kruder & Richardson, 1937, 1939) for tests whichr questions
coefficientcan be scores either 0 or 1
(binary; dichotomous)
k
Coefficient alpha (Cronbach, 1951) rating scales that have 2 or more possible answers
Problem: whether the test being split is homogenous (i.e measuring one characteristic) or
heterogenous (i.e measuring many characteristics)
every item is compared to
one another
Split-half reliability is mostly similar to internal consistency.
halves of the were (correlated) measured

3. Scorer Reliability (inter-rater reliability) judgments or ratings made by different scorers are often
compared using correlation to see how much they agree.

If tests are being used to make important final decision about people then the reliability of a test
should be high (0.95)
Lower reliability levels may be acceptable when:
Making preliminary decisions,
Sorting people into groups,
Conducting research, etc.

Standard Error of Measurement (SEM or Standard


Standard Deviation:
Deviation)
high heterogenous (more
Index of measurement of inconsistency or the amount of
spread)
expected error in an individual score (i.e how much is the
low homogenous (less
score is likely to differ)
spread)
Factors that can affect reliability
*in terms of scores
1. Errors that can increase or decrease individual score:
the test itself
the test administrator
the test scoring
the test taker
2. Test length as a rule, adding more homogenous items will increase the reliability of the test.
3. Method used to estimate reliability split-half reliability methods yield higher reliability estimates than
test-retest or alt. forms methods
Psychometric properties:
reliability (consistency)
validity (measures what it intends to measure)
norming
standardization

The goal is to increase the probability of getting the true score and minimizing the standard
error of measurement.
Test score is composed of observed score (actual score), true score (reflection of what you really know),
and error score (difference between the true score and the actual score)

trait score sources of errors that reside within the


individual taking the test (excuses:
hungry, headache, unprepared,
Observed Score = true score + error score
etc.)

method score sources of errors that reside in the testing


situation (lousy instructions, too warm/cold room, missing
pages, etc.)

Reliability=

True Score
True Score+ Error Score

Interrater reliability =

Number of agreements
Number of disagreements

error reliability
Stability the same results are obtained over repeated administration of the instrument.
- Test-retest reliability
- parallel, equivalent, or alternative forms
Homogeneity internal consistency (unidimensional)
- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbach-alpha
Item-total correlations each item on an instrument is correlated to total score an item with low correlation
may be deleted. Highest and lowest correlations are usually reported
- only important if homogeneity of items is desired
Kuder-Richardson coefficient when items have dichotomous response e.g. yes/no (binary)
Cronbachs-alpha Likert scale or linear graphic response format
- compares the consistency of responses of all items on the scale (may need to be computed for
each sample)
Equivalence consistency of agreement of observers using the same measure among alternative forms of a tool
- parallel of alternate forms (described under stability)
- interrater reliability
TEST CONSTRUCTION (has rudiments, process)
Test Planning
Decision to develop a Standard Test
(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not adequate for one
reason or another.

Weschlers idea of WAIS was originated from the army alpha (literate soldiers) and army beta (illiterate
soldiers), thats why there are vocabulary and performance tests.
Weschler both covers fluid and crystallized intelligences
difference between the two,
Culture Fair Intelligence test looks into specific intelligence
in terms of defining intelligence

Subject Matter Experts test developer must seek help of the experts in evaluating the test items and even the
identified constructs of component of the test
Writing Items depending on whether the scale is to assess an attitude, content knowledge, ability or personality
traits; stick to the pattern (ex. dont shift from declarative to interrogative statement)
Guidelines
1. Deal only with
Poor item:
Better item:
2. Be precise
Poor item:
Better item:
3.
4.

5.
6.
7.
8.

one central thought; more than 1 is called double-barreled.


My instructor grades fairly and quickly
My instructor grades fairly.
I received good customer service from Y Company.
A member of the scales staff at Y Company asked me if he could assist me within minute of
entering the store.

Be brief
Avoid awkward wording or dangling constructs.
Poor item:
Being clear is the overall guiding principle in writing items.
Better item:
The overall guiding principle in writing items is to be clear.
* Active voice is more preferred than passive voice.
Avoid irrelevant information
Present items in positive language
* If its inevitable, when using not, italicize or CAPITALIZE it.
Avoid double negatives
Avoid terms like all and none
Poor item:
Which of the following never occurs
Better item:
Which of the following is extremely unlikely to occur?

9. Avoid indeterminate items like frequently or sometimes


10. Have someone else review your items
Table of Specifications (blueprint)

Cognitive Domain factual knowledge, ideas, and intellectual abilities

Affective Domain most with the values of a learner including his interests, appreciation, and attitudes

Psychomotor readiness for a particular action that may either be mental, physical, or emotional
Item Analysis
Way of measuring the quality of questions seeing how appropriate they were for the respondents and
how well they measured their ability / trait
Way of measuring items over and over again in different tests with prior knowledge of how they are going
to perform, creating a population of questions with known properties (e.g. test bank)
At least 3 or 4 times more

CTT true score model ( X = T + e )

easiest and most widely used form


of analyses
performed on the test as a whole
rather than on the item and
Classical Test Theory Latent Trait Models
although item statistics can be
(CTT)
generated, they apply only to that
group of students on that collection
Item Response
of items.
Rasch Models
a set of psychometric procedures
Theory (IRT)
used to test items and scales
reliability, difficulty, discrimination,
1P 2P 3P 4P
Level of Difficulty proportion of percent of examinees that answeredetc
the
item correctly.
similar
In order to determine the difficulty level, table the number of examinees
correctperson
answer has
in the
assumes with
thattheevery
item and then apply the formula.
true score on an item pr a scale of
Table of % in level of difficulty
Nu
P=
91
%
and
above
Very easy
Unacceptable
N x 100
79% - 90%
Easy
Acceptable
Optimum difficulty / Highly Acceptabl
where: P = % of students who answered the items correctly
26% - 78%

Item Analysis

moderate

Nu = number of examinees who answered the items correctly


11% - 25%
N = total examinees consisting the 2 groups

10%

and below

Difficult
Very difficult

Acceptable
Unacceptable

Level of Difficulty Using Upper and Lower Groups


1. Score the papers after checking
2. Arrange the papers from highest to lowest score
3. Determine the upper and the lower group by x27% with the total number of examinees.
4. The top 27% of the examinees is considered the upper group while the bottom 27% of the total examinees
comprises the lower group
5. Get both frequencies of the examinees that answered the item correctly from the 2 groups
6. Determine the difficulty level and the discriminating power

Discriminating Power determines the difference between examinees who have done well and those who did
poorly in a particular item. To determine the discriminating level, perform the steps in the difficulty level,
then, determine the difference of the 2 groups and divide the difference by the half of the total
examinees. (? Di natapos)
Discriminability
Item/Total Correlation
Point Biserial Method

every item will be correlated to the total score


point biserial method is best used
dichotomous scored items / items with a correct answer
one dichotomous variable (correct/incorrect) correlated with one continuous
variable (total score) is a point biserial correlation
correlate the proportion of people getting each item with the total test score
CTT

LTM

gauge the performance itself but


not trait derives it

has the test as its basis

statistics are often generalized to


similar students taking a similar test

aims to look beyond that at the


underlying
traits
which
are
producing the tests performance
measured at item level and provides
sample-free measurement

only applies to
taking that test

those

students

Latent Trait Models (LTM) made in 1940a but widely used in 1960s
practically unfeasible to use these without specialized software
Item Response Theory (IRT) family of latent trait models used to establish psychometric properties of
items and scales
sometimes referred to as modern psychometrics because has completely replaced CTT
can predict if one has guessed an item
3 Basic Components
(ex. individual differences on a
construct)
1. Item Response Function (IRF) math function that related the latent trait to the probability of endorsing
an item.
good item
2. Item Information Function an indication of item quality, an items ability to differentiate among
respondents.
3. Invariance item characteristics
Item Response Theory (IRT) the relationship between examinee trait level, item properties and the ability
of endorsing the item.
can be converted into Item Characteristic Curves (ICC) which are graphical
functions that represents the respondents ability.

Item Parameters Location an items location b is defined as the amount of the latent trait
needed to have a 0.5 probability of endorsing the item.

Item Parameters Discrimination (a)


indicates the steepness of the IRT at the items location
how strongly related the item is to the latent trait like
loadings in a factor analysis

Potrebbero piacerti anche