Sei sulla pagina 1di 19

Essentials of a Good Psychological Test Reliability - overview Types of reliability How reliable should tests be?

e? Validity Types of validity Sources of invalidity Generalizability Standardization Recommended Links 1)Reliability - overview

Last updated: 25 Jul 2004

Reliability is the extent to which a test is repeatable and yields consistent scores.Note: In order to be valid, a test must be reliable; but reliability does not guarantee validity.All measurement procedures have the potential for error, so the aim is to minimize it. An observed test score is made up of the true score plus measurement error.The goal of estimating reliability (consistency) is to determine how much of the variability in test scores is due to measurement error and how much is due to variability in true scores.Measurement errors are essentially random: a persons test score might not reflect the true score because they were sick, hungover, anxious, in a noisy room, etc.Reliability can be improved by:

getting repeated measurements using the same test and getting many different measures using slightly different techniques and methods. - e.g. Consider university assessment for grades involve several sources. You would not consider one multiple-choice exam question to be a reliable basis for testing your knowledge of "individual differences". Many questions are asked in many different formats (e.g., exam, essay, presentation) to help provide a more reliable score.

Types of reliability There are several types of reliability: There are a number of ways to ensure that a test is reliable. Ill

mention a few of them now: 1. Test-retest reliabilityThe test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice. Then the first set of scores is correlated with the second set of scores. Correlations range between 0 (low reliability) and 1 (high reliability) (highly unlikely they will be negative!)Remember that change might be due to measurement error e.g if you use a tape measure to measure a room on two different days, any differences in the result is likely due to measurement error rather than a change in the room size. However, if you measure childrens reading ability in February and the again in June the change is likely due to changes in childrens reading ability. Also the actual experience of taking the test can have an impact (called reactivity). History quiz - look up answers and do better next time. Also might remember original answers. 2. Alternate FormsAdminister Test A to a group and then administer Test B to same group. Correlation between the two scores is the estimate of the test reliability 3. Split Half reliabilityRelationship between half the items and the other half. 4. Inter-rater ReliabilityCompare scores given by different raters. e.g., for important work in higher education (e.g., theses), there are multiple markers to help ensure accurate assessment by checking inter-rater reliability 5. Internal consistencyInternal consistence is commonly measured as Cronbach's Alpha (based on inter-item correlations) between 0 (low) and 1 (high). The greater the number of similar items, the greater the internal consistency. Thats why you sometimes get very long scales asking a question a myriad of different ways if you add more items you get a higher cronbachs. Generally, alpha of .80 is considered as a reasonable benchmark How reliable should tests be? Some reliability guidelines .90 = high reliability .80 = moderate reliability .70 = low reliability High reliability is required when (Note: Most standardized tests of

intelligence report reliability estimates around .90 (high).


tests are used to make important decisions individuals are sorted into many different categories based upon relatively small individual differences e.g. intelligence

Lower reliability is acceptable when (Note: For most testing applications, reliability estimates around .70 are usually regarded as low - i.e., 49% consistent variation (.7 to the power of 2).

tests are used for preliminary rather than final decisions tests are used to sort people into a small number of groups based on gross individual differences e.g. height or sociability /extraversion

Reliability estimates of .80 or higher are typically regarded as moderate to high (approx. 16% of the variability in test scores is attributable to error) Reliability estimates below .60 are usually regarded as unacceptably low. Levels of reliability typically reported for different types of tests and measurement devices are reported in Table 7-6: Murphy and Davidshofer (2001, p.142). 2)ValidityValidity is the extent to which a test measures what it is supposed to measure. Validity is a subjective judgment made on the basis of experience and empirical indicators. Validity asks "Is the test measuring what you think its measuring?" For example, we might define "aggression" as an act intended to cause harm to another person (a conceptual definition) but the operational definition might be seeing:

how many times a child hits a doll how often a child pushes to the front of the queue how many physical scraps he/she gets into in the playground.

Are these valid measures of aggression? i.e., how well does the

operational definition match the conceptual definition? Remember: In order to be valid, a test must be reliable; but reliability does not guarantee validity, i.e. it is possible to have a highly reliable test which is meaningless (invalid). Note that where validity coefficients are calculated, they will range between 0 (low) to 1 (high) Types of Validity Face validity Face validity is the least important aspect of validity, because validity still needs to be directly checked through other methods. All that face validity means is: "Does the measure, on the face it, seem to measure what is intended?" Sometimes researchers try to obscure a measures face validity - say, if its measuring a socially undesirable characteristic (such as modern racism). But the more practical point is to be suspicious of any measures that purport to measure one thing, but seem to measure something different. e.g., political polls - a politician's current popularity is not necessarily a valid indicator of who is going to win an election. Construct validity Construct Validity is the most important kind of validity If a measure has construct validity it measures what it purports to measure. Establishing construct validity is a long and complex process. The various qualities that contribute to construct validity include:

criterion validity (includes predictive and concurrent) convergent validity discriminant validity

To create a measure with construct validity, first define the domain of

interest (i.e., what is to be measured), then construct measurement items are designed which adequately measure that domain. Then a scientific process of rigorously testing and modifying the measure is undertaken. Note that in psychological testing there may be a bias towards selecting items which can be objectively written down, etc. rather than other indicators of the domain of interest (i.e. a source of invalidity) Criterion validity Criterion validity consists of concurrent and predictive validity.

Concurrent validity: "uDoes the measure relate to other manifestations of the construct the device is supposed to be measuring?" Predictive validity: "uDoes the test predict an individuals performance in specific abilities?"

Convergent validity It is important to know whether this tests returns similar results to other tests which purport to measure the same or related constructs. Does the measure match with an external 'criterion', e.g. behaviour or another, well-established, test? Does it measure it concurrently and can it predict this behaviour?

Observations of dominant behaviour (criterion) can be compared with self-report dominance scores (measure) Trained interviewer ratings (criterion) can be compared with selfreport dominance scores (measure)

Discriminant validity Important to show that a measure doesn't measure what it isn't meant to measure - i.e. it discriminates. For example, discriminant validity would be evidenced by a low correlation between between a quantitative reasoning test and scores on a reading comprehension test, since reading ability is an irrelevant variable in a test designed to measure quantitative reasoning.

Sources of Invalidity

Unreliability Response sets = psychological orientation or bias towards answering in a particular way: o Acquiescence: tendency to agree, i.e. say "Yes. Hence use of half -vely and half +vely worded items (but there can be semantic difficulties with -vely wording) o Social desirability: tendency to portray self in a positive light. Try to design questions which so that social desirability isn't salient. o Faking bad: Purposely saying 'no' or looking bad if there's a 'reward' (e.g. attention, compensation, social welfare, etc.). Bias o Cultural bias: does the psychological construct have the same meaning from one culture to another; how are the different items interpreted by people from different cultures; actual content (face) validity may be different for different cultures. o Gender bias may also be possible. o Test Bias Bias in measurement occurs when the test makes systematic errors in measuring a particular characteristic or attribute e.g. many say that most IQ tests may well be valid for middle-class whites but not for blacks or other minorities. In interviews, which are a type of test, research shows that there is a bias in favour of good-looking applicants. Bias in prediction occurs when the test makes systematic errors in predicting some outcome (or criterion). It is often suggested that tests used in academic admissions and in personnel selection underpredict the performance of minority applicants Also a test may be useful for predicting the performance of one group e.g. males but be less accurate in predicting the performance of females.

Generalizability Just a brief word on generalizability. Reliability and validity are often discussed separately but sometimes you will see them both referred to as aspects of generalizability. Often we want to know whether the

results of a measure or a test used with a particular group can be generalized to other tests or other groups. So, is the result you get with one test, lets say the WISC III, equivalent to the result you would get using the Stanford-Binet? Do both these test give a similar IQ score? And do the results you get from the people you assessed apply to other kinds of people? Are the results generalizable? So a test may be reliable and it may be valid but its results may not be generalizable to other tests measuring the same construct nor to populations other than the one sampled. Let me give you an example. If I measured the levels of aggression of a very large random sample of children in primary schools in the ACT, I may use a scale which is perfectly reliable and a perfectly valid measure of aggression. But would my results be exactly the same had I used another equally valid and reliable measure of aggression? Probably not, as its difficult to get a perfect measure of a construct like aggression. Furthermore, could I then generalize my findings to ALL children in the world, or even in Australia? No. The demographics of the ACT are quite different from those in Australia and my sample is only truly representative of the population of primary school children in the ACT. Could I generalize my findings of levels of aggression for all 5-18 year olds in the ACT? No. Because Ive only measured primary school children and there levels of aggression are not necessarily similar to levels of aggression shown by adolescents. Standardization Standardization: Standardized tests are:

administered under uniform conditions. i.e. no matter where, when, by whom or to whom it is given, the test is administered in a similar way. scored objectively, i.e. the procedures for scoring the test are specified in detail so that ant number of trained scorers will arrive at the same score for the same set of responses. So for example, questions that need subjective evaluation (e.g. essay questions) are generally not included in standardized tests. designed to measure relative performance. i.e. they are not

designed to measure ABSOLUTE ability on a task. In order to measure relative performance, standardized tests are interpreted with reference to a comparable group of people, the standardization, or normative sample. e.g. Highest possible grade in a test is 100. Child scores 60 on a standardized achievement test. You may feel that the child has not demonstrated mastery of the material covered in the test (absolute ability) BUT if the average of the standardization sample was 55 the child has done quite well (RELATIVE performance). The normative sample should (for hopefully obvious reasons!) be representative of the target population - however this is not always the case, thus norms and the structure of the test would need to interpreted with appropriate caution.

Other
The diversity of psychological tests is staggering. Thousands of different psychological tests are available commercially in Englishspeaking countries, and doubtlessly hundreds of others are published in other part of the world. These test range from personality inventories to self-scored IQ tests, from scholastic examinations to perceptual tests. Yest despite this diversity several features are common to all psychological tests and taken together serve to define the term test. A psychological test is a measurement instrument that has three defining characteristics:

A psychological test is a sample of behavior. The sample is obtained under standardized conditions. There are established rules for scoring or obtaining quantitative (numeric) information from the behavior sample. Behavior Sampling "The representative sample of behavior which is under consideration is known as sample of behavior". Human sample of behavior is a complex phenomenon.We cant measure the human behavior at any cost, because it starts from birth and remains till death.

Incidental behavior is not listed in sample of behavior. it based on specific human behavior. Every psychological test requires the respondent to do something. The subject's behavior is used to measure some specific attribute(e.g Introversion) or to predict some specific outcome (e.g Success in a job training program).The use of behavior sample in psychological measurement has several implications. First a psychological test is not an exhaustive measurement of all possible behavior that could be used in measuring or defining a particular attribute. For example, You wished to develop a test to measure a person's writing ability. One strategy would be to collect and evaluate everything that person had ever written, from term papers to laundry lists. Such a procedure would be highly accurate, but impractical. A psychological test attempts to proximate this exhaustive procedure by collecting a systematic sample of behavior, In this case writing-test might include a series of short essays, sample letters, memos and the like. The second implication of is the quality of a test is largely determined by the representativeness of this sample. For example One could construct a driving test in which each examinees was required to drive the circuit of a race track. This test would certainly sample some aspects of driving , but would omit others such as parking, following signals or negotiating in traffic. It would therefore not not represent a very good driving test. Standardization "Standardization implies the uniformity of procedure both in administering and scoring the test". More over a test should be a standardized over representative sample of population to obtain norms.The result of some tests must be comparable within the population. A psychological test is a sample of behavior collected under standardized conditions. The conditions under which a test is administered are certain to effect the behavior of the person or person taking the test. You would probably give different answers to questions on an intelligence test or a personality inventory administered in a quiet well-lit room than you would if the same test were administered at a baseball stadium during extra innings of a play-off game.A student is likely to do better on a test that is given in a regular classroom environment than he or she would if the same test were given in a hot, noisy auditorium. It is not possible to achieve the same degree of standardization with all psychological logical tests.

Individually administered tests are difficult to standardize because the examiner is an integral part of the test. The same test given to the same subject by two different examiners is certain to elicit a somewhat different set of behaviors. Through specific training a good deal of standardization in the essential features of testing can be achieved. Strict standard procedures for administering various psychological tests helps to minimize the effects of extraneous variables, such as the physical conditions of testing, the characteristics of the examiner or the subject's confusion regarding the demands of the test. A large diversity exist among different psychological test and thousands of different tests in the market. Scoring Rules The immediate aim of testing is to measure or to describe in a quantitative way some attribute or set of attributes of the person taking the test. The final defining characteristic of a psychological test is that there must be some set of rules or procedures for describing in quantitative or numeric terms the subject's behavior in response to the test. These rules must be sufficiently comprehensive and well defined that different exminers will assign scores that are at-least similar. For a classroom test these rules may be simple and well defined, the student earn a certain number of points for each item answered correctly and the total score is determined by adding up the points. For other types of tests the scoring rules may not be so simple or definite. Most mass-produced standardized tests are characterized by objective scoring rules.In this case the term objective should be taken to indicate that two people each applying the same set of scoring rules to an individual's responses will always arrive at the same score for that individual. Thus two teachers who score the same multiple-choice test will always arrive at the same total score. On the other hand many psychological tests are characterized by subjective scoring rules.Subjective scoring rules typically rely on the judgment of the examiner.It is important to note that the term subjective does not necessarily imply inaccurate or unreliable methods of scoring responses to test, but simply that human judgment is an integral part of the scoring of a test. Most psychological tests are designed so that two examiners confronted with the same set of responses will give similar scores. A measure that does not meet this criterion cannot be considered a satisfactory example of a psychological test.

Norms
Scores on psychological tests rarely provide absolute ratio scale measure of psychological attributes.Thus it rarely makes sense to ask in an absolute sense how much intelligence, motivation, depth perception and so on a person has. Scores on psychological tests do however provide useful relative measures.It makes perfect sense to ask whether Scott is more intelligent, is more motivated or has better depth perception than Peter.Psychological tests provide a systematic method of answering such questions. One of the most useful ways of describing a person's performance on a test is to compare his or her test score to the test scores of some other person or group of people.Many psychological tests base their scores on a comparison between each examinees and some standard population that has already taken the test.When a person's test score is interpreted by comparing that score to the scores of several other people, this is referred to as a norm-based interpretation.The score to which each individual is compared are referred to as norms which provide standards for interpreting test scores.A norm-based score indicates where an individual stands in comparison to the particular normative group that defines the set of standards. Normative Group Several different groups might be used in providing normative information for interpreting test scores.First No single population can be regarded as the normative group. Second a wide variety of norm-based interpretations could be made for a given raw score, depending on which normative group is chosen. These two points suggest that careful attention must be given to the definition and development of the particular norms against which a test score will be interpreted. Types of Norms. In some cases norms may be developed that are national in scope, as in the case of large achievement test batteries. 1. Percentile Ranks/Norms The most common form of norms is percentile ranks, which represents the

simplest method of presenting test data for comparative purpose. Percentile rank represents the percentage of the norm group that earned a raw score less than or equal to the score of that particular individual.It is possible to compare one's score to several different norm groups. 2. Age Norms Many psychological characteristics change over time, vocabulary, mathematical ability and moral reasoning are examples.An age norm relates a level of test performance to the age of people who have taken the test.The principle involve in a developing age norms is fairly simple, they can developed for any characteristic that changes systematically with age-at least up to some age level.Second we need to obtain a representative sample at each of several ages and measure the particular age-related characteristics in each of these samples.While age norms tend to emphasize the average level at a given age, it is important to remember that there us considerable variability within the same age, which means that some children at one age will perform on this test similarly to children at other ages. 3. Grade Norms Another type of norm commonly used in school setting is called a grade norm.Grade norms are very similar to age norms.These norms are most popular when reporting the achievement levels of school children.The interpretation of grade norms is similar to age norms.In areas such as emotional and social growth as well as in other achievement areas the child may not perform at the grade equivalent. ____________other

Norm-referenced test
From Wikipedia, the free encyclopedia

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. This estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population.[1] That is, this type of test identifies whether the test taker performed better or worse than other test takers, but not whether the test taker knows either more or less material than is necessary for a given purpose.

The term normative assessment refers to the process of comparing one test-taker to his or her peers.[1] Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-referenced assessment, the score shows whether or not the test takers performed well or poorly on a given task, but not how that compares to other test takers; in an ipsative system, the test taker is compared to his previous performance.

Other types
Alternative to normative testing, tests can be ipsative, that is, the individual assessment is compared to him- or herself through time.[2][3] By contrast, a test is criterion-referenced when provision is made for translating the test score into a statement about the behavior to be expected of a person with that score. The same test can be used in both ways.[4] Robert Glaser originally coined the terms norm-referenced test and criterion-referenced test.[5] Standards-based education reform is based on the belief that public education should establish what every student should know and be able to do.[6] Students should be tested against a fixed yardstick, rather than against each other or sorted into a mathematical bell curve.[7] By assessing that every student must pass these new, higher standards, education officials believe that all students will achieve a diploma that prepares them for success in the 21st century.[8] [edit]Common

use

Most state achievement tests are criterion referenced. In other words, a predetermined level of acceptable performance is developed and students pass or fail in achieving or not achieving this level. Tests that set goals for students based on the average student's performance are norm-referenced tests. Tests that set goals for students based on a set standard (e.g., 80 words spelled correctly) are criterion-referenced tests. Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test-takers cannot "fail" a norm-referenced test, as each test-taker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores that is different for each college.

By contrast, nearly two-thirds of US high school students will be required to pass a criterion-referenced high school graduation examination. One high fixed score is set at a level adequate for university admission whether the high school graduate is college bound or not. Each state gives its own test and sets its own passing level, with states like Massachusetts showing very high pass rates, while in Washington State, even average students are failing, as well as 80 percent of some minority groups. This practice is opposed by many in the education community such as Alfie Kohn as unfair to groups and individuals who don't score as high as others. [edit]Advantages

and limitations

An obvious disadvantage of norm-referenced tests is that it cannot measure progress of the population as a whole, only where individuals fall within the whole. Thus, only measuring against a fixed goal can be used to measure the success of an educational reform program which seeks to raise the achievement of all students against new standards which seek to assess skills beyond choosing among multiple choices. However, while this is attractive in theory, in practice the bar has often been moved in the face of excessive failure rates, and improvement sometimes occurs simply because of familiarity with and teaching to the same test. With a norm-referenced test, grade level was traditionally set at the level set by the middle 50 percent of scores.[9] By contrast, the National Children's Reading Foundation believes that it is essential to assure that virtually all of our children read at or above grade level by third grade, a goal which cannot be achieved with a norm referenced definition of grade level.[10] Advantages to this type of assessment include students and teachers alike know what to expect from the test and just how the test will be conducted and graded. Likewise, each and every school will conduct the exam in the same manner reducing such inaccuracies as time differences or environmental differences that may cause distractions to the students. This also makes these assessments fairly accurate as far as results are concerned, a major advantage for a test. Critics of criterion-referenced tests point out that judges set bookmarks around items of varying difficulty without considering whether the items actually are compliant with grade level content standards or are developmentally appropriate.[11] Thus, the original 1997 sample problems published for the WASL 4th grade mathematics contained items that were difficult for college educated adults, or easily solved with 10th grade level methods such as similar triangles.[12] The difficulty level of items themselves, as are the cut-scores to determine passing levels are also changed from year to year.[13] Pass rates also vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.[14]

One of the limitations of No Child Left Behind is that each state can choose or construct its own test which cannot be compared to any other state. [15] A Rand study of Kentucky results found indications of artificial inflation of pass rates which were not reflected in increasing scores in other tests such as the NAEP or SAT given to the same student populations over the same time.[16] Graduation test standards are typically set at a level consistent for native born 4 year university applicants[citation needed]. An unusual side effect is that while colleges often admit immigrants with very strong math skills who may be deficient in English, there is no such leeway in high school graduation tests, which usually require passing all sections, including language. Thus, it is not unusual for institutions like theUniversity of Washington to admit strong Asian American or Latino students who did not pass the writing portion of the state WASL test, but such students would not even receive a diploma once the testing requirement is in place. Although the tests such as the WASL are intended as a minimal bar for high school, 27 percent of 10th graders applying for Running Start in Washington State failed the math portion of the WASL. These students applied to take college level courses in high school, and achieve at a much higher level than average students. The same studyconcluded the level of difficulty was comparable to, or greater than that of tests intended to place students already admitted to the college.[17] A norm referenced test has none of these problems because it does not seek to enforce any expectation of what all students should know or be able to do other than what actual students demonstrate. Present levels of performance and inequity are taken as fact, not as defects to be removed by a redesigned system. Goals of student performance are not raised every year until all are proficient. Scores are not required to show continuous improvement through Total Quality Management systems. Disadvantages include standards based assessments measure the level that students are currently by measuring against where their peers are currently at instead of the level that both students should be at. A rank-based system only produces data which tell which average students perform at an average level, which students do better, and which students do worse. This contradicts the fundamental beliefs, whether optimistic or simply unfounded, that all will perform at one uniformly high level in a standards based system if enough incentives and punishments are put into place. This difference in beliefs underlies the most significant differences between a traditional and a standards based education system.
____________________________________________________

Use of particular Psychological Test


Psychologists primarily use tests to supplement or assist in various phases of treatment. Test

results are used along with clinical discussions to help you move from one phase of treatment to the next. Tests that measure symptoms provide a picture of what needs to change, and tests that reveal unique traits give the psychologist an idea of how to assist you. From the initial assessment to the closing of treatment, tests results provide vital information that keeps the therapeutic experience relevant for you. 1.Assessment Psychologists use tests during one of your fist few sessions to assess your problem. A psychologist tests at this point to supplement his clinical interview and to determine the severity, duration and extent of your problems. A test such as the Beck Depression Inventory, for example, aids in making these measurements. 2.Setting Goals Psychologists use test results to help you set goals for improvement. Psychologists use unusual results, such as a high occurrence of depression, to develop specific and measurable goals. Goals such as "reduce the frequently of depression to half of that initially discovered" are clear and can be measured to show improvement. 3.Determining Interventions Psychologists also use tests to identify the most effective interventions for you. Personality tests, such as the Myers-Briggs Type Indicator, can reveal much about how you think and the way you relate to other people. These tests reveal your strengths as well as your weaknesses. For example, if your test results reveal that you are a highly analytical person, interventions such as reading and rational analysis of problems may be effective in helping you make desired changes. 4.Reviewing Progress Most psychologists use tests as a way of reviewing what you have accomplished in treatment. If you scored high on the Beck Anxiety Inventory in your initial assessment, re-taking the test three months later may reveal lower anxiety and provide you with momentum to keep up the work 5..Closure A psychologist does not want to keep you dependent on her. Her goal is to build your competence and confidence so you can manage your problems on your own. Psychologists often use tests as a way of ending treatment. Tests results are used as evidence in closing discussions about the progress you have achieved. other Purpose Psychological tests are used to assess a variety of mental abilities and attributes, including achievement and ability, personality, and neurological functioning.
Use of tests to measure skill, knowledge, intelligence, capacities, or aptitudes and to make predictions about performance. Best known is the IQ test; other tests include achievement tests designed to evaluate a student's grade or performance level and personality tests. The latter include both inventory-type (question-and-response) tests and projective tests such as the Rorschach (inkblot) and thematic apperception (picture-theme) tests, which are used by clinical psychologists and psychiatrists to help diagnose mental disorders and by psychotherapists and counselors to help assess their clients. Experimental psychologists routinely devise tests to obtain data onperception, learning, and motivation. Clinical neuropsychologists often use tests to assess cognitive functioning of people with brain injuries. See also experimental psychology; psychometrics. --------------------------------------------------------------------------------------

Item Analysis

Item Analysis allows us to observe the characteristics of a particular question (item) and can be used to ensure that questions are of an appropriate standard and select items for test inclusion. Introduction Item Analysis describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardise existing tests.

There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. Although Classical Test Theory and Rasch Measurement will be discussed, this document will concentrate primarily on Item Response Theory.

The Models
Classical Test Theory Classical Test Theory (traditionally the main method used in the United Kingdom) utilises two main statistics - Facility and Discrimination.

Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8). Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0.

The main problems with Classical Test Theory are that the conclusions drawn depend very much on the sample used to collect information. There is an inter-dependence of item and candidate. Item Response Theory Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were sampled. Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case.

For IRT1, only the difficulty of an item is considered, (difficulty is the level of ability required to be more likely to correctly answer the question than answer it wrongly).

For IRT2, difficulty and discrimination are considered, (discrimination is how well the question is at separating out candidates of similar abilities). For IRT3, difficulty, discrimination and chance are considered, (chance is the random factor which enhances a candidates probability of success through guessing.

IRT can be used to create a unique plot for each item (the Item Characteristic Curve ICC). The ICC is a plot of Probability that the Item will be answered correctly against Ability. The shape of the ICC reflects the influence of the three factors:

Increasing the difficulty of an item causes the curve to shift right - as candidates need to be more able to have the same chance of passing. Increasing the discrimination of an item causes the gradient of the curve to increase. Candidates below a given ability are less likely to answer correctly, whilst candidates above a given ability are more likely to answer correctly. Increasing the chance raises the baseline of the curve.

This simple simulation allows the user to investigate the factors governing the shape of the Item Characteristic Curve. All three well known IRT models are represented (referred to as IRT1, IRT2 and IRT3) and Item Characteristic Curves can be super-imposed on one another to see how they relate. Click to View the Simulation. Of course when you carry out a test for the first time you don't know the ICC of the item because you don't know the difficulty (and discrimination of that item). Rather, you estimate the parameters (using paramater estimation techniques) to find values which fit the data you observed. Using IRT models allows Items to be characterised and ranked by their difficulty and this can be exploited when generating Item Banks of equivalent questions. It is important to remember though, that in IRT2 and IRT3, question difficulty rankings may vary over the ability range. Rasch Measurement Rasch measurement is very similar to IRT1 - in that it considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data. Click data

Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the

'characteristics' of an item are said to be independent of the ability of the candidates who were sampled. Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case.

For IRT1, only the difficulty of an item is considered, (difficulty is the level of ability required to be more likely to correctly answer the question than answer it wrongly). For IRT2, difficulty and discrimination are considered, (discrimination is how well the question is at separating out candidates of similar abilities). For IRT3, difficulty, discrimination and chance are considered, (chance is the random factor which enhances a candidates probability of success through guessing.

IRT can be used to create a unique plot for each item (the Item Characteristic Curve ICC). The ICC is a plot of Probability that the Item will be answered correctly against Ability. The shape of the ICC reflects the influence of the three factors:

Increasing the difficulty of an item causes the curve to shift right - as candidates need to be more able to have the same chance of passing. Increasing the discrimination of an item causes the gradient of the curve to increase. Candidates below a given ability are less likely to answer correctly, whilst candidates above a given ability are more likely to answer correctly. Increasing the chance raises the baseline of the curve.

This simple simulation allows the user to investigate the factors governing the shape of the Item Characteristic Curve. All three well known IRT models are represented (referred to as IRT1, IRT2 and IRT3) and Item Characteristic Curves can be superimposed on one another to see how they relate.

Potrebbero piacerti anche