Sei sulla pagina 1di 15

Test Design with Cognition in Mind

Joanna S. Gorin, Arizona State University


construct denition (DeVellis, 1991; Ferrara & DeMauro, 2006; Netemeyer, Bearden, & Sharma, 2003; Wilson, 2005). A comprehensive denition of the construct helps maintain the focus of item and test development on the ability or trait of interest. Traditionally, this denition has looked much like an entry in a dictionary, consisting of a sentence or two describing the general meaning of a trait or skill. For example, the construct measured by reading comprehension on one large-scale standardized tests verbal section is dened as ones ability to reason with words in solving problems, and that [r]easoning effectively in a verbal medium depends primarily on ability to discern, comprehend, and analyze relationships among words or groups of words and within larger units of discourse such as sentences and written passages (ETS, 1998). Some high-stakes achievement tests have expanded these denitions to include lists of curriculum components (i.e., instructional objectives) or standards targeted by the test items (Ferrara & DeMauro, 2006). As general descriptions of score meaning, either of these denitions may be sufcient. Items on these tests are traditionally written so that each item can be tied to at least one of the standards to be measured. However, whether phrased as a verbal denition or a list of standards-based skills, the generality of their language presents a signicant limitation for test development and validation. In terms of item writing, how can an item writer efciently develop tasks without an understanding of the various skills comprising a domain or a curriculum standard? In terms of validation, how can evidence be gathered to support inferences about cognition when no cognitive terms have been dened?

One of the primary themes of the National Research Councils 2001 book Knowing What Students Know was the importance of cognition as a component of assessment design and measurement theory (NRC, 2001). One reaction to the book has been an increased use of sophisticated statistical methods to model cognitive information available in test data. However, the application of these cognitive-psychometric methods is fruitless if the tests to which they are applied lack a formal cognitive structure. If assessments are to provide meaningful information about student ability, then cognition must be incorporated into the test development process much earlier than in data analysis. This paper reviews recent advancements in cognitively-based test development and validation, and suggests various ways practitioners can incorporate similar methods into their own work.
Keywords: assessment design, test construction, congnitively-based assessment, validity, item writing

he notion of cognitive psychology as relevant to test development is not new. Early conceptualizations of validity incorporate cognitive theory to establish connections between test properties and the measured construct (Messick, 1975). In comparison to earlier theories of assessment design, however, recent test development frameworks rely more heavily on cognition than ever before. Perhaps this shift results from theoretical and technological advances made in cognitive science and theory. Or rather, it may be a consequence of necessity; the educational community wants to use test scores to make inferences about student cognition, hence they need appropriate measurement tools. In either case, the release of the 2001 National Research Council (NRC) report Knowing What Students Know (KWSK; NRC, 2001) solidied the role of cognitive psychology as a critical component in test theory and design. The authors assert that [a]ll assessments will be more fruitful when based on an understanding of cognition in the domain and on the precept of reasoning with evidence (NRC, 2001, p. 178). Thus, test design can be viewed

in terms of developing observational tools that provide evidence about cognition (Mislevy, 1994). This new conceptualization of test development closely parallels the process of hypothesis testing in cognitive psychology; researchers must hypothesize a model describing an ability or trait, select an operational denition, build measures to collect observations, and evaluate the evidence in terms of their hypotheses. Given this similarity, it seems reasonable to argue that theories and methodologies could be borrowed from cognitive psychology to assist with test development. The purpose of this paper is to examine three aspects of test development that might be improved with the use of cognitive psychology principles and methodsconstruct denition, validation, and item writing. For each process emphasis is placed on connecting recent empirical work applying cognitive psychology to psychometrics and practical implications for practitioners currently engaged in test development. Dening the Construct with a Cognitive Model A common beginning to test development is the generation of a clear

Joanna S. Gorin, Division of Psychology in Education, Arizona State University, Box 870611, Tempe, AZ 85287-0611; joanna.gorin@asu.edu. 21

Winter 2006

The major limitation of traditional construct denitions is their lack of detail, specically in terms of substantive information, about the target skills and their relationship to observable student behaviors (Ferrara & DeMauro, 2006). Construct denitions including descriptions of individual cognitive processes and hypothesized relationships among processes can provide a stronger foundation for test development and score interpretation (Embretson, 1994; Mislevy, 1994; Messick, 1995). In their discussion of denitions of achievement constructs, Ferrara and DeMauro (2006) suggest that useful achievement construct denitions should include specications of content and procedural knowledge, a measurement plan describing the nature of the assessment tasks, and hypotheses and evidence of the nomological network of the construct. Cognitive models, a representation commonly used in experimental, developmental, social, and cognitive psychology, can arguably provide the necessary tools to meet these requirements. Cognitive models for construct denition. Cognitive models for assessment specify learners representations of a domain in terms of requisite knowledge, skills, and abilities (KSA). They can be informed both by empirical and theoretical research in cognition as well as observational data in educational settings (NRC, 2001). The specic form of a cognitive model, the types of models that exist, and their various advantages and disadvantages remains an issue of some debate. Leighton (2004) proposes three types of cognitive modelscognitive models of domain mastery, test specications, and task performanceeach useful for understanding test performance and score meaning. For purposes of construct definition, the cognitive model of domain mastery is perhaps most relevant. The domain mastery model describes relationships between various skills and knowledge that comprise expertise in a given content area. Mislevys Evidence Centered Design (ECD) framework incorporates similar models, though ECD uses multiple hierarchical models to describe the relationship between item responses and inferences regarding student ability (See Mislevy, this issue, for a more detailed description of the ECD models). The ECD model of observed responses to a task, the task model, is related probabilistically to the model of
22

student mastery, student model, which describes the types of abilities and skills to be measured by a test. ECDs student model is similar to Leightons domain mastery model, though the former is typically phrased in terms of inferences about skills rather than a list of skills alone. Other psychometricians have adopted more traditional cognitive models of processing as forms of construct denition. Embretson, for example, has developed componential cognitive processing models that describe the components of cognition for solving abstract, verbal, quantitative, and spatial reasoning assessments (Embretson, 1998; Embretson & Gorin, 2001; Embretson & Wetzel, 1987; Gorin & Embretson, 2006). In this approach, the cognitive model for a test is specied in terms of the working memory, representation of information, and other traditional cognitive processes needed to describe the complete processing of an item. These componential models, unlike models of domain mastery, are tied directly to a specic item type rather than to a general domain or ability. This specicity can be useful for test development purposes such as item writing and validity analysis, but may be less useful for initial construct denition. Therefore, discussion of these models is left for later sections of this article related to validity models. Developing a model for construct denition. Perhaps the most natural question for practitioners, most of whom are not cognitive psychologists, is how to begin to develop a cognitive model. Theory is a logical place to begin. Theories of cognition, learning, expertise, training, and assessment in various domains can provide rich sources of information for model development. When theories are lacking, test developers must be more innovative in generating model components. To understand more fully the constructs of interest to educators, researchers have drawn increasingly upon qualitative research methodologies commonly employed in cognitive psychology to generate hypothetical models. Interviews with expert teachers, content experts, and even novice learners provide information to form a more complete representation of a content domain. Collecting the information is only the rst challenge, the second is to organize it in a meaningful way to guide test development and

score interpretation. Wilson (2005) describes one tool for organizing cognitive information in a domain called a construct map. A construct map details the development of a skill and associated knowledge within a particular domain; it is a cognitive model. Its construction is a complex process integrating theory with expert pedagogical knowledge and observational data. Although a construct map can be developed for any skill domain, it is particularly well suited for modeling the cognitive development of mastery in a domain that is structured in a stagelike manner, with typical misconceptions and skill weaknesses at various ordered levels of reasoning. Recent examples of construct maps for middle school science assessments have improved the substantive interpretability of item responses (Briggs, Alonzo, Schwab, & Wilson, 2006). Briggs et al. gathered extensive qualitative data from multiple sources to develop complex models of cognitive development for a variety of science tasks (See Figure 1). Notice that the descriptions of knowledge and skills are not tied to any particular task. Rather, they are more general cognitive characteristics of an individual that should apply to any task in a domain. When developing and validating items, the construct map can provide criteria by which the quality of the items can be judged. In the Briggs et al. study, the construct maps were then used to develop diagnostic multiple choice items tied directly to cognitive theories regarding students misconceptions and skill weaknesses. A qualitative data collection method, verbal protocol, is commonly employed in cognitive psychology to understand various aspects of human processing on complex tasks. Verbal protocols, or think alouds, consist of student verbalizations of response processes for an item either while solving a problem (concurrent or on-line verbal protocols) or once the item is completed (retrospective verbal protocols) (Ericsson & Simon, 1993). They make explicit the processes that characterize human cognition. Several psychometric researchers have noted the potential for verbal protocols to inform theory and model building in assessment design (Embretson & Gorin, 2001; Leighton, 2004; Gorin, 2006). Leighton (2004) characterized traditional quantitative test analysis methods as missed opportunities that could be avoided with qualitative data collection methods
Educational Measurement: Issues and Practice

FIGURE 1. Generalized construct map of Properties of Light adapted from Briggs, Alonzo, Schwab, & Wilson (2006).

such as verbal protocols. Although her discussion of verbal protocols for informing cognitive model development focused on models of task performance, the same general principles could apply for more general construct denitions. In terms of dening construct or domain of expertise, verbal protocols from a variety of tasks could be gathered to identify common skills that generalize to the domain as a whole. The primary difference from verbal protocols for models of task performance alone would be the use of multiple item types (e.g., open ended, multiple choice, and simulation) to gain more domain-level information rather than item-specic processes. Information gained from the verbal protocols could then be used to conrm other qualitative data gathered from interviews with content experts and educators. Validating Cognitive Models and Score Interpretations Up to this point I have focused on cognitive models for purposes of construct denition. However, for any construct or skill domain, test developers have a multitude of item types from which to choose. Any single item type may only assess a subset of the skills within a particular domain. Additionally, each item may introduce itemspecic skills that are not part of the domain, but rather construct-irrelevant skills that affect item responses. Hence, a more specic cognitive model, one that is tailored to the item or an item type, provides additional information regarding connections between the
Winter 2006

test question and the skills in the domain. It is this item-specic cognitive model that allows researchers to examine construct validity at the item level. Central to the issue of construct validity is the question of whether the processes measured by items and tests are those that were intended by the researcher. Ferrara and his colleagues have described this correspondence in terms of the alignment between the intended Knowledge, Skills, Processes, and Strategies (KSPS), specied by test developers, and the observed KSPS, those applied by the examinees when solving an item (Ferrara et al. 2004). When the Intended KSPS and the Observed KSPS are the same, alignment is strong and valid score interpretations are supported. To the extent to which the observed KSPS include additional skills or lack those

specied in the intended KSPS, item score interpretations are less valid. However, the use of the term observed for purposes of discussion in this paper may be somewhat misleading. For many assessment items, all KSPS involved in an item solution are not individually observable based on item responses alone. Rather, we observe item responses from which inferences are made regarding the underlying processes. Item types with complex response formats often provide increased opportunities to observe the KSPS, whereas multiple choice items offer very little observable information. For this reason, Ferrara et al.s observed KSPS will be described as the enacted construct1 the construct that is actually measured. Construct validity can be conceived as the relationship between the intended and enacted construct (See Figure 2). For some items, all aspects of the enacted construct can be directly observed. This makes comparison of the enacted construct to the intended construct a straightforward process. Unfortunately, the observable examineeitem interactions on most large-scale assessments are limited to examinee answers (i.e., selected response to a multiple choice question, responses to an essay question) with little observable information regarding the underlying processes. In this more common situation, the task of comparing the intended and enacted construct becomes more challenging. The rst step in this process is to build a model of the enacted construct using the same level of description as the model of the intended construct (i.e., the construct denition). The following section reviews several tools for developing enacted construct models and methods

FIGURE 2. A construct-model comparison approach to construct validity for assessment. 23

for evaluating their alignment to the intended construct. Item Difculty Models Model building is an iterative process. One begins with a hypothesis to describe a phenomenon and then a series of model testing and revisions ensues. The model testing is typically an evaluation of the models ability to describe observable data. Models of enacted constructs should be capable of describing the observable data generated from item responses. Specically, differences in a models KSPS for various items should account for differences in examinee responses across items. For example, as the KSPS required to solve an item become more complex (i.e., higher working memory loads, more cognitive complexity) we would expect that (a) fewer examinees answer the items correctly and (b) item responses are generated more slowly. Similarly, items for which the KSPS are relatively simple (i.e., lowlevel cognitive processes) the majority of respondents should answer correctly and relatively quickly. Both the number of respondents answering an item correctly (i.e., the items difculty) and the amount of time needed to answer an item (i.e., response time) are observable entities that can be statistically evaluated. The quality of a model of the enacted construct can be judged by its ability to account for variations in item characteristics. Constructing models to account for item characteristics, specically to account for item difculty, has been referred to as item difculty modeling

(IDM). An item difculty model typically includes a list of cognitive processes or skills organized in terms of the sequence of item processing. To test these models, each process is dened by observable features of an item that can be systematically coded and entered into statistical analyses to test the impact of the process on item difculty. A model that includes processes and features describing the majority of variability in item difculty are thought to accurately describe the true processes and skills measured by an item. Figure 3 provides an example of an item difculty model for passage-based multiple-choice (MC) reading comprehension questions. The rst step in IDM is to formulate the expected solution process in cognitive terms. As shown in Figure 3, the model hypothesizes two primary processes as a basis for item responses, text representation, and decision processes. Text representation, for example, describes how the material from the passages is initially processed. Examinees encode the propositions in the text in order to build a representation of the passage and then use coherence and integration processes to solidify the representation for use in answering questions (Kintsch & vanDijk, 1978). Encoding, coherence, and integration are all cognitive terms used to describe student processing. The next step in IDM is to identify features of the test questions associated with each process. The encoding process, for example, is related to item features such as the vocabulary, propositional density, and the average sentence length of the passage (Embretson & Wetzel, 1987).

The level of these features for an item should determine a portion of processing complexity which should then drive the difculty level of the item. Based on this model, we would expect that items with differing levels of vocabulary and propositional density should have differing difculty levels. Applications of the difculty model to reading comprehension items from various standardized achievement tests explained from 35% to 72% of the variance in item difculties (Embretson & Wetzel, 1987; Gorin & Embretson, 2006). Once developed, two characteristics of item difculty models should be evaluated. First, to what extent do modelrelated item features account for variations in item properties? Second, are the skills and knowledge relevant to item processing the same as those of the cognitive model of the construct specied in the construct denition? Statistical procedures exist to answer the rst question. Methods such as regression can be used to estimate the strength of the relationship between item features and the statistical properties of the item (e.g., item difculty, item discrimination, and response time). Several more advanced psychometric approaches designed explicitly for examining these relationships will be discussed briey. Answering the second question regarding the similarity between the items cognitive models and cognitive models of the construct is a more subjective issue, one that is central to the validity of score interpretations. What is the relationship between the enacted construct model and the intended construct model? The veried cognitive

FIGURE 3. Gorin & Embretsons (2006) item difculty model for GRE multiple-choice reading comprehension questions. 24 Educational Measurement: Issues and Practice

model for an item describes only the enacted construct. Whether the processes described in that model are the same as the processes and skills of the intended construct is not assessed in the psychometric analysis. Further consideration of methods for making these evaluations should be explored as more item difculty models are developed for existing and newly developed tests. It is important to note that unlike the cognitive model of the construct, the cognitive model of the item includes both domain-level processes and itemspecic processes associated with answering MC items. For example, while encoding and coherence are general processes required in almost all verbal reasoning tests, mapping and evaluating truth status are processes specic to answering MC questions (See Figure 3). Other item types used to measure verbal reasoning may not require these skills. Part of the strength of IDM procedures is that they allow researchers to examine processing at the level of the individual item. Therefore, the extent to which processing is affected by skills from the intended construct versus the effects of item format or other unintended factors can be assessed. This information can be quite useful for evaluating the validity of score interpretation. Among the rst items to be examined with IDM were those measuring traditional cognitive abilities such as abstract and spatial reasoning (Embretson, 1998; Diehl, 2004; Embretson & Gorin, 2001). These items are ideal for IDM given the extensive theoreti-

cal and empirical cognitive literature documenting sources of their processing difculty. For example, IDMs of Ravens Progressive Matrices (Raven, 1941) and the Minnesota Paper Form Board Test (RMPFBT; Quasha & Likert, 1937) have provided detailed descriptions of the skills and processes underlying items as evidence of the construct validity of score interpretations from these tests (Diehl, 2004; Embretson, 1998; Embretson & Gorin, 2001). IDM models for both of these tasks have subsequently been used to streamline item writing including automated item generators that use computer algorithms to write programs. More discussion of this application of IDM will be presented later. For now, let us consider the progressive matrix items as an illustration of an item difculty model. In the Progressive Matrix problems, students are provided a 3 3 matrix of shapes, with one blank cell (See Table 1). The examinees task is to identify the appropriate shape for the missing cell based on patterns (i.e., rules) governing the completed portions of the matrix. The target construct to be measured by these items is abstract reasoning, including the ability to infer associations between concepts and to apply them in new situations. An IDM study of these problems showed that item difculty could be deconstructed into two general processesinferring rules and applying them (Embretson, 1998). The cognitive complexity of these processes comes from two sources: (1) the number of rules and (2) the complexity of the rules. The difculty of an item

should be directly related to these two characteristics of an item. Items written with multiple rules rather than a single rule, and those with more complex versus simpler rules, should be more difcult to solve than other items. In the example given in Table 1, Item 1 has only one rule at a low-level of complexity; Item 2 has two rules at high levels of complexity. The model of processing therefore predicts that Item 2 should be more difcult than Item 1. Statistical tests of the relationship between item difculty and the model showed that 80% of the variance in item difculties for matrix questions could be explained by the IDM features (Embretson, 1998). This high proportion of explained variance in item difculty suggests strong evidence that scores from these items can be interpreted in terms of the typical skills associated with the ability to think abstractly. IDM of educational aptitude and achievement tests have also increased in recent years. Investigations of item difculty models for mathematics have spanned the developmental continuum, including studies of middle school-level mathematics tests (Shute, Graf, & Hansen, 2006), SAT quantitative reasoning items (Gierl, Tang, & Wang, 2005; Ivie, Kupzyk, & Embreston, 2004), and multiple studies of GRE quantitative reasoning (Enright, Morely, & Sheehan, 2002; Enright & Sheehan, 2002; Graf, Peterson, Steffen, & Lawless, 2005). For example, Enright, et al. (2002) used regression-based procedures to develop a difculty model of mathematics word problems for the

Table 1. Model-Based Comparison of Features for Two Progressive Matrix Items


Item Structure Number of Rules Most Complex Rule Item 1 1 Identity Relation (Low-Complexity Rule) Item 2 2 Distribution of Three (High- Complexity Rule)

Winter 2006

25

Graduate Records Examination (GRE). Their model, based on observable characteristics of the test questions (e.g., content, type of context, and necessary formulas), was useful in explaining up to 90% of the variability in item difculty for several item types. Using regression models of difculty, word problems including calculations of rates were more difcult if they required operations on variables rather than on numbers. Further, items framed in a context related to cost were generally less difcult than other items. When developing items for future use, the authors suggest that these features, in addition to others not discussed here, can be used to generate items with predetermined difculty levels which could signicantly reduce the amount of item pretesting that is typically required for new items. A series of studies of the Trends in International Mathematics and Science Study (TIMSS-R) data led by Tatsuoka and her colleagues provide an example of item difculty model development and useful applications that can result from the analyses (Tatsuoka, Corter, & Guerrero, 2004; Tatsuoka, Corter, & Tatsuoka, 2004; Tatsuoka, Guerrero, Corter, Yamada, & Tatsuoka, 2003; Chen, Gorin, Thompson, & Tatsuoka, 2006). The process began with an intensive review of items by content area experts, cognitive psychologists, and psychometricians. Their data were combined with theory regarding mathematics problem solving to generate a comprehensive in a list of skill (e.g., unit conversion, using proportional reasoning), content attributes (e.g., basic concepts and operations in while numbers and integers), and knowledge attributes (e.g., applying rules in algebra, applying and evaluating mathematical correctness) attributes needed to solve successfully the TIMSS questions. An item-by-skill matrix was then generated to represent the cognitive structure of the test and served as the hypothesized cognitive model of the enacted construct. With data from over 20 countries, regression analyses and other psychometric model testing were conducted to test the t of the model to the observed data. Using the hypothesized set of item attributes, signicant portions of the variance in item difculty were explained (R2 = .87). Further, based on the skill descriptions of items, student performance was reparameterized in terms of the specic mastered attributes, rather
26

than a single overall prociency score (Tatsuoka, Corter, & Tatsuoka, 2004; Tatsuoka et al., 2003). These attribute scores permitted substantive interpretations of differences in student performance across countries. For example, students from Singapore, a country among the highest performers on the TIMMS, were shown to excel in reading and computational skills. Whereas students from Japan, whose overall scores were similar to Singapores, achieved their top performance more based on higher level thinking skills. Score interpretations such as these that go beyond the use of a single overall ability estimate and have the potential to make larger contributions to curriculum and policy decisions. However, the ability to make appropriate interpretations necessitates the complete understanding of the skills required by the items. Item difculty models for verbal assessments, such as reading comprehension, critical reading, sentence completion, and listening comprehension, have been similarly developed (Buck & Tatsuoka, 1998; Buck, Tatsuoka, & Kostin, 1997; Embretson & Wetzel, 1987; Gierl, Tang, & Wang, 2005; Gorin & Embretson, 2006; Sheehan & Ginther, 2001; Sheehan, Kostin, & Persky, 2006). Tests of verbal skills are particularly interesting for item difculty models. Perhaps even more than in other domains, many of the item types used to assess verbal skills are far removed from the more natural tasks that require verbal skills. For example, tests of language competency for English language learners often consist of multiple-choice, llin-the-blank, and short answer items. These paradigms are far more constrained in terms of relevant skills for understanding and communicating in English with native speakers. IDM analyses can help identify relevant skills that are in fact captured by the test items, and the extent to which they inuence performance. IDM reading comprehension example. To illustrate the process of IDM, we examine a recent study of GRE reading comprehension items (Gorin & Embretson, 2006). To begin the model building process, prior research on MC reading comprehension questions was examined. Embretson and Wetzel (1987) had previously developed a cognitive processing model of MC reading comprehension items from the Armed Services Vocational Aptitude Battery (ASVAB). Using correlational analyses, they de-

veloped a model of cognitive complexity derived from two general processes: text representation and response decision (See Figure 3). Human judges and automated text processing programs were employed to code the hypothesized features associated with the processing model. Text representation processes, including encoding and coherence, were coded as linguistic features (i.e., vocabulary difculty, propositional density) of the passage. The response decision processes consisted of encoding, coherence, text mapping, and evaluation. The complexity of these processes was coded based on the lexical similarity between response options and the text (i.e., verbatim wording and paraphrased wording), and the vocabulary and reasoning level of the response options. Regression models including codes for these features accounted for more than 70% of the variability in item difculty in ASVAB items, a somewhat smaller proportion than explained with the nonverbal abstract reasoning items. Sheehan and Ginther (2001) identied similar relevant item features that determined item difculty for Main Idea MC reading comprehension questions from the Test of English as a Foreign Language (TOEFL2000). Their model described item difculty in terms of activation processes by which an individual selects a correct or incorrect response alternative. Individuals consider the activation level of each response alternative in a question and then select the option that is most activated. Three types of item and passage features affected activation in Main Idea questions: Location Effectsthe location within the text of relevant information for answering a particular question, Correspondence Effectsthe lexical and semantic similarity between the response option and the text, and Elaboration of Informationthe extent to which the topic of the question is discussed within the passage itself. Differences in these features for various items should describe differences in their difculty. Table 2 shows how Sheehan and Ginthers difculty model could be used to describe difculty for three sample reading comprehension items. Each item is given a specic value for the location, correspondence, and elaboration of the key and the distractor. These values are then used to determine their activation levels. Items for which the correct response has the
Educational Measurement: Issues and Practice

Table 2. Item Difculty Models for Three MC Reading Comprehension Items with Varying Key and Distractor Activation Levels
Item Features Key Location Correspondence Elaboration Resulting Activation Distractor Location Correspondence Elaboration Resulting Activation Expected Difculty Item 1 Early Verbatim Strong Elaboration High Delayed Paraphrased No Elaboration Low Easy Item 2 Delayed Paraphrase Strong Elaboration Moderate Delayed Paraphrased No Elaboration Low Medium Item 3 Delayed Paraphrase No Elaboration Low Early Verbatim Strong Elaboration High Hard

highest activation are easier than items with higher distractor activation. Coding items only in terms of these three variables, Sheehan and Ginther were able to explain 86% of the variability in item difculties. In both the Embretson and Wetzel (1987) and Sheehan and Ginther (2001) studies, the signicant regression weights for item features and high r-squared values were interpreted as evidence of construct-related validity. In the development of a model for the GRE items, Gorin and Embretson (2006) adopted characteristics of these earlier models to examine the substantive meaning of the GRE items enacted construct. Item features derived from the earlier studies of the TOEFL and ASVAB items were quantied for the GRE items, and regression analyses were applied to assess the t of the proposed psychometric model. For the most part, the model statistics supported the cognitive model, with both text representation and decision processes accounting for approximately 35% of the variance in item difculty. However, this proportion of variance was much lower than that explained on MC reading comprehension items from the other tests on which the models were explicitly built. Further, difculty of the GRE-V reading comprehension items was explained primarily by decision processes and far less by text representation processes than for the TOEFL and ASVAB items. Based on their analysis, the researchers concluded that GRE reading comprehension items measure many skills that are consistent with models of verbal reasoning, but clearly the items on the test measure a slightly distinct construct.

In terms of conclusions regarding construct validity, no cognitive model of the intended construct was available for comparison with the information regarding the nature of the enacted construct. At least two general conclusions can be drawn from these studies. First, our understanding of the construct measured by reading comprehension items and what makes them difcult is still limited. Our models have yet to account for all the variability in item difculties across a variety of tests of reading comprehension. Second, and perhaps more interesting from a test development perspective, whatever that construct measured by MC reading comprehension questions may be it does not seem to be constant across all tests. The model of ASVAB items with relatively short passages and low-level reasoning questions was quite different from that of the TOEFL items with long passages, or the GRE items with high-level reasoning and inference items. Test specic characteristics such as the structure of the passages (e.g., length and vocabulary level) and the nature of the questions (e.g., factual and reasoning) appear to alter the nature of the verbal skills needed to answer the questions. This nding highlights the sensitivity of score meaning to relatively small changes in item structure suggesting that uctuations in item format can meaningfully change the construct measured by the test. Test developers should be aware of the potential unintended impact of item design characteristics on construct meaning. One way to do this is to be explicit from the beginning regarding the skills that are of interest, and then de-

sign items specically to assess these skills. Developing an Item Difculty Model One of the most common approaches to IDM is a correlational analysis of item difculty estimates. First, an initial hypothesis of the skills, knowledge, and processes underlying an item is made. In many cases, the model developed for construct denition can be modied to incorporate processes related to the item format. Item difculty parameter estimates obtained from operational administrations of test items are then regressed on the item features. Typically, the values of these features vary across items to generate questions across a range of difculty levels (Bejar, Lawless, Morley, Wagner, Bennett, & Revuelta, 2002). The key to IDM is to identify the relevant features that drive item processing and to estimate their impact (Bejar, 1993; Bennett, 1999). A preliminary list of item features is often generated from theoretical literature relevant to the content area, and if available, empirical investigations of information processing. The difculty modeling process is often iterative such that item features are added to or removed from the difculty model based on their contribution to the explanatory power of the model. The ultimate goal is to develop a model that most completely accounts for item difculty based on features of the test question associated with theoretical processes. Given the amount of effort and time that is needed to conduct item difculty modeling studies, test developers might be tempted to adopt models developed on similar item formats from

Winter 2006

27

other comparable tests. The underlying assumption might be that there is a single correct cognitive model to describe processing for a particular item type. However, as illustrated by the comparison of results from IDM analyses of MC reading comprehension items on various tests, we cannot assume that a difculty model developed for one test is generalizable to other tests with similar items types. Consequently, practitioners would be advised to examine difculty models for similar items from other tests as an initial hypothesis regarding their own test, but until empirical evidence supports the model for the items of interest they should be cautious in drawing conclusions regarding score meaning. The next question is, then, where does one begin to develop a cognitive model of item difculty for an item type if none previously exist? Although theory may provide a good basis for initial item difculty model development, not all achievement constructs have been extensively researched by cognitive psychologists. Further, as test developers explore the use of new item types for which no information is available, we are in the dark as to how individuals may interact with the test. In such situations, test developers must adopt the role of a cognitive psychologist and develop their own theory. Several useful methods are discussed here. Experimental designs. The use of experimental methods to assess the test a theory is not a new idea (Frederiksen, 1986). Frederiksen quotes Messick (1975) to say that test validation in the construct framework is integrated with hypothesis testing and with all the philosophical and empirical means by which scientic theories are evaluated (p. 995). Given the apparent similarity between validity investigations and hypothesis testing, it is no surprise that experimental methods have been adopted to verify item difculty model components. Two general experimental designs can be noted in the recent IDM literature: (1) manipulations of item features (Embretson & Gorin, 2001; Enright, Morely, & Sheehan, 2002; Gorin, 2005) and (2) manipulation of item format/context (Katz & Lautenschlager, 1994, 2001; Powers & Wilson, 1993). In the rst approach, experimenters manipulate features of items associated with the processing model,

such as the vocabulary level of a reading passage or the number of variables in a math problem, and examine the effect of these manipulations on statistical item parameters. Those manipulations that cause changes in the item parameters are assumed to play a role in item processing. Those that do not are assumed to be incidental to cognitive processing (Bejar et al., 2002). The other type of experimental design, manipulation of item format or context, deals more directly with how changes in the conditions under which a person responds to an item changes (or does not change) the statistical item parameters. Similar conclusions can be drawn as with direct manipulation of item features; changes to item format that affect the difculty level of an item are presumed to affect processing. Returning to the example of IDM for reading comprehension items, numerous experimental analyses have been applied to verify the construct measured by the items. Researchers interested in sources of item difculty for reading comprehension test questions parsed sources of processing difculty according to the components of itemsthe passage versus the question (Katz & Lautenschlager, 1994, 2001; Powers & Wilson, 1993). To examine these individual effects, participants responded to questions either with or without the passage. These studies with college-aged students demonstrated that items from secondary and post-secondary achievement tests could in fact be solved without reading the passage associated with the question. The results have been used to argue that processing of these items can arguably be accounted for with little processing of the reading passages themselves. This nding alone may not be of concern to researchers. However, if test users interpret scores as meaningful indicators of individuals ability to read text and reason verbally, they may be mistaken. A more formal method of testing individual difculty model components of MC reading comprehension stemmed from earlier correlational work with the GRE items. Gorin (2005) generated multiple item variants by modifying item features, including propositional density, use of passive voice, negative wording, order of information, and lexical similarity between the passage and response optionsall of which were theoretically grounded in an item dif-

culty model. Two hundred seventyeight undergraduates were given a subset of 27 items of varying types (i.e., inference, authors purpose, and vocabulary in context) and associated with a variety of passages (e.g., humanities, social sciences, and physical sciences). Result items showed that manipulation of some passage features, such as increased use of negative wording, significantly increased item difculty. Others, such as altering the order of information presentation in a passage, did not signicantly affect item difculty but did affect reaction time. These results provide evidence that certain theoretically based item features directly affect processing and can be considered part of the measured construct. However, nonsignicant results of several manipulations challenge the validity of the processing model, given that no direct links between theoretically relevant item features and individual item processing were established. Experimental manipulations such as these, when applied in item development stages, can be useful in establishing the meaning of the construct measured by a test and suggest potential modications that could strengthen the validity of score interpretations. Process-tracing techniques. In both the correlational and experimental design approaches to IDM, one of the key limiting factors is that the researcher must begin with a preconceived notion of how an individual solves a problem. Theory provides one source of information. However, to develop a strong hypothesis regarding the enacted construct for specic items, data regarding specic item processing rather than domain processing are crucial. In cognitive psychology, the use of process-tracing techniques has shown promise for gathering cognitive processing information for individuals while solving complex cognitive tasks such as those found on achievement and ability tests (Cooke, 1994). Process-tracing methods record a prespecied type of data on-line as a person solves a problem in order to make inferences about the processes underlying task performance (Cooke, 1994). Two such methods have been advocated and used by researchers for IDM analysis: verbal protocols and digital eye-tracking. Verbal protocols were described earlier with respect to cognitive model development for construct denition.

28

Educational Measurement: Issues and Practice

Some IDM researchers have argued that verbal protocols applied to specic tasks can provide unique insight into individual processing, including information about student misconceptions, skill weaknesses, and uses of various problem solving strategies (Desimone & LeFloch, 2004; Leighton, 2004). Unlike more structured data collection methods that presuppose the variables of interest to the researcher, verbal protocols do not restrict the information provided by the student. In terms of item difculty modeling, this approach can be useful as an initial investigation when researchers know little about an item type, or as a conrmatory approach to verify a hypothesized processing model. Despite its informativeness, the excessive time and effort needed to collect and analyze verbal protocol data has limited their application to test design. Still, several examples of verbal protocols usefulness for standardized test item analysis purposes exist (Ferrara et al. 2004; Katz, Bennett, & Berger, 2000; Leighton, 2004; Williamson, Bauer, Steinberg, Mislevy, Behrens, & DeMark, 2004). Ferrara et al. (2004) developed a systematic approach to cognitive item analysis with verbal protocols called the Cognitive Laboratory Analysis (CLA) Framework. Cognitive interviews were conducted with 63 middle-school students solving problems from a state standardized science test. The interviews were then coded into four broad categories: science topics, science skills, broader cognitive processing, and examinee response strategies. The coded interviews were analyzed to compare the intended construct, specied by the state content standards, and the enacted construct, the description of the item processing coded from the CLA. Though only reporting preliminary analyses from a larger project, Ferrara et al.s study illustrates how process-tracing methods such as verbal protocol can provide necessary information to judge the quality of items and meaning of test scores. Some researchers have, however, offered criticisms of verbal protocols. First, students ability to describe accurately their own cognition during processing (i.e., concurrent protocols) may interfere and alter the interaction between students and stimulus from what would normally transpire. This interference may be more problematic depending on the nature of the task and the extent to which verbalization capitalizes on similar cognitive functions
Winter 2006

as the target task. For example, verbalization on a spatial reasoning task may cause little interference due to the fact that spatial reasoning engages different cognitive systems than those used to produce verbal accounts of behavior. Alternatively, verbal protocols while performing a task that requires verbal processes may more signicantly alter the original problem solving process to a point that the tasks are no longer equivalent with and without verbal protocols. To overcome this limitation, retrospective protocols can be used. Though retrospective approaches resolve the cognitive interference issue, this method introduces challenges related to students accuracy in recollecting processing after the fact. Further, some criticism of both concurrent and retrospective procedures addresses the issue that some processing is implicit or unconscious and students are not aware of, nor can they verbalize, behavior related to these skills. As a compromise, some researchers use verbal protocols to develop objective self-report measures of processing. For example, information gained from verbal protocols of student problem solving has been used to develop strategy inventories regarding testtaking methods employed by students (Powers & Wilson, 1993). When examined relative to item and test performance, responses to strategy inventories can provide additional support to a hypothesized processing model, or alternatively, they may suggest processing components not previously considered in the modeling process. Digital eye-tracking analysis. As an alternative to verbal protocols, some researchers suggest the use of a different technology called eye tracking to gather information about individual processing (Leighton, 2004; Snow & Lohman, 1989). Eye-tracking methods record individuals eye movement data during stimulus processing. A relatively new methodology in psychometric research, eye-tracking data can provide many of the same benets to IDM as other process-tracing techniques, such as verbal protocols, but without some of the disadvantages. The assumption of the technology is that the location of an individuals eye xation corresponds to an allocation of visual attention, and in turn cognitive processing resources. Some research supports this argument, citing empirical connections between visual xations and cog-

nitive processing (Underwood, Jebbett, & Roberts, 2004; Rayner, 1998; Rayner, Warren, Juhasz, & Liversedge, 2004). However, direct evidence that the location of an individuals gaze corresponds directly to the information being processed by the individual is still needed. The extent to which this association is nonexistent weakens the usefulness of the eye-tracking data for the purpose of IDM. In comparison to its use in cognitive psychological research, eye-tracking studies of standardized test questions have been relatively scarce. This is likely due to the relatively high cost of the equipment as well as the limited knowledge of the technology outside of cognitive psychology and science. One recent example of eye-tracking for IDM exists for an assessment of verbal reasoning (Gorin, 2006). Preliminary eye-tracking data were collected on a small number of students solving reading comprehension questions from the Scholastic Achievement Test (SAT-I). Figure 4 shows the data collected for two of these participants on the same multiple-choice question. The lines and black circles imposed on the test question represent the sequence and location of each students visual gaze. Several interesting conclusions can be drawn from this illustration. First, it is clear that the two students engaged in different processes while solving the problem. Both students answered the question correctly. However, the rst student did so without ever looking at the passage itself. The second student read the question in the upper right-hand corner of the screen, then moved visual focus to the passage on the left half of the screen, and nally returned to the response options on the right before selecting the correct answer. If we examine these results in comparison to the processing model developed for the GRE items (See Figure 3), the eye-tracking pattern for Student 1 does not seem to conform to the model. This student either did not consider information in the passage as modeled in the Mapping processes, or he/she retrieved the representation of the text in memory and performed Mapping on this representation. However, the retrieval process of the text from memory is not included in the hypothesized model. Perhaps a second processing model for individuals that generate comprehensive representations of the text might be needed to supplement the original processing
29

FIGURE 4. Eye-xation patterns for two students solutions to the same SAT-I reading comprehension question.

model. Information such as that available from eye-movement may provide insight into multiple cognitive models that describe item processing. Further research in this area is needed to examine the usefulness of eye-trackers as a model development tool. They may be most useful in identifying additional skills and knowledge relevant to the cognitive model and in isolat-

ing multiple solution paths for problem solving. Statistical Methods for Model Validation As cognitive models become more central in test development, validation of the test scores becomes increasingly inuenced by the validity of the cognitive information incorporated into the

item. Consequently, statistical models that allow for the examination of items at the same level specied in the cognitive model, rather than at the more general level of the item response, are needed. As a result of changes to the conceptual framework for test development, practitioners should be aware that their statistical and psychometric needs may have changed as well. Recently, several new psychometric models, including primarily item response theory (IRT) models, have been introduced which effectively leverage cognitive information. The majority of these models, the collection of which are often termed cognitive-psychometric models, have been introduced in response to criticisms of traditional testing models for their disconnect with substantive theory. Although detailed description of these models is not the focus here, it is important to mention them as appropriate statistical tools to make use of the cognitive information described up to this point. Two related IRT models developed for use in cognitive-psychometric modeling are the linear logistic latent trait model (LLTM; Fischer, 1973) and the multi-component latent trait model (MLTM; Whitely, 1980). The LLTM incorporates content information into the calculation of probabilities of a correct response to an item. The model includes this information in the form of weights, representing the impact that any cognitive component of a trait may have on the difculty of an item. Essentially, item difculty is decomposed into a linear combination of cognitive attributes and the impact of those attributes on solving an item. The presence or absence of a particular cognitive attribute in an items solution path is represented by a design matrix relating items to processing components. The MLTM, a multidimensional extension of the LLTM, can be similarly applied to items measuring traits with multiple components (Whitely, 1980). In this model, it is assumed that correct sequential completion of processing stages must be completed in order to respond correctly to the overall item. Failure to complete any of the stages (components) results in an incorrect response. Both latent trait models are useful for substantive examinations of score meaning and validity because they provide mechanisms to test the t of cognitive processing models to the data.

30

Educational Measurement: Issues and Practice

Unlike the cognitive latent trait models just described, another set of models based on classication rules have also leveraged cognitive information in modeling assessment data. Applications of these models have primarily focused on diagnostic score reporting rather than examinations of validity, though information is provided from analysis that can be used for both purposes. Tatsuokas rule space model (RSM) is an approach to data analysis designed to provide accurate feedback to groups and individuals regarding skill mastery (Tatsuoka, 1985, 1995). It begins with an evaluation of skills needed to solve a problem correctly. The students skill level is diagnosed based on responses to items and the association between the items and skills. The RSM has been successfully applied to tests of mathematics, reading comprehension, and listening to generate cognitive reports of student ability (Tatsuoka, Corter, & Tatsuoka, 2004; Buck & Tatsuoka, 1998; Buck et al., 1997. Two more recent cognitive psychometric models for diagnosis are structured as constrained latent class models, where each latent class is associated with a different diagnostic state. The fusion model (Hartz, 2002), incorporates similar processing logic to the RSM, though it incorporates several additional parameters to allow for a more realistic model of the relationship between student skills and item processing. Von Daviers General Diagnostic Model (GDM; von Davier, 2005) was designed as a more exible model that under certain conditions would simplify into other existing diagnostic models that have recently been proposed. Two advantages of the GDM over models such as the fusion model are its ability to accommodate polytomous response data and the reduced computational complexity of the parameter estimation procedures. To date, the GDM has been t to data from language competency assessments, as well as the large-scale NAEP data (von Davier, 2005; Xu & von Davier, 2006). Both analyses suggest that this model may be a viable option for cognitive-psychometric analysis of test data. Although each of these models has been successfully applied to assessment data for research purposes, few studies have examined their feasibility in operational settings. Applications of these methods to the complete testing process (i.e., from construct defWinter 2006

inition to score reporting and interpretation) is needed to examine their advantages and limitations for operational testing programs. Several advantages to test developers have already been suggested in the literature. First, test developers interested in diagnostic score reports can use the models to generate prociency scores for individual skills with better reliability than the traditionally reported subscale scores. Further, technical manuals can be augmented with more detailed cognitive descriptions of the construct and score meaning derived based on estimates of the impact of cognitive processes. Finally, test banks can be reorganized with items classied according to their skill structure, rather than simply by difculty level of content codes, which can allow for skill-based item selection procedures that could yield more informative test scores for individual examinees. However, it remains the case that to maximize the benets of these procedures, the test items must be designed with this purpose in mind. Regardless of the sophistication of a statistical procedure, it cannot spontaneously generate cognitive information that is simply not there. The next section discusses approaches to item writing that increase the cognitive information available from test scores. Item Writing with Cognitive Models Aside from construct denition and validation, cognitive models are useful for guiding item design and writing. Descriptions of item writing for tests often cite the process as being part art and part science (Haladyna, 2004; DeVellis, 1991; Wilson, 2005). Given the increasing pressure to extract meaningful information about student skills and knowledge from item responses, item development procedures that rely more heavily on the science and less on the art are desirable. Two advances stemming from cognitive frameworks of assessment design have signicantly improved our ability to approach item writing scientically: (1) incorporation of innovative item types and (2) automated item generation procedures. Developments in item format. Increasingly, educators and researchers have challenged large-scale test developers to move beyond MC based assessments toward more cognitively rich item design (Lane, 2004). Some

of the most innovative changes in item writing have been attributed to rapid advancements in computer technology for adaptive and nonadaptive computerized testing. Computer-generated item formats have the potential to improve the validity of score interpretations by capturing the construct of interest more realistically (Sireci & Zenisky, 2006). Explorations of new item types, often called innovative item types, have led to a variety of item forms that range in terms of their technological sophistication. At a technologically sophisticated level, simulation-based assessments have become popular for measuring complex skills. Simulations are a powerful assessment tool for gathering detailed information about task processing and student cognition (Bennett, Jenkins, Persky, & Weiss, 2003; Mislevy, Steinberg, Breyer, Almond, & Johnson, 2002). Unlike other more static item types, simulations provide opportunities to observe student behavior and reasoning in more cognitively rich contexts that mirror the complexity of the real world. Large-scale assessment projects such as the National Assessment of Educational Progress (NAEP) include simulation-based assessments in their measures to augment achievement information available from more traditional item formats (Bennett et al., 2003). Recently, a computer-based simulation of networking skills, NetPass, was developed by Cisco Systems using the ECD approach to complex assessment design (Williamson et al., 2004). Unlike traditional xedresponse multiple-choice exams, the goal of NetPass was to construct an on-line performance-based assessment capable of providing formative diagnostic feedback (Behrens, Mislevy, Bauer, Williamson, & Levy, 2004). A simulated network space was constructed in which examinees interact with network components to troubleshoot, design tasks, and implement specied networks. Although only reported in its early stages of development, preliminary results indicate that responses to these items can be used to provide information regarding networking skills that affect the functionality and efciency of the network design, the two key components of a correct outcome. The notion of grounding item writing on cognitive models is not unique to technologically sophisticated item types. Cognitive models can
31

inform traditional item development as well (Haladyna, 2004; Lane, 2004; Pek & Poh, 2004). An ideal example of low-tech innovative item types is a new twist on multiplechoice questionsordered multiplechoice questions (OMC; Briggs et al., 2006). Multiple-choice items have often been criticized for the limited information provided by their answers. They are generally written at only the knowledge or comprehension level of cognition, with very few measuring complex skills such as synthesis, analysis, or evaluation and they are in almost all cases scored as correct or incorrect, with no use of information from the distractors (Frederiksen, 1990). However, OMC items link all item components the item stem, correct answer, and the incorrect answersto a cognitive developmental model of the construct. The primary building block of the OMC items is the previously discussed construct map. Based on this construct denition, each response option of an OMC item is written to be cognitively consistent with different developmental levels of reasoning about the construct, including specic skill weaknesses or student misconceptions. Based on the well-developed construct maps of several science domains, Briggs et al. (2006) developed OMC questions that provide rich diagnostic information regarding students cognitive development. Developments in item generation. Whether developing innovative item types or standard MC questions, grounding item writing in cognitive models can have substantial benets both psychometrically and economically. An increasing number of test developers have begun to examine the potential for automating the item writing process with a new technology called item generation (Bejar, 1993; Irvine, 2002). Item generation is the process of streamlining item development by specifying a priori the structure of an item, including all of the ways in which items can vary and how those variations affect student processing. With an item structure in place, test developers can produce virtually limitless numbers of items with known cognitive and statistical properties to improve the efciency and security of a test, while reducing the development costs. Item generation in its most sophisticated form entails computers gener-

ating test items on the y as a test is in progress (Bennett, 1999; Bennett & Bejar, 1998; Bejar, 1993). The entire item writing process is incorporated into a computer algorithm that is invoked after each item response, much like an item selection algorithm is applied in computerized adaptive tests with item banks. In most content areas, technological requirements for programming automated item generation are well within our capabilities. Computerized item generators have already been developed that automatically generate spatial and abstract reasoning tasks. However, in achievement domains such as mathematics and reading assessment for K-12, the necessary cognitive models to direct programming are still under development. Whether automated item generation is an option or not, item difculty models that capture the relationship between construct relevant skills and features of test items are needed before systematic item generation is feasible. Graf et al. (2005) described in detail a structure for item generation based on cognitive models and provided an example for generating quantitative reasoning items. They provide an illustration of the framework that summarizes many of the aspects of test development highlighted in this articleanalyze construct, conduct cognitive analysis, and design and

develop item models (See Figure 5). Their framework and others described in the recent test development literature share several commonalities, the most signicant of which is the use of cognitive information and models to inform item design and test development (Mislevy, 1994; Embretson, 1999; Graf et al., 2005). Additionally, each approach emphasizes the iterative nature of item development. In their example, Graf et al. used cognitive analysis of quantitative reasoning to generate a model of student processing, including information about students general misconceptions and errors, to create a MC structure for item generation. The process alternates repeatedly between cognitive analysis, data collection, model verication, and modication. They note that in many instances, the exact factors that affect processing for a set of items cannot be understood until after data have been collected. Empirical validation of the cognitive models after data collection must be conducted to inform item revisions. Studies that examine the connections between item features and processing models, such as the correlational studies, analysis of verbal protocols and eye-tracking data, and experimental manipulations can play a critical role in this process (Embretson, 1999; Embretson & Gorin, 2001; Graf et al., 2005).

FIGURE 5. Item model development and analysis loop developed by Graf et al. (2005).

32

Educational Measurement: Issues and Practice

Implications for Practice At both the beginning and the end of KWSK, recommendations for assessment theory are made, including a number for test design, development, and validation. Several of these have been incorporated in suggestions made throughout this article. To be clear, I would like to summarize the key implications of work described on cognitive model development and item difculty modeling for practitioners. First, in order to develop tests that provide meaningful information, a scientic and principled approach is warranted. Approaching test development scientically with formal hypotheses and explicit methods for testing the hypotheses maintains the evidencegathering aspect of test development. A construct-related approach to test design that begins with a clear specication of the target construct will more strongly support student level inferences. Test developers are encouraged to develop comprehensive construct denitions that are informative for item development and subsequent analysis of item quality. Practitioners should think outside the box in terms of useful sources of information to model a content domain, including not only experts, but also theory, teachers, and even students themselves. Psychological data collection methods such as verbal protocols and interviews can be a useful tool in gathering information from these sources. Although the temptation may be to move quickly through the construct denition process in an attempt to focus on item writing, the effect may be a corruption of the interpretability of test scores. In terms of item writing, keep in mind that the process is iterative. Current practice for test design often includes item review by content experts, sensitivity review, and item pretesting to examine statistical properties. Test developers must consider even more rigorous methods of item examination before operational use that provides explicit evidence regarding the skills, knowledge, and processes measured by the items. Item design should proceed from sources of cognitive complexity related to the construct of interest, rather than unrelated surface features. Evidence of this fact can be gathered using a variety of sources. First, develop preliminary items that you believe achieve this goal. Then, test your hypotheses about the properties of the items in
Winter 2006

several waysverbal protocols, experimental manipulations, and visual eyetracking data. The procedures may be more time consuming, but they should yield maximally accurate, valid, and useful test scores. Finally, when developing and verifying a cognitive model for specic items, do not be limited to the traditional models that exist. Similar item types can be found on a variety of tests, many of which are developed for different populations of examinees and for different testing purposes. Subtle features of the items may differ from those of interest to you. Consequently, the underlying cognitive model may differ as well. Existing cognitive models derived from comparable tests may provide some insight into item processing, but they should be adopted cautiously. Specic analyses to gather evidence supporting the validity of a cognitive model to different sets of items should be conducted before interpreting scores for high-stakes decisions.

References
Behrens, J. T., Mislevy, R. J., Bauer, M., Williamson, D. M., & Levy, R. (2004). Introduction to evidence-centered design and lessons learned from its application in a global e-learning program. International Journal of Testing, 4, 295302. Bejar, I. I. (1993). A generative approach to psychological and educational measurement. In N. Fredriksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests (pp. 323359). Hillsdale, NJ: Erlbaum. Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2002). A feasibility study of in-the-y item generation in adaptive testing (GRE Board Professional Rep. No. 98-12P). Princeton, NJ: Educational Testing Service. Bennett, R. E. (1999). Using new technology to improve assessment. Educational Measurement: Issues and Practice, 18, 512. Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: Its not only the scoring. Educational Measurement: Issues and Practice, 17(4), 916. Bennett, R. E., Jenkins, F., Persky, H., & Weiss, A. (2003). Assessing complex-problem solving performances (ETS Report No. RM03-03). Princeton, NJ: Educational Testing Service. Briggs, D. C., Alonzo, A. C., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiple-choice items. Educational Assessment, 11(1), 3363. Buck, G. & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15(2), 119157. Buck, G., Tatsuoka, K., & Kostin, I. (1997). The subskills of reading: Rule-space analysis of a multiple choice test of second language reading comprehension. Language Learning, 47(3), 423466. Chen, Y.-H., Gorin, J. S., Thompson, M., & Tatsuoka, K. K. (2006). Cognitively diagnostic examination of Taiwanese mathematics achievement on TIMSS-1999. Unpublished doctoral dissertation. Tempe, AZ: Arizona State University. Cooke, N. J. (1994). Varieties of knowledge elicitation techniques. International Journal of Human Computer Studies, 41, 801 849. Desimone, L., & LeFloch, K. (2004). Are we asking the right questions? Using cognitive interviews to improve surveys in education research. Educational Evaluation and Policy Analysis, 26(1), 122. DeVellis, R. F. (1991). Scale development: Theory and applications. Thousand Oaks, CA: Sage Publications. Diehl, K. A. (2004). Algorithmic item generation and problem solving strategies in matrix completion problems. Dissertation Abstracts International: Section B: The Sciences and Engineering, 64, 4075.

Conclusion As the use of standardized test scores for high-stakes decisions in education becomes more commonplace, the need for a more complete understanding of score meaning has simultaneously increased. What does it mean to get a question right? What decisions can I justify with results of this test? What does a high score on a test mean? Theories and methods of cognitive psychology have provided assessment specialists with new tools to tackle these questions. The future success of cognitively-based test development will likely depend heavily on the ability of test developers and practitioners to learn and adopt these methods. Once referred to as a hybrid breed of psychometricians (Leighton, 2004, p. 13), researchers trained in both cognitive psychology and educational measurement may be the most valuable resource to the test development industry.

Note
1

This terminology is borrowed from research on curriculum alignment that examines the relationship between the intended curriculum (i.e. the state mandated curriculum as dened by instructional objectives) and the enacted curriculum (i.e., the content actually taught within the classroom).

33

Educational Testing Service (Ed.) (1998). GRE: Practicing to the general test: Big book. Princeton, N.J.: Educational Testing Service. Embretson, S. E. (1994). Application of cognitive design systems to test development. In C. R. Reynolds (Ed.), Cognitive assessment: A multidisciplinary perspective (pp. 107135). New York: Plenum Press. Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 300396. Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407433. Embretson, S. E., & Gorin, J. S. (2001). Improving construct validity with cognitive psychology principles. Journal of Educational Measurement, 38(4), 343368. Embretson, S. E., & Wetzel, C. D. (1987). Component latent trait models for paragraph comprehension. Applied Psychological Measurement, 11, 175193. Enright, M. K., Morely, M. & Sheehan, K. M. (2002). Items by design: The impact of systematic variation on item statistical characteristics. Applied Measurement in Education, 15(1), 4974. Enright, M. K., & Sheehan, K. M. (2002). Modeling the difculty of quantitative reasoning items: Implications for item generation. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp. 129157). Mahwah, NJ: Lawrence Erlbaum Associates. Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (Rev. ed.). Cambridge, MA: MIT Press. Ferrara, S., & DeMauro, G. E. (2006). Standardized assessment of individual achievement in K-12. In R. L. Brennan (Ed.), Educational measurement (4th ed.). Westport, CT: American Council on Education/Praeger. Ferrara, S., Duncan, T. G., Freed, R., V leze Paschke, A., McGivern, J., Mushlin, S., Mattessich, A., Rogers, A., & Westphalen, K. (2004). Examining test score validity by examining item construct validity: Preliminary analysis of evidence of the alignment of targeted and observed content, skills, and cognitive processes in a middle school science assessment. Paper presented at the 2004 Annual Meeting of the American Educational Research Association. Fischer, G. H. (1973). Linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359374. Frederiksen, N. (1986). Construct validity and construct similarity: Methods for use in test development and test validation. Multivariate Behavioral Research, 21(1), 3 28. Frederiksen, N. (1990). Introduction. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. ix xvii). Hillsdale, NJ: Lawrence Erlbaum. Gierl, M. J., Tang, X., & Wang, C. (2005). Iden-

tifying content and cognitive dimensions on the SAT (College Board Research Report No. 2005-11). New York: College Board Press. Gorin, J. S. (2005). Manipulation of processing difculty on reading comprehension test questions: The feasibility of verbal item generation. Journal of Educational Measurement, 42, 351373. Gorin, J. S. (2006). Using alternative data sources to inform item difculty modeling. Paper presented at the 2006 Annual Meeting of the National Council on Educational Measurement. Gorin, J. S., & Embretson, S. E. (2006). Item difculty modeling of paragraph comprehension items. Applied Psychological Measurement, 30(5), 394411. Graf, E. A., Peterson, S., Steffen, M., & Lawless, R. (2005). Psychometric and cognitive analysis as a basis for the design and revision of quantitative item models. (ETS Research Report No. RR-0525). Princeton, NJ: Educational Testing Service. Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Mahwah, NJ: Lawrence Erlbaum Associates. Hartz, S. M. (2002). A Bayesian framework for the unied model for assessing cognitive abilities: Blending theory with practicality. Dissertation Abstracts International: Section B: The Sciences and Engineering, 63(2-B), 864. Irvine, S. H. (2002). The foundations for item generation for mass testing. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (pp. 324). Mahwah, NJ: Erlbaum. Ivie, J. L., Kupzyk, K. A., & Embretston, S. E. (2004). Final report of Cognitive Components Study Predicting strategies for solving multiple-choice quantitative reasoning items: An eyetracker study. Princeton, NJ: Educational Testing Service and Lawrence, KS: University Press of Kansas. Katz, I. R., Bennett, E., & Berger, A. E. (2000). Effects of response format on difculty of SAT-Mathematics items: Its not the strategy. Journal of Educational Measurement, 37, 3957. Katz, S., & Lautenschlager, G. J. (1994). Answering reading comprehension items without passages in the SAT-I, the ACT, and the GRE. Educational Assessment, 2, 295308. Katz, S., & Lautenschlager, G. J. (2001). The contribution of passage and no-passage factors to item performance on the SAT reading task. Educational Assessment, 7, 165 176. Kintsch, W., & vanDijk, A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 363 394. Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking? Educational Measurement: Issues and Practice, 23(3), 614. Leighton, J. P. (2004). Avoiding misconception, misuse, and missed opportunities:

The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23(4), 615. Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955966. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons responses and performances as scientic inquiry into score meaning. American Psychologist, 50, 741749. Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439468. Mislevy, R. J. (1995). Probability-based inference in Cognitive Diagnosis. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.) Cognitively Diagnostic Assessment. Hillsdale, NJ: Lawrence Erlbaum Associates. Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (2002). Making sense of data from complex assessments. Applied Measurement in Education, 15, 363389. National Research Council (2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press. Netemeyer, R. G., Bearden, W. O., & Sharma, S. (2003). Scaling procedures: Issues and applications. Thousand Oaks, CA: Sage Publications. Pek, P. K., & Poh, K. L. (2004). A Bayesian tutoring system for Newtonian mechanics: Can it adapt to different learners? Journal of Educational Computing Research, 31(3), 281307. Powers, D. E., & Wilson, S. T. (1993). Passage dependence of the New SAT reading comprehension questions (College Board Report No. 93-3). New York: College Board. Quasha, W. H., & Likert, R. (1937). The revised Minnesota paper form board test. Journal of Educational Psychology, 28, 197204. Raven, J. C., (1941). Standardisation of progressive matrices. British Journal of Medical Psychology, 19, 137150. Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372 422. Rayner, K., Warren, T., Juhasz, B. J., & Liversedge, S. P. (2004). The effect of plausibility on eye movements in reading. Journal of Experimental Psychology: Learning, Memory, & Cognition, 30, 12901301. Sheehan, K. M., & Ginther, A. (2001). What do passage-based multiple-choice verbal reasoning items really measure? An analysis of the cognitive skills underlying performance on the current TOEFL reading section. Paper presented at the 2000 Annual Meeting of the National Council of Measurement in Education. Sheehan, K., Kostin, I., & Persky, H. (2006). Predicting item difculty as a function of inferential processing requirements: An

34

Educational Measurement: Issues and Practice

examination of the reading skills underlying performance on the NAEP grade 8 reading assessment. Paper presented at the 2006 Annual Meeting of the National Council of Measurement in Education. Shute, V. J., Graf, E. A., & Hansen, E. G. (2006). Designing adaptive, diagnostic math assessments for sighted and visually disabled students (ETS Research Report No. RR-06-01). Princeton, NJ: Educational Testing Service. Sireci, S. G., & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In pursuit of improved construct representation. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates. Snow, R. E., & Lohman, D. F. (1989). Implication of cognitive psychology for education measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 263 331). New York: Macmillan. Tatsuoka, K. K. (1985). A probabilistic model for diagnosing misconceptions by the pattern classication approach. Journal of Educational Statistics, 10(1), 55 73.

Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classication approach. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.) Cognitively Diagnostic Assessment. Hillsdale, NJ: Lawrence Erlbaum Associates. Tatsuoka, K. K., Corter, J. E., & Guerrero, A. (2004). Coding manual for identifying involvement of content, skill, and process subskills for the TIMSS-R 8th grade and 12th grade general mathematics test items. Technical Report. New York: Department of Human Development, Teachers College, Columbia University. Tatsuoka, K. K., Corter, J. E., & Tatsuoka, C. (2004). Patterns of diagnosed mathematical content and process skills in TIMSSR across a sample of 20 countries. American Educational Research Journal, 41(4), 901926. Tatsuoka, K. K., Guerrero, A., Corter, J. E., Yamada, T., & Tatsuoka, C. (2003). International comparisons of mathematical thinking skills in the TIMSS-R. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, Illinois.

Underwood, G., Jebbett, L., & Roberts, K. (2004). Inspecting pictures for information to verify a sentence: Eye movements in general encoding and in focused search. Quarterly Journal of Experimental Psychology A: Human Experimental Psychology, 57A, 165182. von Davier, M. (2005). A general diagnostic model applied to language testing data (ETS Research Report No. RR-05-16). Princeton, NJ: Educational Testing Service. Whitely, S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika, 45, 479494. Williamson, D. M., Bauer, M., Steinberg, L. S., Mislevy, R. J., Behrens, J. T., & DeMark, S. F. (2004). Design rationale for a complex performance assessment. International Journal of Testing, 4(4), 303332. Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Associates. Xu, X., & von Davier, M. (2006). Cognitive diagnosis for NAEP prociency data (ETS Research Report No. RR-06-08). Princeton, NJ: Educational Testing Service.

Winter 2006

35

Potrebbero piacerti anche