Sei sulla pagina 1di 61

Scoring NAEP Geography Assessments

The NAEP geography items that are not scored by machine are constructed-response items—those for
which the student must write in a response rather than selecting from a printed list of multiple choices.
Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item. To measure longitudinal trends in geography, NAEP requires trend scoring—replication of
scoring from prior assessment years—to demonstrate statistically that scoring was comparable across
years. Students' constructed responses are scored on computer workstations using an image-based
scoring system. This allows for item-by-item scoring and online, real-time monitoring of geography
interrater reliabilities, as well as the performance of each individual rater. A subset of these items—those
that appear in large-print booklets—require scoring by hand. The 2001 geography assessment
included 57 discrete constructed-response items. The total number of constructed responses scored was
381,477. The number of raters working on the geography assessment and the location of the scoring are
listed here:

Scoring activities, geography assessment: 2001


Number of
Number of scoring
Scoring location Start date End date raters supervisors
Iowa City, Iowa 5/7/2001 5/23/2001 81 9
SOURCE: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Statistics, National Assessment of Educational
Progress (NAEP), 2001 Geography Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item and defines the criteria to be used in evaluating student responses. During the course of the
project, each team scores the items using a 2-, 3-, or 4-point scale as outlined below:

Dichotomous Items Short Three-Point Items Extended Four-Point Items


2 = Complete 3 = Complete 4 = Complete
1 = Inappropriate 2 = Partial 3 = Essential
1 = Inappropriate 2 = Partial
1 = Inappropriate

In some cases student responses do not fit into any of the categories listed on the scoring guide. Special
coding categories for the unscorable responses are assigned to these types of responses. These
unscorable categories are only assigned if no aspect of the student's response can be scored. Scoring
supervisors and/or trainers are consulted prior to the assignment of any special coding category to an
item. The unscorable categories used for geography are outlined as follows.

Categories for unscorable responses, geography assessment: 2001


Label Description
B Blank responses, random marks on paper, word underlined in prompt but response area completely
blank, mark on item number but response area completely blank
X Completely crossed out, completely erased
IL Completely illegible response
OT Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than
English (unless otherwise noted)
? "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"
NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a single-
character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T,"
and the label "?" appears as "D."
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics,
National Assessment of Educational Progress (NAEP), 2001 Geography Assessment.

10
Number of constructed-response items, by score-
point level and grade, geography national main
assessment: 2001
Dichotomous Short 3- Extended 4-
Grade Total 2-point items point items point items
Total 53 22 16 15
4 27 17 5 5
8 9 0 7 2
8/12 4 0 0 4
12 13 5 4 4
SOURCE: U.S. Department of Education, Institute of Education
Sciences, National Center for Education Statistics, National
Assessment of Educational Progress (NAEP), 2001 Geography
Assessment.

11
Number of 1994 constructed-response
items rescored in 2001, by score-point
level and grade, geography national main
assessment: 2001
Short 3-point Extended 4-
Grade Total items point items
Total 53 39 14
4 9 6 3
4/8 5 4 1
8 15 13 2
8/12 6 5 1
12 18 11 7
SOURCE: U.S. Department of Education, Institute
of Education Sciences, National Center for
Education Statistics, National Assessment of
Educational Progress (NAEP), 2001 Geography
Assessment.

12
Geography Interrater Reliability

A subsample of the geography responses for each constructed-response item is scored by a second rater
to obtain statistics on interrater reliability. In general, geography items receive 25 percent second scoring.
This reliability information is also used by the scoring supervisor to monitor the capabilities of all raters
and maintain uniformity of scoring across raters. Reliability reports are generated on demand by the
scoring supervisor, trainer, scoring director, or item development subject area coordinator. Printed copies
are reviewed daily by lead scoring staff. In addition to the immediate feedback provided by the online
reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the
backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct
difficulties in scoring almost immediately with a high degree of efficiency.

Interrater reliability ranges, by assessment year, geography national main


assessment: 2001
Number of items Number of items Number of
Number of between 70% and between 80% and items above
Assessment year unique items 79% 89% 90%
2001 geography 57 8 41 8
1994 geography 64 1 15 48
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress (NAEP),
2001 Geography Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress
using an interrater reliability tool. This display tool functions in either of two modes:

• to display information of all first readings versus all second readings; or


• to display all readings of an individual which were also scored by another rater versus the scores
assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and
scores awarded during second readings displayed in columns (for mode one) and the individual's scores
in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall
along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number
and percentage of cases of agreement (or disagreement). The display also contains information on the
total number of second readings and the overall percentage of reliability on the item. Since the interrater
reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and
compared to previously generated reports. Scoring staff members save printed copies of all final reliability
reports and archive them with the training sets.

13
Item-by-item rater reliability, by grade, geography national main
assessment: 2001
Number scored
Grade Item Score points (1st and 2nd) 2001 reliability 1994 reliability
Total † † 420,157 † †
4 G027201 1,3 3,331 99 §
4 G027301 1,3 3,331 99 §
4 G027401 1,3 3,331 97 §
4 G027501 1,3 3,332 98 §
4 G027601 1,3 3,331 98 §
4 G027701 1,3 3,331 98 §
4 G027801 1,3 3,331 96 §
4 G028201 4 3,331 86 §
4 G028401 3 3,332 86 §
4 G028501 3 3,332 92 §
4 G028701 4 3,331 91 §
4 G008001 3 639 99 98
4 G008001 3 3,247 99 §
4 G008201 4 628 91 93
4 G008201 4 3,247 93 §
4 G008503 3 637 94 94
4 G008503 3 3,248 93 §
4 G008701 3 612 94 92
4 G008701 3 3,248 93 §
4 G009001 4 582 83 86
4 G009001 4 3,247 83 §
4 G009201 3 603 88 87
4 G009201 3 3,236 88 §
4 G009402 3 625 94 93
4 G009402 3 3,236 92 §
4 G009403 3 626 94 90
4 G009403 3 3,236 93 §
4 G009601 4 612 91 91
4 G009601 4 3,236 91 §
4 G029301 1,3 3,277 96 §
4 G029401 1,3 3,278 97 §
4 G029501 1,3 3,277 98 §
4 G029601 1,3 3,277 99 §
4 G029701 1,3 3,278 97 §
4 G030101 4 3,277 82 §
4 G030501 4 3,278 78 §
4 G030801 1,3 3,278 99 §
4 G030901 1,3 3,277 98 §
4 G031001 1,3 3,277 98 §
4 G031101 1,3 3,277 98 §
4 G031401 4 3,277 94 §
4 G031801 3 3,220 92 §
4 G031901 3 3,220 91 §
4 G032401 3 3,220 99 §
4 G032601 1,3 3,220 98 §
See notes at end of table.

14
Item-by-item rater reliability, by grade, geography national main
assessment: 2001 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2001 reliability 1994 reliability
4 G012201 3 638 97 97
4 G012201 3 3,225 97 §
4 G012503 3 642 99 98
4 G012503 3 3,225 99 §
4 G012902 3 644 98 97
4 G012902 3 3,225 98 §
4 G013001 4 619 94 95
4 G013001 4 3,225 94 §
4 G013201 3 623 91 92
4 G013201 3 3,225 93 §
8 G013402 3 638 85 93
8 G013402 3 3,159 86 §
8 G014001 3 648 98 99
8 G014001 3 3,159 98 §
8 G014201 4 620 88 87
8 G014201 4 3,159 89 §
8 G014301 3 640 86 91
8 G014301 3 3,159 90 §
8 G014401 3 646 95 97
8 G014401 3 3,159 94 §
8 G032901 3 3,160 90 §
8 G033301 3 3,160 89 §
8 G033501 4 3,160 77 §
8 G033801 3 3,160 93 §
8 G034101 3 3,160 81 §
8 G016201 3 641 97 99
8 G016201 3 3,099 99 §
8 G016302 3 636 97 97
8 G016302 3 3,099 99 §
8 G016401 3 622 95 92
8 G016401 3 3,099 95 §
8 G016502 3 623 92 91
8 G016502 3 3,099 93 §
8 G016701 3 616 91 88
8 G016701 3 3,099 93 §
8 G017101 4 609 79 85
8 G017101 4 3,099 85 §
8 G034801 4 3,137 95 §
8 G035001 4 3,138 81 §
8 G035301 4 3,137 85 §
8 G035501 4 3,138 87 §
8 G036201 3 3,093 97 §
8 G036501 4 3,094 82 §
8 G036801 3 3,094 90 §
8 G037101 3 3,094 92 §
8 G012201 3 639 98 98
See notes at end of table.

15
Item-by-item rater reliability, by grade, geography national main
assessment: 2001 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2001 reliability 1994 reliability
8 G012201 3 3,119 98 §
8 G012503 3 646 100 99
8 G012902 3 638 98 96
8 G012902 3 3,119 98 §
8 G013001 4 601 92 95
8 G013001 4 3,119 93 §
8 G013201 3 610 79 86
8 G013201 3 3,119 85 §
8 G019002 3 3,210 99 98
8 G019003 3 624 96 94
8 G019003 3 3,210 93 §
8 G019102 3 618 97 94
8 G019102 3 3,210 95 §
8 G019202 3 621 91 93
8 G019202 3 3,210 89 §
8 G019302 3 602 94 95
8 G019302 3 3,210 95 §
8 G019402 3 623 94 93
8 G019402 3 3,210 95 §
8 G019901 4 639 93 95
8 G019901 4 3,210 91 §
8 G020001 3 642 97 97
8 G020001 3 3,210 97 §
8 G020201 3 633 94 95
8 G020201 3 3,210 91 §
8 G020302 3 607 92 92
8 G020302 3 3,210 93 §
12 G020701 3 584 88 89
12 G020701 3 3,049 88 §
12 G021001 4 614 79 85
12 G021001 4 3,049 79 §
12 G021401 4 620 85 90
12 G021401 4 3,049 87 §
12 G021601 4 594 82 92
12 G021601 4 3,049 92 §
12 G021602 3 606 94 95
12 G021602 3 3,048 93 §
12 G037501 1,3 3,072 98 §
12 G037601 1,3 3,073 99 §
12 G037701 1,3 3,072 99 §
12 G037801 1,3 3,072 99 §
12 G037901 1,3 3,072 99 §
12 G038101 3 3,073 83 §
12 G038401 3 3,072 83 §
12 G038801 4 3,073 90 §
See notes at end of table.

16
Item-by-item rater reliability, by grade, geography national main
assessment: 2001 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2001 reliability 1994 reliability
12 G039201 3 3,073 94 §
12 G016201 3 639 98 99
12 G016201 3 3,005 100 §
12 G016302 3 641 96 97
12 G016401 3 625 94 91
12 G016401 3 3,005 92 §
12 G016502 3 610 92 88
12 G016502 3 3,005 92 §
12 G016701 3 616 91 87
12 G016701 3 3,005 92 §
12 G017101 4 630 80 87
12 G017101 4 3,005 84 §
12 G034801 4 3,003 93 §
12 G035001 4 3,002 82 §
12 G035301 4 3,002 85 §
12 G035501 4 3,002 86 §
12 G039801 3 3,026 88 §
12 G040001 4 3,026 78 §
12 G040301 4 3,026 85 §
12 G040701 3 3,026 94 §
12 G025001 3 630 88 88
12 G025202 3 616 83 86
12 G025202 3 3,062 86 §
12 G025301 4 623 92 95
12 G025301 4 3,062 94 §
12 G025601 3 628 93 93
12 G025601 3 3,062 93 §
12 G025801 3 630 90 91
12 G025801 3 3,063 92 §
12 G026101 3 627 94 94
12 G026101 3 3,013 91 §
12 G026204 3 611 91 92
12 G026204 3 3,014 93 §
12 G026301 4 621 90 88
12 G026301 4 3,014 88 §
12 G026502 3 635 97 96
12 G026502 3 3,014 98 §
12 G026503 4 3,014 92 84
12 G026601 3 641 96 98
12 G026601 3 3,014 97 §
12 G026901 3 631 92 94
12 G026901 3 3,014 94 §
† Not applicable.
§ Item had not been created for the 1994 assessment.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2001 Geography Assessment.

17
Scoring of the 2001 Geography Assessment Large-Print Booklets

A subset of the total items scored were those from large-print booklets. These booklets were
administered to students with disabilities who had met the criteria for participation with accommodations.
Since these booklets were non-scannable, they were transported to the scoring center after processing. A
log and score sheet were created to account for these booklets. As a rater scored an item, he or she
marked the score for that response, his or her rater ID, and the date on which the item was scored. Once
all items in each booklet for a given subject were scored, the geography scoring director returned the
sheets to NAEP clerical staff to enter those scores manually into the records for these booklets.

In the 2001 assessment, there were five large-print geography booklets.

18
Item-by-item rater reliability for items in large-print booklets, by grade,
geography national main assessment: 2001
Number
Score scored
Grade Item points (1st and 2nd) 2001 reliability 1994 reliability
Total † † 136,680 † †
4 X1G3_01A 1,3 3,331 99 §
4 X1G3_01B 1,3 3,331 99 §
4 X1G3_01C 1,3 3,331 97 §
4 X1G3_01D 1,3 3,332 98 §
4 X1G3_01E 1,3 3,331 98 §
4 X1G3_01F 1,3 3,331 98 §
4 X1G3_01G 1,3 3,331 96 §
4 X1G3_05 4 3,331 86 §
4 X1G3_07 3 3,332 86 §
4 X1G3_08 3 3,332 92 §
4 X1G3_10 4 3,331 91 §
4 Q1G401 3 639 99 98
4 Q1G401 3 3,247 99 §
4 Q1G404 4 628 91 93
4 Q1G404 4 3,247 93 §
4 Q1G409 3 637 94 94
4 Q1G409 3 3,248 93 §
4 Q1G411 3 612 94 92
4 Q1G411 3 3,248 93 §
4 Q1G415 4 582 83 86
4 Q1G415 4 3,247 83 §
8 Q2G304 3 638 85 93
8 Q2G304 3 3,159 86 §
8 Q2G312 3 648 98 99
8 Q2G312 3 3,159 98 §
8 Q2G314 4 620 88 87
8 Q2G314 4 3,159 89 §
8 Q2G315 3 640 86 91
8 Q2G315 3 3,159 90 §
8 Q2G316 3 646 95 97
8 Q2G316 3 3,159 94 §
8 X2G4_01 3 3,160 90 §
8 X2G4_05 3 3,160 89 §
8 X2G4_07 4 3,160 77 §
8 X2G4_10 3 3,160 93 §
8 X2G4_13 3 3,160 81 §
12 Q3G304 3 584 88 89
12 Q3G304 3 3,049 88 §
12 Q3G308 4 614 79 85
12 Q3G308 4 3,049 79 §
12 Q3G312 4 620 85 90
12 Q3G312 4 3,049 87 §
12 Q3G314 4 594 82 92
12 Q3G314 4 3,049 92 §
12 Q3G315 3 606 94 95
See notes at end of table.

19
Item-by-item rater reliability for items in large-print booklets, by grade,
geography national main assessment: 2001 (continued)
Score Number scored
Grade Item points (1st and 2nd) 2001 reliability 1994 reliability
12 Q3G315 3 3,048 93 §
12 X3G4_01A 1,3 3,072 98 §
12 X3G4_01B 1,3 3,073 99 §
12 X3G4_01C 1,3 3,072 99 §
12 X3G4_01D 1,3 3,072 99 §
12 X3G4_01E 1,3 3,072 99 §
12 X3G4_03 3 3,073 83 §
12 X3G4_06 3 3,072 83 §
12 X3G4_10 4 3,073 90 §
12 X3G4_14 3 3,073 94 §
† Not applicable.
§ Item had not been created for the 1994 assessment.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2001 Geography Assessment.

20
Scoring NAEP Mathematics Assessments

The NAEP mathematics items that are not scored by machine are constructed-response items—those for
which the student must write in a response rather than selecting from a printed list of multiple choices.
Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item. To measure longitudinal trends in mathematics, NAEP requires trend scoring—replication of
scoring from prior assessment years—to demonstrate statistically that scoring is comparable across
years.

Students' constructed responses are scored on computer workstations using an image-based scoring
system. This allows for item-by-item scoring and online, real-time monitoring of mathematics interrater
reliabilities, as well as the performance of each individual rater. All responses from large-print booklets
are transcribed into the appropriate regular-sized booklet and scanned with other booklets. Image scoring
of these responses takes place with regular scoring.

The 2000 mathematics assessment included 199 discrete constructed-response items. The total number
of constructed responses scored was 3,856,211. The number of raters working on the mathematics
assessment and the location of the scoring are listed here:

Scoring activities, mathematics assessment: 2000


Number of
Number of scoring
Scoring location Start date End date raters supervisors
Tucson, Arizona 3/13/2000 4/29/2000 177 16
SOURCE: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Statistics, National Assessment of
Educational Progress (NAEP), 2000 Mathematics Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item and defines the criteria to be used in evaluating student responses. During the course of the
project, each team scores constructed-response items using a 2-, 3-, or 5-point scale as outlined below:

Dichotomous Items Short Three-Point Items Extended Five-Point Items


2 = Correct 3 = Correct 5 = Extended
1 = Incorrect 2 = Partial 4 = Satisfactory
1 = Incorrect 3 = Partial
2 = Minimal
1 = Incorrect

Early (1990) mathematics constructed-response items used a 1 = incorrect and 7 = correct rating scale.
Several of these items also tracked how a student approached the problem by expanding the rating 1 to
[1, 2, and 3] or by expanding the rating 7 to [6 and 7.] An example of this would be if the student was
asked to draw a figure with four 90 degree angles. A student's response that was rated 6 or 7 was
correct; 6 tracked the 'square' while 7 tracked the 'rectangle' response. An example of a response that
would be rated as incorrect would be one for which the student renamed incorrectly in a subtraction
problem and therefore got an incorrect response. This might be tracked as a 2.

In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special
coding categories for the unscorable responses are assigned to these types of responses. These
categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors
and/or trainers are consulted prior to the assignment of any of the special coding categories. The
unscorable categories for mathematics are outlined in the following table.

21
Categories for unscorable responses, mathematics assessments: 2000
Label Description
B Blank responses, random marks on paper
X Completely crossed out, completely erased
IL Completely illegible response
OT Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language other than
English (unless otherwise noted)
? "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"
NOTE: Because the NAEP scoring database recognizes only alphanumeric characters and sets a single-character
field for the value for each score, the label "IL" appears in the database file as "I," the label "OT" appears as "T," and
the label "?" appears as "D."
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics,
National Assessment of Educational Progress (NAEP), Mathematics 2000 Assessment.

Special studies are also included in the mathematics assessment. When the special study item is the
same as the operational item, the responses are scored together within one team.

22
Number of constructed-response items, by score-point level and grade, national main and
state assessments: 2000
Dichotomous 2-point Short 3-point Extended 4-point Extended 5-point Extended 6-point
Grade Total items items items items items
Total 163 41 78 13 29 2
4 52 11 24 6 10 1
4/8 17 3 9 2 3 0
8 30 7 14 0 9 0
8/12 11 4 5 1 0 1
12 49 13 26 3 7 0
4/8/12 4 3 0 1 0 0
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

23
Number of 1996 constructed-response items rescored in 2000, by score-point level and grade,
mathematics national main and state assessments: 2000
Dichotomous Short Extended Extended Extended
Grade Total 2-point items 3-point items 4-point items 5-point items 6-point items
Total 126 35 63 8 19 1
4 32 5 19 3 5 0
4/8 15 3 7 2 3 0
8 25 7 12 0 6 0
8/12 9 4 3 1 0 1
12 41 13 22 1 5 0
4/8/12 4 3 0 1 0 0
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

24
Number of 1992 constructed-response items rescored in 2000, by score-point level and grade,
mathematics national main and state assessments: 2000
Dichotomous Short Extended Extended Extended
Grade Total 2-point items 3-point items 4-point items 5-point items 6-point items
Total 65 34 15 8 7 1
4 13 4 5 3 1 0
4/8 10 3 2 2 3 0
8 11 7 2 0 2 0
8/12 8 4 2 1 0 1
12 19 13 4 1 1 0
4/8/12 4 3 0 1 0 0
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics Assessment.

25
Number of 1990 constructed-response items rescored in 2000, by score-point
level and grade, national main and state assessments: 2000
Dichotomous 2- Short 3-point Extended 4-point Extended 6-point
Grade Total point items items items items
Total 31 20 6 4 1
4 3 1 1 1 0
4/8 5 3 2 0 0
8 2 2 0 0 0
8/12 7 3 2 1 1
12 10 8 1 1 0
4/8/12 4 3 0 1 0
NOTE: No extended 5-point items from the 1990 assessment were rescored in the 2000
assessment.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000
Mathematics Assessment.

26
Mathematics Interrater Reliability

A subsample of the mathematics responses for each constructed-response item is scored by a second
rater to obtain statistics on interrater reliability. In general, items administered only to the national main
sample receive 25 percent second scoring, while those given in state samples receive 6 percent. This
reliability information is also used by the scoring supervisor to monitor the capabilities of all raters and
maintain uniformity of scoring across raters. Reliability reports are generated on demand by the scoring
supervisor, trainer, scoring director, or mathematics item development coordinator. Printed copies are
reviewed daily by the lead scoring staff. In addition to the immediate feedback provided by the online
reliability reports, each scoring supervisor can also review the actual responses scored by a rater with the
backreading tool. In this way, the scoring supervisor can monitor each rater carefully and correct
difficulties in scoring almost immediately with a high degree of efficiency.

Interrater reliability ranges, by assessment year, mathematics national main and


state assessments: 2000
Number of Number of Number of Number of Number of
unique items between items between items between items above
Assessment year items 60% and 69% 70% and 79% 80% and 89% 90%
2000 assessment 199 † 1 12 186
1996 assessment 158 † 1 2 155
1992 assessment 91 2 † 12 77
1990 assessment 51 3 † 2 47
† No items fell within this interrater reliability range.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics
Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress
using an interrater reliability tool. This display tool functions in either of two modes:

• to display information of all first readings versus all second readings; or


• to display all readings of an individual that were also scored by another rater versus the scores
assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and
scores awarded during second readings displayed in columns (for mode one) and the individual's scores
in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall
along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number
and percentage of cases of agreement (or disagreement). The display also contains information on the
total number of second readings and the overall percentage of reliability on the item. Since the interrater
reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and
compared to previously generated reports. Scoring staff members save printed copies of all final reliability
reports and archive them with the training sets.

27
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000
Number
Score scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
Total † † 3,856,211 † † † †
4 M039201 2 28,343 99 99 99 §
4 M039301 3 28,342 100 100 99 §
4 M040001 3 28,343 98 98 95 §
4 M040201 2 28,346 93 96 92 §
4 M066301 3 28,502 97 98 § §
4 M066501 3 28,505 96 98 § §
4 M066601 3 28,503 95 92 § §
4 M066701 3 28,500 96 97 § §
4 M066801 3 28,499 95 97 § §
4 M066901 5 28,503 89 92 § §
4 M019701 2 28,559 99 98 97 99
4 M019801 3 28,556 96 94 91 93
4 M019901 3 28,559 99 98 96 97
4 M020001 2 28,557 99 99 97 97
4 M020101 2 28,557 98 99 98 98
4 M020201 2 28,554 97 95 91 95
4 M020301 4 28,557 98 99 96 97
4 M020401 2 28,553 99 99 98 98
4 M020501 2 28,555 98 99 96 98
4 N277903 2 28,556 99 100 99 99
4 M020701 4 28,560 85 88 85 84
4 M067901 3 28,470 97 98 § §
4 M068001 3 28,464 98 99 § §
4 M068002 3 28,468 98 98 § §
4 M068003 3 28,469 98 98 § §
4 M068004 5 28,466 91 94 § §
4 M010631 3 29,306 98 99 96 97
4 M091101 4 29,186 97 § § §
4 M091201 5 29,184 91 § § §
4 M091401 6 29,185 91 § § §
4 M085701 3 28,470 96 § § §
4 M085901 3 28,474 96 § § §
4 M085401 5 28,469 94 § § §
4 M046001 5 29,237 99 99 97 §
4 M046601 4 29,236 97 97 94 §
4 M046801 5 29,238 99 99 96 §
4 M046901 5 29,238 99 100 97 §
4 M047301 4 29,238 98 99 96 §
4 M086601 5 28,355 97 § § §
4 M087001 3 28,356 97 § § §
4 M087301 5 28,356 94 § § §
4 M043201 2 29,107 98 98 97 §
4 M043301 3 29,108 99 99 96 §
4 M043401 4 29,107 96 97 91 §
4 M043402 4 29,116 96 98 91 §
See notes at end of table.

28
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000 (continued)
Number
Score scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
4 M043403 3 29,108 99 98 91 §
4 M043501 5 29,112 92 91 87 §
4 M072201 3 29,183 100 99 § §
4 M072202 3 29,177 98 98 § §
4 M072401 3 29,179 95 95 § §
4 M072501 3 29,176 94 94 § §
4 M072601 3 29,127 99 99 § §
4 M072701 5 29,177 93 96 § §
4 M074301 2 28,405 99 100 § §
4 M074501 3 28,404 93 92 § §
4 M074701 3 28,405 98 98 § §
4 M074801 3 28,403 99 99 § §
4 M074901 3 28,406 95 95 § §
4 M075001 3 28,406 99 99 § §
4 M075101 5 28,404 90 91 § §
4 M087501 2 5,059 99 § § §
4 M087601 2 5,058 98 § § §
4 M088001 3 5,056 98 § § §
4 M088301 4 5,058 96 § § §
4 M088501 2 5,010 99 § § §
4 M088601 2 5,011 98 § § §
4 M088701 2 5,009 99 § § §
4 M088801 4 5,010 98 § § §
4 M089101 2 5,009 99 § § §
4 M089401 3 5,008 96 § § §
4 M090201 3 5,014 96 § § §
4 M090301 3 5,016 98 § § §
4 M090401 5 5,016 90 § § §
4 M019901 3 3,448 99 § § §
4 M067901 3 3,448 98 § § §
4 M066701 3 3,449 97 § § §
4 M075101 5 3,449 89 § § §
4 M074301 2 3,459 99 § § §
4 M074801 3 3,460 99 § § §
4 M074501 3 3,457 91 § § §
4 M043403 3 3,459 100 § § §
4 M043501 5 3,460 88 § § §
4 M043501 2 3,446 98 § § §
4 M043201 2 3,448 98 § § §
4 M046801 5 3,450 99 § § §
4 M072601 3 3,450 98 § § §
4 M040201 2 3,449 93 § § §
4 M043201 2 513 98 98 97 §
4 M043201 2 514 100 98 97 §
4 M043301 3 512 96 99 96 §
See notes at end of table.

29
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000 (continued)
Number
Score scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
4 M043301 3 514 96 99 96 §
4 M043401 4 515 98 97 91 §
4 M043401 4 514 98 97 91 §
4 M043402 4 515 98 98 91 §
4 M043402 4 514 100 98 91 §
4 M043403 3 513 100 98 91 §
4 M043403 3 513 100 98 91 §
4 M043501 5 513 100 91 87 §
4 M043501 5 513 94 91 87 §
4 M072201 3 513 100 99 § §
4 M072201 3 514 96 99 § §
4 M072202 3 514 98 98 § §
4 M072202 3 514 98 98 § §
4 M072401 3 513 94 95 § §
4 M072401 3 515 96 95 § §
4 M072501 3 513 100 94 § §
4 M072501 3 514 96 94 § §
4 M072601 3 515 100 99 § §
4 M072601 3 514 100 99 § §
4 M072701 5 513 98 96 § §
4 M072701 5 513 96 96 § §
8 M093501 3 28,075 93 § § §
8 M093601 3 28,075 99 § § §
8 M093801 5 28,074 94 § § §
8 M066301 3 28,120 98 99 § §
8 M066501 3 28,121 95 96 § §
8 M066601 3 28,121 97 94 § §
8 M067201 3 28,123 91 91 § §
8 M067501 5 28,119 89 94 § §
8 M019701 2 28,092 99 100 99 100
8 M019801 3 28,095 98 98 96 96
8 M019901 3 28,092 99 99 97 98
8 M020001 2 28,093 100 100 98 98
8 M020101 2 28,092 98 100 99 99
8 M020201 2 28,092 99 97 96 97
8 M020301 4 28,094 99 100 98 98
8 M020401 2 28,096 99 99 100 99
8 M020501 2 28,095 99 100 99 99
8 M020801 6 28,090 96 97 93 93
8 M020901 2 28,092 93 94 90 69
8 M021001 2 28,092 100 100 98 99
8 M021101 3 28,096 95 95 91 92
8 M021201 3 28,092 98 98 96 97
8 M021301 2 28,095 98 98 96 95
8 M021302 2 28,098 97 98 94 95
See notes at end of table.

30
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000 (continued)
Number
Score scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
8 M067901 3 27,976 98 98 § §
8 M068003 3 27,978 98 99 § §
8 M068006 3 27,985 95 94 § §
8 M068005 3 27,975 98 98 § §
8 M068008 3 27,981 95 93 § §
8 M068201 5 27,977 93 93 § §
8 M013031 4 28,094 99 99 96 97
8 M013131 2 28,091 98 98 96 95
8 M052401 2 28,095 94 96 89 §
8 M052901 2 28,094 92 95 84 §
8 M053001 2 28,095 95 94 90 §
8 M053101 5 28,089 89 90 86 §
8 M085701 3 28,192 98 § § §
8 M085901 3 28,192 98 § § §
8 M086301 5 28,194 94 § § §
8 M046001 5 28,192 99 99 98 §
8 M046601 4 28,189 98 98 96 §
8 M046801 5 28,192 99 99 98 §
8 M046901 5 28,192 99 100 98 §
8 M047301 4 28,192 99 100 98 §
8 M047901 3 28,190 99 99 97 §
8 M092401 3 27,961 95 § § §
8 M092601 3 27,964 99 § § §
8 M092001 5 27,966 88 § § §
8 M051201 2 28,098 100 100 99 §
8 M051301 2 28,102 100 100 99 §
8 M051601 2 28,100 99 99 97 §
8 M052101 3 28,106 100 97 96 §
8 M052201 5 28,101 90 93 88 §
8 M072901 3 28,109 96 93 § §
8 M073401 3 28,104 96 97 § §
8 M073501 3 28,106 98 98 § §
8 M073601 5 28,108 93 90 § §
8 M075301 3 28,078 97 98 § §
8 M075401 3 28,073 95 94 § §
8 M075601 3 28,074 95 95 § §
8 M075801 3 28,077 91 91 § §
8 M076001 5 28,076 91 92 § §
8 M051201 2 189 100 § § §
8 M051201 2 187 100 § § §
8 M051301 2 188 100 § § §
8 M051301 2 189 100 § § §
8 M051601 2 188 100 § § §
8 M051601 2 189 100 § § §
8 M052101 3 189 100 § § §
See notes at end of table.

31
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000 (continued)
Number
Score scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
8 M052101 3 189 100 § § §
8 M052201 5 188 100 § § §
8 M052201 5 188 100 § § §
8 M072901 3 188 100 § § §
8 M072901 3 189 89 § § §
8 M073401 3 188 100 § § §
8 M073401 3 188 100 § § §
8 M073501 3 188 100 § § §
8 M073501 3 188 94 § § §
8 M073601 5 188 100 § § §
8 M073601 5 188 94 § § §
12 M056801 2 4,129 99 99 98 §
12 M056901 3 4,126 96 98 94 §
12 M057001 2 4,125 99 99 96 §
12 M057101 2 4,124 97 98 94 §
12 M070801 3 4,124 89 87 § §
12 M071001 3 4,123 96 95 § §
12 M071101 3 4,124 98 97 § §
12 M071201 3 4,127 96 94 § §
12 M071301 3 4,125 99 96 § §
12 M071401 5 4,126 97 98 § §
12 M021401 2 4,106 100 99 98 99
12 M021501 2 4,109 98 97 97 97
12 M021502 2 4,107 99 99 99 99
12 M021601 4 4,108 97 98 89 90
12 M021602 2 4,107 97 96 92 94
12 M020201 2 4,104 98 99 97 98
12 M020301 4 4,104 99 100 98 99
12 M020401 2 4,106 99 99 99 99
12 M020501 2 4,108 99 100 99 99
12 M020801 6 4,102 96 98 92 95
12 M020901 2 4,108 95 95 93 63
12 M021001 2 4,109 99 100 100 99
12 M021101 3 4,107 94 95 92 94
12 M021201 3 4,107 97 98 95 96
12 M021701 2 4,106 100 99 97 97
12 M021702 2 4,108 99 98 95 95
12 M021801 2 4,105 99 99 97 96
12 M071502 3 4,038 94 93 § §
12 M071602 3 4,040 92 92 § §
12 M071603 3 4,040 90 92 § §
12 M071604 3 4,040 95 94 § §
12 M071701 3 4,040 95 96 § §
12 M071801 5 4,041 95 96 § §
12 M013031 4 4,076 99 99 97 98
12 M013131 2 4,078 98 99 96 94
See notes at end of table.

32
Item-by-item rater reliability, by grade, mathematics national main and state
assessment: 2000 (continued)
Score Number scored 2000 1996 1992 1990
Grade Item points (1st and 2nd) reliability reliability reliability reliability
12 M011931 2 4,078 98 100 97 94
12 M012031 3 4,078 99 99 94 96
12 M052401 2 4,098 92 94 89 §
12 M053301 2 4,095 92 92 89 §
12 M053401 5 4,095 72 75 70 §
12 M094201 4 4,063 95 § § §
12 M094301 3 4,063 94 § § §
12 M094701 5 4,063 88 § § §
12 M058901 3 4,048 99 99 98 §
12 M059702 3 4,049 94 92 68 §
12 M059801 2 4,047 98 98 96 §
12 M092301 3 4,125 96 § § §
12 M092401 3 4,124 95 § § §
12 M092601 3 4,123 97 § § §
12 M092901 5 4,123 94 § § §
12 M095001 3 4,100 94 § § §
12 M095301 3 4,099 92 § § §
12 M095401 4 4,099 97 § § §
12 M073801 3 4,100 98 97 § §
12 M073401 3 4,097 93 94 § §
12 M073901 3 4,100 96 97 § §
12 M074001 3 4,099 94 97 § §
12 M074101 5 4,099 95 96 § §
12 M076101 3 4,114 98 99 § §
12 M076601 3 4,116 98 99 § §
12 M076701 3 4,113 98 98 § §
12 M076801 3 4,115 98 98 § §
12 M076901 3 4,116 98 97 § §
12 M077001 5 4,116 90 93 § §
† Not applicable.
§ Item had not been created at the time of the assessment noted in this column heading.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Mathematics
Assessment.

33
Scoring of the 2000 Mathematics Assessment Large-Print Booklets

A subset of the total items scored were those from large-print booklets. These booklets were
administered to students with disabilities who had met the criteria for participation with accommodations.
Since these booklets were non-scannable, they were transported to the scoring center after processing. A
log and score sheet were created to account for these booklets. As a rater scored an item, he or she
marked the score for that response, his or her rater ID, and the date on which the item was scored. Once
all items in each booklet for a given subject were scored, the mathematics scoring director returned the
sheets to Pearson clerical staff to enter those scores manually into the records for these booklets.

In the 2000 assessment, there were 32 large-print mathematics booklets.

34
Item-by-item rater reliability for items in large-print booklets, by grade,
mathematics national main and state assessments: 2000
Score Number scored 2000 1996 1992
Grade Item points (1st and 2nd) reliability reliability reliability
Total † † 753,796 † † †
4 W1M3_03 2 28,343 99 99 99
4 W1M3_04 3 28,342 100 100 99
4 W1M3_11 3 28,343 98 98 95
4 W1M3_13 2 28,346 93 96 92
4 W1M9_07 4 29,186 97 § §
4 W1M9_08 5 29,184 91 § §
4 W1M9_10 6 29,185 91 § §
4 W12M11A_01 5 29,237 99 99 97
4 W12M11A_07 4 29,236 97 97 94
4 W12M11A_09 5 29,238 99 99 96
4 W12M11A_10 5 29,238 99 100 97
4 W12M11A_14 4 29,238 98 99 96
8 W2M3_06 3 28,075 93 § §
8 W2M3_07 3 28,075 99 § §
8 W2M3_09 5 28,074 94 § §
8 W23M9B_02 2 28,095 94 96 89
8 W23M9B_07 2 28,094 92 95 84
8 W23M9B_08 2 28,095 95 94 90
8 W23M9B_09 5 28,089 89 90 86
8 W12M11B_01 5 28,192 99 99 98
8 W12M11B_07 4 28,189 98 98 96
8 W12M11B_09 5 28,192 99 99 98
8 W12M11B_10 5 28,192 99 100 98
8 W12M11B_14 4 28,192 99 100 98
8 W12M11B_18 3 28,190 99 99 97
12 S3M3_10 2 4,129 99 99 98
12 S3M3_12 3 4,126 96 98 94
12 S3M3_13 2 4,125 99 99 96
12 S3M3_14 2 4,124 97 98 94
12 S23M9C_02 2 4,098 92 94 89
12 S23M9C_08 2 4,095 92 92 89
12 S23M9C_09 5 4,095 72 75 70
12 S3M11_04 3 4,048 99 99 98
12 S3M11_13 3 4,049 94 92 68
12 S3M11_14 2 4,047 98 98 96
† Not applicable.
§ Item had not been created at the time of the assessment noted in this column heading.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000
Mathematics Assessment.

35
Scoring NAEP Reading Assessments

The reading items scored include short constructed responses and extended constructed responses.
Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item. To measure longitudinal trends in reading, NAEP requires trend scoring—replication of
scoring from prior assessment years—to demonstrate statistically that scoring is comparable across
years. Students' constructed responses are scored on computer workstations using an image-based
scoring system. This allows for item-by-item scoring and online, real-time monitoring of reading interrater
reliabilities, and monitoring of the performance of each individual rater. A subset of these items—those
that appear in large-print booklets —require scoring by hand. The 2000 reading assessment included 46
discrete constructed-response items. The total number of constructed responses scored was 123,100.
The number of raters working on the reading assessment and the location of the scoring are listed here:

Scoring activities, reading assessment: 2000


Number of Number of
Scoring location Start date End date raters supervisors
Tucson, Arizona 4/3/2000 4/3/2000 40 4
SOURCE: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Statistics, National Assessment of
Educational Progress (NAEP), 2000 Reading Assessment.

Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item and defines the criteria to be used in evaluating student responses. During the course of the
project, each team scores the items using a 2-, 3-, or 4-point scale as outlined below:

Dichotomous Items Extended Four-Point Items


1 = unacceptable 1 = unsatisfactory
2, 3, or 4 = acceptable 2 = partial
3 = essential
Short Three-Point Items 4 = extensive
1 = evidence of little or no comprehension
2 = evidence of partial or surface comprehension
3 = evidence of full comprehension

In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special
coding categories for the unscorable responses are assigned to these types of responses. These
categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors
and/or trainers are consulted prior to the assignment of any of the special coding categories. The
unscorable categories used for reading are outlined as follows.

Categories for unscorable responses, reading assessment: 2000


Label Description
B Blank responses, random marks on paper, word underlined in prompt but response area
completely blank, mark on item number but response area completely blank
X Completely crossed out, completely erased
IL Completely illegible response
OT Off task, off topic, comments to the test makers, refusal to answer, "Who cares," language
other than English (unless otherwise noted)
? "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"
NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a
single-character field for the value for each score, the label "IL" appears in the database file as "I," the label "OT"
appears as "T," and the label "?" appears as "D."
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading Assessment.

36
Special studies are also included in the reading assessment. When the special study item is the same as
the operational item, the responses are scored together within one team.

37
Number of constructed-response items, by score-point level, grade 4
reading national main assessment: 2000
Dichotomous Short 3-point Extended 4-
Assessment Total 2-point items items point items
Total 122 67 34 21
2000 reading items scored 46 24 14 8
1998 reading items rescored 41 23 11 7
1994 reading items rescored 35 20 9 6
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress (NAEP),
2000 Reading Assessment.

38
Number of 1998 constructed-response items rescored in
2000, by score-point level, grade 4 reading national main
assessment: 2000
Dichotomous Short 3-point Extended 4-point
Grade Total 2-point items items items
Total 41 23 11 7
4 31 15 11 5
4/8 10 8 0 2
SOURCE: U.S. Department of Education, Institute of Education
Sciences, National Center for Education Statistics, National
Assessment of Educational Progress (NAEP), 2000 Reading
Assessment.

39
Number of 1994 constructed-response items rescored in 2000, by score-point
level, grade 4 reading national main assessment: 2000
Dichotomous Short Extended
Grade Total 2-point items 3-point items 4-point items
Total 35 20 9 6
4 25 12 9 4
4/8 10 8 0 2
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Reading
Assessment.

40
Reading Interrater Reliability

A subsample of the reading responses for each constructed-response item is scored by a second rater to
obtain statistics on interrater reliability. Reading item responses in the 2000 assessment received 25
percent second scoring. This reliability information is also used by the scoring supervisor to monitor the
capabilities of all raters and maintain uniformity of scoring across raters. Reliability reports are generated
on demand by the scoring supervisor, trainer, scoring director, or item development subject area
coordinator. Printed copies are reviewed daily by both Pearson and lead scoring staff. In addition to the
immediate feedback provided by the online reliability reports, each scoring supervisor can also review the
actual responses scored by a rater with the backreading tool. In this way, the scoring supervisor can
monitor each rater carefully and correct difficulties in scoring almost immediately with a high degree of
efficiency.

Interrater reliability ranges, by assessment year, reading


national main assessment: 2000
Number of Number of items Number of items Number of
unique between 70% between 80% items above
Assessment items and 79% and 89% 90%
2000 reading 46 3 17 26
1998 reading 41 2 16 23
1994 reading 35 † 13 22
† In the 1994 reading assessment, interrater reliability exceeded 79%.
SOURCE: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Statistics, National Assessment of
Educational Progress (NAEP), 2000 Reading Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress
using an interrater reliability tool. This display tool functions in either of two modes:

• to display information of all first readings versus all second readings; or


• to display all readings of an individual which were also scored by another rater versus the scores
assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and
scores awarded during second readings displayed in columns (for mode one) and the individual's scores
in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall
along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number
and percentage of cases of agreement (or disagreement). The display also contains information on the
total number of second readings and the overall percentage of reliability on the item. Since the interrater
reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and
compared to previously generated reports. Scoring staff members save printed copies of all final reliability
reports and archive them with the training sets.

41
Item-by-item rater reliability, grade 4 reading national main assessment:
2000
Number scored
Item Score points (1st and 2nd) 2000 reliability 1998 reliability 1994 reliability
Total † 123,100 † † †
R017001 2 2,674 86 93 §
R017003 3 2,674 80 87 §
R017004 2 2,674 90 95 §
R017006 2 2,674 91 94 §
R017007 4 2,674 76 79 §
R017009 3 2,674 87 89 §
R012102 2 2,697 95 98 95
R012104 2 2,697 93 95 93
R012106 2 2,697 91 90 92
R012108 2 2,697 96 96 96
R012109 2 2,697 96 96 96
R012111 4 2,697 91 89 92
R012112 2 2,697 92 93 95
R012601 2 2,666 93 89 91
R012604 2 2,666 93 93 95
R012607 4 2,666 81 84 90
R012611 2 2,666 91 91 96
R017301 2 2,683 94 § §
R017303 3 2,683 88 § §
R017305 3 2,683 94 § §
R017307 4 2,683 81 § §
R017309 3 2,683 89 § §
R012702 2 2,684 91 96 94
R012703 2 2,684 87 93 92
R012705 2 2,684 92 93 95
R012706 2 2,684 83 88 92
R012708 4 2,684 83 87 86
R012710 2 2,684 94 95 96
R015702 3 2,679 81 85 86
R015703 3 2,679 90 89 88
R015704 3 2,679 83 84 85
R015705 3 2,679 92 90 90
R015707 4 2,679 85 88 88
R015709 3 2,679 95 90 91
R015802 2 2,670 96 91 91
R015803 3 2,670 88 87 84
R015804 4 2,670 77 80 83
R015806 3 2,670 86 86 84
R015807 3 2,670 89 89 83
R015809 3 2,670 93 91 84
R012503 2 2,650 90 93 90
R012504 2 2,650 98 98 96
See notes at end of table.

42
Item-by-item rater reliability, grade 4 reading national main assessment:
2000 (continued)
Number
Score scored
Item points (1st and 2nd) 2000 reliability 1998 reliability 1994 reliability
R012506 2 2,650 93 96 92
R012508 2 2,650 97 97 97
R012511 2 2,650 98 97 95
R012512 4 2,650 84 85 83
† Not applicable.
§ Item had not been created in the year noted in this column heading.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2000 Reading Assessment.

43
Scoring of the 2000 Reading Assessment Large-Print Booklets

A subset of the total items scored were those from large-print booklets. These booklets were
administered to students with disabilities who had met the criteria for participation with accommodations.
Since these booklets were non-scannable, they were transported to the scoring center after processing. A
log and score sheet were created to account for these booklets. As a rater scored an item, he or she
marked the score for that response, his or her rater ID, and the date on which the item was scored. Once
all items in each booklet for a given subject were scored, the science scoring director returned the sheets
to NAEP clerical staff to enter those scores manually into the records for these booklets.

In the 2000 assessment, there was one large-print reading booklet.

44
Scoring NAEP Science Assessments

The NAEP science items that are not scored by machine are constructed-response items—those for
which the student must write in a response rather than selecting from a printed list of multiple choices.
Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item. To measure longitudinal trends in science, NAEP requires trend scoring—replication of
scoring from prior assessment years—to demonstrate statistically that scoring is comparable across
years.

Students' constructed responses are scored on computer workstations using a image-based scoring
system. This allows for item-by-item scoring and online, real-time monitoring of science interrater
reliabilities and the performance of each individual rater. A subset of these items—those that appeared in
large-print booklets—required scoring by hand. The 2000 science assessment included 295 discrete
constructed-response items. The total number of constructed responses scored was 4,398,021. The
number of raters working on the science assessment and the location of the scoring are listed here:

Location of scoring activities, science assessment: 2000


Number of
Number of scoring
Scoring location Start date End date raters supervisors
Iowa City, Iowa 3/13/2000 6/04/2000 115 16
Tucson, Arizona 4/13/2000 4/29/2000 40 4
SOURCE: U.S. Department of Education, Institute of Education Sciences,
National Center for Education Statistics, National Assessment of Educational
Progress (NAEP), 2000 Science Assessment.

One unique aspect of the science assessment is the use of "hands-on" tasks that are given to students as
a part of the assessment. Each student who performs a hands-on task is given a kit with all of the
materials needed to conduct the experiment. For the 2000 assessment, a total of 9 hands-on tasks (3 per
grade) originally designed for the 1996 assessment were chosen for use, although the actual kits used by
the students were new. During scoring of the hands-on task items, raters actually performed the
experiment as part of their training. Each student's experiment was scored as a unit because of the inter-
connectivity of the questions the student had to answer.

Each item's scoring guide identifies the range of possible scores for the item and defines the criteria to be
used in evaluating student responses. During the course of the project, each team scores the items using
a 2-, 3-, 4-, or 5-point scale as outlined below:

Dichotomous Items Extended Four-Point Items


3 = complete 4 = complete
1 = unsatisfactory/incorrect 3 = essential
2 = partial
Short Three-Point Items 1 = unsatisfactory/incorrect
3 = complete
2 = partial Extended Five-Point Items
1 = unsatisfactory/incorrect 5 = complete
4 = essential
3 = adequate
2 = partial
1 = unsatisfactory/incorrect

In some cases, student responses do not fit into any of the categories listed in the scoring guide. Special
coding categories for the unscorable responses are assigned to these type of responses. These
categories are only assigned if no aspect of the student's response could be scored. Scoring supervisors

45
and/or trainers are consulted prior to the assignment of any of the special coding categories. The
unscorable categories used for science are outlined below.

Categories for unscorable responses, science assessments


Label Description
B Blank responses, random marks on paper, word underlined in prompt but
response area completely blank, mark on item number but response area
completely blank
X Completely crossed out, completely erased
IL Completely illegible response
OT Off task, off topic, comments to the test makers, refusal to answer, "Who
cares," language other than English (unless otherwise noted)
? "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"
NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric
characters and sets a single-character field for the value for each score, the label "IL" appears in
the database file as "I," the label "OT" appears as "T," and the label "?" appears as "D."
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science
Assessment.

46
Number of constructed-response items, by score-point level and grade, science
national main and state assessments: 2000
Dichotomous 2-point Short 3-point Extended 4-point Extended 5-point
Grade Total items items items items
Total 246 12 190 38 6
4 60 5 48 6 1
4/8 20 0 16 4 0
8 61 2 49 9 1
8/12 29 2 24 3 0
12 76 3 53 16 4
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science
Assessment.

47
Number of 1996 constructed-response items rescored in 2000, by score-point level
and grade, science national main and state assessments: 2000
Dichotomous 2-point Short 3-point Extended 4-point Extended 5-point
Grade Total items items items items
Total 200 9 149 36 6
4 50 5 38 6 1
4/8 20 0 16 4 0
8 43 1 33 8 1
8/12 29 2 24 3 0
12 58 1 38 15 4
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science
Assessment.

48
Science Interrater Reliability

A subsample of the science responses for each constructed-response item is scored by a second rater to
obtain statistics on interrater reliability. In general, items administered only to the national main sample
receive 25 percent second scoring, while those given in state samples receive 6 percent. This reliability
information is also used by the scoring supervisor to monitor the capabilities of all raters and maintain
uniformity of scoring across raters. Reliability reports are generated on demand by the scoring supervisor,
trainer, scoring director, or item development subject-area coordinator. Printed copies are reviewed daily
by lead scoring staff. In addition to the immediate feedback provided by the online reliability reports, each
scoring supervisor can also review the actual responses scored by a rater with the backreading tool. In
this way, the scoring supervisor can monitor each rater carefully and correct difficulties in scoring almost
immediately with a high degree of efficiency.

Interrater reliability ranges, by assessment year,


science national main and state assessments: 2000
Number of Number of items
unique between 80% Number of items
Assessment items and 89% above 90%
2000 science 295 25 270
1996 science 249 41 208
SOURCE: U.S. Department of Education, Institute of
Education Sciences, National Center for Education Statistics,
National Assessment of Educational Progress (NAEP), 2000
Science Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress
using an interrater reliability tool. This display tool functions in either of two modes:

• to display information of all first readings versus all second readings; or


• to display all readings of an individual which were also scored by another rater versus the scores
assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and
scores awarded during second readings displayed in columns (for mode one) and the individual's scores
in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall
along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number
and percentage of cases of agreement (or disagreement). The display also contains information on the
total number of second readings and the overall percentage of reliability on the item. Since the interrater
reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and
compared to previously generated reports. Scoring staff members save printed copies of all final reliability
reports and archive them with the training sets.

49
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
Total † † 4,398,021 † †
4 K031001 3 19,578 95 96
4 K031002 3 19,578 90 89
4 K031003 3 19,578 87 91
4 K031004 2 19,578 91 96
4 K031005 3 19,578 84 88
4 K031006 3 19,578 93 94
4 K031007 3 19,578 88 90
4 K031101 2 19,429 97 97
4 K031102 2 19,429 96 95
4 K031103 2 19,429 94 94
4 K031104 2 19,429 98 99
4 K031105 3 19,429 100 99
4 K031107 4 19,428 94 93
4 K031301 4 19,845 92 94
4 K031309 4 19,845 91 93
4 K031302 3 19,845 94 96
4 K031303 3 19,845 94 95
4 K031304 3 19,845 97 96
4 K031401 4 26,088 88 88
4 K031402 3 26,090 92 94
4 K031403 3 26,089 93 94
4 K031404 3 26,090 95 92
4 K031407 3 26,088 90 90
4 K031408 3 26,089 98 98
4 K031409 3 26,088 97 95
4 K031410 3 26,091 96 95
4 K103901 3 26,043 93 §
4 K103101 3 26,042 95 §
4 K031602 3 26,250 98 98
4 K031603 3 26,254 98 98
4 K031604 3 26,251 98 99
4 K031606 3 26,253 92 96
4 K031607 4 26,256 91 93
4 K031608 3 26,251 95 94
4 K031609 3 26,250 97 97
4 K031901 3 19,524 94 88
4 K032001 3 19,525 94 97
4 K032501 3 19,523 93 96
4 K032502 3 19,523 92 96
4 K032601 3 19,521 93 90
4 K032602 3 19,520 96 94
4 K099501 3 19,621 94 §
4 K098301 3 19,623 91 §
4 K098201 3 19,624 91 §
4 K092201 3 19,621 92 §
4 K034001 3 19,551 96 92
See notes at end of table.

50
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
4 K034101 3 19,554 93 90
4 KW34101 3 19,554 93 90
4 KX34101 3 19,554 97 96
4 KY34101 3 19,554 91 86
4 KZ34101 3 19,554 95 93
4 K034401 3 19,552 94 92
4 K034501 4 19,552 91 92
4 K034502 3 19,551 98 99
4 K034802 3 19,878 93 94
4 K034901 3 19,881 93 96
4 K034902 3 19,882 92 95
4 K035201 3 19,883 92 93
4 K035301 4 19,881 92 94
4 K035601 3 26,087 92 95
4 K035801 3 26,086 90 94
4 K035901 3 26,086 97 97
4 K036101 3 26,087 93 90
4 K036301 3 26,086 97 97
4 K037301 3 19,570 96 97
4 K037401 3 19,567 94 96
4 K037501 3 19,570 99 97
4 K037601 3 19,569 92 93
4 K037701 4 19,569 98 97
4 K037702 3 19,570 98 94
4 K096901 3 19,825 92 §
4 K098401 3 19,826 94 §
4 K100701 3 19,828 95 §
4 K099601 3 19,826 92 §
4 K039801 3 19,658 96 97
4 K039901 5 19,656 92 91
4 K040001 4 19,658 92 90
4 K040301 3 19,660 92 89
4 K040401 4 19,659 94 96
4 K040501 3 19,659 98 98
8 K040601 3 18,996 99 99
8 K040603 4 18,996 96 96
8 K040604 4 18,996 93 94
8 K040605 4 18,996 95 96
8 K040606 4 18,996 95 96
8 K040607 3 18,996 93 94
8 K040608 3 18,996 94 91
8 K040609 3 18,996 97 95
8 K040610 4 18,996 95 93
8 K040801 3 19,142 97 98
8 K040802 3 19,142 96 97
8 K040808 3 19,142 98 99
See notes at end of table.

51
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
8 K040809 3 19,142 97 98
8 K040803 3 19,142 95 97
8 K040805 3 19,142 94 97
8 K040806 2 19,142 94 96
8 K031301 4 19,281 95 95
8 K031309 4 19,281 96 97
8 K031302 3 19,281 90 98
8 K031305 3 19,281 93 97
8 K031306 2 19,281 97 89
8 K031307 3 19,281 92 95
8 K031308 3 19,280 99 98
8 K102301 3 25,535 96 §
8 K102001 4 25,536 87 §
8 K101801 3 25,535 87 §
8 K098501 3 25,536 95 §
8 K101201 3 25,533 91 §
8 K097901 3 25,533 97 §
8 K041306 3 25,614 86 87
8 K041307 3 25,616 90 90
8 K041401 3 25,614 96 96
8 K041402 3 25,614 97 99
8 K041403 3 25,612 91 94
8 K031602 3 25,427 97 98
8 K031603 3 25,429 99 99
8 K031604 3 25,424 99 100
8 K031606 3 25,428 92 95
8 K031610 3 25,423 96 98
8 K031607 4 25,427 89 90
8 K031608 3 25,427 93 92
8 K031609 3 25,427 93 96
8 K031611 3 25,427 98 97
8 K031613 3 25,426 98 99
8 K099001 3 19,163 97 §
8 K092601 3 19,166 96 §
8 K095901 3 19,165 94 §
8 K093601 3 19,166 98 §
8 K095801 2 19,166 92 §
8 K096101 3 19,164 94 §
8 K043001 3 19,147 96 95
8 K043101 3 19,147 89 89
8 K043102 4 19,147 91 85
8 K043103 3 19,145 91 92
8 K043501 3 19,145 93 94
8 K043601 3 19,145 94 90
8 K043602 3 19,145 97 95
8 K043603 3 19,148 89 88
See notes at end of table.

52
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
8 K047201 4 19,153 94 88
8 K047301 3 19,158 98 96
8 K047401 3 19,154 95 92
8 K047901 3 19,156 98 93
8 K048001 3 19,153 99 99
8 K048101 4 19,157 93 91
8 K048102 3 19,157 98 96
8 K048103 2 19,159 94 95
8 K048601 3 19,265 97 93
8 K048901 3 19,265 99 98
8 K049001 3 19,266 99 100
8 K049301 3 19,269 100 98
8 K049401 3 19,263 94 94
8 K049402 3 19,266 89 90
8 K049403 3 19,268 93 89
8 K049404 4 19,268 87 85
8 K035601 3 25,607 91 93
8 K035801 3 25,609 93 94
8 K035901 3 25,602 95 95
8 K036101 3 25,603 93 89
8 K036301 3 25,606 95 95
8 K036401 3 25,601 98 97
8 K036403 3 25,601 95 92
8 K036404 3 25,601 94 93
8 K036402 3 25,603 92 93
8 K036701 3 25,606 98 97
8 K036801 3 25,603 96 97
8 K037301 3 19,180 95 93
8 K037401 3 19,176 92 93
8 K037501 3 19,179 95 96
8 K037601 3 19,179 92 89
8 K037701 4 19,180 99 99
8 K037703 3 19,178 97 91
8 K038101 3 19,180 96 98
8 K038201 3 19,179 94 92
8 K038301 5 19,180 88 87
8 K093901 3 19,185 90 §
8 K095401 3 19,187 92 §
8 K092801 3 19,187 99 §
8 K093701 3 19,185 93 §
8 K097001 3 19,184 91 §
8 K094901 3 19,185 94 §
8 K045301 3 19,068 95 93
8 K045601 4 19,068 95 96
8 K045701 3 19,070 92 90
8 K045801 3 19,069 93 93
See notes at end of table.

53
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
8 K046301 3 19,070 96 93
8 K046401 3 19,070 93 94
8 K046501 3 19,071 97 96
8 K046601 3 19,067 91 92
8 K046701 4 19,072 88 89
12 K049501 4 3,234 99 98
12 K049502 5 3,234 93 93
12 K049503 3 3,234 91 94
12 K049504 3 3,234 91 94
12 K049505 3 3,234 94 94
12 K049506 4 3,234 93 97
12 K040801 3 3,195 98 99
12 K040802 3 3,195 98 98
12 K040808 3 3,195 98 99
12 K040809 3 3,195 98 98
12 K040803 3 3,195 97 97
12 K040804 3 3,195 92 95
12 K040805 3 3,195 95 97
12 K040806 2 3,195 92 92
12 K049701 3 3,263 99 99
12 K049702 3 3,263 97 98
12 K049708 2 3,263 98 98
12 K049703 3 3,263 96 94
12 K049704 3 3,263 96 96
12 K049705 4 3,263 91 90
12 K049706 3 3,263 90 86
12 K049707 5 3,263 89 88
12 K105501 3 4,355 96 §
12 K105601 2 4,353 99 §
12 K106101 4 4,354 97 §
12 K105001 3 4,352 97 §
12 K104501 3 4,355 98 §
12 K104601 3 4,354 98 §
12 K041306 3 4,260 89 88
12 K041307 3 4,260 92 90
12 K041401 3 4,259 96 94
12 K041402 3 4,260 96 99
12 K041403 3 4,261 91 94
12 K041404 3 4,262 96 95
12 K041406 3 4,262 100 96
12 K049901 3 4,304 91 94
12 K049902 3 4,301 90 87
12 K049903 3 4,299 92 90
12 K049904 3 4,300 95 93
12 K049907 3 4,298 94 86
12 K049908 3 4,300 95 97
See notes at end of table.

54
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Number scored
Grade Item Score points (1st and 2nd) 2000 reliability 1996 reliability
12 K049909 4 4,300 98 94
12 K049911 3 4,301 94 94
12 K049914 3 4,301 98 94
12 K049912 5 4,301 87 84
12 K098601 3 3,258 91 §
12 K092701 3 3,258 94 §
12 K090501 4 3,258 98 §
12 K094601 3 3,257 95 §
12 K092101 3 3,258 96 §
12 K090301 3 3,257 98 §
12 K051701 3 3,241 96 96
12 K051801 3 3,242 92 91
12 K052301 3 3,242 94 98
12 K052401 4 3,244 92 90
12 K052402 3 3,243 95 97
12 K052501 4 3,242 98 91
12 K052502 4 3,241 98 92
12 K052503 3 3,242 98 94
12 K047201 4 3,221 94 91
12 K047301 3 3,221 97 96
12 K047401 3 3,218 95 92
12 K047901 3 3,221 100 94
12 K048001 3 3,222 99 98
12 K048101 4 3,220 85 86
12 K048102 3 3,220 94 94
12 K048103 2 3,221 93 96
12 K048601 3 3,259 95 92
12 K048901 3 3,257 99 98
12 K049001 3 3,257 96 99
12 K049301 3 3,261 99 97
12 K049401 3 3,258 94 94
12 K049402 3 3,261 89 87
12 K049403 3 3,259 92 88
12 K049404 4 3,259 85 84
12 K052901 4 4,372 98 90
12 K053001 3 4,370 98 91
12 K053101 3 4,373 98 99
12 K053102 3 4,372 98 92
12 K053601 5 4,370 94 89
12 K053701 3 4,370 98 96
12 K053801 3 4,368 93 91
12 K053901 3 4,371 98 94
12 K054001 4 3,256 97 98
12 K054002 3 3,256 97 94
12 K054003 3 3,256 100 99
12 K054004 3 3,257 100 97
See notes at end of table.

55
Item-by-item rater reliability, by grade, science national main and state
assessment: 2000 (continued)
Score Number scored (1st
Grade Item points and 2nd) 2000 reliability 1996 reliability
12 K054005 3 3,258 100 97
12 K054006 3 3,256 98 97
12 K054007 3 3,259 100 87
12 K054008 4 3,256 96 84
12 K100001 2 3,248 100 §
12 K100101 3 3,250 98 §
12 K100201 3 3,248 95 §
12 K100301 3 3,250 97 §
12 K096501 3 3,250 100 §
12 K092001 3 3,250 94 §
12 K059001 3 3,190 95 90
12 K059101 3 3,190 91 94
12 K059201 4 3,191 95 93
12 K059301 4 3,189 99 99
12 K059801 3 3,189 95 92
12 K059901 4 3,191 96 95
12 K060001 3 3,189 95 93
12 K060101 4 3,187 93 90
† Not applicable.
§ Item had not been created at the time of the assessment noted in this column
heading.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2000 Science Assessment.

56
Scoring of the 2000 Science Assessment Large-Print Booklets

A subset of the total items scored were those from large-print booklets. These booklets were
administered to students with disabilities who had met the criteria for participation with accommodations.
Since these booklets were non-scannable, they were transported to the scoring center after processing. A
log and score sheet were created to account for these booklets. As a rater scored an item, he or she
marked the score for that response, his or her rater ID, and the date on which the item was scored. Once
all items in each booklet for a given subject were scored, the science scoring director returned the sheets
to NAEP clerical staff to enter those scores manually into the records for these booklets.

In the 2000 assessment, there were 28 large-print science booklets.

57
Item-by-item rater reliability for items in large-print booklets, by grade,
science national main and state assessments: 2000
Number scored 2000 1996
Grade Item Score points (1st and 2nd) reliability reliability
Total † † 679,711 † †
4 S12S9A_02 3 26,250 98 98
4 S12S9A_03 3 26,254 98 98
4 S12S9A_04 3 26,251 98 99
4 S12S9A_06 3 26,253 92 96
4 S12S9A_07 4 26,256 91 93
4 S12S9A_08 3 26,251 95 94
4 S12S9A_09 3 26,250 97 97
4 S1S21_04 3 19,658 96 97
4 S1S21_05 5 19,656 92 91
4 S1S21_06 4 19,658 92 90
4 S1S21_09 3 19,660 92 89
4 S1S21_10 4 19,659 94 96
4 S1S21_11 3 19,659 98 98
8 W2S7_01 3 25,535 96 §
8 W2S7_04 4 25,536 87 §
8 W2S7_06 3 25,535 87 §
8 W2S7_11 3 25,536 95 §
8 W2S7_16 3 25,533 91 §
8 W2S7_20 3 25,533 97 §
8 S12S15B_04 3 19,180 95 93
8 S12S15B_05 3 19,176 92 93
8 S12S15B_06 3 19,179 95 96
8 S12S15B_07 3 19,179 92 89
8 S12S15B_08 4 19,180 99 99
8 S12S15B_09 3 19,178 97 91
8 S12S15B_14 3 19,180 96 98
8 S12S15B_15 3 19,179 94 92
8 S12S15B_16 5 19,180 88 87
12 W3S7_09 3 4,355 96 §
12 W3S7_10 2 4,353 99 §
12 W3S7_11 4 4,354 97 §
12 W3S7_14 3 4,352 97 §
12 W3S7_18 3 4,355 98 §
12 W3S7_19 3 4,354 98 §
12 S3S15_01 4 3,256 97 98
12 S3S15_02 3 3,256 97 94
12 S3S15_03 3 3,256 100 99
12 S3S15_04 3 3,257 100 97
12 S3S15_05 3 3,258 100 97
12 S3S15_06 3 3,256 98 97
12 S3S15_07 3 3,259 100 87
12 S3S15_08 4 3,256 96 84
† Not applicable.
§ Item had not been created at the time of the assessment noted in this column heading.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2000 Science
Assessment.

58
Scoring NAEP U.S. History Assessments

The NAEP U.S. history items that are not scored by machine are constructed-response items—those for
which the student must write in a response rather than selecting from a printed list of multiple choices.
Each constructed-response item has a unique scoring guide that identifies the range of possible scores
for the item. To measure longitudinal trends in U.S. history, NAEP requires trend scoring—replication of
scoring from prior assessment years—to demonstrate statistically that scoring is comparable across
years.

Students' constructed responses are scored on computer workstations using a image-based scoring
system. This allows for item-by-item scoring and online, real-time monitoring of U.S. history interrater
reliabilities, as well as the performance of each individual rater. A subset of these items—those that
appear in large-print booklets—require scoring by hand. The 2001 U.S. history assessment included 47
discrete constructed-response items. The total number of constructed responses scored was 399,182.

Scoring activities, U.S. history assessment: 2001


Number of Number of scoring
Scoring location Start date End date raters supervisors
Iowa City, Iowa 5/7/2001 5/25/2001 81 9
NOTE: U.S. history was not assessed in 2000.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2001 U.S. History Assessment.

Each item's scoring guide identifies the range of possible scores for the item and defines the criteria to be
used in evaluating student responses. During the course of the project, each team scores the items using
a 2-, 3-, or 4-point scale as outlined below:

Dichotomous Items Short Three-Point Items Extended Four-Point Items


1 = Inappropriate 1 = Inappropriate 1 = Inappropriate
2 = Appropriate 2 = Partial 2 = Partial
3 = Appropriate 3 = Essential
4 = Complete

In some cases, student responses do not fit into any of the categories listed on the scoring guide. Special
coding categories for the unscorable responses are assigned to these types of responses. These
categories are only assigned if no aspect of the student's response can be scored. Scoring supervisors
and/or trainers are consulted prior to the assignment of any of the special coding categories. The
unscorable categories used for U.S. history are outlined below.

Categories for unscorable responses, U.S. history assessments


Label Description
B Blank responses, random marks on paper
X Completely crossed out, completely erased
IL Completely illegible response
OT Off task, off topic, comments to the test makers, refusal to answer, "Who cares,"
language other than English (unless otherwise noted)
? "I don't know," "I can't do this," "No clue," "I don't understand," "I forget"
NOTE: Because the NAEP scoring contractor's database recognizes only alphanumeric characters and sets a
single-character field for the value for each score, the label "IL" appears in the database file as "I," the label
"OT" appears as "T," and the label "?" appears as "D."
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

59
Number of constructed-response items, by
score-point level and grade, U.S. history
national main assessment: 2001
Short 3-point Extended 4-
Grade Total items point items
Total 38 35 3
4 10 10 0
4/8 5 5 0
8 5 5 0
8/12 4 4 0
12 14 11 3
SOURCE: U.S. Department of Education, Institute of
Education Sciences, National Center for Education
Statistics, National Assessment of Educational
Progress (NAEP), 2001 U.S. History Assessment.

60
Number of 1994 constructed-response items rescored in 2001, by score-point
level and grade, U.S. history national main assessment: 2001
Dichotomous Short Extended
Grade Total 2-point items 3-point items 4-point items
Total 66 2 47 17
4 10 0 8 2
4/8 6 1 4 1
8 20 0 16 4
8/12 5 0 4 1
12 25 1 15 9
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S.
History Assessment.

61
U.S. History Interrater Reliability

A subsample of the U.S. history responses for each constructed-response item is scored by a second
rater to obtain statistics on interrater reliability. In general, items administered only to the national main
sample receive 25 percent second scoring. This reliability information is also used by the scoring
supervisor to monitor the capabilities of all raters and maintain uniformity of scoring across raters.
Reliability reports are generated on demand by the scoring supervisor, trainer, scoring director, or item
development subject area coordinator. Printed copies are reviewed daily by lead scoring staff. In addition
to the immediate feedback provided by the online reliability reports, each scoring supervisor can also
review the actual responses scored by a rater with the backreading tool. In this way, the scoring
supervisor can monitor each rater carefully and correct difficulties in scoring almost immediately with a
high degree of efficiency.

Interrater reliability ranges, by assessment year, U.S. history national main


assessment: 2001
Number of items Number of items Number of items Number of
Number of between 60% between 70% between 80% items above
Assessment unique items and 69% and 79% and 89% 90%
2001 U.S. history 47 2 16 16 13
1994 U.S. history 79 † 1 33 45
† The interrater reliability of 1994 items rescored in 2000 exceeded 79 percent.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2001 U.S. History Assessment.

During the scoring of an item or the scoring of a calibration set, scoring supervisors monitor progress
using an interrater reliability tool. This display tool functions in either of two modes:

• to display information of all first readings versus all second readings, or


• to display all readings of an individual which were also scored by another rater versus the scores
assigned by the other raters.

The information is displayed as a matrix with scores awarded during first readings displayed in rows and
scores awarded during second readings displayed in columns (for mode one) and the individual's scores
in rows and all other raters in columns (for mode two.) In this format, instances of exact agreement fall
along the diagonal of the matrix. For completeness, data in each cell of the matrix contain the number
and percentage of cases of agreement (or disagreement). The display also contains information on the
total number of second readings and the overall percentage of reliability on the item. Since the interrater
reliability reports are cumulative, a printed copy of the reliability of each item is made periodically and
compared to previously generated reports. Scoring staff members save printed copies of all final reliability
reports and archive them with the training sets.

62
Item-by-item rater reliability, by grade, U.S. history national main
assessment: 2001
Score Number scored (1st and 2001 1994
Grade Item points 2nd) reliability reliability
Total † † 447,714 † †
4 H028201 4 642 88 92
4 H028201 4 3,324 87 §
4 H028701 3 591 90 89
4 H028701 3 3,323 91 §
4 H028702 3 601 89 91
4 H028702 3 3,324 91 §
4 H028801 3 606 88 88
4 H028801 3 3,324 91 §
4 H029002 3 628 86 89
4 H029002 3 3,323 90 §
4 H054301 3 3,283 82 §
4 H054601 3 3,283 90 §
4 H054801 3 3,282 92 §
4 H054901 3 3,283 94 §
4 H055301 3 3,283 88 §
4 H055901 3 3,359 92 §
4 H056101 3 3,359 93 §
4 H056401 3 3,358 94 §
4 H056601 3 3,359 95 §
4 H056801 3 3,359 98 §
4 H031701 4 604 86 86
4 H031701 4 3,354 86 §
4 H031801 3 553 89 90
4 H031801 3 3,354 86 §
4 H031802 3 637 96 97
4 H031802 3 3,353 94 §
4 H032301 3 576 83 87
4 H032301 3 3,354 86 §
4 H032503 3 606 92 93
4 H032503 3 3,354 91 §
4 H057501 3 3,223 97 §
4 H057701 4 3,223 85 §
4 H057801 3 3,223 89 §
4 H058601 3 3,223 88 §
4 H058701 3 3,223 91 §
4 H034101 4 609 93 94
4 H034101 4 3,285 94 §
4 H034401 3 642 94 97
4 H034401 3 3,285 93 §
4 H034501 2 657 100 99
4 H034501 2 3,281 99 §
4 H034702 3 644 98 98
4 H034702 3 3,281 99 §
See notes at end of table.

63
Item-by-item rater reliability, by grade, U.S. history national main
assessment: 2001 (continued)
Score Number scored (1st 2001 1994
Grade Item points and 2nd) reliability reliability
4 H035001 3 627 89 90
4 H035001 3 3,281 92 §
4 H035101 3 637 88 92
4 H035101 3 3,282 90 §
8 H035801 3 639 92 92
8 H035801 3 3,177 92 §
8 H035901 3 630 94 89
8 H035901 3 3,177 93 §
8 H035902 3 613 84 85
8 H035902 3 3,177 87 §
8 H036101 3 593 87 87
8 H036101 3 3,177 87 §
8 H036402 4 572 63 82
8 H036402 4 3,178 83 §
8 H059001 3 3,195 88 §
8 H059201 3 3,195 82 §
8 H059701 3 3,195 91 §
8 H059801 3 3,195 85 §
8 H060201 3 3,195 87 §
8 H038103 4 617 88 92
8 H038103 4 3,222 88 §
8 H038301 3 611 91 90
8 H038301 3 3,222 90 §
8 H038601 3 594 87 90
8 H038601 3 3,223 85 §
8 H038702 3 601 85 85
8 H038702 3 3,223 88 §
8 H039001 3 630 95 93
8 H039001 3 3,222 98 §
8 H039401 3 614 89 90
8 H039401 3 3,257 89 §
8 H039901 4 609 89 89
8 H039901 4 3,256 89 §
8 H040001 3 632 93 94
8 H040001 3 3,256 95 §
8 H040103 3 640 81 90
8 H040103 3 3,256 89 §
8 H040201 3 610 92 93
8 H040201 3 3,256 95 §
8 H057501 3 3,204 96 §
8 H057701 4 3,204 83 §
8 H057801 3 3,204 83 §
8 H058601 3 3,204 85 §
8 H058701 3 3,204 92 §
8 H034101 4 609 87 85
8 H034101 4 3,285 90 §
See notes at end of table.

64
Item-by-item rater reliability, by grade, U.S. history national main
assessment: 2001 (continued)
Score Number scored (1st 2001 1994
Grade Item points and 2nd) reliability reliability
8 H034401 3 642 93 95
8 H034401 3 3,285 94 §
8 H034501 2 3,285 99 §
8 H034702 3 636 92 93
8 H034702 3 3,285 94 §
8 H035001 3 587 80 81
8 H035001 3 3,285 84 §
8 H035101 3 640 88 92
8 H035101 3 3,285 89 §
8 H060701 3 3,156 96 §
8 H061501 3 3,156 93 §
8 H061601 3 3,156 92 §
8 H061801 3 3,156 98 §
8 H042201 3 643 99 96
8 H042201 3 3,209 97 §
8 H042801 4 615 89 86
8 H042801 4 3,209 92 §
8 H042902 3 622 94 91
8 H042902 3 3,209 94 §
8 H043001 3 607 85 92
8 H043001 3 3,209 89 §
8 H043101 3 608 94 89
8 H043101 3 3,209 94 §
8 H043201 3 615 87 90
8 H043201 3 3,101 88 §
8 H043401 3 612 89 89
8 H043401 3 3,101 86 §
8 H043501 4 616 81 92
8 H043501 4 3,101 87 §
8 H043601 3 601 86 86
8 H043601 3 3,101 83 §
8 H043701 3 606 90 90
8 H043701 3 3,101 89 §
8 H043705 3 582 78 81
8 H043705 3 3,101 83 §
8 H044001 4 602 79 83
8 H044001 4 3,101 77 §
12 H044301 4 620 70 87
12 H044301 4 3,071 78 §
12 H044501 4 617 84 92
12 H044501 4 3,071 85 §
12 H044702 3 601 91 87
12 H044702 3 3,071 93 §
12 H045102 3 627 86 92
12 H045102 3 3,071 92 §
12 H045301 3 634 89 97
See notes at end of table.

65
Item-by-item rater reliability, by grade, U.S. history national main
assessment: 2001 (continued)
Score Number scored (1st 2001 1994
Grade Item points and 2nd) reliability reliability
12 H045301 3 3,071 93 §
12 H045501 4 560 61 78
12 H045501 4 3,027 78 §
12 H045901 4 3,026 80 §
12 H046001 2 650 100 99
12 H046001 2 3,026 99 §
12 H046101 3 621 96 96
12 H046101 3 3,026 95 §
12 H046301 3 600 88 90
12 H046301 3 3,026 91 §
12 H062001 3 3,100 79 §
12 H062201 3 3,100 88 §
12 H063101 3 3,100 87 §
12 H063401 4 3,100 84 §
12 H063601 3 3,100 86 §
12 H048901 3 645 97 99
12 H048901 3 2,996 99 §
12 H049401 4 639 84 88
12 H049401 4 2,996 86 §
12 H049503 3 620 85 92
12 H049503 3 2,996 86 §
12 H049601 4 596 84 88
12 H049601 4 2,997 86 §
12 H049701 3 648 99 98
12 H049701 3 2,996 98 §
12 H050101 4 592 79 86
12 H050101 4 3,063 80 §
12 H050201 4 590 76 82
12 H050201 4 3,063 78 §
12 H051002 3 639 97 98
12 H051002 3 3,063 98 §
12 H051101 3 601 84 90
12 H051101 3 3,063 87 §
12 H051102 3 558 78 81
12 H051102 3 3,064 81 §
12 H051301 3 590 79 83
12 H051301 3 3,055 85 §
12 H052301 4 628 82 92
12 H052301 4 3,055 71 §
12 H052501 3 626 87 93
12 H052501 3 3,055 92 §
12 H052601 3 607 81 85
12 H052601 3 3,055 86 §
12 H052701 3 614 87 88
12 H052701 3 3,055 88 §
See notes at end of table.

66
Item-by-item rater reliability, by grade, U.S. history national main
assessment: 2001 (continued)
Score Number scored (1st 2001 1994
Grade Item points and 2nd) reliability reliability
12 H060701 3 3,038 94 §
12 H061501 3 3,037 82 §
12 H061601 3 3,038 91 §
12 H061801 3 3,037 96 §
12 H042201 3 632 97 96
12 H042201 3 3,045 96 §
12 H042801 4 598 89 88
12 H042801 4 3,045 91 §
12 H042902 3 606 91 87
12 H042902 3 3,045 93 §
12 H043001 3 618 81 87
12 H043001 3 3,045 86 §
12 H043101 3 610 96 91
12 H043101 3 3,045 93 §
12 H063801 3 3,054 80 §
12 H064001 3 3,054 84 §
12 H064101 3 3,054 81 §
12 H064401 4 3,053 79 §
12 H064901 3 3,054 81 §
12 H065101 3 3,054 79 §
12 H065201 3 3,054 76 §
12 H065301 3 3,054 81 §
12 H065401 4 3,054 88 §
† Not applicable.
§ Item had not been created in 1994.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2001 U.S. History Assessment.

67
Scoring of the 2001 U.S. History Assessment Large-Print Booklets

A subset of the total items scored were those from large-print booklets. These booklets were
administered to students with disabilities who had met the criteria for participation with accommodations.
Since these booklets were non-scannable, they were transported to the scoring center after processing. A
log and score sheet were created to account for these booklets. As a rater scored an item, he or she
marked the score for that response, his or her rater ID, and the date on which the item was scored. Once
all items in each booklet for a given subject were scored, the U.S. history scoring director returned the
sheets to NAEP clerical staff to enter those scores manually into the records for these booklets.

In the 2001 assessment, there were eight large-print U.S. history booklets.

68
Item-by-item rater reliability for items in large-print booklets, by grade,
U.S. history national main assessment: 2001
Number
Score scored
Grade Item points (1st and 2nd) 2001 reliability 1994 reliability
Total † † 111,515 † †
4 X1H5_03 3 3,359 92 §
4 X1H5_05 3 3,359 93 §
4 X1H5_08 3 3,358 94 §
4 X1H5_10 3 3,359 95 §
4 X1H5_12 3 3,359 98 §
4 Q1H602 4 604 86 86
4 Q1H602 4 3,354 86 §
4 Q1H603 3 553 89 90
4 Q1H603 3 3,354 86 §
4 Q1H604 3 637 96 97
4 Q1H604 3 3,353 94 §
4 Q1H611 3 576 83 87
4 Q1H611 3 3,354 86 §
4 Q1H615 3 606 92 93
4 Q1H615 3 3,354 91 §
8 Q2H505 4 617 88 92
8 Q2H505 4 3,222 88 §
8 Q2H508 3 611 91 90
8 Q2H508 3 3,222 90 §
8 Q2H511 3 594 87 90
8 Q2H511 3 3,223 85 §
8 Q2H513 3 601 85 85
8 Q2H513 3 3,223 88 §
8 Q2H517 3 630 95 93
8 Q2H517 3 3,222 98 §
8 Q2H604 3 614 89 90
8 Q2H604 3 3,257 89 §
8 Q2H609 4 609 89 89
8 Q2H609 4 3,256 89 §
8 Q2H610 3 632 93 94
8 Q2H610 3 3,256 95 §
8 Q2H613 3 640 81 90
8 Q2H613 3 3,256 89 §
8 Q2H614 3 610 92 93
8 Q2H614 3 3,256 95 §
12 Q3H604 3 645 97 99
12 Q3H604 3 2,996 99 §
12 Q3H610 4 639 84 88
12 Q3H610 4 2,996 86 §
12 Q3H613 3 620 85 92
12 Q3H613 3 2,996 86 §
12 Q3H614 4 596 84 88
12 Q3H614 4 2,997 86 §
12 Q3H615 3 648 99 98
See notes at end of table.

69
Item-by-item rater reliability for items in large-print booklets, by grade,
U.S. history national main assessment: 2001
Score Number scored (1st
Grade Item points and 2nd) 2001 reliability 1994 reliability
12 Q3H615 3 2,996 98 §
12 Q3H702 4 592 79 86
12 Q3H702 4 3,063 80 §
12 Q3H703 4 590 76 82
12 Q3H703 4 3,063 78 §
12 Q3H714 3 639 97 98
12 Q3H714 3 3,063 98 §
12 Q3H715 3 601 84 90
12 Q3H715 3 3,063 87 §
12 Q3H716 3 558 78 81
12 Q3H716 3 3,064 81 §
† Not applicable.
§ Item had not been created in 1994.
SOURCE: U.S. Department of Education, Institute of Education Sciences, National
Center for Education Statistics, National Assessment of Educational Progress
(NAEP), 2001 U.S. History Assessment.

70

Potrebbero piacerti anche