Language Testing: Expertise in Evaluating Second Language Compositions

Language Testing
http://ltj.sagepub.com/
Expertise in evaluating second language compositions

Alister Cumming
Language Testing 1990 7: 31
DOI: 10.1177/026553229000700104
The online version of this article can be found at:

http://ltj.sagepub.com/content/7/1/31
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Email Alerts: http://ltj.sagepub.com/cgi/alerts
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://ltj.sagepub.com/content/7/1/31.refs.html
>> Version of Record - Jun 1, 1990
What is This?
Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

31-
Expertise in evaluating second

language compositions
Alister Cumming University of British Columbia
This study (1) assesses whether raters implicitly distinguish students’ writing
expertise and second language proficiency while evaluating ESL compositions
holistically and (2) seeks to describe the decision-making behaviours used by
experienced and inexperienced raters in this process. The performance of 7
novice and 6 expert ESL teachers was assessed while they evaluated 12
compositions written by adult students with differing levels of ESL proficiency
(intermediate and advanced) and writing expertise (average and professionally
experienced writers) in their mother tongues. Multivariate analyses of rating
scores indicated that both groups’ evaluations distinguished students’ second
language proficiency and writing skills as separate, non-interacting factors.
Descriptive analyses of the raters’ concurrent verbal reports revealed 28
common decision-making behaviours, many of which varied significantly in
use between novice and expert groups. Implications discuss biases in holistic
evaluations of second language compositions, aspects of expertise in this skill,
and potential uses of this research for the training of composition raters and
student-teachers.
I Introduction
Holistic and analytic methods for evaluating compositions have
gained wide acceptance in second language testing and teaching
practices (Canale, 1981; Carroll, 1980; Jacobs, Zinkgraf, Wormuth,
Hartfiel, and Hughey, 1981; Perkins, 1983). But understanding of
these evaluation methods has, in two respects, remained limited.
Firstly, it is not certain that students’ language proficiency can
logically be distinguished from their writing skills, since both factors
necessarily interact in the processes (i.e., for the student) and
products (i.e., for the evaluator) of composition writing in a second
language. For instance, Cumming, (1989) found both factors had
significant, but separate, effects on analytic rating scores in three ESL
composition tasks. This suggests such evaluation procedures may
Address for correspondence: Dr Alister Cumming, Department of Language Education,

University of British Columbia, 2125 Main Mall, Vancouver, B.C., Canada V6T 1Z5.

32
contain (1) a hidden bias toward more literate learners, if used as

language tests, or (2) a hidden bias against language learners, if used
as composition tests in settings where they have not mastered the
majority language. This view is supported by theories that academic
performance in second languages involves different cognitive and
linguistic skills (Cummins and Swain, 1986) and that composition
writing is a specialized expertise which students develop to varying
extents (Bereiter and Scardamalia, 1987).
Secondly, very little is known about what the skill of evaluating
compositions written in a second language entails, or how it develops.
This is problematic for the training of raters for composition tests
(Cohen, in press; Homburg, 1984), in defining the appropriate
criteria to apply to such assessments (Land and Whitley, 1989;
Mendelsohn and Cumming, 1987; Santos, 1988; Weir, 1988), as well
as for knowing the kinds of evaluation practices to promote in class-
room instruction or initial teacher education (Cumming, 1985;
Zamel, 1985). It is not clear what knowledge and thinking processes
are involved in this specialized kind of decision-making. For this
reason, methods for evaluating ESL writing have had to be validated
indirectly, i.e., for their reliability, for example by assessing the
consistency of rating scores, or by establishing correspondences to
other measures of language proficiency (e.g., Jacobs, Zinkgraf,
Wormuth, Hartfiel, and Hughey 1981; Wesche, Canale, Cray, Jones,
Mendelsohn, Tumpane and Tyacke, 1987). Direct validation of the
judgement processes used in these assessment methods has not been
possible because there is insufficient knowledge about the decision-
making or criteria which raters or teachers actually use to perform
such evaluations (Quellmalz, 1980; Purves, 1984).
Empirical accounts of the judgement processes used to evaluate
second language compositions are sketchy at best. Homburg (1984:
102) proposes that trained readers of ESL compositions may use a
common ’funnel model’ to guide their judgements, a process whereby
evaluators ’grossly characterize ESL compositions on the basis of one
feature and then further categorize on the basis of yet other features.’
However, this conception differs from the more interactive kind of
decision-making reported in an exploratory study by Cumming
(1985). In this study, three experienced ESL teachers performed
similar, complex reading behaviours while evaluating one ESL
composition, despite their uses of different feedback techniques.
Janopoulos (1989) also likens the process of reading ESL composi-
tions to a specialized comprehension task, assuming the mental
processing involved could vary with the quality of the writing which is
read.
The present study addressed these two issues, seeking (1) to

33
establish whether raters implicitly distinguish students’ writing

expertise and second language proficiency while evaluating ESL
compositions holistically; and (2) to describe the thinking behaviours
used by experienced and inexperienced raters to evaluate a range of
ESL compositions. Two research designs were used within the same
context. The compositions to be rated were organized in a repeated
measured 2 x 2 factorial design, juxtaposing texts written by 12 ESL
students with greater and lesser writing expertise in their mother
tongues and proficiency in their second language. Raters were
selected to form a novice-expert design, contrasting the performance
of six teachers with considerable expertise in this domain against that
of seven student-teachers preparing to be ESL instructors. Both
groups produced concurrent think-aloud reports to provide
indicators of their decision-making behaviours while evaluating the
compositions.
II Approach
Twelve compositions were selected from a pool of 147 composition
examinations administered placement
as a for ESL classes at a test
Canadian university. The compositions were selected to represent:
two levels of ESL proficiency (intermediate and advanced); two levels
of writing expertise in students’ mother tongue (average student
writers and those with professional experience writing in their work);
and thirdly the writing of students from different language and
cultural backgrounds (so as to counterbalance this effect on the
ratings).
TOEFL scores and interview assessments, collected in the pro-
cess of placement testing, were used to establish ESL proficiency
groupings. The intermediate group had TOEFL scores between 387
and 457, and had been placed in pre-university ESL classes. The
advanced group’s scores ranged from 537 to 627; they had gained
admission to academic programmes at the university, though they
had opted to take one ESL course. Data on students’ expertise in
writing in their mother tongues were collected using a self-report
instrument (validated in several earlier studies, see Cumming, 1989)
asking for self-assessment of abilities to write in various situations, as
well as self-reports of professional experience writing. Compositions
were selected from extreme ends of the 4-point scale in this instru-
ment. Only students reporting professional experience writing were
included in the higher level; none reported being published authors.
For the 12 compositions chosen, TOEFL scores correlated with
holistic interview ratings (r = .8, p < .001), but not with students’
self-ratings of writing expertise (r = .04). The number of words

34
produced in the compositions was considered as an additional

criterion (cf. Homburg 1984), but this turned out not to correlate with
any of the other measures (r < .1). Appendix A cites features of the
compositions selected and their authors’ backgrounds, showing how
these form four cells in the study’s factorial design.
Two groups of volunteer teacher-raters were solicited. The novice
group responded to a letter distributed to three introductory classes in
TESL methodology. Seven student-teachers volunteered, were pre-
screened to ensure they had no prior teaching experience, and were
offered feedback on their performance in return for participating in
the study. The expert teachers were nominated by the coordinator of
the ESL programme at the same university, who proposed the names
of teachers with more than seven years’ experience specializing in ESL
composition instruction. The eight individuals nominated were sent
letters asking if they wished to volunteer their time in return for a
small stipend, and six agreed. All volunteers met individually with the
researcher, at their convenience, for the data collection (at a point
when the student-teachers were nearing the end of their TESL
methodology course).
The raters initially received thinking-aloud demonstrations and
training with arithmetic problems, following procedures in Ericsson
and Simon (1984: 375-79). They were then given written instructions
to rate the 12 compositions on a scale from 1 to 4 for their effective-
ness of ’language use’, ’rhetorical organization’, and ’substantive
content’. No definitions of these terms were given, it being said that a
purpose of the study was to find out how people defined them them-
selves. The raters were asked to ’do the task as if you are evaluating
the compositions for an ESL proficiency test’. They were not
informed that the compositions had been selected to conform to any
particular criteria. No efforts were made to simulate a teacher-student
relation between raters and authors of the compositions, since
previous research had indicated this might induce variable rating
behaviours, as per pedagogical, rather than assessment, situations
(Cumming, 1985; Zamel, 1985). The original prompt for the
composition was specified in the instructions, ’Discuss the effects
your studies in Canada may have on your future.’ The compositions
were given to each participant in (a different) random order, so as to
counterbalance any effects for order of presentation. Raters were
offered the option of rating the texts alone or in the company of the
researcher. Only two chose to perform in the company of the
researcher; both were ’novices’ who wished to have feedback
immediately after their performance. People took from 30 to 120
minutes to complete the task (though later analyses showed this time
factor was not related to teaching experience).

35
Repeated measures MANOVAs (with SPSSx version 3) were used

to analyse the rating scores, treating the scores for each group of
raters as the dependent variable and the factors of writing expertise
and second language proficiency (as represented in the factorial
organization of the compositions) as independent variables. The
think-aloud protocols were transcribed in full by the researcher, then
reviewed impressionistically in conjunction with a second rater (one
of the expert subjects) to develop the coding scheme described below.
After piloting and refining the coding scheme on samples of the data,
a third independent coder blind-coded all of the data. Random
samples of 10% of the data (from each of the 12 compositions) were
later checked for intercoder reliability with the researcher (87% agree-
ment on 285 coding decisions) and intra-coder reliability (90~Io agree-
ment on 269 coding decisions). Data from one experienced teacher
(Expert 5), who did not provide a full concurrent think-aloud report,
were excluded from these analyses. Codings of the transcribed data
were tallied for each rater, means were calculated for the novice and
expert groups, then t-tests were used to assess differences in the mean
quantity of decisions between the two groups (after it was established
that neither the quantity of words nor decisions per protocol varied
significantly between the two groups).
III Findings
7 Second language proficiency and writing expertise
MANOVA results indicated that both groups of raters implicitly
distinguished students’ ESL proficiency and writing expertise as
separate factors in their rating of the compositions. The ratings of the
expert group showed significant main effects for ESL proficiency
(F = 28.1, p < .0001 ) and writing expertise (F = 4.4, p < .04). Similar
results appeared for the novice groups as main effects for ESL
proficiency (F= 12.5, p < .001) and writing expertise (F= 10.4,
p < .002). Interestingly, there were no interaction effects between the
main factors for either group, suggesting that both groups of raters
implicitly treated students’ ESL proficiency and writing expertise as
separate, distinct factors in their evaluations. As Figure 1 shows, the
ratings for both groups tended to increase consistently (about 1 point
on the 4-point scale) according to students’ levels of ESL proficiency
or writing expertise.
Univariate tests indicated that the ratings for the novice and expert
groups were, however, significantly different from each other: for
ratings of ’content’ (F = 13.5, p < .0003) and ’rhetorical organiza-
tion’ (F= 13.6, p< .0003), but not for ratings of ’language use’

36
Figure 1 Group means for ratings of compositions.
(F= 2.8, n.s.). As Figure 1 shows, the student-teachers consistently

tended to rate these aspect of the compositions higher than the
experienced teachers did. The novice groups’ ratings for ’language
use’ were, overall, significantly different (F = 4.8, p < .03) from their
ratings for the other two aspects of the compositions. The expert
group showed no significant differences between their overall ratings
for the three categories. Other univariate tests did not reveal any
significant differences between groups or interactions between
factors.
2 Decision-making behaviours
Impressionistic analyses of the raters’ verbal reports identified 28
r1p(,1~1&dquo;n-m~ 1<in<1 hph~vinurs nsprt to internret and evaluate the student

37
compositions. Three criteria were used to distinguish these: if they

represented logically relevant and distinct cognitive behaviours; if
they occurred with sufficient frequency (i.e., were reported by a
majority of the raters); and if they could be coded with sufficient
accuracy (8007o agreement) in the initial trials of the coding scheme.
Figure 2 displays these behaviours as two kinds of strategies:
interpretation strategies used to read the texts (items 1 to 7) or judge-
ment strategies used to evaluate qualities of the texts (items 8 to 28).
These, in turn, have four kinds of focus: on the raters’ self-control of
their own reading or judgement processes (items 1, 2, and 8-13); on
the substantive content of the texts (items 3, 4, and 14-18); on the uses
of language in the texts (items 5, 6, and 19-24); or on the rhetorical

38
organization of the texts (items 7 and 25-28). Though these

behaviours are logically distinguishable, they frequently occurred in
conjunction with one another in the verbal report data, and were
coded as such. Appendix B provides segments of the data exempli-
fying each of the 28 behaviours.
Figure 3 reports the mean frequencies of these behaviours for the
novice and expert groups over the 12 compositions, standard
deviations for each group, as well as results of t-tests to assess differ-
ences between the groups. Only two behaviours, ’editing phrases’ and
’classifying errors’, accounted for large proportions of the total
number of behaviours reported. All other behaviours accounted, on
average, for less than 10070 of the total in the data reported. Many
behaviours show high standard deviations because of variation from
individual to individual. These tendencies suggest that the thinking
processes involved in evaluating ESL compositions consist in many
discrete decisions, which are relatively variable from person to
person. As noted above, t-tests showed that, overall, differences in
the number of decisions between the groups was not statistically signi-
ficant. Since both novices and experts tended to make a comparable
number of decisions, it can be inferred that differences in their
behaviour were predominantly qualitative. That is, the experts did
not make significantly more decisions than the novices, just qualita-
tively different ones.
Differences between the novice and expert groups which proved to
be statistically significant related to the raters’ strategies for con-
trolling their own evaluation behaviours, as well as their attention to
specific aspects of the content, syntax, or rhetorical organization of
the compositions. In terms of controlling their own evaluation
behaviours, expert raters reported more self-reflexive behaviours.
They tended more frequently: to envision the situation of the writer of
the composition (usually as an ESL student with particular learn-
ing needs); to direct their own reading processes toward attending to
key criteria in the texts; to reflect on how they, themselves, were dis-
tinguishing the categories of language use, content, and rhetor-
ical organization; and to summarize their own rating judgement
collectively.
In evaluating the substantive content of the compositions, expert
raters tended more frequently: to count the number of main ideas in
each composition (often doing this several times) in order to assess
the students’ total written output; to consider carefully how well
the particular topics in each composition were developed; and
to assess the quality of ’interest’ conveyed by the compositions. While
assessing language use in the compositions, expert raters also tended
more frequently to establish a general impression of the ESL students’

39
Figure 3 Group means, standard deviations, and t-tests.

40
command of syntax in English, reviewing phrases in the compositions

to look for key examples of this, such as verb forms or clause struc-
tures. The expert raters also devoted more attention to discerning the
rhetorical structures of the students’ compositions, first while reading
them, then later while judging their coherence or identifying
unnecessary repetitions.
About 30% (83 .1 ) of the behaviours of the novice group involved
’editing phrases’. In contrast, for the expert group, about 19070 (72.6)
of their behaviours involved ’classifying errors’. Though this was the
most pronounced quantitative difference between the two groups, the
frequency of these behaviours varied (see standard deviations in
Figure 3) substantially from individual to individual within groups,
and thus did not reach statistical significance in the t-tests. For
instance, two of the novice raters did not edit phrases at all. Among
the other five novices, though, this was a predominant, highly visible
strategy applied throughout their evaluations. As one student-teacher
put it, this behaviour formed a naive way of making sense of the ESL
writing: ’Well, I would just go here and start inserting verbs that I
think might be appropriate. I don’t know, I don’t know. I don’t really
understand it without doing that.’ Experts, in contrast, tended to
categorize the errors they encountered in the students’ texts, more as a
way of gathering information to judge the quality of language use in
the compositions.
These strategy differences are evident in the following extracts
from the beginning parts of protocols evaluating the same composi-
tions. (Slash bars indicate pauses of three seconds or more.) The two
novices ’correct’ many of the same conspicuous errors while they
read. In contrast, the two experts abstract their thinking away from
the surface features of the texts, categorizing key items to form over-
all judgements, while attending concurrently to content, rhetorical,
and language aspects of the writing:
Novice 10: Okay the best one is number 61./Okay, so, they say why they want
to learn English./Okay, this ’if’ is wrong, ’because’ I want to learn English.
And then it’s ’I’ve come to Canada’, not ’I’m going’. So not only is it the
wrong verb, but, um, um, the wrong tense also. And then this ’if ’ is completely
wrong./‘Produces’. There’s no pluralization anywhere here./’I liked my
work’. There should be some agreement./There’s a spelling mistake here. And
more importantly, there’s this ’ed’. The past tense morpheme is wrong./Okay,
so, what this sounds to me, at first, is that these friends speak English. Ah,
maybe./So this means their friends are native speakers of English or some-
thing. Anyway, ’many of them are speaking English’. So ’many of them speak
English’./‘Than at this moment’, that’s just a minor point./‘Interested’, so
that’s ’interesting’./So this is understandable, but there are quite a few little
mistakes./‘People’, not ’peoples’, no apostrophe./No apostrophe, and plural.
Novice ll: Okay, the next one, number 51./I think this person is confusing the
meaning of ’if’. He or she wants to say ‘because’./‘Produces’ instead of

41
’produce’./I guess, pluralize these items./’Liked’, change the verb

there./‘Practice’, not ’practiced’. Okay./I would change ’in this moment’ to
‘now’. She could probably tighten that sentence quite a bit./’Whom I hope I
can understand better than now’, ’better than I can now’./‘It’s interested’,
wrong word form. It should be ‘interesting’./‘People’, not ’people’s’.
Expert 1: Now I’m moving to paper number 61, which is presented in two
paragraphs, about half as long as the longest one so far. Well, a bit longer than
the one I just saw, but not by much./A wonderful job./We’ve got verb tense
problems. We’ve got conditional problems. Fairly basic stuff here./We go to
another paragraph here. Why? /This student has a strong sense of sentence. We
don’t have a lot of run-ons and fragments. Ah, the sentences are very simple. A
lot of, um, simple problems, verbs, conditionals, spelling. This student hasn’t
had very much English language training. She’s probably quite good orally.
But, ah, she hasn’t had a basic grounding in writing, nor in English structure.
Um, the person probably isn’t a very good writer. We have a very kind of oral
kind of writing./She gives us her background. The only reason really, the
student is on topic in that she may be able to get a better job back in Germany,
but, ah, she doesn’t really look at the effect she predicts her studies to have. She
presents us with her history.
Expert 4: Okay, 61. A shortie. Two paragraphs. Here we go./Okay, now, she
understands rhetorical paragraphing. But she doesn’t have a real conclusion
here. Ah, an introduction, that’s basically what? A paragraph, one line, what
she did, ah, in Germany, and why she wants to learn English. And, ah, kind of
muddled that second paragraph./Another word form. ’Quiet’ instead of
’quite’./Now, we’ve got several ideas here, but not well developed. So I think I
can only give her a 2 for rhetorical patterns. Content? A lot of different ideas
that are very nice. But the organization doesn’t allow you to see them. I’m not
sure that those ideas are there. She’s talking about how nice the city is. She’s
talking about money. Ah, she has a lot of friends who speak English. And
English is important. Ah, this is a poor pattern.
Two novices, however, avoided editing strategies altogether. Novice
7, for instance, focused almost exclusively on comprehending the
ideas communicated in the texts, rejecting analyses of the language
features of the compositions for the reason that that would be too
’technical’: ’That is really difficult, this whole business of commu-
nicating. If they get their ideas across, I guess, um, I mean that’s the
primary thing, I would say, if I can understand what they’re trying to
get across. And the rest of it seems very artificial to me, the technical
aspect of it.’ However, as she progressed with the evaluations, she
found she had avoided making distinctions necessary for assessment:
‘I can see my tendency is to give them a similar number straight across
the board. I haven’t made much discrimination between each of, ah,
the language use, content, and rhetorical organization. The fine
tuning in that area, I think, would drive me bananas.’
Discussion
The results of the first analysis in this study confirm Cumming’s

42
(1989) finding that analytic evaluations of ESL compositions assess

students’ language proficiency and writing expertise concurrently,
implicitly attributing separate values to each factor. Though raters
are probably not aware of this distinction while performing such
evaluations, analyses of their rating scores reveal a significant
tendency to produce ratings which conform closely to students’ skills
in both of these areas (as determined by external, objective criteria,
i.e., standard tests, ratings of spoken proficiency, self ratings of
working skills and professional experience). In the present study, this
tendency seems to have influenced raters’ assessments of diverse
aspects of written texts, not just language use, content, or rhetorical
organization alone. Likewise, it occurred for student-teachers as for
teachers with considerable expertise in this area. None of the raters
appeared to be aware of these distinctions in their behaviours, which
were revealed by post-hoc analyses of their performance.
An implication is that such analytic evaluation methods may not be
appropriate as tests of either second language proficiency or writing
skill exclusively, as occurs in many situations of evaluation practice.
Both skills are being evaluated in conjunction and are not logically
distinguished from each other within evaluators’ ratings. This may
disadvantage two groups of students in different ways in evaluation
practices. For minority language students, such as ESL learners in
English-dominant settings, analytic evaluations of their written
compositions may be biased against their language pro-
ficiency. Conversely, for unskilled writers in second language pro-
grammes, the same kinds of evaluation may be biased against their
lack of expertise in composition, a skill which appears to be different
from second language proficiency (Cumming, 1989; Cumming,
Rebuffot and Ledwell, 1989; see also Cummins and Swain, 1986). For
students to be rated as highly ’effective’ in such assessment, it would
seem that they require high levels of both second language proficiency
and writing expertise.
Additional implications of this analysis concern the relative merits
of analytic or holistic rating. For testing purposes, analytic ratings of
different aspects of ESL compositions may not be necessary, if raters
(like the present expert and novice teachers) tend overall not to vary
their ratings appreciably across different analytic categories. A single
holistic rating of compositions may be less time-consuming and
equally reliable. Nonetheless, analytic scales may have the advantage
of drawing raters’ attention to specific aspects of students’ composi-
tions, as well as appropriate evaluation strategies and criteria. The
performance of student-teachers in the present study suggests that
novices in this domain probably need such explicit guidance to direct
their decision-making while evaluating compositions. Likewise, the
decision-making processes of expert teachers appears to vary

43
substantially from person to person in the absence of pre-specified

criteria and procedures for evaluation.
The second analysis in this study was able to describe, in a
preliminary way, the decision-making behaviours which raters
perform mentally while evaluating ESL compositions. These descrip-
tions are, of course, tentative; they require much further study before
a full model of the thinking processes integral to this skill could be
developed. The use of concurrent verbal reports, over a variety of
similar, well-defined tasks with a distinct range of subjects, permitted
identification of certain thinking processes integral to this skill. How-
ever, these data must merely be considered as indicators of cognitive
processes, not direct evidence of their full realizations (Ericsson and
Simon, 1984). Other limitations to the study, such as the small
number of subjects, all from one institution, make it necessary to
caution that the present findings require much further research before
they can be usefully applied to educational practice.
The descriptive results do, nonetheless, provide an empirical basis
for further research which may be of value in developing explicit,
accurate definitions of the knowledge and strategies appropriate to
training raters or student-teachers of second language composition.
The foregoing descriptions of differences in the performance of
novices and experts are consistent with many studies of expertise in
other domains (see Alexander and Judy, 1988; Chi, Glaser and Rees,
1982) and of teachers’ thinking processes in general (Clark and
Peterson, 1986; Kagan, 1988; Leinhardt and Greeno, 1986). Overall,
expert teachers appear to have a much fuller mental representation of
’the problem’ of evaluating student compositions, using a large
number of very diverse criteria, self-control strategies, and knowledge
sources to read and judge students’ texts. Novice teachers tend to
evaluate compositions with only a few of these component skills and
criteria, using skills which may derive from their general reading
abilities or other knowledge they have acquired previously (cf.
Perfetti, 1989; Perkins and Salomon, 1989). Their tendency to edit
student texts extensively appears to be an example of this kind of skill
transfer, since this behaviour would have served no other logical
purpose in the context of the experimental tasks (i.e., novice raters
could not have expected the ESL students to receive their feedback,
nor the researcher to benefit from their corrections).
Kintsch’s (1988, 1989) distinctions between ’situational’ and ’text-
based’ models of reading appear germane to explaining differences
between the behaviours of these expert and novice teachers. Many of
the novices did not seem to have developed a very thorough
’situational model’ for evaluating student compositions; they limited
their decision-making to a very ’text-based’ editing strategy, then
rapid judgements of quality based on often inexplicit criteria. Other

44
novices, in contrast, seem to have avoided considering textual

features altogether, searching only for ’situational’ impressions of the
students who may have composed the texts - a strategy which left
them little basis for judging qualities of language use. Expert
teachers, on the other hand, integrated their interpretations and
judgements of situational and textual features of the compositions
simultaneously, using a wide range of relevant knowledge and
strategies. Their decision-making appeared as complex, interactive
mental processes, as seem to characterize other cognitive aspects of
teaching skill (Kagan, 1988; Leinhardt and Greeno, 1986). This was a
far more multi-faceted, variable process than the unilinear ’funnel
model’ which Homburg (1984) suggests might guide the processes of
evaluating ESL compositions. The sheer quantity of interrelated
decisions which occur in this process testify to the difficulty of
obtaining homogeneous ratings on composition exams, even among
skilled raters.
Interestingly, many novice teachers did, on occasion, perform the
range of decision-making behaviours which the expert teachers
displayed. This suggests that their potential to develop expertise in
this skill extends from practical knowledge they already possess, but
have not yet had the opportunity to refine to a point of expertise
through teaching and evaluation experiences (see Yinger, 1987). Of
course, the present study can say little about how to foster such
development in student-teachers; but it does provide some empirical
evidence to help define the kinds of skills and knowledge which
teacher education in second languages might seek to foster (see
Cumming, 1990; Freeman, 1989; Stern, 1983). It is important to note,
though, that the task of evaluating students’ compositions, as elicited
in the present study, is but one aspect of the complex interpersonal
interactions which obtain in classroom settings. As other research has
pointed out, many additional considerations operate in teaching
practiceto determine how teachers evaluate students’ writing and
how this may provide opportunities to learn (Cohen, 1987; Freed-
man, 1987; Zamel, 1985).
Acknowledgements
I greatly appreciated: assistance from Ernest Hall and Catherine
Ostler-Howlett in coding the data; advice from Lee Gunderson on the
statistical analyses; the time and effort volunteered by the 13 teachers
and student-teachers to rate the compositions and report on their
thinking; and partial funding of this research by the Social Sciences
and Humanities Research Council of Canada through the University
of British Columbia.

45
V References
Alexander, P. and Judy, J. 1988: The interaction of domain-specific and

strategic knowledge in academic performance. Review of Educational
Research 58(4), 375-404.
Bereiter, C. and Scardamalia, M. 1987: The psychology of written
composition. Hillsdale, NJ: Erlbaum.
Canale, M. 1981: Communication: How to evaluate it? Bulletin of the
Canadian Association of Applied Linguistics 3(2), 77-94.
Carroll, B. 1980: Testing communicative performance. Oxford: Pergamon.
Chi, M., Glaser, R. and Rees, E. 1982: Expertise in problem solving. In
Sternberg, R., editor, Advances in the psychology of human
intelligence, vol. 1, pp. 7-75. Hillsdale, NJ: Erlbaum.
Clark, C. and Peterson, P. 1986: Teachers’ thought processes. In Wittrock,
M., editor, Handbook of research on teaching, 3rd edition,
pp. 255-96. New York, NY: Macmillan.
Cohen, A. 1987: Student processing of feedback on their compositions. In
Wenden, A. and Rubin J., editors, Learner strategies in language
learning, pp. 57-69. Englewood Cliffs, NJ: Prentice-Hall.
Cohen, A. in press: The taking and rating of summary tasks on a foreign-
language reading comprehension test. To appear in Hill, C. and Parry,
K., editors, The test at the gate: Towards an ethnography of reading
assessment.
Cumming, A. 1985: Responding to the writing of ESL students. In Maguire,
M. and Pare, A., editors, Patterns of development, pp. 58-75. Ottawa:
Canadian Council of Teachers of English.
— 1989: Writing expertise and second language proficiency. Language
Learning 39(1), 81-141.
— 1990: Student-teachers’ conceptions of curriculum: Toward an under-
standing of language teacher development. TESL Canada Journal
7(1), 33-51.
Cumming, A., Rebuffot, J. and Ledwell, M. 1989: Reading and
summarizing challenging texts in first and second languages. Reading
and Writing
2(2), 201-19.
Cummins, J. and Swain, M. 1986: Bilingualism in education. New York:
Longman.
Ericsson, A. and Simon, H. 1984: Verbal reports as data. Cambridge, Mass.
MIT Press.
Freeman, D. 1989: Teacher training, development, and decision making: a
model of teaching and related strategies for language teacher educa-
tion. TESOL Quarterly 23(1), 27-45.
Freedman, S. 1987: Response to student writing. Urbana, IL: National
Council of Teachers of English.
Homburg, T. 1984: Holistic evaluation of ESL compositions: Can it be
validated objectively? TESOL Quarterly 18(1), 87-107.
Jacobs, H., Zinkgraf, A., Wormuth, D., Hartfiel, V. and Hughey, J. 1981:
Testing ESL composition. Rowley, MA: Newbury House.
Janopoulos, M. 1989: Reader comprehension and holistic assessment of

46
second language writing proficiency. Written Communication 6(2),

218-37.
Kagan, D. 1988: Teaching as clinical problem solving: a critical examination
of the analogy and its implications. Review of Educational Research
58(4), 482-505.
Kintsch, W. 1988: The role of knowledge in discourse comprehension: a
construction-integration model. Psychological Review 95(2), 163-82.
— 1989: Learning from text. In Resnick, L., editor, Knowing, learning,
and instruction, pp. 25-46. Hillsdale, NJ: Erlbaum.
Land, R. and Whitley, C. 1989: Evaluating second language essays in regular
composition classes: toward a pluralistic U.S. rhetoric. In Johnson, D.
and Roen, D., editors, Richness in writing, pp.284-93. New York:
Longman.
Leinhardt, G. and Greeno, J. 1986: The cognitive skill of teaching. Journal
of Educational Psychology 78(1), 75-95.
Mendelsohn, D. and Cumming, A. 1987: Professors’ ratings of language use
and rhetorical organization in ESL compositions. TESL Canada
Journal 5(1), 9-26.
Perfetti, C. 1989: There are generalized abilities and one of them is reading.
In Resnick, L., editor, Knowing, learning, and instruction, pp. 307-35.
Hillsdale, NJ: Erlbaum.
Perkins, D. and Salomon, G. 1989: Are cognitive skills context-bound?
Educational Researcher 18(1), 16-25.
Perkins, K. 1983: On the use of composition scoring techniques, objective
measures, and objective tests to evaluate ESL writing ability. TESOL
Quarterly 17(4), 651-71.
Purves, A. 1984: In search of an internationally-valid scheme for scoring
compositions. College composition and communication 35(4), 426-38.
Quellmalz, E. 1980: Problems in stabilizing the judgment process. Center for
the Study of Evaluation report no. 136. Los Angeles CA: University of
California.
Santos, T. 1988: Professors’ reactions to the writing of nonnative-speaking
students. TESOL Quarterly 22(1), 69-90.
Stern, H. 1983: Language teacher education: an approach to the issues and a
framework for discussion. In Alatis, J., Stern, H. and Strevens, P.,
editors, Applied linguistics and the preparation of second language
teachers: Georgetown University round table on languages and
linguistics, pp.342-61. Washington, DC: Georgetown University
Press.
Wesche, M., Canale, M., Cray, E., Jones, S., Mendelsohn, D., Tumpane,
M., Tyacke, M. 1987: The Ontario test of English as a second
language: A report on the research. Toronto: Ontario Ministry of
Colleges and Universities.
Weir, C. 1988: Academic writing - can we please all the people all the time?
ELT Documents 129, 24-34.
Yinger, R. 1987: Learning the language of practice. Curriculum Inquiry
17(3), 293-318.
Zamel, V. 1985: Responding to student writing. TESOL Quarterly 19(1),
79-101.

47
Appendix A Characteristics of compositions selected
Appendix B Examples of Coded Data

L Self-control Strategies to Guide Interpretations
1. initially scan whole text to obtain impression before reading
All right here we go, number 9. Mm, he’s got lots of paragraphs. I wonder if
they all say something different. Let’s see. (Expert 4)
Okay, 142. A big one. (Novice 13)

2. envision the situation of writing and the writer
This may be rewriting his c.v. You know, we may have a language here that is
memorized, that’s not really generated. (Expert 1)
It’s very curious to me. I want to know, I guess the other thing I would like to
know is what are the life experiences of these people? Some of them are dead
give-aways in their writing. Others, um, are not. And I’m tempted, I guess I’m
just very curious to know if it’s their lack of experience, or age, or what have
you, that affects their writing. Because some of these come across as being
much younger. (Novice 7)
II. Strategies to Interpret Content

3. interpret ambiguous phrases
Ah, I think he means, ’reduced base’. (Expert 6)

48
Um, ’as it was’? What does he say? What’s the purpose of that? I’m not sure.
(Novice 8)
4. summarize propositions
Okay, the writer deals with, not only, studying both English and French. Um,
the value of studying English and French, then the value of meeting people
from other countries. As a, um, the value in broadening one’s world view.
(Expert 1)
This person has said, what? They introduced themselves, why there’re taking
English, why they’re here, um, what they like about being here. (Novice 10)
III. Strategies to Interpret Language Use

5. classify errors
Verb agreement. (Expert 6)

Um, okay, the, ah, sentence in the second paragraph tends to run-on. (Novice
8)
6. edit phrases
Um, if this isn’t the name of a section of government, then he should leave off
the capital. (Expert 3)
’And this is sure to take me big role in my future.’ It should be, ’this is sure to
take a big role in my future.’ Not the ’me’. (Novice 9)
IV. Strategies to Interpret Rhetorical Organization

7. discern rhetorical structure(s) .
What has just crossed my mind, now, as I just look at the way the composition
actually looks is that it’s not broken down into paragraphs at all. (Expert 2)
The conclusion, the person has decided to, ah, slightly alter the topic, or the
conclusion of the topic. (Novice 12)
V. Self-control Strategies to Guide Judgements

8. establish personal responses to qualities of items
I like number 45. He doesn’t make any bones about what he’s after. (Expert 6)
Again, I guess, for some of these I’m struck by either the assumptions that
some of them are making or, um, some of the observations that they’re making
about language use, which is quite interesting. (Novice 7)
9. define, assess, or revise own criteria and strategies

So, in terms of rating, ’language use’ I interpret as grammar, spelling,
punctuation, and, ah, wording. (Expert 3)
So I take that, first of all, I’m reading through and wonder what the whole

49
thing means. And then I can go through and, ah, see what individual correc-
tions to work on. (Novice 10)
10. read to assess specific criteria
But, ah, language use? I’m looking now for just simple subject-verb units to
get a sense of sentence. (Expert 1)
So that’s number 85. Um, language use? Well, I’m looking for what kind of
content words there are that might show a good level of understanding, or use
of difficult words maybe. (Novice 10)
11. compare qualities of different compositions

Um, a little bit more effective than the last one I had in mind, that I just read
before. (Expert 2)
I’m just noticing this one from the other 2, the, um, paragraph structure is
shorter, in terms of fewer sentences, but of course more paragraphs. (Novice 8)
12. distinguish interactions between evaluation categories
I’m now, it’s hard to disassociate language use from content, because they
so
seem so entangled. But when I look at a sentence, and only look at language use
in a sentence, I am able to, I mean I would give it a 2. (Expert 2)
Content? Ah, well, the problem with content is, they’re not really addressing
the question. They’re saying more, um. This has to do with rhetorical
organization too actually. (Novice 10)
13. summarize own judgements collectively
So I’m happy now with 3, 4, 4 for number 142. (Expert 2)
The ideas, the structure, the rhetorical organization is quite good. It’s a bit
more of the language use. (Novice 9)
VI. Strategies to Judge Content

14. count propositions to assess total output
We definitely have 3 points that are made. (Expert 1)
Okay, well, the content wasn’t so bad. He could get the idea across, to acquire
skills, to improve opportunities, learning English, to show his friends. (Novice
9)
15. assess relevance
Okay, this is not, the whole paragraph is irrelevant. (Expert 3)

The content is good, but it addresses the topic, why you chose Canada, not
what effect the studies you do in Canada will have, may have on the future.
(Novice 13)
16. assess interest
Mm, that’s an interesting statement. (Expert 6)

50
You can’t help but be drawn into the personal thinking in some of these.
They’re fairly heroic characters, some of them. (Novice 7)
17. assess development of topics
He makes abundant applications. His reason for studying, it’s developed here.
Again it’s for the company. (Expert 4)
It to be quite logical. It begins with how we should consider the future.
seems
And, um, it begins by telling us about his or her problem in Taiwan. And that
actually follows up to why this person is here. (Novice 8)
18. rate content overall
I’m going to give it a 1 in terms of content. (Expert 2)
Content? It isn’t ineffective, but it’s not effective. So I’d say, it’s a 2. (Novice
12)
VII. Strategies to Judge Language Use

19. establish level of comprehensibility
But I was noticing that a couple of times I was actually stopped. I could not
grasp the meaning of the sentence right away, primarily because of problems in
language use. (Expert 2)
Yeah, so this fifth paragraph really gives me some problems, although I still get
thesense of what’s trying to be said. (Novice 8)
20. establish error values
For language use, there are some really low level mechanical sorts of problems
’- in verbs. But she does use a lot of transitions and attempts at coherence.
(Expert 1) .
So the first thing, again, very minor problems. (Novice 10)

21. establish error frequency
Not that many mistakes. (Expert 4)
Quite a few grammar errors in the first paragraph. (Novice 12)
22. establish command of syntactic complexity
I just realized that I haven’t really been verbally, or maybe consciously, paying
much attention to sentence complexity. In this, um, composition, because it’s
quite unsophisticated, I’m noticing simpler constructions. (Expert 1)
He definitely uses long sentences. (Novice 9)
23. establish appropriateness of lexis
Inappropriate words, used inappropriately here and there. (Expert 2)

Nice vocabulary use, it seems to me. (Novice 10)
24. rate language use overall
Yeah, 85’s language use is really low. (Expert 6)

51
Um, for language use? It’s pretty good for language use. I’d give it a 3 for
language use. (Novice 13)
VIII. Strategies to Judge Rhetorical Organization

25. assess coherence
Well, um, this writer has no concept of, ah, rhetorical organization. He’s put
everything in one paragraph. (Expert 4)

The organization, although the first two, with ’first’ and ’second’, there is
some kind of progression there. But it’s not developed. (Novice 8)
26. identify unnecessary repetition .
Ah, okay, first paragraph, it’s repetitive. (Expert 6)

Okay, and the last sentence seems unnecessary. (Novice 11)
27. assess helpfulness in guiding reader
The rhetorical organization does not help me see that the student is actually
addressing the topic. (Expert 2)
She skips from present to past, and a bit from country to country, which is hard
for me to follow. (Novice 12)
28. rate organization overall
The organization isn’t particularly good. (Expert 3)
Rhetorical organization? A 2. It’s not that great. (Novice 13)

Language Testing: Expertise in Evaluating Second Language Compositions

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Language Testing: Expertise in Evaluating Second Language Compositions

Caricato da

Copyright:

Formati disponibili

Language Testing

Expertise in evaluating second language compositions

The online version of this article can be found at:

Email Alerts: http://ltj.sagepub.com/cgi/alerts

>> Version of Record - Jun 1, 1990

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Expertise in evaluating second

Address for correspondence: Dr Alister Cumming, Department of Language Education,

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

contain (1) a hidden bias toward more literate learners, if used as

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

establish whether raters implicitly distinguish students’ writing

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

produced in the compositions was considered as an additional

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Repeated measures MANOVAs (with SPSSx version 3) were used

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Figure 1 Group means for ratings of compositions.

(F= 2.8, n.s.). As Figure 1 shows, the student-teachers consistently

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

compositions. Three criteria were used to distinguish these: if they

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

organization of the texts (items 7 and 25-28). Though these

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Figure 3 Group means, standard deviations, and t-tests.

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

command of syntax in English, reviewing phrases in the compositions

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

’produce’./I guess, pluralize these items./’Liked’, change the verb

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

(1989) finding that analytic evaluations of ESL compositions assess

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

substantially from person to person in the absence of pre-specified

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

novices, in contrast, seem to have avoided considering textual

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Alexander, P. and Judy, J. 1988: The interaction of domain-specific and

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

second language writing proficiency. Written Communication 6(2),

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

Appendix A Characteristics of compositions selected

Appendix B Examples of Coded Data

Okay, 142. A big one. (Novice 13)

II. Strategies to Interpret Content

Ah, I think he means, ’reduced base’. (Expert 6)

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

III. Strategies to Interpret Language Use

Verb agreement. (Expert 6)

IV. Strategies to Interpret Rhetorical Organization

V. Self-control Strategies to Guide Judgements

9. define, assess, or revise own criteria and strategies

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

11. compare qualities of different compositions

VI. Strategies to Judge Content

Okay, this is not, the whole paragraph is irrelevant. (Expert 3)

Mm, that’s an interesting statement. (Expert 6)

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014

VII. Strategies to Judge Language Use

20. establish error values

So the first thing, again, very minor problems. (Novice 10)

Inappropriate words, used inappropriately here and there. (Expert 2)

Yeah, 85’s language use is really low. (Expert 6)

Downloaded from ltj.sagepub.com at Lancaster University Library on November 11, 2014