Sei sulla pagina 1di 38

Assessing Writing 9 (2004) 122159

Developing a common scale for the assessment of writing


Roger Hawkeya, , Fiona Barkerb
a b

Consultant in language testing research to Cambridge, ESOL Validation Ofcer, Research & Validation, Cambridge, ESOL Available online 23 July 2004

Keywords: Writing assessment; Prociency; Levels; Criteria; Scales

1. Background to the study This article reports a qualitative analysis of a corpus of candidate scripts from three Cambridge ESOL examination levels, but all in response to the same writing task. The analyses, using both intuitive and computer-assisted approaches, are used to propose key language features distinguishing performance in writing at four pre-assessed prociency levels, and to suggest how these features might be incorporated in a common scale for writing. The work is part of the Cambridge ESOL Common Scale for Writing (CSW) project, which aims to produce a scale of descriptors of writing prociency levels to appear alongside the common scale for speaking in the Handbooks for the Main Suite and other Cambridge ESOL international exams. Such a scale would assist test users in interpreting levels of performance across exams and locating the level of one examination in relation to another. Fig. 1 conceptualises the relationship between a common scale for writing and the levels typically covered by candidates for Cambridge ESOL examinations, the Key English Test (KET), the Preliminary English Test (PET), the First Certicate in English (FCE), the Certicate in Advanced English (CAE) and the Certicate of Prociency in English (CPE), each of which has its own benchmark pass level, the C in Fig. 1. Writing tasks set as part of the tests in Table 1 are currently scored by rating degree of task fullment and evidence of target language control according

Tel.: +44 1840 212 080; fax: +44 1840 211 295. E-mail address: roger@hawkey58.freeserve.co.uk (R. Hawkey).

1075-2935/$ see front matter 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.asw.2004.06.001

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

123

Fig. 1. Conceptual diagram of a common scale across examination levels and ranges.

to criteria such as communicative effectiveness, register, organisation, linguistic range and accuracy. From the assessment by trained raters of performance on these and other criteria, a candidate is assigned a score for the writing task set. As Fig. 1 indicates, a common scale may cover the whole conceptual range of prociency (Common European Framework of Reference for Languages (CEF)
Table 1 CEF overall written production (illustrative scale) Levels CEF overall written production (illustrative scale) C2 Can write clear, smoothly owing, complex texts in an appropriate and effective style and a logical structure which helps the reader to nd signicant points. C1 Can write clear, well-structured texts of complex subjects, underlining the relevant salient issues, expanding and supporting points of view at some length with subsidiary points, reasons and relevant examples, and rounding off with an appropriate conclusion. B2 Can write clear, detailed texts on a variety of subjects related to his/her eld of interest, synthesising and evaluating information and arguments from a number of sources. B1 Can write straightforward connected texts on a range of familiar subjects within his eld of interest, by linking a series of shorter discrete elements into a linear sequence. A2 Can write a series of simple phrases and sentences linked with simple connectors like and, but and because. A1 Can write simple isolated phrases and sentences.

124

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

2001, p. 40). If examinations for candidates at different levels are located on a common scale of prociency, then, to cite the CEF again, it should be possible, over a period of time, to establish the relationship between the grades on one examination in the series with the grades of another (ibid., p. 41). The investigations and analyses of the common scale for writing project should also inform key writing assessment issues such as the development and application of criteria, the generalisability of prociency ratings and the use of exam candidate script corpora in writing assessment research.

2. Relevant issues in the testing of writing 2.1. Communicative writing constructs Cambridge ESOL claims to undertake to assess skills at a range of levels, each of them having a clearly dened relevance to the needs of language learners and to assess skills which are directly relevant to the range of uses to which learners will need to apply the language they have learnt, and cover the four language skills listening, speaking, reading and writing. Saville, in Weir and Milanovic (2003, p. 66), a volume describing the history and recent revision of the Cambridge ESOL Certicate of Prociency in English exam, sees it as the task of the examination developer to construct denitions or traits of ability for the purpose of measurement, and claims that it is these denitions which are the constructs. In the context of language testing, Saville continues, a model of language ability represents the construct. Given the stated Cambridge ESOL aim to assess the four skills as they are directly relevant to the range of uses to which learners will need to apply the language they have learnt, it is the construct of communicative language ability that should underlie the language tests concerned. The communicative language ability construct, derived from models such as Bachmans (1990) view of language competence, comprises pragmatic competences (including grammatical and textual competences) and organisational competence (including illocutionary and sociolinguistic competences). The CEF describes language use, embracing language learning similarly as comprising the actions performed by persons who as individuals and as social agents develop a range of competences, both general and in particular communicative language competences (2002, p. 9). A communicative writing construct in the context of such models of communicative language ability will also entail Bachmans (1990) pragmatic, organisational and sociolinguistic competences. Cumming (1998, p. 61), again taking a context-rooted view, reminds us that the construct writing refers not only to text in written script but also to the acts of thinking, composing, and encoding language into such text, these necessarily entailing discourse interactions within a socio-cultural context. The CEF, echoing this interactive view of the nature of writing, notes (2001, p. 61) that in written production activities the language

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

125

user as writer produces written text which is received by a readership of one or more readers, and exemplies writing activities (e.g., completing forms; writing articles, reports, memoranda; making notes and taking messages) which are performed for communicative purposes. The research described in this article analyses test candidate responses to a writing task that was selected for its potential to engage the interest and communicative abilities of a wide range of language learners. The status of the task concerned, in terms of its appropriateness to the communicative writing construct is explored in Section 5. 2.2. The assessment of communicative prociency Hamp-Lyons (1990), whose own participation in the rst phase of the CSW project is described below, notes that once the indirect (often multiple-choice) writing tests of the 1960s and 1970s had been chased from the battleeld, direct tests of writing held sway. The communicative writing construct invited, in the interests of test construct (and content) validity, assessment methods that measured communicative prociency. Such methods were likely to involve the development of direct tests to elicit candidate performance on tasks with a context, purpose, authentic discourse and behavioural outcomes. Milanovic, Saville, and Shen (1992) claim that direct tests of writing have always been standard practice in Cambridge examinations for the assessment of both rst (L1) and second language (L2) writing abilities (p. 62). In Cambridge ESOL exams, according to Saville (2003), authenticity of test content and the authenticity of the candidates interaction with that content are important considerations for the examination developer in achieving high validity (p. 67). But as Hamp-Lyons (1990) notes, direct tests of communicative language ability raise problems to which there are no easy answers, each aspect task, writer, scoring procedure, and reader interacts with the others, creating a complex network of effects which to date has eluded our efforts to control (p. 87). Bachmans (1990) framework for test method characteristics, which also emphasises the complexity of language ability assessment, includes among the variables: testing environments, test rubric, test input, expected response and relationship between input and response. Authenticity also remains a major issue. Bachman (1990, 1991) denes the notion in terms of the appropriacy of language users response to language as communication. Bachman and Palmer (1996, pp. 2325) re-analyse this notion into authenticity and interactiveness, the former dened as the degree of correspondence of the characteristics of a given language task to the features of a TLU task, the latter as the extent and type of involvement of the test takers individual characteristics in accomplishing the test task. The assessment of writing through communicative tasks brings a change in the relationship between reliability and validity. Saville (2003, p. 69) notes a potential tension between them; when high reliability is achieved, for example by narrowing the range of task types or the range of skills tested, this restricts the interpretations

126

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

that can be placed on performance in the test, and hence its validity for many purposes. When language tests were more discrete-item and objective, it was easier to obtain stable, consistent results free from bias and random error. With the task-based assessment of communicative writing prociency, however, validity, in Bachmans (1990, p. 161) sense of maximising the effects of the language abilities we want to measure, often decree test tasks that are not susceptible to discretepoint marking but require the kind of rating, for example through criteria and band scales, that are more vulnerable to problems of intra- and inter-rater reliability. Writing tasks set as part of the Cambridge ESOL examinations in Fig. 1 are rated according to criteria incorporated in band descriptors used to place a candidate in terms of what learners can be expected to do (Cambridge ESOL exam Handbooks) and within the prociency level represented by the exam (s)he has taken. Cambridge ESOL describes the exams as linked to the Common European Framework for Modern Languages, published by the Council of Europe and as the only certicated exams referred to in the Framework document as specically linked to it by a long-term research programme (Cambridge Examinations in English, a brief guide). The Cambridge ESOL First Certicate in English (FCE) exam, for example, claims to certify successful candidates at CEF B2 or Vantage level. A further, related problem with communicative test tasks is that, while they attempt to mirror specic authentic activities, they are also expected to offer generalisability to other task performances and extrapolation to future abilities. Morrow (1979, 1990) sees generalisability potential in the analysis of communicative tasks into their enabling skills or micro-skills (see Munby, 1978), since the assessment of ability in using these skills . . . yields data which are relevant across a broad spectrum of global tasks, and are not limited to a single instance of performance (Morrow, 1979, p. 20). Saville and Hawkey (2004) among others, however, note the difculty of isolating the particular enabling skills actually used in the performance of tasks. North (2000) notes that there are arguments for and against using the same rating categories for different assessment tasks, but warns that if the categories are based on the task rather than a generic ability model, results from the assessment are less likely to be generalisable (p. 568). Norths categories are assessment criteria such as range, accuracy, and interaction. The CEF (2001) suggests that a common framework scale should be context-free in order to accommodate generalisable results from different specic contexts. But the test performance descriptors concerned also need to be context-relevant, relatable or translatable into each and every relevant context. . . (p. 21). Weir (1993) considers rigour in the content coverage of direct performance tasks as one way to increase generalisability. The sample of communicative language ability elicited from test-takers by a test task must be as representative as possible, in accordance with the general descriptive parameters of the intended target situation particularly with regard to the task setting and task demands (p. 11). Bachman (2002) proposes, in similar vein, assessments based on the planned

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

127

integration of both tasks and constructs in the way they are designed, developed and used. For Bachman, a fundamental aim of most language performance assessments is to identify tasks that correspond to tasks in real-world settings and that will engage test-takers in language use or the creation of discourse (p. 471). In the context of the research project described in this article, the constructs and problems mentioned above, communicative language ability, the socio-cultural context, task and other effects, generalisability, authenticity, and, of course, validity and reliability, are to a greater or lesser extent at issue. The authenticity and thus the validity of the particular task used to collect a range of candidate performances must be considered. Inter-rater reliability will have to be established to permit inferences to be made about communicative performance features typical of different levels of prociency, and the issue of generalisability is, of course, critical to any work related to the development of common scales. Since the analyses in the research described below are based on a wide range of learners performing on a single, common task, the generalisability question will demand particularly convincing answers. 2.3. Bands and scales in the assessment of performance on communicative writing tasks Aldersons (1990) paper on direct test assessment bands remains an insightful survey of the principles and practices of prociency level descriptor band scales in the assessment and reporting of direct language test performances. Alderson notes that bands represent a range of scores rather than precisely dened performances, which, he suggests, may help testers avoid a spurious impression of accuracy. He also makes the useful distinction between constructor-oriented, assessor-oriented and user-oriented scales; band descriptions may be used in test development, to rate test performance or to interpret performance for test candidates or receiving organisations. These categories of scale are not, of course, mutually exclusive. The Cambridge ESOL CSW project is intended eventually to inform test constructors, assessors and users. Aldersons analysis of band scale development raises issues that must receive attention in this study. These include: deciding which assessment criteria to include and how to dene them; distinguishing the end of one band or level from the beginning of the next (without, e.g., resorting entirely to indenite distinctions between always, usually, sometimes, occasionally); avoiding long, over-detailed descriptions (see also CEF, 2001; North, 2000; Porter, 1990), and achieving intra- and inter-rater consistency when bands are used to assess prociency. Spolsky (1995) expresses stronger doubts about the viability of band scales. Such scales may be attractive for their easy presentation of language test results, but risk an oversimplication that ultimately misrepresents the nature of language prociency and leads to necessarily inaccurate, and therefore questionable, statements about individuals placed on such a scale (p. 350).

128

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

The message for language testers is not, however, that scales are not feasible but that their development and use should take account of the underlying complexity of writing and its measurement as part of an open and contextual approach to language prociency assessment (p. 353). 2.4. Developing and revising rating scales On the question of how band scales are developed, suggestions from the Common European Framework are examined here in some detail as they will help categorise the approaches of the research that is the subject of this article. The CEF (2001) suggests the following on scale development methodologies in general:
There are a number of possible ways in which descriptions of language prociency can be assigned to different levels. The available methods can be categorised in three groups: intuitive methods, qualitative methods and quantitative methods. Most existing scales of language prociency and other sets of levels have been developed through one of the three intuitive methods in the rst group. The best approaches combine all three approaches in a complementary and cumulative process. (p. 207)

Qualitative methods, according to the CEF account (2001, p. 207), require the intuitive preparation and selection of material and the interpretation of results. The use of quantitative methods involves scale developers in quantifying qualitatively pre-tested material, and will require the intuitive interpretation of results. The CEF then provides examples of intuitive, qualitative and quantitative methods. Intuitive methods are seen as requiring the principled interpretation of experience rather than structured data collection, probably involving the drafting of a scale using existing scales and other relevant source materials, possibly after undertaking a needs analysis of the target group, after which they may pilot and revise the scale, possibly using informants. This process is seen as being led by an individual, a committee (e.g., a development team and consultants) or as experiential (the committee approach but over a longer period, developing a house consensus and possibly with piloting and feedback). The CEF account describes qualitative methods of scale development as involving small workshops with groups of informants and a qualitative rather than statistical interpretation of the information obtained (ibid., p. 209), the use of expert or participant-informant reactions to draft scales, or the analysis of typical writing performances using key features or traits to rene provisional criteria and scales and relate them to prociency levels. The CEF then proposes three quantitative methods of developing band scales. Discriminant analysis sees sets of performances rated and subjected to a detailed discourse analysis to identify key features, after which multiple regression is used to determine which of the identied features are signicant in determining the rating which the assessors gave. These features can then be incorporated in

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

129

the required level descriptors, as in Fulchers (1996) multi-dimensional scaling, a descriptive technique to identify key features and the relationships between them (p. 210) is used on performance ratings to identify features decisive in determining level and provides a diagram mapping the proximity or distance of the different categories to each other. Item response theory (IRT) or latent trait analysis, for example, using the Rasch model to scale descriptors of communicative prociency, associates descriptors of communicative performance with prociency levels. The generalisation advantages of Rasch analysis are signicant; such analysis can provide sample-free, scale-free measurement. . . scaling that is independent of the samples (p. 211). Insights for the study here are also taken from Fulcher (2003, p. 92), who classies approaches to rating-scale development as either intuitive (including expert, committee or experiential judgement) or empirical (data-based or -driven, empirically-derived binary choice boundary denition scales, and the ranking of scaling descriptors by experts). There are some similarities, too, with Upshur and Turner (1995, 1999) and Turner and Upshur (1996), who develop empirically derived, binary choice, boundary denition scales (EBBs). This method rank-orders target language samples, scores them, then identies features that were decisive in allocating the samples to particular bands or score ranges (Fulcher, 2003, p. 104). In terms of the CEF categorisation of approaches above, and of the methods suggested by Fulcher, and by Upshur and Turner, the study described in this article, which is but one channel of inquiry in what Cambridge ESOL (see Saville, 2003, p. 64) calls an on-going programme of validation and test revision, may be characterised as follows: led by an individual researcher using the principled interpretation of experience with a co-researcher responsible for the computer linguistic analysis of some data; using the intuitive preparation and selection of material and the interpretation of results, but also structured data collection; using data-based or -driven approaches; using analysis of typical writing performances through key features or traits, but also reference to existing scales and other relevant sources, to develop rating criteria for a draft prociency scale, to be piloted and revised; using some quantitative methods to validate the sorting of data, for example inter-rater reliability statistics on the candidate scripts to be grouped according to level; the use of small workshops with groups of informants and a qualitative rather than statistical interpretation of the information obtained to rene provisional criteria and scales and relate them to prociency levels; and referring interim work regularly to a committee developing a house consensus, namely the Cambridge ESOL Writing Steering Group. Sections 48, which describe the study in some detail, will illustrate these approaches.

130

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

2.5. Example criteria and bands The current study towards a common scale for writing is informed by the bands and assessment criteria for writing used in a number of existing band scales. Table 1 shows the CEF illustrative six-level scale for overall written production (A1 Breakthrough to C2 Mastery) (2001, p. 61). It is possible to infer criteria such as the following from this scale: clarity, uency, complexity; appropriacy and effectiveness of style; logical structure and connections; links, helping the reader to nd signicant points; range/variety of topics. For its six-level overall written interaction scale, the CEF (2001, p. 84) refers to the following overlapping but not identical criteria: clarity, precision, exibility, effectiveness; emotional, allusive and joking language; conveying degrees of emotion; highlighting the signicance of events and experiences. Such descriptors for writing prociency, derived as they are according to the CEFs guidelines of positiveness, deniteness, clarity, brevity, and independence (pp. 206207) inform the descriptors identied later for use in a draft scale proposed in this study. Table 2 summarises bands and assessment criteria used for Cambridge ESOL Main Suite and other key international tests of writing.
Table 2 Main writing assessment criteria used with Cambridge ESOL main suite and other exams Exam Certicate of Prociency in English (CPE) Bands/levels 6 bands/levels Main criteria for assessment Task realisation: content, organisation, cohesion, range of structures, vocabulary, register and format, target reader; General impression: sophistication and range of language style, register, format organization and coherence topic development errors

Effect on reader very positive positive achieves desired effect negative very negative nil 6 bands/levels

Certicate in Advanced English (CAE)

Task specic:

Effect on reader very positive positive would achieve required effect negative very negative

Content; range; organisation and cohesion; register; target reader; General impression: task realisation: coverage, resourcefulness organisation and cohesion appropriacy of register

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159 Table 2 (Continued ) Exam Bands/levels language: control, naturalness, range of vocabulary and structure, errors nil 6 bands/levels Main criteria for assessment

131

First Certicate in English (FCE)

Effect on reader

Preliminary English Test (PET) Key English Test (KET) Certicates in English Language Skills (CELS)

very positive positive would achieve required effect negative very negative nil 5 marks for task 5 for language 10, 5 and 5 marks for three tasks 6 bands/levels

Task specic: content; range; organisation and cohesion; appropriacy of register and format; target reader; General impression: task realisation: full, good, reasonable, not adequate, not at all; coverage of points, relevance, omissions, original output organisation and links control of language: range and accuracy appropriacy of presentation and register

Task coverage, elaboration, organisation Language range, variety, complexity, errors Message communication, grammatical structure, vocabulary, spelling, punctuation Content points, length

International English Language Testing System (IELTS)

9 bands/levels

Effect on reader expert very good good competent modest limited extremely limited intermittent non-user

format and register: appropriacy organisation: clarity, intent cohesion: complexity, variety of links structure and vocabulary range: range, distortion accuracy: impeding errors paragraphing, spelling, punctuation Task fullment: requirements, exploitation, relevance, arguments, ideas, evidence: logic, development, point of view, support, clarity; coherence and cohesion Communicative quality : impact on reader, uency, complexity, Vocabulary and sentence structure: range, appropriacy accuracy, error types

132

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

These descriptors and criteria also inform decisions on the draft band descriptors proposed below. Noted, in particular, were the following, used in the descriptions of prociency levels for one or more Cambridge ESOL exams, and classied here under three headings that appear appropriate superordinates for the descriptors and criteria covered: Fullment of the task set task realisation: full, good, reasonable, not adequate, not at all coverage of points, relevance, content points, omissions, original output task fullment: requirements, exploitation relevance, arguments, ideas, evidence: logic, development, point of view, support, clarity; length Communicative command of the target language communicative quality: impact on reader, effect on target reader sophistication and range of language uency, complexity, range of vocabulary and structure language: control, naturalness, structure and vocabulary range style/register and format appropriacy of register and format appropriacy of presentation and register format and register: appropriacy Organisation of discourse organisation and coherence; organisation and links; organisation: clarity, intent; paragraphing, cohesion: complexity, variety of links coherence and cohesion topic development Linguistic errors accuracy: impeding, non-impeding errors spelling, punctuation errors accuracy, error types distortion

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

133

In the study reported here, it will be necessary to identify the characteristics or key features of the writing of candidates who take exams used to certicate learners at different prociency levels, and who perform at different prociency levels across those exams. The descriptors and criteria listed above will be one reference source for decisions on descriptors and criteria for use in the study. The groupings do suggest that, judging from their prominence in the Cambridge ESOL scales sample in Table 2, task realisation, linguistic error (or accuracy?), and the organisation of discourse could gure in the drafting of any new scale. Not, perhaps, belonging under such headings, however, because they are less directly connected with the content of response to the task set, with lexico-grammatical correctness or with the way a text is organised, are descriptors or criteria such as appropriacy, uency, complexity, sophistication and range of language, included above under the very tentative heading Communicative command of the target language. Fulcher (2003, p. 96), warning relevantly that the intuitive approach to scale development has led to a certain amount of vagueness and generality in the descriptors used to dene bands, reminds us of the need to dene and distinguish between key components of assessment:
In language testing, the attention of raters has been drawn to the accuracy of structure and vocabulary in speech as one component of assessment, and the quality and speed of delivery as a separate component. This is an attempt at construct denition: the operational denition of two related but distinct components that make up the construct of speaking. (p. 27)

The same need to attempt construct denition applies, of course, to the development of descriptors and scales for the assessment of writing, especially when samples of candidate writing are analysed for features that might typify their level of prociency. On linguistic error or accuracy, for example, Fulcher (2003) appears to accept teacher denitions of the errors made by speakers. He notes, in addition, that some of these errors interfere with communication, and others do not. Fulcher also accepts that accuracy in the use of a second language may be associated with students who concentrate on building up grammatical rules and aim for accuracy (p. 26), unlike those who concentrate on communicating uently, paying little attention to accuracy. Helpfully for the use of accuracy, in particular lexico-grammatical accuracy, as a criterion in the analysis of candidate scripts in this study, Fulcher refers to low and high gravity errors, and to the following types of accuracy error areas: agreement, word order, pronouns and relative clauses, tense, prepositions. All these categories of error are used later in this study. Accuracy is frequently juxtaposed in language assessment contexts with appropriacy, a construct that will also emerge from the analysis of candidate scripts below. Appropriacy entails the application of Hymes (1971) rules of

134

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

use without which the rules of grammar would be useless. Appropriacy would seem to belong mainly under Bachmans (1990) pragmatic competence construct, which includes sociolinguistic competences such as sensitivity to dialect, variety, register. The uency construct also appears in CEF descriptors, for example in the global scales, at C2 level: Can express him/herself uently and spontaneously. . ., or in the Common Reference Level self-assessment grid for writing, again at C2 level: I can writing clear, smoothly owing text . . .. The development of descriptors, criteria and bands in this study will be informed both by relevant existing descriptors, criteria and bands, and by the intuitive and qualitative analyses of candidate scripts. The study being part of an iterative research process, insights will also come from previous phases in the Cambridge ESOL CSW project.

3. Research methodology 3.1. Lessons from CSW Research Phase 1 The CSW project is phased. Phase 1 is summarised here as background to the more detailed description of Phase 2, the main topic of this paper. The Phase 1 project design called for a two-fold approach to research towards the development of a common scale for writing. A senior Cambridge ESOL examiner, Annette Capel, revisited existing Cambridge exam mark schemes and modied intuitively the descriptors for the levels represented by the main suite of Cambridge exams. The outcome was a ve-band draft common scale for writing using criteria such as: operational command of written language; length, complexity and organisation of texts; register and appropriacy; range of structures and vocabulary; and accuracy errors (Saville & Capel, 1995). The scale, shown in Fig. 2, is based on Pass level descriptors for each of the ve Cambridge Main Suite exam levels. As an applied linguist with a particular interest in writing assessment, Liz Hamp-Lyons (1995) was invited to investigate a representative corpus of candidate scripts from PET, FCE, CAE and CPE exams. From this corpus, she proposed can do, can sometimes do, and cannot do statements to characterise the prociency levels of the scripts, and identied criteria such as task completion; communicative effectiveness; syntactic accuracy and range; lexical appropriacy; chunking, paragraphing and organisation; register control; and personal stance and perspective. As the scripts in the corpus were from candidates taking a range of Cambridge ESOL exams and thus responding to different prompts, Hamp-Lyons noted signicant task effects on candidate writing performance. These provided interesting insights into task: performance relationships, but made more difcult the identication of consistent features of writing at different levels.

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

135

Fig. 2. Phase 1: draft common scale for writing (Capel, 1995).

Learning from the experience of Phase 1 of the CSW project, the following decisions were made on the approach to Phase 2: 1. Insights from existing scales and criteria would continue to inform work towards the development of a common scale for writing (as they had Capels work in Phase 1). 2. The tabula rasa analysis of candidate scripts (as performed by Hamp-Lyons) would also be continued. 3. Task effect would be controlled by using a corpus of candidate writing in response to the same communicative task across exam levels. 4. The qualitative analyses of scripts would be carried out by a single researcher, thus ltered through the corpus analysts own experience and preferences, but taking account (see Sections 48) of the analyses of existing scales and of candidate scripts carried out in Phase 1.

136

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

5. The intuitive analyses would be backed by some computer corpus analyses of the scripts, to be carried out by a second researcher in close contact with the rst. 6. The work would continue to be monitored by the Cambridge ESOL Writing Steering Group. 3.2. Corpus linguistics and computer analyses The decision to use computer corpus analysis in Phase 2 of the CSW project was partly motivated by advances in this approach since Phase 1. Corpus Linguistics (CL) is the study of language based on examples of real life language use, in this case the written language produced in live and simulated test situations in response to real life tasks (McEnery & Wilson, 1996, p. 1). Using CL techniques in this study would allow certain checks on the assertion that actual patterns of use in natural discourse are often quite different from linguists perceptions, and many patterns simply go unnoticed until they are uncovered by empirical analysis (Biber, Conrad, & Reppen, 1998, p. 145). The reasons for the growing popularity of CL techniques in language testing include the ease of access to data and the range of research tools available. Computerised analyses can reveal many aspects of language use quickly and accurately, reducing the need for painstaking manual analysis and revealing patterns or facts that may be undetectable to the naked eye. CL techniques are used alongside other methodologies to reveal important facts about language. There has been relatively little corpus-based research specically related to learner writing. A notable exception is Granger and Rayson (1998) who investigated features of learner writing by comparing native and non-native texts from two corpora (the International Corpus of Learner English and the Louvain Corpus of Native English Essays). Granger and Rayson found that the non-native speakers overused three categories signicantly: determiners, pronouns and adverbs, and also signicantly underused three: conjunctions, prepositions and nouns (p. 123). Equally important is Kennedy, Dudley-Evans, and Thorp (2001), which is examined in greater detail below. Corpora and corpus linguistic techniques are increasingly used in Cambridge ESOLs research projects for analysing candidate performance or developing examination tasks alongside established methodologies such as qualitative analysis and intuition, and the knowledge and experience of item writers, examiners and subject ofcers. Although Cambridge ESOL has been developing corpora for over a decade (most notably the Cambridge Learner Corpus, see Ball, 2001), the related analytical techniques have tended to be used for small-scale research or test validation projects such as the comparison of scripts from different versions of the same examination or the analysis of transcripts of Young Learners speaking tests (see, e.g., Ball & Wilson, 2002). CL techniques have also been used by ETS, for example, to study the writing section of the TOEFL computer-based test, and in the development of the new writing and listening tests for TOEFL 2005.

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

137

Size is not everything when using corpora, because of the detailed scrutiny that can be applied to a small corpus. Aston and Burnard (1998, p. 21) note that it is striking how many descriptive studies have analysed only small corpora (or small samples of larger ones), often because of the need to inspect and categorise manually. This is relevant to Phase 2 of the CSW study because the total data submitted to computer analysis amounted to 18,000 words in 98 scripts, that is the four sub-corpora selected after the initial analysis of the 53,000 word 288script corpus (see below). Whatever the size of the corpus, Leech (1998) urges researchers to be cautious when drawing general inferences from corpus ndings and to stay alert to the inuence of hidden variables implicit in the way we collected or sampled the data. In this study, having built our own small corpus, we were able to avoid this potential problem although we should still bear in mind that a corpus is a nite sample of an (in principle) innite population: we cannot easily extrapolate from what is found in a corpus to what is true of the language or language variety it supposedly represents. A description of the recent growth in the use of corpora in Cambridge ESOL can be found in a series of articles detailing key projects over the last 2 years (Ball, 2001, 2002; Ball & Wilson, 2002; Boyle & Booth, 2000). A second reason for applying CL techniques to the data in the study was the use of similar techniques in a contemporary study at University of Birmingham (UK), where Dr. Chris Kennedy was working with colleagues on the analysis of a corpus of 150 candidates scripts in response to writing tasks from the International English Language Test System (IELTS) (Kennedy et al., 2001). The methodology of the Kennedy et al. project involved re-typing candidate scripts into text les, including all errors, performing a manual analysis to note features of interest by band level, performing statistical analyses on essay length, and then applying the Concord and Wordlist tools of the WordSmith Tools program (Scott, 2002). This is a text analysis software package that has three main actions: producing wordlists from texts, converting texts into concordances (contextualised extracts showing where a key word or phrase occurs) and identifying the keywords in a text (by comparing the words in the text with a larger reference list). Interesting points emerging from the Birmingham research and relevant to Phase 2 of the CSW study include features of writing identied at different prociency levels, for example: longer essays with broader vocabulary range at higher IELTS prociency levels; more rhetorical questions, interactivity, idioms, colloquial, colourful and metaphorical language at higher levels; and the over-use of explicit (probably rote-learnt) cohesion devices by candidates at lower writing prociency levels. Some of these aspects of high and low level prociency in writing are echoed in ndings in Phase 2 of the CSW study reported below.

138

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

4. CSW project Phase 2 research design The main research questions for Phase 2 of the Common Scale for Writing project were: What are the distinguishing features in the writing performance of ESOL learners or users across three Cambridge English examination levels, addressing a common task? How can these be incorporated into a scale of band descriptors? The research design for an empirical study to answer these questions was as follows: A corpus of writing performances, all responding to the same communicative task, was obtained from candidates at three different Cambridge exam levels. Each script was graded by more than one experienced and trained rater, using a single assessment scale. All scripts were read and graded by the main researcher, who wrote comments on salient features of each. Sub-corpora of scripts, selected at four prociency levels according to the band scores assigned by raters, were identied for closer analysis by the main researcher. Analyses of the writing samples in the sub-corpora were carried out, to identify, then check through expert consultation, typical features of the writing samples for each level. The manual analyses of the sub-corpora were supplemented by computer analyses run by the second researcher to re-examine some of the characteristics identied as typical of each sub-corpus and explore other features of potential interest. The characteristics and criteria identied were rationalised into a draft scale of band descriptions for the prociency levels specied, this scale to be proposed as a draft common scale for writing.

5. CSW project Phase 2 research Steps 1 and 2 To obtain scripts for the Phase 2 corpus, a communicative writing task was needed which was suitable for candidates for Cambridge exams at three levels: FCE, CAE and CPE. After consideration of the Principal Examiners report on the December 1998 FCE exam session, which suggested it as relevant to candidates real world and interests, and likely to engage test-takers in language use or the creation of discourse (see Bachman, 2002) a task prompt was selected. The task rubric was: Competition: Do you prefer listening to live music or recorded music? Write us an article giving your opinion. The best article will be published in the

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

139

magazine and the writer will receive 500. The task served its purpose well in the sense that none of the candidates in the study sample misunderstood the topic. A sample of 108 live test scripts was selected from the December 1998 administration of the exam. Two hundred further international candidates were identied, split between those preparing for the CAE and the CPE exams. Pilot test papers were sent to six Cambridge-ESOL test centres, with the instruction that the test, including the writing task already completed by the FCE candidates, should be administered to the candidates for CAE and CPE within a 2-week period. One hundred and eighty pilot CAE and CPE scripts were returned, to be added to the 108 live FCE scripts to form the 288-script (53,000-word) corpus used in Phase 2 of the study. Ten experienced Cambridge writing test examiners were invited to a marking day and oriented to the mark scheme to be used. The 180 pilot scripts (113 CAE candidates and 67 CPE candidates) were to be marked by examiners, all using the mark scheme already used for the same task in the live FCE exam. Once the 180 pilot CAE/CPE scripts had been marked, the 108 live FCE candidate scripts were remarked by members from the same team of examiners. To identify distinguishing features in writing performance on the common task across three prociency levels, the main researcher then rated all 288 scripts in the corpus, using the same FCE mark scheme as the experienced raters. This rating was carried out without his prior knowledge of which scripts belonged to which level of exam candidates, or of the ratings already assigned by the Cambridge ESOL raters. The overall FCE mark scheme used by all raters included criteria such as task points and own output; range and control of structure and vocabulary; organisation and cohesion; appropriacy of presentation and register, and effectiveness of communication of message on target reader. The task-specic mark scheme included relevance, range of structure and vocabulary; presentation and register, adjusted for length, spelling, handwriting, irrelevance. The band ratings assigned were on a scale from 0 to 5 with the option of three selections within a band (e.g., 4.3, 4.2 or 4.1). Agreement between the ratings assigned by the script analyst and those of the experienced UCLES raters was then checked. Table 3 gives rater identications, numbers of ratings, average ratings assigned (with the FCE bandings from 1 to 5 divided into 15 points, three for each band on the FCE assessment scale, thus 5.3 = 15, 5.2 = 14, 4.3 = 12 and so on) and inter-rater correlations (Pearson r). Table 3 shows reasonably high inter-rater correlations across the raters and between the trained raters and the analyst, given that with correlations of ratings of the same task, Pearson r levels around .8 and above would be desirable (Hatch & Lazaraton, 1991, p. 441). It should also be noted, however, that the experienced raters each rated relatively few of the FCE papers (see Table 3 for the number of ratings assigned by each rater). The scripts provided by candidates at Cambridge ESOL centres who were preparing to take the CAE or the CPE exam, were each assigned a band rating

140

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Table 3 FCE candidate scripts: ratings, standard deviations and inter-rater correlations between experienced raters and the script analyst Raters A No. of ratings Average score assigned S.D. Correlation with analyst ratings (L) Correlation between analyst ratings and all other raters combined 19 8.72 2.24 .84 B 12 9.50 2.75 .92 H 14 10.71 2.53 .76 I 32 10.90 2.41 .89 K 21 10.14 2.90 .90 L 107 9.82 2.68 .84

5 ratings in the original data have no rater ID.

by 3 of 10 trained raters, each rater rating an average of 50 scripts, allocated differentially from the total of 180 CAE/CPE scripts. The averages and standard deviations across raters thus differ somewhat since each rater was rating a different set of scripts, apart from the corpus analyst, who again rated all the (180) scripts concerned. Table 4 gives the numbers of scripts rated by the raters, standard deviations for raters rating 50 or more scripts, and inter-rater correlations between each of the raters and the analyst. It will be noticed that the correlations in Table 4 are, on the whole, lower than those for the FCE candidate ratings in Table 3. This may be because this study covers groups of candidates from three different exam levels but, because of the nature of its research questions, each candidate responded to a common task and was rated according to the same (FCE) assessment scale. It is possible that raters found the FCE scale less appropriate for the CAE and CPE than for the FCE candidates. After all, the current FCE and CPE (though not the CAE) Handbooks state that the FCE and CPE mark schemes should be interpreted at FCE and CPE levels respectively, which might suggest that the schemes are potentially levelspecic. The inter-rater correlations were, however, considered high enough for the purpose for which they were intended in the context of the study. That was to facilitate the division of the corpus into four different levels of target language prociency,
Table 4 CAE/CPE candidate scripts: ratings, standard deviations and inter-rater correlations with the script analyst Raters A No. of ratings Average score assigned S.D. Correlation with analyst ratings (L) B D E F G H K L

50 69 71 48 62 70 51 72 180 10.36 12.14 11.38 9.46 10.19 9.84 10.25 10.04 10.35 2.84 2.28 2.80 2.11 1.93 2.35 2.69 2.99 2.56 .75 .72 .68 .75 .75 .70 .62 .74 1.00

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

141

each of which could be analysed according to characteristics that might be said to be representative of the level concerned.

6. CSW project Phase 2 research Step 3 As well as rating all 288 scripts, the corpus analyst had written brief intuitive comments on the distinguishing performance features of each (see Table 5). The four most common features noted in the analysts rst round of qualitative analysis were impact, uency, organisation and accuracy. Impact on the reader is a feature of writing already noted (usually called effect on the reader) in existing Cambridge ESOL rating scales (see Table 2) and inherent perhaps in the interactive nature of the writing construct as espoused by the CEF and others above. Thirty comments using the term impact itself or reected in adjectives suggesting the making of impact on the reader (e.g., lively, powerful) were made, all but one applied to scripts rated at Bands 5 and 4. From this a provisional inference could be made that the ability to make an impact through
Table 5 Summary of features and occurrences in the rst analysis of 288 scripts Most common features and total mentions Descriptors (positive/negative) Number of mentions Mentions with Bands 5 and 4 scripts (n = 161) 29 Mentions with Bands 3 and 2 scripts (n = 127) 1

Impact, 30

Fluency, 83

Organisation, 120

impact, lively, powerful [+impact]/NA uent, [+uent]/awkward, strained, stiff stilted, disjointed, [-uent] well-organised [+ organised]/disorganised, [-organised], muddled, confused accuracy, accurate [+ accurate]/inaccuracy, inaccurate, [+ inaccurate]

30

40

37

19

16

Accuracy (of vocabulary, grammatical structure, etc.), 213

101 52

40 51

61 1

161

51

110

142

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

writing presupposes a fairly high level of target language prociency. The impact criterion, though clearly somewhat subjective in nature, was deemed worthy of further investigation in terms of communicative features, which might be associated with it. Impact was already seeming a rather broad concept, however, possibly overlapping uency, a term which tended to be applied here to scripts that were also seen as making an impact on the reader. Table 5 shows that uency (see Section 2.4), referring to ease of use of the target language, was mentioned 83 times in the initial analysis. As a positive feature of writing performance, uency received very signicantly more mentions (37 to 3) with scripts awarded Bands 5 and 4 than with those awarded Bands 3 and 2. Descriptors characterising a lack of uency, however, were mentioned only somewhat more frequently with reference to Bands 3 and 2 than with the Bands 4 and 5 scripts. Organisation and cohesion (see also Section 2.4), covering the overall structure and coherence of the writing and its use of links, or cohesion devices, was mentioned 120 times, around 80% of these being negative. Whereas nearly all positive references to organisation referred to Bands 4 and 5 scripts, the negative references were more evenly shared by the higher and lower prociency scripts, at 40% and 60%, respectively. Accuracy (of vocabulary, grammatical structure, etc., see Section 2.4 above, and Section 7.2 below for a detailed analysis) was referred to more than any other feature in the initial comments. This prominence may have been because the chosen writing task invited a newspaper article, a discourse mode particularly liable to lose impact through errors of linguistic accuracy. But it may also be that, even in a language teaching and testing world where communicative approaches hold sway, with emphasis on message rather than form, accuracy plays a key part in the impact of communication on interlocutors. Of the 213 mentions of accuracy, 52 were positive, all but one of these referring to Bands 5 and 4 scripts. Of the more than three times as many references to inaccuracy, more than two-thirds applied to the scripts assigned Bands 2 and 3. Despite the subjective nature, some circularity and the fact that the analysis was performed by a single individual, the ndings were considered by the monitoring committee to be signicant enough to inform the next research steps.

7. CSW project Phase 2 research Steps 4 and 5 All the ratings of the corpus of 288 scripts as assigned using the FCE assessment scale were now used to select four sub-corpora according to prociency level but still regardless of examination candidacy (CPE, CAE or FCE). The rst sub-corpus (n = 29 scripts) consisted of all scripts banded only at 5 on the FCE mark scheme (including scores of 5.3, 5.2 and 5.1) by all raters; the second (n = 18) of scripts banded at 4 by all raters, the third (n = 43) of those banded at 3 by all raters, and the fourth (n = 8 only) all banded at 2.

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

143

Each of the scripts in the four sub-corpora (98 scripts, totaling around 18,000 words) was then submitted to detailed re-examination by the main researcher involving: a re-reading; a characterisation according to features that made a favourable or less favourable impact; a count and classication of errors, and a selection of excerpts (see Table 6) considered typical of certain of the communicative characteristics of the level concerned, these excerpts to be checked with an expert panel. The emergence from the rst reading of the whole corpus of impact on the reader as a key characteristic of some of the higher-scored scripts would now be re-visited, alongside the other criteria that featured in the initial subjective analysis, uency, organisation, and lexico-grammatical (in)accuracy. The outcome of this re-visit was a more rationalised set of features that could form the basis of a draft scale for writing: sophistication of language, accuracy and organisation and cohesion. 7.1. Sophistication of language It may be noted that certain of the criteria specied by Capel in her Phase 1 draft common scale for writing (see Fig. 2) are not among the emerging Phase 2 criteria, for example: operational command of written language; register and appropriacy; range of structures and vocabulary. Also missing from the developing list were Hamp-Lyons suggested criteria (see Section 3.1): communicative effectiveness, lexical appropriacy, register control and personal stance and perspective. The absence of such criteria from the emerging labels was not, however, because the criteria concerned were not found relevant to a potential common scale for writing. Their absence as explicit, independent categories was caused by decisions on which criteria to specify separately, and which to subsume in others, given the issues of criterion denition, independence and of over-detailed or overlapping descriptions (see Alderson, 1990; CEF, 2001). A portmanteau criterion of the nature of sophistication of language appeared to be emerging. As Table 2 indicates, a similar criterion, sophisticated use of an extensive range of vocabulary, collocation and expression, entirely appropriate to the task set already appears in the general mark scheme for the CPE exam. Perhaps this criterion could be taken to subsume features such as uency, appropriacy and register, for example, or Capels useful Phase 1 operational command of written language, criteria notoriously difcult to distinguish from range of structures and vocabulary. It is difcult to achieve a balance between a criterion that is narrow and thus risks too discrete a rating of a communicative task, and one that is too broad, thus risking overlap with other criteria. But sophistication of language seemed to cover language use making an impact on the reader in addition to accuracy and organisation, and one with interesting differentiation potential across prociency levels. The corpus analysts own comments and examples from scripts seen as making extra impact may clarify the concept of sophistication of language. Table 6 gives

144

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Table 6 Descriptions and samples of sophistication of language Corpus analyst comment Band 5 scripts exceptionally uent, stylish (meaning that the writer is able to adopt a particular style to increase impact) thus impactful uent, brisk, thus makes impact Script ID 1 Script sample Just a year ago, it was easy: me and the guys on a Friday night (in fact, every night was Friday night back then),hitting rock clubs, drinking cold beer and absorbing loud live music. But its different these days . . . But live music has something more really. Maybe it is because of that fusion between the artist and you, the crowd effect that makes you feel things a different way, and enjoy music you would have disliked on the radio for instance. Listening to live music can be an extraordinary experience or a total asco. . .. Actually I cant think of anything better than the excitement of listening to a new CD you have just acquired. Every song is something new and unknown. Its like discovering a whole new world, full of new possibilities. ... The voice of the fado singer. In it were all the sorrow and longing of a broken heart, in her face the traces of a life lived to the limit. Not a word did I understand but the music moved me as no music had ever done before. Have you ever stood on a stage, singing to hundreds of people? Have you ever looked at their faces and seen the pleasure youre giving them? If you have, then you know what live music is. Close your eyes and feel the sounds of the opera with your ears and your heart; you will not regret having done it. Live music is totally different than recorded music: its warm, its real, you cant cheat. . . You have to be yourself in front of the audience. you cannot describe the thrill you feel listening to your favourite group live. A live concert is really exciting. You can feel a really big rush, the show can nish, but that continues for months and months.

makes powerful statements

very atmospheric, personal and uent, with native-speaker-like idiom and style Band 4 scripts effective use of rhetorical questions

uent, evocative . . .

Band 3 scripts uent, some style

quite hip

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

145

examples from the Bands 5, 4 and 3 sub-corpora, along with the script analysts comments at the time. The corpus analyst commented on the following script features for their impact on the reader through sophistication of language: (adopting a) style; use of idiom and colloquial language; use of rhetoric (words used to inuence and persuade); rich, lively vocabulary and collocation; humour and irony; using personal experience to enhance an argument and/or strengthen the writer:reader relationship; and variation of sentence and paragraph length. These features suggest that the sophistication of language criterion may indeed subsume writing assessment criteria such as those referred to by Capel and HampLyons in Phase 1, for example: uency, operational command of written language, appropriacy, range of structures and vocabulary, register control, and personal stance and perspective. The sophistication of language criterion is probably similar in its intention and scope to Hamp-Lyons communicative effectiveness, and indicates features of communication at the upper ranges of prociency, their absence or partial presence, however, possibly reducing communicative effectiveness at lower levels. Since these criteria emerged from the analyses of an individual corpus analyst, they were next checked with the views of expert informants. Eight Cambridge ESOL exam chairs, chief examiners and subject ofcers were thus requested to comment on excerpts taken, unlabeled or characterised, from some of the highestgraded scripts. The hypothesis was that the excerpts would be seen to display some of the characteristics proposed by the corpus analyst as exemplifying sophistication of language (Table 7). On the whole, the features identied by the expert participants in this smallscale study t with those suggested by the corpus analyst, namely: adopting a style, use of idiom and colloquial language, use of rhetoric, rich, lively vocabulary and collocation, the use of personal experience to enhance an argument and/or strengthen the writer:reader relationship. Use of humour and irony, and variation of sentence and paragraph length are not specically noted by the group, the latter probably because the participants were dealing with short excerpts rather than complete responses to the writing task. Further corroboration of the sophistication of language concept is found in the Birmingham University study (see Section 3.2). Kennedy et al. (2001) identify the following characteristics in their IELTS Band 8 (very good user) scripts:
reader awareness, including sophisticated linguistic devices to modify and qualify; more interactive, conversational register; more interpersonal, topical, textual

146

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Table 7 Expert references to features of a possible sophistication of language criterion in high Band 5 responses to the writing task Sophisticated language features identied by Expert Group Explicit reference to sophistication sophisticated approach to the task, attempt at complex, sophisticated language, fairly sophisticated ideas, sophistication of ideas Reference to control of language and complexity controlled use of language (4), skilful control of language, good control of structure, complex language (3),complex structure (2), complex sentences, compound sentence structure, resourceful (2) Reference to impact on the reader, tone, feeling positive impression on the reader (2), giving the reader something to think about (2), achieving effect, addresses reader directly, communicates directly with the reader, enables reader to imagine the situation, enthusiastic tone, lively, persuasive, engaging, spontaneous, refreshing, individual, intimate, evocative (3),emotive (4), sensations, atmospheric, ambitious (3), downbeat Reference to naturalness of language, native-speaker competence natural ow (3), natural use of words/language (2), very natural, colloquial (2), ease and familiarity, feels real, could be a native speaker, could have been written by a native speaker, almost native like, unnatural (4), awkwardness (3) Reference to register, genre, rhetoric good attempt at genre (3), uniform register, register appropriate, suitable style, appropriate to the question, inappropriate, stylistically assured, stylised word order (2), use of appropriate buzz word, use of contrast (3), contrast of emotion with analysis, juxtaposition of the poetic and prosaic, balance, rhetorical devices, gurative use of words, imagery, good phrases, personication, repetition, exaggeration, poetic (2), literary, catchy, cool Reference to vocabulary range of vocabulary (4), good use of vocabulary (4), excellent vocabulary, (in)appropriate/suitable vocabulary (3), competent vocabulary Number of mentions 4

15

29

20

30

13

themes; more idiomatic language; more evidence of the range of vocabulary needed to describe wider experience.

There are signicantly fewer examples of such manifestations of sophisticated language at each lower level of prociency among the 98 scripts. At Band 3, the analysts comments frequently suggest writing prociency between a level where a user has only one way of communicating a particular meaning, and a level where (s)he has the competence and condence to branch out. This recalls the CEF Vantage (B2) level, described as beyond Threshold (B1) level in its renement of functional and notional categories, with a consequent growth in the available inventory of exponents to convey degrees of emotion and highlight the personal signicance of events and experiences (2001, p. 76).

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

147

The sophistication of language criterion also resembles quite closely these criteria from the General Mark Scheme of the C2 level CPE exam referred to above and used for the assessment of writing across a range of tasks: sophisticated use of extensive range of vocabulary; collocation and expression, entirely appropriate to the task set; effective use of stylistic devices; register and format wholly appropriate; impressive use of a wide range of structures.

The features shared by the emerging sophistication of language criterion and the other assessment criteria it may subsume, begin to suggest some generalisability for the criterion. There may also tentative implications that the four levels represented by the four sub-corpora of scripts may tend to relate to the levels of the CPE, CAE, FCE and PET exams, themselves, according to Cambridge ESOL exam information, linked to CEF Levels C2, C1, B2, and B1, respectively. The inferences are still intuitive at this stage, however. 7.2. Accuracy The criterion of accuracy was noted as apparently signicant across the four sub-corpora in CSW Research Step 3. In Step 4, accuracy errors in the sub-corpora scripts were counted by the corpus analyst, and their frequency calculated in relation to average length of response (see Table 8). The key inference from this table is that response length and error frequency differ across the four sub-corpora. Band 2 candidates appear to write noticeably less than Band 5 candidates. Table 8 also indicates that the Bands 4 and 3 candidates t between the two extremes in terms both of task response length and error frequency. They also make many more errors as counted (manually, as the WordSmith programme cannot yet fully identify linguistic errors) according to the lexico-grammatical error categories specied by the main researcher, namely: word choice, word order; word form apart from verb forms, verb forms, prepositions/adverbials, number/quantity, articles, deixis, spelling, punctuation. Although the error analysis is subjective, it is likely that the counts of errors per
Table 8 Comparisons of task response length and error ranges, averages and frequencies across Bands 5, 4, 3, and 2 sub-corpora Sub-corpus Bands 5 4 3 2 Average length of task response 210 194 188 154 Range of number of errors 111 618 1131 839 Average errors 5 11 17 24 Error frequency (per no. of words) 1 per 42 1 per 17.5 1 per 11.1 1 per 6.4

148

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

script are accurate. More doubtful will be some of the counts across individual error categories, given the occasional difculty of distinguishing between certain error types, for example, verb form and number (as for example in *Another person prefer recorded music). Earlier inferences on the impact of accuracy (see Section 6) are corroborated rather than refuted by the error analyses and counts. There are indications, inviting further investigation, that: very frequent language inaccuracies may signicantly reduce the impact of a text, lessening the readers condence in the writers ability to convey the intended meaning, sometimes even obscuring that meaning; the negative impact of certain categories of lexico-grammatical error may be greater than that made by others; learners with weaker target language competence are more likely to make certain types of accuracy error. Table 9 below suggests that, whereas the choice of the correct word is a problem at all prociency levels in the corpus, it is four times more frequently evidenced by Band 2 candidates than Band 5. Errors with verb forms are signicantly less of a problem for Bands 4 and 5 candidates, both in terms of frequency of error and in terms of rank order of error type. Analyst reaction that such errors are surprising at Band 5 appears to support a hypothesis that accuracy errors have a negative impact on the reader, some errors more than others. 7.3. Organisation and cohesion Assessment criteria used across language exams (see Section 2.4 above) suggest that the organisation of a piece of writing affects its impact on the reader. Hence the identication on band scales (see above of criteria such as organisation and linking devices [FCE], organisation and cohesion [CAE], organisation and coherence [CPE], organisation: clarity, intent [CELS], coherence and cohesion [IELTS]). In the CSW study, organisation is referred to explicitly in only 7 of the 18 Band 4 subcorpus rater analyses (ve times positively and twice negatively). This suggests a generally satisfactory handling of the organisation of ideas at this level, with little of the sometimes forced and over-explicit linking mentioned in the analyses of the Band 3 scripts. Similarly with the four negative-only references to organisation and links in the eight Band 2 scripts, where the message of the text may have already been lost at the lexico-grammatical level. The corpus analysts identication of errors with links at the cohesion level (i.e., within or between sentences) in the four sub-corpora indicated that they are relatively common at Band 5 compared with the accuracy errors analysed in Table 8, although still at an average of only 1.34 per script. This compares interestingly with the Bands 4, 3 and 2 sub-corpora, where links would be only seventh in the accuracy rank order, averaging 1.17, 1.19 and 1.5 per script, respectively. It

Table 9 Error types, occurrences and rank orders across bands Type R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159 Word choice Band 5 (n = 29) Totals Mean/script Ranks Band 4 (n = 18) Total Mean/script Ranks Band 3 (n = 43) Total Mean/script Ranks Band 2 (n = 8) Total Mean/script Ranks Mean ranks 28 1 1 33 1.8 1 153 3.6 1 32 4.0 1 2 Verb form 14 0.5 4= 25 1.4 4 125 2.91 2 28 3.5 2 3 Preposition/ adverbial 22 0.76 2 29 1.61 2 110 2.56 3 22 2.75 3 3.5 Number Spelling Article Deixis Punctuation Word order 7 0.24 7 7 0.61 8 12 0.28 8 4 0.5 9 9.8 Word form 1 0.03 9= 3 0.17 10 7 0.16 10 2 0.25 10 9.8

8 0.28 6 17 0.94 5= 88 2.05 4 19 2.38 4= 5.8

17 0.59 3 17 0.94 5= 69 1.6 5 19 2.38 4= 4.3

14 0.5 4= 27 1.5 3 50 1.16 6 14 1.75 6 4.8

1 0.03 9= 5 0.28 9 22 0.51 7 8 1.0 7 9.5

5 0.18 8 11 0.37 7 9 0.21 9 6 0.75 8 8.3

149

150

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

may be that the Band 5 candidates experiment more with their intra- and intersentential links (the term over-ambitious is used to describe parts of seven of the 29 scripts in this sub-corpus). But the Band 5 script analysis also makes 12 positive comments on the macro-organisation (or coherence) of the scripts, compared with only three negatives, yet four positive to six negative comments at the microor cohesion level. One of the problems emerging on cohesion (borne out by the Kennedy et al. study) is the tendency for candidates to learn a set of link words or phrases (rstly, therefore, furthermore, etc.) and force them into their writing, sometimes incorrectly or inappropriately, risking a negative impression on the reader. A portmanteau criterion such as organisation and cohesion seems to warrant inclusion in a common scale band set. To separate organisation from links, including the latter among errors of accuracy, seems counter-intuitive.

8. CSW project Phase 2 research Step 6 At this stage of the study, a draft working scale for writing was emerging using three criteria, sophistication of language, accuracy, and organisation and cohesion. Since the scale was being derived from a close study of written task responses from candidates assessed on multiple ratings as at four different levels of prociency, the draft scale has the potential width of a common scale (see Fig. 1) although, of course, the responses were all to the same task. The computerised re-analysis of certain aspects of the sub-corpora was now initiated by the co-researcher (also the co-writer of this article). This re-analysis sought evidence that might support or undermine the script analysts ndings in areas where computer corpus analysis applied, and to add to the coverage of the study where the computer analysis could do things that the manual analysis could not. The methodology was informed by that of Kennedy, Dudley-Evans and Thorp (2001) summarised above. The 98 scripts in the four sub-corpora were keyed in to Microsoft Word to provide a text le for each candidates script. The original layout of the script was kept in the typed-up version although corrections and crossings-out were not included. Because these text les were to be analysed in various ways using WordSmith Tools software, each was re-saved as a corrected version in which spelling errors, where possible, were corrected. This was done to enable WordSmith to produce accurate wordlists based on individual or sets of text les without incorrectly spelt words misrepresenting the vocabulary range of a candidate. It was envisaged that the WordSmith Tools software would be used to investigate: whole script, sentence and paragraph lengths; title use;

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159 Table 10 Sub-corpora mean sentence and paragraph lengths Bands 5 4 3 2 Mean sentence lengths (words) 19.5 19.1 17.3 13.3

151

Mean paragraph lengths (words) 61.4 48.9 88.6 33.1

vocabulary range; words in concordances and collocations; and errors. The rst analysis corroborated evidence from the manual analysis of a correlation between response length and prociency level (see Table 8). It then revealed that, while sentences tended to be longer (though not signicantly so) in the Bands 5 and 4 sub-corpora, a hypothesis of paragraph length as a distinguishing feature of prociency level, was, contrary to indications in the Birmingham study, not corroborated (see Table 10). Further manual analysis is suggested to check variations of sentence length in sequence, an element of sophisticated language apparently used for effect by some of the CSW Band 5 writers. The WordSmith software cannot yet handle such an analysis. The use or not of a title in the task response was found to vary according to level (see Table 11), CSW Band 5 candidates being more likely, at 48%, to use a title than Bands 3, 4 or 2 candidates. Given that the task discourse type, a newspaper article, invites the inclusion of a title, its presence or absence is relevant to task fullment. It may also be that titling should be considered as an aspect of the organisation criterion (see above). The computer analysis of the range of vocabulary of each sub-corpus supports inferences from the manual analysis. Table 12 illustrates the vocabulary range in each sub-corpus in terms of the normal and standardised type:token ratios, and the average number of different types found at each level. We would anticipate that higher-level scripts would display a greater range of vocabulary, i.e., a higher type:token ratio, than lower level scripts.

Table 11 Use of a title in candidate scripts Bands 5 4 3 2 Title 14 4 14 1 % 48 22 33 13 No title 15 14 29 7 % 52 78 67 87

152

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Table 12 Normal and standardised type:token ratios Band 5 4 3 2 Tokens 6130 3112 7999 1206 Types 1191 619 1116 342 Type:token ratio 19.43 19.89 13.95 28.36 Standardised type: token ratio 38.97 32.77 32.77 30.10 No. of scripts 29 18 43 8 Av. types per script 41 34 26 43

The type:token ratio column expresses the number of different words in each sub-corpus as a percentage of the total number of words in that sub-corpus. The normal type:token ratios are indeed higher at Bands 4 and 5, indicating that more lexical items are used at these prociency levels. The normal type:token ratio for the lower-prociency Band 2 scripts, however, appears high, this anomaly probably accounted for by the small number of Band 2 scripts in this sub-corpus (8). The standardised type:token ratio measure, computed every n words rather than once for the whole text (the default is every 1000 words), permits a comparison across texts of different lengths and is thus a more appropriate measure for analysing the CSW data. This measure conrms an increased vocabulary range as prociency levels increase, from 30 new words in every 100 at Band 2 level, 33 new words at Bands 3 and 4 and 39 words at Band 5 level. Range of vocabulary is thus possibly a feature distinguishing prociency levels. Figures on words occurring only once were obtained by comparing the frequency wordlists produced by WordSmith Tools for each level, as a possible measure of the use of less frequent vocabulary items. Table 13 suggests that, while candidates at all levels produce some less usual vocabulary items, candidates at higher levels may produce more one-off words. While this could be an indication of the command of a richer vocabulary, an aspect of our criterion of sophistication of language, the percentage differences in Table 13 are not signicant, with the small Band 2 sample appearing once again to contradict the trend. The lengths of words in each sub-corpus was calculated using the Statistics table in the Wordlists application in WordSmith Tools to ascertain whether word length correlated with prociency level. The percentage of different word lengths found in each sub-corpus was, however, almost identical (Table 14).
Table 13 Single-occurrence words Band 5 4 3 2 No. 655 317 575 198 % of types 55 51 52 58

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159 Table 14 Percentages of words with 111 letters Word lengths (in letters) 1 Band 5 Cumulative % Band 4 Cumulative % Band 3 Cumulative % Band 2 Cumulative % 5 5 5 4 2 19 24 19 24 19 24 22 26 3 20 44 20 44 19 43 18 44 4 18 62 21 65 20 63 18 62 5 13 75 13 78 13 76 14 76 6 7 82 7 85 7 83 9 85 7 7 89 5 89 6 89 3 88 8 4 93 4 94 4 93 5 93 9 4 97 3 97 3 96 3 96 10 2 99 1 98 1 97 2 98

153

11 1 100 0 0 1 99

The nding that difference in word lengths across levels is not signicant somewhat contradicts the Birmingham study, which notes signicantly more words longer than 12 letters in its IELTS Band 8 (i.e., very good user) scripts. These differences may well relate to the difference between the FCE exam prompt used in this study and typical IELTS test tasks. The single task to which all the candidates in our study responded was, it will be recalled, on a relatively straightforward, everyday-view journalistic topic. The writing corpora used by Kennedy et al., however, will have been responses to more academic questions, requiring a more formal language register and inviting, perhaps, the use of some longer words. In the Kennedy et al. study, the collocates of I and it were identied as possible indications of the organisation of task response, and of impact on the reader. A new analysis procedure was thus trialled on the CSW sub-corpora, using concordances and collocational information to investigate words occurring one place to the right of I (its collocates). Results showed a range of collocates, most of which, unsurprisingly, are verb forms. Some verbs occur at all levels (e.g., prefer and think) whilst others occur at some levels only. The data here appear inconclusive, however, apart, perhaps, from an apparent stronger inclination for candidates writing at higher levels to use the rst person I in their responses. This could be an aspect of the use of personal experience to enhance a general argument and/or strengthen the writer:reader relationship, seen as part of sophistication of language in Section 7.1. Collocational analysis certainly has possibilities for future analyses of writing corpora. The nal analysis using WordSmith aimed to investigate errors at each level, based on the wordlists produced from the original text les (note that the corrected text les were used for all other analyses). These wordlists were saved in an Excel spreadsheet and the Excel spell-checker was used to identify incorrectly spelt items in these lists. This analysis was intended as a check on the manual error analysis (see Section 7 above). But lexico-grammatical errors could not be easily analysed using the WordSmith Tools, so were not re-checked at this stage in the study. In

154

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

fact WordSmith Tools software can be used to identify some lexico-grammatical errors but only if the texts concerned have been pre-coded for parts of speech. The production of concordances for specic words or phrases in the corpus was also trialled during the computerised analysis. This remains an avenue for future research when more detailed analysis of sophisticated language, accuracy or organisation is called for on new corpora of examination scripts. The computerised corpus analyses of the CSW scripts, which facilitated both cross-checking and original analyses, proved revealing and helpful. On matters where the manual analysis could be replicated by the computer analysis, most of the script analysts ndings were corroborated. Additional features were also indicated as signicant by the computer corpus analysis, namely titling, greater word length and vocabulary range across levels.

9. CSW project Phase 2 research Step 7 9.1. The draft common scale for writing Three criteria, sophistication of language, organisation and cohesion and accuracy have thus been identied by multiple ratings of sub-corpora of scripts for four levels. Qualitative analyses of these scripts have identied and exemplied features consistently characterising their prociency levels. Some of these features
Level 5 prefer have listen like think always dont want enjoy had would can cant nd never went Level 4 am prefer think like Level 3 listen think like prefer can have was dont feel would enjoy love want had buy choose really Level 2 listen prefer think would have

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

155

have been checked with expert informants, and through computer corpus analyses, which had also directed attention to other features, some already suggested by the manual analyses, some new. A scale of level descriptions for four bands of prociency in writing could now be drafted using insights from the study and from previous related scales. The draft scale would initially, since it had been derived through assessor-oriented use, be worded in negative as well as positive can-do terms (see the CEF distinction between positive and negative formulation, CEF, 2001, p. 205). The draft scale attempts to avoid the complexity and length against which Alderson (1990), Porter (1991) and North (2000) warn (see Section 2.2). The problem of distinguishing one band from the next only by the use of distinctions between always, usually, sometimes, occasionally (also see Alderson, 1990 in Section 2.2), is not entirely solved, but such descriptors are as far as possible combined with other distinguishing criterial features. The rening of the descriptors and broadening of their generalisability is being attempted through their application to candidate writing corpora from other Cambridge ESOL exams in response to a range of tasks at various levels (see conclusions below). 9.2. Number of levels One of the key considerations in the imminent comparisons between the draft descriptors and other scales must be the number of levels to be covered. The four levels so far identied from the four sub-corpora of scripts might represent a rst step towards a matching with CEF Levels A2 to C2, although these, with their aim of avoiding negative references, sometimes appear to indicate a rather higher performance level in some aspects. Fig. 4 gives examples from the CEF analysis of functions, notions, grammar and vocabulary necessary to perform the communicative tasks described on the scales (2001, p. 33) at four levels of prociency.

Fig. 4. CEF B1 to C2 level description extracts.

156

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Fig. 3. Draft four-level scale for writing.

There are certainly similarities of features here with the draft scale in Fig. 3 above, indicating that the draft scale may indeed have possibilities for renement into a common scale. But the four corpora derived from our analyses do not include a level representing CEF A2 (Waystage). This is conned to communication such as short, basic descriptions of events and activities (2001, p. 34), thus should be investigated by the analysis of scripts in response to a briefer, simpler communicative writing task than the one performed by the corpus of 288. Below this A2 level of prociency would be a level similar to CEF Level A1 (Breakthrough), the lowest level of generative language use (2001, p. 33). It is likely that the common scale will eventually have six levels, but empirical research is still required on the two lower levels. 10. Conclusions and further research This study has analysed an extensive corpus of scripts written by candidates at three exam levels, FCE, CAE and CPE (offered by Cambridge ESOL for certi-

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

157

cation at CEF B2, C1 and C2 levels, respectively) in response to a single communicative task. The two-stage qualitative analyses by one researcher but with some consultation with expert opinion and regular feedback from the Cambridge ESOL Writing Steering Group, have been supported where feasible by computer corpus analyses conducted by the second researcher and co-writer of this article. Features derived from the analyses of the corpus have been incorporated in a draft four-level assessor-oriented scale based on three criteria, sophistication of language, organisation and links, and accuracy. This draft scale is being used for further research designed to increase its generalisability for potential use in a common scale for writing. The draft scale is being applied to corpora of IELTS scripts with previous ratings from Bands 3 to 9; and to Business English Certicates (BEC) and Certicates in English Language Skills (CELS) exam scripts at Preliminary, Vantage and Higher levels. With each of these three corpora the scripts concerned cover a wide range of writing tasks. Results so far indicate that the draft scale derived from the study described in this paper does have generalisability across exams and writing tasks and will be useful in helping to specify relationships between prociency levels measured by Cambridge ESOL Main Suite, Business English and IELTS test bands. On a broader research front, the study would seem to offer useful insights into the writing construct, with the identication and development of criteria which are relevant to the communicative language testing construct, and which should be useful for the assessment of writing beyond Cambridge ESOL exams. Methodologically, the study appears to support the use of learner corpora in the investigation of target language prociency levels and the use of a combination of qualitative and computer-linguistic analytic approaches, including those starting from tabula rasa and analyst intuition, though checked through expert opinion and reference to existing criteria and scales. Acknowledgements We would like to acknowledge the involvement of Nick Saville, Janet Bojan, Annette Capal and Liz Hamp-Lyons in Phase 1 of the CSW Project, and Cambridge ESOL team members, Chris Banks, Neil Jones, Tony Green, Nick Saville, Stuart Shaw, Lynda Taylor and Beth Weighill for their work on Phase 2. Thanks also to Cyril Weir for comments on an early draft of the paper. References
Alderson, C. (1990). Bands and scores. In: C. Alderson & B. North (Eds.), Language testing in the 1990s (pp. 7186). London: Modern English Publications and the British Council. Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British national Corpus with SARA. Edinburgh, UK: Edinburgh University Press. Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

158

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

Bachman, L. (1991). What does language testing have to offer? TESOL Quarterly, 25 (4), 671704. Bachman, L. (2002). Some reections on task-based language performance assessment. Language Testing, 19 (4), 453476. Bachman, L., & Palmer, A. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press. Ball, F. (2001). Using corpora in language testing. In: Research notes (Vol. 6). Cambridge: Cambridge ESOL. Ball, F. (2002). Developing wordlists for BEC. In: Research notes (Vol. 8). Cambridge: Cambridge ESOL. Ball, F., & Wilson, J. (2002). Research projects related to YLE speaking tests. In: Research notes (Vol. 7). Cambridge: Cambridge ESOL. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press. Boyle, A., & Booth, D. (2000). The UCLES/CUP learner corpus. In: Research notes (Vol. 1). Cambridge: Cambridge ESOL. Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press. Cumming, A. (1998). Theoretical perspectives on writing. Annual Review of Applied Linguistics, 18, 6178. Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing, 13 (1). Fulcher, G. (2003). Testing second language speaking. London: Pearson Longman. Granger, S., & Rayson, P. (1998). Automatic proling of learner texts. In: S. Granger (Ed.), Learner English on Computer (pp. 119131). London: Longman. Hamp-Lyons, L. (1990). Second language writing: Assessment issues. In: B. Kroll (Ed.), Second language writing assessment issues and options. New York: Macmillan. Hamp-Lyons, L. (1995). Summary report on Writing Meta-Scale Project (UCLES EFL Internal Report). Hatch, E., & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguistic. Boston: Heinle and Heinle. Hymes, D. (1971). On communicative competence. Philadelphia, PA: University of Philadelphia Press. Kennedy, C., Dudley-Evans, T., & Thorp, D. (2001). Investigation of linguistic output of academic writing Task 2 (British Council-funded IELTS Research Project 19992000, Final Report). Leech, G. (1998). Preface. In: S. Granger (Ed.), Learner English on computer (pp. XIVXX). London: Addison Wesley Longman Limited. McEnery, T., & Wilson, A. (1996). Corpus linguistics (2nd ed.). Edinburgh: Edinburgh University Press. Milanovic, M., Saville, N., & Shen, S. (1992). Studies on direct assessment of writing and speaking (UCLES EFL Internal Report). Morrow, K. (1979). Communicative language testing: Revolution or evolution? In: C. Brumt & K. Johnson (Eds.), The communicative approach to language teaching. Oxford: Oxford University Press. Morrow, K. (1990). Evaluating communicative tests. In: S. Anivan (Ed.), Current developments in language testing. Singapore: SEAMEO Regional Centre. Munby, J. (1978). Communicative syllabus design. Cambridge: Cambridge University Press. North, B. (2000). Linking language assessments: An example in the low stakes context. System, 28, 555577. Porter, D. (1990). Affective factors in the assessment of oral interaction. In: S. Anivan (Ed.), Current developments in language testing. Singapore: SEAMEO Regional Language Centre. Saville, N. (2003). The process of test development and revision within UCLES EFL. In: C. J. Weir & M. Milanovic (Eds.), Continuity and innovation: Revising the Cambridge prociency in English examination 19132002. Cambridge: Cambridge University Press. Saville, N., & Capel, A. (1996). Common scale writing (Interim Project Report: UCLES EFL).

R. Hawkey, F. Barker / Assessing Writing 9 (2004) 122159

159

Saville, N., & Hawkey, R. (2004). The IELTS impact study: Investigating washback on teaching materials. In: L. Cheng & Y. Watanabe (Eds.), Washback in language testing: Research contexts and methods. New Jersey: Lawrence Erlbaum Associates Inc. Scott, M. (2002). WordSmith Tools version 3. Oxford: Oxford University Press. Available at: http://www.lexically.net/wordsmith/version3/index.html. Spolsky, B. (1995). Measured words. Oxford: Oxford University Press. Turner, J., & Upshur, C. (1996, August). Scale development factors as factors of test method. Paper presented at the 18th Language Testing Research Colloquium, Tampere, Finland. Upshur, J., & Turner, C. (1995). Constructing rating scales for second language tests. English Language Teaching Journal, 49 (1), 312. Upshur, J., & Turner, C. (1999). Systematic effects in the rating of second language speaking ability: Test method and learner discourse. Language Testing, 16 (1), 82111. Weir, C. J. (1993). Communicative language testing. New York: Prentice-Hall. Weir, C. J., & Milanovic, M. (Eds.). (2003). Continuity and innovation: Revising the Cambridge prociency in English examination 19132002. Cambridge: Cambridge University Press.

Potrebbero piacerti anche