Sei sulla pagina 1di 224

PUBLISHING DETAILS

IELTS RESEARCH REPORTS VOLUME 6, 2006


Published by: Project Managers: Acknowledgements: IELTS Australia and British Council Jenny Osborne, IELTS Australia Uyen Tran, British Council Dr Lynda Taylor, University of Cambridge ESOL Examinations Dr Anthony Green, University of Cambridge ESOL Examinations Petronella McGovern Dr Steve Walsh
IELTS Australia Pty Limited ABN 84 008 664 766 (incorporated in the ACT) GPO Box 2006, Canberra, ACT, 2601, Australia Tel 61 2 6285 8222 Fax 61 2 6285 3233 Email ielts@idp.com Web www.ielts.org IELTS Australia Pty Limited 2006

Editors:

British Council Bridgewater House 58 Whitworth Street Tel 44 161 957 7755 Fax 44 161 957 7762 Email ielts@britishcouncil.org Web www.ielts.org British Council 2006

This publication is copyright. Apart from any fair dealing for the purposes of: private study, research, criticism or review, as permitted under Division 4 of the Copyright Act 1968 and equivalent provisions in the UK Copyright Designs and Patents Act 1988, no part may be reproduced or copied in any form or by any means (graphic, electronic or mechanical, including recording or information retrieval systems) by any process without the written permission of the publishers. Enquiries should be made to the publisher. The research and opinions expressed in this volume are of individual researchers and do not represent the views of IELTS Australia Pty Limited or British Council. The publishers do not accept responsibility for any of the claims made in the research. National Library of Australia, cataloguing-in-publication data 2006 edition, IELTS Research Reports 2006 Volume 6 ISBN 0-9775875-1-7 Copyright 2006

IELTS Research Reports Volume 6

Publishing details

CONTENTS
Foreword Introduction Publishing details 1 An investigation of the effectiveness and validity of planning time in Part 2 of the IELTS Speaking Test
Addresses the question of whether the use of planning time for the IELTS Speaking Test assists in candidate performance. Catherine Elder and Gillian Wigglesworth

2 An examination of the rating process in the revised IELTS Speaking Test


Examines the validity of the analytic rating scales used to assess performance in the revised IELTS Speaking Test, through an analysis of verbal reports produced by IELTS examiners when rating test performances and a subsequent questionnaire. Annie Brown

3 Candidate discourse in the revised IELTS Speaking Test


Aims to verify the IELTS Speaking Test scale descriptors by providing empirical validity evidence derived from a linguistic analysis of candidate discourse. Annie Brown

4 The impact on candidate language of examiner deviation from a set interlocutor frame in the IELTS Speaking Test
Shows that the deviations examiners make from the interlocutor frame in the IELTS Speaking Test have little significant impact on the language produced by candidates. Barry OSullivan and Yang Lu

5 Exploring difficulty in Speaking tasks: an intra-task perspective


Looks at how the difficulty of a speaking task is affected by changes to the time offered for planning, the length of response expected and the amount of scaffolding provided. Cyril Weir, Barry OSullivan and Tomoko Horai

6 The interactional organisation of the IELTS Speaking Test


Describes the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair. Paul Seedhouse and Maria Egbert

7 An investigation of the lexical dimension of the IELTS Speaking Test


Investigates vocabulary use by candidates in the IELTS Speaking Test by measuring lexical output, variation and sophistication, and the use of formulaic language. John Read and Paul Nation

IELTS Research Reports Volume 6

Contents

FOREWORD
Welcome to Volume 6 of the IELTS Research Reports. The studies reported in this volume were funded by the IELTS joint-funded research programme, sponsored by British Council and IELTS Australia. The third IELTS partner, Cambridge ESOL, supports the programme by providing assistance to approved researchers. Since the programme began in 1995, nearly 70 studies and over 120 leading researchers have received grants under the joint programme. The results have made a significant contribution to the monitoring, evaluation and development process of IELTS. It is now one of the worlds most researched English language tests, ensuring that IELTS continues to be the test that sets the standard through its high level of quality, validity, security and overall integrity. IELTS research activities are co-ordinated as part of a coherent framework for research and validation of the IELTS Test and the research programme is a major component of this framework. A summary of the impact of the research studies reported in this volume can be found in the Introduction by Cambridge ESOL. The annual call for research proposals is widely publicised and aims to reflect current issues relating to IELTS as a major international English language proficiency test. A Joint Research Committee of the three IELTS partners agrees on research priorities and oversees the tendering process. Committee members, collaborating with experts in applied linguistics and language testing, assess the research proposals according to the following criteria: relevance and benefit of outcomes to IELTS clarity and coherence of the proposals rationale, objectives and methodology feasibility of outcomes, timelines and budget qualifications and experience of proposed project staff potential to be published for both IELTS and an international audience. Volume 6 is the first of two volumes of IELTS research reports to be published jointly by British Council and IELTS Australia and it contains reports of research funded by both partners. The main theme of Volume 6 is the IELTS Speaking Test. Volume 7 will focus on a range of topics including the IELTS Writing Test. Further information about IELTS research and the joint-funded programme is on the IELTS website www.ielts.org

Martin Davidson Deputy Director General British Council

Anthony Pollock Chief Executive IELTS Australia

IELTS Research Reports Volume 6

Foreword

INTRODUCTION
The British Council/ IELTS Australia joint-funded research programme makes a significant contribution to the ongoing development of IELTS. External studies funded by these two IELTS partners complement internal validation and research studies conducted or commissioned by the third IELTS partner, Cambridge ESOL. The funded studies form an integral part of the process of IELTS monitoring, validation and evaluation. This volume brings together a number of important empirical studies focusing on the IELTS Speaking Test. A major review of the IELTS Speaking Test took place towards the late 1990s and a formal project to revise the Speaking module was conducted between 1998 and 2001. The revision project concentrated on several key areas with the aim of achieving greater standardisation of test conduct and improving the reliability of assessment; this included: developing a clearer specification of tasks, eg in terms of input and expected candidate output, and the revision of the tasks themselves for some phases of the Test introducing an examiner frame to guide examiner language and behaviour, and so increase standardisation of test management re-developing the assessment criteria and rating scale to ensure that the rating descriptors matched more closely the output from candidates in relation to the specified tasks re-training and re-standardising a community of around 1500 IELTS examiners worldwide using a face-to-face approach, and introducing ongoing quality assurance procedures for this global examiner cadre. The revised IELTS Speaking Test was introduced in July 2001 and since that time the joint-funded programme has invited research proposals for empirical studies which explore various aspects of the revised test module along the dimensions listed above. Such studies are considered essential to confirm that the revised test is functioning as intended, to identify any issues that may need addressing, and to contribute to the body of evidence in support of the validity arguments underpinning use of the test. The first study reported in this volume, by Gillian Wigglesworth and Catherine Elder, investigated the relationship between three variables in the IELTS Speaking Test planning, proficiency and task. Their study aimed to increase our understanding of how these variables interact with one another and how they impact on test-taker performance. The specific focus was the role and use of the one minute of planning time afforded to candidates in Part 2 of the Speaking Test. Part 2 is a long turn task with in-built pre-task planning time. The task design reflects the fact that some speech especially in academic and professional contexts is more formal in nature and is often planned prior to delivery (though, as the researchers acknowledge, it is clearly difficult to replicate this condition within the limited time-frame of a speaking test). Early Second Language Acquisition (SLA) research into the effect of pre-task planning, including work by Wigglesworth, suggested that planning time impacted positively on both content and quality of L2 oral performance; later research findings, however, proved less conclusive. As this was an innovative feature of the revised

IELTS Research Reports Volume 6

Introduction to the IELTS Research Reports, Volume 6 Lynda Taylor

IELTS Speaking Test introduced in 2001, the test developers were keen to investigate the effectiveness and validity of the planning time and how test-takers make use of it. Interestingly, Wigglesworth and Elders experimental study found no evidence that the availability of planning time advantages or disadvantages candidates performance, either in terms of the discourse they produce or the scores they receive. Despite this finding, the researchers recommend that one minute of pre-task planning should continue to be included on Part 2 in the interests of fairness and for face validity purposes. An important dimension of this study was that it canvassed the candidates own perceptions of the planning time available to them; feedback from test-takers suggests they perceive the one minute as adequate and useful. This study therefore offers positive support for the decision by the IELTS test developers to include a small amount of planning time in the revised Speaking Test; it also confirms that there would be no value in increasing it to two minutes since this would be unlikely to produce any measurable gain. Another useful outcome from this study is the feedback from both researchers and test-takers on possible task factors relating to topic; this type of information is valuable for informing the test writing process. Volume 6 includes two studies by Annie Brown who has a long association with the IELTS Speaking Test dating back to the early 1990s. Findings from Browns studies of the Test as it was then (some of which formed the basis of her doctoral research) were instrumental in shaping the revised Speaking Test introduced in 2001. The first of the two studies in this volume examined the validity of the analytic rating scales used by IELTS examiners to assess test-takers performance. When the Speaking Test was revised, a major change was the move from a single global assessment scale to a set of four analytic scales that all IELTS examiners worldwide are trained and standardised to administer. The IELTS partners were therefore keen to investigate how examiners are interpreting and applying the new criteria and scales, partly to confirm that they are functioning as intended, and also to highlight any issues that might need addressing in the future. Browns study used verbal report methodology to analyse examiners cognitive processes when applying the scales to performance samples, together with a questionnaire probing the rating process further. The studys findings provided encouraging evidence that the revised assessment approach is a significant improvement over the pre-revision Speaking Test in which examiners were relatively unconstrained in their language and behaviour and used a single, holistic scale. Firstly, the revised test format has clearly reduced the extent to which an interviewers language and behaviour is implicated in a test-takers performance. Secondly, examiners in this study generally found the scales easy to interpret and apply, and they adhered closely to the descriptors when rating; they reported a high degree of comfort in terms of both managing the interaction and awarding scores. The study was instructive in highlighting several aspects that may need further attention, including some potential overlap between certain analytic scales and some difficulty in differentiating across levels; however, Brown suggests that these can relatively easily be addressed through minor revisions to the descriptors and through examiner training. Browns second study in this volume is a partner to the first. It too sought empirical evidence to validate the new Speaking Test scale descriptors but through a discourse analytic study of test-taker performance rather than a focus on examiner attitudes and behaviour. Overall, the study findings confirmed that all the measures relating to each analytical criterion contribute in some way to the assessment on that scale and that no single measure appears to dominate the rating process. As we would wish, a range of performance features contribute to the overall impression of a candidates proficiency and the results of this study are therefore encouraging for the IELTS team who developed the revised scales and band descriptors.

IELTS Research Reports Volume 6

Introduction to the IELTS Research Reports, Volume 6 Lynda Taylor

This study also highlights the complexities that are involved in assessing speaking proficiency across a broad ability continuum (as is the case in IELTS). Specific aspects of performance may be more or less relevant at certain levels, and so contribute differentially to the scores awarded. Furthermore, even though two candidates may be assessed at the same level on a scale, their respective performances may display subtle differences on different dimensions of that trait. This reminds us that, at the level of the individual, the nature of spoken language performance and what it indicates about their proficiency level can be highly idiosyncratic in nature. Barry OSullivan and Yang Lu set out to analyse the way in which the examiner script (or Interlocutor Frame) used in the IELTS Speaking Test impacted on the test-takers performance, specifically in cases where an examiner deviates from the scripted guide provided. An Interlocutor Frame was introduced in the 2001 revision on grounds of fairness to increase standardisation of the test and to reduce the risk of rater variability; since then, the functioning of the Interlocutor Frame has been the focus of ongoing research and validation work. The study reported here forms part of that research agenda, and aimed to locate specific sources of deviation, the nature of the deviations and their effect on the language of the candidates. Taking a discourse analytic approach, the researchers analysed transcription extracts from over 60 recordings of live speaking tests to investigate the nature and impact of examiner deviations from the interlocutor frame. Findings from their study suggest that in Parts 1 and 2 of the Speaking Test, the examiners adhere closely to the Frame; any deviations are relatively rare and they occur at natural interactional boundaries with an essentially negligible effect on the language of candidates. Part 3 shows a different pattern of behaviour, with considerable variation across examiners in the paraphrased questions, though even here little impact on candidate language could be detected. It is important to note, however, that some variation is to be expected in this third part of the Test as it is specifically designed to offer the examiner flexibility in choosing and phrasing their questions, matching them to the level of the test-taker within the context of a test measuring across a broad proficiency continuum. Once again, findings from this study confirm that the revised Speaking Test is functioning largely as the developers originally intended. The study also provides useful insights which will inform procedures for IELTS examiner training and standardisation, and will shape future changes to the Frame. Cyril Weir, Barry OSullivan and Tomoko Horai investigated how the difficulty of speaking tasks can be affected if changes are made to three key task variables: amount of planning time offered; length of response expected; and amount of content scaffolding provided. Their study explored these variables in relation to Part 2 (individual long turn task) of the IELTS Speaking Test and it therefore complements the Wigglesworth and Elder study which focused on the planning time variable in isolation. Using an experimental design, the researchers collected performance and score data for analysis. They supplemented this with an analysis of questionnaire responses related to test-takers cognitive processing based upon a socio-cognitive framework for test validation. Once again, the findings are encouraging for the IELTS test developers. There is welcome empirical support for the current design of the Part 2 task used in the operational IELTS, both in terms of the quality of candidate performance and the scores awarded, and also in relation to candidate perceptions of task difficulty. Task equivalence is an important issue for IELTS given the large number of tasks which are needed for the operational test, and this study provides useful insights into some of the variables which can affect task difficulty, especially for test candidates at different ability levels.

IELTS Research Reports Volume 6

Introduction to the IELTS Research Reports, Volume 6 Lynda Taylor

Paul Seedhouse and Maria Egbert explored the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair, drawing their sample for analysis from the large corpus of audio-recordings held by Cambridge ESOL. Since 2002 several thousand recordings of live IELTS Speaking Tests have been collected and these now form a valuable spoken language corpus used by researchers at Cambridge ESOL to investigate various aspects of the Speaking Test. By applying Conversation Analysis (CA) methodology to 137 complete speaking tests, Seedhouse and Egbert were able to highlight key features of the spoken interaction. Like OSullivan and Yang, they observed that examiners adhere closely to the scripted guide they are given to ensure standardisation of the test event. Although spoken interaction in the IELTS Speaking Test is somewhat different to ordinary conversation due to the institutional nature of the test event, the researchers confirm that it does share similarities with interactions in teaching and academic contexts. In addition, the three parts of the Test allow for a variety of task types and patterns of interaction. Seedhouse and Egbert make a number of useful recommendations which will inform aspects of test design as well as examiner training, particularly in relation to the rounding-off questions at the end of Part 2. In the final report in Volume 6, John Read and Paul Nation investigated the lexical dimension of the IELTS Speaking Test. Allocation of grant funding for this study once again reflected the IELTS partners concern to undertake validation work following introduction of the revised Speaking Test in 2001. When the holistic or global scale for speaking was replaced with four analytic criteria and scales in July 2001, one of these four was Lexical Resource; this requires examiners to attend to the accuracy and range of a candidates vocabulary use as one basis for judging their performance. The Read and Nation study therefore set out to measure lexical output, variation and sophistication, as well as the use of formulaic language by candidates. As the researchers point out in their literature review, there was a strong motivation to explore speaking assessment measures from a lexical perspective given the relative lack of previous research on spoken (rather than written) vocabulary and the growing recognition of the importance of lexis in second language learning. For this study the researchers created a small corpus of texts derived from transcriptions of Speaking Tests recorded under operational conditions at IELTS tests centres worldwide. As for the Seedhouse and Egbert study, they were given access to the corpus of IELTS Speaking Test recordings at Cambridge ESOL from which they selected a subset of 88 performances for transcription and analysis. The studys findings are broadly encouraging for the IELTS test developers, confirming that the Lexical Resource scale does indeed differentiate between higher and lower proficiency candidates. At the same time, however, the study highlights the complexity of this aspect of spoken performance and the extent to which candidates who receive the same band score sometimes display markedly different qualities in their individual performance. The study also provides useful insights into how different topics in Parts 2 and 3 influence the nature and extent of lexical variation. Such insights can feed back into the test writing process; they can also inform the training of IELTS examiners to direct their attention to salient distinguishing features of the different bands and so assist them in being able to reliably rate vocabulary performance as a separate component from the other three rating criteria. The researchers suggest that in the longer term, and following additional research into how this scale operates, there may be a case for some further revision of the rating descriptors.

IELTS Research Reports Volume 6

Introduction to the IELTS Research Reports, Volume 6 Lynda Taylor

Revision of the IELTS Speaking Test in 2001 made it possible to address a number of issues relating to the quality and fairness of the Test. Each of the research studies reported in Volume 6 offers important empirical evidence to support claims about the usefulness of the current IELTS Speaking Test as a measure of L2 spoken language proficiency. In addition, they all provide valuable insights which can inform the ongoing development process (task design, examiner training, etc) as well as future revision cycles. All the reports from the funded projects in this volume highlight avenues for further research and researchers wishing to apply for future grants with British Council and IELTS Australia joint-funded programme may like to take note of some of the suggestions made. Dr Lynda Taylor Assistant Director Research and Validation Group University of Cambridge ESOL Examinations September 2006

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2 of the IELTS Speaking Test
Authors Catherine Elder The University of Melbourne, Australia Gillian Wigglesworth The University of Melbourne, Australia Grant awarded Round 9, 2003 This study addresses the question of whether the use of planning time for the IELTS Speaking Test assists in candidate performance.
ABSTRACT This study investigates the relationship between three variables in the oral IELTS test planning, proficiency and task and was designed to enhance our understanding of how or whether these variables interact. The study questioned whether differences in performance resulted from one or two minutes of planning time. The study also aimed to identify the most effective strategies used by candidates in their planning. Ninety candidates, in two groups intermediate and advanced each undertook three tasks with no planning time, one minute or two minutes planning time. All tasks were rated by two raters, and the transcripts of the speech samples subjected to a discourse analysis. Neither the analysis of the scores, nor the discourse analysis revealed any significant differences in performance according to the amount of planning time provided. While this suggests that planning time does not positively advantage candidates, we argue that one minute of pre-task planning should continue to be included on Task 2 of the IELTS test in the interests of fairness, and to enhance the face validity of the test. The report concludes with a discussion of possible reasons for the null findings and proposes avenues for further research.

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

AUTHOR BIODATA GILLIAN WIGGLESWORTH Gillian Wigglesworth is Associate Professor and Head of the School of Languages and Linguistics at The University of Melbourne. She has a wide range of research interests which broadly include both first and second language acquisition, language testing and assessment, and bilingualism. Gillian has several edited book publications, and numerous journal articles and book chapters which reflect her research interests. CATHERINE ELDER Catherine Elder is Associate Professor of Applied Linguistics and Director of the Language Testing Research Centre in the School of Languages and Linguistics at The University of Melbourne. Previously, and while undertaking this research study, she was with Monash University. Catherine is co-author of the Dictionary of Language Testing (CUP 1999) and co-editor of Experimenting with Uncertainty: Essays in Honour of Alan Davies (CUP 2001) and of the Handbook of Applied Linguistics (Blackwell 2004).

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

CONTENTS
1 Background to the research .....................................................................4 2 The current study .......................................................................................5 3 Context for the research............................................................................6 3.1 Research questions .............................................................................6 3.2 Variables ............................................................................................6 3.2.1 3.2.2 3.2.3 Proficiency level..................................................................6 Amount of planning time.....................................................6 Task ....................................................................................7

4 Methodology ............................................................................................7 4.1 Participants ..........................................................................................7 4.2 Study design ........................................................................................7 4.3 Data collection procedures ..................................................................8 4.3.1 4.3.2 4.3.3 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 5 Results Interviews............................................................................8 Post-interview questionnaires.............................................9 Focus groups ......................................................................9 Transcription and digitisation of tapes................................10 Post-performance ratings ...................................................10 Discourse analysis..............................................................10 Questionnaire responses....................................................11 Focus group responses ......................................................11 ............................................................................................11

4.4 Data compilation and analysis .............................................................10

5.1 Research question 1 (amount of planning time/scores) ......................11 5.2 Research question 2 (amount of planning time/quality).......................12 5.3 Research question 3 (candidates perception of planning time)..........15 5.3.1 5.3.2 5.3.3 Topic as a factor .................................................................17 Planning time as a factor ....................................................18 Planning and topic as a factor ............................................18

5.4 Research question 4 (how planning time used)...................................19 5.5 Research question 5 (most effective strategies for planning time)......19 6 Discussion and Conclusion ......................................................................20 References ............................................................................................23 Appendix 1: Task prompts provided for candidates ..................................25 Appendix 2: Task administration instructions for interviewer ..................26 Appendix 3: Marking sheet ............................................................................27 Appendix 4: Focus group interview questions ...........................................27 Appendix 5: Student questionnaire ..............................................................28

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

BACKGROUND TO THE RESEARCH

The time variable is critical in information processing theories of speech production, and there is now a substantial body of Second Language Acquisition (SLA) research within this cognitive tradition investigating the effects of pre-task planning time on oral performance. This research has yielded fairly convincing evidence that opportunities for planning before a task impact both on the content of learners speech and also on the quality of the language they produce. With regard to the latter, planning is seen as important because of the role it can play in helping learners access their L2 knowledge through controlled processing, promoting selective attention to form and monitoring (Skehan, 1988). A review of the effects of planning time by Ellis (2005) shows that planning generally enhances the fluency and complexity of L2 learners spoken performance (eg Foster, 1996; Foster and Skehan, 1996; Skehan and Foster, 1997; Wendel, 1997; Mehnert, 1998; Ortega, 1999; Yuan and Ellis, 2003). Results pertaining to accuracy are less consistent, but some studies (eg Ellis, 1997; Mehnert, 1998) show that planning also reduces the incidence of error in learner speech. This inconsistency has been attributed to a number of variables, including the characteristics of the tasks used to elicit learner speech and to the conditions under which these tasks are performed. Performance on structured tasks, for example, has been found to be more responsive to planning than is the case with unstructured tasks (Foster and Skehan, 1996; Skehan and Foster, 1997). The type of planning which learners engage in may also be important as Sangarun (2005) showed. Finally, the time allowed for planning needs also has an impact with some aspects of speech improving after only one minute of planning time and others requiring more sustained rehearsal. In Ortegas (1999) study, for example, fluency improvements were evident only after 10 minutes of pre-task planning. One of the reasons for the intense interest in planning amongst SLA researchers is that it is believed to foster pushed output (Swain, 1993) and hence to aid acquisition, although firm evidence in support of this belief is yet to emerge. Whether or not this is the case, the different qualities of speech produced under planned and unplanned conditions provide insight into the psycholinguistic constraints on L2 production, and lend support to the distinction made by Ellis (2005) and others between implicit (automated) and explicit (analytic) knowledge. These constructs are regarded by many as central to psycholinguistic theories of second language production. The justification for researching planning time in language testing contexts, such as the one investigated in this study, is somewhat different. Skehan (1998) invokes test validity, claiming that speaking tests need to sample language produced under planned and unplanned conditions if test scores are to be considered representative of a broad range of real world performances. Such a position begs the question of how much planning time will produce the desired variation in the quality of speech. Elder et al (2002) propose that tests like IELTS and TOEFL, which are used to predict language performance in academic settings, should include planning time for authenticity reasons, given that academic speech is more often than not planned prior to delivery. There are, however, obvious constraints on how closely test tasks can mirror the requirements of academia where students may spend several hours or days preparing for an academic presentation. In a testing context, the amount of planning time must be limited to what is practical given the resources available. It should also be acknowledged that the majority of speaking taking place in academic contexts is entirely spontaneous, so it seems logical to also include some tasks with no planning time. This however raises fairness issues a further argument for allowing planning in testing contexts. In the highly stressful test situation, planning time may serve to reduce anxiety, a possible source of construct-irrelevant variance on a test. It may thereby give candidates opportunities to produce their best possible performance (see Swains (1985) arguments about biasing for best in the test situation). However, what is not clear is either whether planning does reduce anxiety, or whether planning in fact makes a difference to test performance, as the SLA research would lead us to believe.
IELTS Research Reports Volume 6
4

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

The few studies which have been conducted into the effects of planning in language testing contexts have produced less consistent results than is the case with classroom-based SLA research. The first study to be undertaken was that of Wigglesworth (1997), which explored the effects of planning on the oral proficiency component of the access: test (used to screen immigrants for entry to Australia) and found that pre-task planning increased the accuracy of certain grammatical features, such as verb tenses and articles, particularly amongst the higher proficiency candidates when performing cognitively demanding tasks. But while she found significant effects for planning at the discourse level, giving candidates pre-task planning time made no difference to their scores. Two recent studies have also found that planning time can have a positive impact on performance. The first, by Tavakoli and Skehan (2005), which was conducted in what the authors claim to be a testing environment, found consistent benefits for planning on discourse measures of accuracy, complexity and breakdown fluency. The impact of planning on scores however, is not reported. Proficiency again interacted with planning time, as in the Wigglesworth study, but this time it was the lessproficient learners who gained the most (elementary planners in some cases outperformed the intermediate non-planners). Learners also found task performance easier under the planned condition. The second study by Xi (2005), which focused on a graph description task from the taped-mediated SPEAK (Speaking Proficiency English Assessment Kit), found that planning time had the effect of increasing holistic scores on some line graph tasks and also served to mitigate the effects of task familiarity on performance. Qualitative analyses revealed that candidates described more line segments and offered more complex information when planning was provided. However, these findings are at odds with those of other test-based research, namely that of Wigglesworth (2000) and Iwashita et al (2001). In Wigglesworths study, which focused only on test scores, planning was found to be counterproductive in the case of unstructured tasks and had little impact on learner performance on other task types. Iwashita et al (2001) found that planning before a monologic story-telling task had no impact on either the quality of test discourse or test scores, or indeed on candidates perceptions of task difficulty. Elder and Iwashita (2005) offered a variety of tentative explanations for the discrepancy between the findings of classroom and language testing research, including the nature of the tasks themselves, of the instructions given to candidates and the opportunities for on-line planning during task performance (which they speculate may obscure the effects of pre-task preparation). They also suggest that the use of planning time by test-takers may be ineffective. Although some classroom-based research has investigated how learners use their planning time (Wendel, 1997; Ortega, 1999 & 2005; Sangarun, 2005), this issue is yet to be explored in a language testing context. 2 THE CURRENT STUDY

The current study was motivated by a desire to probe these issues further in the context of a face-toface oral interview (the previous studies were conducted with tests requiring tape-based performances). Particular attention was paid to the design features of the study to avoid some weaknesses of previous research efforts in this area. As well as investigating the effect of different levels of planning time on learners at different levels of proficiency, we were interested in investigating the nature and effectiveness of test-taker planning processes and also in canvassing test-takers perceptions of planning time (ie its adequacy and usefulness).

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

CONTEXT FOR THE RESEARCH

The study (funded from an IELTS Australia grant awarded in 2003) explored the effects of pre-task planning time on performance on Part 2 of the International English Language Testing System (IELTS) oral interview. The interview offers one minutes preparation time to all candidates and allows them to prepare notes which they can refer to during the actual interview. We will hereafter use Elliss term strategic planning (2005: 3-5) to make it clear that we are talking about the preparation time given to candidates immediately before performing a test task rather than to pre-task rehearsal (Bygate and Samuda, 2005) in which the candidate actually practises the task prior to performing it. The following questions were investigated in the study. 3.1 Research questions 1. Does the amount of strategic planning time provided make a difference to the scores awarded to candidates in Part 2 of the oral test? 2. Does the amount of strategic planning time make a difference to the quality of candidate discourse in Part 2 of the oral test? 3. How do candidates perceive the usefulness and validity of strategic planning time? 4. How do candidates use their strategic planning time? 5. What are the most effective strategies for the use of strategic planning time? 3.2 Variables Three variables were manipulated in the study design: 1. Proficiency level 2. Amount of planning time 3. Task.
3.2.1 Proficiency level

There were two groups of candidates at different levels of proficiency. Group A were intermediate level candidates as determined by previous scoring on IELTS (band 5.0-5.5) and/or institutional estimates derived from in-house measures used for placement purposes. Group B were advanced candidates (ie previous scores of 6.0 or more in the IELTS band or institutionally determined equivalent). Items from Nations 3,000-5,000 level Academic Word List were also administered to candidates in each group to confirm the validity of these proficiency groupings. The vocabulary test was used as a surrogate for general language proficiency, which was the basis for the institutional groupings, to confirm that the candidates belonged to two distinct proficiency groupings.
3.2.2 Amount of planning time

The instructions for the IELTS Part 2 of the oral test indicate that candidates should be given one to two minutes to prepare. Given previous research which has indicated that as little as one minute can affect performance on some discourse measures (see Mehnert, 1998; Wigglesworth, 1997), this study set out to investigate if there were any differences according to whether candidates are instructed to perform with a) no planning time, b) one minute or c) two minutes of planning time. In each case, 15 seconds was provided for the candidate to read the task.

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

3.2.3

Task

Three tasks were developed in line with the specifications for the Part 2 task. These were then sent to TESOL Cambridge for feedback. Modifications were made after suggestions by test developers to ensure that the tasks did indeed correspond very closely to what might be used in operational test conditions. (See Appendix 1 for the tasks and accompanying prompts to candidates). The design of the study was set up to control for variations in performance that might occur as a result of differences between the tasks, rather than as a result of the planning or proficiency variables (for details of the study design see under Interview Procedure below). This builds on previous research which has suggested that the impact of planning time on performance may be sensitive to relatively small differences in tasks (Foster, 1996; Foster and Skehan, 1996; Skehan and Foster, 1997; Mehnert, 1998; Ortega, 1999; Wigglesworth, 2001). 4 4.1 METHODOLOGY Participants

Participants for the study were recruited from three different Australian tertiary institutions which offered English for Academic Purposes (EAP) and IELTS training. The candidates were given a small payment in compensation for time spent and also promised feedback on their performance against the various IELTS criteria (although it was explained that the resultant score was only roughly indicative of their IELTS level). The explanatory letter to participants is Appendix 2. The participants were aged between 19 and 36 years of age, and came from a range of language backgrounds. Approximately 60% were Chinese speakers (Mandarin vs Cantonese not specified), and the remainder included Korean, Japanese, Thai, Arabic and Vietnamese speakers. Most participants had taken the IELTS test before and all were university bound, for either undergraduate or postgraduate study. All were intending to take the IELTS test in the near future. This study provided an important opportunity to practise an IELTS-like task and therefore motivation to participate was generally very high. There were 90 candidates in all, equally distributed across advanced and intermediate levels. 4.2 Study design Each candidate did all three Part 2 task versions. In one task they were allowed no planning time; in another, one minute of planning time; and in the other, two minutes. Tasks, planning time and order were counterbalanced across candidates using a Latin Square design as indicated in Tables 1 and 2 below. There were 45 candidates in Group A (intermediate), and 45 candidates in Group B (advanced). In each group, the candidates were divided into 3 subgroups (i, ii and iii), and within each subgroup, the candidates were divided into groups of five to avoid any practice effect. So, for example, in group Bi, all 15 candidates did Task 1 with no planning time, but five did this task first, five did it second and five did it third. Thus each student did each of the three tasks, and each student experienced one task with no planning time, one task with one minute of planning time and one task with two minutes of planning time.

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Planning time 0 minutes 1 minute 2 minutes

Group Ai Task 1 Task 2 Task 3

Group Aii Task 2 Task 3 Task 1

Group Aiii Task 3 Task 1 Task 2

1 minute 2 minutes 0 minutes

Task 2 Task 3 Task 1

Task 3 Task 1 Task 2

Task 1 Task 2 Task 3

2 minutes 0 minutes 1 minute

Task 3 Task 1 Task 2

Task 1 Task 2 Task 3

Task 2 Task 3 Task 1

Table 1: Research design (Group A: intermediate candidates)


Planning time 0 minutes 1 minute 2 minutes Group Bi Task 1 Task 2 Task 3 Group Bii Task 2 Task 3 Task 1 Group Biii Task 3 Task 1 Task 2

1 minute 2 minutes 0 minutes

Task 2 Task 3 Task 1

Task 3 Task 1 Task 2

Task 1 Task 2 Task 3

2 minutes 0 minutes 1 minute

Task 3 Task 1 Task 2

Task 1 Task 2 Task 3

Task 2 Task 3 Task 1

Table 2: Research design (Group B: advanced candidates)

4.3
4.3.1

Data collection procedures


Interviews

A total of eight trained and experienced IELTS interviewers were recruited for the study and were thoroughly briefed on the interview procedures. Candidates within each proficiency grouping (advanced and intermediate) were assigned randomly to interviewers who were issued with a bundle of pre-prepared student packs for their candidates. These packs contained the task prompts in the order in which they were to be administered, together with instructions for the candidates about the amount of planning time allowed (Appendix 3).

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Apart from the differences in planning time, each task was administered under conditions which simulated as closely as possible the operational conditions of the IELTS interview. In the one and two minute planning conditions, candidates were given a sheet of paper and a pen to do their planning and were allowed to refer to their notes during task performance (as is normal during the IELTS interview). On completion of each task they were asked to hand the paper to the interviewer, who wrote the amount of planning time on the same sheet as well as any difficulties or notable features of candidates behaviour that had been observed. All interviews were tape-recorded so that any breach of the planning time instructions by either candidate or interviewer could be detected, and so that additional retrospective ratings of performance could be arranged. Standard IELTS analytic criteria were used to rate each task performance separately as soon as the candidates completed each task (see Appendix 4 for the rating sheet). Ratings were assigned concurrently by the interviewer for feedback purposes but it was decided not to use these ratings for our research investigation given a) informal feedback indicating that interviewers found it difficult to rate one task at a time (under normal operational conditions ratings are completed once only after the interview is over) and b) our fear that ratings might be contaminated by interviewers attitudes to planning time. (Interviewers have been found to compensate candidates for what they perceive as a difficult task or interlocutor and we believed the same might be true for task conditions perceived by raters to pose challenges to candidates.)
4.3.2 Post-interview questionnaires

On completion of the interview, all candidates filled out a questionnaire (see Appendix 6) which canvassed their perceptions of planning time. It asked about any prior strategy training the candidates had experienced (eg in IELTS preparation classes) and asked them to indicate which strategies they used during planning time. The planning strategies adopted for the questionnaire were based on those identified by Rutherford (2001) on the basis of feedback from a focus group of students very similar to the participants in the current study. Both micro level (language-related) and macro level (contentrelated) strategies were included. The questionnaire was administered on completion of the three-task sequence to avoid the risk of a learning effect (ie candidates using some of the strategies included on the questionnaire in subsequent task performance). Candidates then completed the vocabulary test (described above).
4.3.3 Focus groups

Candidates perceptions regarding the difficulty/fairness of the task under the three different conditions and the utility of planning time were further probed during two focus group interviews each involving 810 participants from the larger study who volunteered to stay on for a further hour after the questionnaire and vocabulary test. The questions which guided the focus groups are given in Appendix 5. These focus group discussions were recorded on tape. For the purposes of this study, focus group interviews were preferred over individual interviews for two main reasons. Firstly, for entirely practical reasons, the focus group meant that the candidates were not required to wait for a long period of time. Secondly, focus groups allow for a dynamic interaction between the members of the group (Greenbaum, 1998; Bryman, 2001), which was considered to be productive in terms of drawing out candidates views, particularly given that they were second language learners. We acknowledge, however, that the views expressed by focus group participants are not necessarily representative of the broader sample.

IELTS Research Reports Volume 6

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

4.4
4.4.1

Data compilation and analysis


Transcription and digitisation of tapes

All 90 tapes were transcribed so that transcripts could be analysed and coded (see further details below). The tapes were then sent to a laboratory for digitization and a CD-Rom created of all 90 performances. Instructions from the interviewer and silences for planning were removed from the CD so that raters would be unaware of the conditions under which each task was performed.
4.4.2 Post-performance ratings

Two trained IELTS raters were recruited to rate all three tasks on each of the 90 tapes using the IELTS analytic criteria. They were instructed to take a break at least once an hour to avoid fatigue. The ratings of both assessors were then entered into a database. Inter-rater reliability was calculated (using the Spearmans correlation coefficient). The data were first analysed using the Facets rating scale model (Linacre 1990) with rater, task and proficiency and planning time entered as separate facets in the file. Univariate F tests (using SPSS) were then calculated with task and planning time entered as independent variables and average (of the two raters) scores on each of the analytic rating criteria as the outcome measures. Due to the Latin Square design, whereby candidates were randomly assigned to different planning conditions within each proficiency grouping rather than across groupings, these analyses were conducted separately for high and low proficiency candidates.
4.4.3 Discourse analysis

A subset of speech samples was selected for further analysis of the discourse. Two candidates from each of the nine cells in tables 1 and 2 were randomly selected. Thus 18 advanced and 18 intermediate candidates speech samples were selected. Transcribed speech samples for each candidate were coded for the following categories. 1. Fluency fluent versus disfluent speech filled and unfilled pauses self repairs 2. Accuracy: global measures in terms of: error-free AS-units error-free clauses 3. Complexity proportion of dependent clauses per AS-unit percentage of subordinate clauses to AS-units Fluency features were coded on the WAV files using the EMU Speech Data Base System and the R Statistical package for extracting the statistics. The EMU System offers a more accurate means of measuring fluency than does the traditional approach based solely on written transcripts. It allows data to be coded in real time on a variety of different levels chosen by the investigator, and the R package allows these to be read once the features have been labelled. In other words, stretches of fluent speech were marked at beginning and end, as were filled and unfilled pauses and self-repairs. Although much more detailed labeling is available (eg syllables can be marked and thus counted) this process was very time-consuming. It was decided not to do this in the first instance, and only to focus on a more detailed analysis in the event of significant differences between groups on the broader categories. For the measures of accuracy and complexity, the transcripts were coded into AS-units (Foster, Tonkyn and Wigglesworth, 2000) and clauses. Following Foster et al (2000), an AS-unit was defined as an utterance consisting of an independent clause together with any subordinate clause associated with it. An independent clause was defined minimally as a clause which included a finite verb, while a
IELTS Research Reports Volume 6
10

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

subordinate clause was defined as a clause consisting of a finite or non-finite verbal element with at least one other clausal element such as a subject, object, complement or adverbial (pp 365). Subordinate clauses were divided into two types which we labeled subordinate (when introduced by a subordinating discourse marker, eg because, before, after) and dependent, consisting of non-finite and other non-independent clauses. Twelve speech samples were coded by the two chief investigators and reliability checks were then conducted. Areas of discrepancy were discussed and modifications were made to the coding system where necessary. The remaining speech samples were coded by a single researcher only.
4.4.4 Questionnaire responses

Questionnaire responses were entered into a database and descriptive statistics (frequencies and percentages) were calculated for the various items. T-tests were used to compare the mean number of strategies used under the one and two minute planning conditions. The relative frequency of microand macro- planning strategies at each proficiency level was also calculated and these frequencies were compared using the Chi Square statistic. Correlations were also computed to determine whether there was a significant relationship between number of micro- and macro-planning strategies used and test scores. The questionnaires also yielded qualitative data about candidates attitudes to planning time. These comments were thematically coded and summarised with reference to findings from the focus group interviews (see below).
4.4.5 Focus group responses

Focus group interviews were replayed and coded for keywords based on themes emerging from the data. These themes were exemplified with verbatim quotes where appropriate. 5 RESULTS

The results of the vocabulary test, which we used as a surrogate for proficiency, confirmed that the intermediate and advanced students came from different groups, with the intermediate students averaging 46.15 (standard deviation 13.21) and the advanced students averaging 56.50 (sd 9.43). This difference was significant (t= 4.243, df 87, p<.0001). An inter-rater reliability check on the two trained IELTS raters was calculated on each of the rating categories and yielded coefficients ranging from .51 (for intelligibility) to .73 (for accuracy). However, it should be pointed out that while candidates the majority of whom were from mainland China were at different levels of proficiency, their speaking skills were not highly variable. This is likely to be as a result of their lack of exposure to spoken English in their previous instructional contexts. In future studies it might be useful to pre-test students' oral proficiency, rather than their general proficiency, as a means of forming the different groupings. 5.1 Research question 1
Does the amount of strategic planning time provided make a difference to the scores awarded to candidates in Part 2 of the oral test?

The first research question addressed the issue of whether strategic planning time made a difference to the scores awarded to the candidates. Mean IELTS scores and standard deviations for advanced and intermediate groups are presented in Table 3 below. The univariate analysis revealed no significant effects for either task or planning time at either level of proficiency on the global ratings, a null finding that was confirmed in the facets analysis.

IELTS Research Reports Volume 6

11

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

none Planning time mean Intermediate Advanced 23.6 23.9 SD 2.2 2.2

1 minute mean 23.6 24 SD 2.2 1.9

2 minutes mean 23.8 24 SD 2.1 2

Table 3: Total IELTS score (N=90)

Similarly, descriptive statistics presented in Tables 4 and 5 below show only minimal mean differences according to planning time on each component of the analytic rating scale at each proficiency level. The univariate F test again confirmed that there were no significant effects for either task or planning time.
none Planning time mean Fluency Lexis Grammar Pronunciation 5.8 6.0 5.8 6.0 SD 0.9 0.7 0.6 0.3 mean 5.8 5.9 5.8 6.1 SD 0.9 0.7 0.7 0.3 mean 5.8 6.0 5.8 6.1 SD 0.8 0.7 0.7 0.3 1 minute 2 minutes

Table 4: Analytic measures for intermediate candidates (N=45)


none Planning time mean Fluency Lexis Grammar Pronunciation 6.1 6.0 5.9 5.9 SD 0.7 0.7 0.7 0.6 mean 6.1 6.1 5.9 6.0 SD 0.7 0.5 0.5 0.5 mean 6.0 6.1 5.9 6.0 SD 0.7 0.6 0.6 0.5 1 minute 2 minutes

Table 5: Analytic measures for advanced candidates (N=45)

5.2

Research question 2
Does the amount of strategic planning time make a difference to the quality of candidate discourse in Part 2 of the oral test?

The discourse analytic measures were used to determine whether planning time made a difference to the quality of the discourse in these tasks. As indicated above, the discourse of a subset of candidates was assessed on measures of fluency, accuracy and complexity. The fluency measures identified the percentage of fluent versus non fluent speech, filled and unfilled pauses, and duration of reformulations, repetitions and false starts (self repairs). The results for the intermediate candidates are given in Table 6, and those for the advanced candidates in Table 7. The univariate analyses yielded no significant differences for either task or planning time across any of these measures.

IELTS Research Reports Volume 6

12

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

none Planning time mean % fluent vs non fluent speech Unfilled pauses Filled pauses Reformulations (duration seconds) Repetitions (duration secs) False starts (duration secs) 65.20 25.96 9.58 2.74 3.64 3.34 SD 10.87 10.86 4.79 1.94 2.70 2.14

1 minute mean 66.83 27.59 8.07 2.49 3.78 3.55 SD 9.96 12.80 5.91 1.65 1.60 1.71

2 minutes mean 65.85 27.76 8.47 3.48 3.64 3.09 SD 10.73 14.89 4.19 2.61 2.74 1.59

Table 6: Fluency measures for intermediate candidates (N=18)


none Planning time mean % fluent vs non fluent speech Unfilled pauses Filled pauses Reformulations (duration seconds) Repetitions (duration secs) False starts (duration secs) 69.49 22.49 6.79 2.04 3.58 2.63 SD 9.41 9.05 3.92 1.61 5.42 2.34 mean 68.83 21.62 7.17 2.09 2.80 2.51 SD 9.26 8.58 2.58 1.83 2.30 1.81 mean 70.13 21.69 7.72 2.86 3.47 3.54 SD 7.53 8.61 3.88 2.37 3.18 3.33 1 minute 2 minutes

Table 7: Fluency measures for advanced candidates (N=18)

There were two measures of complexity proportion of dependent clauses per AS-unit, and percentage of subordinate clauses per AS-unit. Once again, as shown in Tables 8 and 9, the mean scores for each planning condition were fairly close, although there does appear to be an increase in the number of subordinate clauses per AS-unit in the one minute planning condition for both intermediate and advanced candidates. However, this difference was not large enough to reach statistical significance.

IELTS Research Reports Volume 6

13

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

none Planning time mean Dependent clauses / AS-unit Subordinate clauses / AS-unit 1.4 16.5 SD 0.4 10.2

1 minute mean 1.5 26.9 SD 0.5 21.3

2 minutes mean 1.5 21.8 SD 0.5 15.1

Table 8: Complexity measures for intermediate candidates (N=18)


none Planning time mean Dependent clauses / AS-unit Subordinate clauses / AS-unit 1.7 21.9 SD 0.2 12.8 mean 1.7 27.1 SD 0.2 14.5 mean 1.7 20.4 SD 0.4 20.6 1 minute 2 minutes

Table 9: Complexity measures for advanced candidates (N=18)

The global measures for accuracy (error free AS-units and error free clauses) are presented in Tables 10 and 11. Statistical analyses again indicated that there were no significant differences according to either task or the amount of planning time provided.
none Planning time mean % error free AS-units % error free clauses 26.5 40.4 SD 21.7 9.9 mean 27.3 40.1 SD 23.1 21.4 mean 24.8 39.1 SD 22.7 12 1 minute 2 minutes

Table 10: Accuracy measures for intermediate candidates (N=18)


none Planning time mean % error free AS-units % error free Clauses 26.1 39.1 SD 16.8 16.6 mean 26.5 42 SD 16.8 18.7 mean 30 40.3 SD 16.7 15.8 1 minute 2 minutes

Table 11: Accuracy measures for advanced candidates (N=18)

To summarise, there were no significant differences in any of the score measures, or in the discourse measures between groups depending upon whether they had had access to one or two minutes planning time, or whether they had had no planning time. The implications of these results for continuing to include planning time in Part 2 of the IELTS test are discussed further below.

IELTS Research Reports Volume 6

14

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

5.3

Research question 3
How do candidates perceive the usefulness and validity of strategic planning time?

Candidates were asked in the questionnaire whether they felt that planning time helped them, to which 89% responded positively. This was reiterated in the focus group interviews where most of the students said that they found it easier when planning time was available Planning time is important you can organise your idea and prepare what you want to say. One candidate stated that planning time is useful not only for organising ideas but also for providing time in which to calm nerves in the stressful testing situation. The comment section of the questionnaire provided some interesting insights. The candidates were asked to comment on three aspects of their performance in the questionnaire: a) why planning time had not helped them b) which task they thought they had performed best on and why and c) which task they had performed worst on and why. Very generally, the candidate responses can be broken down as follows.
Planning time was used to: Organise Improve ideas/think about topic Improve speaking Structure Nervousness Other Negative Total Number 21 18 16 5 3 16 10 89 % of candidates 23.59 20.22 17.98 5.61 3.37 17.98 11.24 100%

Table 12: Use of planning time by candidates

Negative responses indicating that planning time was not useful or even counterproductive were few in number, although one candidate at the focus group interview suggested that having to prepare in front of the interviewer made him more anxious than when he spoke without any planning. Typical responses from the major categories are given below.
Organise

planning time helps organise ideas (candidate 11) planning lead to organise my ideas (cand 23) helped me know what I have to say and what is first, second (cand 34) I can decide on ideas and organise them (cand 40) had time to organise topic and write down my idea (cand 49) it can help to organise my ideas (cand 54) I can prepare and organise my ideas to explain better (cand 63) I can organise my thinking and ideas before speaking (cand 64) helped me organise my idea (cand 85)

IELTS Research Reports Volume 6

15

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Improve ideas/think about topic

more time allows you to better use ideas (cand 13) makes me brainstorm (cand 15) think about the topic step-by-step (cand 33) can prepare and think more to say about the topic (cand 44) helps me think about the content of the topic (cand 45) I spent time thinking about how to extend my topic (cand 53) thought about more things to talk about (cand 68) I can describe more about the topic (cand 80)
Improve speaking

can make speaking more clearly (cand 6) because I can speak well planning the tasks (cand 27) improve my speaking in English (cand 43) I know what I am going to talk, making me more fluent (cand 55) successful speech smooth (cand 60) I didnt think about the topic, but it helped to speak calmly (cand 70) helps speak clearly (cand 79) I wrote the points and then I was able to speak clearly (cand 82)
Structure

organised sentences better (cand 3) can think about how to make sentences correctly then word form (cand 65) tried to write down words relating to my topic (cand 66) better arrangement, grammar structure and fewer awkward sentences (cand 73)
Perception of worst task

Interestingly, when asked to identify which of the three tasks they did worst on, many commented that they did worst on a particular task because they did not have time to prepare their response properly. (Task 1 was describing a subject they had studied; Task 2 was describing a book or movie; and Task 3 was describing an important event in their lives.) The last task. I wasn't able to take notes, so I had to think immediately (cand 10, task 1, no planning) Subject. I had little time to prepare (cand 13, task 1, no planning) The first one. The time was only enough to remember my event (cand 20, task 2, no planning) The second one. No enough time even to read the topic (cand 28, task 2, no planning) The first one due to no enough time (cand 32, task 3, no planning) Event due to no time to plan (cand 33, task 3, no planning) The last task due to not enough time to get ready (cand 36, task 3, no planning) The last one. No time to prepare (cand 39, task 3, no planning) Task 1. I had no time to organise or think (cand 49, task 1, no planning) Subject. I had no time to think about my ideas (cand 54, task 1, no planning) Subject. I had no time to think about the topic (cand 60, task 1, no planning) The first one due to no enough time (cand 65, task 2, no planning)

IELTS Research Reports Volume 6

16

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

The last one. Without preparation, I kept repeating the same information (cand 82, task 3, no planning) Event. I had neither time nor ideas (cand 85, task 3, no planning) As can be seen from the examples above, this was often when the candidates had no planning time available, but this was not always the case, as the following examples show. Task 2: time was short and the topic was hard for me (cand 4, task 2, one minute) Task 3. I didn't have time to think (cand 5, task 3 (event), two minutes) The last one. Not enough time (cand 26, task 3, one minute) Task 1. I had no time and didn't know what to say (cand 78, task 1, one minute) Task 1. Not enough time (cand 84, task 1, one minute) Topic was another important factor which impacted on the activity. As can be seen from the responses below, the task they found most difficult was often where they found the topic difficult, and the presence or absence of planning time was unlikely to make much difference. 2nd. In the middle of that task, I couldn't talk about anything (cand 5, task 2, one minute) Event. I don't have information about it (cand 7, task 3, two minutes) Task 2, subject. I had no idea (cand 14, task 2, one minute) The third one. I seldom watch movies (cand 15, task 2, one minute) The first one. I couldn't think of anything (cand 16, task 2, no planning) Book/movie. I had no idea about it (cand 21, task 2, no planning) Subject. I've never thought about this (cand 22, task 1, two minutes) Book. I was confused (cand 35, task 2, two minutes) Subject. I have no idea about it, even when I use my own language (cand 4, task 1, one minute) Subject. I've never thought about this task (cand 43, task 1, one minute) Subject. I have no idea to describe a subject (cand 44, task 1, one minute) First one. I have never thought of it before (cand 47, task 1, no planning) Book. I had no idea about the book (cand 51, task 2, one minute) The last one. I didn't know about the topic very well (59, task 2, one minute) Subject. I've never done this task before (cand 71, task 1, two minutes) The third one. I had nothing to say (cand 75, task 3, one minute) Movie/book. It was hard to describe a book, especially some Chinese book (cand 79, task 2, two minutes) Event. This topic is too big (cand 80, task 3, no planning)
5.3.1 Topic as a factor

Topic was also identified as a salient factor in responses to the question about which task they felt they had performed best on, with 56 of the candidates (62.9%) mentioning this, compared to 21 (23.5%) who identified the presence of planning time as the major determinant of their performance. However, 19 of the responses in the latter category indicated that it was the two minutes of planning time which they perceived to have made the difference and of these, five mentioned both planning time and topic as contributing. Some typical responses are below. Maybe the movie because I am interested in it (cand 3) Task 1. The topic was easier than others (cand 6) Event. I have a lot of events in my life, that I can explain very well (cand 11) Book. I just read the book recently, so I can remember (cand 24)
IELTS Research Reports Volume 6
17

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Task 2. My memorable event in my life that I never forget (cand 41) Last one. It was a part of my life (cand 47) Movie. There were many ideas in my mind (cand 64) Event. It was the most important event in my life (cand 72) Task 1. I got many points to talk about (cand 81) Task 3. I was familiar with the topic (cand 83) Movie. I'm interested in it (cand 86)
5.3.2 Planning time as a factor

Second one. Enough time to think (cand 8, task 3, two minutes) Event. I had more time to prepare (cand 13, task 3, two minutes) Task 3 about subject. I could use two mins to plan (cand 16) The first one. I had time to prepare (cand 28, task 1, two minutes) Task 3. I had more time to prepare (cand 34, task 3, two minutes) Event. I had more time to think about my ideas and how to say them (cand 54, task 3, two minutes) Event. I had more time to prepare. With two mins, you have enough time to think of it (cand 59) The last one. I had time to prepare it (cand 65, task 1, two minutes)
5.3.3 Planning and topic as a factor

3: I had time to organise my ideas and the topic was familiar with me (cand 4) Task 3. I had enough time to think and the test name "talking about movies" was interesting (cand 14) Task 3. I had more time to organise and the topic was easier for me (cand 49) Subject. Enough time to prepare and familiar subject (cand 30) It should be pointed out that in the questionnaire almost 50% of the candidates claimed to be familiar with the tasks. Almost 15% of the candidates reported they had previously practised the tasks in their responses to a subsequent question Which task do you think you performed best on? Why? Subject. I am familiar with it (cand 21) Movie. I have done this topic before. Im familiar with it (cand 31) I practised it before (cand 62). Subject. I prepared this topic before and am familiar with the vocabulary (cand 80) As discussed below, this may be a factor which contributes to the null findings presented in this study. Overall, from the candidates responses above, it appears that although planning time does not seem to affect scores, or engender differences in the discourse measures investigated above, the majority of the candidates clearly found it useful, and identified difficulties when it was not present. Nevertheless, the topic of the task emerged as the most important factor in how candidates perceive themselves performing on these tasks.

IELTS Research Reports Volume 6

18

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

5.4

Research question 4
How do candidates use their planning time?

The analysis of the strategy questionnaires revealed that the candidates used a variety of strategies when they had planning time available. The most common strategies used are given in Table 13; the six most popular strategies are shaded.
one minute Strategy planning Number I tried to decide what topic I would talk about I thought about the content and ideas needed for the task I read the task card again I thought about how to organise my ideas I wrote down vocabulary on paper I wrote down useful sentences or phrases on paper I thought about grammar (eg verb forms) in my head I made notes about grammar on paper I practised useful sentences or phrases in my head I made a list of vocabulary in my head I made a list of useful organising and/or linking language in my head I wrote down useful organising and/or linking language on paper I practised the task in my head I practised pronunciation in my head I wrote down ideas in my first language and then translated them I thought about nothing 72 58 57 53 42 40 32 11 33 26 32 22 27 15 10 12 % 80.9 65.2 64.0 59.5 47.2 44.9 35.9 12.4 37.1 29.2 35.9 24.7 30.3 16.8 11.2 13.5 two minutes planning Number 68 57 53 61 51 44 37 17 42 31 43 35 35 21 13 9 % 76.4 64.0 59.5 68.5 57.3 49.4 41.6 19.1 47.2 34.8 48.3 39.3 39.3 23.6 14.6 10.1

Table 13: Use of strategies by candidates with one and two minutes planning time

While the pattern of strategy use is similar for both the one and two minute planning condition, there was a significant difference between the number of strategies candidates reported using when more planning time was available (t=2.575, df=88, p=0.012). 5.5 Research question 5
What are the most effective strategies for the use of strategic planning time?

Given the results of the previous analyses, it was anticipated that there would be no significant correlations between the number of planning strategies and either the global or analytic scores given by the raters for planning condition. This proved to be the case. A further analysis was undertaken which involved identifying the strategies as either macro-strategies (those concerned with topic, content and organisation), and micro-strategies (those concerned with language level issues such as grammar, structure, vocabulary, etc). The last strategy (I thought about nothing), which attracted very few responses, was omitted. (See Figure 1.)

IELTS Research Reports Volume 6

19

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Macro-strategies I read the task card again I practiced the task in my head I practiced pronunciation in my head I tried to decide what topic I would talk about I thought about how to organise my ideas I thought about the content and ideas needed for the task I wrote down ideas in my first language and then translated them

Micro-strategies I thought about grammar (eg verb forms) in my head I made notes about grammar on paper I practiced useful sentences or phrases in my head I wrote down useful sentences or phrases on paper I made a list of vocabulary in my head I wrote down vocabulary on paper

Figure 1: Macro and micro strategies

Table 14 summarises the strategy use by proficiency level and amount of planning time provided. While it appears that macro-strategies were used more frequently than micro-strategies under the one minute planning condition and that the reverse was true when two minutes of planning was allowed, a Chi Square analysis revealed no significant differences across any of the groupings.
Macro-strategies used 1 minute Intermediate 2 minutes 1 minute Advanced 2 minutes 149 139 128 148 Micro-strategies used 138 166 115 155

Table 14: Use of micro and macro strategies by group

Finally, the results of a t-test analysis indicated that there was no significant difference in the mean level of performance between micro and macro planners (ie candidates who reported using more language related strategies and those who reported focusing more on content and organisation). 6 DISCUSSION AND CONCLUSION

The null findings in this study mirror those of Iwashita et al (2001) and Wigglesworth (2000). As noted in our earlier literature review, test-based research has produced scant evidence of benefits for strategic planning time on the quality of the subsequent speaking performance. In this study, the lack of any effect for planning time was consistent across all measures used, including the different categories of the IELTS rating scale and the various discourse dimensions. While there was some trend towards greater discourse complexity (as measured by the ratio of subordinate clauses to ASunits) under the one minute planning condition for both intermediate and advanced level candidates, this finding did not prove to be statistically significant. It therefore seems reasonable to conclude that planning time has limited utility for Part 2 of the IELTS oral test, which uses very similar tasks.

IELTS Research Reports Volume 6

20

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Does this mean that the one minute of planning time currently available to prepare performance on Part 2 of the IELTS oral is superfluous? We think not. Candidates expressed preference for planning time is worth taking notice of, if only for face validity reasons. Providing opportunities for planning may engender greater confidence in the IELTS Speaking Test on the part of candidates and, accordingly, greater acceptance of the scores obtained. However, while candidates questionnaire and interview responses suggest that removing the currently offered one minute of planning time from IELTS task 2 is likely to be unwelcome, there is surely no point in extending the amount of planning time provided, since the longer (two minute) planning condition yielded no additional benefit on any performance measure. Even for complexity, the marginal gains observed under the one minute condition disappeared completely when two minutes of planning were provided. As far as strategies are concerned, the results of this study (and indeed from most other studies of planning in a test situation) suggest that, while candidates appreciate being given planning time before speaking, they make poor use of it. There was no evidence that either the number of strategies or the particular type of strategy (macro or micro) used by learners made a significant difference to performance. Interviewer feedback after administering the test indicated that many learners appeared lost during the planning period, or were too anxious to make use of what they had prepared. This is supported by comments made by one of the focus group interviewees, who reported that the presence of the interviewer distracted him from his planning efforts. Another commented that she was unable to read the notes she had made. Another possibility (also reflected in comments from focus group candidates) is that the benefits of planning are constrained by memory, and that improvements in the fluency, accuracy or complexity of the discourse cannot be sustained beyond the first few utterances of candidate speech. It seems likely that raters are also constrained by memory and that it is the final impression which informs their judgement. This would explain the lack of any impact for planning on scores and on the discourse measures which are averaged across the whole stretch of performance. It is also possible that in an unpressured monologic performance such as this one, candidates are able to monitor their speech as they go, and that this produces benefits even in the zero planning condition (see Yuan and Ellis 2005). The effects of strategic planning may therefore be discernible only under highly-pressured performance conditions where on-line planning is not possible. Further investigation of this may be warranted using the approach adopted by Yuan and Ellis (2005) in which on-line planning is sharply differentiated from pre-task planning and no planning by introducing a time limit for both the pre-task and no planning conditions, but providing unlimited time for the on-line planning condition. Alternatively, it may be that there is a mismatch between the focus of candidate planning and what is valued by the IELTS rater and captured by our discourse measures. The strategies which candidates reported using most frequently in both the one and two minute planning conditions were those directed to planning the message content, whereas the main focus of the IELTS analytic rating scale categories is on form, or, to be more precise, candidates accuracy, fluency, pronunciation and the lexical resources they deploy. It might therefore be instructive to devise some means of measuring the propositional complexity of the discourse, to see if planning makes a difference to this dimension of performance (although it is debatable whether propositional complexity is of interest in a language testing context). It might also be useful to examine in more detail those individuals who benefit from planning to determine what planning strategies these candidates engage in. However, to do so, we would need to devise a more fine-grained taxonomy of strategy use (see Ortega, 2005) and to gather rich think-aloud data (of the kind elicited by Sangarun, 2005). Such a study would be of interest to those involved in teaching test preparation courses and could form the basis for further research on the role of strategy training in boosting performance.

IELTS Research Reports Volume 6

21

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

As pointed out above, many of the candidates reported having practised these or similar tasks before. It may be that planning is to no avail when candidates are already familiar with the task, particularly simple ones (like those used in this study) which require a description or commentary on past experience. Indeed it may be that on a high-stakes test such as IELTS, some candidates have prepared so well that much of what we are really measuring on this test is pre-rehearsed rather than spontaneous unplanned discourse (although this study provides no direct evidence of such a phenomenon, which should be the subject of further research). On the other hand, we saw comments from a number of testtakers indicating they were unprepared for the topics and in these instances, as was suggested earlier, planning time may do little to improve their performance. The current study adds to the weight of evidence suggesting that planning time is not conducive to producing better performance in a testing environment. However, Xis (2005) recent findings in relation to the graph task on the SPEAK exam nevertheless give some grounds for believing that planning time may interact with task type. Before definitive conclusions are drawn, further research needs to be conducted using more complex and cognitively demanding tasks. In this respect, integrated tasks in which candidates may be required to integrate specific features of aural and written input in their oral response, may mean that planning is more beneficial than in other types of task. This would mitigate again, for example, the situation found in this study where some candidates find the topic difficult and this overrode the availability or not of planning time. In integrated tasks, where familiarity (or not) with the task is likely to be less of an issue since input material is given, planning would certainly be warranted, not only for reasons of fairness, but also on authenticity grounds. In summary, the findings of this study offer positive support for the inclusion of a small amount of planning time on oral proficiency tests. However, the null findings on all measures of both rater evaluations and of the discourse suggest that the rationale for this relates more to fairness and face validity, than to the ability of candidates to improve their performance as a result of planning time. As already noted, it is clear that further research into the effects of planning time in testing contexts is warranted if we are to fully understand the impact that the provision of planning time may have in oral proficiency tests, and the ways in which it may impact on the test construct.

IELTS Research Reports Volume 6

22

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

REFERENCES Bryman, A, 2001, Social Research Methods, Oxford University Press, Oxford Bygate, M and Samuda, V, 2005, Integrative planning through the use of task-repetition in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Crookes, G, 1989, Planning and interlanguage variation, Studies in Second Language Acquisition, vol 11, pp 183-199 Elder, C and Iwashita, N, 2005, Planning for test performance: What difference does it make? in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Elder, C, Iwashita, N and McNamara, T, 2002, Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer?, Language Testing, vol 19, no 4, pp 347-368 Ellis, R, 1987, Interlanguage variability in narrative discourse: style shifting in the use of the past tense, Studies in Second Language Acquisition, vol 9, no 1, pp 1-20 Ellis, R, ed, 2005, Planning and task performance in a second language, John Benjamins, Amsterdam and Philadelphia Ellis, R and Yuan, F, 2005, The effects of careful within-task planning on oral and written task performance in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Foster, P, 1996, Doing the task better: how planning time influences students' performance in Challenge and change in language teaching, eds J Willis and D Willis, Heinemann, London Foster, P and Skehan, P, 1996, The influence of planning and task-type on second language performance, Studies in Second Language Acquisition, vol 18, pp 299-323 Foster, P, Tonkyn and Wigglesworth, G, 2001, Measuring spoken language: a unit for all reasons, Applied Linguistics, vol 21, no 3, pp 354-375 Greenbaum, TL, 1998, The handbook for focus group research, Thousand Oaks, Sage, California Iwashita, N, McNamara, T and Elder, C, 2001, Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design, Language Learning vol 21, no 3, pp 401-436 Linacre, M, 1990, FACETS, computer program for many faceted Rasch measurement, Mesa Press, Chicago Mehnert, U, 1998, The effects of different lengths of time for planning on second language performance, Studies in Second Language Acquisition, vol 20, no 1, pp 83-108 Ortega, L, 1999, Planning and focus on form in L2 oral performance, Studies in Second Language Acquisition, vol 21, pp 109-148 Ortega, L, 2005, What do learners plan? Learner-drive attention to form during pre-task planning in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Rutherford, K, An investigation into the effects of planning on oral production in a second language, unpublished masters dissertation, University of Auckland, New Zealand
IELTS Research Reports Volume 6
23

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

Sangarun, J, 2005, The effects of focusing on meaning and form in strategic planning in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Skehan, P, 1996, A framework for the implementation of task-based instruction, Applied Linguistics, vol 17, pp 38-62 Skehan, P, 1998, A cognitive approach to language learning, Oxford University Press, Oxford Skehan, P and Foster, P, 1997, Task type and task processing conditions as influences on foreign language performance, Language Teaching Research, vol 13, pp 185-211 Skehan, P and Foster, P, 1999, The influence of task structure and processing conditions on narrative retellings, Language Learning, vol 49, no 1, pp 93-120 Swain, M, 1985, Large-scale communicative language testing: A case study in Testing, Pergamon Press, Oxford, pp. 35-46 Swain, M, 1993, The output hypothesis: Just speaking and writing arent enough, The Canadian Modern Language Review, vol 50, pp 158-164 Tavarkoli, P and Skehan, P, 2005, Strategic planning, task structure and performance testing in Planning and task performance in a second language, ed R Ellis, John Benjamins, Amsterdam and Philadelphia Wendel, JN, 1997, Planning and second language narrative production, unpublished doctoral dissertation, Temple University, Japan Wigglesworth, G, 1997, An investigation of planning time and proficiency level on oral test discourse, Language Testing, vol 14, no 1, pp 101-122 Wigglesworth, G, 2000, Issues in the development of oral tasks for competency-based assessments of second language performance in Studies in immigrant English language assessment, Vol 1, Research series 11, ed G Brindley, National Centre for English Language Teaching and Research Macquarie University, Sydney, pp 81-124 Wigglesworth, G, 2001, Influences on performance in task-based oral assessments in Task based learning, eds M Bygate, P Skehan and M Swain, Addison Wesley Longman, London, pp 186-209 Yuan, F and Ellis, R, 2003, The effects of pre-task planning and on-line planning on fluency, complexity and accuracy in L2 monologic oral production, Applied Linguistics, vol 24, no 1, pp 1-27 Xi, X, 2005, Do visual chunks and planning impact performance on the graph description task in the SPEAK exam?, Language Testing, vol 22, no 4, pp 463-508

IELTS Research Reports Volume 6

24

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

APPENDIX 1: TASK PROMPTS PROVIDED FOR CANDIDATES


TASK 1 SUBJECT

Describe a subject you have studied which has had a great influence on your life: You should say what the subject was where you learned the subject who your teacher was and explain how it has influenced your life.

TASK 2

BOOK OR MOVIE

Talk about a book or a movie that you found interesting. You should say: what the book or movie was about who the main characters were what you liked and/or disliked about it and explain why you found the book or movie interesting.

TASK 3

EVENT

Describe an event in your life (eg holiday or childhood experience) which made a great impression on you. You should say: what the event was where and when it took place who you were with and explain why it made a great impression on you.

IELTS Research Reports Volume 6

25

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

APPENDIX 2: TASK ADMINISTRATION INSTRUCTIONS FOR INTERVIEWER


When there is NO PLANNING TIME you should say the following: Now, Im going to give you a topic and Id like you to talk about it for one to two minutes. Id like you to start talking straight away. Do you understand? Heres your topic [hand over the relevant task card and give students 15 seconds to read the card] Id like you to talk about X (mention the topic of the task) All right? Remember you have one to two minutes for this so dont worry if I stop you. Ill tell you when the time is up. Can you start speaking now please?

When there is ONE MINUTE OF PLANNING TIME you should say the following: Now, Im going to give you a topic and Id like you to talk about it for one to two minutes. Before you talk, youll have one minute to think about what you are going to say. You can make some notes if you wish. Do you understand? Heres some paper and a pen for making notes [hand over spare paper and a pencil] and heres your topic [hand over the relevant task card] Id like you to talk about X (mention the topic of the task) Allow up to a minute for preparation, but the candidate can start earlier if he/she wants. When the time is up or the student signals readiness to begin you should say: All right? Remember you have one to two minutes for this, so dont worry if I stop you. Ill tell you when the time is up. Can you start speaking now please?

When there is TWO MINUTES OF PLANNING TIME you should say the following: Now, Im going to give you a topic and Id like you to talk about it for one to two minutes. Before you talk, youll have two minutes to think about what you are going to say. You can make some notes if you wish. Do you understand? Heres some paper and a pen for making notes[hand over spare paper and a pencil]and heres your topic [hand over the relevant task] Id like you to talk about X (mention the topic of the task) Allow up to two minutes for preparation, but the candidate can start earlier if he/she wants. When the time is up or the student signals readiness to begin you should say: All right? Remember you have one to two minutes for this so dont worry if I stop you. Ill tell you when the time is up. Can you start speaking now please?

When the student has finished the task you should retrieve the notes he has made and attach your own notes (if relevant) to them. Say:
Thank you very much.

IELTS Research Reports Volume 6

26

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

APPENDIX 3: MARKING SHEET


Students number_____________________________ Interviewer name____________________________ Please give 4 ratings for each task, using the normal IELTS criteria, namely:
FC = LR = GRA = P= FC Task 1 Fluency and coherence Lexical resources Grammatical range and accuracy Pronunciation LR GRA P

Task 2

Task 3

Tasks are to be rated one at a time in order of performance.

APPENDIX 4: FOCUS GROUP INTERVIEW QUESTIONS


1. 2. 3. 4. 5. 6. Did you the think the tasks used for this study were a good measure of your ability to use language in university settings? (Give reasons for your answer) Did you find planning time made the tasks easier? If no, please explain why. If yes indicate how you see the benefits of planning time (ie how did it help you?) Which planning activities were most helpful in performing the task? Do you think you used the planning time as well as you could have? Say why/why not. If you took notes during the planning session did you use these when performing the task? If yes, did having the notes in front of you help you? Have you ever been given instruction/training on how to use pre task planning time? If yes, how useful was it? If no, do you think it would help to have this kind of training?

IELTS Research Reports Volume 6

27

1. An investigation of the effectiveness and validity of planning time in Part 2, IELTS Speaking Elder + Wigglesworth

APPENDIX 5: STUDENT QUESTIONNAIRE


A Task Feedback 1. Have you practised any of the three tasks you have just done before? (Tick yes or no) Talking about a SUBJECT Talking about a BOOK/MOVIE Talking about an EVENT Yes Yes Yes No No No Yes No

2. Have any of your teachers taught you how to plan before speaking?

3. For two of the three tasks you have just performed some planning time was given. Indicate (by ticking all the relevant boxes) which of the following things you did during your planning time before you started speaking. With 1 minute TASK NAME I read the task card again I thought about grammar (eg verb forms) in my head I made notes about grammar on paper I practised useful sentences or phrases in my head I wrote down useful sentences or phrases on paper I made a list of vocabulary in my head I wrote down vocabulary on paper I made a list of useful organising and/or linking language in my head I wrote down useful organising and/or linking language on paper I practised the task in my head I practised pronunciation in my head I tried to decide what topic I would talk about I thought about how to organise my ideas I thought about the content and ideas needed for the task I wrote down ideas in my first language and then translated them I thought about nothing I did other things (please tell us what you did) Do you think the planning helped you? Explain why/why not Which task do you think your performed best on? Why? Which of the three tasks do you think you performed worst on? Why? Yes No With 2 minutes

IELTS Research Reports Volume 6

28

2. An examination of the rating process in the revised IELTS Speaking Test


Author Annie Brown Ministry of Higher Education and Scientific Research United Arab Emirates Grant awarded Round 9, 2003 This study examines the validity of the analytic rating scales used to assess performance in the IELTS Speaking Test, through an analysis of verbal reports produced by IELTS examiners when rating test performances and their responses to a subsequent questionnaire.
ABSTRACT In 2001 the IELTS interview format and criteria were revised. A major change was the shift from a single global scale to a set of four analytic scales focusing on different aspects of oral proficiency. This study is concerned with the validity of the analytic rating scales. Through a combination of stimulated verbal report data and questionnaire data, this study seeks to analyse how IELTS examiners interpret the scales and how they apply them to samples of candidate performance. This study addresses the following questions: How do examiners interpret the scales and what performance features are salient to their judgements? How easy is it for examiners to differentiate levels of performance in relation to each of the scales? What problems do examiners identify when attempting to make rating decisions? Experienced IELTS examiners were asked to provide verbal reports after listening to, and rating a set of the interviews. Each examiner also completed a detailed questionnaire about their reactions to the approach to assessment. The data were transcribed, coded and analysed according to the research questions guiding the study. Findings showed that, in contrast with their use of the earlier holistic scale (Brown, 2000), the examiners adhered closely to the descriptors when rating. In general, the examiners found the scales easy to interpret and apply. Problems that they identified related to overlap between the scales, a lack of clear distinction between levels, and the inference-based nature of some criteria. Examiners reported the most difficulty with the Fluency and Coherence scale, and there were concerns that the Pronunciation scale did not adequately differentiate levels of proficiency.

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

CONTENTS
1 Rationale for the study.............................................................................. 3 2 Rating behaviour in oral interviews......................................................... 3 3 Research questions .................................................................................. 5 4 Methodology 4.1 Data 4.3 Coding 5 Results 5.1.1 5.1.2 5.1.3 5.1.4 ........................................................................................... 5 ........................................................................................... 5 ........................................................................................... 7 ........................................................................................... 8 Fluency and coherence ..................................................... 8 Lexical resource................................................................. 12 Grammatical range and accuracy...................................... 15 Pronunciation ..................................................................... 18

4.2 Score data ........................................................................................... 7

5.1 Examiners interpretation of the scales and levels within the scales .. 8

5.2 The discreteness of the scales............................................................ 20 5.3 Remaining questions........................................................................... 22 5.3.1 5.3.2 5.3.3 6 Discussion 7 Conclusion References Additional criteria ............................................................... 22 Irrelevant criteria ................................................................ 22 Interviewing and rating....................................................... 22 ........................................................................................... 23 ........................................................................................... 25 ........................................................................................... 26

Appendix 1: Questionnaire ........................................................................... 28

AUTHOR BIODATA: ANNIE BROWN Annie Brown is Head of Educational Assessment in the National Admissions and Placement Office (NAPO) of the Ministry of Higher Education and Scientific Research, United Arab Emirates. Previously, and while undertaking this study, she was Senior Research Fellow and Deputy Director of the Language Testing Research Centre at The University of Melbourne. There, she was involved in research and development for a wide range of language tests and assessment procedures, and in language program evaluation. Annie's research interests focus on the assessment of speaking and writing, and the use of Rasch analysis, discourse analysis and verbal protocol analysis. Her books include Interviewer Variability in Oral Proficiency Interviews (Peter Lang, 2005) and the Language Testing Dictionary (CUP, 1999, co-authored with colleagues at the Language Testing Research Centre). She was winner of the 2004 Jacqueline A Ross award for the best PhD in language testing, and winner of the 2003 ILTA (International Language Testing Association) award for the best article on language testing.

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

RATIONALE FOR THE STUDY

The IELTS Speaking Test was re-designed in 2001 with a change in format and assessment procedure. These changes responded to two major concerns: firstly, that a lack of consistency in interviewer behaviour in the earlier unscripted interview could influence candidate performance and hence, ratings outcomes (Taylor, 2000); and secondly, that there was a degree of inconsistency in interpreting and applying the holistic band scales which were being used to judge performance on the interview (Taylor and Jones, 2001). A number of studies of interview discourse informed the decision to move to a more structured format. These included Lazaraton (1996a, 1996b) and Brown and Hill (1998) which found that despite training, examiners had their own unique styles, and they differed in the degree of support they provided to candidates. Brown and Hills study, which focused specifically on behaviour in the IELTS interview, indicated that these differences in interviewing technique had the potential to impact on ratings achieved by candidates (see also Brown, 2003, 2004). The revised IELTS interview was designed with a more tightly scripted format (using interlocutor frames) to ensure that there would be less individual difference among examiners in terms of interviewing technique. A study by Brown (2004) conducted one year into the operational use of the revised interview found that generally this was the case. In terms of rating consistency, a study of examiner behaviour on the original IELTS interview (Brown, 2000) revealed that while examiners demonstrated a general overall orientation to features within the band descriptors, they appeared to interpret the criteria differently and included personal criteria not specified in the band scales (in particular interactional aspects of performance, and fluency). In addition, it appeared that different criteria were more or less salient to different raters. Together these led to ratings variability. Taylor and Jones (2001) reported that it was felt that a clearer specification of performance features at different proficiency levels might enhance standardisation of assessment (2001: 9). In the revised interview, the holistic scale was replaced with four analytic scales. This study seeks to validate the new scales through an examination of the examiners cognitive processes when applying the scales to samples of test performance, and a questionnaire which probes the rating process further. 2 RATING BEHAVIOUR IN ORAL INTERVIEWS

There has been growing interest over the last decade in examining the cognitive process employed by examiners of second language production through the analysis of verbal reports produced during, or immediately after, performing the rating activity. Most studies have been concerned with the assessment of writing (Cumming, 1990; Vaughan, 1991; Weigle, 1994; Delaruelle, 1997, Lumley, 2000). But more recently, the question of how examiners interpret and apply scales in assessments of speaking has been addressed (Meiron, 1998; Brown, 2000; Brown, Iwashita and McNamara, 2005). These studies have investigated questions such as: how examiners assign a rating to a performance; what aspects of the performance they privilege; whether experienced or novice examiners rate differently; the status of self-generated criteria; and how examiners deal with problematic performances. In her examination of the functioning of the now retired, IELTS holistic scale, Brown (2000) found that the holistic scale was problematic for a number of reasons. Different criteria appeared to be more or less salient at different levels; for example comprehensibility and production received greater attention at the lower levels and were typically commented on only where there was a problem. Brown found that different examiners attended to different aspects of performance, privileging certain features over others in their assessments. Also, some examiners were found to be more performance IELTS Research Reports Volume 6
3

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

oriented, focusing narrowly on the quality of performance in relation to the criteria, while others were reported to be more inference-oriented, drawing conclusions about candidates ability to cope in other contexts. The most recently trained examiner focused more exclusively on features referred to in the scales and made fewer inferences about candidates. In the present study, of course, the question of weighting should not arise, although examiners may have views on the relative importance of the criteria. A survey of examiner reactions to the previous IELTS interview and holistic rating procedure (Merrylees and McDowell, 1999) found that most Australian examiners would prefer a profile scale. Another question then, given the greater detail in the revised, analytic scales, is whether examiners find them easier to apply than the previous one, or whether the additional detail and difficulty distinguishing the scales makes the assessment task more problematic. Another question of concern when validating proficiency scales is the ease with which examiners are able to distinguish levels. While Merrylees and McDowell (1999) found that around half the examiners felt the earlier holistic scale used in the IELTS interview was able to distinguish clearly between proficiency levels, Taylor and Jones reported concern as to how well the existing holistic IELTS rating scale and its descriptors were able to articulate key features of performance at different levels or bands (2001: 9). Again, given the greater detail and narrower focus of the four analytic scales compared with the single holistic one, the question arises of whether this allows examiners to better distinguish levels. A focus in the present study, therefore, is the degree of comfort that examiners report when using the analytic scales to distinguish candidates at different levels of proficiency. When assessing performance in oral interviews, in addition to a range of linguistic and production related features, examiners have also been found to attend to less narrowly linguistic aspects of the interaction. For example, in a study of Cambridge Assessment of Spoken English (CASE) examiners perceptions, Pollitt and Murray (1996) found that in making judgements of candidates proficiency, examiners took into account perceived maturity and willingness or reluctance to converse. In a later study of examiners orientations when assessing performances on SPEAK (Meiron, 1998), despite it being a non-interactive test, Meiron found that examiners focused on performance features such as creativity and humour, which she described as reflecting a perspective on the candidate as an interactional partner. Browns analysis of the IELTS oral interview (2000) also found that examiners focused on a range of performance features, both specified and self-generated, and these included interactional skills, in addition to the more explicitly defined structural, functional and topical skills. Examiners noted candidates use of interactional moves such as challenging the interviewer, deflecting questions and using asides, and their use of communication strategies such as the ability to self-correct, ask for clarification or use circumlocution. They also assessed candidates ability to manage a conversation and expand on topics. Given the use in the revised IELTS interview of a scripted interview and a set of four linguistically focused analytic scales, rather than the more loosely worded and communicativelyoriented holistic one in the earlier format, the question arises of the extent to which examiners still attend to, and assess communicative or interactional skills, or any other features not included in the scales. Another factor which has been found to impact on ratings in oral interviews is interviewer behaviour. Brown (2000, 2003, 2004) found that in the earlier unscripted quasi-conversational interviews, examiners took notice of the interviewer and even reported compensating when awarding ratings for what they felt was inappropriate interviewer behaviour or poor technique. This finding supported those of Morton, Wigglesworth and Williams (1997) and McNamara and Lumley (1997), whose analyses of score data combined with examiners evaluations of interviewer competence also found that examiners compensated in their ratings for less-than-competent interviewers. Pollitt and Murray (1993) found

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

that examiners made reference to the degree of encouragement interviewers gave candidates. While it is perhaps to be expected that interviewer behaviour might be salient to examiners in interviews which allow interviewers a degree of latitude, the fact that the raters in Morton et als study, which used a scripted interview (the access: oral interview), took the interviewer into account in their ratings, raises the question of whether this might also be the case in the current IELTS interview, which is also scripted, in those instances where interviews are rated from tape. 3 RESEARCH QUESTIONS

On the basis of previous research, and in the interests of seeking validity evidence for the current oral assessment process, this study focuses on the interpretability and ease of application of the revised, analytic scale, addressing the following sets of questions: 1. What performance features do examiners explicitly identify as evidence of proficiency in relation to each of the four scales? To what extent do these features reflect the criteria key indicators described in the training materials? Do examiners attend to all the features and indicators? Do they attend to features which are not included in the scales? How easy do they find it to apply the scales to samples of candidate performance? How easy do they find it to distinguish between the four scales? 2. What is the nature of oral proficiency at different levels of proficiency in relation to the four assessment categories? How easy is it for examiners to distinguish between adjacent levels of proficiency on each of the four scales? Do they believe certain criteria are more or less important at different levels? What problems do they identify in deciding on ratings for the samples used in the study? 3. Do examiners find it easy to follow the assessment method stipulated in the training materials? What problems do they identify? 4 4.1 METHODOLOGY Data

The research questions were addressed through the analysis of two complementary sets of data: verbal reports produced by IELTS examiners as they rated taped interview performances the same examiners responses to a questionnaire which they completed after they had provided the verbal reports. The verbal reports were collected using the stimulated recall methodology (Gass and Mackey, 2000). In this approach, the reports are produced retrospectively, immediately after the activity, rather than concurrently, as the online nature of speaking assessment makes this more appropriate. The questionnaire was designed to supplement the verbal report data and to follow up any rating issues relating to the research questions which were not likely to be addressed systematically in the verbal reports. Questions focused on the examiners interpretations of, application of, and reactions to, the scales. Most questions required descriptive (short answer) responses. The questionnaire is included as Appendix 1. Twelve IELTS interviews were selected for use in the study: three at each of Bands 5 to 8. (Taped interviews at Band 4 level and below were too difficult to follow due to intelligibility and hence, interviews from Band 5 and above only were used.) The interviews were drawn from an existing dataset of taped operational IELTS interviews used in two earlier analyses: one of interviewer behaviour (Brown, 2003) and one of candidate performance (Brown, 2004). Most of the interviews were

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

conducted in Australia, New Zealand, Indonesia and Thailand in 2001-2, although the original set was supplemented with additional tapes provided by Cambridge ESOL (test centres unknown). Selection for the present study was based on ratings awarded in Browns 2004 study, averaged across three examiners and the four criteria, and rounded to the nearest whole band. Of the 12 interviews selected, seven involved male candidates and five female. The candidates were from the following countries: Bangladesh, Belgium, China (3), Germany, India, Indonesia (2), Israel, Korea and Vietnam. Table 1 shows candidate information and ratings.
Interview 1 2 3 4 5 6 7 8 9 10 11 12 Sex M F M M F M M M F M F F Country Belgium Bangladesh Germany India Israel Indonesia Vietnam China China China Indonesia Korea Averaged ratings 8 8 8 7 7 7 6 6 6 5 5 5

Table 1: Interview data

Six expert examiners (as identified by the local IELTS administrator) participated in the study. Expertise was defined in terms of having worked with the revised Speaking Test since its inception, and having demonstrated a high level of accuracy in rating. Each examiner provided verbal reports for five interviews, see Table 2. (Note: Examiner 4 only provided four reports.) Prior to data collection they were given training and practice in the verbal report methodology. The verbal reports took the following form. First, the examiners listened to the taped interview and referred to the scales in order to make an assessment. When the interview had finished, they stopped the tape and wrote down the score they had awarded for each of the criteria. They then started recording their explanation of why they had awarded these scores. Next they re-played the interview from the beginning, stopping the tape whenever they could comment on some aspect of the candidates performance. Each examiner completed a practice verbal report before commencing the main study. After finishing the verbal reports, all of the examiners completed the questionnaire.

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Interview 1 2 3 4 5 6 7 8 9 10 11 12

Examiner 1 X X

Examiner 2

Examiner 3 X

Examiner 4

Examiner 5

Examiner 6

X X X

X X X

X X X X X X X X X X X X X

X X X X X X X

Table 2: Distribution of interviews

4.2

Score data

There were a total of 29 assessments for the 12 candidates. The mean score and standard deviation across all of the ratings for each of the four scales is shown in Table 3. The mean score was highest on Pronunciation, followed by Fluency and coherence, Lexical resource and finally Grammatical range and accuracy. The standard deviation was smaller on Pronunciation than on the other three scales, which reflects the narrower range of band levels used by the examiners; there were only three ratings lower than Band 6.
Scale Fluency and coherence Lexical resource Grammatical range and accuracy Pronunciation Mean 6.28 6.14 5.97 6.45 Standard deviation 1.53 1.60 1.52 1.30

Table 3: Mean ratings

4.3

Coding

After transcription, the verbal report data were broken up into units, a unit being a turn a stretch of talk bounded by replays of the interview. Each transcript consisted of several units, the first being the summary of ratings, and the remainder being the talk produced during the stimulated recall. At times, examiners produced an additional turn at the end, where they added information not already covered, or reiterated important points. Before the data was analysed, the scales and the training materials were reviewed, specifically the key indicators and the commentaries on the student samples included in the examiner training package

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

(UCLES, 2001). A comprehensive description of the aspects of performance that each scale and level addressed was built up from these materials. Next, the verbal report data were coded in relation to the criteria. Two coders, the researcher and a research assistant undertook the coding with a proportion of the data being double coded to ensure inter-coder reliability (over 90% agreement on all scales). This coding was undertaken in two stages. First, each unit was coded according to which of the four scales the comment addressed: Fluency and coherence, Lexical resource, Grammatical range and accuracy, and Pronunciation. Where more than one was addressed the unit was double-coded. Additional categories were created, namely Score, where the examiner simply referred to the rating but did not otherwise elaborate on the performance; Other, where the examiner referred to criteria or performance features not included in the scales or other training materials; Aside, where the examiner made a relevant comment but one which did not directly address the criteria; and Uncoded, where the examiner made a comment which was totally irrelevant to the study or was inaudible. Anomalies were addressed through discussion by the two coders. Once the data had been sorted according to these categories, a second level of coding was carried out for each of the four main assessment categories. Draft sub-coding categories were developed for each scale, based on the analysis of the scale descriptors and examiner training materials. These categories were then applied and refined through a trial and error process, and with frequent discussion of problem cases. Once coded, the data were then sorted in various ways and reviewed in order to answer the research questions guiding the study. Of the comments that were coded as Fluency and coherence, Lexical resource, Grammatical range and accuracy, and Pronunciation (a total of 837), 28% were coded as Fluency and coherence, 26% as Lexical resource, 29% as Grammatical range and accuracy and 17% as pronunciation. Examiner 1 produced 18% of the comments; Examiner 2, 17%; Examiner 3, 10%; Examiner 4, 14%; and Examiners 5 and 6, 20% each. The questionnaire data were also transcribed and analysed in relation to the research questions guiding the study. Where appropriate, the reporting of results refers to both sets of data. 5 5.1 RESULTS Examiners interpretation of the scales and levels within the scales

In this section the analysis of the verbal report data and relevant questionnaire data is drawn upon to illustrate, for each scale, the examiners interpretations of the criteria and the levels within them. Subsequent sections will focus on the question of the discreteness of the scales and the remaining interview questions.
5.1.1 Fluency and coherence 5.1.1a Understanding the fluency and coherence scale

The Fluency and coherence scale appeared to be the most complex in that the scales, and examiners comments, covered a larger number of relatively discrete aspects of performance than the other scales hesitation, topic development, length of turn, and use of discourse markers. The examiners referred often to the amount of hesitation, repetition and restarts, and (occasionally) the use of fillers. They noted uneven fluency, typically excusing early disfluency as nerves. They also frequently attempted to infer the cause of hesitation, at times attributing it to linguistic limitations a search for words or the right grammar and at other times to non-linguistic causes to candidates thinking about the content of their response, to their personality (shyness), to their cultural background, or to a lack of interest in the topic (having nothing to say). Often examiners were unsure whether language or content was the cause of disfluency but, because it was relevant to the ratings decision
IELTS Research Reports Volume 6
8

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

(Extract 1), they struggled to decide. In fact, this struggle appeared to be a major problem as it was commented on several times, both in the verbal reports and in the responses to the questionnaire.
Extract 1

And again with the fluency hes ready, hes willing, theres still some hesitation. And its a bit like guess what Im thinking. It annoys me between 7 and 8 here, where it says I think I alluded to it before is it content related or is it grammar and vocab or whatever? It says here in 7, some hesitation accessing appropriate language. And I dont know whether its content or language for this bloke. So you know I went down because I think sometimes it is language, but I really dont know. So I find it difficult to make that call and thats why I gave it a 7 because I called it that way rather than content related, so being true to the descriptor. In addition to the amount or frequency of hesitation and possible causes, examiners frequently also considered the impact of too much hesitancy on their understanding of the candidates talk. Similarly, they noted the frequency of self-correction, repetition and restarts, and its impact on clarity. Examiners distinguished repair of the content of speech (clarifying the situation, withdrawing her generalisation), which they saw as native-like, even evidence of sophistication, from repair of grammatical or lexical errors. Moreover, this latter type of repair was at times interpreted as evidence of limitations in language but at other times was viewed positively as a communication strategy or as evidence of self-monitoring or linguistic awareness. Like repair, repetition could also be interpreted in different ways. Typically it was viewed as unhelpful (for example, one examiner described the candidates repetition of the interviewers question as tedious) or as reducing the clarity of the candidates speech, or as indicative of limitations in vocabulary, but occasionally it was evaluated positively, as a stylistic feature (Extract 2).
Extract 2

So here I think she tells us its like shes really got control of how tonot tell a story but her use of repetition is very good. Its not just simple use; its kind of drawing you I like to do this, I like to do that its got a kind of appealing, rhythmic quality to it. Its not just somebody whos repeating words because they cant think of others she knows how to control repetition for effect so I put that down for a feature of fluency. Another aspect of the Fluency and coherence scale that examiners attended to was the use of discourse markers and connectives. They valued the use of a range of discourse markers and connectives, and evaluated negatively their incorrect use and the overuse or repetitive use of only a few basic ones. Coherence was addressed in terms of a) the relevance or appropriateness of candidates responses and b) topic development and organisation. Examiners referred to candidates being on task or not (answering the question), and to the logic of what they were saying. They commented negatively on poor topic organisation or development, particularly the repetition of ideas (going around in circles) or introduction of off-topic information (going off on a tangent), and on the impact of this on the coherence or comprehensibility of the response. At times examiners struggled to decide whether poor topic development was a content issue or a language issue. It was also noted that topic development favours more mature candidates. A final aspect of Fluency and coherence that examiners mentioned was candidates ability, or willingness, to produce extended turns. They made comments such as able to keep going or truncated. The use of terms such as struggling showed their attention to the amount of effort involved in producing longer turns. They also commented unfavourably on speech which was disjointed or consisted of sentence fragments, and on speech where candidates kept on keep adding phrases to a sentence or when they ran too many ideas together into one sentence.

IELTS Research Reports Volume 6

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

5.1.1b Determining levels within the fluency and coherence scale

To determine how examiners coped with the different levels within the Fluency and coherence scale, the verbal report data were analysed for evidence of how the different levels were interpreted. Examiners also commented on problems they had distinguishing levels. In the questionnaire the examiners were asked whether each scale discriminated across the levels effectively and, if not, why. In general, hesitancy and repetition were key features at all levels, with levels being distinguished by the frequency of hesitation and repetition and its impact on the clarity or coherence of speech. At the higher levels (Bands 79), examiners use terms like relaxed and natural to refer to fluency. Candidates at these levels were referred to as being in control. Examiners appeared uncomfortable about giving the highest score (Band 9), and spent some time trying to justify their decisions. One examiner reported that the fact that Band 9 was absolute (that is, required all hesitation to be content-related) was problematic (Extract 3), as was distinguishing what constituted appropriate hesitation, given that native speakers can be disfluent. Examiners also expressed similar difficulties with the differences between Bands 7 and 8, where they reported uncertainty as to the cause of hesitation (whether it was grammar, lexis, or content related, see Extract 4).
Extract 3

Now I find in general, judgements about the borderline between 8 and 9 are about the hardest to give and I find that were quite often asked to give them. And the reason theyre so hard to give is that on the one hand, the bands for the 9 are stated in the very absolute sense. Any hesitation is to prepare the content of the next utterance for Fluency and coherence, for example. Whatve we got all contexts and all times in lexis and GRA. Now as against that, you look at the very bottom and it says, a candidate will be rated on their average performance across all parts of the test. Now balancing those two factors is very hard. Youre being asked to say, well does this person usually never hesitate to find the right word? Now thats a contradiction and I think thats a real problem with the way the bands for 9 are written, given the context that were talking about average performance.
Extract 4

It annoys me between 7 and 8 here. Where it says I think I alluded to it before is it content related or is it grammar and vocab or whatever? It says here in 7: Some hesitation accessing appropriate language. And I dont know whether its content or language for this bloke. So you know I went down because I think sometimes it is language, but I really dont know. So I find it difficult to make that call and thats why I gave it a 7 because I called it that way rather than content related, so being true to the descriptor. The examiners appeared to have some difficulty distinguishing Bands 8 and 9 in relation to topic development, which was expected to be good in both cases. At Band 7, examiners reported problems starting to appear in the coherence and/or the extendedness of talk. At Band 6, examiners referred to a lack of directness (Extract 5), poor topic development (Extract 6) and to candidates going off on a tangent or otherwise getting off the topic, and to occasional incoherence. They referred to a lack of confidence, and speech was considered effortful. Repetition and hesitation or pausing was intrusive at this level (Extract 6). As described in the scales, an ability to keep going seemed to distinguish a 6 from a 5 (Extract 7).
Extract 5

And I found that she says a lot but she doesnt actually say anything; it takes so long to get anywhere with her speech.

IELTS Research Reports Volume 6

10

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Extract 6

6 for Fluency and coherence. She was very slow, very hesitant. I felt that her long searches, her low pauses were searching for right words. And I felt that there was little topic development; that she wasnt direct.
Extract 7

I ended up giving him a 6 for Fluency and coherence. I wasnt totally convinced but by-andlarge he was able to keep going. At Band 5, examiners commented on having to work hard to understand the candidate. All of them expressed an inability at times to follow what the candidate was saying. Other comments related to the degree of repetition, hesitation and pausing, the overuse of particular discourse markers, not answering the question, and occasional trouble keeping going, being able to elaborate or take long turns (Extracts 8-10).
Extract 8

So Ive given it 5 for fluency and I guess the deciding factor there was the 4 says unable to keep going without noticeable pauses, and she was able to keep going. There were pauses and all that but she did keep going, so I had to give her a 5 there.
Extract 9

Overuse of certain discourse markers, connectives and other cohesive features. He was using the same ones again and again and again.
Extract 10

So she got a 5 for Fluency and coherence because she was usually able to keep going, but there was repetition and there were hesitations mid-sentence while she looked for fairly basic words and grammar, and then she would stop as if she had more to say but she couldnt think of the words. I think there is a category for that in the descriptor, but anyway
Extract 11

Its interesting in this section, the fluency tends to drop away and I dont know whether its just that he doesnt like the topic of soccer very much, so maybe Im doing him an injustice but Im going to end up marking him down to 7 on Fluency whereas before I was tending more to an 8 but I felt that if he was really fluent hed be able to sustain it a little bit better.
5.1.1c Confidence in using the fluency and coherence scale

When asked to judge their confidence in understanding and interpreting the scales, no examiner selected lower than the mid-point on a scale of 1 (Not at all confident) to 5 (Very confident) for any of the scales (see Table 4). Examiners were marginally the least confident about Fluency and coherence, and the most confident about Pronunciation. When asked to elaborate on why they felt confident or not confident about the Fluency and coherence scale several commented, as they had also done in their verbal reports, that the focus on hesitation was problematic because it was necessary but not always possible to infer its cause a search for content or language in order to decide whether a rating of 7 or 8 should be given. One examiner commented that there can at times be more witchcraft than science involved in discerning why hesitation is used. It was also noted that fluency can be affected by familiarity with or liking of a topic (Extract 11). Another commented that assessing whether speech is situationally appropriate is problematic given the restricted context of the interview, while another said that topic development being mentioned at Band 7 but not Band 8 is problematic. One examiner remarked that the Fluency and coherence descriptors are longer than the others and thus harder to internalise.

IELTS Research Reports Volume 6

11

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Only one examiner reported that the Fluency and coherence scale was easy to apply, commenting that the key indicators for her were whether the candidate could or could not keep going and could or could not elaborate.
Examiner 1 Fluency and coherence Lexical resources Grammatical range and accuracy Pronunciation 4 5 5 2 5 4 4 3 4 3 4 5 4 3 4 3 5 5 4 4 4 3 6 3 4 4 3 Mean 3.8 4.0 4.0 4.3

4.5 5

Table 4: Confidence using the scales

When asked in the questionnaire whether the descriptors of the Fluency and coherence scale capture the significant performance qualities at each of the band levels and distinguished adequately between levels, most examiners reported problems. Bands 6 and 7 were considered difficult to distinguish in terms of the frequency or amount of hesitation, repetition and/or self-correction. The terms some (Band 7) and at times (Band 6) were said to be very similar. One examiner said it was difficult to infer intentions (willingness and readily) to discriminate between Bands 6 and 7. Distinguishing Bands 7 and 8 was also considered problematic for two reasons: firstly, because topic development is mentioned at Band 7 but not Band 8, but also because, as noted earlier, examiners found it difficult to infer whether disfluency was caused by a search for language (Band 7) or by the candidate thinking about their response (Band 8). One examiner felt that Bands 4 and 5 were particularly difficult to distinguish because hesitation and repetition are the hallmarks of both; another reported problems distinguishing Band 4 and Band 6 (even 6 versus 4 can produce problems coherence may be lost at 6 and some breakdowns in coherence at 4). Finally, it was noted that hesitation and pace may be an indication of limitations in language but often reflected individual habits of speech.
5.1.2 Lexical resource 5.1.2a Understanding the lexical resource scale

As was the case for the Fluency and coherence scale, examiners tended to refer to the range of features referred to in the descriptors and the key indicators. These included lexical errors, range of lexical resource (including stylistic choices and adequacy for different topics), the ability to paraphrase, and the use of collocations. One feature included in the key indicators but not referred to was the ability to convey attitude. Although not referred to in the scales, the examiners took candidates lack of comprehension of interviewer talk as evidence of limitations in Lexical resource. As expected, there were numerous references to the sophistication and range of the lexis used by candidates, and to inaccuracies or inappropriate word choice. When they referred to inaccuracies or inappropriateness, examiners commented on their frequency (occasional errors), their seriousness (a small slip), the type of error (basic, simple, non-systematic) and the impact the errors had on comprehensibility. Examiners also commented on the appropriateness or correctness of collocations, and on morphological errors (the use of dense instead of density). They commented unfavourably on candidates inability to find the right words, a feature which overlapped with assessments of fluency.

IELTS Research Reports Volume 6

12

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

While inaccuracies or inappropriate word choice were typically taken as evidence of lexical limitations, it was also recognised that unusual lexis or use of lexis may in fact be normal in the candidates dialect or style. This was particularly the case for candidates from the Indian sub-continent. The evidence suggests, however, that determining whether a particular word or phrase was dialectal or inappropriate was not necessarily straightforward (Extract 12).
Extract 12

Thats her use of in here and she does it a lot. I dont know whether its a dialect or whether its a systematic error. As evidence of stylistic control, examiners commented on a) the use of specific, specialist or technical terms, and b) the use of idiomatic or colloquial terms. They also evaluated the adequacy of candidates vocabulary for the type of topic (described in terms of familiar, unfamiliar, professional, etc). There was some uncertainty as to whether candidates ability to use specialist terms within their own professional, academic or personal fields of interest was indicative of a broad range of lexis or whether, because the topic was familiar, it was not. Reference was also made to the impact of errors or inappropriate word use on comprehensibility. Finally, although there were not a huge number of references to learned expressions or formulae, examiners typically viewed their use as evidence of vocabulary limitations (Extract 13), especially if the use of relatively sophisticated learned phrases contrasted with otherwise unsophisticated usage.
Extract 13

Very predictable and formulaic kind of response: Its a big problem and Im not sure about the solution kind of style, which again suggests very limited lexis and probably pre-learnt. Examiners also attended to candidates ability to paraphrase when needed (Extract 14). They drew attention to specific instances of what they considered to be successful or creative circumlocution, my good memory moment or the captain of a company.
Extract 14

He rarely attempts paraphrase, he sort of stops, cant say it and he doesnt try and paraphrase it; he sort of repeats the bit that he didnt say right.
5.1.2b Determining levels within the lexical resource scale

The verbal report and questionnaire data were next analysed for evidence of how examiners distinguished levels within the Lexical resource scale and what problems they had distinguishing them. Examiners tended to value sophisticated or idiomatic lexical use at the higher end (Extract 15), although they tended to avoid Band 9 because of its absoluteness. Band 8 was awarded if they viewed non-native usage as occasional errors (Extract 16), and Band 9 if they considered them to be dialectal or creative. Precise and specific use of lexical items was also important at the higher levels, as per the descriptors.
Extract 15

and very sophisticated use of common, idiomatic terms. He was clearly 8 in terms of lexical resources.

IELTS Research Reports Volume 6

13

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Extract 16

Then with Lexical resource, occasionally her choice of word was slightly not perfect and thats why she didnt get a 9 but she really does nice things that shows that shes got a lot of control of the language like at one stage she says that something will end and then she changed it and said it might end, and that sort of indicated that she knew about the subtleties of using; the impact of certain words. At Band 7 examiners noted style and collocation. They still looked for sophisticated use of lexical items, although in contrast with Band 8, performance was considered uneven or patchy (Extract 17). They also noticed occasional difficulty elaborating or finding the words at Band 7.
Extract 17

So unusual vocabulary there; its suitable and quite sophisticated to say eating voluptuously, so eating for the joy of eating. So this is where my difficulty in assessing her lexical resource. Shell come out with words like that which are really quite impressive but then shell say the university wasnt published, which is quite inappropriate and distracting. So yes, at this stage Im on a 7 for Lexical resource. Whereas Band 7 required appropriate use of idiomatic language, at Band 6 examiners reported errors in usage (Extract 18). Performance at Band 6 was also characterised by adequate or safe use of common lexis.
Extract 18

Lexical resource was very adequate for what she was doing. She used a few somewhat unusual and idiomatic terms and there were points where therefore I was torn between a 6 and a 7. The reason I erred on the side of the 6 rather than the 7 was because those idiomatic and unusual terms were sometimes themselves not used quite correctly and that was a bit of a giveaway, it just wasnt quite the degree of comfort that Id have expected with a 7. A Band 5 was typically described in terms of the range of lexis (simple), the degree of struggle involved in accessing it, and the inability to paraphrase. At this level candidates were seen to struggle for words and there was some lack of clarity in meaning (Extract 19).
Extract 19

Its pretty simple vocabulary and hes struggling for words, at times for the appropriate words, so Id say 5 on Lexical resource. Examiners awarded Band 4 when they felt candidates were unable to elaborate, even on familiar topics (Extract 20) and when they were unable to paraphrase (Extract 21). They also noted repetitive use of vocabulary.
Extract 20

So she can tell us enough to tell us that the government cant solve this problem but she hasnt got enough words to be able to tell us why, so its like she can make the claims but she cant work on the meaning to build it up, even when shes talking about something fairly familiar.
Extract 21

I did come down to a 4 because resource was sufficient for familiar topics but really only basic meaning on unfamiliar topics, which is number 4. Attempts paraphrase well she didnt really, she couldnt do that. So I felt that she fitted a 4 with the Lexical resource.

IELTS Research Reports Volume 6

14

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

5.1.2c Confidence in using the lexical resource scale

The examiners reported being slightly more comfortable with the Lexical resource scale than they were with the Fluency and coherence scale (Table 4). Three of them noted that it was clear, and the bands easily distinguishable. One noted that it was easy to check depth or breadth of lexical knowledge with a quick replay of the taped interview, focusing on the candidates ability to be specific. When asked to elaborate on what they felt the least confident about, examiners commented on: the lack of interpretability of terms used in the scales (terms such as sufficient, familiar and unfamiliar) the difficulty they had distinguishing between levels (specifically, the similarity between Band 7 Resource flexibly used to discuss a variety of topics and Band 6 Resource sufficient to discuss at length), and the difficulty distinguishing between Fluency and coherence and Lexical resource (discussed in more detail later). In relation to this last point, one examiner remarked that spoken discourse markers and other idiomatic items such as adverbials (possibly, definitely), emphatic terms (you know, in a sense) and intensifiers or diluters (really, somewhat, quite) are relevant to both Lexical resources and Fluency and coherence. One examiner commented that paraphrasing is difficult to assess as not all candidates do it, and that indicators such as repetition and errors are more useful. In contrast, another commented that the paraphrase criterion was useful, particularly across Bands 4 to 7. Another remarked that it is difficult to assess lexical resources in an interview and that the criteria should focus more on the relevance or appropriateness of the lexis to the context. When asked whether the descriptors of the Lexical resource scale capture the significant performance qualities at each of the Band levels and discriminate across the levels effectively, most examiners said that they felt that this was the case. One said the scale developed well from basic meaning at 4 through sufficient at 5 to meaning clear, and then higher levels of idiom and collocation etc. One felt that a clearer distinction was needed in relation to paraphrase for Bands 7 and 8, and another that Bands 5 and 6 were difficult to distinguish because the ability to paraphrase, which seemed to be a key cut off, was difficult to judge. Another felt that deciding what was a familiar or unfamiliar topic was problematic, particularly across Bands 4 and 5. One examiner did not like the use of the term discuss at Band 5, as this for her implied dealing in depth with an issue, something she felt was unlikely at that level. She suggested the term talk about. Another commented that some candidates have sophisticated vocabulary relating to specific areas of work or study yet lack more general breadth.
5.1.3 Grammatical range and accuracy 5.1.3a Understanding the grammatical range and accuracy scale

In general, the examiners were very true to the descriptors and all aspects of the Grammatical range and accuracy scale were addressed. The main focus was on error frequency and error type on the one hand, and complexity of sentences and structures on the other. Examiners appeared to balance these criteria against each other. In relation to grammatical errors, examiners referred to density or frequency, including the number of error-free sentences. They also noted the type of error those viewed as simple, basic, or minor included articles, tense, pronouns, subject-verb agreement, word order, plurals, infinites and participles and whether they were systematic or not. They also noted the impact of errors on intelligibility. The examiners commented on the range of structures used, and the flexibility that candidates demonstrated in their use. There was reference, for example, to the repetitive use of a limited range of structures, and to candidates ability to use, and frequency of use of, complex structures such as passive, present perfect, conditional, adverbial constructions, and comparatives. Examiners also noted
IELTS Research Reports Volume 6
15

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

candidates ability to produce complex sentences, the range of complex sentence types they used, and the frequency and success with which they produced them. Conversely, what they referred to as fragmented or list-like speech or the inability to produce complete sentences or connect utterances (a feature which also impacted on assessments of coherence) was taken as evidence of limitations in grammatical resources.
5.1.3b Determining levels within the grammatical range and accuracy scale

To determine how examiners coped with the different levels within the Grammatical range and accuracy scale, the verbal report data were analysed for evidence of how the different levels were interpreted. Again Band 9 was used little. This seemed to be because of its absolute nature; the phrase at all times was used to justify not awarding this Band (Extract 22). Examiners did have some problems deciding whether non-native usage was dialectal or error. At Band 8, examiners spoke of the complexity of structures and the flexibility or control the candidates displayed in their use of grammar. At this level errors were expected to be both occasional non-systematic, and tended to be referred to as inappropriacies or slips, or as minor, small, or unusual (for the candidate), or as non-native like usage.
Extract 22

And again I think Im stopping often enough for these grammatical slips for it on average, remembering that we are always saying that, for it on average to match the 8 descriptor which allows for these, than the 9 descriptor which doesnt. Overall, Band 7 appeared to be a default level; not particularly distinguishable but more a middle ground between 8 and 6, where examiners make a decision based on whether the performance is as good as an 8 or as bad as a 6. Comments tended to be longer as examiners tended to argue for a 7 and against a 6 and an 8 (Extract 23). At this level inaccuracies were expected but they were relatively unobtrusive, and some complex constructions were expected (Extract 24).
Extract 23

I thought that he was a 7 more than a 6. He definitely wasnt an 8, although as I say, at the beginning I thought he might have been. There was a range of structures flexibly used. Error free sentences frequent, although Im not a hundred per cent sure of that because of pronunciation problems. And he could use simple and complex sentences effectively, certainly with some errors. Now when you compare that to the criteria for 6: Though errors frequently occur in complex structures these rarely impede communication
Extract 24

For Grammatical range and accuracy, even though there was [sic] certainly errors, there was certainly still errors, but youre allowed that to be a 7. What actually impressed me here he was good on complex verb constructions with infinitives and participles. He had a few really quite nice constructions of that nature which, I mean there were talking about sort of true complex sentences with complex verbs in the one clause, not just subordinate clauses, and I thought they were well handled. His errors certainly werent that obtrusive even though there were some fairly basic ones, and I think it would be true to say that error-free sentences were frequent there. At Band 6 the main focus for examiners was the type of errors and whether they impeded communication. While occasional confusion was allowed, if the impact was too great then examiners tended to consider dropping to a 5 (Extract 25). Also, an inability to use complex constructions successfully and confidently kept candidates at 6 rather than a 7 (Extract 26).

IELTS Research Reports Volume 6

16

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Extract 25

A mixture of short sentences, some complex ones, yes variety of structures. Some small errors, but certainly not errors that impede communication. But not an advanced range of sentence structures. Ill go for a 6 on the grammar.
Extract 26

Grammatical range and accuracy was also pretty strong, relatively few mistakes, especially simple sentences were very well controlled. Complex structures. The question was whether errors were frequent enough for this to be a 6, there certainly were errors. There were also a number of quite correct complex structures. I did have misgivings I suppose about whether this was a 6 or a 7 because she was reasonably correct. I suppose I eventually felt the issue of flexible use told against the 7 rather than the 6. There wasnt quite enough comfort with what she was doing with the structures at all times for it to be a 7. At Band 5 examiners noted frequent and basic errors, even in simple structures, and errors were reported as frequently impeding communication. Where attempts were made at more complex structures these were viewed as limited, and tended to lead to errors (Extract 27) Speech was fragmented at times. Problems with the verb to be or sentences without verbs were noted.
Extract 27

She had basic sentences, she tended to use a lot of simple sentences but she did also try for some complex sentences, there were some there, and of course the longer her sentences, the more errors there were. The distinguishing feature of Band 4 appeared to be that basic and systematic errors occurred in most sentences (Extract 28).
Extract 28

Grammatical range and accuracy, I gave her a 4. Even on very familiar phrases like where she came from, she was missing articles and always missed word-ending s. And the other thing too is that she relied on key words to get meaning across and some short utterances were error-free but it was very hard to find even a basic sentence that was well controlled for accuracy.
5.1.3c Confidence in using the grammatical range and accuracy scale

When asked to comment on the ease of application of the Grammatical range and accuracy scale, one examiner remarked that is easier to notice specific errors than error-free sentences, and another that errors become less important or noticeable if a candidate is fluent. Three examiners found the scale relatively easy to use. Most examiners felt that the descriptors of the scale captured the significant performance qualities at each of the Band levels. One examiner said that he distinguished levels primarily in terms of the degree to which errors impeded communication. Another commented that the notion of "error" in speech can be problematic as natural speech flow (ie native) is often not in full sentences and is sometimes grammatically inaccurate. When asked whether the Grammatical range and accuracy scale discriminates across the levels effectively, three agreed and three disagreed. One said that terms such as error-free, frequently, and well controlled are difficult to interpret (I ponder on what per cent of utterances were frequently error-free or well controlled). Another felt that Bands 7 and 8 were difficult to distinguish because he was not sure whether a minor systematic error would drop the candidate to 7, and that Bands 5 and 6 could also be difficult to distinguish. Another felt that the Band 4/5 threshold was problematic because some candidates can produce long turns (Band 5) but are quite inaccurate even in basic sentence forms

IELTS Research Reports Volume 6

17

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

(Band 4). Finally, one examiner remarked that a candidate who produces lots of structures with a low level of accuracy, even on basic ones, can be hard to place, and suggested that some guidance on risk takers is needed.
5.1.4 Pronunciation 5.1.4a Understanding the pronunciation scale

When evaluating candidates pronunciation, examiners focused predominantly on the impact of poor pronunciation on intelligibility, in terms of both frequency of unintelligibility and the amount of strain for the examiner (Extract 29).
Extract 29

I really do rely on that occasional strain, compared to severe strain. [The levels] are clearly formed I reckon. When they talked about specific aspects of pronunciation, examiners referred most commonly to the production of sounds, that is, vowels and consonants. They did also, at times, mention stress, intonation and rhythm, and while they again tended to focus on errors there was the occasional reference to the use of such features to enhance the communication (Extract 30).
Extract 30

And he did use phonological features in a positive way to support his message. One that I wrote down for example was well nobody was not interested. And he got the stress exactly right and to express a notion which was, to express a notion exactly. I mean he could have said everybody was interested but he actually got it exactly right, and the reason he got it exactly right among other things had to do with his control of the phonological feature.
5.1.4b Determining levels within the pronunciation scale

Next the verbal report and questionnaire data were analysed for evidence of how the different levels were interpreted and problems that examiners had distinguishing levels. While they attended to a range of phonological features vowel and consonant production but also stress and rhythm intelligibility, or the level of strain involved in understanding candidates appeared to be the key feature used to determine level (Extract 33). Because only even numbered bands could be awarded, it seemed that examiners took into account the impact that the Pronunciation score might have on overall scores (Extract 34).
Extract 31

I really do rely on that occasional strain, compared to severe strain.


Extract 32

So I dont know why we cant give those bands between even numbers. So, just as I wanted to give a 5 to the Indian I want to give a 9 to this guy. Because you see the effect of 9, 9, 8, 8 will be hell come down to 8 probably, Im presuming. At Band 8 examiners tended to pick out isolated instances of irregular pronunciation, relating the impact of these on intelligibility to the descriptors: minimal impact and accent present but never impedes communication. Although the native speaker was referred to as the model, it was recognised that native speakers make occasional pronunciation errors (Extract 35). Occasional pronunciation errors were generally considered less problematic than incorrect or non-native stress and rhythm (Extract 36) One examiner expressed a liking for variety of tone or stress in delivery and noted that she was reluctant to give an 8 to a candidate she felt sounded bored or disengaged.

IELTS Research Reports Volume 6

18

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Extract 33

Because I suppose the truth is, as native speakers, we sometimes use words incorrectly and we sometimes mispronounce them.
Extract 34

Its interesting how she makes errors in pronunciation on words. So shes got bif roll and steek and selard and I dont think there is much of a problem for a native speaker to understand as if you get the pauses in the wrong place, if you get the rhythm in the wrong place so thats why Ive given her an 8 rather than dropping her down because it says L1 accent may be evident, this has minimal effect on intelligibility, and it does have minimal effect because its always in context that she might get a word mispronounced or pronounced in her way, not my way. Band 6 appeared to be the default level where examiners elect to start. Examiners seemed particularly reluctant to give 4; of the 29 ratings, only three were below 6. Bands 4 and 6 are essentially determined with reference to listener strain, with severe strain at Band 4 and occasional strain at Band 6 (Extract 37).
Extract 35

Again with Pronunciation I gave her a 6 because I didnt find patches of speech that caused severe strain, I mean there was mispronunciation causes temporary confusion, some occasional strain. At Band 4 most comments referred to severe strain, or to the fact that examiners were unable to comprehend what the candidate had said (Extract 38).
Extract 36

I actually did mark this person down to Band 4 on Pronunciation because it did cause me severe strain, although I dont know whether thats because of the person I listened to before, or the time of the day but there were large patches, whole segments of responses that I just couldnt get through and I had to listen to it a couple of times to try and see if there was any sense in it.
5.1.4c Confidence in using the pronunciation scale

When asked to judge their confidence in understanding and interpreting the scales, the examiners were the most confident about Pronunciation (see Table 4). However, there was a common perception that the scale did not discriminate enough (Extract 31). One examiner remarked that candidates most often came out with a 6, and another that she doesn't take pronunciation as seriously as the other scales. One examiner felt that experience with specific language groups could bias the assessment of pronunciation (and, in fact, there were a number of comments in the verbal report data where examiners commented on their familiarity with particular accents, or their lack thereof). One was concerned that speakers of other Englishes may be hard to understand and therefore marked down unfairly (Extract 32). Volume and speed were both reported in the questionnaire data and verbal report data as having an impact on intelligibility.
Extract 37

And I would prefer to give a 5 on Pronunciation but it doesnt exist. But to me hes somewhere between severe strain, which is the 4, and the 6 is occasional strain. He caused strain for me nearly 50% of the time, so thats somewhere between occasional and severe. And this is one of the times where I really wish there was a 5 on Pronunciation because I think 6 is too generous and I think 4 is too harsh.

IELTS Research Reports Volume 6

19

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Extract 38

I think there is an issue judging the pronunciation of candidates who may be very difficult for me to understand, but who are fluent/accurate speakers of recognised second language Englishes, (Indian or Filipino English). A broad, Scottish accent can affect comprehensibility in the Australian context and Im just not sure therefore, whether an Indian or Filipino accent affecting comprehensibility should be deemed less acceptable. While pronunciation was generally considered to be the easiest scale on which to distinguish Band levels because there are fewer levels, four of the six examiners remarked that there was too much distinction between levels, not too little, so that the scale did not discriminate between candidates enough. One examiner commented that as there is really no Band 2, it is a decision between 4, 6, or 8, and that she sees 4 as almost unintelligible. In arguing for more levels they made comments like: Many candidates are Band 5 in pronunciation between severe strain for the listener and occasional. Perhaps mild strain quite frequently, or mild strain in sections of the interview. One examiner felt a Band 9 was needed (Extract 39).
Extract 39

Levels 1,3,5,7 and 9 are necessary. It seems unfair not to give a well-educated native speaker of English Band 9 for pronunciation when theres nothing wrong with their English, Australian doctors going to UK. Examiners commented at times on the fact that they were familiar with the pronunciation of candidates of particular nationalities, although they typically claimed to take this into account when awarding a rating (Extract 40).
Extract 40

I found him quite easy to understand but I dont know that everybody would and theres a very strong presence of accent or features of pronunciation that are so specifically Vietnamese that they can cause other listeners problems. So Ill go with a 6. 5.2 The discreteness of the scales In this section, the questionnaire data and, where relevant, the analysis of the verbal report data were drawn upon to address the question of the ease with which examiners were able to distinguish the four analytic scales Fluency and coherence (F&C); Grammatical range and accuracy (GRA); Lexical resource (LR); and Pronunciation (P). The examiners were asked how much overlap there was between the scales on a range of 1 (Very distinct) to 4 (Almost total overlap), see Table 5. The greatest overlap (mean 2.2) was reported between Fluency and coherence and Grammatical range and accuracy. Overall, Fluency and coherence was considered to be the least distinct and Pronunciation the most distinct scale.

IELTS Research Reports Volume 6

20

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Examiner Scale overlap F&C and LR F&C and GRA F&C and P LR and GRA LR and P GRA and P 1 1 3 2 2 1 1 2 2 2 3 2 2 2 2 1 1 4 2 2 2 1 1 1 5 3 2 1 2 1 1 6 2 2 2 2 1 1 Mean 2.0 2.2 1.8 1.8 1.0 1.0

Table 5: Overlap between scales

When asked to describe the nature of the overlap between scales, the examiners responded as follows. Comments made during the verbal report session supported these responses,
Overlap: Fluency and coherence / Lexical resource

Vocabulary was seen as overlapping with fluency because to be fluent and coherent [candidates] need the lexical resources, and because good lexical resources allow candidates to elaborate their responses. Two examiners pointed out that discourse markers (and, one could add, connectives), which are included under Fluency and coherence, are also lexical items. Another examiner commented that the use of synonyms and collocation helps fluency.
Overlap: Fluency and coherence / Grammatical range and accuracy

Grammar was viewed as overlapping with fluency because if a candidate has weak grammar but a steady flow of language, coherence is affected negatively. The use of connectives (so, because) and subordinating conjunctions (when, if) was said to play a part in both sets of criteria. Length of turn in Grammatical range and accuracy was seen as overlapping with the ability to keep going in Fluency and coherence (Extract 41).
Extract 41

Again I note both with fluency and with grammar the issue of the length of turns kind of cuts across both of them and Im sometimes not sure whether I should be taking into account both of them or if not which for that, but as far as I can judge it from the descriptors, its relevant to both. One examiner remarked that fluency can dominate the other criteria, especially grammar (Extract 42).
Extract 42

Well I must admit that I reckon if the candidate is fluent, it does tend to influence the other two scores. If they keep talking you think oh well they can speak English. And you have to be really disciplined as an examiner to look at those other the lexical and the grammar to really give them an appropriate score because otherwise you can say well you know they must have enough vocab I could understand them. But the degree to which you understand them is the important thing. So even as a 4 I said that I think there also needs to be some other sort of general band score. It does make you focus on those descriptors here.

IELTS Research Reports Volume 6

21

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Overlap: Lexical resource / Grammatical range and accuracy

Three examiners wondered whether errors in expressions or phrases (preposition phrases, phrasal verbs, idioms) were lexical or grammatical (If a candidate says in the moment instead of at the moment, what is s/he penalised under? and Im one of those lucky persons Is it lexical? Is it expression?) Another examiner saw the scales as overlapping in relation to skill at paraphrasing.
Overlap: Fluency and coherence / Pronunciation

Two examiners pointed out that if the pronunciation is hard to understand the coherence will be low. Another felt that slow speech (disfluent) was often more clearly pronounced and comprehensible, although another felt that disfluent speech was less comprehensible if there was a staccato effect. One examiner remarked that if pronunciation is unintelligible it is not possible to accurately assess any of the other areas. 5.3
5.3.1

Remaining questions
Additional criteria

As noted earlier, during the verbal report session, examiners rarely made reference to features not included in the scales or key criteria. Those that examiners did refer to were: the ability to cope with different functional demands confidence in using the language, and creative use of language. In response to a question about the appropriateness of the scale contents, the following additional features were proposed as desirable: voice; engagement; demeanour; and paralinguistic aspects of language use. Three examiners criticised the test for not testing communicative language. One examiner felt there was a need for a holistic rating in addition to the analytic ratings because global marking was less accurate than profile marking owing to the complexity of the variables involved.
5.3.2 Irrelevant criteria

When asked whether any aspects of the descriptors were inappropriate or irrelevant, one examiner remarked that candidates may not exhibit all aspects of particular band descriptors. Another saw conflict between the absolute nature of the descriptors for Bands 9 and 1 and requirement to assess on the basis of average performance across the interview. When asked whether they would prefer the descriptors to be shorter or longer, most examiners said they were fine. Three remarked that if a candidate must fully fit all the descriptors at a particular level, as IELTS instructs, it would create more difficulties if descriptors were longer. One examiner said that the Fluency and coherence descriptors could be shorter and should rely less on discerning the cause of disfluency, whereas another remarked that more precise language was needed in Fluency and coherence Bands 6 and 7. Another referred to the need for more precise language in general. One examiner suggested that key cut off statements would be useful, and another that an appendix to the criteria giving specific examples would help.
5.3.3 Interviewing and rating

While they acknowledged that it was challenging to conduct the interview and rate the candidate simultaneously, the examiners did not feel it was inappropriately difficult. In part, this was because they had to pay less attention to managing the interaction and thinking up questions than they did in the previous interview, and in part because they were able to focus on different criteria in different sections of the interview, while the monologue turn gave then ample time to focus exclusively on rating. When asked whether they attended to specific criteria in specific parts of the interview, some said yes and some no.

IELTS Research Reports Volume 6

22

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

They also reported different approaches to arriving at a final rating. The most common approach was to make a tentative assessment in the first part and then confirm this as the interview proceeded (Extract 43). One reported working down from the top level, and another making her assessment after the interview was finished.
Extract 43

By the monologue I have a tentative score and assess if I am very unsure about any of the areas. If I am, I make sure I really focus for that in the monologue. By the end of the monologue, I have a firmer feel for the scores and use the last section to confirm/disconfirm. It is true that the scores do change as a candidate is able to demonstrate the higher level of language in the last section. I do have some difficulties wondering what weight to give to this last section. When asked if they had other points to make, two examiners remarked that the descriptors could be improved. One wanted a better balance between specific and vague terms, and the other more distinct cut off points, as in the writing descriptors. Two suggested improvements to the training: the use of video rather than audio-recordings of interviews, and the provision of examples attached to the criteria. Another commented that cultural sophistication plays a role in constructing candidates as more proficient, and that the test may therefore be biased towards European students (some European candidates come across as better speakers, even though they may be mainly utilising simple linguistic structures). 6 DISCUSSION

The study addressed a range of questions pertaining to how trained IELTS examiners interpret and distinguish the scales used to assess performance in the revised IELTS interview, how they distinguish the levels within each scale, and what problems they reported when applying the scales to samples of performance. In general, the examiners referred closely to the scales when evaluating performances, quoting frequently from the descriptors and using them to guide their attention to specific aspects of performance and to distinguish levels. While there was reference to all aspects of the scales and key criteria, some features were referred to more frequently than others. In general, the more quantifiable features such as amount of hesitation (Fluency and coherence) or error density and type (Lexical resource and Grammatical range and accuracy) were the most frequently mentioned, although it cannot be assumed that this indicates greater weighting of these criteria over the less commonly mentioned ones (such as connectives or paraphrasing). Moreover, because examiners are required to make four assessments, one for each of the criteria, it seems that there is less likelihood than was the case previously with the single holistic scale of examiners weighting these four main criteria differentially. There were remarkably few instances of examiners referring to aspects of performance not included in the scales, which is in marked contrast to the findings of an examination of functioning of the earlier holistic scale (Brown, 2000). In that study Brown reported while some examiners focused narrowly on the criteria, others were more inference-oriented, drawing more conclusions about the candidates ability to cope in other contexts (2000: 78). She noted also that this was the case more for more experienced examiners. The examiners reported finding the scales relatively easy to use, and the criteria and their indicators to be generally appropriate and relevant to test performances, although they noted some overlap between scales and some difficulties distinguishing levels.

IELTS Research Reports Volume 6

23

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

It was reported that some features were difficult to notice or interpret. Particularly problematic features included: the need to infer the cause of hesitation (Fluency and coherence) a lack of certainty about whether inappropriate language was dialectal or error (Lexical resource and Grammatical range and accuracy) a lack of confidence in determining whether particular topics were familiar nor not, particularly those relating to professional or academic areas (Lexical Resource). Difficulty was also reported in interpreting the meaning of relative terms used in the descriptors, such as sufficient, adequate, etc. There was some discomfort in the absoluteness of the Band 9 descriptors across the scales. The most problematic scale appeared to be Fluency and coherence. It was the most complex in terms of focus and was also considered to overlap the most with other scales. Overlap resulted from the impact of a lack of lexical or grammatical resources on fluency, and because discourse markers and connectives (referred to in the Fluency and coherence scale) were also lexical items and a feature of complex sentences. Examiners seemed to struggle the most to determine band levels on the Fluency and coherence scale, perhaps because of the broad range of features it covers, and the fact that the cause of hesitancy, a key feature in the scale at the higher levels, is a high-inference criterion. The Pronunciation scale was considered the easiest to apply, however the examiners expressed a desire for more levels for Pronunciation. They felt it did not distinguish candidates sufficiently and the fewer band levels meant the rating decision carried too much weight in the overall (averaged) score. As was found in earlier studies of examiner behaviour in the previous IELTS interview (Brown, 2000) and in prototype speaking tasks for Next Generation TOEFL (Brown, Iwashita and McNamara, 2005), in addition to observable features such as frequency of error, complexity and accuracy, examiners were influenced in all criteria by the impact of particular features on comprehensibility. Thus they referred frequently to the impact of disfluency, lexical and grammatical errors and non-native pronunciation on their ability to follow the candidate or the degree of strain it caused them. A marked difference in the present study from that of Brown (2000) was the relevance of interviewer behaviour to ratings. Brown found that a considerable number of comments were devoted to the interviewer and reports that the examiners were constantly aware of the fact that the interviewer is implicated in a candidates performance (2000:74). At times, the examiners even compensated for what they perceived to be unsupportive or less-than-competent interviewer behaviour (see also Brown 2003, 2004). While there were one or two comments on interviewer behaviour in the present study, they did not appear to have any impact on ratings decisions. In contrast, however, some of the examiners did report a level of concern that the current interview and assessment criteria focused less on communicative or interactional skills than previously, a result of the use of interlocutor frames. Finally, although the examiners in this study were rating taped tests conducted by other interviewers, they reported feeling comfortable, (and more comfortable than was the case in the earlier unscripted interview), with simultaneously conducting the interview and assessing it, despite the fact that they were required to focus on four scales rather than one. This seemed to be because they no longer have to manage the interview by developing topics on-the-fly and also have the opportunity during Part 2 (the long turn) to sit back and focus entirely on the candidates production.

IELTS Research Reports Volume 6

24

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

CONCLUSION

This study set out to investigate examiners behaviour and attitudes to the rating task in the IELTS interview. The study was designed as a follow-up to an earlier study (Brown, 2000), which investigated the same issues in relation to the earlier IELTS interview. Two major changes in the current interview are: the use of interlocutor frames to constrain unwanted variation amongst interviewers; and the use of a set of four analytic scales rather than the previous single holistic scale. The study aimed to derive evidence for or against the validity the interpretability and ease of application of these revised scales within the context of the revised interview. To do this, the study drew on two sets of data, verbal reports and questionnaire responses provided by six experienced IELTS examiners when rating candidate performances. On the whole, the evidence suggested that the rating procedure works relatively well. Examiners reported a high degree of comfort using the scales. The evidence suggested there was a higher degree of consistency in examiners interpretations of the scales than was previously the case; a finding which is perhaps unsurprising given the more detailed guidance that four scales offer in comparison with a single scale. The problems that were identified perceived overlap amongst scales, and difficulty distinguishing levels could be addressed in minor revisions to the scales and through examiner training.

IELTS Research Reports Volume 6

25

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

REFERENCES Brown, A, 1993, The role of test-taker feedback in the development of an occupational language proficiency test in Language Testing, vol 10 no 3, pp 277-303 Brown, A, 2000, An investigation of the rating process in the IELTS Speaking Module in Research Reports 1999, vol 3, ed R Tulloh, ELICOS, Sydney, pp 49-85 Brown, A, 2003a, Interviewer variation and the co-construction of speaking proficiency, Language Testing, vol 20, no 1, pp 1-25 Brown, A, 2003b, A cross-sectional and longitudinal study of examiner behaviour in the revised IELTS Speaking Test, report submitted to IELTS Australia, Canberra Brown, A, 2004, Candidate discourse in the revised IELTS Speaking Test, IELTS Research Reports 2006, vol 6 (the following report in this volume), IELTS Australia, Canberra, pp 71-89 Brown, A, 2005, Interviewer variability in oral proficiency interviews, Peter Lang, Frankfurt Brown, A and Hill, K, 1998, Interviewer style and candidate performance in the IELTS oral interview in Research Reports 1997, vol 1, ed S Woods, ELICOS, Sydney, pp 1-19 Brown, A, Iwashita, N and McNamara, T, 2005, An examination of rater orientations and test-taker performance on English for Academic Purposes speaking tasks, TOEFL Monograph series MS-29, Educational Testing Service, Princeton, New Jersey Cumming, A, 1990, Expertise in evaluating second language compositions in Language Testing, vol 7, no 1, pp 31-51 Delaruelle, S, 1997, Text type and rater decision making in the writing module in Access: Issues in English language test design and delivery, eds G Brindley and G Wigglesworth, National Centre for English Language Teaching and Research, Macquarie University, Sydney, pp 215-242 Gass, SM and Mackey, A, 2000, Stimulated recall methodology in second language research, Lawrence Erlbaum, Mahwah, NJ Green, A, 1998, Verbal protocol analysis in language testing research: A handbook, (Studies in language testing 5), Cambridge University Press and University of Cambridge Local Examinations Syndicate, Cambridge Lazaraton, A, 1996a, A qualitative approach to monitoring examiner conduct in the Cambridge assessment of spoken English (CASE) in Performance testing, cognition and assessment: Selected papers form the 15th Language Testing Research Colloquium, eds M Milanovic and N Saville, Cambridge University Press, pp 18-33 Lazaraton, A, 1996b, Interlocutor support in oral proficiency interviews: The case of CASE in Language Testing, vol 13, pp 151-172 Lewkowicz, J, 2000, Authenticity in language testing: some outstanding questions in Language Testing, vol 17 no 1, pp 43-64 Lumley, T and Stoneman, B, 2000, Conflicting perspectives on the role of test preparation in relation to learning in Hong Kong Journal of Applied Linguistics, vol 5 no 1, pp 50-80 Lumley, T, 2000, The process of the assessment of writing performance: the rater's perspective, unpublished doctoral thesis, The University of Melbourne

IELTS Research Reports Volume 6

26

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

Lumley, T and Brown, A, 2004, Test-taker response to integrated reading/writing tasks in TOEFL: evidence from writers, texts and raters, unpublished report, The University of Melbourne McNamara, TF and Lumley, T, 1997, The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings in Language Testing, vol 14, pp 140-156 Meiron, BE, 1998, Rating oral proficiency tests: a triangulated study of rater thought processes, unpublished Masters thesis, University of California, LA Merrylees, B and McDowell, C, 1999, An investigation of Speaking Test reliability with particular reference to the Speaking Test format and candidate/examiner discourse produced in IELTS Research Reports Vol 2, ed R Tulloh, IELTS Australia, Canberra, pp 1-35 Morton, J, Wigglesworth, G and Williams, D, 1997, Approaches to the evaluation of interviewer performance in oral interaction tests in Access: Issues in English language test design and delivery, eds G Brindley and G Wigglesworth, National Centre for English Language Teaching and Research, Macquarie University, Sydney, pp 175-196 Pollitt, A and Murray, NL, 1996, What raters really pay attention to in Performance testing, cognition and assessment, (Studies in language testing 3), eds M Milanovic and N Saville, Cambridge University Press, Cambridge, pp 74-91 Taylor, L and Jones, N, 2001, University of Cambridge Local Examinations Syndicate Research Notes 4, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-11 Taylor, L, 2000, Issues in speaking assessment research, (Research notes 1), University of Cambridge Local Examinations Syndicate, Cambridge, pp 8-9 UCLES (2001) IELTS examiner training material, University of Cambridge Local Examinations Syndicate, Cambridge Vaughan, C, 1991, Holistic assessment: What goes on in the raters mind? in Assessing second language writing in academic contexts, ed L Hamp-Lyons, Ablex, Norwood, New Jersey, pp 111-125 Weigle, SC, 1994, Effects of training on raters of ESL compositions in Language Testing, vol 11, no 2, pp 197-223

IELTS Research Reports Volume 6

27

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

APPENDIX 1: QUESTIONNAIRE
A Focus of the criteria

1. Do the four criteria cover features of spoken language that can be readily assessed in the testing situation? Yes / No Please elaborate

2. Do the descriptors relate directly to key indicators of spoken language? Is anything left out? Yes / No Please elaborate

3. Are any aspects of the descriptors inappropriate or irrelevant? Yes / No Please elaborate

Interpretability of the criteria

4. Are the descriptors easy to understand and interpret? How would you rate your confidence on a scale of 1-5 in using each scale? Not at all confident Fluency and coherence Lexical resource Grammatical range and accuracy Pronunciation 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 Very confident 5 5 5 5

5. Please elaborate on why you felt confident or not confident about each of the scales: Fluency and coherence

Lexical resource

Grammatical range and accuracy

Pronunciation

IELTS Research Reports Volume 6

28

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

6. How much overlap do you find among the scales? Very distinct F&C and LR F&C and GRA F&C and P LR and GRA LR and P GRA and P 7. Could you describe this overlap? 1 1 1 1 1 1 Some overlap 2 2 2 2 2 2 A lot of overlap 3 3 3 3 3 3 Almost total overlap 4 4 4 4 4 4

8. Would you prefer the descriptors to be shorter / longer? Please elaborate

Level distinctions

9. Do the descriptors of each scale capture the significant performance qualities at each of the band levels? Fluency and coherence Yes / No Please elaborate

Lexical resource

Yes / No

Please elaborate

Grammatical range and accuracy

Yes / No

Please elaborate

Pronunciation

Yes / No

Please elaborate

10. Do the scales discriminate across the levels effectively? (If not, for each scale which levels are the most difficult to discriminate, and why?) Fluency and coherence Yes / No Please elaborate

Lexical resource

Yes / No

Please elaborate

Grammatical range and accuracy

Yes / No

Please elaborate

Pronunciation

Yes / No

Please elaborate

IELTS Research Reports Volume 6

29

2. An examination of the rating process in the revised IELTS Speaking Test Annie Brown

11. Is the allocation of bands for pronunciation appropriate? Yes / No Please elaborate

12. How often do you award flat profiles? Please elaborate

The rating process

13. How difficult is it to interview and rate at the same time? Please elaborate

14. Do you focus on particular criteria in different parts of the interview? Yes / No Please elaborate

15. How is your final rating achieved? How do you work towards it? At what point do you finalise your rating? Please elaborate

Final comment Is there anything else you think you should have been asked or would like to add?

IELTS Research Reports Volume 6

30

3. Candidate discourse in the revised IELTS Speaking Test


Author Annie Brown Ministry of Higher Education and Scientific Research United Arab Emirates Grant awarded Round 8, 2002 This study aims to verify the IELTS Speaking Test scale descriptors by providing empirical validity evidence derived from a linguistic analysis of candidate discourse.
ABSTRACT In 2001 the IELTS interview format and criteria were revised. A major change was the shift from a single global scale to a set of four analytic scales focusing on different aspects of oral proficiency. This study is concerned with the validity of the analytic rating scales. It aims to verify the descriptors used to define the score points on the scales by providing empirical evidence for the criteria in terms of their overall focus, and their ability to distinguish levels of performance. The Speaking Test band descriptors and criteria key indicators were analysed in order to identify relevant analytic categories for each of the four band scales: fluency, grammatical range and accuracy, lexical resource and pronunciation. Twenty interviews drawn from operational IELTS administrations in a range of countries, and representing a range of proficiency levels, were analysed with respect to these categories. The analysis found that most of the measures displayed increases in the expected direction over the levels, which appears to confirm the validity of the criteria. However, for all measures the standard deviations tended to be large, relative to the differences between levels. This indicates a high level of variation amongst candidates assessed at the same level, and a high degree of overlap between levels, even for those measures which produced significant findings. In addition, for most measures the differences between levels were greater at some boundaries between two bands than at others. Overall, the findings indicate that while all the measures relating to one scale contribute in some way to the assessment on that scale, no one measure drives the rating; rather a range of performance features contribute to the overall impression of the candidates proficiency.

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

CONTENTS
1 Aim of the study ........................................................................................ 3 2 Discourse studies of L2 speaking task performance ............................ 3 3 Methodology 3.1 Data ........................................................................................... 4 ........................................................................................... 4

3.2 The IELTS Speaking Test ................................................................... 5 3.3 Analytic categories .............................................................................. 5 3.3.1 3.3.2 3.3.3 4 Results 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 Fluency and coherence ..................................................... 6 Lexical resources ............................................................... 7 Grammatical range and accuracy...................................... 8 ........................................................................................... 10 Repair ................................................................................ 10 Hesitation ........................................................................... 10 Speech rate........................................................................ 10 Response length ................................................................ 10 Amount of speech .............................................................. 11

4.1 Fluency and coherence ....................................................................... 10

4.2 Lexical resources ................................................................................ 11 4.3 Grammatical range and accuracy ....................................................... 12 5 Summary of findings................................................................................. 13 References ........................................................................................... 16 Appendix 1: ANOVAs (Analysis of variance) .............................................. 18

AUTHOR BIODATA: ANNIE BROWN Annie Brown is Head of Educational Assessment in the National Admissions and Placement Office (NAPO) of the Ministry of Higher Education and Scientific Research, United Arab Emirates. Previously, and while undertaking this study, she was Senior Research Fellow and Deputy Director of the Language Testing Research Centre at The University of Melbourne. There, she was involved in research and development for a wide range of language tests and assessment procedures, and in language program evaluation. Annie's research interests focus on the assessment of speaking and writing, and the use of Rasch analysis, discourse analysis and verbal protocol analysis. Her books include Interviewer Variability in Oral Proficiency Interviews (Peter Lang, 2005) and the Language Testing Dictionary (CUP, 1999, co-authored with colleagues at the Language Testing Research Centre). She was winner of the 2004 Jacqueline A Ross award for the best PhD in language testing, and winner of the 2003 ILTA (International Language Testing Association) award for the best article on language testing.

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

AIM OF THE STUDY

This study comprises an analysis of candidate discourse on the revised IELTS Speaking Test as part of the program of validation research funded by IELTS Australia. The overall aim of the study is to try to verify the descriptors used to define the score points on the scales by providing empirical validity evidence for the criteria, in terms of: their overall focus and their ability to distinguish levels of performance. The aim will be addressed through an analysis of samples of performance at each of several levels of proficiency using a variety of quantitative and qualitative measures selected to reflect the features of performance relevant to the test construct and defined within the band scales. 2 DISCOURSE STUDIES OF L2 SPEAKING TASK PERFORMANCE

One of the first studies to examine learner discourse in relation to levels of proficiency was that of Mangan (1988). Mangan examined the occurrence of specific grammatical errors in French Oral Proficiency Interviews. He found that while there was a decrease as the proficiency level increased, it was not linear. Douglas (1994) found similar results on a semi-direct speaking test for a variety of measures, including grammatical errors, fluency, vocabulary, and rhetorical organisation. He speculates that this could be because raters were attending to features not included in the scales, which raises the question of the validity of the scales used in this context. It may also be, as Douglas and Selinker (1992, 1993) and Brown et al (2005) argue, that holistic ratings do not adequately capture jagged profiles, that is, different levels of performance by a candidate across different criteria. Brown, Iwashita and McNamara (2005) undertook an analysis of candidate performance on speaking tasks to be included in New TOEFL. The tasks had an English for Academic Purposes (EAP) focus and included both independent and integrated tasks (see Lewkowicz, 1997 for a discussion of integrated tasks). As the overall aim of the study was to examine the feasibility of drawing on verbal report data to develop scales, the measures used to examine the actual discourse were selected to reflect the criteria applied by EAP specialists when not provided with specific guidance, rather than those contained within existing scales. The criteria applied by the specialists and used to determine the discourse measures reflected four major categories: linguistic resources (which included grammar and vocabulary), fluency (which included repair phenomena, pausing and speech rate), phonology (which included pronunciation, intonation and rhythm), and content. Brown et al found that for each category only one or two of the measures they used revealed significant differences between levels. In addition, the effect sizes were generally marginal or small, indicating relatively large variability within each score level. This, they surmise, may have been because the score data which formed the basis of the selection of samples was rated holistically rather than analytically. They argue that it may well have been that samples assessed at the same level would reveal very different profiles across the different criteria (the major categories identified by the raters). A similar study carried out by Iwashita and McNamara (2003) using data from the Examination for the Certificate of Competency in English (English Language Institute, 2001) produced similar findings. Discourse analysis of candidate data has also been used in the empirical development of rating scales. The work of Fulcher (1993, 1996, 2003) on the development of scales for fluency is perhaps the most original and detailed. He drew on data taken from a range of language tests to examine what constituted increasing levels of proficiency in terms of a range of fluency measures. He found strong evidence of progression through the levels on a number of these measures, which led to the
IELTS Research Reports Volume 6
3

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

development of descriptors reflecting this progression, that, he argued, would not only be more userfriendly but, because of their basis in actual performance, would lead to more valid and reliable ratings. Other studies that have used various discourse measures to examine differences in candidate performance on speaking tasks include those by Skehan and Foster (1999), Foster and Skehan (1996) and Wigglesworth (1997, 2001), which used measures designed to capture differences in grammatical accuracy and fluency. In these studies the measures were applied not to performances assessed as being at different levels of proficiency, but to performances on different tasks (where the cognitive complexity of the task differed) or on the same task completed under varying conditions. Iwashita, McNamara and Elder (2001) drew on Skehans (1998) model of cognitive complexity to examine the feasibility of defining levels of ability according to cognitive demand. They manipulated task conditions on a set of narrative tasks and measured performance using measures of accuracy and fluency. However, they found the differences in performance under the different conditions did not support the development of a continuum of ability based on cognitive demand. As Brown et al (2005) point out in discussing the difficulty of applying some measures, particularly those pertaining to grammatical analysis, most of the studies cited above do not provide measures of inter-coder agreement; Brown et als study is exemplary in this respect. Like Foster, Tonkyn and Wigglesworth (2000), they discuss the difficulty of analysing the syntactic quality of spoken second language data using measures developed originally for the analysis of first language written texts. Foster et al consider the usefulness for the analysis of spoken data of several units of analysis commonly used in the analysis of written data. They conclude by proposing a new unit which they term the AS-unit. However, the article itself contains very little guidance on how to apply the analysis. (The AS-unit was considered for this study but an attempt at its use created too many ambiguities and unexplained issues.) 3 3.1 METHODOLOGY Data

A set of 30 taped operational IELTS interviews, drawn from testing centres in a range of countries, was rated analytically using the IELTS band descriptors. Ratings were provided for each of the categories: fluency and coherence lexical resource grammatical range and accuracy pronunciation. To select interviews for the study which could be assumed to be soundly at a particular level, each was rated three times. Then, for each criterion, five interviews were selected at each of four levels, 5 to 8, on that specific criterion (totalling 20 interview samples). (The IELTS scale ranges from 0 to 9, with 6, 6.5 and 7 typically being the required levels for entry to tertiary study. This study had intended to include level 4 but the quality of the production of candidates at this level and the poor quality of the operational test recordings was such that their interviews proved impossible to transcribe accurately or adequately.) For example, interviews to be included in the analysis of grammatical accuracy were selected on the basis of the scores awarded in the category grammatical range and accuracy. Similarly, interviews to be included in the analysis of hesitation were selected on the basis of the scores awarded in the category fluency and coherence.

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

For interviews to be selected to reflect a specific level on a specific criterion, the following types of agreement on scores were required: all three scores were the specified level (eg 7 7 7), or two scores were at the specified level and one a level above or below (eg 7 7 8), or the three scores reflected different levels but averaged to the level (eg 6 7 8). Prior to analysis the selected tapes were transcribed in full by a research assistant and checked by the researcher. 3.2 The IELTS Speaking Test The IELTS Speaking Test consists of a face-to-face interview between an examiner and a single candidate. The interview is divided into three main parts (Figure 1). Each part fulfils a specific function in terms of interaction pattern, task input and candidate output. In Part 1, candidates answer general questions about themselves, their homes/families, their jobs/studies, their interests, and a range of similar familiar topic areas. Three different topics are addressed in Part 1. Part 1 lasts between four and five minutes. In Part 2, candidates are given a topic and asked to talk for between one and two minutes. There is one minute preparation time. Examiners may ask one or two follow-up questions. In Part 3, the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic used in Part 2. The discussion lasts between four and five minutes.
Part 1: Introduction and Interview (45 minutes) Examiner introduces him/herself and confirms candidates identity. Examiner interviews candidate using verbal questions based on familiar topic frames. Part 2: Individual long turn 34 minutes (including 1 minute preparation time) Examiner asks candidate to speak for 12 minutes on a particular topic based on written input in the form of a general instruction and content-focused prompts. Examiner asks one or two questions at the end of the long turn. Part 3: Two-way discussion (45 minutes) Examiner invites candidate to participate in discussion of more abstract nature, based on verbal questions thematically linked to Part 2 prompt.

Figure 1: Interview structure

3.3

Analytic categories addressed each of the individual scales and covered the main features referred to in each might be expected to show differences between performances scored at levels 5 to 8 could be applied reliably and meaningfully.

For each assessment category, the aim was to select or develop specific analyses which:

To address the first two criteria, three pieces of documentation were reviewed. 1. The band descriptors (UCLES, 2001) 2. The Speaking Test criteria key indicators, as described in the Examiner Training Materials (UCLES, 2001) 3. The descriptions of the student samples contained in the Examiner Training Materials (UCLES, 2001)
IELTS Research Reports Volume 6
5

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

In order to address the last criterion, the literature on the analysis of learner discourse was reviewed to see what it indicated about the usefulness of particular measures, particularly whether they had sound operational definitions, could be applied reliably, and had sound theoretical justifications. While the measures typically used to measure fluency and vocabulary seemed relatively straightforward, there appeared to be a wide range of measures used for the analysis of syntactic quality but little detailed guidance on how to segment the data or what levels of reliability might realistically be achieved. Phonology proved to be the most problematic; the only reference was that of Brown et al (2005) who analysed the phonological quality of candidate performance in tape-based monologic tasks. However, not only did the phonological analyses used in that study consist of subjective evaluative judgements rather than (relatively) objective measures, but they required the use of specific phonetic software and the involvement of trained phoneticians. Ultimately, it was decided that such analyses were beyond the scope of the present study. Sections 3.3.1 to 3.3.3 describe the analyses selected for the present study.
3.3.1 Fluency and coherence

Key Fluency and coherence features as described within the IELTS documentation include: repetition and self-correction hesitation / speech rate the use of discourse markers, connectives and cohesive features the coherence of topic development response length. Following a review of the literature to ascertain how these aspects of fluency and coherence might be operationalised as measures, the following analyses were adopted. Firstly, repair was measured in terms of the frequency of self-corrections (restarts and repeats) per 100 words. It was calculated over the Part 2 and Part 3 long responses (not including single word answers or repair turns). Secondly, hesitation was measured in terms of the ratio of pausing (filled and unfilled pauses) to speech (measures in terms of milliseconds). For this analysis the data were entered into the Cool Edit Pro program (Version 2.1, 2001). Hesitation was also measured in terms of the number of pauses (filled, unfilled and filled/unfilled). Both of these measures were carried out using speech produced in response to Part Two, the monologue turn. Thirdly, speech rate was calculated in terms of the number of words per minute. This was also calculated over Part 2, and the analysis was carried out after the data were cleaned (pruned of repairs, repeats, false starts and filled pauses). Because the interview is divided into three parts, each of which takes a distinct form, response length was measured in a number of ways, as follows. 1. Average length of response in Part 1. Single word answers and repair turns were excluded. The analysis was carried out after the data were cleaned (pruned of repairs, repeats, false starts and filled pauses). 2. Number of words in Part 2. The analysis was also carried out after the data were cleaned. 3. Average length of response in Part 2 follow-up questions (if presented) and Part 3. Single word answers and repair turns were excluded. Again, the analysis was carried out after the data were cleaned. 4. Average length of response in Part 1, Part 2 (follow-up question only) and Part 3 combined (all the question-answer sections).

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Finally, while not explicitly referred to within the assessment documentation, it was anticipated that the total amount of speech produced by candidates might have a strong relationship with assessed level. The total amount of speech was calculated in terms of the number of words produced by the candidate over the whole interview. Again, the analysis was carried out after the data were cleaned. Table 1 summarises the Fluency and coherence analyses.
Assessment feature 1. Repair 2. Hesitation Measure restarts and repeats per 100 words ratio of pause time (filled and unfilled pauses) to speech time ratio of filled and unfilled pauses to words 3. Speech rate 4. Response length words per 60 secs average length of response total number of words Average length of response Average length of response 5. Total amount of speech words per interview Part 2-3 Part 2 monologue Part 2 monologue Part 2 monologue Part 1 Part 2 monologue Part 2 follow-up questions and Part 3 Part 1, Part 2 follow-up questions and Part 3 Parts 1-3 Data

Table 1: Summary of fluency and coherence measures 3.3.2 Lexical resources

Key Lexical resources features as described within the IELTS documentation are: breadth of vocabulary accuracy / precision / appropriateness idiomatic usage effectiveness and amount of paraphrase or circumlocution. After a review of the literature to ascertain how these aspects of lexical resources might be operationalised as measures, the following analyses were adopted. Vocabulary breadth was examined using the program VocabProfile (Cobb, 2002), which measures the proportions of low and high frequency vocabulary. The program is based on the Vocabulary Profile (Laufer and Nation, 1995), and performs the analysis using the Academic Word List (AWL) (Coxhead, 2000). VocabProfile calculates the percentage of words in each of five categories: the most frequent 500 words of English; the most frequent 1000 words of English (K1); the second most frequent thousand words of English (1001 to 2000) (K2); words found in the Academic Word List (AWL); and any remaining words not included in any of the first four lists (Offlist). The vocabulary breadth analysis was carried out on the Part 2 monologue task using cleaned data (after all filled pauses, repeats/restarts and unclear words were removed). Before the analyses were run the texts were checked for place names and other proper names, and lexical fillers and discourse markers such as okay or yeah. These were re-coded as high frequency as they would otherwise show up as Offlist.

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Another measure of vocabulary sophistication used in earlier studies is average word length (Cumming et al, 2003). The average word length in each Part 2 monologue performance was calculated by dividing the total number of characters by the total number of words using the cleaned texts. In addition, as VocabProfile calculates the type-token ratio (the lexical density of the spoken text) this is also reported for Part 2. The type-token ratio is the number of different lexical words to the total number of lexical words, and has typically been used as a measure of semantic density. Although it has been used traditionally to analyse written texts, it has more recently been used on spoken texts also (eg, see OLoughlin, 1995; Brown et al, 2005). The three remaining key vocabulary features were more problematic. For the first two contextualised accuracy, precision or appropriateness of vocabulary use, and idiomatic usage no measure was found in the literature for objectively measuring them. These, it seemed, could only be done judgementally but would be: difficult to define; time consuming to carry out: and almost certainly have low reliability. These performance features were, therefore, not addressed in the present study because of resource constraints. Perhaps the best way to understand how these evaluative categories are interpreted and applied might be to analyse what raters claim to pay attention to when evaluating these aspects of vocabulary (see Brown et al, 2005). The last key vocabulary feature the ability to paraphrase or use circumlocution is also not objectively measurable as it is a communication strategy which is not always visible in speech. It only possible to know it has been employed (successfully or unsuccessfully) in those cases where the speaker overtly attempts to repair a word choice. However, even this is problematic to measure, as in many cases it may not be clear whether a repair or restart is an attempt at lexical repair or grammatical repair. For these reasons, it was decided that the sole measures of vocabulary in this study would be of vocabulary breadth and density. Table 2 summarises the vocabulary measures.
Assessment feature 1. Word type Measure Proportion of words in most frequent 500 words Proportion of words in K1 Proportion of words in K2 Proportion of words in AWL Proportion of words in Offlist 2. Word length 3. Lexical density Average no. of characters per word type/token ratio Data Part 2 monologue Part 2 monologue Part 2 monologue Part 2 monologue Part 2 monologue Part 2 monologue Part 2 monologue

Table 2: Summary of lexical resources measures

3.3.3

Grammatical range and accuracy

Key Grammatical range and accuracy features described within the IELTS documentation are: range / variety of structures errors type (eg basic) and density error-free sentences impact of errors sentence complexity length of utterances complexity of structures.
IELTS Research Reports Volume 6
8

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Most of the better-known and well-defined measures for the analysis of syntactic complexity and accuracy depend on first dividing the speech into units, typically based on syntax, such as the clause and the t-unit a t-unit being an independent clause and all attached dependent clauses. However, because of the elliptical nature of speech, and learner speech in particular, it proved very difficult to divide the speech into these units consistently and reliably, in particular to distinguish elliptical or illformed clauses from fragments. Other measures which have been proposed for spoken data such as the c-unit and the AS-unit (Foster et al, 2000) are less widely-used and less well-defined in the literature and were, therefore, equally difficult to apply. Consequently, an approach to segmentation was developed for the present study to be both workable (to achieve high inter-coder agreement) and valid. It rested on the identification of spoken sentences or utterances primarily in terms of syntax, but also took semantic sense into account in identifying unit boundaries. While utterances were defined primarily as t-units, because of the often elliptical syntax produced by many of the learners, the segmentation also took meaning into account in that the semantic unity of utterances overrode syntactic (in)completeness. Fragments and ill-formed clauses which were semantically integrated into utterances were treated as part of that utterance. Abandoned utterances and unattached sentence fragments were identified as discrete units. Segmentation was carried out on the cleaned Part 2 and 3 data; hesitation and fillers were removed and, where speech was repaired, the data included the repaired speech only. Once the approach to segmentation had been finalised, 75% of the data was segmented by two people. Inter-coder agreement was 91.5%. Disagreements were resolved through discussion. Once the data had been segmented, each Part 2 utterance was coded for the occurrence of specific basic errors, these being tense, noun-verb agreement, singular/plural, article, preposition, pronoun choice and comparative formation. In addition, each utterance was coded to indicate whether it contained any type of syntactic error at all. Error-free units were those that were free from any grammatical errors, including the specific errors defined above as well as any others (relative clause formation) but excluding word order as it was extremely difficult to reach agreement on this. In addition, each utterance was coded to indicate the number of clauses it contained. Once the data had been coded, the following analyses were undertaken: Complexity - mean length of utterance as measured by the number of words - number of clauses per utterance Accuracy - proportion of error-free utterances - frequency of basic errors: the ratio of specific basic errors to words.
Assessment feature 1. Complexity # 1 2. Complexity # 2 3. Accuracy # 1 4. Accuracy # 3 Measure Words per utterance Clauses per utterance Proportion of error-free utterances Ratio of specific basic errors to words Part 23 Part 23 Part 2 monologue Part 2 monologue Data

Table 3: Summary of grammatical range and accuracy measures

IELTS Research Reports Volume 6

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

4 4.1

RESULTS Fluency and coherence

The descriptive statistics for the Fluency and coherence analyses are shown in Table 4. The results of the ANOVAs (analysis of variance) are shown in Appendix 1.
4.1.1 Repair

The number of self-corrections (restarts and repeats) was calculated per 100 words over Parts 2 and 3. Column 1 shows that there is a trend over the four levels for the frequency of self-correction to decrease as the band score for Fluency and coherence increases, although Bands 6 and 7 are very similar and the expected direction is reversed for these two levels. There appears to be a significant amount of individual variation among students assessed at the same level; the standard deviation for each level is rather large. An ANOVA showed that the differences were not significant (F (3, 16) = .824, p = .499).
4.1.2 Hesitation

The amount of hesitation was measured in terms of the ratio of pause time (filled and unfilled pauses) to speech time, and the ratio of filled and unfilled pauses to words. Columns 2 and 3 shows that the ratio of pause to speech for each of these measures decreased as the proficiency level increased, with the greatest difference being between levels 5 and 6. However, ANOVAs showed that the differences were not significant (F (3, 16) = 2.314, p = .116 and (F (3, 16) = 1.454, p = .264).
1
Score Repair

2
Speak time: pause time

3
Words: pauses

4
P2 words per 60 secs

5
P1 Average length of turn

6
Words P2

7
P2/3 Average length of turn

8
P1-3 Average length of turn

9
Total words

8 7 6 5

Mean StDev Mean StDev Mean StDev Mean StDev

5.49 3.25 7.14 3.45 7.01 1.09 8.64 4.07

7.10 2.75 7.06 3.61 5.99 2,44 3.22 1.51

15.40 6.28 18.31 15.67 14.56 8.58 6.37 1.28

125.3 20.0 123.6 26.0 103.5 24.1 87.2 20.3

49.01 18.84 39.03 13.84 37.60 22.55 24.51 10.54

250.6 109.3 232.0 66.9 224.0 46.7 154.0 44.7

61.23 37.50 60.18 14.62 54.15 16.36 28.62 12.57

51.52 23.86 44.74 11.09 42.24 19.61 25.59 8.63

1227 175.6 1034 354.2 1007 113.6 657 80.4

Table 4: Fluency and coherence: descriptive statistics 4.1.3 Speech rate

Speech rate was measured in terms of the number of words per minute, calculated for Part 2, excluding repairs and restarts. Column 4 shows an increase in the speech rate as the band score for Fluency and coherence increases, although Bands 7 and 8 are very similar. Again the standard deviations are rather large. An ANOVA indicated that the differences were close to significance (F (3, 16) = 3.154, p = .054).
4.1.4 Response length

The interview contained two types of speech responses to questions (Part 1, Part 2 follow-up questions, and Part 3) which could, in theory, be as long as the candidate wished, and the monologue turn (Part 2) which had a maximum time allowance. Column 5 shows that the average length of response in Part 1 increased as the band score for Fluency and coherence increased, with Band 8 responses being, on average, twice as long as Band 5 responses. The biggest increases were from
IELTS Research Reports Volume 6
10

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Band 5 to Band 6, and Band 7 to Band 8. The average length of response in Bands 6 and 7 was very similar. Again, the standard deviations for each level were high and an ANOVA showed that the differences were not significant (F (3, 16) = 1.736, p = .200). In the monologue turn, Part 2, there was an increase in the number of words over the levels with the biggest increase from Band 5 to 6 (Column 6). The standard deviations for each level were high. Again, an ANOVA showed that the differences were not significant (F (3, 16) = 1.733, p = .200). As was the case for the responses to questions in Part 1, the average length of response to Part 2 follow-up questions and Part 3 questions increased as the band score for Fluency and coherence increased (Column 7). Again Band 8 responses were, on average, twice as long as Band 5 responses. The biggest increase was from Band 5 to 6, but this time Bands 7 and 8 were very similar. Again, the standard deviations for each level were high and again an ANOVA showed that the differences were not significant (F (3, 16) = 2.281, p = .118). When the average length of response for all question responses was calculated, we again found an increase over the levels, with Band 8 being twice as long as Band 5, and with the most marked increase being from Band 5 to 6 (Column 8). Again, an ANOVA showed that the differences were not significant (F (3, 16) = 2.074, p = .144).
4.1.5 Amount of speech

Column 9 shows that as the band score for Fluency and coherence increases, the total number of words over the whole interview increases. The most marked increase is from Bands 5 to 6. Bands 6 and 7 are very similar. An ANOVA confirmed significant differences (F (3, 16) = 6.412, p = .005). 4.2 Lexical resources The descriptive statistics for the Lexical resources analyses are shown in Table 5.
1 Score 8 7 6 5 Mean StDev Mean StDev Mean StDev Mean StDev 500 % 83 5 83 4 86 4 90 2 2 K1 % 91 5 90 3 93 2 94 2 3 K2 % 4 3 5 1 3 2 4 2 4 AWL % 1 1 3 2 2 2 1 1 5 OWL % 3 3 4 3 2 1 2 1 6 Word length 4.02 4.44 4.06 3.72 3.86 3.59 4.02 4.05 7 T/T ratio 0.47 0.03 0.44 0.06 0.49 0.09 0.44 0.06

Table 5: Lexical resources: descriptive statistics

The word frequency analysis calculated the percentage of word in each of five categories: 1. the first 500 words 500 2. the first 1000 words K1 3. the second 1000 words K2 4. the academic word list AWL 5. Offlist OWL.

IELTS Research Reports Volume 6

11

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Columns 1 and 2 in Table 5 show that although there is a slight decrease in the proportion of words from the first 500 words and the first 1000 words lists as the Lexical resources band score increases, a large proportion of words are in the first 1000 words list for all levels (91%94%). The average proportion of words from the remaining categories (K2, AWL and OWL) is relatively low for all levels and there is no linear increase in the proportion of K2 and AWL (Columns 3 and 4) across the levels. While the percentage of Offlist words increases across the levels (Column 5) this is, in fact, uninterpretable as Offlist words were found to include mis-formed words on the one hand, and low frequency words on the other. The ANOVAs showed that none of the measures exhibited significant differences. (The results of the ANOVAs are shown in Appendix 1.) The analysis of average word length (Column 6) indicated that the measure was relatively stable across the levels. This is probably due to the high incidence of high frequency words at all levels, something that is typical of spoken language in general. Column 7 indicates that there is no linear increase across the band levels in the average type-token ratio. 4.3 Grammatical range and accuracy The descriptive statistics for the Lexical resources analyses are shown in Table 6. The results of the ANOVAs are shown in Appendix 1.
1 Score Utterance length Mean StDev 7 6 5 Mean StDev Mean StDev Mean StDev 12.33 2.47 12.32 2.24 12.33 3.22 11.07 2.54 2 Clauses per utterance 1.57 .36 1.64 .46 1.51 .17 1.31 .20 3 Proportion of error-free utterances 6.41 3.76 3.00 1.29 1.44 .27 1.35 .40 4 Ratio of specific errors to words 72.96 38.98 35.86 15.30 17.97 5.36 14.15 3.91

Table 6: Grammatical range and accuracy: descriptive statistics

The two measures of complexity (utterance length in terms of mean number of words, and mean number of clauses per utterance) showed very little variation across the levels (Columns 1 and 2). For utterance length, Band 5 utterances were shorter than those of higher levels, those of Bands 68 were almost identical. The ANOVAs showed that the differences were not significant (F (3, 15) = .270, p = .886). For the second measure of complexity, the number of clauses per utterance, there was little difference between levels and the progression was not linear. Band 8 utterances were on average less complex than those of Band 7. Again the ANOVA revealed no significant differences (F (3, 15) = 1.030, p = .407). In terms of accuracy, both measures were as expected. The proportion of error-free utterances increased as the level increased (Column 3) and the frequency of basic errors decreased (Column 4). Both ANOVAs revealed significant differences: (F (3, 15) = 6.721, p = .004 and F (3, 15) = 7.784, p = .002).

IELTS Research Reports Volume 6

12

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

SUMMARY OF FINDINGS

Overall, the analyses revealed evidence that features of test-takers discourse varied according to the assessed proficiency level. While all measures broadly exhibited changes in the expected direction across the levels, for some, the difference between two adjacent levels were not always as expected. In addition, for most measures the differences between levels were greater at some boundaries than others, for example between Band 5 on the one hand, and Bands 6 to 8 on the other, or between Band 8 on the one hand and Bands 5 to 7 on the other. This indicates, perhaps, that, rather than contributing equally at all levels, specific aspects of performance are relevant at particular levels only. This finding supports the argument of Pollitt and Murray who, on the basis of analyses of raters orientations rather than analyses of candidate performance, argued that the trait of proficiency is understood in different terms at different levels and that, as a consequence, proficiency should not be assessed as a rectangular set of components (1996:89). Figure 2 shows where the greatest differences lie for each of the measures. On all fluency measures, there was a clear difference between Bands 5 and 6 but the size of the differences between the other bands varied across the measures. For the grammar complexity measures, the greatest difference lay between Band 5 on the one hand, and Bands 6 to 8 on the other. For the accuracy measures, however, the greatest difference lay between Bands 7 and 8, with Bands 5 and 6 being very similar. For the lexical resource measures there was little difference between means for any of the measures.

Fluency and coherence Repair/restart 5 // 6=7 5 5 // // 6 6=7 5 5 5 5 // // // // // 8 6 / 7=8 // 7=8 // 8 6=7 6=7 6=7 / / // 8 8 8

Pause to speak time Frequency of pauses Words per minute 5 P1 length of turn 5 P2 words P2/3 length of turn P1-2 length of turn Total words Grammatical range and accuracy Utterance length 5 Error free utterances Specific errors Lexical resource 5=6 / //

// 6=7=8

// 6=7=8

6=7=8 5 // 5=6 7 // 6=7=8 / 7 8 // 8

Clauses per utterance

Little difference between means for all measures


KEY = indicates little difference between means / indicates some difference between means // indicates substantial difference between means

Figure 2: Differences across bands within measures

IELTS Research Reports Volume 6

13

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

For all measures the standard deviations tended to be large, relative to the differences between levels, indicating a high level of variation amongst candidates assessed at the same level and a high degree of overlap between levels, even for those measures which produced significant findings. This would appear to indicate that while all the measures contribute in some way, none is an overriding driver of the rating awarded; candidates assessed at one particular level on one scale display subtle differences in performance on the different dimensions of that trait. This is perhaps inevitable where different and potentially conflicting features (such as accuracy and complexity) are combined into the one scale. Brown et al (2005) acknowledge this possibility when they discuss the tension, referred to by raters, between dimensions on all traits grammar, vocabulary, fluency and pronunciation such as accuracy (or nativeness), complexity (or sophistication) and impact. This tension is also acknowledged in the IELTS band scales themselves, with the following statement about grammar: Complex structures are attempted but these are limited in range, nearly always contain errors and may lead to the need for reformulation. Impact, of course, is listener-related and is therefore not something that can be measured objectively, unlike the other measures addressed in this study. The findings are very interesting for a number of reasons. First, they reveal that, for each assessment category, a range of performance features appear to contribute to the overall impression of the candidate. In terms of the relatively low number of measures which revealed significant differences amongst the levels, this may be attributed to the relatively few samples at each level which resulted in large measurement error. While a number of the measures approached significance, the only one to exhibit significant differences across levels was the total amount of speech. This is in many ways surprising, because amount of speech is not specifically referred to in the scales. In addition, it is not closely related to the length of response measures, which showed trends in the expected direction but were not significant. It may be, then, that interviewers close down or otherwise cut short the phases of the interview if they feel that candidates are struggling, which would explain the significance of this finding. It may also be that while the extended responses produced by weaker candidates were not substantially shorter than those of stronger candidates, weaker candidates produced many more single-word responses and clarification requests which resulted in the interviewer dominating the talk more. Second, the conduct of the analysis and review of the results allow us to draw conclusions about the methodology used in the study. Not all of the measures proved to be useful. For example, the relatively high proportion of high frequency vocabulary in all performances meant that the lexical frequency measures proved to be unhelpful in distinguishing the levels. It would appear that a more fine-grained analysis is required here, something that lay outside the scope of the present study. In addition, for some aspects of performance it was not possible to find previously-used valid and reliable measures for example, to measure syntactic sophistication. Brown et al (2005), who tried to address this dimension through the identification of specific structures such as passives and conditionals, found so few examples in the spoken texts that the measure failed to reveal differences amongst levels. It may be that raters impressions about sophistication are driven by one or two particularly salient syntactic (or lexical) features in any one candidates performance, but that these differ for different candidates. In short, it may prove to be impossible to get at some of the key drivers of assessments through quantification of discourse features. Other measures appear to be somewhat ambiguous. For example, self-repair might, on the one hand, be taken as evidence of monitoring strategies and therefore a positive feature of performance. On the other, it might draw attention to the fact that errors had been made or be viewed as affecting the fluency of the candidates speech, both of which might lead it to be evaluated negatively. Given this, this feature on its own is unlikely to have a strong relationship with assessed levels of proficiency.

IELTS Research Reports Volume 6

14

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Despite the problems outlined above and while there were some limitations to the study in terms of size, scope, and choice of analyses, in general the results of this study are encouraging for the validity of the IELTS band descriptors. The overall tendency for most of the measures to display increases in the expected direction over the levels appears to confirm the relevance of the criteria they address to the assessment of proficiency in the IELTS interview.

IELTS Research Reports Volume 6

15

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

REFERENCES Brown, A, Iwashita, N and McNamara, T, 2005, An Examination of Rater Orientations and Test-Taker Performance on English-for-Academic-Purposes Speaking Tasks, TOEFL Monograph series, MS-29, Educational Testing Service, Princeton, NJ Cobb, T, 2002, The Web Vocabulary Profiler, ver 1.0, computer program, University of Qubec, Montral, retrieved from <http://www.er.uqam.ca/nobel/r21270/textools/web_vp.html> Coxhead, A, 2000, A new academic word list, TESOL Quarterly, vol 34, no 2, pp 213-238 Cumming, A, Kantor, R, Baba, K, Eouanzaoui, E, Erdosy, U and James, M, 2003, Analysis of discourse features and verification of scoring levels for independent and integrated prototype written tasks for New TOEFL, draft project report, Educational Testing Service, Princeton, New Jersey Douglas, D, 1994, Quantity and quality in speaking test performance, Language Testing, vol 11, no 2, pp 125-144 Douglas, D, and Selinker, L, 1992, Analysing oral proficiency test performance in general and specific-purpose contexts, System, vol 20, no 3, pp 17-328 Douglas, D, and Selinker, L, 1993, Performance on a general versus a field-specific test of speaking proficiency by international teaching assistants in A new decade of language testing research, eds D Douglas and C Chapelle, TESOL Publications, Alexandria, VA, pp 235-256 English Language Institute, 2001, Examination for the Certificate of Competency in English, Ann Arbour, English Language Institute, University of Michigan Foster and Skehan, 1996, The influence of planning on performance in task-based learning, Studies in Second Language Acquisition, vol 18, no 3, pp 299-324 Foster, P, Tonkyn, A and Wigglesworth G, 2000, A unit for all measures: Analysing spoken discourse, Applied Linguistics, vol 21, no 3, pp 354-375 Fulcher, G, 1993, The construction and validation of rating scales for oral tests in English as a foreign language, unpublished doctoral dissertation, University of Lancaster, UK Fulcher, G, 1996, Does thick description lead to smart tests? A data-based approach to rating scale construction, Language Testing, vol 13, no 2, pp 208-238 Fulcher, G, 2003, Testing second language speaking, Pearson Education Limited, London Iwashita, N and McNamara, T, 2003, Task and interviewer factors in assessments of spoken interaction in a second language, unpublished report, Language Testing Research Centre, The University of Melbourne Iwashita, N, McNamara, T and Elder, C, 2001, Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design, Language Learning, vol 21, no 3, pp 401-436 Laufer, B, and Nation, P, 1995, Vocabulary size and use: Lexical richness in L2 written production, Applied Linguistics, vol 16, no 3, pp 307-322 Lewkowicz, J, 1997, The integrated testing of a second language in Encyclopedia of Language and Education, Vol 7: Language Testing and Assessment, eds C Clapham and D Corson, Kluwer, Dortrecht, The Netherlands, pp 121-130

IELTS Research Reports Volume 6

16

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Mangan, SS, 1988, Grammar and the ACTFL oral proficiency interview: discussion and data, Modern Language Journal vol 72, pp 266-76 OLoughlin, K, 1995, Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test, Language Testing, vol 12, no 2, pp 217-237 Paltridge, B, 2000, Making sense of discourse analysis, Antipodean Educational Enterprises, Gold Coast, Queensland Pollitt and Murray, 1996, What raters really pay attention to in Performance testing, cognition and assessment, (Studies in language testing 3), eds M. Milanovic and N. Saville, Cambridge University Press, Cambridge, pp 74-91 Schiffrin, D, 1987, Discourse markers, Cambridge University Press, Cambridge Skehan, P, and Foster, P, 1999, The influence of task structure and processing conditions on narrative retellings, Language Learning, vol 49, pp 93-120 Skehan, P, 1998, A cognitive approach to language learning, Oxford University Press, Oxford Syntrillium Software Corporation, 2001, Cool Edit Pro, ver 2.1, computer program, Phoenix, Arizona UCLES, 2001, IELTS Examiner Training Materials, University of Cambridge Local Examinations Syndicate, Cambridge Wigglesworth, G, 1997, An investigation of planning time and proficiency level on oral test discourse, Language Testing, vol 14, pp 85-106 Wigglesworth, G, 2001, Influences on performance in task-based oral assessments in Task based learning, eds M Bygate, P Skehan, and M Swain, Addison Wesley Longman, pp 186-209

IELTS Research Reports Volume 6

17

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

APPENDIX 1: ANOVAS ANALYSIS OF VARIANCE Fluency and coherence ANOVAs


Sum of Squares Abandoned words and repeats Per 100 words * Score Ratio of pause time to speak Time * Score Ratio of pauses to words * Score P2 words per 60 secs * Score Words P2 only * Score Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups P1 Av. length of turn * Score P2/3 Av. Length of turn * Score P1-3 a Av length of turn * Score Total words * Score Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups 24.849 160.753 49.844 115.384 392.857 1,440.910 4,896.861 8,280.294 26791.350 82433.200 1518.518 4664.718 3499.907 8182.661 1790.619 4605.400 844710.550 702596.400 3 16 3 16 3 16 3 16 3 16 3 16 df 3 16 3 16 3 16 Mean square 8.283 10.047 16.615 7.212 130.952 90.057 1,632.287 517.518 8930.450 5152.075 506.173 291.545 1166.636 511.416 596.873 287.837 281570.183 43912.275 6.412 .005 2.074 .144 2.281 .118 1.736 .200 1.733 .200 3.154 0.054 1.454 0.264 2.304 0.116 F .824 Sig. .499

IELTS Research Reports Volume 6

18

3. Candidate discourse in the revised IELTS Speaking Test Annie Brown

Lexical resources ANOVAs


Sum of Squares 500% * Score (first 500 words) K1% * Score (first 1000 words) K2% * Score (second 1000 words) AWL% * Score (academic word list) OWL% * Score (offlist) Word length * Score Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups Between groups Within groups T/T ratio * Score Between groups Within groups 147.497 298.453 55.125 187.984 8.524 64.144 7.873 29.659 6.026 67.011 .102 .926 .010 .067 df 3 16 3 16 3 16 3 16 3 16 3 16 3 16 Mean square 49.166 18.653 18.375 11.749 2.841 4.009 2.624 1.854 2.009 4.188 .034 .058 .003 .004 .817 .503 .587 .632 .480 .701 1.416 .275 .709 .561 1.564 .237 F 2.636 Sig. .085

Grammatical range and accuracy ANOVAs


Sum of Squares Utterance length * Score Between groups Within groups Clauses per utterance * Score Proportion of error-free utterances * Score Ratio of specific errors to words * Score Between groups Within groups Between groups Within groups Between groups Within groups 5.768 106.973 .296 1.436 84.112 62.574 10830.11 6956.58 df 3 15 3 15 3 15 3 15 Mean square 1.923 7.132 .099 .096 28.037 4.172 3610.04 463.77 7.784 .002 6.721 .004 1.030 .407 F .270 Sig. .846

IELTS Research Reports Volume 6

19

4. The impact on candidate language of examiner deviation from a set interlocutor frame in the IELTS Speaking Test
Author Barry OSullivan University of Roehampton, UK Yang Lu University of Reading, UK Grant awarded Round 8, 2002 This paper shows that the deviations examiners make from the interlocutor frame in the IELTS Speaking Test have little significant impact on the language produced by candidates.
ABSTRACT The Interlocutor Frame (IF) was introduced by Cambridge ESOL in the early 1990s to ensure that all test events conform to the original test design so that all test-takers participate in essentially the same event. While essentially successful, Lazaraton (1992, 2002) demonstrated that examiners sometimes deviate from the IF under test conditions. This study of the IELTS Speaking Test set out to locate specific sources of deviation, the nature of these deviations and their effect on the language of the candidates. Sixty recordings of test events were analysed. The methodology involved the identification of deviations from the IF, and then the transcription of the candidates pre- and post-deviation output. The deviations were classified and the test-takers pre- and post-deviation oral production compared in terms of elaborating and expanding in discourse, linguistic accuracy and complexity as well as fluency. Results indicate that the first two parts of the Speaking Test are quite stable in terms of deviations, with relatively few noted, and the impact of these deviations on the language of the candidates was essentially negligible in practical terms. However, in the final part of the Test, there appears to have been a somewhat different pattern of behaviour, particularly in relation to the number of paraphrased questions used by the examiners. The impact on candidate language again appears to have been minimal. One implication of these findings is that it may be possible to allow for some flexibility in the Interlocutor Frame, though this should be limited to allowing for examiner paraphrasing of questions.

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

AUTHOR BIODATA: BARRY OSULLIVAN Barry OSullivan has a PhD in language testing, and is particularly interested in issues related to performance testing, test validation and test-data management and analysis. He has lectured for many years on various aspects of language testing, and is currently Director of the Centre for Language Assessment Research (CLARe) at Roehampton University, London. Barrys publications have appeared in a number of international journals and he has presented his work at international conferences around the world. His book Issues in Business English Testing: the BEC revision project was published in 2006 by Cambridge University Press in the Studies in Language Testing series; and his next book is due to appear later this year. Barry is very active in language testing around the world and currently works with government ministries, universities and test developers in Europe, Asia, the Middle East and Central America. In addition to his work in the area of language testing, Barry taught in Ireland, England, Peru and Japan before taking up his current post. YANG LU Dr Yang Lu has a BA in English and English Literature from Jilin University, China. She obtained both her MA and doctorate degrees from the University of Reading. Her PhD investigates the nature of EFL test-takers spoken discourse competence. Dr Yang Lu has 18 years experience of language teaching and testing. She worked first as a classroom teacher and later as Director of the ESP Faculty and Deputy Coordinator of a British Council project based at Qingdao University, where she also worked as Associate Professor of English. Her academic interests are spoken discourse analysis and its applications in classroom and oral assessment contexts. Dr Yang Lus publications include papers on: EFL learners interlanguage pragmatics; application of the Birmingham School approach; the roles of fuzziness in English language oral communication; and task-based grammar teaching. She has presented different aspects of her work at a number of international conferences. Dr Yang Lu was a Spaan Fellow for a validation study on the impact of examiners conversational styles.

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

CONTENTS
1 Introduction ............................................................................................4

2 The Interlocutor Frame ............................................................................4 3 Methodology ............................................................................................5 3.1 The IELTS Speaking Test ...................................................................6 3.2 Test-takers ..........................................................................................6 3.3 The examiners .....................................................................................7 4 The study ............................................................................................7 4.1 The coding process .............................................................................7 4.2 Locating deviations .............................................................................10 4.3 Transcribing ........................................................................................10 5 Analysis 6 Results 6.1 Overall 6.1.1 6.1.2 6.1.3 6.1.4 ............................................................................................11 ............................................................................................12 ............................................................................................12 Paraphrasing ......................................................................12 Interrupting..........................................................................13 Improvising .........................................................................13 Commenting .......................................................................14

6.2 Impact on test-takers language of each deviation type ......................15 6.3 Location of deviations .........................................................................17 6.3.1 6.3.2 7 Conclusions 8 References Deviations by test part ........................................................17 Details of the deviations .....................................................18 ............................................................................................21 ............................................................................................23

Acknowledgement......................................................................................22 Appendix 1: Profiles of the test-takers included in the study ..................26

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

INTRODUCTION

While research into various aspects of speaking tests has become more common and more varied over the past decade, there is still great scope for researchers in the area, as the fractured nature of research to date betrays the lack of a systematic research agenda in the field. OSullivan (2000) called for a focus on a more clearly defined socio-cognitive perspective on speaking, and this is reflected in the framework for validating speaking tests outlined by Weir (2005). This is of particular relevance in tests of speaking where candidates are asked to interact either with other candidates and an examiner or, in the case of IELTS, with an examiner only. The co-constructive nature of spoken language means that the role played by the examiner-as-interlocutor in the test event is central to that event. One source of construct irrelevant variance in face-to-face speaking tests lies in the potential for examiners to misrepresent the developers construct either by consciously or subconsciously changing the way in which individual candidates are examined. There is considerable anecdotal evidence to suggest that examiners have a tendency to deviate from planned patterns of discourse during face-to-face speaking tests, and to some extent we might want this to happen, for example to allow for the interaction to develop in an authentic way. However, the dangers inherent in examining speaking by using what is sometimes called a conversational interview (Brown 2003:1) are far more likely to result in test events that are essentially unique, though this is something that can be said of any truly free conversation see also van Liers (1989) criticism of this type of test in which he convincingly argues that true conversation is not necessarily reflected in interactions performed under test conditions. These dangers, which include unpredictability in terms of topic, linguistic input and expected output, all of which can have an impact on test-taker performance, have long been noted in the language testing literature (see Wilds 1975; Shohamy 1983; Bachman 1988; 1990; Stansfield 1991; Stansfield & Kenyon 1992; McNamara 1996; Lazaraton 1996a). There have been a number of studies in which rater linguistic behaviour has been explored in terms of its impact on candidate performance (see Brown & Hill 1998; Brown & Lumley 1997; Young & Milanovic 1992), and others in which the focus was on linguistic behaviour without an overt focus on the impact on candidate performance (Lazaraton 1996a; Lazaraton 1996b; Ross 1992; Ross & Berwick 1992). Other studies have looked at the broader context of examiner behaviour (Brown 1995; Chalhoub-Deville 1995; Halleck 1996; Hasselgren 1997; Lumley 1998; Lumley & OSullivan 2000; Thompson 1995; Upshur & Turner 1999). The results of these studies suggest that there is likely to be systematic variation in how examiners behave during speaking test events, in relation both to their language and to their rating. These studies have tended to look either at the scores achieved by candidates or at the identification of specific variations in rater behaviour and have not focused so much on how the language of the candidates might be affected as a result of particular examiner linguistic behaviour (with the exception perhaps of Brown & Hill 1998). Another limitation of these studies (at least in terms of the study reported here) is the fact that they were almost all conducted on so-called conversational interviews (with the exception of the work of Lazaraton 2002). Since the 1990s, many tests have moved away from this format, to a more tightly controlled model of spoken test using an Interlocutor Frame. 2 THE INTERLOCUTOR FRAME

An Interlocutor Frame (IF) is essentially a script. The idea of using such a device is to ensure that all test events conform to the original test design so that all test-takers participate in essentially the same event. Of course, the very nature of live interaction means that no two are ever likely to be exactly
IELTS Research Reports Volume 6
4

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

the same but some measure of standardisation is essential if test-takers are to be treated fairly and equitably. Such frames were first introduced by Cambridge ESOL in the early 1990s (Saville & Hargreaves 1999) to increase standardisation of examiner behaviour in the test event though it was demonstrated by Lazaraton (1992) that there might still be deviations from the Interlocutor Frame even after examiner training. This may have been at least partly a response by the examiners to the extreme rigidity of the early frames, where all responses (verbal, paraverbal and non-verbal) were scripted. Later work by Lazaraton (2002) provided evidence of the effect of examiner language and behaviour on ratings, and contributed to the development of the less rigid Interlocutor Frames used in subsequent speaking tests. As we have pointed out above, the IF was originally introduced to give the test developer more control of the test event. However, Lazaraton has demonstrated that, when it comes to the actual event itself, examiners still have the potential to deviate from any frame. The questions that emerge from this are: 1. Are there identifiable positions in the IELTS Speaking Test in which examiners tend to deviate from the Interlocutor Frame? 2. Where a deviation occurs, what is the nature of the deviation? 3. Where a deviation occurs, what is the effect on the linguistic performance of the candidate? To investigate these questions, it was decided to revisit the IELTS Speaking Test following earlier work. Brown & Hill (1998) and Brown (2003) reported a study based on a version of the IELTS Speaking Test which was operational between 1989 and 2001. Findings from this work, together with outcomes from other studies on the IELTS Speaking Test, informed a major revision of the test in the late 1990s; from July 2001 the revised test incorporated an Interlocutor Frame for the first time to reduce rater variability (see Taylor, in press). (The structure of the current test is described briefly below in 3.1.) Since its introduction, the functioning of the Interlocutor Frame in the IELTS Speaking Test has been the focus of ongoing research and validation work; the study reported here forms part of that agenda and is intended to help shape future changes to the IF and to inform procedures for IELTS examiner training and standardisation. 3 METHODOLOGY

Previous studies into the use by examiners of Interlocutor Frames used time-consuming, and therefore, extremely expensive research methodologies, particularly conversation analysis (see the work of Lazaraton 1992, 1996a, 1996b, 2002). Here, an alternative methodology is applied. In this methodology, audio-recorded examination events were first studied for deviations from the specified IF. These deviations were then coded and the area of discourse around them transcribed and analysed. The methodology involved the identification of deviations from the existing IF (in real time). The deviations identified were then transcribed to identify the test-takers pre- and post-deviation oral output. A total of approximately 60 recorded live IELTS Speaking Tests undertaken by a range of different examiners were analysed. The deviations were classified and the test-takers pre- and postdeviation oral production compared in terms of elaborating and expanding in discourse, linguistic accuracy and complexity as well as fluency.

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

3.1

The IELTS Speaking Test

The Speaking Test is one of four skills-focused components which make up the IELTS examination administered by the IELTS partners Cambridge ESOL, British Council and IELTS Australia. The Test consists of a one-to-one, face-to-face oral interview with a single examiner and candidate. All IELTS interviews are audio-taped for purposes of quality assurance and monitoring. The test has three parts (see Figure 1), each of which is designed to elicit different profiles of a candidates language. This has been shown to be the case in speaking tests for the Cambridge ESOL Main Suite examinations by OSullivan, Weir & Saville (2002) and OSullivan & Saville (2000) through use of an observation checklist. Brooks (2003) reports how a similar methodology was developed for and applied to IELTS; an internal Cambridge ESOL study (Brooks 2002) demonstrated that the different IELTS test parts were capable of fulfilling a specific function in terms of interaction pattern, task input and candidate output.
Part Part 1 Introduction and interview Nature of interaction Examiner introduces him/herself and confirms candidates identity. Examiner interviews candidate using verbal questions selected from familiar topic frames Part 2 Individual long turn Examiner asks candidate to speak for 1-2 minutes on a particular topic based on written input in the form of a candidate task card and content-focused prompts. Examiner asks one or two questions to round off the long turn. Examiner invites candidate to participate in discussion of a more abstract nature, based on verbal questions thematically linked to Part 2 topic. 4-5 minutes Timing

3-4 minutes (incl. 1 minute preparation time)

Part 3 Two-way discussion

4-5 minutes

Figure 1: IELTS Speaking Test format

The examiner interacts with the candidate and awards scores on four analytical criteria which contribute to an overall band score for speaking on a nine-point scale (further details of test format and scoring are available on the IELTS website: www.ielts.org). Since this study is concerned with the language of the test event as opposed to the outcome (ie score awarded) no further discussion of the scoring will be entered into at this point except to say that the band scores were used to assist the researchers in selecting a range of test events in which candidates of different levels were represented. The test version selected for use in this study is Version 88, a version that was in use after July 2001, but that was later retired. 3.2 Test-takers A total of 85 audio-taped live IELTS Speaking Test events using Test Version 88 were selected from administrations of the test conducted during 2002. Of these, 70 were selected for the study after consideration of test-takers nationality and first language. This was done to reflect the composition of the general IELTS candidature worldwide. Band scores awarded to candidates were also looked at to avoid a situation where one nationality might be over-represented at the different overall score levels. However, this was not always successful as it is clear from the overall patterns of IELTS scores that there are differences in performance levels across the many different nationalities represented in the test-taking population.
IELTS Research Reports Volume 6
6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

After an initial listening, a further eight performances were excluded because of poor quality of recording (previous experience has shown that this makes accurate transcription almost impossible), leaving 62 speaking performances for inclusion in the analysis. There were 21 female test-takers and 41 males. The language and nationality profile is shown in Table 1. From this table we can see that the population represents a wide range of first languages (17) and nationalities (18). This sample allows for some level of generalisation to the main IELTS population. More detailed information about the test-takers can be found in Appendix 1.
Language Arabic Arabic Arabic Bengali Chinese Chinese Farsi German Hindi Japanese Korean Nationality Iraq Oman UAE Bangladesh China Taiwan Iran Switzerland India Japan S Korea Number 1 5 3 3 17 1 1 1 5 1 1 Language Portuguese Portuguese Punjabi Pushtu Spanish Spanish Swedish Telugu Urdu Other Other Nationality Brazil Portugal India Pakistan Colombia Mexico Sweden India Pakistan India Malawi Number 1 1 3 1 1 1 5 1 4 1 1

Table 1: Language and nationality profile

3.3

The examiners

A total of 52 examiners conducted the 62 tests included in the matrix. The intention was to include as large a number of examiners as possible in order to minimise any impact on the data of non-standard behaviour by particular judges. For this reason, care was also taken to ensure that no one examiner would conduct the test on more than three occasions. As all of the test events used in this study were live (ie recordings of actual examinations), the conditions under which the tests were administered were controlled. This meant that all of the examiners were fully trained and standardised and had experience working with this test. 4 4.1 THE STUDY The coding process

The first listening was undertaken to identify the nature and location of the obvious and recurring deviations from the Interlocutor Frame by examiners. The more frequent deviations were first identified, then categorised, and finally coded. Efforts were made to be consistent with the coding according to a set of definitions given to these deviations which was generated gradually during the listening. As is usual with this kind of work, definitions were very sketchy at the outset but became more clearly defined when the first careful listening was finished. Table 2 presents the findings of this first listening.

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Types of deviations interrupting question hesitated question paraphrased question paraphrased and explained question comments after replies

Coding itr hes para parax

Definitions question asked that stops the test-takers answer question asked hesitatingly possibly because of unfamiliarity with the interlocutor frame question that is rephrased without test-takers request appears to be based on examiners judgement of the candidates listening comprehension ability question that is both paraphrased and explained with example with or without test-takers request comment made after test-takers reply that is more than the acknowledgement or acceptance the examiner is supposed to give; it tends to make the discourse more interactive question that is not part of the interlocutor frame but asked based on test-takers reply very often about their personal interests or background informal discussion mainly held by examiner who is interested in test-takers experience or background examiners loud laughing caused by test-takers reply or answer examiners utterance made to offer a hint and/or to facilitate candidate reply

com

improvised question informal chatting loud laughing offer of clues

imp chat la cl

Table 2: Development of coding for deviations (Listening 1)

A second careful listening was undertaken to confirm the identification of deviations, to check the coding for each case and to decide on a final list of the deviations to be examined. As can be seen from Table 2, there were two distinct types of deviation related to paraphrasing. While this coding appeared at first a useful distinction, it became quite difficult to operationalise, as the study was based on audio tapes, a medium which does not allow the researcher to observe the body language and facial expressions of the parties involved. This made it practically impossible to know whether paraphrasing was performed in response to test-takers requests (verbal or non-verbal) or volunteered by the examiner. Therefore, the decision was made to collapse the two paraphrasing categories and to report only the single category paraphrase. A list of occurrences of the deviations resulted as shown in Table 3:
Types of deviations interrupting question hesitated question paraphrased question comments after reply improvised question informal chatting Laughing Clues Coding Itr Hes Para Com Imp Chat La Cl Occurrences 34 7 47 12 28 9 5 2

Table 3: Occurrences of deviations

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Two decisions were made after the second listening: 1. The four types of deviations that were found to be most frequent in the tests were selected for investigation. They are: interrupting question, paraphrased question, comment after replies and improvised question. We also believe that these four types of deviations can be established because in the Instructions to IELTS Examiners (Cambridge ESOL 2001) it is made very clear to the examiners that: The Interlocutor Frame is used for the purpose of standardisation in order that all candidates are treated fairly and equally. Deviations from the script may introduce easier or more difficult language or change the focus of a task. In Part 1 the exact words in the Frame should be used. Reformulating and explaining the questions in the examiners own words are not allowed. In Part 2 examiners must use the words provided in the Frame to introduce the long turn task. In Part 3 the Frame is less controlled so that the examiners language can be accommodated to the level of the candidate being examined. In all parts of the test, examiners should refrain from making unscripted comments or asides. Explanation needs to be given at this point about the rationale for including the interrupting questions and paraphrased questions in Part 3 as deviation types. Although, understandably, examiners sometimes cannot help stopping test-takers whose replies in Part 1 and 3 are lengthy and slow down the procession of the Speaking Test, this should be done in a more subtle way with body language as suggested in IELTS Speaking Test-FAQs and Feedback document (Cambridge ESOL 2001) or by using more tentative verbal hints. These strategies are suggested so as to limit any potential impact on future candidate linguistic performance. The interrupting questions we have coded as deviations occur neither after lengthy replies by test-takers nor are they made in a nonthreatening (ie tentative) manner. In Part 1, as the Instructions to IELTS Examiners states, examiners should not explain any vocabulary in the frame. Therefore, any reformulating of the questions is regarded here as a deviation and coded as such. However, in Part 3 examiners have more independence and flexibility within the Frame and are even encouraged to develop the topic in a variety of directions according to the responses from the candidates (Cambridge ESOL 2001). The examiners decisions to reformulate, rephrase, exemplify or paraphrase the questions in Part 3 were noticed in the first listening of the tapes. For most of the cases this was done without a specific request from the testtakers and appears to have been based on examiner judgements of the individual test-takers level of proficiency and ability to discuss the comparatively more abstract topics contained in this section of the Test. However, it should be noted that this part differs from Parts 1 and 2 in that the prompts are just that indicative prompts designed for them to articulate in a way that is appropriate to the level of the candidate, but not fully scripted questions for them to read off the page as in Parts 1 and 2. 2. The second decision concerned the amount of speech to be transcribed on either side of the deviation. Since it was believed that we needed a significant amount of language for transcription so that realistic observations could be made, and that all language chunks transcribed should be of similar length, we decided that 30 seconds of pre- and post-deviations should be transcribed and analysed to provide reliable data for investigation. Details of the transcription conventions used are given below. Pre-deviations that were found to be overlapping with the post-deviation of a previous question could not be transcribed. As a

IELTS Research Reports Volume 6

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

result, the number of pre- and post-deviation sections from the oral production by the candidates in each category was reduced, the final numbers being: 33 paraphrased questions 26 interrupting questions 17 improvised questions 9 comments after replies. 4.2 Locating deviations The reason for looking at the points of deviation was to identify places in the Interlocutor Frame that might be prone to lead to unintended breakdowns or deviations. It was thought that locating these weak points in the Frame would offer valuable insights into why the breakdown occurred and lead to a series of practical recommendations for the improvement of the IF as well as guidance for examiner training. Two procedures were undertaken for this purpose: 1. Occurrences of each deviation in the three test parts were identified to highlight where they were most likely to occur. 2. Occurrences of the questions where examiners deviated most were counted in order to discover where certain deviations would be most likely to occur within each test part. 4.3 Transcribing Transcribing was conducted after the second, more detailed listening. The maximum amount of time for each pre- or post-deviation chunk was 30 seconds. Conventions for transcriptions are as below: er ---- filled pauses x ---- one syllable of a non-transcribed word ---- not transcribed pre- or post-deviation oral production. A total of over 10,000 were transcribed in the pre- and post-deviation data. This dataset was then divided into nine files: Part 1. com (comments after replies in Part 1) Part 2. com (comments after replies in Part 2) Part 3. com (comments after replies in Part 3) Part 1. itr Part 3. itr (interrupting questions in Part 1) (interrupting questions in Part 3)

Part 1. imp (improvised questions in Part 1) Part 3. imp ( improvised questions in Part 3) Part 1. para (paraphrased questions in Part 1) Part 3. para (paraphrased questions in Part 3)

IELTS Research Reports Volume 6

10

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

ANALYSIS

To realise the aim of the study (to compare the quality of the candidates oral production in the pre and post deviation sections), four categories of measure were used; these are presented in Table 4 along with the sub-categories.
Category of measures Fluency Sub-category of measures 1. filled pauses per AS-unit 2. words per second (excluding repetitions, self-corrections and filled pauses) 1. number of errors of plural or singular forms per word 2. number of errors of subject and verb agreement per word Linguistic Complexity Average number of clauses per AS-unit 1. number of expanding moves per T-unit Discoursal Performance 2. number of elaborating moves per T-unit 3. number of enhancing moves per T-unit

Grammatical Accuracy

Table 4: Categories of measures used in transcription analysis

The Analysis of Speech Unit, or AS-unit (Foster, Tonkyn & Wigglesworth 2000) was used for calculating filled pauses and investigating linguistic complexity; for comparing the discoursal performance before and after deviations, the T-unit (Hunt 1970) was chosen as the unit in which changes were examined. The rationale for this approach is: 1. According to Foster et al (2000: 365), the AS-unit is a mainly syntactic unitconsisting of an independent clause, or sub-clausal unit, together with any subordinate clause(s) associated with either. This allows us to analyse speech at different clausal units such as the non-finite clauses, so that the complexity of linguistic features can be measured. 2. Since studies of pausing in native-speaker speech have shown that pauses often occur at syntactic unit boundaries, especially at clausal boundaries (Raupach 1980; Garman 1990), the AS-unit was selected as the most appropriate unit for calculating filled pauses. 3. The T-unit is the shortest unit into which a piece of discourse can be cut without leaving any sentence fragments as residue (Hunt 1970:189). The T-unit enables us to include in the analysis all acts, some of which can be coordinate clauses or fragments of clauses. This is beyond the scope of the AS-unit which regards these structures as separate units.

IELTS Research Reports Volume 6

11

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

6 6.1

RESULTS Overall

The results are presented in relation to the three research questions posed in section one. We will look at the overall evidence of deviation and at any apparent impact on test-taker language of these deviations. In addition, we will look at the location of the deviations for evidence of systematicity which may point to inherent weaknesses in the interlocutor frame method. The overall results are presented so as to reflect the four areas identified as the most common deviation type above.
6.1.1 Paraphrasing

The results suggest that there is a very limited impact on fluency, while in the other areas there are mixed results. There appears to be a reduction in accuracy immediately following the deviation in terms of plural/singular errors, though this is counteracted by the post-deviation increase in subject/verb agreement accuracy. It is in the area of complexity that the most obvious change occurs, with both the number of AS-units and the number of clauses per AS-unit appearing to significantly drop following the deviation. The discourse indicators also appear to show a mixed reaction. The results are grouped together as Table 5.
Fluency Average Total Filled pauses per T-unit pre 1.021 31.993 post 1.346 36.933 pre 1.77 58.33 Words per second post 1.67 55.26

Accuracy Average Total

Plural/Single Error per word pre 0.01 0.47 post 0.01 0.17

Subject/Verb agreement Error per word pre 0.02 0.64 post 0.03 0.92

Complexity Average Total

Clauses per AS-unit pre 0.01 0.47 post 0.01 0.17

Discourse Average Total

Expanding per T-Unit Pre 0.43 14.28 Post 0.31 10.28

Elaborating per T-Unit pre 0.16 5.41 post 0.22 7.12

Enhancing per T-Unit pre 0.23 7.75 Post 0.17 5.57

Table 5: The impact of paraphrasing questions on candidate language

IELTS Research Reports Volume 6

12

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

6.1.2

Interrupting

In Table 6 we can see that there is quite a large reduction in filled pauses per T-unit, though there is little change as regards the number of words spoken per second. Like the results from the paraphrasing analysis, there seems to be a reduction in accuracy immediately following the deviation in terms of plural/singular errors, though this is again reversed with the post-deviation increase in subject/verb agreement accuracy. The pattern found for complexity is not repeated here, and is instead seen to be much more inconsistent. The discourse indicators are the most consistent, with a slight drop in the post-deviation position, though this does not appear to be great enough to suggest a significant reaction.
Fluency Average Total Filled pauses per T-unit Pre 1.035 26.919 post 0.558 14.500 pre 1.832 47.63 Words per second post 1.857 48.28

Accuracy Average Total

Plural/Single Error per word Pre 0.009 0.222 post 0.005 0.142

Subject/Verb agreement Error per word pre 0.008 0.207 post 0.016 0.428

Complexity Average Total

Clauses per AS-unit Pre 0.89 23.05 post 1.01 26.13

Discourse Average Total

Expanding per T-Unit pre 0.356 9.255 post 0.340 8.833

Elaborating per T-Unit pre 0.118 3.060 post 0.058 1.500

Enhancing per T-Unit pre 0.147 3.833 post 0.125 3.250

Table 6: The impact of interrupting questions on candidate language 6.1.3 Improvising

As far as the results for fluency are concerned (Table 7), there seems to be a significant reduction in the number of filled pauses following the deviation, though a corresponding reduction in the number of words spoken per second does not appear great. As for accuracy, there seems to be a very slight increase in the measures over the two sections, though the numbers are probably too small to draw any definite conclusions. With complexity, the picture is once again mixed, while the discourse indicators also appear to show little reaction apart from the amount of expanding carried out.

IELTS Research Reports Volume 6

13

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Fluency Average Total

Filled pauses per T-unit pre 0.666 11.328 Post 0.373 6.333 pre

Words per second post 2.023 34.390

2.159 36.710

Accuracy Average Total

Plural/Single Error per word pre 0.005 0.093 Post 0.008 0.137

Subject/Verb agreement Error per word pre 0.012 0.212 post 0.026 0.449

Complexity Average Total

Clauses per AS-unit pre 1.217 20.692 Post 1.431 24.333

Discourse Average Total

Expanding per T-Unit pre 0.340 5.787 post 0.152 2.583

Elaborating per T-Unit Pre 0.156 2.660 post 0.153 2.600

Enhancing per T-Unit pre 0.198 3.368 post 0.229 3.892

Table 7: The impact of improvising questions on candidate language 6.1.4 Commenting

In the results from the analysis of the language bordering the deviations which were identified as being related to unscripted comments made by the examiners, we can see that there is a drop in the number of filled pauses, while there is little significant change in the number of words spoken per second (Table 8). The figures for accuracy are so small that there seems little point in attempting to make any meaningful comment on them, while for complexity there is quite a large increase in the number of clauses per AS-unit. Finally, the discourse indicators seem to indicate a systematic decrease right across the board.

IELTS Research Reports Volume 6

14

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Fluency Average Total

Filled pauses per T-unit pre 0.666 4.983 Post 0.473 4.386 pre

Words per second post 2.353 21.180

2.137 19.230

Accuracy Average Total

Plural/Single Error per word pre 0.000 0.000 Post 0.002 0.017

Subject/Verb agreement Error per word pre 0.008 0.069 post 0.015 0.137

Complexity Average Total

Clauses per AS-unit pre 0.609 5.483 Expanding per T-Unit pre post 0.257 2.317 post 0.816 7.343 Elaborating per T-Unit pre 0.206 1.852 post 0.083 0.750 Enhancing per T-Unit pre 0.307 2.760 post 0.254 2.283

Discourse Average Total

0.372 3.345

Table 8: The impact of commenting on responses on candidate language

6.2

Impact on test-takers language of each deviation type

If we then review these results in terms of each of the four language areas, we can see that of the four deviation types, paraphrasing seems to result in relatively little change to the language performance of the candidates, while all other deviation types seem to be having a negative impact on fluency (see Table 9). However, the rate of speed does not appear to be affected to any great extent by the deviations. The negative direction of interrupting/improvising/commenting suggested by Table 9 could imply that examiners should really avoid doing any of these, while the positive direction of the impact of paraphrasing suggests that examiners need not be so concerned about doing this because it may even have a positive impact?
Fluency Paraphrasing Interrupting Improvising Commenting Filled pauses per T-unit pre 1.021 1.035 0.666 0.554 post 1.346 0.558 0.373 0.487 Words per second pre 1.77 1.832 2.159 2.137 Post 1.67 1.857 2.023 2.353

Table 9: The impact on fluency of each deviation type


IELTS Research Reports Volume 6
15

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

In terms of the accuracy of the output, we can see that there does not appear to be any significant impact as a result of the deviations recorded here though the numbers recorded may in any case be too small to make any meaningful difference (see Table 10).
Plural/Single Error per word pre Paraphrasing Interrupting Improvising Commenting 0.01 0.009 0.005 0.000 Post 0.01 0.005 0.008 0.002 Subject/Verb agreement Error per word pre 0.02 0.008 0.012 0.008 post 0.03 0.016 0.026 0.015

Accuracy

Table 10: The impact on accuracy of each deviation type

The complexity of the language is affected in different ways (Table 11). If anything, there is a slight increase in the complexity of the language used following each of the deviations with the exception of paraphrasing.
Complexity Paraphrasing Interrupting Improvising Commenting Clauses per AS-unit Pre 0.01 0.89 1.217 0.609 Post 0.01 1.01 1.431 0.816

Table 11: The impact on complexity of each deviation type

Finally, we can see from Table 12 that the amount of expanding undertaken by candidates is systematically reduced following all four deviation types, though the picture for elaborating and enhancing is quite mixed.
Discourse Paraphrasing Interrupting Improvising Commenting Expanding per T-Unit Pre 0.43 0.356 0.340 0.372 post 0.31 0.340 0.152 0.257 Elaborating per T- Unit pre 0.16 0.118 0.156 0.206 post 0.22 0.058 0.153 0.083 Enhancing per T- Unit Pre 0.23 0.147 0.198 0.307 post 0.17 0.125 0.229 0.254

Table 12: The impact on discourse of each deviation type

IELTS Research Reports Volume 6

16

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

6.3

Location of deviations

The other aim of the research is to investigate where the deviations occur to identify a pattern of the possible or likely situations or conditions for the deviations to occur. Two kinds of deviation location were studied: deviations across the three test parts and deviation within each test part.
6.3.1 Deviations by test part

Table 13 shows the numbers of occurrences of both the transcribed and non-transcribed (ie where the amount of language on either side of the deviation was too small to make meaningful inferences from the analyses) deviations in the tasks used in the three parts of the test. The non-transcribed deviations are added here to give a more complete picture of the amount of deviation from the IF that actually took place during these test events.
Deviation Type Paraphrased Questions P1 Deviations analysed for this study Total number of Deviations 4 P2 0 P3 29 P1 8 Improvised Questions P2 0 P3 9 Comments after Replies P1 2 P2 4 P3 4 P1 14 Interrupting Questions P2 0 P3 12

43

10

18

19

15

Table 13: Number of deviations by test part

There are a number of clear tendencies implied by Table 13: Interrupting questions spread more or less evenly in Part 1 and Part 3. This is possibly due to the two-way nature of these parts both of which involve questions and answers. When the test-taker gives a longer reply than necessary from the point of view of the examiner, the examiner may ask the next question to stop the candidates reply to the previous question in the middle of a sentence or even a word. The table also suggests that about 30% of interrupting questions do not result in an extended turn (at least 30 seconds) from the candidate. This may be due to the fact that the questions are rhetorical (and do not require a response); or they may be yes/no questions or questions that elicit only very short responses; or it may be that the questions are either not clearly heard or understood by the candidates (and are either ignored or poorly answered). Since these possibilities can have potentially different impacts on candidate performance, it is clear that this aspect of examiner behaviour deserves more detailed examination. There are more improvising questions in Part 3 than in Part 1, though the discourse patterns are the same. It is possible that the improvising questions in Part 3 result from the more abstract nature of the questions, and is most likely related to the way Part 3 is designed from the examiners perspective see the above discussion. However, under what conditions the examiners tend to ask questions which are not in the Frame but are spontaneously raised by the examiners according to information given by test-takers can only be disclosed by examining the location of deviations within tasks. We can also see that in only half of the instances was there enough language resulting from the improvised question to merit inclusion in this study. This implies that this question type did not tend to result in the elicitation of a meaningful response (in terms of length of utterance) and as such may not always impact on candidate performance though any
IELTS Research Reports Volume 6
17

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

lack of response may result in a lowering of the examiners opinion of the proficiency level of the candidate. Again, more detailed study of this phenomenon is required. The only type of deviation observed in Part 2 (the individual long turn) was where the examiners made comments following the candidate responses. This is not really surprising when we consider that the nature of the task reduces the potential for paraphrasing and improvising questions. Also, since the candidates are told before they start the task that they will be stopped when time is up, interruptions are not expected to occur. Comments after test-takers replies seem to occur most often in the Individual long turn task, if we bear in mind that in this part of the test examiners are only required to ask one or two rounding-off questions. Where and when these commenting deviations happen is certainly an interesting revelation, which will be discussed in the next part of this study. 91% of the paraphrasing questions occurred in Part 3, the two-way discussion task, where examiners invite the candidates to discuss the abstract aspect of the topic linked to Part 2 using unscripted questions. There is a suggestion here that in this part of the test the testtakers may have more difficulty answering the questions. Because of this, the examiners offered (based on their assessment of the candidates levels of proficiency and ability to answer abstract questions) to rephrase or explain the questions without examinees requests in most of the cases. The nature of the questions seems to be the cause, as there are far fewer paraphrasing questions in Part 1 where the purpose of the questions is to access factual information. When we consider the overall number of paraphrased questions to those analysed here, we can see that there is no difference for Part 1, suggesting that the paraphrasing was successful in that it always resulted in a long response (at least 30 seconds). The picture in Part 3 is different; here one in three of the paraphrased questions failed to elicit a long enough turn to be included in this analysis. This suggests that the paraphrases failed to enlighten the candidates, perhaps not surprisingly, since the concepts in Part 3 tend to be more abstract, and therefore more difficult to paraphrase than in Part 1. The implication here is that examiner training, in this particular examination and in other tests in which this approach is used, should focus specifically on developing noticing, questioning and paraphrasing skills. It is also clear that this element of the test should be closely monitored in future administrations to ensure that candidate performances are not significantly affected by features of examiner behaviour that are not relevant to the skill being tested.
6.3.2 Details of the deviations

We will now examine each part of the test separately in order to identify which of the scripted questions were most likely to lead to or result in deviations from the Interlocutor Frame. In Part 1 we can see that there is an even spread of deviations across the various questions (see Table 14). All of these questions are scripted for the examiner, who makes decisions on which ones to ask during the course of the test. It should be mentioned that there are more questions than listed in the table. They are not included here either because they were not asked by the examiners or there were no deviations associated with them.

IELTS Research Reports Volume 6

18

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

PART 1

Paraphrased Questions

Improvised Questions

Comments after Replies

Interrupting Questions

Total Deviations

Introductory Place of origin Work/study Accommodation in UK Everyday habits Likes and personality Favourite clothing Language & other learning Mode of learning Cooking New experiences Museums & galleries Most loved festivals Festival games Festival general Sports Sporting addictions Most loved sports Total 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 4

Not analysed as this section is not assessed 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 8 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 3 2 1 0 1 1 0 0 1 1 1 2 0 1 0 0 0 14 3 3 1 1 2 2 1 2 1 1 2 3 1 1 1 1 2 28

Table 14: Spread of deviations in Part 1

There are a number of observations that can be made at this juncture: 1. One examiner was responsible for five of the interrupting questions, suggesting that this is more of a test monitoring issue than a training issue (if it were a training issue we would expect to find a greater spread of occurrences). 2. The majority of the interrupting questions served to bring a candidate turn to an end, and as such do not appear to impact on candidate performance on the task. 3. We might need to think further about improvised questions. These are unscripted, and represented a real threat to the integrity of the test. It may well be that this type of question can be eliminated to a great extent by training and by the inclusion of a statement on the Frame specifically referring to the problem. 4. There does not appear to be a systematic pattern of deviation in relation to specific questions or question types (direct or slightly more abstract).

IELTS Research Reports Volume 6

19

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

PART 2

Paraphrased Questions 0 0 0 0 0

Improvised Questions 0 0 0 0 0

Comments after Replies 0 0 2 2 4

Interrupting Questions 0 0 0 0 0

Total Deviation s 0 0 2 2 4

Instructions During long turn Anyone with job? Will you have the job? Total

Table 15: Spread of deviations in Part 2

Table 15 shows that in Part 2, the Individual long turn, the examiners stayed very clearly with the Frame both during the introductory section of the task (when they were giving instructions) and while the candidate was involved in the long turn itself. There were four commenting responses by the examiners out of a total of 10 analysed for Part 2. A further probing of the data shows that they all happened when the examiners were rounding off this part by asking one or two questions. It also seems that at this point they tend to make comments about the candidates answers to the questions, thus giving more acknowledgement and/or acceptance than required by the IF. This is an interesting finding, in that it suggests that examiners sense some need to backchannel; although the original purpose of the rounding-off questions appears to have been to help examiners form a bridge from Part 2 to Part 3, they still seem to need to say something else. This is yet another area in which further exploration is likely to significantly add to our understanding of the Speaking Test event in general and examiner behaviour in particular. In Part 3 (Table 16) we can see that the stable patterns observed in the first two parts are not repeated. Instead, there are a far greater number of deviations from the IF, though this is not unexpected as examiners are offered a choice of prompts from which to select and fashion their questions, depending on how the interaction evolves and are likely make unscripted contributions in this final part of the test. As we have seen above, Parts 1 and 3 are somewhat similar in design, with both designed to result in interactive communication. We would therefore expect to see similar patterns of behaviour from the examiners in the two parts. In fact, it is true that the patterns are strikingly similar in most areas there are similar levels of occurrence of improvised questions, comments and interruptions. However, it is clear that there are far more instances of paraphrasing in this last part than in any of the others (in fact there are almost as many paraphrased questions in Part 3 as there are deviations in total for the other two parts). This may well be due to the less rigid nature of this final part, with the examiner offered a broad range of prompts to choose from when continuing the interaction, but is more likely due to the nature of the questions asked. Even if we take a less rigid view of paraphrasing (where scripted questions are asked using alternative wording or emphasis) and view this final part as being more loosely controlled, there is an issue with the degree of variation here. Examiners must regularly make real-time decisions as to the value or relevance of questions. The fact that they are likely to make changes to the alternatives offered in this part of the test implied that they may not be totally comfortable with the alternatives offered, at least in terms of language.

IELTS Research Reports Volume 6

20

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

PART 3 Factors for choice of career Different factors for men/women More important factors Career structure important? () of job for life and change of jobs Future working patterns? Being a boss () Qualities of a good employer? Future boss/employee relationship? Total

Paraphrased Questions 3 1 5 7 2 6 1 4

Improvised Questions 2 1 2 1 1 0 1 0

Comments after Replies 1 0 0 1 2 0 0 0

Interrupting Questions 3 0 3 0 2 1 2 1

Total Deviations 9 2 10 9 7 7 4 5

0 29

1 9

0 4

0 12

1 45

Table 16: Spread of deviations in Part 3

We can see from Table 16 that some of the prompts appear to be more likely to result in paraphrasing than others (though the number of times each question was asked varied); it is possible that they potentially place a greater demand on the resources of the candidate in terms of background knowledge and understanding or awareness of European/Western working habits. The inability of candidates to respond to the questions may well result in the greater resort to paraphrasing seen in this part of the test. As with the other findings here, this raises as many questions as it answers, particularly in relation to examiner decision making, and the impact on overall score awarded of these deviations appearing so late in the test event. 7 CONCLUSIONS

In this study, we set out to explore the way in which IELTS examiners deviated from the relatively new Interlocutor Frame in the revised IELTS Speaking Test introduced in July 2001. We were interested to identify the nature and location of any deviations and to establish evidence of their impact on the language of the candidates who participated in the test events. Our analyses appear to show that the first two parts of the Speaking Test are quite stable in terms of deviations, with relatively few noted; where these were found they were either associated with a single examiner or were unsystematically spread across the tasks. It was also clear that the examiners seemed to adhere very closely to the IF, and that the deviations that did occur came at natural interactional boundaries, such as at the end of medium or long turns from candidates. The impact of these deviations on the language of the candidates was essentially negligible in practical terms. In the final part of the Test, there appears to have been a somewhat different pattern of behaviour, particularly in relation to the number of paraphrased questions used by the examiners. While Part 3 mirrors the other interactive task in terms of the number of improvised questions, comments on candidate responses and interrupting questions, there are seven times more paraphrased questions in the final task. The reasons for this difference appears to be related to the alternative format of the task which offers the examiner greater flexibility than in Parts 1 or 2: while the candidate was
IELTS Research Reports Volume 6

21

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

basically asked information-based questions in the first part (typically of a personal nature), in the final part the questions asked the candidate to conjecture, offer opinions and reflect on often abstract topics. The other possible explanation is that the question types may have been beyond the typical candidate in terms of cognitive load or of their cultural or background knowledge. Whatever the cause of the deviations, the impact on candidate language appears to have been minimal, though it remains unclear if there was any impact on the final score awarded to candidates. The use of an Interlocutor Frame is based on the rationale that without a scripted guide, examiners are likely to treat each test event as unique and that candidates risk being unfairly advantaged or disadvantaged as a result. Anecdotal evidence from some stakeholders, principally teachers and examiners, suggests that there is some concern that very tight Interlocutor Frames might cause examiners to become too stilted and unnatural in their language during a test event and that this has a negative impact on the face validity of the test. Test developers therefore have to balance the need to standardise the test event as much as possible (to ensure that all test-takers are examined under the same conditions and that an appropriate sample of language is elicited) against the need to give examiners some degree of flexibility so that they (and the more directly affected stakeholders) feel that the language of the event is natural and free flowing. The results of our analyses suggest that examiners in the revised IELTS Speaking Test essentially adhere to the Interlocutor Frame they are given. The absence of systematicity in the location of deviations implies that the Frames are working as the test developers intended, and that there are no obvious points in the test in which deviation is likely to occur, particularly for the first two tasks. There is some slight cause for concern with the final part. It may well be that it is not possible to create a Frame that can adequately cope with the requirements of less controlled interaction, though the evidence from this study suggests that the extensive paraphrasing that resulted in the less controlled final section did not seriously impact on candidate performance; indeed, if anything it resulted in slightly improved performance. However, the evidence from this study implies that greater care with the creation of question options may result in a more successful implementation of the Frame. The most relevant implication of the findings of this study is that it may be possible to allow for some flexibility in the Interlocutor Frame, though this flexibility might be best confined to allowing for examiner paraphrasing of questions. That this might be achieved without negatively impacting on the language of the candidate is of particular interest. ACKNOWLEDGEMENT The authors would like to acknowledge the valuable input provided by Dr Lynda Taylor in preparing the report of the study that appears here.

IELTS Research Reports Volume 6

22

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

REFERENCES Bachman, LF, 1988, Problems in examining the validity of the ACTFL oral proficiency interview, Studies in Second Language Acquisition, vol 10, pp 149-64 Bachman, LF, 1990, Fundamental considerations in language testing, Oxford University Press, Oxford Brooks, L, 2002, Report on functions observed in the old IELTS Speaking Test versus those in the revised Speaking Test, Internal Cambridge ESOL Report, Cambridge Brooks, L, 2003, Converting an observation checklist for use with the IELTS Speaking Test, Research Notes Issue 11, University of Cambridge ESOL Examinations, Cambridge, pp 20-21 Brown, A, 1995, The effect of rater variables in the development of an occupation specific language performance test, Language Testing, vol 12, pp 1-15 Brown, A and Hill, K, 1998, Interviewer style and candidate performance in the IELTS oral interview, IELTS Research Reports, vol 1, IELTS Australia, Canberra, pp 1-19 Brown, A, 2003, Interviewer variation and the co-construction of speaking proficiency Language Testing, vol 20, pp 1-25 Brown, A, and Lumley, T, 1997, Interviewer variability in specific-purpose language performance tests in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyvskyl and University of Tampere, Jyvskyl, pp 137-150 Cambridge ESOL, 2001, IELTS Speaking Test-FAQs and feedback, Cambridge ESOL, Cambridge Chalhoub-Deville, M, 1995, A contextualized approach to describing oral language proficiency, Language Learning, vol 45, pp 251-281 Foster, P, Tonkyn, A, and Wigglesworth, G, 2000, Measuring spoken language: a unit for all reasons Applied Linguistics, vol 21, pp 354-375 Garman, M, 1990, Psycholinguistics, Cambridge University Press, Cambridge Halleck, G, 1996, Interrater reliability of the OPI: using academic trainee raters, Foreign Language Annals, vol 29, pp 223-238 Hasselgren, A, 1997, Oral test subskill scores: what they tell us about raters and pupils in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyvskyl and University of Tampere, Jyvskyl, pp 241-256 Hunt, K, 1970, Syntactic maturity in school-children and adults, Monograph of the Society for Research into Child Development Lazaraton, A, 1992, The structural organisation of a language interview: a conversational analytic perspective, System, vol 20, pp 373-386 Lazaraton, A, 1996a, Interlocutor support in oral proficiency interviews: the case of CASE, Language Testing, vol 13, pp 151-172

IELTS Research Reports Volume 6

23

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Lazaraton, A, 1996b, A qualitative approach to monitoring examiner conduct in the Cambridge Assessment of Spoken English (CASE), Performance, Testing and Cognition: Selected Papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, eds M Milanovic and N Saville, UCLES/Cambridge University Press, Cambridge, pp 18-33 Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, Cambridge University Press, Cambridge Lumley, T, 1998, Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency, English for Specific Purposes, vol 17, pp 347-367 Lumley, T and OSullivan, B, 2000, The effect of speaker and topic variables on task performance in a tape-mediated assessment of speaking, Paper presented at the 2nd Annual Asian Language Assessment Research Forum, The Hong Kong Polytechnic University, January 2000 McNamara, T, 1996, Measuring second language performance, Addison Wesley Longman, Harlow OSullivan, B, 2000, Towards a model of performance in oral language testing, unpublished PhD dissertation, The University of Reading OSullivan, B and Saville, N, 2000, Developing observation checklists for speaking tests, Research Notes, vol 3, pp 6-10 OSullivan, B, Weir, C and Saville, N, 2002, Using observation checklists to validate speaking-test tasks, Language Testing, vol 19, pp 33-56 Raupach, M, 1980, Temporal variables in first and second language production in Temporal Variables in Speech: Studies in Honor of Freida Goldman-Eissler, eds HW Dechert and M Raupach, Mouton, The Hague Ross, S, 1992, Accommodative questions in oral proficiency interviews, Language Testing, vol 9, pp 173-186 Ross, S and Berwick, R, 1992, The discourse of accommodation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 159-176 Saville, N and Hargreaves, P, 1999, Assessing speaking in the revised FCE, ELT Journal, vol 53, pp 42-51 Shohamy, E, 1983, The stability of oral proficiency assessment on the oral interview testing procedures, Language Learning, vol 33, pp 527-40 Stansfield, CW, 1991, A comparative analysis of simulated oral proficiency interviews in Current Developments in Language Testing, ed S Anivan, SEAMEO Regional Language Centre, Singapore, pp 199-209 Stansfield, CW and Kenyon, DM, 1992, Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview System vol 20, pp 347-64 Taylor, L (in press), Introduction in IELTS Collected Papers: Research in Speaking and Writing Assessment. Studies in Language Testing Volume 19, eds L Taylor and P Falvey, Cambridge ESOL/Cambridge University Press, Cambridge Thompson, I, 1995, A study of interrater reliability of the ACTFL oral proficiency interview in five European Languages: Data from ESL, French, German, Russia, and Spanish, Foreign Language Annals, vol 28, pp 407-422

IELTS Research Reports Volume 6

24

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Upshur, JA and Turner, C, 1999, Systematic effects in the rating of second-language speaking ability: test method and learner discourse, Language Testing, vol 16, pp 82-111 Van Lier, L, 1989, Reeling, writhing, drawling, stretching and fainting in coils: oral proficiency interviews as conversations, TESOL Quarterly, vol 23, pp 480-508 Weir, C, 2005, Language testing and validation: an evidence-based approach, Palgrave, Oxford Wilds, C, 1975, The oral interview test in Testing Language Proficiency, eds RL Jones and B Spolsky, Center for Applied Linguistics, Arlington, VA, pp 29-44 Young, R and Milanovic, M, 1992, Discourse variation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 403-424

IELTS Research Reports Volume 6

25

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

APPENDIX 1: PROFILES OF THE TEST-TAKERS INCLUDED IN THE STUDY


Cand. No. 1188 0214 0105 0397 0385 0801 0803 0810 0890 0971 0190 0403 0386 0931 1089 1119 1383 1427 1436 1487 0437 0466 0478 0439 0515 0549 0702 0717 0727 0752 0168 1396 Gender M M F M M M F M M F M M F M M M F M F F F F M M M M M M M F M M Score (speaking) 6 7 6 7 6 6 9 8 6 4 6 6 8 5 6 5 6 5 6 4 6 4 6 5 7 6 6 5 5 5 6 6 Nationality UAE Jordan UAE Iraq UAE Oman Oman Oman Oman Oman Bangladesh Bangladesh Bangladesh China China China China China China Taiwan China China China China China China China China China China China Iran L1 Arabic Arabic Arabic Arabic Arabic Arabic Arabic Arabic Arabic Arabic Bengali Bengali Bengali Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Chinese Farsi Examiner 9 23 28 22 22 12 48 48 53 50 1 22 38 26 41 35 43 34 41 27 40 31 40 20 21 17 24 15 51 29 36 41

IELTS Research Reports Volume 6

26

4. The impact on candidate language of examiner deviation from a set interlocutor frame Barry OSullivan & Yang Lu

Cand. No. 0767 3526 3527 5372 5375 6060 0941 1015 0078 0466 1002 5371 1423 1494 3880 4292 5415 1235 1236 0354 0996 0381 0128 0137 0152 6351 0229 0420 0371 0449

Gender M M M F M M M F F F M M F M M M M M F F M F M M F F M M F M

Score (speaking) 7 9 6 8 7 7 8 5 6 4 8 6 7 7 8 6 6 8 7 8 8 9 8 8 7 7 7 6 8 5

Nationality Switzerland India India India India India Japan Japan Japan S Korea Malawi India Brazil Portugal India India India Pakistan Colombia Mexico Sweden Sweden Sweden Sweden Sweden India Pakistan Pakistan Pakistan Pakistan

L1 German Hindi Hindi Hindi Hindi Hindi Japanese Japanese Japanese Korean Other Other Portuguese Portuguese Punjabi Punjabi Punjabi Pushtu Spanish Spanish Swedish Swedish Swedish Swedish Swedish Telugu Urdu Urdu Urdu Urdu

Examiner 18 37 37 39 39 11 32 6 45 30 44 39 9 34 33 3 4 49 32 31 9 31 10 13 14 25 8 52 42 42

IELTS Research Reports Volume 6

27

5. Exploring difficulty in Speaking tasks: an intra-task perspective


Authors Cyril Weir University of Bedfordshire, UK Barry OSullivan Roehampton University, UK Tomoko Horai Roehampton University, UK Grant awarded Round 9, 2003 This study looks at how the difficulty of a speaking task is affected by changes to the time offered for planning, the length of response expected and the amount of scaffolding provided (eg suggestions for content). ABSTRACT The oral presentation task has become an established format in high stakes oral testing as examining boards have come to routinely employ them in spoken language tests. This study explores how the difficulty of the Part 2 task (Individual Long Turn) in the IELTS Speaking Test can be manipulated using a framework based on the work of Skehan (1998), while working within the socio-cognitive perspective of test validation. The identification of a set of four equivalent tasks was undertaken in three phases. One of these tasks was left unaltered; the other three were manipulated along three variables: planning time, response time and scaffolded support. In the final phase of the study, 74 language students, at a range of ability levels, performed all four versions of the tasks and completed a brief cognitive processing questionnaire after each performance. The resulting audio files were then rated by two IELTS trained examiners working independently of each other using the current IELTS Speaking criteria. The questionnaire data were analysed in order to establish any differences in cognitive processing when performing the different task versions. Results from the score data suggest that while the original un-manipulated version tends to result in the highest scores, there are significant differences to be found in the responses of three ability groups to the four tasks, indicating that task difficulty may well be affected differently for test candidates of different ability. These differences were reflected in the findings from the questionnaire analysis. The implications of these findings for teachers, test developers, test validators and researchers are discussed.

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

AUTHOR BIODATA CYRIL WEIR Cyril Weir has a PhD in language testing and has published widely in the fields of testing and evaluation. He is the author of Communicative Language Testing, Understanding and Developing Language Tests and Language Testing and Validation: an evidence based approach. He is the coauthor of Evaluation in ELT, An Empirical Investigation of the Componentiality of L2 Reading in English for Academic Purposes, Empirical Bases for Construct Validation: the College English Test a case study, and Reading in a Second Language and co-editor of Continuity and Innovation: Revising the Cambridge Proficiency in English Examination 1913-2002. Cyril Weir has taught short courses, lectured and carried out consultancies in language testing, evaluation and curriculum renewal in over 50 countries worldwide. With Mike Milanovic of UCLES he is the series editor of the Studies in Language Testing series published by CUP and on the editorial board of Language Assessment Quarterly and Reading in a Foreign Language. Cyril Weir is currently Powdrill Professor in English Language Acquisition at the University of Bedfordshire, where he is also the Director of the Centre for Research in English Language Learning and Assessment (CRELLA) which was set up on his arrival in 2005. BARRY OSULLIVAN Barry OSullivan has a PhD in language testing, and is particularly interested in issues related to performance testing, test validation and test-data management and analysis. He has lectured for many years on various aspects of language testing, and is currently Director of the Centre for Language Assessment Research (CLARe) at Roehampton University, London. Barrys publications have appeared in a number of international journals and he has presented his work at international conferences around the world. His book Issues in Business English Testing: the BEC revision project was published in 2006 by Cambridge University Press in the Studies in Language Testing series; and his next book is due to appear later this year. Barry is very active in language testing around the world and currently works with government ministries, universities and test developers in Europe, Asia, the Middle East and Central America. In addition to his work in the area of language testing, Barry taught in Ireland, England, Peru and Japan before taking up his current post. TOMOKO HORAI Tomoko Horai is a PhD student at Roehampton University, UK. She has an MA in Applied Linguistics and an MA in English Language Teaching, in addition to a MEd in TESOL/Applied Linguistics. She also has a number of years of teaching experience in a secondary school in Tokyo. Her current research interests are intra-task comparison and task difficulty in the testing of speaking. Her work has been presented at a number of international conferences including Language Testing Research Colloquium 2006, British Association of Applied Linguistics (BAAL) 2006, International Association of Teaching English as a Foreign Language (IATEFL) 2005 and 2006, Language Testing Forum 2005, and Japan Association of Language Teachers (JALT) 2004 and 2005.

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

CONTENTS
1 Introduction ............................................................................................4

2 The oral presentation ...............................................................................4 3 Task difficulty ...........................................................................................5 4 The study ............................................................................................6 4.1 Aims of the study .................................................................................7 4.2 Methodology ........................................................................................7 4.2.1 4.2.2 5 Results Quantitative analysis ..........................................................8 Qualitative analysis.............................................................10 ............................................................................................13

5.1 Rater agreement ................................................................................14 5.2 Score data analysis .............................................................................15 5.3 Questionnaire data analysis (from the perspective of the task) ..........18 6 Conclusions 6.1.1 6.1.2 6.1.3 6.1.4 References ............................................................................................24 Teachers ............................................................................26 Test developers .................................................................26 Test validators ...................................................................26 Researchers ......................................................................26 ............................................................................................28 6.1 Implications .........................................................................................25

Appendix 1: Task difficulty checklist ............................................................33 Appendix 2: Readability statistics for 9 tasks..............................................32 Appendix 3: The original set of tasks ...........................................................34 Appendix 4: The final set of tasks .................................................................35 Appendix 5: SPSS one-way ANOVA output .................................................36 Appendix 6: Questionnaire about Task 1 .....................................................37 Appendix 7: Questionnaire unchanged and reduced time versions ......38 Appendix 8: Questionnaire no planning version ......................................40 Appendix 9: Questionnaire unscaffolded version ....................................41

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

INTRODUCTION

In recent years, a number of studies have looked at variability in performance on spoken tasks from the perspective of language testing. Empirical evidence has been found to suggest significant effects resulting from test-taker-related variables (Berry 1994, 2004; Kunnan 1995; Purpura 1998), interlocutor-related variables (O'Sullivan 1995, 2000a, 2000b; Porter 1991; Porter & Shen 1991) and rater- and examiner-related variables (Brown 1995, 1998; Brown & Lumley 1997; Chalhoub-Deville 1995; Halleck 1996; Hasselgren 1997; Lazaraton 1996a, 1996b; Lumley 1998; Lumley & OSullivan 2000, 2001; Ross 1992; Ross & Berwick 1992; Thompson 1995; Upshur & Turner 1999; Young & Milanovic 1992). Skehan and Foster (1997) have suggested that foreign language performance is affected by task processing conditions (see also Ortega 1999; Shohamy 1983; Skehan 1998). They have attempted to manipulate processing conditions in order to modify or predict difficulty. In line with this, Skehan (1998) and Norris et al (1998) have made serious attempts to identify task qualities which impinge upon task difficulty in spoken language. They proposed that difficulty is a function of code complexity, cognitive complexity, and communicative demand. A number of empirical findings have revealed that task difficulty has an effect on performance, as measured in the three areas of accuracy, fluency, and complexity (Skehan 1998; Mehnert 1998; Wigglesworth 1997; Skehan & Foster 1997, 1999; Ortega 1999; O'Sullivan, Weir & ffrench 2001). 2 THE ORAL PRESENTATION

Oral presentation is advocated as a valuable elicitation task for assessing speaking ability by a number of prominent authorities in the field (Clark & Swinton 1979; Bygate 1987; Underhill 1987; Weir 1993, 2005 Hughes 1989, 2003; Butler et al, 2000; Fulcher 2003; Luoma 2004). Its practical advantages are obvious, not least that it can be delivered in a variety of modes. The telling advantage of this method is one speaker produces a long turn alone, without interacting with other speakers. As such, it does not suffer from the contaminating effect of the co-construction of discourse in interactive tasks where one participants performance will affect the others, so is also more suitable for the investigation of intra-task variation, the subject of this study (Iwashita 1997; Luoma 2004; McNamara 1996; Ross & Berwick 1992; Weir 1993, 2005). Over the past three decades, oral presentation tasks (also known as individual long turn or monologic tasks) have become an established format in high stakes oral testing as examining boards have come to routinely employ them in spoken language tests. The Test of Spoken English (TSE) from Educational Testing Service (ETS) in the USA, the International English Language Testing System (IELTS), the Cambridge ESOL Main Suite examinations, and the College English Test in China (the worlds biggest EFL examination) all include an oral presentation task in their tests of speaking. In ETSs TOEFL Academic Speaking Test (TAST) only monologues are used. In the context of the New Generation TOEFL speaking component, Butler et al (2000) advocate testing extended discourse, arguing that this is most relevant to the academic use of language at the university level. Earlier, Clark and Swinton (1979) found that the picture sequence task was one of the most effective techniques in experimental tests which investigated suitable techniques for a speaking component for TOEFL. Given its importance, it is surprising that over the last 20 years no research articles dedicated to oral presentation speaking tasks per se can be found in the most prominent journal in the field, Language Testing. Similarly, there has been little published research on the long turn elsewhere even in the non-language testing literature (see Abdul Raof 2002). Certainly, very little empirical investigation has been conducted to find out what contributes to the degree of task difficulty within oral
IELTS Research Reports Volume 6
4

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

presentation tasks in a speaking test even though such tasks play an important function in high stakes tests around the world. 3 TASK DIFFICULTY

In recent years, a number of studies have looked at variability in spoken performance from the perspective of task difficulty in language testing. Empirical evidence has been found to suggest significant effects resulting from how interlocutor-related variables impact on difficulty in interaction-based tasks (Porter 1991; Porter & Shen 1991; O'Sullivan 2000a, 2000b, 2002; Berry 1997, 2004; Buckingham 1997; Iwashita 1997). In terms of the study of test task related variables, a number of studies concerning inter-task comparison have been undertaken. These have adopted both quantitative perspectives (ChalhoubDeville 1995; Fulcher 1996; Henning 1983; Lumley & OSullivan 2000, 2001; OLoughlin 1995; Norris et al 1998; Robinson 1995; Shohamy 1983; Shohamy, Reves & Bejarano 1986; Skehan 1996; Stansfield & Kenyon 1992; Upshur and Turner 1999; Wigglesworth & OLoughlin 1993) and qualitative perspectives (Bygate 1999; Kormos 1999; OSullivan, Weir & Saville 2002; Shohamy 1994; Young 1995). These studies were conducted to investigate the impact on scores awarded for speakers performances across the different tasks. OSullivan and Weir (2002) report that on the whole, the results of these investigations are mixed, perhaps in part due to the crude nature of such investigations where many variables are uncontrolled, and tasks and test populations tend to vary with each study. There is less research available on intra-task comparison, where internal aspects of one task are systematically manipulated. This is perhaps surprising as this type of study enables the researcher to more closely control and manipulate the variables involved. Skehan and Foster (1997) suggest that foreign language performance is affected by task processing conditions. They propose that difficulty is a function of code complexity, cognitive complexity, and communicative stress. This view is largely supported by the literature (see, for example, Foster & Skehan 1996, 1999; Mehnert 1998; Ortega 1999; Skehan 1996, 1998; Skehan and Foster 2001; Wigglesworth 1997; Brown & Yule 1983; Crookes 1989). The most likely sources of intra-task variability appear to lie in the three broad areas outlined by Skehan and Foster (1997) mentioned above and appear to be most clearly observed when the following specific performance conditions are manipulated: 1. 2. 3. 4. 5. 6. Planning time Planning condition Audience Type and amount of input Response time Topic familiarity

Empirical findings have revealed that intra-task variation in terms of these conditions has an effect on performance as measured in the four areas of accuracy, fluency, complexity and lexical range (Ellis 1987; Crookes 1989; Williams 1992; Skehan 1996; Mehnert 1998; Wigglesworth 1997; Foster & Skehan 1996; Skehan & Foster 1997, 1999; Ortega 1999; O'Sullivan, Weir & ffrench 2001). Weir (2005) argues that it is critical that examination boards are able to furnish validity evidence on their tests and that this should include research-based evidence on intra-task variation, ie how the conditions under which a single task is performed affect candidate performance. Research into intra-task variation is critical for high stakes tests because if we are able to manipulate the difficulty level of tasks we can create parallel forms of tasks at the same level and offer a principled way of
IELTS Research Reports Volume 6
5

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

establishing versions of tasks across the ability range (elementary to advanced). This is clearly of relevance to examination bodies that offer a suite of examinations as is the case with Cambridge ESOL. 4 THE STUDY

This study is primarily designed to explore how the difficulty of the IELTS Speaking paper Part 2 task (Individual Long Turn) can be deliberately manipulated using a framework based on the work of Skehan (1998), while working within the socio-cognitive perspective of test validation suggested by OSullivan (2000a) and discussed in detail by Weir (2005). In this research project, the conditions under which tasks are performed are treated as independent variables. We have omitted the variables type and amount of input and topic familiarity from our study as it was decided that it was necessary to limit the scope of the study. These were felt to be adequately controlled for in the task selection process (described in detail below) in which an analysis of the language and topic of each task was undertaken (by considering student responses from the pilot study questionnaire and from the responses of an expert panel who applied the difficulty checklist to all tasks). The variable audience was also controlled for by identifying the same audience for each task variant. The remaining variables are operationalised for the purpose of this study in the following way:
Variable Planning Time Planning Condition Response Time Unaltered 1 minute Guided (3 scaffolding points) 2 minutes Altered No planning time No scaffolding 1 minute

Table 1: Task manipulation

The first of the three manipulations is in response to the findings of researchers such as Skehan and Foster (1997, 1999, 2001), Wigglesworth (1997) and Mehnert (1998) who suggest that there is a significant difference in performance where as little as one minute of planning is allowed. Since the findings have shown that this improvement is manifested in increased accuracy, we expect that the scores awarded by raters for this criterion will be most significantly affected. The second area of manipulation is related to the suggestion (by Foster & Skehan, among others) that the nature of the planning can contribute to its effect. For that reason, students will be given an opportunity to engage in guided planning (by using the scaffolded points) or unguided planning (where these points are removed). Finally, the notion of response time is addressed. Anecdotal evidence from examiners and researchers who have listened to recordings of timed responses suggest that test-takers (particularly at a low level of proficiency) tend to run out of things to say and either struggle to add to their performance, engage in repetition of points already made, or simply dry up. Any of these situations can lead to a lowering of the scores candidates are awarded by examiners. Since the original version of this task asks test-takers to respond for 1 to 2 minutes, it was felt to be important to investigate what the consequences of allowing this wide variation in performance time might be.

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The hypotheses are formulated as follows: 1. Planning time will impact on task performance in terms of the test scores achieved by candidates. 2. Planning condition will impact on task performance in terms of the test scores achieved by candidates. 3. Response time will impact on task performance in terms of the test scores achieved by candidates. 4. Differences in performance in respect of the variables in hypotheses 1 to 3 will vary according to the level of proficiency of test-takers. 5. The manipulations to each task, as represented in hypotheses 1-3, will result in significant changes in the internal processing of the participants (i.e. the theory-based validity of the task will be affected by manipulating elements of the task setting or demands). 4.1 Aims of the study To establish any differences in candidate linguistic behaviour, as reflected in test scores, arising from language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions Since all students complete a theory-based validity questionnaire on completion of each of the four tasks they perform (see Appendix 7), analysis of these responses will allow us to make statements regarding the second of our research questions: To establish any differences in candidate behaviour (cognitive processing) arising from language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions 4.2 Methodology As mentioned above, this study employs a mixture of quantitative and qualitative methods as appropriate. The study is divided into a number of phases, described below. Phase 1: In this phase, a number of retired IELTS oral presentation tasks were analysed by the researchers using a checklist based on Skehan (1996). This analysis led to the selection of a series of nine tasks from which it was hoped to identify at least four that were truly equivalent (see Appendix 1 for the checklist). Readability statistics were generated for each of the tasks (see Appendix 2) in order to ascertain that each task was similar in terms of level of input. In addition to these analyses, a qualitative perspective on the task topics was undertaken. The nine tasks are contained in Appendix 3. Phase 2: A series of pilot administrations was conducted involving overseas university students at a UK institution. These students were on or above the language threshold level for entry into UK university (ie approximately 6.5 on the IELTS overall band scale). The students were asked to perform a number of tasks and to report verbally to one of the researchers on their experience. From these pilot studies it was noted that the topic of two of the tasks (visiting a museum or art gallery and entering a contest) were considered by many students to be outside their experience and as such too difficult to talk about for two minutes. For this reason, the former was changed to a sports event and the scaffolding or prompts rewritten, while the latter was dropped from the study. It was decided at this stage that the eight tasks that remained were suitable, and that these should form the basis of the next phase (these are in Appendix 4). Phase 3: In this phase of the project, a formal trial of the eight selected tasks (A to H) was undertaken.

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

4.2.1

Quantitative analysis

A group of 54 students was asked to participate in the trial. Each student was asked to complete four tasks, and to fill in a short questionnaire immediately on completing each task. To ensure that an approximately equal number of students responded to each task, the following matrix was devised. This meant that students were given at random a pack marked Version 1 to 8. These packs contained the rubric for each of the tasks in the pack as well as four questionnaires.
Version 1
A B C D

Version 2
H A B C

Version 3
G H A B

Version 4
F G H A

Version 5
E F G H

Version 6
D E F G

Version 7
C D E F

Version 8
B C D E

Table 2: Make-up of task batches for the trial

The above design resulted in the following numbers of students responding to each task.
Task A B C D E F G H Number of Students 27 26 27 28 26 26 26 26

Table 3: Number of students responding to each task

The students performed the tasks in a multimedia laboratory, speaking directly to a computer. Each students four responses were recorded and saved on the computer as a single file. These files were later edited to remove unwanted elements (such as long breaks following the end of a task performance or unwanted noise that occurred outside of the performance but was inadvertently recorded). The volume of each file was edited to ensure maximum audibility throughout. The performances of each student were then split up into the four constituent tasks and further edited (ie an indicator of student number and task was inserted at the beginning of the task and a bleep inserted to signal to the future rater that the task was now complete). The order of the files was randomised using a random numbers list generated using Microsoft Excel. Finally, eight CDs were created, each of which contained all of the performances for each task.

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

These eight CDs were then duplicated and a set was given to each of two trained and experienced IELTS raters who rated all tasks over a one-week period. The resulting score data were subjected to multi-faceted Rasch (MFR) analysis using the FACETS program (Linacre 2003) in order to identify a set of at least four tasks where any differences in difficulty could be shown to be statistically insignificant. (For recent examples of this statistical procedure in the language testing literature see Lumley & OSullivan 2005, Bonk & Ockey 2004). The task measurement report from the FACETS output (Table 4) suggests that Task A is potentially significantly easier than the other seven. In addition, the infit mean square statistic (which indicates that all tasks are within the accepted range) suggests that all of the tasks are working in a predictable way.

| |

Fair-M| Avrage|Measure

Model | Infit S.E. |MnSq ZStd

Outfit

| |

MnSq ZStd | N Tasks

| | | | | | | |

5.86| 5.74| 5.69| 5.66| 5.63| 5.51| 5.56| 5.57|

-.71 -.27 -.11 -.02 .08 .45 .29 .28

.11 | 1.1 .11 | 1.1 .11 | 1.0 .11 | .12 | .8 .9

0 0 0 -2 -1 1 0 0

1.1 1.1 1.0 .8 .9 1.1 .9 1.0

0 1 0 -2 -1 1 0 0

| 1 A | 2 B | 3 C | 4 D | 5 E | 6 F | 7 G | 8 H

| | | | | | |

.12 | 1.2 .11 | 1.0 .11 | 1.0

Table 4: Task measurement report (summary of FACETS output)

IELTS Research Reports Volume 6

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Follow-up analysis of the scores awarded by the raters indicates that this difference appears to be of statistical significance only in the case of Tasks G and H (see Appendix 5) which appear to be significantly easier than Tasks A and C. The boxplots generated from the SPSS output (Figure 1) suggest that there is a broader spread of scores for Tasks A and C, though in general the mean scores do not appear to be widely spread.

Figure 1: Boxplots comparing task means from SPSS output

The results of these analyses suggest that Tasks A, C, G and H should not be considered for inclusion in the main study, though all of the others are acceptable.
4.2.2 Qualitative analysis

In addition to the quantitative analysis described above, we analysed the responses of all students to a short questionnaire (see Appendix 6) about students perceptions of the tasks. For this phase of the study, we focused primarily on their responses to the items related to topic familiarity and degree of abstractness of the tasks. The data from these questionnaires (each student completed a questionnaire for each task) were entered into SPSS and analysed for instances of extreme views as it was thought that we should only accept tasks in which the students felt a degree of comfort that the topic was familiar and that the information given was of a concrete nature. From this analysis, we made a preliminary decision to eliminate two of the eight tasks: Tasks G and H (Table 5). It was decided to monitor Task C, as students perceived it as being somewhat difficult in terms of vocabulary and grammar though the language of the task (see Appendix 4) does not appear to be significantly different from that of the other tasks.

IELTS Research Reports Volume 6

10

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Topic TASK A B C D E F G H
KEY:

Information 4 3 2 2 1 2 3 3 3 5 0 1 3 2 0 1 1 2 1 9 9 6 5 6 6 7 7 2 8 6 9 12 10 9 2 3 3 8 10 7 6 10 11 12 10 4 1 1 3 3 0 0 4 3 5 0 0 1 1 0 0 1 3 1 12 14 12 11 15 11 14 15

Vocabulary 2 8 8 6 13 8 7 5 6 3 6 4 4 3 2 6 4 5 4 1 0 4 1 1 0 3 0 5 0 0 1 0 0 1 0 0 1 11 11 8 11 14 10 11 11

Grammar 2 10 7 9 13 8 11 6 5 3 4 6 8 4 3 4 8 9 4 1 1 2 0 1 0 1 1 5 0 1 0 0 0 0 0 0

1 9 8 2 9 7 4 3 7

2 8 8 13 9 8 10 8 3

3 7 6 5 7 8 8 11 11

Topic Information Vocabulary & Grammar

1 = Familiar 1 = Very Concrete 1 = Easy

5 = Unfamiliar 5 = Very Abstract 5 = Difficult

Table 5: Qualitative analysis of the tasks (suggesting that G & H be eliminated)

Based on the two types of analyses, the researchers identified four tasks as being equivalent from the qualitative and quantitative perspectives. These were:
Task B Task E

B. Describe a part-time/holiday job that you have done. You should say: How you got the job What the job involved How long the job lasted And explain why you think you did the job well or badly.

E. Describe a teacher who has influenced you in your education. You should say: Where you met them What subject they taught What was special about them And explain why this person influenced you so much. Task F

Task D

D. Describe an enjoyable event that you experienced when you were at school. You should say: What the event was When it happened What was good about it And explain why you particularly remember this event.

F. Describe a film or a TV programme which made a strong impression on you. You should say: What kind of film or TV programme it was (eg comedy) When you saw it What it was about And explain why it made such an impression on you.

Figure 2: Four tasks selected for the main study (Phase 5)

In addition to identifying four tasks that can be considered equivalent from as broad a number of perspectives as possible, the early phases of the project also saw the development of a series of theory-based validity questionnaires based on ongoing research at the Centre for Research in Testing, Evaluation and Curriculum (CRTEC) at Roehampton University, London (reported by Akmar Zainal Abidin at the Language Testing Forum, Cambridge, 2003). These questionnaires, which are designed to offer insights into the cognitive processing of the participants before and during test task
IELTS Research Reports Volume 6
11

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

performance, are based on Weir (2005) and were piloted during Phase 3 (see Appendix 7 for the four versions developed for use in this project). During this piloting, a number of minor amendments were made to the original drafts based on qualitative feedback from participants primarily for reasons of clarity and where the language proved to be beyond the level of participating learners. Phase 4: The above phases meant that we were able to identify a set of four oral presentation tasks for which we could claim equivalence from both qualitative and quantitative perspectives; to the best of our knowledge, this has not been attempted before in either language testing or SLA research. In this phase, the resulting tasks were manipulated according to the variables identified in Section IV above. Table 6 shows that this manipulation resulted in four versions of each of the four tasks: Task B remained unchanged, Task D had no planning time, Task E had no scaffolding and Task F required a response time of one minute (instead of two minutes).
Task B D E F No Change x x x No Planning time x x x No Scaffolding x x x 1 minute response X X X

Table 6: Manipulation of each task

To ensure that there was no order effect, the following matrix was designed (see Table 7). As described above, in this phase of the study, students were asked to perform four tasks, one of which remained unchanged from the original and the others manipulated in the way described in Table 6. In the matrix in Table 7, each version appears on an equal number of occasions and at each level (ie to be performed first, second, etc).
Version 1 Version 2 Version 3 E F B D Version 4 F E D B

B D E F

D B F E

Table 7: Setup for test versions for the main study

IELTS Research Reports Volume 6

12

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The tasks used in the study can be seen in Figure 3 below.


Task B [UNCHANGED] You will have to talk about the topic for two minutes. You have one minute to think about what you are going to say. B. Describe a part-time/holiday job that you have done. You should say: How you got the job What the job involved How long the job lasted And explain why you think you did the job well or badly. Task D [NO PLANNING] You will have to talk about the topic for two minutes. You should start speaking now, without taking time to think about what you are going to say. D. Describe an enjoyable event that you experienced when you were at school. You should say: What the event was When it happened What was good about it And explain why you particularly remember this event. Task E [NO SCAFFOLDING] You will have to talk about the topic for two minutes. You have one minute to think about what you are going to say. E. Describe a teacher who has influenced you in your education. And explain why this person influenced you so much.

Task F [REDUCED OUTPUT] You will have to talk about the topic for one minute. You have one minute to think about what you are going to say. F. Describe a film or a TV programme which made a strong impression on you. You should say: What kind of film or TV programme it was (eg comedy) When you saw it What it was about And explain why it made such an impression on you.

Figure 3: Manipulation of the tasks in the main study

Phase 5: In the main part of the study, a total of 74 language students at a range of ability levels performed all four versions of the tasks according to the schedule defined by the matrix in Table 7. The resulting audio files were then edited and saved as individual MP3 files. This was done to avoid any halo effect in the rating process as the four tasks performed by any individual were separated so that raters would not be overly affected by performance on an early task when rating the later tasks. Four CDs were created each containing a randomised set of performances for each task (B, D, E and F). These were rated by two IELTS trained examiners working independently of each other using the current rating criteria and scales for the operational IELTS Speaking Test. 5 RESULTS

The scores from these ratings were then analysed using MFR and the resulting data were used for ANOVA and correlational analysis using the programme SPSS, Version 12. The model used in this MFR analysis takes into account the ability of the candidates, the relative harshness of the raters and the difficulty of the tasks to suggest a score called the Fair Average; Fair Average scores have the additional advantage of being true interval in nature. This will allow us to make statements regarding the first aim of the study: To establish any differences in candidate linguistic behaviour, as reflected in test scores, to language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions

IELTS Research Reports Volume 6

13

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Since all students complete a theory-based validity questionnaire on completion of each of the four tasks they perform (see Appendix 7), analysis of these responses will allow us to make statements regarding the second of our research questions: To establish any differences in candidate behaviour (cognitive processing) to language elicitation tasks that have been manipulated along a number of socio-cognitive dimensions The existence (or not) of observable systematic differences across the four tasks will be interpreted in light of our third aim: To create a framework for the systematic manipulation of speaking tasks 5.1 Rater agreement Before analysing the candidate performance data, it is first necessary to explore the area of inter-rater reliability. In this project, a number of measures will be considered, in order to gain a broad picture of the extent to which the two raters behaved in a consistent and predictable way. First correlation analysis was undertaken to explore the degree to which the two raters placed the candidates in a similar order. The results of this analysis (Table 8) indicate a significant level of correlation for all comparisons (the more meaningful correlations have been highlighted in the table). The overall agreement, based on the raw data is 0.75, certainly acceptable, though not as high as we would expect to find in an operational test event (where it is usual to expect correlations above 0.8). It is possible that the unnatural nature of the rating process, where each rater was given a set of four CDs each one containing the performances of all candidates for a particular task, may have affected rating.
Fluency & coherence 2 Lexical resource 2 Grammatical range & accuracy 2 .685 .662 .668 .651 .715 Pronunciation 2 Overall 2

Fluency & coherence 1 Lexical resource 1 Grammatical range & accuracy 1 Pronunciation 1 Overall 1

.700 .677 .656 .583 .720

.696 .662 .631 .604 .703

.629 .592 .588 .589 .643

.738 .694 .679 .640 .750

All correlations significant at the 0.01 level (2-tailed).

Table 8: Correlations between the raters

Another estimate of inter-rater agreement is the degree to which they agree on scores around the critical boundary. A widely recognised threshold boundary for IELTS is an overall band score of 6.5 (ie the level demanded by most universities for entrance, computed from scores on the four skills modules); although operational scores for IELTS Speaking are only reported at the whole band level, it was decided to use 6.5 in the following analysis. Table 9 shows the level of agreement/disagreement between the two raters. The shaded areas of the table indicate the areas in which the two raters agreed. This indicates that they agreed for a total of 78% of the candidates and disagreed on the remaining 22%. The table also suggests that Rater 1 is somewhat harsher than Rater 2. From these two analyses, we can see that the raters were in broad agreement. As both the correlation between the overall scores and the critical boundary agreement indices are acceptable, we can accept that the scores awarded can be used for additional analysis.

IELTS Research Reports Volume 6

14

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Rater 2 Pass Rater 1 Pass Rater 1 Fail 48 20

Rater 2 Fail 45 183

Table 9: Critical boundary agreement (boundary = 6.5)

5.2

Score data analysis

Following the tests of rater agreement, the first analysis conducted on the task performance score data involved estimating the correlations between the four tasks. Table 10 shows that the correlations were very high and were all significant at the 0.01 level. It is particularly interesting to see that Task B is most highly correlated with Tasks D and F suggesting that the existence of planning time may not significantly affect task performance. Task D was the same as Task B with the single exception that in Task D there was no planning time available to test candidates. The other interesting suggestion here is that the amount of output expected of the candidate does not appear to have had a significant impact on the score achieved. Task F is the same as Task B except that the candidates are expected to talk for two minutes in the former and for just one minute in the latter.
Correlations Task B Task B Task D Task E Task F 1 .900 .871 .901 Task D .900 1 .862 .858 Task E .871 .862 1 .862 Task F .901 .858 .862 1

All correlations are significant at the 0.01 level (2-tailed).

Table 10: Correlations between the four tasks

To more fully explore the data from the perspective of variation in performance across the four tasks it was decided to classify each candidate into one of three groups; those who are of High ability (setting the critical boundary at 6.5 and including those at and above it); those who could be considered Borderline cases (here the range is from 6.0 to 6.5); and finally those who would have been categorised as Low ability candidates (scoring less than 6.0). All three of these categorisations were based on performance over the four tasks.
N Ability Level Pass Borderline Fail Fail Original No Planning No Support Reduced Response 19 27 28 74 74 74 74

Task

Table 11: Descriptive statistics of the main study data

IELTS Research Reports Volume 6

15

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The descriptive statistics (see Table 11) show that the relative ability level of the population was quite low, with approximately half of the candidates in the fail category and only about 20% clearly achieving 6.5 or above. The results of the ANOVA (Table 12) show that there are significant differences between the four task types and the three ability groups (as we would expect since they were selected based on overall scores averages over the four tasks). There does not appear to be any significant interaction between the ability groups and the task type suggesting the stability of these tasks across ability level. However, significant differences emerge in respect of task and ability as separate variables.
Source Corrected Model Intercept Task Ability task * ability Error Total Corrected Total Type III Sum of Squares 158.490(a) 9891.754 4.287 151.483 2.570 69.692 10066.500 228.182 Df 11 1 3 2 6 284 296 295 Mean Square 14.408 9891.754 1.429 75.742 .428 .245 F 58.714 40309.670 5.823 308.653 1.745 Sig. .000 .000 .001 .000 .110

R Squared = .695 (Adjusted R Squared = .683)

Table 12: ANOVA results from the main study

The post hoc (Bonferroni) analysis (Table 13) suggests that there are differences in the responses and that these are significant for comparisons between the original version of the task and the versions which included no planning time and reduced response time. The actual differences in scores achieved for these tasks are approximately one third and one quarter of a band respectively with the original task proving easier in both cases.
Comparison
Original Original Original No Planning No Planning No Support No Planning No Support Reduced Response No Support Reduced Response Reduced Response

Mean Difference
.32(*) .15 .26(*) -.17 -.06 .11

95% Confidence Interval Sig.


.001 .378 .008 .234 1.000 1.000

Lower Bound
.10 -.06 .05 -.39 -.27 -.10

Upper Bound
.54 .37 .48 .05 .16 .33

Based on observed means.* The mean difference is significant at the .05 level.

Table 13: Multiple post hoc analysis (Bonferroni)

Having completed the main analyses, a set of charts was then generated. These consisted of a set of clustered boxplots and a line diagram, both of which were based on averaged scores for each task but with ability group also taken into account.

IELTS Research Reports Volume 6

16

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

In the first of these charts (Figure 4) we can see that there is relatively little difference in the range of mean scores achieved by each group for the four tasks. While there is a clear difference between the three ability groups in terms of the mean scores achieved by each group for the different tasks, there is also an apparent difference between the pattern of scores on the four tasks between the High ability group (the pass group), the Borderline group and the Low ability group (the fail group).

Figure 4: Boxplots comparing task mean score by ability group

In the final chart (Figure 5 see following page) we can now see that the pattern of scoring is relatively similar for the Low and Borderline groups but quite different for the High scoring group. Taken with the significant results found in the ANOVA reported above, this suggests that manipulating tasks may result in more complex effects on difficulty than initially thought. The standard version of the task appears to result in optimum performance for all groups; by contrast, the no-planning version appears to result in systematically lower scores across the three ability groups. The lack of support (or scaffolding) appears to have a greater negative impact on test scores achieved by the High and Borderline groups while at the same time having only a very slight (and certainly non-significant) impact on the Low group who may be at a level of language ability where any changes have little impact on performance. Finally, the reduction in response time appears to have had little impact on the performances of the High and Borderline groups, though it clearly has had a different impact on the Low group, with their mean score at its lowest point.

IELTS Research Reports Volume 6

17

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Estimated Marginal Means of tottask

ability
Low Borderline High

Estimated Marginal Means

High 6.5

6 Borderline 5.5

5 Low 4.5 Original No Planning No Support Reduced Response

task
Figure 5: Line diagram comparing task mean score by ability group

5.3

Questionnaire data analysis (from the perspective of the task)

For reasons of clarity of analysis and presentation, we will present the results from the three parts of the questionnaires separately. In the first part of the questionnaire, all participants were asked to respond to items related to how they dealt with their initial response to each task version. The results are shown in Table 13 below. These results are based on a series of univariate ANOVAs carried out on the data after the questionnaires had been shown to be working as predicted through factor analysis. The factor analysis of the data was carried out to find evidence that the questionnaires were producing consistent results. Since the three parts of the instrument had been designed to elicit information on specific aspects of the candidates behaviour, it was expected that a factor analysis of the responses should result in identifying background factors that matched the planning. The results of the analysis of Part 1 indicated a very clear two-factor solution, with the first four items loading on Factor 1 (which we suggest indicates a more general background knowledge of speaking test response), while the latter four items load a second factor (which appears to be more task-specific knowledge).

IELTS Research Reports Volume 6

18

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Component Factor 1 1. I read the task very carefully to understand what was required. Goal setting 2. I thought of HOW to deliver my speech in order to respond well to the topic. 3. I thought of HOW to satisfy the audiences and examiners. 4. I understood the instructions for this speaking test completely. 5. I had ENOUGH ideas to speak about this topic. Generating Ideas 6. I felt it was easy to produce enough ideas for the speech from memory. 7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic. 8. I know A LOT about other types of speaking test, e.g., interview, discussion. .104 .114 .273 .182 .750 .813 .823 .745 2 .702 .748 .643 .657 .236 .185 .180 .126

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalisation. A Rotation converged in 3 iterations.

Table 14: Factor analysis of Questionnaire Part 1 (before speaking)

When this is taken into account, the analysis of the responses to individual items should reflect this two-factor solution. In the first section, which explores candidates awareness of how they might go about responding to the task when in the initial stages of reading and considering their response, we can see that there are a number of significant differences between the tasks and the ability groups (though as with all responses to the questionnaire items there is no interaction between the two variables).
Item 1. I read the task very carefully to understand what was required. 2. I thought of HOW to deliver my speech in order to respond well to the topic. 3. I thought of HOW to satisfy the audiences and examiners. 4. I understood the instructions for this speaking test completely. 5. I had ENOUGH ideas to speak about this topic. 6. I felt it was easy to produce enough ideas for the speech from memory. Ave. 4.2 3.7 3.3 4 Task Type Less likely for No Planning Less likely for No Planning No meaningful differences Less likely for No Planning More likely in Original, least for No Planning & No Support More likely in Original, least for No Planning & No Support Ability Group Less likely for BORDERLINE group No meaningful differences No meaningful differences More likely for HIGH group

3.1

Less likely for LOW group

3.1

Less likely for BORDERLINE group

7. I know A LOT about this type of speech, i.e., I know how to 2.9 No meaningful differences make a speech on this type of topic. 8. I know A LOT about other types 3 No meaningful differences of speaking test, e.g., interview, discussion. = no significant difference found = significant difference found Note: the Likert scale upon which the Average (column 2) is calculated is from 1-5

No meaningful differences

No meaningful differences

Table 15: Univariate ANOVA results for Questionnaire Part 1 (before speaking)

IELTS Research Reports Volume 6

19

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The mean response levels (in the Ave. column) indicate that the candidates are likely to read the instructions carefully, and that they tended to have no problem understanding the task. However, they were less likely to consider the audience (Item 3) or to give much thought to the generation of ideas prior to speaking (Items 5 8). It is interesting to note that there is less likelihood that candidates responding to the No Planning version of the tasks will either read the rubric as carefully as for the other versions or think about how to respond in the same was as they might do for the other versions. However, it should be noted that the low mean response to the first item appears to have been heavily influenced by the Borderline group. Review of the data indicates that no errors in data entry could have led to this, and in the absence of post-test interview data, the reason for the very low response cannot easily be explained. We can also see that the No Planning task appears to have resulted in candidates failing to fully understand the instructions (not surprising in light of the earlier responses which indicated they may not have read them carefully), though this was not a problem for the High ability group. In the second part of the section, which focused on generating ideas in the pre-planning stage, candidates indicated that the manipulation of the task appears to have had a significant impact on their ability to produce ideas from their background knowledge. Where the task has been altered in terms of planning time or support offered, the candidates report significantly more difficulty in generating ideas this is most significant for the Low and Borderline groups. For Items 5 and 6 the pattern of response for the Low group was similar across the four tasks, while both the High and Borderline groups indicated a high likelihood for both the Original task and the Reduced Response version and a low likelihood for the other two versions. Perhaps not surprisingly, in the final pair of items, which link the generating of ideas to what is essentially background knowledge, there are no meaningful differences between the tasks or between the three ability levels. As with the factor analysis of the first section of the questionnaire, the analysis of the second section suggests that this part of the instrument is also working well (Table 16); note that in this analysis the No Planning task was not included as the candidates were not asked to complete a questionnaire since they had not been given any time for planning. The single exception seems to be Item 7, which loads on two factors, so in the analysis that follows this item has been removed. The six-factor solution reflects the original design.

IELTS Research Reports Volume 6

20

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Component
Factor 1
1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 2. During the period allowed for planning, I was conscious of the time. 3. I followed the 3 short prompts provided in the task when I was planning. 4. The information in the short prompts provided was necessary for me to complete the task. 5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 6. I wrote down the words and expressions I needed to fulfil the task. 7. I wrote down the structures I need to fulfil the task. -.071 .114 -.035 -.118 -.111 -.110 .439 -.758 .785 .862 -.057 -.232 -.111 .040 .192 .369

2
-.070 .171 .771 .731 .602 .002 .000 .114 -.056 -.092 -.082 -.004 .265 .257 -.396 -.309

3
.222 -.067 .167 -.001 .050 .152 .162 -.078 .084 -.016 .014 -.431 .726 .661 .584 .543

4
.084 -.059 -.061 .042 .443 .730 .512 .209 .157 -.039 -.652 .410 .059 .243 -.015 .000

5
.635 .805 -.107 .156 .118 .050 .310 .022 -.001 .044 .045 -.200 .016 -.066 .246 .241

Time Element

Task Specific Planning

Linguistic Planning

Language used when Planning

8. I made notes only in ENGLISH. 9. I took notes only in my own language. 10. I took notes in both ENGLISH and own language. 11. I planned an outline on paper BEFORE starting to speak.

Organisation
12. I planned an outline in my mind BEFORE starting to speak.

Generating & Practicing

13. Ideas occurring to me at the beginning tended to be COMPLETE. 14. I was able to put my ideas or content in good order. 15. I practiced the speech in my mind WHILE I was planning. 16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalisation. A Rotation converged in 7 iterations.

Table 16: Factor analysis of Questionnaire Part 2 (planning excludes Task 2)

The mean responses in Table 17 show an interesting pattern, particularly with the high levels for Items 3, 4 and 5 indicating that candidates tended to rely to a great extent on the bullet-pointed prompts: the high mean for Item 8 (when combined with the low means for Items 9 and 10) indicate that planning tends to be done in the target language (though the Low ability group are more likely to use L1). The low means for Items 11 and 12 suggest that little concern is given to planning an outline before speaking. This appears to contradict Item 5, where candidates say they wrote down the points they wanted to make before speaking. It is possible that they interpreted this as actually making a full plan or script of what to say, though not necessarily on paper. This needs to be clarified before any future administration of the instrument. In the first part of the section (labeled Time Element) there is little difference across ability levels, though there appears to be a significant effect for the Reduced response version of the task for the item referring to awareness of time. Since there are only two significant effects for all items related to planning we can deduce that manipulating tasks in the ways adapted here may have a limited impact on the planning phase. These aspects can be summarised as: With reduced response time candidates may feel they are under less pressure and so are less conscious of time when responding Removing support from a task appears to make it more difficult for students to plan their response

IELTS Research Reports Volume 6

21

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

High level candidates are more likely to rely on the supporting points in a task rubric Low level candidates are more likely to use either their own language only or a combination of the target language and their own language in planning. Low level students are more likely to practise what they are about to say both during and after planning
Item
1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 2. During the period allowed for planning, I was conscious of the time. 3. I followed the 3 short prompts provided in the task when I was planning. 4. The information in the short prompts provided was necessary for me to complete the task. 5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 6. I wrote down the words and expressions I needed to fulfil the task. 7. I wrote down the structures I need to fulfil the task. 8. I took notes only in ENGLISH.

Ave
3.64 3.31 3.99

Task Type
No meaningful difference Least likely for Reduced Response No meaningful differences

Ability Level
No meaningful differences No meaningful differences No meaningful differences HIGH group more likely to respond positively No meaningful differences No meaningful differences LOW group more likely to respond positively No meaningful differences LOW group more likely to respond positively (but low means) Lower level more likely to respond positively No meaningful differences No meaningful differences No meaningful differences No meaningful differences LOW group more likely to respond positively (but low means) HIGH group less likely to respond positively

3.78

No meaningful differences

3.84

No meaningful differences

3.35

No meaningful difference

2.4

No meaningful difference

4.05

No meaningful difference

9. I took notes only in my own language.

1.9

No meaningful difference

10. I took notes in both ENGLISH and own language. 11. I planned an outline on paper BEFORE starting to speak. 12. I planned an outline in my mind BEFORE starting to speak. 13. Ideas occurring to me at the beginning tended to be COMPLETE. 14. I was able to put my ideas or content in good order. 15. I practiced the speech in my mind WHILE I was planning. 16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

2.14

No meaningful difference

1.25 1.38 3.12 2.88

No meaningful difference No meaningful difference No meaningful difference Less likely for No Support

2.89

No meaningful difference

2.72

No meaningful difference

= no significant difference found = significant difference found Note: Items 3, 4 and 5 not included in No Support version (as they refer to supporting points)

Table 17: Univariate ANOVA results for Questionnaire Part 2 (during planning)

IELTS Research Reports Volume 6

22

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

In the final section of the questionnaire, candidates were asked to respond to items related to what they did as they were speaking. The factor analysis reflected the original design, as so the section was considered to have worked as predicted.
Component Factor 1
1. I felt it was easy to put ideas in good order. 2. I was able to express my ideas using appropriate words. .819 .705 .695 .736 .602 .748 -.048 .103 .194 .239 .251 .195 .215 .170

2
.083 .203 .194 .226 .264 .125 .205 -.132 .009 .278 .754 .786 .783 .744

3
.079 .134 .133 .086 .073 .158 .330 -.326 .819 .629 .030 .049 .090 .221

4
-.028 .015 .088 .040 -.136 .094 .714 .759 -.025 .012 -.017 -.020 .016 .107

Idea Development (ability)

3. I was able to express my ideas using correct grammar. 6. I was able to put sentences in logical order. 7. I was able to CONNECT my ideas smoothly in the whole speech. 14. I felt it was easy to complete the task.

Idea Development (temporal) Time Awareness

4. I thought of MOST of my ideas for the speech WHILE I was actually speaking. 5. Some ideas had to be omitted while I was speaking. 8. I was conscious of the time WHILE I was making this speech. 9. I tried NOT to speak more than the required length of time in the instructions. 10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech.

Monitoring

11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 12. I was listening and checking the correctness of sentences WHILE I was making this speech. 13. I was listening and checking whether the words fit the topic WHILE I was making this speech.

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalisation. A Rotation converged in 5 iterations.

Table 18: Factor analysis of Questionnaire Part 3 (during speaking)

The most interesting thing about mean responses in this section is the lack of variation across the items. In the first part, there is very much a no view perspective displayed, suggesting that the candidates were not overly challenged by the tasks. In support of the findings for the previous section, there appears to have been a tendency for candidates to plan while speaking (Item 4) and a slight tendency for them to monitor the contents and language of their responses (though the latter seems to have been most likely with the High ability level). In the first part of the section, which related to ease and ability to develop ideas, the suggestion appears to be that the candidates found the Original version of the task the easiest to respond to (though this was shared with the Reduced Response version for Item 1). Not surprisingly the High level candidates indicated that they found it easy to expressideas using good grammar, while the Borderline candidates seemed to struggle with cohesion and coherence. Low level candidates were more likely to omit ideas as they were speaking, though this was reported as being less likely with the No Support task version, possibly because the candidates considered the idea to be related primarily with the three bullet-pointed supporting points suggested and when these were removed they struggled.

IELTS Research Reports Volume 6

23

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Item
1. I felt it was easy to put ideas in good order. 2. I was able to express my ideas using appropriate words. 3. I was able to express my ideas using correct grammar. 6. I was able to put sentences in logical order. 7. I was able to CONNECT my ideas smoothly in the whole speech. 14. I felt it was easy to complete the task. 4. I thought of MOST of my ideas for the speech WHILE I was actually speaking. 5. Some ideas had to be omitted while I was speaking. 8. I was conscious of the time WHILE I was making this speech. 9. I tried NOT to speak more than the required length of time in the instructions. 10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech. 11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 12. I was listening and checking the correctness of sentences WHILE I was making this speech. 13. I was listening and checking whether the words fit the topic WHILE I was making this speech.

Ave.
2.9 3 2.8 3

Task Type
Easier for Original and Reduced Response No meaningful differences No meaningful differences No meaningful differences More likely with Original especially compared to No Planning No meaningful differences No meaningful differences Less likely with No Support version No meaningful differences No meaningful differences

Ability Level
No meaningful differences No meaningful differences More likely with HIGH group Less likely with BORDERLINE group Less likely with BORDERLINE group No meaningful differences No meaningful differences Most likely for LOW group No meaningful differences No meaningful differences No meaningful differences

2.8

2.9 3.4 3 3.3 3.4

3.3

No meaningful differences

3.3

No meaningful differences

Less likely with Borderline group More likely with HIGH group More likely with HIGH group

3.3

No meaningful differences

3.3

No meaningful differences

= no significant difference found

= significant difference found

Table 19: Univariate ANOVA results for Questionnaire Part 3 (during speaking)

Time did not seem to be particularly important to candidates, and though there was a slight tendency for them to be conscious of time, this does not appear to have varied across ability level or task type attempted. Similarly, though candidates tended to monitor their responses for content, organisation and language, this was not a very strong trend, with the exception of the High ability group who were significantly more likely to monitor their language (but not content or organisation) than the other groups. 6 CONCLUSIONS

In this research project we set out to establish whether the difficulty of a task could be varied by systematic manipulation along a number of dimensions. In doing this we were interested in whether the scores achieved by a group of test candidates would vary along with the cognitive processing associated with performance on the various tasks. This was hoped to provide the basis for a framework which could be used to manipulate tasks in order to systematically alter the difficulty of these tasks.

IELTS Research Reports Volume 6

24

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The project called for a set of four equivalent tasks to be identified so that all participants would respond to an unaltered version as well as three versions in which systematic variations had been made (removal of planning time; removal of support; and reduction of expected response time). In order to identify four equivalent tasks, a complex procedure was designed, in which a set of nine tasks was analysed both quantitatively (based on the performances of a group of 54 participants) and qualitatively (using the responses of these same participants to a series of short questionnaires). At this stage, a set of four tasks was identified and manipulated as planned. A group of 74 participants then recorded their responses to the tasks which were presented to different people in different orders. At the same time, all respondents then completed questionnaires (one per task, so a total of four per participant) based on Weirs (2005) socio-cognitive framework for test validation for speaking. The resulting data were then analysed using the two datasets. Results of the analysis of the score data suggest that there are significant differences to be found in the responses of three ability groups to the four tasks, indicating that task difficulty may well be affected differently for test candidates of different ability. In other words, simply altering a task along a particular dimension may not result in a version that is equally more or less difficult for all test candidates. Instead, there is likely to be a variety of effects as a result of the alteration. For instance, here, mid-level and higher-level participants were not significantly affected by the reduction in response time, while this same alteration to the task resulted in the most serious negative effect for the lower level participants. The analysis of the questionnaire data further complicates the picture. We can briefly summarise the findings as: The most significant effects of task manipulation on candidates appear to be at the prespeaking phase, particularly where no planning time is offered. However, these effects appear to differ depending on the ability level of the candidate. The effects on planning are far less obvious. The candidates report essentially the same approach to planning regardless of the task. Here, while there are far more significant differences in the ways in which candidates of different ability level approach task planning, there appears to be a clear tendency for them not to outline their response before speaking, so even though they take the time to plan, they seem to do much of their planning on-line ie, as they are speaking (though lower level candidates report practising what they plan to say before speaking). When speaking, the candidates seemed to feel that the original version of the task offered them the greatest opportunity to perform at their best, though not surprisingly, this depended on their ability level (lower levels did not find any particular version easier in any way than the others). There was a significant difference in approach to monitoring of own output, with the higher level students more likely to monitor language, though not content or organisation). 6.1 Implications We believe the study has implications for teachers who prepare students for examinations containing speaking tasks which involve individual long turn responses, for the test developers who design these tasks, for test validators and first and second language acquisition researchers.

IELTS Research Reports Volume 6

25

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

6.1.1

Teachers

The differences in approach to task performance highlighted here suggest that teachers might focus more explicitly on pre-speaking strategies such as focusing more clearly on any bulleted prompts and on using the target language for any planning. The lack of impact on approach to planning of task manipulation suggests that students (certainly those involved in this study) have already formed strategies for task performance. However, to improve their understanding of a task, students should be encouraged to read task rubrics more carefully, focus on the language used in the instructions and perhaps ask for assistance where things are not clear.
6.1.2 Test developers

The notion of task equivalence is not as straightforward as it seems. The nine tasks initially used here were presumed by their developers to be equivalent. The methodology used to establish equivalence demonstrated how difficult it can be to create truly equivalent versions of a task. The main study also demonstrates how task difficulty can be affected by decisions to either include or exclude support (eg in the form of bulleted prompts) or by altering the planning time afforded to candidates. This suggests that any substantive changes to these conditions of task performance need to be empirically tested before they are considered in any test revision (or as alternative choices within a test). This is particularly relevant for the planning variable, where the difference in scores achieved was significantly lower for the no planning condition than for the original version of the task (which allows one minute of planning time). The situation regarding amount of response time seems to be less conclusive. Apart from a reduced awareness of time in the planning phase (possibly due to the perception that less speaking time meant there was less to worry about), there appears to have been no difference to the approach taken to task response. However, the scores achieved appear to have been significantly lower for this version than for the original version of the task (in the original version candidates spoke for 2 minutes as opposed to 1 minute in the reduced response version). The rubric appears to be especially important in this type of task. It is clear that a number of candidates (typically at the lower level) had some difficulty understanding what to do. While this is possibly unavoidable in a test which is designed to be used across a broad range of abilities, it is clearly very important for the test developer to ensure measures are in place to avoid poor reading or listening skills affecting student spoken performance. In live tests this is not so difficult (examiners can be trained to deal systematically with comprehension problems), though it is a potentially serious limitation of any computer-delivered test of this sort.
6.1.3 Test validators

In the same way that test developers need to focus on the area of task equivalence, test validators should also consider the area when establishing evidence of the context validity (see Weir 2005) of their tests. Consideration should be given to using the methodology developed here in order to establish true equivalence in test tasks, as well as to investigating how tasks are affected when variations are suggested by stakeholders.
6.1.4 Researchers

SLA researchers have argued since the mid-1980s that performing language elicitation tasks in a learning environment supports learning. While OSullivan (2000a: 298) argues that [The] notion of an interlocutor effect on performance does not appear to have been sufficiently addressed in the [SLA] literature, he also argues that the conditions under which tasks are performed should be more rigorously described (OSullivan, 2000a: 297). While there has been a recognition in the taskbased learning literature that task performance conditions can affect performance (Larson-Freeman & Long, 1991: 30-33), there is little evidence that this awareness has found its way into SLA or Applied Linguistics research.

IELTS Research Reports Volume 6

26

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

The evidence presented in this project suggests that researchers need to more clearly understand the implications of decisions they make when designing tasks for use as elicitation devices in their studies. Research studies should contain both more detail of task design and equivalence and an awareness on the side of the researcher of the rationale for task selection and manipulation. In other words, tasks for both testing and research purposes should be specified in an equally systematic and comprehensive fashion using a model of validation such as that of Weir (2005) to ensure that the results obtained are credible in terms of the validity evidence available.

IELTS Research Reports Volume 6

27

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

REFERENCES Abdul Raof, AH, 2002, The production of a performance rating scale: an alternative methodology, unpublished PhD dissertation, The University of Reading, UK Berry, V, 1994, Personality characteristics and the assessment of spoken language in an academic context, paper presented at the 16th Language Testing Research Colloquium, Washington, DC Berry, V, 1997, Gender and personality as factors of interlocutor variability in oral performance tests, paper presented at the 19th Language Testing Research Colloquium, Orlando, Florida Berry, V, 2004, A study of the interaction between individual personality differences and oral test performance test facets, unpublished PhD dissertation, Kings College, The University of London Bonk, WJ and Ockey, GJ, 2003, A many-facet Rasch analysis of the second language group oral discussion task, Language Testing, vol 20, no 1, pp 89-110 Brown, A, 1995, The effect of rater variables in the development of an occupation specific language performance test, Language Testing, vol 12, no 1, pp 1-15 Brown, A, 1998, Interviewer style and candidate performance in the IELTS oral interview, paper presented at the 20th Language Testing Research Colloquium, Monterey, CA Brown, A, and Lumley, T, 1997, Interviewer variability in specific-purpose language performance tests in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyvskyl and University of Tampere, Jyvskyl, pp137-150 Brown, G, and Yule, G, 1983, Teaching the spoken language, Cambridge University Press, Cambridge Buckingham, A, 1997, Oral language testing: do the age, status and gender of the interlocutor make a difference?, unpublished MA dissertation, University of Reading Butler, FA, Eignor, D, Jones, S, McNamara, T, and Suomi, BK, 2000, TOEFL (2000) Speaking Framework: A Working Paper, TOEFL Monograph Series 20, Educational Testing Service, Princeton, NJ Bygate, M, 1987, Speaking, Oxford University Press, Oxford Bygate, M, 1999, Quality of language and purpose of task: patterns of learners language on two oral communication tasks, Language Teaching Research, vol 3, no 3, pp 185-214 Chalhoub-Deville, M, 1995, Deriving oral assessment scales across different tests and rater groups, Language Testing, vol 12, pp16-33 Clark, JLD and Swinton, SS, 1979, An exploration of speaking proficiency measures in the TOEFL context, TOEFL Research Report, Educational Testing Service, Princeton, NJ Crookes, G, 1989, Planning and interlanguage variation, Studies in Second Language Acquisition, vol 11, pp 367-383 Ellis, R, 1987, Interlanguage variability in narrative discourse: style shifting in the use of the past tense, Studies in Second Language Acquisition, vol 9, pp 1-20 Foster, P and Skehan, P, 1996, The influence of planning and task type on second language performance, Studies in Second Language Acquisition, vol 18, pp 299-323
IELTS Research Reports Volume 6
28

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Foster, P and Skehan, P, 1999, The influence of source of planning and focus of planning on taskbased performance, Language Teaching Research, vol 3, no 3, pp 215-247 Fulcher, G, 1996, Testing tasks: issues in task design and the group oral, Language Testing, vol 13, no 1, pp 23-51 Fulcher, G, 2003, Testing second language speaking, Longman/Pearson, London Halleck, G, 1996, Interrater reliability of the OPI: using academic trainee raters, Foreign Language Annals, vol 29, no 2, pp 223-238 Hasselgren, A, 1997, Oral test subskill scores: what they tell us about raters and pupils, in Current Developments and Alternatives in Language Assessment, eds A Huhta, V Kohonen, L Kurki-Suonio and S Luoma, University of Jyvskyl and University of Tampere, Jyvskyl, pp 241-256 Henning, G, 1983, Oral proficiency testing: comparative validities of interview, imitation, and completion methods, Language Learning, vol 33, no 3, pp 315-332 Hughes, A, 1989, Testing for language teachers, Cambridge University Press, Cambridge Hughes, A, 2003, Testing for language teachers: Second Edition, Cambridge University Press, Cambridge Iwashita, N, 1997, The validity of the paired interview format in oral performance testing, paper presented at the 19th Language Testing Research Colloquium, Orlando, Florida Kormos, J, 1999, Simulation conversations in oral proficiency assessment: a conversation analysis of role plays and non-scripted interviews in language exams, Language Testing, vol 16, no 2, pp 163-188 Kunnan, AJ, 1995, Test-taker characteristics and test performance: a structural modeling approach, UCLES/Cambridge University Press, Cambridge Larson-Freeman, D, and Long, MH, 1991, An introduction to second language acquisition research, Longman, London Lazaraton, A, 1996a, Interlocutor support in oral proficiency interviews: the case of CASE, Language Testing, vol 13, no 2, pp 151-172 Lazaraton, A, 1996b, A qualitative approach to monitoring examiner conduct in the Cambridge Assessment of Spoken English (CASE), in Performance testing, cognition and assessment: selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, eds M Milanovic and N Saville, UCLES/Cambridge University Press, Cambridge, pp 18-33 Linacre, JM, 2003, FACETS 3.45 computer program, MESA Press, Chicago, IL Lumley, T, 1998, Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency, English for Specific Purposes, vol 17, no 4, pp 347-367 Lumley, T and OSullivan, B, 2000, The effect of speaker and topic variables on task performance in a tape-mediated assessment of speaking, paper presented at the 2nd Annual Asian Language Assessment Research Forum, The Hong Kong Polytechnic University Lumley, T and OSullivan, B, 2001, The effect of test-taker sex, audience and topic on task performance in tape-mediated assessment of speaking, Melbourne Papers in Language Testing, vol 9, no 1, pp 34-55

IELTS Research Reports Volume 6

29

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Lumley, T and OSullivan, B, 2005, The effect of test-taker gender, audience and topic on task performance in tape-mediated assessment of speaking, Language Testing, vol 23, no 4, pp 415-437 Luoma, S, 2004, Assessing Speaking, Cambridge University Press, Cambridge McNamara, T, 1997, Interaction in second language performance assessment: whose performance? Applied Linguistics, vol 18, pp 446-466 Mehnert, U, 1998, The effects of different lengths of time for planning on second language performance, Studies in Second Language Acquisition, vol 20, pp 83-108 Norris, J, Brown, JD, Hudson, T and Yoshioka, J, 1998, Designing second language performance assessment, Technical Report #18, University of Hawaii Press, Hawaii OLoughlin, K, 1995, Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test, Language Testing, vol 12, no 2, pp 217-237 OSullivan, B, 1995, Oral language testing: does the age of the interlocutor make a difference? unpublished MA dissertation, University of Reading OSullivan, B, 2000a, Towards a model of performance in oral language testing, unpublished PhD dissertation, University of Reading OSullivan, B, 2000b, Exploring gender and oral proficiency interview performance, System, vol 28, no 3, pp 373-386 OSullivan, B, 2002, Learner acquaintanceship and oral proficiency test pair-task performance, Language Testing, vol 19, no 3, pp 277-295 OSullivan, B, and Weir, C, 2002, Research issues in testing spoken language, mimeo: internal research report commissioned by Cambridge ESOL OSullivan, B, Weir, C and ffrench, A, 2001, Task difficulty in testing spoken language: a sociocognitive perspective, paper presented at the 23rd Language Testing Research Colloquium, St Louis, Miss OSullivan, B, Weir, CJ and Saville, N, 2002, Using observation checklists to validate speaking-test tasks, Language Testing, vol 19, no 1, pp 33-56 Ortega, L, 1999, Planning and focus on form in L2 oral performance, Studies in Second Language Acquisition, vol 20, pp 109-148 Porter, D, 1991, Affective factors in language testing in Language Testing in the 1990s, eds JC Alderson and B North, Modern English Publications in association with British Council, Macmillan, London, pp 32-40 Porter, D and Shen SH, 1991, Gender, status and style in the interview, The Dolphin 21, Aarhus University Press, pp 117-128 Purpura, J, 1998, Investigating the effects of strategy use and second language test performance with high- and low-ability test-takers: a structural equation modeling approach, Language Testing, vol 15, no 3, pp 333-379 Robinson, P, 1995, Task complexity and second language narrative discourse, Language Learning, vol 45, no 1, pp 99-140

IELTS Research Reports Volume 6

30

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Ross, S, 1992, Accommodative questions in oral proficiency interviews, Language Testing, vol 9, pp 173-186 Ross, S and Berwick, R, 1992, The discourse of accommodation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 159-176 Shohamy, E, 1983, The stability of oral language proficiency assessment on the oral interview testing procedure, Language Learning, vol 33, pp 527-540 Shohamy, E, 1994, The validity of direct versus semi-direct oral tests, Language Testing, vol 11, pp 99-123 Shohamy, E, Reves, T and Bejarano, Y, 1986, Introducing a new comprehensive test of oral proficiency, ELT Journal, vol 40, no 3, pp 212-220 Skehan, P, 1996, A framework for the implementation of task based instruction, Applied Linguistics, vol 17, pp 38-62 Skehan, P, 1998, A cognitive approach to language learning, Oxford University Press, Oxford Skehan, P and Foster, P, 1997, The influence of planning and post-task activities on accuracy and complexity in task-based learning, Language Teaching Research, vol 1, no 3, pp 185-211 Skehan, P and Foster, P, 1999, The influence of task structure and processing conditions on narrative retellings, Language Learning, vol 49, no 1, pp 93-120 Skehan, P and Foster, P, 2001, Cognition and tasks in Cognition and second language instruction, ed P Robinson, Cambridge University Press, Cambridge, pp 183-205 Stansfield, CW and Kenyon, DM, 1992, Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview, System, vol 20, pp 347-364 Thompson, I, 1995, A study of interrater reliability of the ACTFL oral proficiency interview in five European Languages: data from ESL, French, German, Russia, and Spanish, Foreign Language Annals, vol 28, no 3, pp 407-422 Underhill, N, 1987, Testing spoken language: a handbook of oral testing techniques, Cambridge University Press, Cambridge Upshur, JA and Turner, C, 1999, Systematic effects in the rating of second-language speaking ability: test method and learner discourse, Language Testing, vol 1, no 1, pp 82-111 Weir, CJ, 1990, Communicative language testing, Prentice Hall International Weir, CJ, 1993, Understanding and developing language tests, Prentice Hall London Weir, CJ, 2005 Language testing and validation: an evidence-based approach, Palgrave, Oxford Wigglesworth, G, 1997, An investigation of planning time and proficiency level on oral test discourse, Language Testing, vol 14, no 1, pp 85-106 Wigglesworth, G, and OLoughlin, K, 1993, An investigation into the comparability of direct and semi-direct versions of an oral interaction test in English, Melbourne Papers in Language Testing, vol 2, no 1, pp 56-67

IELTS Research Reports Volume 6

31

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

Williams, J, 1992, Planning, discourse marking, and the comprehensibility of international teaching assistants, TESOL Quarterly, vol 26, pp 693-711 Young, R, 1995, Conversational styles in language proficiency interviews, Language Learning, vol 45, no 1, pp 3-42 Young, R, and Milanovic, M, 1992, Discourse variation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 403-424

IELTS Research Reports Volume 6

32

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 1: TASK DIFFICULTY CHECKLIST (BASED ON SKEHAN, 1998)


MODERATOR VARIABLES CONDITION Range of linguistic input Sources of input Amount of linguistic input to be processed Availability of input GLOSS (THE MORE DIFFICULT THE HIGHER THE NUMBER) Vocabulary and structure as appropriate to ALTE levels 1 5 (beginner to advanced) Number and types of written and spoken input 1 = one single written or spoken source to 5 = multiple written and spoken sources Quantity of input 1 = sentence level (single question, prompts) 5 = long text (extended instructions and/or texts) Extent to which information necessary for task completions is readily available to the candidate 1 = all information provided 5 = student attempts an open ended task [student provides all information]; 1 = the information given and/or required is likely to be within the candidates experience 5 = information given and/or required is likely to be outside the candidates experience 1 = almost no organisation required 5 = extensive organisation required simple answer to a question to a complex response 1 = concrete 5 = abstract 1 = no constraints on time available to complete task (if candidate does not complete the task in the time given he/she is not penalised) 5 = serious constraints on time available to complete task (if candidate does not complete the task in the time given he/she is penalised) 1 = more than sufficient to plan or formulate a response 5 = no planning time available Number of participants in a task, number of relationships involved 1 = one person 5 = five or more people 1 = simple unequivocal outcome 5 = complex unpredictable outcome 1 = reference to objects and activities which are visible 5 = reference to external/displaced (not in the here and now) objects and events 1 = a measure of attainment which is of value only to the candidate 5 = a measure of attainment which has a high external value 1 = no requirement of the candidate to initiate, continue or terminate interaction 5 = task requires each candidate to participate fully in the interaction 1 = task is highly structured/scaffolded 5 = task is totally unstructured/unscaffolded 1 = complete autonomy 5 = no opportunity for control DIFFICULTY (CIRCLE ONE) 1 2 3 4 5 6

CODE COMPLEXITY

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

COGNITIVE COMPLEXITY

Familiarity of information

1 2 3 4 5 6

Organisation of information required As information becomes more abstract Time pressure

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

Response level Scale

1 2 3 4 5 6

1 2 3 4 5 6

Complexity of task outcome COMMUNICATIVE DEMAND Referential complexity

1 2 3 4 5 6

1 2 3 4 5 6

Stakes

1 2 3 4 5 6

Degree of reciprocity required Structured Opportunity for control

1 2 3 4 5 6

1 2 3 4 5 6 1 2 3 4 5 6

IELTS Research Reports Volume 6

33

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 2: READABILITY STATISTICS FOR 9 TASKS


Task 1 Words Characters Paragraph Sentences Sentence/Paragraph Words/Sentence Characters/word Passive sentences Flesch Reading Ease Flesch-Kincaid Grade Level 35 153 1 6 6.0 5.8 4.2 0% 70.3 4.8 Task 1 Task 2 33 142 1 6 6.0 5.5 4.0 0% 80.7 3.3 Task 2 Task 3 36 150 1 6 Average 6.0 6.0 3.9 0% 85.5 2.8 Task 3 6.0 7.1 3.6 0% 91.3 2.2 Task 4 6.0 5.6 4.7 0% 59.2 6.4 Task 5 6.0 5.8 4.6 0% 75.2 4.2 Task 6 6.0 7.6 3.8 0% 85.0 3.3 Task 7 6.0 5.1 4.5 0% 65 5.4 Task 8 6.0 6.3 3.8 0% 84.6 3.0 Task 9 Task 4 43 162 1 6 Task 5 34 169 1 6 Task 6 35 169 1 6 Task 7 46 185 1 6 Task 8 31 146 1 6 Task 9 38 151 1 6 Counts

Readability

APPENDIX 3: THE ORIGINAL SET OF TASKS


You will have to talk about the topic for 2 minutes. You have 1 minute to think about what you are going to say. 1. Describe a city you have visited which has impressed you. You should say: Where it is situated Why you visited it What you liked about it And explain why you prefer it to other cities. 2. Describe a competition (or contest) that you have entered. You should say: When the competition took place What you had to do How well you did it And explain why you entered the competition (or contest). 3. Describe a part-time/holiday job that you have done. You should say: How you got the job What the job involved How long the job lasted And explain why you think you did the job well or badly. 4. Describe a museum, exhibition or art gallery that you have visited. You should say: Where it is What made you decide to go there What you particularly remember about the place And explain why you would or would not recommend it to your friend. 5. Describe an enjoyable event that you experienced when you were at school. You should say: What the event was When it happened What was good about it And explain why you particularly remember this event. 6. Describe a teacher who has influenced you in your education. You should say: Where you met them What subject they taught What was special about them And explain why this person influenced you so much. 7. Describe a film or a TV programme which has made a strong impression on you. You should say: What kind of film or TV programme it was, eg comedy When you saw the film or TV programme What the film or TV programme was about And explain why this film or TV programme made such an impression on you. 8. Describe a memorable event in your life. You should say: When the event took place Where the event took place What happened exactly And why this event was memorable for you. 9. Describe something you own which is very important to you. You should say: Where you got it from How long you have had it What you use it for And explain why it is so important to you.

IELTS Research Reports Volume 6

34

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 4: THE FINAL SET OF TASKS


You will have to talk about the topic for 2 minutes. You have 1 minute to think about what you are going to say.

A. Describe a city you have visited which has impressed you. You should say: Where it is situated Why you visited it What you liked about it And explain why you prefer it to other cities.

E. Describe a teacher who has influenced you in your education. You should say: Where you met them What subject they taught What was special about them And explain why this person influenced you so much. F. Describe a film or a TV programme which made a strong impression on you. You should say: What kind of film or TV programme it was (eg comedy) When you saw it What it was about And explain why it made such an impression on you.

B. Describe a part-time/holiday job that you have done. You should say: How you got the job What the job involved How long the job lasted And explain why you think you did the job well or badly.

C. Describe a sports event that you have been to or seen on TV. You should say: What it was Why you wanted to see it What was the most exciting or boring part And explain why it was good or bad. D. Describe an enjoyable event that you experienced when you were at school. You should say: What the event was When it happened What was good about it And explain why you particularly remember this event.

G. Describe a memorable event in your life. You should say: When the event took place Where the event took place What happened exactly And why this event was memorable for you.

H. Describe something you own which is very important to you. You should say: Where you got it from How long you have had it What you use it for And explain why it is so important to you.

IELTS Research Reports Volume 6

35

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 5: SPSS ONE-WAY ANOVA OUTPUT


Multiple Comparisons Dependent Variable: TOTAL Bonferroni Mean Difference (I-J) .3622 -.0185 .3824 .4487 .6891 .9103* .7853* -.3622 -.3807 .0203 .0865 .3269 .5481 .4231 .0185 .3807 .4010 .4672 .7076 .9288* .8038* -.3824 -.0203 -.4010 .0663 .3067 .5278 .4028 -.4487 -.0865 -.4672 -.0663 .2404 .4615 .3365 -.6891 -.3269 -.7076 -.3067 -.2404 .2212 .0962 -.9103* -.5481 -.9288* -.5278 -.4615 -.2212 -.1250 -.7853* -.4231 -.8038* -.4028 -.3365 -.0962 .1250

(I) TASK Task A

Task B

Task C

Task D

Task E

Task F

Task G

Task H

(J) TASK Task B Task C Task D Task E Task F Task G Task H Task A Task C Task D Task E Task F Task G Task H Task A Task B Task D Task E Task F Task G Task H Task A Task B Task C Task E Task F Task G Task H Task A Task B Task C Task D Task F Task G Task H Task A Task B Task C Task D Task E Task G Task H Task A Task B Task C Task D Task E Task F Task H Task A Task B Task C Task D Task E Task F Task G

Std. Error .22786 .22570 .22368 .22786 .22786 .22786 .22786 .22786 .22786 .22586 .23000 .23000 .23000 .23000 .22570 .22786 .22368 .22786 .22786 .22786 .22786 .22368 .22586 .22368 .22586 .22586 .22586 .22586 .22786 .23000 .22786 .22586 .23000 .23000 .23000 .22786 .23000 .22786 .22586 .23000 .23000 .23000 .22786 .23000 .22786 .22586 .23000 .23000 .23000 .22786 .23000 .22786 .22586 .23000 .23000 .23000

Sig. 1.000 1.000 1.000 1.000 .079 .003 .019 1.000 1.000 1.000 1.000 1.000 .507 1.000 1.000 1.000 1.000 1.000 .061 .002 .015 1.000 1.000 1.000 1.000 1.000 .572 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .079 1.000 .061 1.000 1.000 1.000 1.000 .003 .507 .002 .572 1.000 1.000 1.000 .019 1.000 .015 1.000 1.000 1.000 1.000

95% Confidence Interval Lower Bound Upper Bound -.3591 1.0835 -.7330 .6959 -.3256 1.0905 -.2726 1.1700 -.0322 1.4104 .1890 1.6315 .0640 1.5065 -1.0835 .3591 -1.1020 .3406 -.6947 .7352 -.6415 .8146 -.4011 1.0550 -.1800 1.2761 -.3050 1.1511 -.6959 .7330 -.3406 1.1020 -.3071 1.1090 -.2540 1.1885 -.0137 1.4289 .2075 1.6501 .0825 1.5251 -1.0905 .3256 -.7352 .6947 -1.1090 .3071 -.6487 .7812 -.4083 1.0216 -.1871 1.2428 -.3121 1.1178 -1.1700 .2726 -.8146 .6415 -1.1885 .2540 -.7812 .6487 -.4877 .9684 -.2665 1.1896 -.3915 1.0646 -1.4104 .0322 -1.0550 .4011 -1.4289 .0137 -1.0216 .4083 -.9684 .4877 -.5069 .9492 -.6319 .8242 -1.6315 -.1890 -1.2761 .1800 -1.6501 -.2075 -1.2428 .1871 -1.1896 .2665 -.9492 .5069 -.8531 .6031 -1.5065 -.0640 -1.1511 .3050 -1.5251 -.0825 -1.1178 .3121 -1.0646 .3915 -.8242 .6319 -.6031 .8531

*. The mean difference is significant at the .05 level.

IELTS Research Reports Volume 6

36

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 6: QUESTIONNAIRE ABOUT TASK 1


For each of the items below, circle the number that REFLECTS YOUR VIEWPOINT on a five point scale.

1. The vocabulary in the task prompts was:

Very easy 1 2 3 4

Very difficult 5

2. The grammatical structures in the task prompts were:

Very easy 1 2 3 4

Very difficult 5

3. Topic of the task was:

Very familiar 1 2 3 4

Very unfamiliar 5

4. Information given in the task was:

Very concrete 1 2 3 4

Very abstract 5

5. The planning time to complete (prepare for) the task was:

Too long 1 2

appropriate 3 appropriate 2 3 4 4

Too short 5 Too short 5

6. Time to complete the task was:

Too long 1

7. How much information did you use from the 4 short prompts provided in the task?

1 = I used 100% of information provided in the task 2 = I used 75% of information provided in the task 3 = I used 50% of information provided in the task 4 = I used 25% of information provided in the task 5 = I did not use any information in the task at all

8. How did you use notes while you were speaking?

1 = I read aloud my notes. 2 = I referred to my notes line by line and looked up to speak. 3 = I referred to my notes when I needed. 4 = I prepared for my notes, but I did not use it. 5 = I did not take my notes.

Thank you very much for your cooperation.

IELTS Research Reports Volume 6

37

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 7: QUESTIONNAIRE UNCHANGED AND REDUCED TIME VERSIONS For students responding to the unchanged versions and to the reduced response time versions
For each of the items below, circle the number that reflects your view point on the five point scale.

What I thought of or did before I started

disagree

strongly disagree

1. I read the task very carefully to understand what was required. 2. I thought of HOW to deliver my speech in order to respond well to the topic. 3. I thought of HOW to satisfy the audiences and examiners. 4. I understood the instructions for this speaking test completely. 5. I had ENOUGH ideas to speak about this topic. 6. I felt it was easy to produce enough ideas for the speech from memory. 7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic. 8. I know A LOT about other types of speaking test, e.g., interview, discussion.

1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4

What I thought of or did in planning stage

strongly disagree

disagree

1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 2. During the period allowed for planning, I was conscious of the time. 3. I followed the 3 short prompts provided in the task when I was planning. 4. The information in the short prompts provided was necessary for me to complete the task. 5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 6. I wrote down the words and expressions I needed to fulfil the task. 7. I wrote down the structures I need to fulfil the task. 8. I took notes only in ENGLISH. 9. I took notes only in my own language. 10. I took notes in both ENGLISH and own language. 11. I planned an outline on paper BEFORE starting to speak. 12. I planned an outline in my mind BEFORE starting to speak. 13. Ideas occurring to me at the beginning tended to be COMPLETE. 14. I was able to put my ideas or content in good order. 15. I practiced the speech in my mind WHILE I was planning. 16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 1. Yes 1. Yes 2 2 2 2

3 3 3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4 4 4 2. No 2. No 4 4 4 4

1 1 1 1

3 3 3 3

IELTS Research Reports Volume 6

strongly agree

no view

agree

strongly agree

no view

agree

5 5 5 5 5 5 5 5

5 5 5 5 5 5 5 5 5 5

5 5 5 5

38

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

What I thought of or did while I was speaking

disagree

strongly disagree

1. I felt it was easy to put ideas in good order. 2. I was able to express my ideas using suitable words. 3. I was able to express my ideas using correct grammar. 4. I thought of MOST of my ideas for the speech WHILE I was speaking. 5. WHILE I was speaking, I did not use some ideas that I had planned. 6. I was able to put sentences in logical order. 7. I was able to CONNECT my ideas smoothly in the whole speech. 8. I was conscious of the time WHILE I was making this speech. 9. I tried to finish speaking within the time. 10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech. 11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 12. I was listening and checking the correctness of sentences WHILE I was making this speech. 13. I was listening and checking whether the words fit the topic WHILE I was making this speech. 14. I felt it was easy to complete the task. 15. Comments on the above items:

1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4 4 4 4 4 4 4

Thank you for completing this questionnaire

IELTS Research Reports Volume 6

strongly agree

no view

agree

5 5 5 5 5 5 5 5 5 5 5 5 5 5

39

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

APPENDIX 8: QUESTIONNAIRE NO PLANNING VERSION For students responding to the no planning versions
For each of the items below, circle the number that reflects your view point on the five point scale.

What I thought of or did before I started


strongly disagree disagree strongly agree 5 5 5 5 5 5 5 5 strongly agree 5 5 5 5 5 5 5 5 5 5 5 5 5 5 40 no view 3 3 3 3 3 3 3 3

1. I read the task very carefully to understand what was required. 2. I thought of HOW to deliver my speech in order to respond well to the topic. 3. I thought of HOW to satisfy the audiences and examiners. 4. I understood the instructions for this speaking test completely. 5. I had ENOUGH ideas to speak about this topic. 6. I felt it was easy to produce enough ideas for the speech from memory. 7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic. 8. I know A LOT about other types of speaking test, e.g., interview, discussion.

1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2

What I thought of or did while I was speaking


disagree strongly disagree no view 3 3 3 3 3 3 3 3 3 3 3 3 3 3

1. I felt it was easy to put ideas in good order. 2. I was able to express my ideas using suitable words. 3. I was able to express my ideas using correct grammar. 4. I thought of MOST of my ideas for the speech WHILE I was speaking. 5. WHILE I was speaking, I did not use some ideas that I had planned. 6. I was able to put sentences in logical order. 7. I was able to CONNECT my ideas smoothly in the whole speech. 8. I was conscious of the time WHILE I was making this speech. 9. I tried to finish speaking within the time. 10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech. 11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 12. I was listening and checking the correctness of sentences WHILE I was making this speech. 13. I was listening and checking whether the words fit the topic WHILE I was making this speech. 14. I felt it was easy to complete the task. 15. Comments on the above items:

1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2

Thank you for completing this questionnaire

APPENDIX 9: QUESTIONNAIRE UNSCAFFOLDED VERSIONS For students responding to the unscaffolded versions
For each of the items below, circle the number that reflects your view point on the five point scale. IELTS Research Reports Volume 6

agree 4 4 4 4 4 4 4 4 4 4 4 4 4 4

agree 4 4 4 4 4 4 4 4

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

What I thought of or did before I started

strongly disagree

disagree

1. I read the task very carefully to understand what was required. 2. I thought of HOW to deliver my speech in order to respond well to the topic. 3. I thought of HOW to satisfy the audiences and examiners. 4. I understood the instructions for this speaking test completely. 5. I had ENOUGH ideas to speak about this topic. 6. I felt it was easy to produce enough ideas for the speech from memory. 7. I know A LOT about this type of speech, i.e., I know how to make a speech on this type of topic. 8. I know A LOT about other types of speaking test, e.g., interview, discussion.

1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4

What I thought of or did in planning stage

strongly disagree

1. I thought of MOST of my ideas for the speech BEFORE planning an outline. 2. During the period allowed for planning, I was conscious of the time. 3. I followed the 3 short prompts provided in the task when I was planning. 4. The information in the short prompts provided was necessary for me to complete the task. 5. I wrote down the points I wanted to make based on the 3 short prompts provided in the task. 6. I wrote down the words and expressions I needed to fulfil the task. 7. I wrote down the structures I need to fulfil the task. 8. I took notes only in ENGLISH. 9. I took notes only in my own language. 10. I took notes in both ENGLISH and own language. 11. I planned an outline on paper BEFORE starting to speak. 12. I planned an outline in my mind BEFORE starting to speak. 13. Ideas occurring to me at the beginning tended to be COMPLETE. 14. I was able to put my ideas or content in good order. 15. I practiced the speech in my mind WHILE I was planning. 16. After finishing my planning, I practiced what I was going to say in my mind until it was time to start.

1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 1. Yes 1. Yes 2 2 2 2

3 3 3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4 4 4 2. No 2. No 4 4 4 4

1 1 1 1

3 3 3 3

IELTS Research Reports Volume 6

strongly agree 5 5 5 5 5 5 5 5 5 5 5 5 5 5 41

disagree

no view

agree

strongly agree 5 5 5 5 5 5 5 5

no view

agree

5. Exploring difficulty in Speaking tasks: an intra-task perspective Cyril Weir, Barry OSullivan + Tomoko Horai

What I thought of or did while I was speaking

strongly disagree

1. I felt it was easy to put ideas in good order. 2. I was able to express my ideas using suitable words. 3. I was able to express my ideas using correct grammar. 4. I thought of MOST of my ideas for the speech WHILE I was speaking. 5. WHILE I was speaking, I did not use some ideas that I had planned. 6. I was able to put sentences in logical order. 7. I was able to CONNECT my ideas smoothly in the whole speech. 8. I was conscious of the time WHILE I was making this speech. 9. I tried to finish speaking within the time. 10. I was listening and checking the correctness of the contents and their order WHILE I was making this speech. 11. I was listening and checking whether the contents and their order fit the topic WHILE I was making this speech. 12. I was listening and checking the correctness of sentences WHILE I was making this speech. 13. I was listening and checking whether the words fit the topic WHILE I was making this speech. 14. I felt it was easy to complete the task. 15. Comments on the above items:

1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3

4 4 4 4 4 4 4 4 4 4 4 4 4 4

Thank you for completing this questionnaire

IELTS Research Reports Volume 6

strongly agree 5 5 5 5 5 5 5 5 5 5 5 5 5 5 42

disagree

no view

agree

6. The interactional organisation of the IELTS Speaking Test


Authors Paul Seedhouse University of Newcastle upon Tyne, UK Maria Egbert University of Southern Denmark, Denmark Grant awarded Round 10, 2004 This report describes the interactional organisation of the IELTS Speaking Test in terms of turn-taking, sequence and repair. ABSTRACT This study is based on the analysis of transcripts of 137 audio-recorded tests using a Conversation Analysis (CA) methodology. The institutional aim of standardisation in relation to assessment is shown to be the key principle underlying the organisation of the interaction. Overall, the vast majority of examiners conform to the instructions; in cases where they do not do so, they often give an advantage to some candidates. The overall organisation of the interaction is highly constrained, although there are some differences in the different parts of the test. The organisation of repair has a number of distinctive characteristics in that it is conducted according to strictly specified rules, in which the examiners have been briefed and trained. Speaking test interaction is an institutional variety of interaction with three sub-varieties. It is very different to ordinary conversation, has some similarities with some sub-varieties of L2 classroom interaction and some similarities with interaction in universities. A number of recommendations are made in relation to examiner training, instructions and test design.

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

AUTHOR BIODATA PAUL SEEDHOUSE Dr Paul Seedhouse is Reader in Educational and Applied Linguistics in the School of Education, Communication and Language Sciences at the University of Newcastle upon Tyne, UK, where he is also Postgraduate Research Director. Following a teaching career in which he taught ESOL, German and French in five different countries, he published widely in journals of applied linguistics, language teaching and pragmatics. His monograph, The Interactional Architecture of the Language Classroom: A CA Perspective, was published by Blackwell in 2004 and won the 25th annual Kenneth W Mildenberger Prize of the Modern Language Association of America in 2005. He has also edited (with Keith Richards) the collection, Applying Conversation Analysis, published by Palgrave Macmillan in 2005. MARIA EGBERT Maria Egbert, PhD (University of California Los Angeles), is Associate Professor at the Institute of Business Communication and Information Science at the University of Southern Denmark. She has taught conversation analysis, applied linguistics and German at the University of Texas at Austin, the University of Oldenburg, the University of Jyvskyl and most recently at the University of Southern Denmark. Her research focuses on conversational repair, interculturality, and affiliation.

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

CONTENTS
1 Introduction ............................................................................................4

2 Research design ........................................................................................4 2.1 Background information on the IELTS Speaking Test.........................4 2.2 The study ............................................................................................5 2.3 Methodology ........................................................................................6 2.4 Data ............................................................................................7 2.5 Sampling ............................................................................................8 2.6 Relationship to existing research literature .........................................8 3 Data analysis ............................................................................................9 3.1 Trouble and repair ...............................................................................9 3.1.1 3.1.2 3.1.3 3.1.4 3.2.1 3.2.2 3.2.3 3.3 Topic 3.3.1 3.3.2 5 Conclusion Repair initiation ...................................................................10 Repetition of questions .......................................................14 Lack of uptake to the prompt ..............................................15 Vocabulary..........................................................................17 The introduction section .....................................................19 Transition between parts of the test and questions............21 Evaluation ...........................................................................23 ............................................................................................24 Topic disjunction.................................................................25 Recipient design and rounding-off questions .....................28 ............................................................................................34

3.2 Turn-taking and sequence .................................................................19

4 Answers to research questions ...............................................................31 5.1 Implications and recommendations: test design / examiner training ...34 5.2 Suggestions for further research .........................................................35 References ............................................................................................37 Appendix 1: Transcript conventions .............................................................39 Appendix 2: Low test score of Band 3.0 .......................................................40 Appendix 3: High test score of Band 9.0 ......................................................43

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

INTRODUCTION

This report presents the results of a qualitative study of the IELTS Speaking Test, which is the most widely used English proficiency test for overseas applicants to British universities. The Speaking Test is designed to assess how effectively candidates can communicate in English. About 4,000 certified examiners administer well over 500,000 IELTS tests annually at over 300 centres in around 120 countries around the world. Based on a selection of 137 transcribed oral proficiency interviews, this study analyses the internal organisation of this institutional variety of interaction in terms of examiner-candidate talk. In particular, the interactional structures are investigated in the areas of trouble and repair, turn-taking and sequence, and topic development. The analysis also focuses on how examiners put instructions from the training documents into practice, and how institutional constraints may implicate learners speech behaviour. Since the Speaking Test is taken to predict how well candidates will communicate in a university setting, it is important to understand what kind of interaction is generated in the test and its relationship to interaction in the target setting. In the next section of this report (Part 2), a background description of the Speaking Test is provided, together with a presentation of the research design. The ensuing presentation of the analytic results focuses on brief answers to the research questions (Part 3). A more detailed qualitative data analysis with displays of exemplary transcript excerpts follows in Part 4. The conclusion (Part 5) raises applied issues for test design and examiner training, and develops implications for future research. 2 2.1 RESEARCH DESIGN Background information on the IELTS Speaking Test

IELTS Speaking Tests are encounters between one candidate and one examiner and are designed to take between 11 and 14 minutes. There are three main parts. Each part fulfils a specific function in terms of interaction pattern, task input and candidate output. These are now described as a backdrop for the analysis. In Part 1 (Introduction) candidates answer general questions about themselves, their homes/families, their jobs/studies, their interests, and a range of familiar topic areas. Examiners introduce themselves and confirm candidates identity. Examiners interview candidates using verbal questions selected from familiar topic frames. This part lasts between four and five minutes. In Part 2 (Individual long turn) the candidate is given a verbal prompt on a card and is asked to talk on a particular topic. The candidate has one minute to prepare before speaking at length, for between one and two minutes. The examiner then asks one or two rounding-off questions. In Part 3 (Two-way discussion) the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic prompt in Part 2. Examiners receive detailed directives in order to maximise test reliability and validity. The most relevant and important instructions to examiners are as follows: Standardisation plays a crucial role in the successful management of the IELTS Speaking Test. (Instructions to IELTS Examiners, pp11). The IELTS Speaking Test involves the use of an examiner frame which is a script that must be followed (original emphasis)Stick to the rubrics do not deviate in any wayIf asked to repeat rubrics, do not rephrase in any wayDo not make any unsolicited comments or offer comments on performance. (IELTS Examiner Training Material 2001, pp5). The degree of control over the phrasing differs in the three parts of the test as follows: The wording of the frame is carefully controlled in Parts 1 and 2 of the Speaking Test to ensure that all candidates receive similar input delivered in the same manner. In Part 3, the frame is less controlled so that the examiners
IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

language can be accommodated to the level of the candidate being examined. In all parts of the test, examiners are asked to follow the frame in delivering the scriptExaminers should refrain from making unscripted comments or asides. (Instructions to IELTS Examiners, pp5). Research has shown that the speech functions which occur regularly in a candidates output during the Speaking Test are: providing personal information; expressing a preference; providing non-personal information; comparing; expressing opinions; summarising; explaining; conversation repair; suggesting; contrasting; justifying opinions; narrating and paraphrasing; speculating; analysing. Other speech functions may emerge during the test, but they are not forced by the test structure (Taylor, 2001a). Detailed performance descriptors have been developed which describe spoken performance at the nine IELTS bands, based on the following criteria. Scores are reported as whole bands only. Fluency and coherence refers to the ability to talk with normal levels of continuity, rate and effort and to link ideas and language together to form coherent, connected speech. The key indicators of fluency are speech rate and speech continuity. The key indicators of coherence are logical sequencing of sentences, clear marking of stages in a discussion, narration or argument, and the use of cohesive devices (eg connectors, pronouns and conjunctions) within and between sentences. Lexical resource refers to the range of vocabulary the candidate can use and the precision with which meanings and attitudes can be expressed. The key indicators are the variety of words used, the adequacy and appropriacy of the words used and the ability to circumlocute (get round a vocabulary gap by using other words) with or without noticeable hesitation. Grammatical range and accuracy refers to the range and the accurate and appropriate use of the candidates grammatical resource. The key indicators of grammatical range are the length and complexity of the spoken sentences, the appropriate use of subordinate clauses, and variety of sentence structures, and the ability to move elements around for information focus. The key indicators of grammatical accuracy are the number of grammatical errors in a given amount of speech and the communicative effect of error. Pronunciation refers to the capacity to produce comprehensible speech in fulfilling the Speaking Test requirements. The key indicators will be the amount of strain caused to the listener, the amount of unintelligible speech and the noticeability of L1 influence. (IELTS Handbook 2005, pp11) 2.2 The study The overall aim is to uncover the interactional organisation of the IELTS Speaking Test as it is collaboratively produced in its three parts. In this section, we present the research questions, methodology, data, sampling and the relation to existing literature. Sub-questions are as follows: 1. How and why does interactional trouble arise and how is it repaired by the interactants? What types of repair initiation are used by examiners and examinees and how are these responded to? What role does repetition play? 2. What is the organisation of turn-taking and sequence? 3. What is the relationship between Speaking Test interaction and other speech exchange systems such as ordinary conversation, L2 classroom interaction, and interaction in universities? 4. What is the relationship between examiner interaction and candidate performance? 5. To what extent do examiners follow the briefs they have been given?

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

6. In cases where examiners diverge from briefs, what impact does this have on the interaction? 7. How are tasks implemented? What is the relationship between the intended tasks and the implemented tasks, between the task-as-workplan and task-in-process? 8. How is the organisation of the interaction related to the institutional goal and participants orientations? 9. How are the roles of examiner and examinee, the participation framework and the focus of the interaction established? 10. How long do tests last in practice and how much time is given for preparation in Part 2? Language proficiency interviews in general are intended to assess the language proficiency of nonnative speakers and to predict their ability to communicate in future encounters. IELTS is designed to assess the language ability of candidates who need to study or work where English is used as the language of communication (www.ielts.org.handbook.htm). The Speaking Test aims to evaluate how well a language learner might function in a target context, often an academic one. The IELTS Speaking Test is predominantly used to assess and predict whether a candidate has the ability to communicate effectively on programmes in English-speaking universities. Hypothetically, interaction in oral proficiency interviews could be characterised in a number of ways, including similarities and differences with other speech exchange systems such as ordinary conversation, L2 classroom interaction, task-based interaction, academic interaction, interviews and tests. This project aims to determine the endogenous organisation of the Speaking Test and its relationship to some of these other systems. Because the Speaking Test (with its own interactional organisation) evaluates learners ability to function in future in other speech exchange systems, each with their own interactional organisation, the proposed research should be of interest to the following parties: fellow researchers in language testing; designers of the IELTS Speaking Test and other similar tests; IELTS examiners; teachers preparing students for the Speaking Test. It is argued that making the interactional organisation of the Speaking Test explicit may help to ensure comparability of challenge to candidates from different cultural backgrounds. The question of how and why interactional trouble arises and how it is repaired by the interactants should be of interest to all those taking part and designers of test items would be interested in how the items are actually implemented in practice. Seedhouse (2004) suggests that the organisation of repair in L2 classrooms is reflexively related to the pedagogical focus. This study will investigate when repair occurs, how it is organised in the Speaking Test and what the relationship is between the organisation of repair and the institutional goal. The research, then, intends to provide empirical insights and raise awareness which can then feed into all areas of test development and training. 2.3 Methodology The methodology employed is Conversation Analysis (CA) (Drew & Heritage, 1992a; Lazaraton, 2002; Sacks, Schegloff & Jefferson, 1974; Seedhouse, 2004). Studies of institutional interaction have focussed on how the organisation of the interaction is related to the institutional aim and on the ways in which this organisation differs from the benchmark of free conversation. Heritage (1997) proposes six basic places to probe the institutionality of interaction, namely: turn-taking organisation overall structural organisation of the interaction sequence organisation turn design lexical choice epistemological and other forms of asymmetry.
IELTS Research Reports Volume 6
6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

He also proposes four different kinds of asymmetry in institutional talk: asymmetries of participation, eg the professional asking questions to the lay client asymmetries of interactional and institutional know-how, eg professionals being used to type of interaction, agenda and typical course of an interview in contrast to the lay client epistemological caution and asymmetries of knowledge, eg professionals often avoiding taking a firm position rights of access to knowledge, particularly professional knowledge. Interactional asymmetry and roles in LPIs are controversial issues (Taylor, 2001c) and Speaking Test data are examined with the above issues in mind. Perhaps the most important analytical consideration is that institutional talk displays goal orientation and rational organisation. In contrast to conversation, participants in institutional interaction orient to some core goal, task or identity (or set of them) conventionally associated with the institution in question. (Drew & Heritage, 1992b, pp22). CA institutional discourse methodology attempts to relate not only the overall organisation of the interaction but also individual interactional devices to the core institutional goal. CA attempts, then, to understand the organisation of the interaction as being rationally derived from the core institutional goal. Levinson sees the structural elements of institutional talk as: Rationally and functionally adapted to the point or goal of the activity in question, that is the function or functions that members of the society see the activity as having. By taking this perspective it seems that in most cases apparently ad hoc and elaborate arrangements and constraints of very various sorts can be seen to follow from a few basic principles, in particular rational organisation around a dominant goal. (Levinson, 1992, pp 71) Seedhouse (2004) describes the overall interactional organisation of the L2 classroom, identifying the institutional goal as well as the interactional properties which derive directly from the goal. He also identifies the basic sequence organisation of L2 classroom interaction and exemplifies how the institution of the L2 classroom is talked in and out of being by participants. Seedhouse demonstrates that, although L2 classroom interaction is extremely diverse, heterogeneous, fluid and complex, it is nonetheless possible to describe its interactional architecture. In the case of Speaking Test interaction, we will see that there is considerably less diversity and heterogeneity than in L2 classrooms because of the restrictions of the test format and the use of similar tasks for all participants. Language proficiency interviews (LPIs) differ from other types of institutional interaction in one respect. Normally, the institutional business is achieved via the content of the talk, whereas in the LPI the content of the talk is not central. The responses are required to be accurate and relevant to the questions, but the examiner does not have to employ the responses to further the institutional business; language is used for display rather than communication. (The authors are grateful to G Thompson for this and other comments.) In this study, we employ Richards and Seedhouses (2005) model of description leading to informed action in relation to applications of CA. We link the description of the interaction to the institutional goals and provide proposals for informed action based on our analysis of the data. 2.4 Data The analysis of naturalistic data, one of the basic premises of CA research, allows a direct and authentic examination of the interactants conduct. Therefore, the primary raw data consist of audio recordings in cassette format of operational IELTS Speaking Tests. All IELTS Speaking Tests are routinely recorded for monitoring and quality assurance purposes; in addition, a selection of these is entered into an IELTS Speaking Test Corpus which is used for research purposes and currently contains several thousand test performances. The data set for this study was drawn from recordings of live tests conducted during 2003. Secondary data included paper materials relevant to the Speaking Tests recorded on cassette, including examiners briefs, marking criteria, examiner
IELTS Research Reports Volume 6
7

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

induction, training, standardisation and certification packs (Taylor, 2001b). These data were helpful in establishing the institutional goal of the interaction and the institutional orientations of the examiners. The primary raw data (137 Speaking Tests) were transcribed using CA transcription conventions (Appendix 1) by postgraduate research students at the University of Newcastle, using the existing transcription equipment in the School of Education, Communication and Language Sciences. The resultant transcripts were produced in paper and electronic format and are copyright of Cambridge ESOL, one of the IELTS partners. All personal references have been anonymised. 2.5 Sampling The IELTS Speaking Test Corpus contains over 2,500 recordings of tests conducted during 2003; the researchers selected an initial sample of 300 cassettes and then transcribed 137 of these. The aim of the sampling was to ensure variety in the transcripts in terms of gender, region of the world, task/topic number and Speaking Test band score. The test centre countries covered by the transcribed tests are: Albania, Brazil, Cameroon, United Kingdom, Greece, Indonesia, India, Iran, Jamaica, Lebanon, Mozambique, Netherlands, Norway, New Zealand, Oman, Pakistan, Syria, Vietnam and Zimbabwe. However, we do not have data on individual candidate nationality and ethnicity and it should be borne in mind that in, for example, the data from the UK, a wide range of nationalities and ethnic backgrounds are covered. We do not have any data on the first languages of candidates. Overall test scores covered by the transcribed sample range from band 9.0 to band 3.0 on the IELTS Speaking Module. Two tasks among the many used for the test were selected for transcription. This enabled easy location of audio cassettes whilst at the same time ensuring diversity of task. The way in which sampling was conducted is as follows: Cambridge ESOL has written information on the above variables in relation to their corpus of IELTS Speaking Tests. The researchers first examined the information available in consultation with Cambridge ESOL and then requested a set of 300 cassettes which covered the range of variables, namely gender, region of the world, task/topic number and Speaking Test band score. A certain number of these cassettes were not usable due to poor sound quality or inadequate labelling. From the researchers perspective, the aim was to produce a description of the interactional architecture of the Speaking Test which was able to account for all of the data, regardless of variables relating to particular candidates. The description will tend to have more credibility if the data sampled cover a wide range of variables. 2.6 Relationship to existing research literature The research builds on existing research in two areas. Firstly, it builds on existing research done specifically on the IELTS Speaking Test and on language proficiency interviews in general. Secondly, it builds on existing CA research into language proficiency interviews in particular, into institutional talk (Drew & Heritage, 1992a) and into applications of CA (Richards & Seedhouse, 2005). Taylor (2000) identifies the nature of the candidates spoken discourse and the language and behaviour of the oral examiner as issues of current research interest. Wigglesworth (2001:206) suggests that In oral assessments, close attention needs to be paid, not only to possible variables which can be incorporated or not into the task, but also to the role of the interlocutor in ensuring that learners obtain similar input across similar tasks. Brown & Hill (1998) examine the relationship between the interactional style of the interviewer and candidate performance, with easier interviewers shifting topics frequently and asking simpler questions, while more difficult interviewers used interruption, disagreement and challenging questions. This study builds on this work by examining through a sizeable dataset the relationship between the interactional style of the interviewer and candidate performance. Previous CA-informed work in the area of oral proficiency interviews area by Young and He (1998) and Lazaraton (1997) examined the American Language Proficiency Interview (LPI). Egbert points out that LPIs are implemented in imitation of natural conversation in order to evaluate a learners
IELTS Research Reports Volume 6
8

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

conversational proficiency (Egbert, 1998:147). Young and Hes collection demonstrates, however, a number of clear differences between LPIs and ordinary conversation. Firstly, the systems of turntaking and repair differ from ordinary conversation. Secondly, LPIs are examples of goal-oriented institutional discourse, in contrast to ordinary conversation. Thirdly, LPIs constitute cross-cultural communication in which the participants may have very different understandings of the nature and purpose of the interaction. Egberts (1998) study demonstrates that interviewers explain to students not only the organisation of repair they should use, but also the forms they should use to do so; the suggested forms are cumbersome and differ from those found in ordinary conversation. Hes (1998) microanalysis reveals how a students failure in an LPI is due to interactional as well as linguistic problems. Kasper and Ross (2001:10) point out that their CA analysis of LPIs portrays candidates as eminently skilful interlocutors, which contrasts with the general SLA view that clarification and confirmation checks are indices of NNS incompetence, while their (2003) paper analyses how repetition can be a source of miscommunication in LPIs. In the context of course placement interviews, Lazaraton (1997) notes that students initiated a particular sequence, namely self-deprecations of their English language ability. She further suggests that a student providing a demonstration of poor English language ability constitutes grounds for acceptance onto courses. Interactional sequences are therefore linked to participant orientations and goals. Lazaraton (2002) presents a CA approach to the validation of LPIs and her framework should enable findings from this research to feed into future decision-making in relation to the Speaking Test. 3 DATA ANALYSIS

We now move on from the summary answers to examine in more detail a number of themes which emerged from our more detailed qualitative analysis of the data. In particular, we show the interview-specific structures of (1) trouble and repair, including repair initiation and repetition as the repair operation, (2) turn-taking and sequence, with a special focus on the (lack of) transitions between test parts and question sequences, and (3) topic development, with disjunction being related to abrupt sequencing. Other issues arising in the data are addressed in terms of vocabulary, evaluation, answering the question, and introducing the interview (4). Two themes which arise frequently are interactional problems caused by examiners deviating from instructions and problems issuing from the design of the test itself. In this part of the report, excerpts from transcripts serve to exemplify the findings. Please note that two complete transcripts are available in Appendices 2 and 3 for further review. 3.1 Trouble and repair Repair is the mechanism by which interactants address and resolve trouble in speaking, hearing and understanding (Schegloff, Jefferson & Sacks, 1977). Trouble is anything which the participants display as impeding speech production or intersubjectivity; a repairable item is one which constitutes such trouble for the participants. Any element of talk may in principle be the focus of repair, even an element which is well-formed, propositionally correct and appropriate. Schegloff, Jefferson & Sacks (1977:363) point out that nothing is, in principle, excludable from the class repairable. Repair, trouble and repairable items are participants constructs, for use how and when participants find appropriate. Their use may be related to institutional constraints, however. In courtroom crossexamination of a witness by an opposing lawyer, for example, a failure by the witness to answer questions with yes or no may constitute trouble within that institutional setting (Drew, 1992). Such a failure is therefore repairable (for example by the lawyer and/or judge insisting on a yes/no answer) and even sanctionable. So within a particular institutional sub-variety, the constitution of trouble and what is repairable may be related to the particular institutional focus.

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

We now focus on the connection between repair and test design. By examining how and why interactional problems arise, it may be possible to fine-tune test design and procedures to minimise trouble. As mentioned above, there does appear to be some kind of correlation between test score and occurrence of trouble and repair: in interviews with high test scores, fewer examples of repair are observable. To illustrate this observation, two complete transcripts are produced in the Appendices, one with a high score of band 9.0 (Appendix 3) and no occurrence of trouble in hearing or understanding, and one with a low score of band 3.0 (Appendix 2), which gives the impression of great strain in both the candidates and the examiners conduct. The candidates performance is characterised by three instances of other-initiated repair in the first half of Part 1 of the interview. Although she does not initiate any further other-repair, her long delays in uptake in combination with answers which display partial, wrong or lack of understanding occur throughout the interview. While there are indications that high scoring and low occurrence of trouble co-occur, our study is furthermore interested in uncovering any instances of trouble which may have been created by the test format or procedures themselves and which may therefore have an impact on test validity and reliability.
3.1.1 Repair initiation

Repair policy and practice vary in the different parts of the test. Examiners have training and written instructions on how to respond to repair initiations by candidates. When interaction has clearly broken down, or fails to develop initially, the examiner will need to intervene. Thismay involve: repetition of all or part of the rubric (Part 1 or 2); the examiner asking: Can you tell me anything more about that? (Part 2); re-wording a question/prompt or asking a different question (Part 3) (IELTS Examiner Training Material 2001, pp 6). Candidates initiate repair in relation to examiner questions in a variety of ways. Examiner instructions are to repeat the question once only but not to paraphrase or alter the question. In Part 1, The exact words in the frame should be used. If a candidate misunderstands the question, it can be repeated once but the examiner cannot reformulate the question in his or her own words. If misunderstanding persists, the examiner should move on to another question in the frame. The examiner should not explain any vocabulary in the frame. (Instructions to IELTS Examiners, pp5). The vast majority of examiners in the data do conform to this guidance; however, they frequently do make prosodic adjustments, as in the example below. (For transcription conventions, the reader is referred to Appendix 1.)
Extract 1 E: 70 71 C: 72 E: 73 C: 74 E: 75 C: 76 E: 77 C: 78 E: 79 C: 80 (Part 1) do people (0.6) <generally prefer watching films at home> (0.2) yeah (0.5) <or in a cinema> (0.2) yeah (2.7) so (1.2) do people generally prefer watching films (.) at home (0.3) mm hm (0.6) or in a (0.3) cinema (0.2) I think a cinema (0.4) why? (0.6) because I think cinema (0.9) is too big (0.2) and (1.2) you can (0.3) you can join in the:: the film (0.7)

In this case the examiner repeats the question once. Sometimes examiners do not follow the guidelines and modify the question, as in the extract below:

IELTS Research Reports Volume 6

10

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Extract 2 17 E: can we talk about your country (.) which part of China (0.2) do most 18 people live in (0.4) 19 C: uhm in I think in the south of China most people living (0.4) 20 E: yeah (0.3) tell me about the main industries in China (0.8) 21 C: sorry? (0.3) 22 E: the main industries (0.3) 23 C: industries?= 24 E: =like ca:r industry:= 25 C: =o::h (0.4) 26 E: factories where they make th[in]gs (0.3) 27 C: [oh] 28 E: what what things does China (.) m[ake ] 29 C: [oh ye]s (0.4) uhm I think mm China: 30 (0.5) the heavy industry (0.2) is uh most uh important in (0.5) China 31 (0.3) 32 E: mm hm (1.2) how easy is it to travel round China (0.9) (Part 1)

In Part 3, by contrast, The scripted frame is looser and the examiner uses language appropriate to the level of the candidate being examined. The examiner should use the topic content provided and formulate prompts to which the candidate responds in order to develop the dialogue.
Extract 3 117 E: can you suggest some of the ways life has improved because of 118 technology? 119 C: (0.4) can you repeat that? 120 E: are there some ways that our life has improved because of this technology 121 C: mm (3.0) 122 E: have our lives become easier and more convenient because of new 123 technology 124 C: erm yes (0.3) I think technology helps us a lot (Part 3)

Here the examiner reformulates the question in line 122 in response to a repair initiation in line 119 which is followed by a hesitation and 3 second pause in line 121; this is within the guidelines for Part 3 of the test. Another example of the examiner modifying questions in Part 3 was found in 0125, lines 290-295. On rare occasions, candidates ask for help in explaining a question. Sometimes examiners follow the guidelines and repeat the question. Sometimes this leads to the interaction being able to proceed on track, as in the extract below.
Extract 4 73 E: 74 C: 75 76 E: 77 C: 78 79 E: 80 C: (Part 3) and what do you think (0.7) erm (2.1) what do you think the role of public transport will be in the future here in Albania? (2.6) what do you mean? (0.3) what kind of role does it (1.0) will it have in the future? (1.1) oh (.) what role? (1.2) well (2.0) the same as now I think (.) the (.) greater part the people mostly erm (0.7) travel or (.) be with the public transport (2.1)you dont think that will change? (2.8) no (0.4)

IELTS Research Reports Volume 6

11

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Note that in this case, the candidates repair initiation at line 75 does not claim trouble in understanding the words of the utterance but rather the intended meaning of the prompt. While the examiner does not produce a verbatim repetition in response, she repeats only the key words from the original prompt. In the candidates ensuing uptake (line 77) he displays that his trouble had been what role, indicating that his trouble had not just been the meaning but rather the meaning was impeded by lack of word recognition. Sometimes the examiners repetition of the question does not result in the interaction being able to proceed, as shown in the data segment below. After a repair initiation (line 64) and the ensuing repetition (line 65) do not resolve the candidates trouble in understanding, his ensuing request for reformulation (line 66-67) is declined implicitly, as per examiner instructions, and the sequence is aborted (line 68).
Extract 5 63 E: 64 C: 65 E: 66 C: 67 68 E: 69 (Part 1) what qualifications or certificates do you hope to get? (0.4) sorry? (0.4) what qualifications or (.) certificates (0.3) do you hope to get (2.2) could you ask me in another way (.) Im not quite sure (.) quite sure about this (1.3) its alright (0.3) thank you (0.5) uh:: can we talk about your childhood? (0.7)

In the above extract we can see that there is no requirement for the examiner to achieve intersubjectivity or mutual understanding. The institutional aim is for the examiner to assess the candidate in terms of a specific band, and the candidates inability to answer even after repetition provides the examiner with data for this task; the examiner simply moves on to the next prompt. We should note, however, that the lack of requirement to achieve intersubjectivity produced by the test design creates a major difference between Speaking Test interaction and interaction in university seminars, tutorials and workshops, in which the achievement of intersubjectivity is a major institutional goal. Sometimes examiners oblige the candidate and explain the question, contrary to instructions. Note that in a similar way to the previous example, there are two succeeding repair sequences, each consisting of a repair initiation and operation. In both cases, the examiner first repeats the prompt. In this case, however, the second repair operation consists of examples. It is noteworthy that once intersubjectivity is re-established, the candidate heavily recycles words from the helpful repair operation. It thus seems that the examiners deviation from the training manual provides an advantage to the candidate.
Extract 6 50 E: what kind of shops do you prefer? 51 C: (1.0) shop? (.) er (0.3) do you explain perhaps for me please? 52 E: erm (2.4) what kind of shops do you like? 53 C: kind of shop? 54 E: big shop? small shop? 55 C: ah ah yeah I understand (0.2) I like er big shop (0.2) I prefer big shop (Part 1)

Sometimes, candidates may ask for clarification of a question. According to the guidelines for Part 3, examiners may do so, as in the example below.

IELTS Research Reports Volume 6

12

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Extract 7 79 E: 80 81 82 C: 83 84 E: (Part 3)

first of all (0.3) er if we could (.) look a little at public and private transport (.) em could you evaluate for me (0.6) please (0.7) the advantages of private and public transport? when you mean private and public transport you mean like er (.) private for example a family goes alone or you mean like like private owned? no with a car or something like that (1.4)

Examiners are briefed to not help candidates who are struggling Examiners should not prompt candidates who are struggling to find language. (Instructions to IELTS Examiners, pp6). However, there are exceptions to these instructions: in Part 2, When interaction has clearly broken down, or fails to develop initially, the examiner will need to intervene. Thismay involve: repetition of all or part of the rubric (Part 1 or 2); the examiner asking: Can you tell me anything more about that? (Part 2) (Examiner Training Material 2001, pp6). In an exceptional case (Extract 8) with a very weak student, we can see an example of the examiner trying to help the candidate in Part 2.
Extract 8 78 C: 79 80 81 E: 82 83 C: 84 E: 85 C: 86 87 E: 88 C: (Part 2) yes (0.8) er I travelled in er (.) in er (inyana) (eryanas) very (.) very (like) and er (.) I went (.) I went er to: (1.4) I went there for er the job (0.3) and er (5.4) and er (0.8) did you enjoy your trip or not (.) how did you go there? (.) you went to (Indiana) yes how did you travel there? (12.7) did you go by train did you go by plane? er (0.3) I went er (1.0) to the bus and erm (.) I went erm to my parents and erm (2.1) did you enjoy the trip? (1.0) ((name omitted)) er yes I (1.8) I enjoy the er (7.1)

Above we see the examiner rephrasing the question in line 84 and then simplifying the question by offering train and plane alternatives. Examiners are instructed to not correct candidate utterances, and instances of correction are indeed very rare. In Extract 9, with a weak candidate, we see an example of correction.
Extract 9 1 E: can you tell me where you come from? (name omitted) 2 C: ahh I come from I I I come from Korea. 3 E: alright um where where in Korea? 4 C: er Korea is um 5 E: no no not where is Korea where in Korea which city? 6 C: its Asia Asia 7 E: (1.0) I know that 8 C: its Seoul Seoul (Part 1)

In the above extract, the examiner initiates repair of the candidates answer in line 5 before s/he has completed it. Eventually in line 8 the candidate is able to self-repair successfully. In line 5, the examiner initiates repair on her own prior utterance in light of the fact that the candidates answer in line 4 displays a wrong understanding of what the examiner said in line 3. Note that the trouble source is in, which the candidate mistakes for is. In her third position repair, the examiner places

IELTS Research Reports Volume 6

13

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

emphasis on in, yet this is not reflected in the candidates next response (line 6). After the examiner has rejected that answer (line 7), the candidate finally responds adequately to the prompt. In this section we have seen that there are slight differences in the interpretation of examiner instructions relating to repair in the different parts of the test. The vast majority of examiners adhere rigidly to these instructions. Some examiners do not follow the rules, and in these cases they provide a clear advantage to their candidates.
3.1.2 Repetition of questions

The repetition of questions plays a key role in the Speaking Test and is therefore examined in detail. The instruction manual states for Part 1 of the interview that examiners are to repeat the question only one time (in case of trouble) and then to move on: The exact words in the frame should be used. If a candidate misunderstands the question, it can be repeated once but the examiner cannot reformulate the question in his or her own words. If misunderstanding persists, the examiner should move on to another question in the frame. (Instructions to IELTS Examiners, pp5). In the vast majority of cases, examiners adhere to this policy. Occasionally, however, some examiners do not follow these instructions, and we examine some instances of this below; the consequences of repeated repetition vary.
Extract 10 53 E: 54 C: 55 E: 56 C: 57 E: 58 C: 59 60 E: 61 C: 62 E: 63 C: 64 E: 65 C: 66 E: 67 C: 68 E: 69 C: 70 71 E: (Part 1) yes (0.3) was it a good place for children (1.1) s- (0.3) beg your pardon maam (0.5) was it a good place (.) for children (0.3) for children (1.2) eh well thats definitely my whole ((inaudible)) (0.5) was it a good place (.) for children (0.7) good place for children. (0.4) Im sorry Im not can you please be a bit more specific I hope if you dont mind so maam (0.6) mm= =I mean like Im not getting you (0.4) okay (0.3) yeah exactly (0.4) was it a good (0.3) oh [was ] [place] it a good place I see [see I thought] [for children ] that you were saying (.) what it s a good place like wa- (0.3) yeah definitely it was (0.4) mm hm (.)

In the above case the question is repeated three times and the talk becomes a long repair sequence. When comprehension is finally achieved in line 69, only a very simple answer is provided and the candidate does not engage with the topic. In this case, then, repeated repetition does not result in helping the candidate display a high level of proficiency in his/her answer.
Extract 11 102 E: and (0.6) eh where (0.4) did you usually play. (1.7) 103 C: play (0.9) 104 E: play (.) 105 C: yes= 106 E: =where (0.4) 107 C: like eh (0.7) cricket ((inaudible))= 108 E: no where did you usually play (1.6)

IELTS Research Reports Volume 6

14

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

109 110 111 112 113 114 115 116 117 118 (Part 1)

C: E: C: E: C: E: C: E: C:

so[rry ] [where] (0.6) where did you usually play (1.6) sorry I cant get that (0.4) where (0.7) where (1.1) I usually play? (1.3) mm s- eh (.) when you were a child (0.5) yeah (0.5) where <did you play> (0.5) eh well (.) eh as I told you that as this is divided into portions and particularly the first portion which I (.) dont like,

In Extract 11, there is a repair initiation in form of a partial repeat in line 103. The examiner repeats the question no fewer than five times, without being able to obtain an answer at the end of the cycle. Again, repetition as a repair operation does not always work so well.
Extract 12 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 (Part 2) I see (.) alright (1.9) do you generally enjoy (.) travelling (1.5) sorry (0.6) do you generally (.) enjoy (.) travelling (1.3) eh (1.5) I think eh I want to eh eh (0.3) drive home in the car (1.0) because eh (.) all the facilities and eh (0.3) mm save time car (0.5) car save you time (0.3) and it give you [much] E: [but ] do you enjoy (.) travelling (2.3) C: eh travelling? (0.5) E: mm hm (0.7) C: yeah I (0.6) eh (0.3) get travelling (0.4) eh (.) in (0.3) trip (0.9) E: do you enjoy (0.4) travelling (1.6) C: yeah I have eh (0.4) fond of travelling eh somewhere (0.7) so because eh (.) travelling it give you some time (0.3) to fresh your mind (0.5) and eh (0.3) eh because eh life (0.3) is now (.) very eh h- eh (.) quick (0.3) indeed eh (.) and we have not much time to travel (0.4) so it give [you some] freshness E: C: E: C:

In the above case the examiner repeats the question three times and this enables the candidate to finally provide a correct answer after the examiner has stressed the key word in line 127. In this particular case, the examiner has ignored the instructions and this has given a distinct advantage to this particular candidate. Other examples of excessive repetition in the data are: 0127, lines 60 onwards, repeats three times; 1106 lines 23 on, repeats twice; lines 97 on, repeats twice; 0272 lines 38 on, repeats four times; 0836, lines 23 on, repeats three times.
3.1.3 Lack of uptake to the prompt

We now consider what happens if the candidate does not answer the question directly. The instructions for examiners in this area are different for the three parts of the Test, as follows. How is the rating affected if the candidate answers the prompt without actually answering the question? This may be an indicator of inadequate lexis, and therefore that the candidate can only deal with familiar topics. It may also indicate a prepared answer. (IELTS Examiner Training Material 2001, pp69, Part 1).

IELTS Research Reports Volume 6

15

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

The candidate misunderstands the task and talks off the topic. Let them go ahead; assessment is still on their ability to talk at length and unassisted for the required time (IELTS Examiner Training Material 2001, pp70, Part 2) The candidate does not seem to answer the questions directly. This may indicate inadequate lexis, or simply be a roundabout way of dealing with a difficult topic. It is a judgement call for the examiner. (IELTS Examiner Training Material 2001, pp73, Part 3) The vast majority of examiners follow these instructions and do not take action if candidates do not answer the question directly. However, in some exceptional cases, examiners do treat failure to provide a direct answer as trouble. The extracts below demonstrate a variety of behaviours by examiners.
Extract 13 36 E: alright (0.7) now lets move on to talk about some of the activities 37 you enjoy in your freetime (.) right? (1.5) 38 C: alright. Yeah (1.0) 39 E: when do you have free time? (1.5) 40 C: well I love to play on the computer (0.7) I love to travel with my 41 family to my farm (.) because I have a farm (.) and next to (5.6) and here 42 in (Bella renoche) I can say that I have a little stressed life? (1.0) because 43 I dont have time to do my stuff (0.7) well I (0.3) I (0.7) I like to be with 44 my friends (0.8) I like to go out with my friends (.) I like to go to the 45 movies (1.4) I like to be with my girlfriend (1.0) yes (1.2) 46 E: what free activities are most popular among your friends? (1.3) 47 C: most popular? well (0.7) study (0.4) at weekends (0.8) we have to study 48 because our course is= 49 E: =so would you call it free time activities? = 50 C: =no (1.2) not free time activities free time activities we go to parties (0.7) 51 we go to the movies (1.6) and we travel together (1.9) 52 E: alright and how important is free time in peoples lives? (0.7) (Part 1)

In line 49 the examiner asks a supplementary unscripted question which implies that the question has not been answered and provides the candidate with an opportunity to self-repair, and in this case s/he is able to do so and provide a direct answer.
Extract 14 40 E: okay (0.6) lets talk about public transport (0.5) what kinds of public 41 transport are there (0.3) where you live (2.0) 42 C: its eh (0.5) I (0.4) as eh (0.4) a (0.3) person of eh (0.4) ka- Karachi, I 43 (1.1) we have many (0.8) public transport problems and (0.7) many eh 44 we use eh (0.4) eh buses (0.4) there are private cars and eh (.) there are 45 some (0.3) eh (0.4) children (0.4) buses (0.8) and eh (1.9) abou- (0.2) 46 about the main problems in is the (0.4) the number one is the over eh 47 speeding (0.5) they are the oh eh (0.5) the roads (0.8) and eh (.) they are 48 [on] 49 E: [I ] didnt ask you about the problems (0.6) my question was (0.6) what 50 kinds of public transport are there (.) where you live (0.7) 51 C: oh s- (.) sorry (0.5) eh I there (.) I live in (0.5) ((inaudible)) (0.4) so I 52 have eh (0.3) eh (0.4) t- we have there eh (0.4) private cars (0.5) and 53 some read 54 about the taxis and eh (0.3) local buses (0.5) (Part 1)

IELTS Research Reports Volume 6

16

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In line 49 above the examiner explicitly treats the candidates answer as trouble in that it did not provide a direct answer to his/her question, even though it was on the general topic of public transport. In this instance, the candidate is able to provide a direct answer. In the data, the vast majority of examiners follow the instructions in relation to indirect candidate answers. In some cases, examiners do initiate repair of indirect answers and this generally results in candidates supplying direct answers as a result.
3.1.4 Vocabulary

The examiner should not explain any vocabulary in the frame. (Instructions to IELTS Examiners, pp5.)
Extract 15 63 E: what qualifications or certificates do you hope to get? (0.4) 64 C: sorry? (0.4) 65 E: what qualifications or (.) certificates (0.3) do you hope to get (2.2) 66 C: could you ask me another way (.) Im not quite sure (.) quite sure about 67 this (1.3) 68 E: its alright (0.3) thank you (0.5) uh:: can we talk about your childhood? 69 (0.7) (Part 1)

In the above extract the examiner follows the instructions perfectly, declines the request for clarification and moves on to the next question. In the data the vast majority of examiners follow the instructions in this way.
Extract 16 40 E: uh so (0.5) how would you improve (0.4) the city you live in (1.8) 41 C: I:: (0.8) how do I pro::ve? (0.2) 42 E: how would you impro:ve (.) the city (0.3) 43 C: sorry I dont know (.) 44 E: improve? (0.3) 45 C: yeah (.) 46 E: how would you make the city better? (0.3) 47 C: o::h yes (0.5) (Part 1)

In Extract 16 the examiner does not follow the brief, explains the vocabulary item by providing a synonym and thus gives an advantage to the student, who indicates comprehension in line 47.
Extract 17 71 E: okay (0.3) uh:m what dyou think is the most important (0.6) household 72 task? (1.4) 73 C: household task? (0.4) 74 E: mm= 75 C: =uh:m sorry I [cant ] [most importa]nt job (.) in the house (0.8) 76 E: 77 C: in the house (1.5) uh:m (0.7) I think (0.4) the: most important job is (.) 78 cleaning hh (0.5) because my house is quite big (0.3) (Part 1)

In a similar instance above, the examiner does not follow the brief, explains the vocabulary item by providing a synonym in line 76 and thus gives an advantage to the student, who is able to provide an answer in line 77.
IELTS Research Reports Volume 6
17

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Extract 18 236 C: hh because you know uh:: uh (0.3) we dont have uh:::m materials here 237 we dont have uh:: fuel or (.) petrol we dont have uh:::: (0.2) hh 238 E: resources (.) nature[al resour- ] [thats right w]e dont have natural resources 239 C: 240 ((inaudible)) (0.8) uh: (0.3) we we have to work on on tourism (Part 3)

In Extract 18 the examiner helps by supplying vocabulary to the candidate in line 238, which s/he subsequently employs in line 239. Although examiners have more flexibility in Part 3, they do not have a brief to supply vocabulary. In this sub-section, we have seen that the vast majority of examiners follow the instructions not to explain vocabulary. In some rare cases, examiners do not follow the instructions and provide an advantage to these candidates, who are generally able to exploit this help. We can summarise the section on repair as follows. The organisation of repair in the Speaking Test is highly constrained and inflexible, and this is intended to ensure standardisation. In Part 1, the candidate may only initiate repair by requesting a single repetition of the question no reformulation is permitted. The examiner rarely initiates repair. If a candidate turn is incomprehensible, errorridden or irrelevant, there is no brief for the examiner to initiate repair in order to achieve intersubjectivity, except for the single repetition as and when requested. This is because candidate turns are produced for evaluation by the examiner. The design of repair in Part 1, then, has been tightly constrained in relation to the institutional goal of standardisation and fairness. How does repair in the IELTS Speaking Test compare to that in other settings? In general, the organisation of repair in the IELTS Speaking Test differs very significantly from that described as operating in ordinary conversation (Schegloff, Jefferson & Sacks, 1977), L2 classroom interaction (Seedhouse, 2004) and from university interaction, (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) which is the target form of interaction for most candidates. The literature on L2 classroom interaction (Seedhouse, 2004) and interaction in universities (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) shows that many different forms and trajectories of repair are used in these settings. The lack of requirement (in Part 1) to achieve intersubjectivity produced by the test design creates a major difference between Speaking Test interaction and interaction in university seminars, tutorials and workshops, in which the achievement of intersubjectivity is a major institutional goal. The organisation of repair is rationally designed in relation to the institutional attempt to standardise the interaction and thus assure reliability. However, given that the organisation of repair is unusual and cannot be anticipated by candidates, the worry is that some candidates may become confused and test performance is lowered. The evidence for this is that (as we have seen) some candidates request explanations of questions and multiple repetitions. In the IELTS Handbook and website available to students, and in most IELTS preparation books we examined, there was no statement on the organisation of repair; this was detailed, however, in IELTS On Track (Slater, Millen & Tyrie, 2003). It is unclear to what extent candidates are aware of these repair rules. A mock Speaking Test may prepare candidates for these rules, but it is unclear how many candidates will have taken one. We would recommend that a very brief statement be included in written documentation for students, eg, When you dont understand a question, you may ask the examiner to repeat it. The examiner will repeat this question only once. No explanations or rephrasing of questions will be provided. A further recommendation would be that examiners state the rules for repair towards the end of the opening sequence. An example of this choice is described in Egbert (1998).

IELTS Research Reports Volume 6

18

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Overall, the organisation of repair in the Speaking Test has a number of distinctive characteristics. Firstly, it is conducted according to strictly specified rules, in which the examiners have been briefed and trained. Secondly, the vast majority of examiners adhere rigidly to these rules, which are rationally designed to ensure standardisation and reliability. Some examiners do not follow the rules, and in these cases they provide a clear advantage to their candidates. Thirdly, the nature and scope of repair is extremely restricted because of this rational design. In particular, exact repetition of the question is used by examiners as the dominant means of responding to repair initiations by candidates. Fourthly, there is no requirement to achieve intersubjectivity in Part 1 of the Test. 3.2 Turn-taking and sequence The overall organisation of turn-taking and sequence in the Speaking Test closely follows the examiner instructions. Part 1 is a succession of question-answer adjacency pairs. Part 2 is a long turn by the student, started off by a prompt from the examiner and sometimes rounded off with questions. Part 3 is another succession of question-answer adjacency pairs with slightly less rigid organisation than Part 1. This tight organisation of turn-taking and sequence is achieved in two ways. First of all, the examiner script specifies this organisation, for example Now, in this first part, Id like to ask you some questions about yourself. (Examiner script, January 2003). Secondly, many candidates have undertaken training for the Test, and in some cases this will have included a mock Speaking Test.
3.2.1 The introduction section

One of the key features of the IELTS Speaking Test is the importance placed on making the candidate feel as relaxed and as much at ease as possible within the confines of an examination. (Instructions to IELTS Examiners, pp3). However, the administrative business in the introduction section sometimes works against this and has the potential to create interactional trouble at the start. In the introduction section, the examiner must create a relaxed atmosphere, but at the same time perform introductions and verify the candidates identity. Because this administrative business has to take place before the Test as such begins, a switch of identity is involved for both participants which may tend to work against the intention to create a relaxed atmosphere. When verifying ID, the professional is adopting a gatekeeping or administrative identity and a quasi-policing function; the candidate has the identity of person-being-identified. When this business is concluded, the identities switch to examiner and candidate. The policing function is evident in the extract below.
Extract 19 1 E: could you (0.4) tell me your full name please (0.6) 2 C: ((name omitted)) (0.7) 3 E: thank you and (0.4) do you have your identification with you please 4 [thats] 5 C: [yes ] exactly I sure do have the passport! (1.1) and I do have the 6 national I.D. card (0.8) 7 E: I think its your (0.4) oh (5.0) passport that I need 8 C: (4.0) yes please 9 E: (6.3) is this you? (0.3) 10 C: exactly maam I didnt have my moustaches so thats why (0.4) I went for 11 a clean shave(0.7) so thats why Ive got a chin (0.4) Im s- (0.5) 12 E: you look older on that one= 13 C: =yeah exactly (0.6) thats my mummy told me the same thing 14 E: (27.2) right (1.2) thank you (0.5) (Part 1)

IELTS Research Reports Volume 6

19

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In Extract 19, then, the administrative business works in opposition to the aim of creating a relaxed atmosphere. The examiner challenges the candidate twice in relation to his identity and the 27.2 second pause before the examiner finally accepts the candidates identity is by far the longest pause in the data.
Extract 20 1 E: 2 3 C: 4 E: 4b C: 4c E: 4d 5 C: 6 E: 7 E: 7b C: 8 E: 8b C: 9 E: 10 E: 10b 10c C: 11 E: 12 C: 13 C: 13b 13c C: 13d 14 E: 15 C: 16 E: 17 C: 17b 18 E: 19 19b (Part 1) .hh well good evening=my name is ((first name)) ((last name))= can you tell me your full name please.= =yes ((first name,)) ((last name.)) .hh ah: a:n[d, [ghm= =can you tell me er, what shall I ca:ll you. (1.5) e:r (1.0) can you repeat the: er the question[(s),? [( ) what do you, (0.2) your first name? do you use [((last name)) [( ) ((first name)) ((first name)). [((first name)) (you want me to call you) ((first na[me)) [yes [yes ((first name)) right. ((with forced sound release)) .hh and can I see your identifi<cation: card please.> (0.5) .h[hm[an ID. .hh er: not a student card=do you have an I [D card? [e::::m no::=in, (0.2) tch! no. (0.5) tch! er I dont er (0.2) .h I dont have (1.3) the: (1.0) administration,=er: the day. m:: I understa:nd but you erm .h need to ha:ve, a: tch! (0.2) your official, yes ID card. ye:s. (1.5) .hh thank yer .hh erm in this first part Id like to s=ask some questions about yourself. .hh em >well first of all can you tell me where youre< from

The above introduction sequence creates considerable interactional problems and a full analysis of the test (Appendix 2) suggests that the candidate has been thrown by this initial sequence and never recovers. The question What shall I call you? created significant problems for the candidate above and very occasionally in other cases. Sometimes the question and answer sequence for this question is negotiated smoothly, as in Extract 21.
Extract 21 1 E: 2 C: 3 E: 4 5 C: 6 E: 7 (Part 1) could you tell me your full name please (.) yes (.) Im ((name omitted)) (0.6) thank you (0.6) and (.) what shall I call you (.) ((name omitted))? or (0.9) ((name omitted)) (0.6) right (0.7) my names ((name omitted)) (0.8) em (0.4) can I see your identification please (0.4)

IELTS Research Reports Volume 6

20

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

The examiner asks the question and the answer of a nickname is provided by the candidate without trouble arising. However, the examiner does not actually use the candidates nickname as provided later on during the course of the interview. It is therefore unclear what the purpose of asking the question is. See also 0394, lines 5-9 for another example of this. We should also note that, in cases where candidates do have a nickname or pet name which is different to their ID name, they sometimes volunteer this (see 0126, line 1 for another example):
Extract 22 1 E: 2 C: 3 4 5 6 7 E: (Part 1) good afternoon my name is ((name omitted)) my name is ((name omitted)) oh well you can call me ((name omitted)) because I was studying university everybummy (0.3) everybody call me ((name omitted)) so (0.5) everybody (0.7) because this ((name omitted)) is quite close to my given name at first ((name omitted)) ((spelling out name)) (.) and ((spelling out nickname)) so (0.7) s-= =o[kay]

As this question can cause problems to candidates, and as candidates sometimes volunteer a nickname if they have one, it is recommended that the question be deleted.
3.2.2 Transition between parts of the test and between question sequences

Transitions between sequences are marked more or less explicitly by examiners in accordance with their written script. An example of the change from Part 2 to Part 3 of the test can be observed between lines 217-220 of the following segment.
Extract 23 216 E: 217 C: 218 E: 219 220 221 222 mm h[m ] [and] (2.0) and and and most people I know (1.2) alright weve been talking this piece of equipment which you find useful (0.6) and Id like to discuss with you one or more general questions related to this (0.6) okay? (0.2) comes to the first of all (0.3) attitudes to technology (1.2) can you describe the attitude of all the people (.) in modern technology (0.7)

Although the examiner above does not specify that s/he is moving from Part 2 to Part 3 of the test, the wording implies a transition from a previous focus to a new but related focus. We now consider what examiners say on receiving an answer from the candidate and to mark the transition to the next question within Part 1 of the test.
Extract 24 25 E: okay so what do you like most (0.3) about your studies (1.7) 26 C: uh the variety (0.4) I think in: medicine especially because no: two 27 patients will present the same way (0.4) and i- its always a challenge to 28 figure out what the diagnosis is (0.3) and uh ways in which you can (.) 29 confirm the diagnosis basically (0.2) 30 E: okay (0.4) are there any things you dont like about your studies? (2.7) 31 C: well personally the fact tha:t (.) if I read something I have to read it again 32 you know to remember it (.) its just a lot (.) the volume of work is very 33 very large so its just (0.2) time management (0.2) and learning to deal 34 with the: (0.2) (volume of work) (0.3) 35 E: okay (0.7) so uh:: what qualifications or certificates (0.8) do you hope to 36 get (1.3) (Part 1)

IELTS Research Reports Volume 6

21

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In the Test from which the above extract is taken, the examiner says okay 21 times at the start of the receipt slot (the point directly after the candidates answer), with seven of those instances being a double okay and the end of the test being marked with a triple okay. We now consider the issue of how examiners signal to the candidate that the examiner wants to listen further. In the Training Manual, items are listed which in CA literature have been termed continuers (eg Goodwin 1986). These display understanding to the current speaker and indicate that the listener passes the opportunity to take the next turn. Examiners should keep non-verbal interjections to a minimum. (Eg um, right, uh uh.) (IELTS Examiner Training Material 2001, pp6). How do examiners acknowledge something candidate has said? By adopting a listening pose and maintaining eye contact. NOT by commenting or giving too much audible acknowledgement. (IELTS Examiner Training Material 2001, pp69. Part 1). While the audio tapes do not allow us to examine the non-vocal aspects of the interaction, the transcripts indicate that examiners make frequent use of continuers.
Extract 25 15 E: 16 17 C: 18 E: 19 C: 20 E: 21 C: 22 E: 23 C: 24 E: 25 C: 26 E: (Part 1) you some questions about yourself (0.7) em (0.3) lets talk about what you do (.) do you work or are you a student (1.0) actually: (1.1) I- no (.) I am not a student right now (0.3) mm hm (.) I did my (.) engineering some (0.3) three years back (0.4) mm hm (0.6) and then I started working for my father (0.6) and (0.6) family for (0.3) mm [hm] [its] construction business Im in (.) mm hm, (0.7) okay so tell me about your job (1.5) right now (0.5) we dont have a job at all (0.5) mm hm, (0.4)

The examiner in the above extract uses mm hm to pass on taking the turn, and okay to mark that the answer turn is finished and that the examiner will produce another question. Generally in the data, Mm hm provides a non-committal, non-evaluative display of attention. Okay marks receipt of a complete turn and marks transition to the next question. In neither case does the candidate know the degree of the examiners understanding. The issue of examiners use of continuers is of particular importance in relation to Part 2 of the Test. In many transcripts there is no verbalised feedback from the examiner at all during Part 2, for example in transcript 0415.
Extract 26 244 C: 245 246 247 E: 248 C: 249 250 251 252 E: (Part 2) so this is a need of (.) this thing (0.7) so (1.1) some people use (.) eh are using (.) these things (.) eh this thing but (0.3) not most of the people (.) mm hm= =so in my view it is (.) eh (0.9) eh it should be (1.2) the: necessity (.) of our >home town< not my home towns (0.5) all the countryside aactually all seventy per- eh percent of population is living in the (.) countryside (.) mm hm (.)

IELTS Research Reports Volume 6

22

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In Extract 26, by contrast, the examiner uses mm hm more frequently, a total of 5 times throughout the test. There are arguments for consistent conduct by examiners in the use of markers in the receipt slot and at turn transition relevance spaces (points at which turn change can occur). The use of okay and mm hm does not appear to cause any trouble in interaction, is designed by examiners and understood by candidates to be non-evaluative and appears suitable in that they do not generate any instances of trouble in the data. We would therefore recommend, in the interests of consistency and standardisation, that examiner instructions should be that okay is used in the receipt slot to mark transition to the next question and that mm hm be used as a continuer, ie as a signal that the candidate is encouraged to continue talking. This would be particularly useful in Part 2. A more systematic video analysis would be necessary to shed light on the systematic use of body posture, eye contact, head movements, handling of the written materials and similar behaviours in connection with turn transition, signals of understanding and displays of section closings.
3.2.3 Evaluation

The Instructions for Examiners tell examiners to avoid expressing evaluations of candidate responses: Do not make any unsolicited comments or offer comments on performance. (IELTS Examiner Training Material, 2001, pp5). It is very noticeable in the data that examiners do not verbalise positive or negative evaluations of candidate talk, with some very rare exceptions. In this aspect the interaction is rather different to interaction in classrooms of all kinds, in which an evaluation move by the teacher in relation to learner talk is an extremely common finding, in relation to L1 classrooms (eg Mehan, 1979) as well as in L2 classrooms (Westgate et al, 1985). It is also different to interaction in university settings (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000). Examiners follow these instructions, and we found only very few aberrant cases. In the following two data excerpts, examiners produced evaluations of candidate talk.
Extract 27 16 C: eh (1.3) actually eh (0.3) its very interesting job eh (.) it is (0.3) 17 especially in my eh (0.8) department (0.4) that is specialised (.) 18 department that is eh (0.3) microbiology (0.8) in eh eh interesting 19 ((inaudible)) (0.5) [I enjoy it] [yes yes ] mm yes (0.3) good (0.5) are there any 20 E: 21 things you dont like about your work (1.1) 22 C: lot of things I like to do (0.5) as a pharmacist because eh (.) pharmacist 23 are (1.1) eh complicated persons in pharmaceuticals so= 24 E: =yes (0.9) 25 C: eh (0.3) but the whole department is (0.4) very interesting for me (0.9) 26 [mm hm] 27 E: [good! ] (1.0) eh (.) do you have any plans to change your job in the 28 future (1.1) (Part 1) Extract 28 108 E: ((inaudible)) and have you any plans to change your job? (1.7) 109 C: na::h (0.4) I dont think I will change my job? After I come back to 110 Vietnam (.) because when I came here (0.4) to New Zealand (0.3) I quit 111 my job (.) but my ex boss said that I could return to my office (.) if I 112 wish to (0.4) but I think that its time for me to set up my own business= 113 E: =very good!= 114 C: =yeah (0.4) I plan to (.) set up my business to ((inaudible)) educational 115 (1.1) I set up my business (0.6) 116 E: very good (1.7) (Part 1)

IELTS Research Reports Volume 6

23

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

It appears to be the case that L2 teachers often provide positive or negative evaluations of learner talk when teaching in class. However, when the same teachers assume the examiner role in a Speaking Test, they generally do not verbalise evaluations of candidate talk. The explanation appears to lie in the rational design of these two different varieties of institutional talk. In the L2 classroom the institutional goal is that the teacher will teach the learners the L2 (Seedhouse, 2004:183). In this institutional setting, positive or negative evaluations of learner talk are formative and designed to help the learners learn. The instructors main aim is to teach and evaluate learner talk, at least in many teaching methods. However, in the IELTS test, the institutional goal is to assess the language ability of candidates (IELTS Handbook, pp2). The Speaking Test is not part of an ongoing programme of study. Moreover, a summative evaluation of language ability is provided formally and in writing after the Speaking Test has taken place. The examiners aim is to provide an assessment, but the result is not provided to the candidate immediately. It may be that one way in which examiners talk into being a formal examination is precisely by avoiding the positive or negative evaluations of learner talk typical of the classroom. Examiner behaviour here is a striking example of professional caution and asymmetry of access to knowledge, ie the evaluation and scoring of learner talk. It appears that this lack of positive or negative evaluations of candidate talk is related to the rational design of the institutional setting and is therefore appropriate. However, we should note that this creates a striking difference between Speaking Test talk, L2 classroom interaction and interaction in universities, which is the target destination for most candidates. We therefore recommend that candidates be informed about this aspect of examiners conduct beforehand. 3.3 Topic In the Speaking Test, the topic of the talk is pre-determined by the central administration, is written out in advance and is introduced by the examiner. Candidates are evaluated on (among other things) their ability to develop a nominated topic (see IELTS band descriptors). Topic is intended to be developed differently in the different parts of the test: Can examiners ask a follow-up question from something candidate has said? No. (IELTS Examiner Training Material 2001, pp69. Part 1). Can the examiner ask an unscripted follow-up question in Part 3? Yes. (IELTS Examiner Training Material 2001, pp71) Usually, candidates follow the examiners topic nomination wherever possible; however, there are some very rare cases in which the candidate attempts to determine the topic. Note that in the example below (lines 125 ff), the candidate asks whether she can talk about a specific aspect of the prompted topic. This is denied. So even in Part 3 the examiner does not allow the candidate to shift topic.
Extract 29 118 lets talk about public and private transport (0.6) can you describe (.) the 119 public transport systems in your country (1.0) 120 C: I used to have eh the main eh (2.0) public transport and th- (0.3) the main 121 transport which are (0.3) which is used by the public are the (0.5) buses 122 (0.7) secondly if eh (0.3) there are some eh urgent eh they use the taxis 123 investment plans the banks are (.) given (0.5) and eh (0.3) the main eh is 124 the (0.5) the (1.8) transport is eh (1.5) is eh bad (0.7) today have eh (0.9) cant I talk about the (0.3) problems (1.1) 125 126 E: no= 127 C: =no (1.1) 128 E: just describe (.) the public transport systems [in you country ] 129 C: [eh describe the main] 130 transport system which [I ] 131 E: [okay] (0.5) now (0.5) I would like you to 132 evaluate (0.8) the advantages of private (.) and public transport (1.5) 133 C: okay (1.7) first eh (.) talking about (0.4) the eh (0.5) private transport eh (Part 3)

IELTS Research Reports Volume 6

24

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In Extract 30 below we see the issues of topic, interpretation of topic, question repetition and direct answers to questions converging. The examiner appears to engage with the topics the candidate talks about in lines 40, 42, 44, 46, 48, 50 and 52. The candidate answers are not direct answers to the question, but are clearly related to the overall topic of childhood. Of special interest is the candidates response in lines 50 and 52, where he provides a completely logical answer, which, however, is treated as a misunderstanding since, apparently, the examiner expected a different interpretation.
Extract 30 39 E: 40 C: 41 E: 42 C: 43 E: 44 C: 45 E: 46 C: 47 E: 48 C: 49 E: C: 50 51 E: C: 52 53 E: 54 C: 55 E: 56 C: 57 E: 58 (Part 1) hh where did you grow up, (0.8) eh (.) in my childhood I was eh very naughty! (.) yes, (0.7) I p- eh (.) I played with my er friends, (.) yes (0.7) where did you play (0.7) eh (0.7) t- cricket, (0.6) ah yes (1.1) eh (.) fly kiting, (0.6) yes (0.9) and eh othe::r (0.8) things eh (0.6) and where (.) where did you grow up ((name omitted)) (1.6) em (.) grow eh with my (0.6) parents (.) yes= =eh (.) m- my (.) especially my dad (0.6) very good eh (.) I see (.) and where did you grow up (1.4) ((inaudible)) (0.3) where (0.5) did you grow up (1.3) ((inaudi[ble))] [yeah ] hh (.) okay (.) do you think childhood is different today from when you were a child, (1.4)

The examiner says yes five times in response to the candidates turns and this appears to be positive evaluation. However, at the same time the examiner repeats the question where did you grow up? three times. This shows that the examiner is treating the candidates answer as a failure to provide an adequate response as trouble. The examiners yes-receipts and his question repetition appear to be mutually contradictory, with one signalling approval and the other signalling trouble. The examiner is deviating from instructions in two regards, by expressing evaluations and by multiple repetitions of the question.
3.3.1 Topic disjunction

In this section we examine instances in which scripted questions generate trouble and topic disjunction (in which the flow of topic is disturbed). We examine firstly the question Would you like to be in a film? (Part 1 of the Test), which causes trouble for a striking number of candidates. In the examiner script this follows the questions: Do you enjoy watching films? How often do you watch films? Do people generally prefer watching films at home or in a cinema? The interesting point is that in the script, there is no indication that the question might be topic disjunctive, as it is clearly continuing the topic of films. However, in the flow of interaction, eight candidates found it difficult to understand the question, even in cases where the candidate has no problems with understanding all of the other questions in the test. This is out of a total of 32 candidates who were asked this question in the data. In the following two examples we see how the trouble and repair sequences typically unfold after this question.

IELTS Research Reports Volume 6

25

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Extract 31 57 C: 58 59 E: 60 C: 61 E: 62 C: 63 E: 64 C: (Part 1)

.hh err I (0.4) watch most films (0.8) usually after work (1.5) er sometimes sometimes I see two (.) film in a week (.) only mm hm (1.8) would you like to be in a film (.) yourself? (2.0) pardon.(1.1) would you like to be in a film. (1.0) err:: if I was:: an actor? (.) hmm (1.0) no I dont. I dont like it. (2.1)

Extract 32 66 E: alright= 67 C: =yeah (0.2) 68 E: do- do uhm would you like to be (0.3) in a film (0.3) 69 C: oh I like going to the cinema (0.2) 70 E: but would you like to be in (0.3) a film (0.6) 71 C: uh::m (2.3) 72 E: actress (0.8) 73 C: actress (.) actre::ss (0.9) 74 E: would you like to be? (0.3) 75 C: yeah (0.9) I like= 76 E: =why would you like? (0.6) 77 C: uh::m (0.5) because (0.9) I I saw a film (0.4) include uh hero (0.3) and a 78 heroine (0.3) I think the heroine is very very beautiful (0.8) I really like it (Part 1)

In Extract 32, we see that the examiner deviates from instructions by modifying the question in lines 72 and 74. This may be due to the ambiguity of the prompt. Other examples of trouble with this question can be found in extracts 0099, lines 73 on; 0127, lines 83 on; 0394, lines 161 on; 0144, lines 72 on. We cannot know for certain why the question created so much interactional trouble for so many candidates. However, the explanation appears to involve a shift in perspectives. The previous questions about films involved the candidates in continuing their normal perspective as visitors to cinemas and viewers of films. The problem question involves an unmarked and unmotivated shift in perspective to a fantasy question in which candidates have to imagine they had the opportunity to be a film star. As we can see in the following extracts, some candidates say they have never thought about this and have difficulty with the shift in perspective:
Extract 33 78 (0.4) if I watch a film by video (0.7) it is cheaper than theatre (.) but if 79 I have a family (0.4) I choose (0.6) watching in my home (1.3) 80 E: right (.) right (0.5) would you like to be in a film? (1.1) 81 C: pardon (0.9) 82 E: would would you like to be in a film (0.8) like be an actress= 83 C: =ahhh (0.4) I never think about that! hhh (0.6) of course if I have a chance 84 (.) of course haha huh huh (1.4) 85 E: ha ha huh of course (0.7) right (Part 1)

IELTS Research Reports Volume 6

26

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Extract 34 66 cinema maybe you: just uh uhm can uh can see once (0.9) 67 E: would you like to be in a film ((name omitted))? (0.9) 68 C: sorry? (0.2) 69 E: would you like to be in a film (2.1) 70 C: I: (0.3) 71 E: yes you (0.5) 72 C: no:: hh heh (0.7) 73 E: okay lets talk about shopping now (.) (Part 1)

Further examples of questions which cause trouble are now provided. The question below is Could you speculate on how much of todays technology will still be in use in 50 years time?
Extract 35 148 E: thank you (0.6) and could you speculate (.) on how much of todays 149 technology (0.7) w- may still be in use (.) in fifty years time (3.9) 150 C: sorry (0.8) 151 E: could you speculate on how much of <todays technology> (0.9) will 152 still be in use (.) in fifty years time (0.3) 153 C: in fifty years time eh (0.5) there will be more advance ((inaudible)) (0.9) 154 to ((inaudible)) (0.7) more things will be in the market (.) available (0.6) 155 and more easy life (0.3) there will be (0.8) (Part 3) For a similar example, see 0338, lines 188-201.

It is unclear whether the trouble is lexical in nature (speculate) or whether the change in perspective to the imaginary is problematic.
Extract 36 175 E: 176 177 C: 178 E: 179 180 C: 181 E: 182 C: 183 184 185 E: (Part 3) could you speculate on (.) future developments in the transport system (4.6) eh (.) in what sense (0.6) well what do you think were likely to see in the future (.) how will people travel (1.1) eh (0.8) no (.) any (0.6) further developments (1.0) normally eh (.) the development could be made in the (0.7) in cars side of the (0.3) transport (0.6) that eh (0.3) cars in more (.) fuel economised (0.3) and eh (.) pollution aspect can be (0.3) mm=

The question in Extract 36 is slightly different from the preceding one. However, it contains the same lexical item and the same imaginary perspective. In Extract 37, the scripted question is Can we talk about your childhood? Are you happy to do that?
Extract 37 63 E: mm hm (0.9) now can we talk about your childhood (0.6) are you happy 64 to do that? (0.8) 65 C: eh (.) happy to repeat that? (.) 66 E: ah [eh] 67 C: [ha]ppy to remember that= 68 E: =are you happy to talk about your childhood (.) 69 C: eh (0.6) [ee ] 70 E: [now] where did you grow up (0.4) 71 C: yes (.) not too quite happy (0.4) because it was (0.4) eh actually divided
IELTS Research Reports Volume 6
27

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

72 73 74 75 76 77 78 79 (Part 1)

into: eh multiple different portions (0.7) eh like I was born somewhere else (.) not where (0.3) where I am living now= E: =mm so would you prefer to talk about some (0.3) something else? (0.8) C: eh like (0.6) eh no no (.) I-I I mean to say (.) that I [dont ] E: [youre] happy to talk a[bout ] C: [yeah] E: it (0.3) so where did you grow up (1.4)

In the above extract, considerable trouble talk arises due to a confusion as to what exactly happy is referencing; the candidate takes it to be referencing the topic of childhood and starts explaining that some parts of his childhood were happy and others not. In line 74 we see that the examiner takes this reply to mean that the candidate is not happy to discuss his childhood. This appears to be the only frame in which candidates are asked for their consent to discuss the topic; elsewhere they clearly have no choice. Candidates may find this a source of confusion. In this section we have seen that a sequence of questions on a particular topic may appear unproblematic in advance of implementation. However, this may nonetheless be a cause of unforeseen trouble for candidates, especially if an unmotivated and unprepared shift in perspective of any kind is involved. Piloting of questions (if not already undertaken) would therefore be recommended.
3.3.2 Recipient design and rounding-off questions

In a number of instances in the data, trouble arises in relation to specific rounding-off questions in Part 2. Their purpose is stated as follows: The rounding-off questions at the end of Part 2 provide a short response to the candidates long turn and closure for Part 2 before moving on to Part 3. However, there may be occasions when these questions are inappropriate or have already been covered by the candidate, in which case they do not have to be used. (Instructions to IELTS Examiners, pp6). These types of questions are sometimes topically disjunctive in practice as they may not fit into the flow of interaction and topic which has developed. Does everyone you know use this piece of equipment? is a rounding-off question to be used after a Part 2 talk on a piece of equipment which you find very useful. In a number of cases the question is experienced as disjunctive and problematic by candidates. In the extract below the candidate has described a computer.
Extract 38 202 E: 203 C: 204 E: 205 C: 206 E: 207 C: 208 E: 209 C: 210 E: 211 C: 212 E: 213 C: (Part 2) okay (0.3) indispensable (0.4) okay (0.4) does everyone you know use this piece of equipment (1.0) pardon? (0.5) does does everyone you know use this piece of equipment (0.6) you mean my particular one? (0.7) uh: not your I- but= =a computer right (1.0) most people I know nowadays mm hm have access to a computer some use it more than others

IELTS Research Reports Volume 6

28

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

The above candidate has spoken fluently throughout the interview without repair, but encounters difficulty with this question, even after repetition. This may well be due to the scripted nature of the question. It is unusual that an object already referred to as a computer would later be referred to as this piece of equipment. A shift in perspective is also evident in the question; previously they had been talking about the equipment which the candidate uses and the shift is to whether other people s/he knows use the equipment.
Extract 39 92 93 94 E: 95 C: 96 E: 97 C: 98 99 E: 100 C: (Part 2) (0.2) or er (0.2) funny story (.) can make me er (0.3) erm er er (0.4) can make me to relax OK thanks (.) alright er does everyone you know er use the computer? (3.0) actually er can you repeat please? yeah (0.2) does every one (.) you know use the computer (6.3) I think er computer is very useful for me (0.8) erm tend to computer (0.2) I can er (2.3) er (2.3) I can er I can improve my language uh hum, ok (.) so er do you enjoy using the computer? yes I enjoy it very much

In Extract 39, even after repetition, the candidate still does not understand the question. The examiner then switches to the other additional question, which is successfully answered. In the extract below the candidate (a doctor) has described a stethoscope.
Extract 40 257 C: =so that really convinced me that (.) this is a key instrument for us (0.6) 258 and [I ] 259 E: [yes] 260 C: think its really helpful in diagnosing the diseases (0.3) 261 E: right (0.3) thank you (0.7) em (.) eh does everyone you know use this 262 piece of equipment (0.3) 263 C: eh sorry? (0.8) 264 E: does everyone you know (0.5) use this piece of equi[pment] [ah ] yes as I told 265 C: 266 you that eh we (.) even in dramas and every person have eh 267 supposed to face a doctor som- eh (0.3) at one or the other time (0.6) so I 268 dont think so (.) that this is an instrument eh (0.3) which is not well 269 known by the other people (0.5) (Part 2)

The candidate is a medical consultant and the piece of equipment he described is a stethoscope. The question is topically disjunctive and the candidates answer (lines 265 ff) shows a degree of confusion with the function of the question. Clearly, a stethoscope is a piece of medical equipment and it is not possible that everyone he knows uses it. In Extract 41, the candidate is also a medical consultant and the piece of equipment he described is a colonoscope.
Extract 41 223 (0.3) so (.) em (0.3) we had (0.4) scope then (0.5) so it is used to help us 224 ((inaudible)) (0.2) 225 E: okay (.) thank you (0.3) and eh (0.3) does eh (0.6) anyone else you know 226 use this piece of equipment (0.9) 227 C: em (0.6) in eh (0.3) well (.) every eh (0.3) I think all the specialists the 228 (0.3) mm in eh (.) in EST as (0.3) they use them (.) and em (0.9) in our 229 hospital (.) Im in charge of this (0.6) equipment because Im the senior 230 doctor (0.4) I teach them to my junior doctors (0.2)
IELTS Research Reports Volume 6
29

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

231 E: 232 C: 233 (Part 2)

mm= =and the doctors the medical people also use it (0.3) gastro-enterologists (0.3)

In terms of recipient design, then, the examiners follow-up question (lines 225-6) seems very odd and disconnected from the previous flow of interaction. The candidate (who obtained a score of 8.0 and speaks extremely fluently elsewhere) shows definite signs of confusion in line 227. Clearly, a colonoscope is a highly specialised piece of equipment and any question about whether other people use it is likely to sound strange. In this case we should perhaps just be grateful that the examiner did not ask the alternative rounding-off question Do you enjoy using this piece of equipment? Other instances of trouble in relation to rounding-off questions may be found in 0304, lines 117-121; 0589, lines 133-138; 0099, lines 120-126. We have seen that these rounding-off questions can appear disjunctive and actually create trouble when they are worded in such a way that they ignore the local context in which they are produced. We now examine three instances of examiners modifying the rounding-off question to provide good recipient design, which maintains the flow of the topic and interaction and avoids interactional trouble. In the extract below the candidate has described a mobile phone.
Extract 42 121 people can contact you. (0.5) anytime (0.7) because you use (.) your own 122 cell phone (0.5) and this is the big (.) advantage of mobile phone (0.4) 123 and thats why (.) I use to prefer it ((inaudible)) (0.8) 124 E: so (0.5) um (1.7) does everyone you know carry a mobile phone now? 125 (2.4) 126 C: just not (.) not much (1.2) mm lot of people (0.3) lot of people are not 127 carrying the mobile phone (0.4) but (0.9) eh what eh (0.3) in now (.) its 128 eh (0.4) thirty or forty percent (0.8) mm of people who work in offices (.) 129 and who are working in a marketing and (0.3) other places (.) they use (Part 2)

In Extract 42 the examiner modifies the rounding-off question to the sequential environment or flow of topic and this proves effective in smoothly continuing and rounding off the topic, as well as enabling the candidate to understand the question and provide an appropriate answer.
Extract 43 160 writing skills (0.7) and it also helps you i::n improving your intelligence 161 and doing other things (0.6) 162 E: mm hm okay thank you (2.7) does everyone you know (.) in your 163 family enjoy (.) writing? (0.9) 164 C: yes I do my elder sister is: uh: working (0.2) for a newspaper which is 165 called Times of India (0.3) 166 E: mm hm (0.2) (Part 2)

In Extract 43 the candidate has described a pen. Again the question is adapted to the flow of interaction and to the candidates circumstances. Hence, the candidate is able to develop the topic very smoothly.
Extract 44 118 119 120 121 122 E: the plough is used to (.) its not very simple (.) its not very sophisticated (.) but we call it appropriate technology (.) so it can be used (.) im sure its very widely used in Botswana (.) because its always pulled by oxen (.) they are pulled by oxen (.) needed to (.) does everyone you know use a plough like that to (.) in the village
30

IELTS Research Reports Volume 6

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

123 124 C: 125 (Part 2)

where you live? er (.) I could say sixty percent of people use the plough (.) because they can not afford to pay for tractor

In Extract 44 the candidate has described a plough. The question is adapted to include the specific item of equipment and a specific location and the candidate is able to provide an answer without trouble. In each of the three examples above, the examiners have used the name of the equipment rather than piece of equipment to refer to it and in two cases the examiners have adapted the question to what they have learnt during the test of the candidates personal and local circumstances. Thus there is a case for training examiners in how to adapt the rounding-off questions slightly to fit seamlessly into the previous flow of the interaction. The training could include some of the examples given above, explain the topic disjunction problems which can arise with unmodified rounding-off questions and provide examples of questions which have been successfully adapted to topic flow. Training should also stress that the questions are optional and that in some instances it might not be possible at all to adapt them to the flow of the interaction. 4 ANSWERS TO RESEARCH QUESTIONS

The main research question is: How is interaction organised in the three parts of the Speaking Test? The organisation of turn-taking, sequence and repair are tightly and rationally organised in relation to the institutional goal of ensuring valid and reliable assessment of English speaking proficiency. In general, the interaction is organised according to the instructions for examiners: In Part 1, candidates answer general questions about a range of familiar topic areas. In Part 2 (Individual long turn) the candidate is given a verbal prompt on a card and is asked to talk on a particular topic. The examiner may ask one or two rounding-off questions. In Part 3 the examiner and candidate engage in a discussion of more abstract issues and concepts which are thematically linked to the topic prompt in Part 2. The overwhelming majority of tests adhere very closely to examiner instructions. The test is intended to provide variety in terms of task type and patterns of interaction, and in general this is achieved. However, the interaction is very restricted in ways detailed below. How and why does interactional trouble arise and how is it repaired by the interactants? There are two basic ways in which interactional trouble may arise. Either a speaker has trouble in speaking (self-initiated repair) or something the other co-participant uttered is not heard or understood properly (other-initiated repair). In the interviews analysed, trouble generally arises for candidates when they do not understand questions posed by examiners. In these cases, candidates usually initiate repair by requesting question repetition. Occasionally, they ask for a re-formulation or explanation of the question. Sometimes interactional trouble can be created (even for the best candidates) by questions which are topically disjunctive, and a number of examples of this are provided. Examiners very rarely initiate repair in relation to candidate utterances, even when these contain linguistic errors or appear to be incomprehensible. This is because the institutional brief is not to achieve intersubjectivity, nor to offer formative feedback; it is to assess the candidates utterances in terms of IELTS bands. Therefore, a poorly-formed, incomprehensible utterance can be assessed and banded in the same fashion as a perfectly-formed, comprehensible utterance. Repair initiation by examiners is not rationally necessary from the institutional perspective in either case. In this way, Speaking Test interaction differs significantly from interaction in classrooms and university settings,

IELTS Research Reports Volume 6

31

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

in which the achievement of intersubjectivity is highly valued and assumed to be relevant at all times. In those institutional settings, the transmission of knowledge or skills from teacher to learner is one goal, with repair being a mechanism used to ensure that this transmission has taken place. What types of repair initiation are used by examiners and examinees and how are these responded to? Repair policy and practice vary in the different parts of the test. Examiners have training and written instructions on how to respond to repair initiations by candidates. The examiner rarely initiates repair. Candidates initiate repair in relation to examiner questions in a variety of ways. In response to a candidates repair initiation, examiner instructions are to repeat the test question once only but not to paraphrase or alter the question. The vast majority of examiners follow the instructions, but there are exceptions. The organisation of repair in the Speaking Test is highly constrained and inflexible; it is rationally designed in relation to the institutional attempt to standardise the interaction and thus to assure reliability. This results in a much narrower choice of repair options. In general, then, the organisation of repair in the IELTS Speaking Test differs very significantly from that described as operating in ordinary conversation (Schegloff, Jefferson & Sacks, 1977), L2 classroom interaction (Seedhouse, 2004) and from university interaction, (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000), the latter being the target form of interaction for most candidates. In the data, the organisation of repair in the IELTS Speaking Test overwhelmingly follows the instructions for IELTS examiners in Part 1, which specify that the question can only be repeated once and may not be explained or reformulated. What role does repetition play? In Part 1, examiners are instructed to repeat the question once and then move on. In the vast majority of cases, examiners adhere to this policy. Occasionally, however, some examiners do not follow these instructions; subsequently, the consequences of repeated repetition vary. What is the organisation of turn-taking and sequence? The overall organisation of turn-taking and sequence in the Speaking Test closely follows the examiner instructions. Part 1 is a succession of question-answer adjacency pairs. Part 2 is a long turn by the student, started off by a prompt from the examiner and sometimes rounded off with questions. Part 3 is another succession of question-answer adjacency pairs. This tight organisation of turntaking and sequence is achieved in two ways. Firstly, the examiner script specifies this organisation, eg Now, in this first part, Id like to ask you some questions about yourself. (Examiner script, January 2003). Secondly, many candidates have undertaken training for the Test, and in some cases this will have included a mock Speaking Test. What is the relationship between Speaking Test interaction and other speech exchange systems such as ordinary conversation, L2 classroom interaction and interaction in universities? Speaking test interaction is a very clear example of goal-oriented institutional interaction and is very different to ordinary conversation; it should be noted here that the IELTS test developers primary aim was not to develop a Speaking Test in which the interaction mirrors ordinary conversation. Sacks, Schegloff & Jefferson (1974) speak of a linear array of speech-exchange systems. Ordinary conversation is one polar type and involves total local management of turn-taking. At the other extreme (which they exemplify by debate and ceremony) there is pre-allocation of all turns. Clearly, Speaking Test interaction demonstrates an extremely high degree of pre-allocation of turns by comparison with other institutional contexts (cf Drew & Heritage, 1992). Not only are the preallocated turns given in the format of prompts, but the examiner also reads out scripted prompts (with some flexibility allowed in Part 3). So, not only the type of turn but the precise linguistic formatting of the examiners turn is pre-allocated for the majority of the test.

IELTS Research Reports Volume 6

32

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

The repair mechanism is pre-specified in the examiner instructions; the organisation of turn-taking and sequence are implicit in these. There are also constraints on the extent to which topic can be developed. The interaction also exhibits considerable asymmetry. Only the examiner has the right to ask questions and allocate turns; the candidate has the right to initiate repair, but only in the prescribed format. Access to knowledge is also highly asymmetrical. The examiner knows in advance what the questions are, but the candidate may not know this. The examiner has to evaluate the candidates performance and allocate a score, but must not inform the candidate of his/her evaluation. Overall, the examiner performs a gate-keeping role in relation to the candidates performance. Restrictions and regulations are institutionally implemented with the intention to maximise fairness and comparability. There are certain similarities with L2 classroom interaction, in that the tasks in all three parts of the test are ones which could potentially be employed in L2 classrooms. Indeed, task-based assessment and task-based teaching have the potential to be very closely related (Ellis, 2003). There are sequences which occur in some L2 classrooms, for example when teachers have to read out prepared prompts and learners have to produce responses. However, there are many interactional characteristics in the Speaking Test which are very different to L2 classroom interaction. In general, tasks tend to be used in L2 classrooms for learner-learner interaction in pairs or groups, with the teacher acting as a facilitator, rather than for teacher-learner interaction. Another difference between Speaking Test interaction and L2 classroom interaction is that the teacher evaluation moves common in L2 classrooms are generally absent in the Speaking Test. Also, the options for examiners to conduct repair, explain vocabulary, help struggling students or engage with learner topics are very restricted by comparison to those used by teachers in L2 classroom interaction (Seedhouse, 2004). As far as university contexts (Benwell, 1996; Benwell & Stokoe, 2002; Stokoe 2000) are concerned, interaction in seminars, workshops and tutorials appears to be considerably less restricted and more unpredictable than that in the Speaking Test. Seminars, tutorials and workshops are intended to allow the exploration of subject matter, topics and ideas and to encourage self-expression. In the Speaking Test, intersubjectivity does not need to be achieved and language is produced for the purpose of assessment. However, there are some similarities. It is very likely that students will be asked questions about their home countries or towns and about their interests when they start tutorials in their universities. To summarise, Speaking Test interaction is an institutional variety of interaction with three subvarieties, namely the three parts of the Test. It is very different to ordinary conversation, has some similarities with some sub-varieties of L2 classroom interaction and some similarities with interaction in universities. Speaking test interaction has some unique interactional features; these may, however, occur in other language proficiency interviews. What is the relationship between examiner interaction and candidate performance? The overall impression is that the overwhelming majority of examiners treat candidates fairly and equally. Where there are exceptions to this, some examiners sometimes do not follow instructions and may give an advantage to some candidates. The overall impression in the data is that there does appear to be some kind of correlation between test score and occurrence of other-initiated repair, ie trouble in hearing or understanding on the part of the candidate. In interviews with high test scores, candidates initiate fewer or no repairs on the talk of the examiner. To what extent do examiners follow the briefs they have been given? The vast majority of examiners follow the briefs and instructions very closely.

IELTS Research Reports Volume 6

33

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

In cases where examiners diverge from briefs, what impact does this have on the interaction? Where some examiners sometimes do not follow instructions, they often give an advantage to some candidates in terms of their ability to produce an answer. Some examples of examiners aiding candidates in this way are provided above. How are tasks implemented? What is the relationship between the intended tasks and the implemented tasks, between the task-as-workplan and task-in-process? There is an extremely close correspondence between intended and implemented tasks. This is in contrast to the common finding in language teaching that there is often a major difference between task-as-workplan and task-in-process (Seedhouse, 2005). One key difference, however, is that L2 classroom tasks generally involve learner-learner interaction. How is the organisation of the interaction related to the institutional goal and participants orientations? The organisation of turn-taking, sequence and repair are logically organised in relation to the institutional goal of ensuring valid and reliable assessment of English speaking proficiency, with standardisation being the key concept in relation to the instructions for examiners. CA work was influential in the design of the revised IELTS Speaking Test, introduced in 2001, and specifically in the standardisation of examiner talk: Lazaratons studies have made use of conversation analytic techniques to highlight the problems of variation in examiner talk across different candidates and the extent to which this can affect the opportunity candidates are given to perform, the language sample they produce and the score they receive. The results of these studies have confirmed the value of using a highly specified interlocutor frame in Speaking Tests which acts as a guide to assessors and provides candidates with the same amount of input and support. (Taylor, 2000, pp8-9). How are the roles of examiner and examinee, the participation framework and the focus of the interaction established? These are established in the introduction section to the test. The examiner has a script to follow, which includes verifying the candidates identity, performing introductions and stating the participation framework and focus of the interaction. Once established, the participation framework is sustained throughout the interview and oriented to by both interactants. The examiner is also the one who closes the encounter. How long do tests last in practice and how much time is given for preparation in Part 2? The documentation states that tests will last between 11 and 14 minutes. In the sample data, the shortest test lasted 12 minutes 16 seconds (0176) and the longest test 17 minutes 1 second (0199). This included the approximate 1 minute preparation time for the long turn. The actual length of long turn preparation time varied from 41.1 seconds (0678) to 98.2 seconds (0505). 5 5.1 CONCLUSION Implications and recommendations: test design and examiner training

In this final section, we conclude with implications and recommendations in relation to test design and examiner training, followed by suggestions for further research. We employed Richards and Seedhouses (2005) model of description leading to informed action in relation to applications of CA. Here we summarise the recommendations for test design and examiner training which have emerged from analysis of the data. The logic of the Speaking Test is to ensure validity by standardisation of examiner talk. Therefore, most of these recommendations serve

IELTS Research Reports Volume 6

34

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

to increase standardisation of examiner conduct and concomitantly equality of opportunity for candidates. Other suggestions aim to make the interview more similar to everyday conversation where appropriate. We would recommend that a statement on repair rules be included in documentation for students, eg When you dont understand a question, you may ask the examiner to repeat it. The examiner will repeat this question only once. No explanations or rephrasing of questions will be provided. Examiners might also state these rules during the opening sequence. It may also be helpful for candidates to know that examiners will not express any evaluations of their utterances. We recommend, in the interests of consistency and standardisation, that examiner instructions should be that okay is used in the receipt slot to mark transition to the next question and that mm hm be used for back-channelling, particularly in Part 2. A sequence of questions on a particular topic may appear unproblematic in advance of implementation. However, this may nonetheless be a cause of unforeseen trouble for candidates, especially if an unmotivated and unprepared shift in perspective of any kind is involved. Piloting of questions (if not already undertaken) to check for this is therefore recommended. There is a case for training examiners in how to adapt the rounding-off questions slightly to fit seamlessly into the previous flow of the interaction. The training could include some of the examples given above, explain the topic disjunction problems which can arise with unmodified rounding-off questions and provide examples of questions which have been successfully adapted to topic flow. Training should also stress that the questions are optional and that in some instances it might not be possible at all to adapt them to the flow of the interaction. Although the vast majority of examiners follow instructions, some do not, as we have seen above. Examiner training could include examples from the data of examiners failing to follow instructions re repair, repetition, explaining vocabulary, assisting candidates and evaluation. These examples would demonstrate how such failures may compromise test validity. The question What shall I call you? created significant problems, and it is recommended that this question be deleted. The issue of how candidates and examiners address each other is a cultural one and may be adapted to the local conventions. We recommend that the IELTS test developers consider what kind of variation in test and preparation duration is acceptable, since candidates may in some cases derive benefit from disproportionate preparation time. Examiners must stick to the correct timing of the test both for standardisation and fairness to candidates and also for the efficient running of tests in centres. (IELTS Examiner Training Material 2001, pp6) 5.2 Suggestions for further research This study has not correlated candidate categories in the database (gender, test centre, test score) systematically with patterns of interaction. For the test developers it may be helpful to establish if particular patterns of communication and evidence of interactional trouble are related to any of the above categories. For example, it may be found that candidates from particular regions of the world repeatedly run into trouble in relation to a particular interactional sequence, topic or question in the Speaking Test. Or, for example, comparisons of interactional patterns associated with candidates with a low score with those with a high score may be revealing. Furthermore, such research could build on existing IELTS research like O Loughlins (2000) study of the variable of gender in relation to the oral interview. Relationships between these categories and patterns of communication may form the basis of further research studies.

IELTS Research Reports Volume 6

35

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

We tentatively suggest that there appears to be a correlation between test score and incidence of interactional trouble and repair sequences. This could be researched further. Current repair policy is that only verbatim repetitions of the question are allowed in Part 1. Further research could examine the consequences of allowing the examiner a greater variety of repair activities. The Speaking Test is predominantly used to assess and predict whether a candidate has the ability to communicate effectively on programmes in English-speaking universities. A vital area of research is therefore the relationship between the IELTS Speaking Test as a variety of institutional discourse and the varieties to which candidates will be exposed when they commence their university studies. Our study has shown the interactional organisation of the Speaking Test to have certain idiosyncrasies, particularly in the organisation of repair. These idiosyncrasies derive rationally from the principle of ensuring standardisation. The key question arising from this study is how the organisation of interaction in the Speaking Test might be modified to make it more similar to interaction in the university environment while not compromising the principle of standardisation.

IELTS Research Reports Volume 6

36

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

REFERENCES Atkinson, JM and Heritage, JC, 1984, (eds) Structures of Social Action: Studies in Conversation Analysis, Cambridge University Press, Cambridge Benwell, B, 1996, The discourse of university tutorials, unpublished PhD dissertation, University of Nottingham, UK Benwell, B and Stokoe, EH, 2002, Constructing discussion tasks in university tutorials: shifting dynamics and identities, Discourse Studies, vol 4, pp 429-453 Brown, A and Hill, K, 1998, Interviewer style and candidate performance in the IELTS Oral Interview, International English Language Testing System Research Reports 1, vol 1, pp 1-19 Drew, P, 1992, Contested evidence in courtroom cross-examination: the case of a trial for rape, in P Drew and J Heritage (eds) Talk at work: interaction in institutional settings, Cambridge University Press, Cambridge, pp 470-520 Drew, P and Heritage, J, eds, 1992a, Talk at Work: Interaction in Institutional Settings, Cambridge University Press, Cambridge Drew, P and Heritage, J, 1992b, Analyzing talk at work: an introduction in Talk at Work: Interaction in Institutional Settings, eds P Drew and J Heritage, Cambridge University Press, Cambridge, pp 3-65 Egbert, M, 1998, Miscommunication in language proficiency interviews of first-year German students: a comparison with natural conversation in Talking and testing: discourse approaches to the assessment of oral proficiency, eds R Young and A He, Benjamins, Amsterdam, pp 147-169 Ellis, R, 2003, Task-based language learning and teaching, Oxford University Press, Oxford Goodwin, C, 1986, Between and within: alternative sequential treatments of continuers and assessments, Human Studies, vol 9, pp 205-218 He, A, 1998, Answering questions in language proficiency interviews: a case study, in Talking and Testing: Discourse Approaches to the Assessment of Oral Proficiency, eds R Young and A He, Benjamins, Amsterdam, pp 147-169 Heritage, J, 1997, Conversation analysis and institutional talk: analysing data, in Qualitative Research: Theory, Method and Practice, ed D Silverman, Sage, London, pp 161-82 Instructions to IELTS Examiners, 2001, Cambridge ESOL IELTS Examiner Training Material, 2001, Cambridge ESOL IELTS Handbook, 2005, Cambridge ESOL IELTS Speaking Test: Examiner script to accompany tasks, 2003, Cambridge ESOL Kasper, G and Ross, S, 2001, Is drinking a hobby, I wonder: other-initiated repair in language proficiency interviews, paper at American Association of Applied Linguistics, St. Louis, MO Kasper, G and Ross, S, 2003, Repetition as a source of miscommunication in oral proficiency interviews in Misunderstanding in Social Life. Discourse Approaches to Problematic Talk, eds J House, G Kasper and S Ross, Longman/Pearson Education, Harlow, UK, pp 82-106

IELTS Research Reports Volume 6

37

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

Lazaraton, A, 1997, Preference organisation in oral proficiency interviews: the case of language ability assessments, Research on Language and Social Interaction, vol 30, pp 53-72 Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, UCLES/Cambridge University Press, Cambridge Levinson, S, 1992, Activity types and language in Talk at Work: Interaction in Institutional Settings, eds P. Drew and J. Heritage, Cambridge University Press, Cambridge, pp 66-100 Mehan, H, 1979, Learning lessons: social organisation in the classroom, Harvard University Press, Cambridge, Mass Merrylees, B, 1999, An investigation of speaking test reliability International English Language Testing System Research Reports, vol 2, pp 1-35 OLoughlin, K, 2000, The impact of gender in the IELTS Oral Interview, International English Language Testing System Research Reports, vol 3, pp 1-28 Richards, K and Seedhouse, P, 2005, Applying conversation analysis, Palgrave Macmillan, Basingstoke Sacks, H, Schegloff, E and Jefferson, G, 1974, A simplest systematics for the organisation of turn-taking in conversation, Language, vol 50, pp 696-735 Schegloff, E, A, Jefferson, G and Sacks, H, 1977, The preference for self-correction in the organisation of repair in conversation, Language, vol 53, pp 361-382 Seedhouse, P, 2004, The interactional architecture of the language classroom: a conversation analysis perspective, Blackwell, Malden, MA Seedhouse P, 2005, Task as research construct, Language Learning, vol 55, 3, pp 533-570 Slater, P, Millen, R and Tyrie, L, 2003, IELTS on track, Language Australia, Sydney Stokoe, EH, 2000, Constructing topicality in university students small-group discussion: a conversation analytic approach, Language and Education, vol 14, pp 184-203 Taylor, L, 2000, Issues in speaking assessment research, Research Notes, vol 1, pp 8-9 Taylor, L, 2001a, Revising the IELTS Speaking Test: developments in test format and task design, Research Notes, vol 5, pp 3-5 Taylor, L, 2001b, Revising the IELTS Speaking Test: retraining IELTS examiners worldwide, Research Notes, vol 6, pp 9-11 Taylor, L, 2001c, The paired speaking test format: recent studies, Research Notes, vol 6, pp 15-17 Westgate, D, Batey, J, Brownlee, J, and Butler, M, 1985, Some characteristics of interaction in foreign language classrooms, British Educational Research Journal, vol 11, pp 271-281 Wigglesworth, G, 2001, Influences on performance in task-based oral assessments in Researching Pedagogic Tasks: Second Language Learning, Teaching and Testing, eds M Bygate, P Skehan and M Swain, Pearson, Harlow, pp 186-209 Young, RF and He, A, eds, 1998, Talking and testing: discourse approaches to the assessment of oral proficiency, Benjamins, Amsterdam

IELTS Research Reports Volume 6

38

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

APPENDIX 1: TRANSCRIPTION CONVENTIONS


A full discussion of CA transcription notation is available in Atkinson and Heritage (1984). Punctuation marks are used to capture characteristics of speech delivery, not to mark grammatical units. [ ] = indicates the point of overlap onset indicates the point of overlap termination a) turn continues below, at the next identical symbol b) if inserted at the end of one speakers turn and at the beginning of the next speakers adjacent turn, it indicates that there is no gap at all between the two turns an interval between utterances (3 seconds and 2 tenths in this case) a very short untimed pause underlining indicates speaker emphasis indicates lengthening of the preceding sound a single dash indicates an abrupt cut-off rising intonation, not necessarily a question an animated or emphatic tone a comma indicates low-rising intonation, suggesting continuation a full stop (period) indicates falling (final) intonation especially loud sounds relative to surrounding talk utterances between degree signs are noticeably quieter than surrounding talk indicate marked shifts into higher or lower pitch in the utterance following the arrow indicate that the talk they surround is produced more quickly than neighbouring talk a stretch of unclear or unintelligible speech a timed stretch of unintelligible speech indicates transcriber doubt about a word speaker in-breath speaker out-breath laughter transcribed as it sounds arrows in the left margin pick out features of especial interest

(3.2) (.) Word e:r the::: ? ! , . CAPITALS >< ( ) ((inaudible 3.2)) (guess) .hh hh hhHA HA heh heh Additional symbols ja ((tr: yes)) [gibee] [ ] < > C: E:

non-English words are italicised, and are followed by an English translation in double brackets. in the case of inaccurate pronunciation of an English word, an approximation of the sound is given in square brackets phonetic transcriptions of sounds are given in square brackets indicate that the talk they surround is produced slowly and deliberately (typical of teachers modelling forms) Candidate Examiner

IELTS Research Reports Volume 6

39

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

APPENDIX 2: A LOW SCORE OF BAND 3.0 ON THE IELTS SPEAKING MODULE Part 1
-6 -5 -4 -3 -2 -1 0 1 2 3 4 4b 4c 4d 5 6 7 7b 8 8b 9 10 10b 10c 11 12 13 13b 13c 13d 14 15 16 17 17b 18 19 19b 20 20b 21 22 23 23b 24 25 25b 26 27 29 30 31 32 32b 32c 33 33b 34 E: ehm (.) this is the speaking module, for the international English language testing system, .h conducted on the twenty eighth of january, ehm two thousand an three,? .h thee ca:ndidate is ((first name,)) ((last name,)) candidate number ((number))= ((number))=((number))=((number.)) .hh a:nd the interviewer is ((first name))= ((last name:.)) (1.0)/((clicking sound probably from tape being switched on and off)) .hh well good evening=my name is ((first name)) ((last name))= can you tell me your full name please.= =yes ((first name,)) ((last name.)) .hh ah: a:n[d, [ghm= =can you tell me er, what shall I ca:ll you. (1.5) C: E: E: C: E: C: E: E: C: E: C: C: C: E: C: E: C: E: e:r (1.0) can you repeat the: er the question[(s),? [( ) what do you, (0.2) your first name? do you use [((last name)) [( ) ((first name)) ((first name)). [((first name)) (you want me to call you) ((first na[me)) [yes ((first name)) [yes right. ((with forced sound release)) .hh and can I see your identifi<cation: card please.> (0.5) .h[hm[an ID. .hh er: not a student card=do you have an I [D card? [e::::m no::=in, (0.2) tch! no. (0.5) tch! er I dont er (0.2) .h I dont have (1.3) the: (1.0) administration,=er: the day. m:: I understa:nd but you erm .h need to ha:ve, a: tch! (0.2) your official, yes ID card. ye:s. (1.5) .hh thank yer .hh erm in this first part Id like to s=ask some questions about yourself. .hh em >well first of all can you tell me where youre< from .h yes er: hh I go: eh: hh e:r I live er to:, .h to Kosa:ni,? (0.2) [(Im) from Kosani. [I am from Kosani. ookay tch! now! .hh uhm can we talk about erm where you live. (0.5) could you describe the city or the town that you live in now. .h er yes Id li- I would like eh .hh I (0.3) eh I very much eh in Thesaloniki,? (0.5) you live in <The[saloniki> eh=ok .hh could you describe where you live? [yes. er yes er (0.5) I would like er (0.2) whe:re you live. >can you describe it please.< ((pitch lowered gradually)) erm (1.2) >where do you live in Thesaloniki:. < ((pitch lowered more)) (0.2) where? erm: tch! in the centre. (0.2) tell me: eh describe where you live.=uh hum,?

E: C: E: C: E:

C: E: C: E:

((Note: While we did the final check on the transcription, the tape got damaged at this stretch.))

C: E: C: C: E: C: E: E: C: E:

IELTS Research Reports Volume 6

40

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

34b 35 36 37 38 38b 39 39b 39c ?? 40 40b 40c 41 42 43 43b 43c 44 44b 45 46 47 48 48b 49 50 50b 51 51b 52 53 54 54b 55 56 57 57b 57c 58 58b 59 59b 59c 59d 60 60b 61 61b 62 63 64 65 66 66b 67 68 69 69b 70 70b 71 72

C: E: C: E: C:

E: E: C: E: C: C: E: C: E: C: E: C: E: C: E:

C: E: C: E: E: C: E: C: E: C: E: C: E:

E: C: E:

(1.0) erm (1.0) I would live in erm the centre, (.) erm, (0.5) Im: er:, h (0.4) one years er, (.) one years in Thesaloniki, I see .hh what do you li:ke:, about living he:re tch! erm (3.0) ( ) I would like Thesaloniki:, (0.4) er because erm (2.0) because it have eh it has eh (.) er very much er eh people, (0.5) and: eh: and clu:bbing, and er [(1.0) [((sound of paper shuffling)) m hm .h eh is, are there things you dont like about it? (.) ((first name)) wh:at? <are there things you dont like about it?> yes. (0.5) er (1.8) er I guess I I do:: (0.5) I do like er: Thesaloniki,? <uh hum .hh how could you improve (.) the city? > (0.5) hm: (2.5) I improving Thesaloniki: (3.0) umhh (1.0) tch! all right. lets move on to (.) the topic of food and restaurants= what kind of food do you like to eat. er: (0.8) I(wou) like er: (0.8) own food in the restaurant, because er (.) eh what kind of food do you like to eat. er (2.0) tch! erm (.) yes er: I would like eh to eata the restaurant, (0.3) and er:, (0.5) .hh <<is there any food you dont like?>> ((flat intonation)) (0.5) er: (0.8) no there isnt er a (.) a restaurant er (0.6) erm (3.0) m hm (.) um (0.3) <what are some of the advantages and disadvantages of eating in a restaurant.> (0.5) erm (5.0) tch! er the advantage er (0.2) eh of eating er in a restaurant,? (0.4) er because: er : (0.7) uh hm (0.5) uhmhh (0.8) whats the good thing about eating in a restaurant. ((soft voice)) (5.0) n:t. ^lets talk about fi:lms. (.) do you enjoy wa:tching films?= =((first name))? er yes. eh Im like er: (.) er watching film,? eh hm,? er: because eh: (.) erm (.) because watching er to=er=in Thesaloniki, and er, and (0.2) and er, like in Thesaloniki: okay .h how often <do you watch (.) films.> (0.5) umhh (0.2) how often ((whisper voice)) how often (0.2) .hh erm (6.0) I often er (2.0) I often watch er (.) er film, in er (6.0) ((sing song voice all turn)) uh hum all right (0.5) now lets move on to the next part (1.0) a:nd I::, I am going to give you a topic, (0.2) and Id like you to talk about it for one or two minutes (.) before you talk you will have one minute to think: about what you are going to say:. .hh and you can make some notes if you wish. (.) all right?=do you understand? ((high pitch)) yes. okay so heres some paper, and heres a pencil, (.) eh to

IELTS Research Reports Volume 6

41

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

73 74 74b

make your notes, (0.5) and er heres your topic: (0.5) here we are (.) Id like you to describe a trip: (.) that you once went on:. ((sound of tape recording being switched off))

Part 2 (Counter 119)


75 76 77 78 78b 79 80 81 82 83 84 84b 84c 85 86 86b 86c 86d 87 88 88b E: a:ll right now remember you have >one or two minutes for this so dont worry if I stop you .h and Ill tell you when the time is up,< can you start speaking (now) please,?= =yes (0.2) .hh er I travelled in er (0.5) I travelled in er inyana, .hh iryana is very:, (.) very like, and er (.) I went (.) and I went er to: (0.8) I went there for er the job, (.) and er (3.0) and er (5.0) did you enjoy your trip? or not? how did you go there? you went to Indiana.= =yes.= =how did you travel there? (6.0) did you go by train?=did you go by plane?=how [did you [er, I went er tch! (.) to: the bus (0.2) and erm (.) I: went erm to my parents, (0.2) em (0.2) mhm did you enjoy the trip? (0.2) ((first name?)) (3.0) er yes I:: (2.0) I enjoy the: (5.0)

C:

((Note: While we did the final check on the transcription, the tape got damaged at this stretch))

E: C: E: E: C: C: E: E: C:

Part 3 (Counter 143)


89 89b 90 90b 90c 91 91b 92 93 93b 93c 93d 94 95 96 97 97b 98 99 99b 100 100b 101 102 103 104 103 (0836) E: C: E: E: C: E: E: uh hum .h okay: .hh can I have the task (.) card (.) back=then: uh hum (.) wha:t did you like (.) mo:s[t [grh ((clears throat)) about it. (0.7) what was the be:st: thing:. (1.0) um (3.0) m:h::: (0.3) tch! o:kay (0.5) .hh ^erm^ (0.8) tch! weve been talking about a trip that you went on=and Id like to discuss with you: one or two more general questions related to this.= =.hhh lets think about erm (.) travel and transport .hhh whats the most popular way to travel .hh a long distance: (.) <in your country?> (0.7) um: (5.0) er the travel ermh, (3.0) .hh er I would like er to:: (0.5) to transport er (1.0) tch! (1.0) [eh for the: (6.0) [mhm uh hum .hhh how do you think, were going to travel in the future? .hh hh (0.2) erm: yes erm: (0.2) I believe that er: (0.5) er the travel er in the future, mhm,? er but er: but I dont er (1.0) erm (0.8) but I dont eh know: to: .hh to the to:wn:, the [city:, [a:ll right.:: (.) o:kay:: .hh thank ((tape cut off here))

C: E: E:

C: E: C: E:

IELTS Research Reports Volume 6

42

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

APPENDIX 3: A HIGH SCORE OF BAND 9.0 ON THE IELTS SPEAKING MODULE Part 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 E: C: E: C: E: good afternoon (1.3) uh:m (3.4) can you tell me your full name please? (0.4) ((name omitted)) (0.2) thanks and uh:: (1.5) can you tell me where youre from? (0.3) Im from Trinidad ((inaudible)) (0.3) okay (0.4) can I see your I.D please (4.7) thanks (3.3) thats fine thank you (1.7) now in this first part Id like to ask you some questions about yourself (0.6) uh:: lets talk about (0.4) uh:m (3.1) what you do: (0.3) do you work or are you a student (0.3) Im medical student hh (0.2) Im ((inaudible)) graduate in Ma:y of this year (.) s(hh)o (0.3) o[kay] [not]too long (0.5) okay (0.7) so: uh (1.1) tell me about your studies (1.0) well (.) I originally started in Grenada (0.3) we do: two years of basic sciences (.) anatomy physiology etc (0.4) hh then we do two years of clinical studies either in England or the States or a combination of both (0.5)also <the students are American so they tend to do most of the studies in the States< (0.7) uh::m I chose I originally: (.) scheduled to start in New York but that didnt work out so I actually came to England (0.4) but Im actually glad I did because (.) medical system is a lot dif- it it American system is much different in((inaudible)) (0.5) whereas English system is more compatible so: I: consider its a good move to come to Engla(hh)nd (0.3) okay so what do you like most (0.3) about your studies (1.7) uh the variety (0.4) I think in: medicine especially because no: two patients will present the same way (0.4) and i- its always a challenge to think about what the diagnosis is (0.3) and uh ways in which you can (.) confirm the diagnosis basically (0.2) okay (0.4) are there any things you dont like about your studies? (2.7) well personally the fact tha:t (.) if I read something I have to read it again you know to remember it (.) its just a lot (.) the volume of work is very very large so its just (0.2) time management (0.2) and learning to deal with the: (0.2) (volume of work) (0.3) okay (0.7) so uh:: what qualifications or certificates (0.8) do you hope to get (1.3) well (1.1) after I: (0.5) get my degree in May Im hoping to:: (1.3) uh:m >probably work in England for a while and in order to do that I have to do further exams< hh (0.5) unfortunately bu:t uh:m (1.1) hh then I just hope to: (0.6) progress further i- in my field ((inaudible)) (0.2) okay okay (0.7) lets uh move on to talk about some of the activities you (0.6) enjoy in your free time (0.7) when do you have free time? (1.3) rarely hh heh (0.3) hh uh::m (0.5) I try to pace myself generally (.) in terms of: getting a lot of work done during the week so I ca:n at least relax a bit at the weekends (0.5) I like to:: look at movies go shopping: hh heh (0.5) uhm have a chat with friends and (0.6) okay and uh::m (1.5) what free time activities are most popular where you live? (1.6) probably going to the beach definitely cause its always warm hh (.) mm [hm] [th ]ats what I miss most actually (0.4) uh::m (2.3) I would say thats probably the most po[pular] [ o:]kay (0.4) so how important is free time in peoples lives? (0.6) very ve(hh)ry important (0.7) hh I can (1.8) well personally uh:m (0.7) because I always have so much work to do so much studying to do its always so important for me (.) to be able to relax a bit and then come back refreshed so I can study (.) some more ((inaudible)) (0.2) I think

C: E: C: E: C:

E: C:

E: C:

E: C:

E: C:

E: C: E: C: E: C:

IELTS Research Reports Volume 6

43

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

E: C: E: C: E: C: E: C:

E: C:

E: C:

its very very important to have free time (0.3) okay okay (0.7) uh::m (0.8) okay can we talk about (.) your childhood are you happy to do that? (0.3) yes? (no worries)= =okay (0.3) so where did you grow up (0.3) I grew up in Tobago (0.2) o:kay (0.5) uh was it a good place for children? (0.4) yes I think so hh HA (.) why? (0.2) uh::m (1.7) I think becau:se the society at ho:me (.) tends to stress a lot of family value (.) >I think thats very very important and looking back at my childhood now I realise just how important that was< (0.8) hh uh::m (0.6)hh I cant say I cant really compare myself to to:: (1.1) children growing up in other parts of the world just because I didnt experience it first hand (.)but I would definitely advocate (0.2) growing up in the West Indies (.) a (great dea(hh)l) (0.4) where did you usually play? (0.7) uh::m (1.7) well if you were at school then you would play at the (.) playground at school o:r (0.7) at home theres always space to run around the yard and things like that (.) or you could play on the beach: (0.3) oh okay okay (0.6) hh uh do you think childhood is different today (.) from when you were a child? (2.2) I think theres uh: uhm many differences yes because (1.6) uh:m (0.4) children nowadays are exposed a lot mo:re (0.8) uh:m (1.5) different influences basically because of the television internet things like that (0.5) so I think that tends to have a bigger impact on a child (0.7) in recent years compared to when I grew up (0.6)

Part 2
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 E: okay (0.2) all right (cough) (4.3) okay now Im gonna give you a topic (.) and Id like you to talk about it (0.6) for one to two minutes and before you talk (0.3) youll have one minute to think about what youre gonna say (0.8) and you can make some notes if you wish (0.3) dyou understand? (0.2) yes (0.3) o:kay (0.4) so heres paper (.) and pencil (0.8) for making some notes (0.6) an:d (0.7) Id like you to describe a trip (0.6) that you once went on (33.8) okay? (0.2) yep (.) a:ll right (0.6) remember you have one to two minutes for this so dont worry if I stop you Ill tell you when the time is u[p] [o]kay (0.2) can you start speaking now please= =yeah (0.6) I remembe:r at the beginning of my medical school career (0.4) we were taken on uhm a boat trip to one of the smaller islands around Grenada (0.6) hh it was basically: a (0.5) ((inaudible)) trip (0.4) hh uh:m it was part of the orientatio:n (0.4) into: medical school life and into: life in Grenada: obviously (0.4) hh uh::m (1.1) uh I: think it took abou:t half an hour to get the:re (0.3) if I remember correctly hh (.) uh:m (0.5) there were lots of us there lots of the students (0.2) uh::m both (0.5) >students who were just starting medical school as well as those who were further into their medical school career< (0.7) an:d there was uhm lot of foo:d lots of drinks (.) we spent (.) most of the day on the beach) (.) in the sun in the water (0.6hh uh::m (1.5) it was: (.) but I cant say it was: (0.3) a big (0.8) culture change for me because coming from Tobago which is half an hour flying is (.) very very similar (0.3) but I just enjoyed the day out and (0.7) hh uh:m it always brings back good memories because then you remember all the free time that you had (.) hh heh (0.5) uh::m (1.4) I actually: (0.3) repeated the trip about: a year later (.) because that was my (0.4) last opportunity: (0.6) to uhm (0.2) see a bit of Grenada before leaving: (0.3) to start my ((inaudible)) (0.7)

C: E:

C: E: C: E: C:

IELTS Research Reports Volume 6

44

6. The interactional organisation of the IELTS Speaking Test Paul Seedhouse + Maria Egbert

119 120 121 122 123 124

E: C: E: C:

uh::m (1.9) okay (0.2) okay= =hh heh (.) do you generally enjoy travelling? (0.5) yes I do although I havent actually travelled as much as I would like to yet (0.3) but hopefully after I start working (0.3)

Part 3
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 (0389) E: C: E: o:kay okay (0.4) so weve been talking about a trip you (0.4) went on (0.3) y[eah] [an ]d Id like to discuss with you one or two more general questions related to this (0.7) lets consider first of all uhm (.) public and private transport can you describe (0.6) the public transport systems (0.3) in your country (1.4) well there are public buses (0.4) which tend to be cheaper than taxis (0.2) there is (0.6) uh:m a: taxi association which operates: basically from the airport to anywhere around the island (0.3) but in general if you (.) need a taxi >basically you just stand up on the side of the road and put out your hand< hh heh (0.2) for uhm (0.3) a taxi that comes along (0.5) hh uh::m (1.1) the bus system theres a central depot in the middle of town (0.2) so you can go there to purchase tickets and so (0.5) and there are various othe:r (0.7) stations around the island where you can also purchase tickets (0.4) okay (0.2) okay (0.6) uh:m (0.5) can you uh:: (1.4) speculate on (1.6) future developments (.) in transport systems? (3.8) currently there a:re talks to:: (.) expand (0.2) the public >transport systems to< inclu:de (0.5) what we call maxi taxis not (0.6) hh uh::m (0.7) not exactly public buses they tend to be smalle:r (0.4) they tend to charge more than the public buses (0.4) but uh >they can also hold more people than a taxi obviously so its its more economical in that way< (0.6) hh uh:m (0.8)Im not sure when exactly it will the whole system be put into place but (.)actually (0.7) the: (0.7) the:: (0.4) the plans for the: development of the system in Tobago: (.) can be modelled on ((inaudible)) in Trinidad because there are a lot more maxi taxis in Trinidad (0.2) ((inaudible)) (.) okay okay (0.5) can you uh ((cough)) (1.1) speculate on an- any measures that will be taken to reduce pollution (0.7) in the future? (2.0) theres a lot of debate no:w about pollution especially the waters (0.3) around Trinidad and Tobago (0.2) becau:se theres (0.5) gro- growth in the tourism industry especially (0.5) theres a lot of concern about the hotels (.)hh disposing of their waste properly (0.5) and in recent years in the (0.2) probably about the last ten years or so there have been (0.7) uh::m (0.4)there has been an increase in the amount of pollutio:n (0.3) in the water and the:res (0.6) several (1.3) uh:m societies for example ((inaudible)) Tobago that have been set up to try and combat the problems throu:gh education (0.2) uh:m (1.1) uh:m (0.3) there is other mea(hh)sures ((inaudible)) (.) okay okay (.) okay (0.6) right thank you very much= =yes (.) thats the end of the speaking test= =okay thank you

C:

E: C:

E: C:

E: C: E: C:

IELTS Research Reports Volume 6

45

7. An investigation of the lexical dimension of the IELTS Speaking Test


Authors John Read University of Auckland Paul Nation Victoria University of Wellington

Grant awarded Round 8, 2002 This study investigates vocabulary use by candidates in the IELTS Speaking Test by measuring lexical output, variation and sophistication, as well as the use of formulaic language.
ABSTRACT This is a report of a research project to investigate vocabulary use by candidates in the current (since 2001) version of the IELTS Speaking Test, in which Lexical resource is one of the four criteria applied by examiners to rate candidate performance. For this purpose, a small corpus of texts was created from transcriptions of 88 IELTS Speaking Tests recorded under operational conditions at 21 test centres around the world. The candidates represented a range of proficiency levels from Band 8 down to Band 4 on the nine-band IELTS reporting scale. The data analysis involved two phases: the calculation of various lexical statistics based on the candidates speech, followed by a more qualitative analysis of the full transcripts to explore, in particular, the use of formulaic language. In the first phase, there were measures of lexical output, lexical variation and lexical sophistication, as well as an analysis of the vocabulary associated with particular topics in Parts 2 and 3 of the test. The results showed that, while the mean values of the statistics showed a pattern of decline from Band 8 to Band 4, there was considerable variance within bands, meaning that the lexical statistics did not offer a reliable basis for distinguishing oral proficiency levels. The second phase of the analysis focused on candidates at Bands 8, 6 and 4. It showed that the sophistication in vocabulary use of high-proficiency candidates was characterised by the fluent use of various formulaic expressions, often composed of high-frequency words, perhaps more so than any noticeable amount of low-frequency words in their speech. Conversely, there was little obvious use of formulaic language among Band 4 candidates. The report concludes with a discussion of the implications of the findings, along with suggestions for further research.

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

AUTHOR BIODATA JOHN READ John Read is an Associate Professor in the Department of Applied Language Studies and Linguistics, University of Auckland, New Zealand. In 2005, while undertaking this research study, he was at Victoria University of Wellington. His research interests are in second language vocabulary assessment and the testing of English for academic and professional purposes. He is the author of Assessing Vocabulary (Cambridge, 2000) and is co-editor of the journal Language Testing. PAUL NATION Paul Nation is Professor of Applied Linguistics in the School of Linguistics and Applied Language Studies, Victoria University of Wellington, New Zealand. His research interests are in second language vocabulary teaching and learning, as well as language teaching methodology. He is the author of Learning Vocabulary in Another Language (Cambridge, 2001) and also the author or coauthor of widely used research tools such as the Vocabulary Levels Test, the Academic Word List and the Range program.

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

CONTENTS
1 Introduction ............................................................................................4

2 Literature review.........................................................................................4 3 Research questions ...................................................................................7 4 Method ............................................................................................7 4.1 The format of the IELTS Speaking Test...............................................7 4.2 Selection of texts..................................................................................7 4.3 Preparation of texts for analysis...........................................................9 5 Statistical analyses ....................................................................................9 5.1 Analytical procedures...........................................................................9 6 Statistical results........................................................................................10 6.1 Lexical output .......................................................................................10 6.2 Lexical variation ...................................................................................10 6.3 Lexical sophistication ...........................................................................11 6.4 Key words in the four tasks..................................................................14 7 Qualitative analyses...................................................................................16 7.1 Procedures...........................................................................................16 8 Qualitative results ......................................................................................17 8.1 Band 8 8.2 Band 6 8.3 Band 4 9 Discussion 10 Conclusion References ............................................................................................17 ............................................................................................19 ............................................................................................20 ............................................................................................21 ............................................................................................22 ............................................................................................24

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

INTRODUCTION

The revised Speaking Test for the International English Language Testing System (IELTS), introduced in 2001, involved various changes in both the way that a sample of speech is elicited from the candidates and in the criteria used to rate their performance. From our perspective as vocabulary researchers, a number of issues stimulated our interest in investigating the test from a lexical perspective. An obvious one is that, whereas examiners previously assessed each candidate on a single global scale incorporating various descriptors, the rating is now done more analytically with four separate scales, one of which is Lexical resource. Examiners are required to attend to the accuracy and range of the candidates vocabulary use as one basis for judging his or her performance. A preliminary study conducted by Cambridge ESOL with a pilot version of the revised test showed a very high correlation with the grammar rating scale, and indeed with the fluency one as well (Taylor and Jones, 2001), suggesting the existence of a halo effect, and perhaps a lack of salience for the examiners of lexical features of the candidates speech. Thus, there is scope to investigate characteristics of vocabulary use in the Speaking Test, with the possible outcome of guiding examiners in what to consider when rating the lexical resource of candidates at different proficiency levels. A second innovation in the revised test was the introduction of the Examiner Frame, which largely controls how an examiner conducts the Speaking Test, by specifying the structure of the interaction and the wording of the questions. This means that the examiners speech in the test is quite formulaic in nature. We were interested to determine if this might influence what the candidates said. Another possible influence on the formulaic characteristics of the candidates speech is the growing number of IELTS preparation courses and materials, including at least one book (Catt, 2001) devoted just to the Speaking Test. The occurrence of formulaic language in the test would not in itself be a problem. One needs to distinguish here between purposeful memorising of lexical phrases specifically to improve test performance which one might associate with less proficient candidates and the skilful use of a whole range of formulaic sequences which authors like Pawley and Syder (1983) see as the basis of fluent native-like oral proficiency. More generally, the study offered an opportunity to analyse spoken vocabulary use. As Read noted (2000: 235-239), research on vocabulary has predominantly focused on the written language because among other reasons written texts are easier to obtain and analyse. Although the speaking test interview is rather different from a normal conversation (cf van Lier, 1989), it represents a particular kind of speech event which is routinely audiotaped, in keeping with the operational requirements of the testing program. As a result, a large corpus of learner speech from test centres all around the world is available for lexical and other analyses once a selection of the tapes has been transcribed and edited. Thus, a study of this kind had the potential to shed new light on the use of spoken vocabulary by second language learners at different levels of proficiency. 2 LITERATURE REVIEW

Both first and second language vocabulary research have predominantly been conducted in relation to reading comprehension ability and the written language in general. This reflects the practical difficulties of obtaining and transcribing spoken language data, especially if it is to be natural, ie, unscripted and not elicited. The relative proportions of spoken and written texts in major computer corpora such as the Bank of English and British National Corpus maintain the bias towards the written language, although a number of specialised spoken corpora like the CANCODE (Cambridge and Nottingham Corpus of Discourse in English) and MICASE (Michigan Corpus of Academic Spoken English) are now helping to redress the balance.

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

To analyse the lexical qualities of texts, scholars have long used a range of lexical statistics. Here again, for practical reasons, the statistics have, until recently, been applied mostly to written rather than spoken texts. Nevertheless, they potentially have great value in allowing us to describe key features of spoken vocabulary in a quantitative manner that may provide useful comparisons between test-takers at different proficiency levels. Read (2000: 197-213), in an overview of the statistical procedures, identifies the main qualities which the statistics are designed to measure: lexical density; lexical variation; and lexical sophistication. Lexical density is operationalised as the proportion of content words in a text. It has been used to distinguish the relative denseness of written texts from that of oral ones, which tend to have lower percentages of nouns, verbs and adjectives. In a language testing context, OLoughlin (1995) showed that candidates in a direct speaking test, in which they interacted with an interviewer, produced speech with a lower lexical density than those who took a semi-direct version of the test, which required test-takers to respond on audiotape to pre-recorded stimulus material with no interviewer present. Lexical variation, which has traditionally been calculated as the type-token ratio (TTR), is simply the proportion of different words used in the text. It provides a means of measuring what is often referred to as range of vocabulary. However, a significant weakness of the TTR when it is used to compare texts is the sensitivity of the measure to the variable length of the texts. Various unsatisfactory attempts have been made over the years to correct the problem through algebraic transformations of the ratio. Malvern and Richards (Durn, Malvern, Richards and Chipere, 2004) argue they have found a solution with their measure, D, which involves drawing multiple word samples from the text and plotting the resulting TTRs on a curve that allows the relative lexical diversity of even quite short texts to be determined. In a study which is of some relevance to our research, Malvern and Richards (2002) used D to investigate the extent to which teachers, acting as examiners in a secondary school French oral examination, accommodated their vocabulary use to the ability level of the candidates. Lexical sophistication can be defined operationally as the percentage of low-frequency, or rare, words used in a text. One such measure is Laufer and Nations (1995) Lexical Frequency Profile (LFP), which Laufer (1995) later simplified to a Beyond 2000 measure the percentage of words in a text that are not among the most frequent 2000 in the language. Based on the same principle, Meara and Bell (2001) developed their program called P_Lex to obtain reliable measures of lexical sophistication in short texts. It calculates the value lambda by segmenting the text into 10-word clusters and identifying the number of low-frequency words in each cluster. As yet, there is no published study which has used P_Lex with spoken texts. Apart from the limited number of studies using lexical statistics, recent work on spoken vocabulary has highlighted a number of its distinctive features, as compared to words in written form. One assumption that has been widely accepted is that the number of different words used in informal speech is substantially lower than in written language, especially of the more formal kind. That is to say, a language user can communicate effectively through speaking with a rather smaller vocabulary than that required for written expression. There has been very little empirical evidence for this until recently. In their study of the CANCODE corpus, Adolphs and Schmitt (2003) found a vocabulary of 2000 word families could account for 95% of the running words in oral texts, which indicates that learners with this size of vocabulary may still encounter quite a few words they do not know. The authors suggest that the target vocabulary size for second language learners to have a good foundation for speaking English proficiently should be around 3000 word families, which is somewhat larger than previously proposed.

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

But perhaps the most important area in the investigation of spoken vocabulary is the use of multiword lexical items. This represents a move away from the primary focus on individual word forms and word families in vocabulary research until now. Both in manual and computer analysis, it is simpler to count individual forms than any larger lexical units, although corpus linguists are now developing sophisticated statistical procedures to identify collocational patterns in text. The phenomenon of collocation has long been recognised by linguists and language teaching specialists, going back at least to Harold Palmer (1933, cited in Nation, 2001: 317). What is more recent is the recognition of its psycholinguistic implications. The fact that particular sequences of words occur with much greater than chance probability is not simply an interesting characteristic of written and spoken texts, but also a reflection of the way that humans process natural language. Sinclair (1991) distinguishes two approaches to text construction: the open-choice principle, by which language structures are generated creatively on the basis of rules; and the idiom principle, which involves the building of text from prefabricated lexical phrases. Mainstream linguistics has tended to overlook or undervalue the significance of the latter approach. Another seminal contribution came from Pawley and Syder (1983), who argued that being able to draw on a large memorised store of lexical phrases was what gave native speakers both their ability to process language fluently and their knack of expressing ideas or speech functions in the appropriate manner. Conversely, learners reveal their non-nativeness in both ways. According to Wray (2002: 206), first language learners focus on large strings of words and decompose them only as much as they need to, for communicative purposes, whereas adult second language learners typically store individual words and draw on them, not very successfully, to compose longer expressions as the need arises. This suggests one interesting basis for distinguishing candidates at different levels in a speaking test, by investigating the extent to which they are able to respond fluently and appropriately to the interviewers questions. Applied linguists are showing increasing interest in the lexical dimension of language acquisition and use. In their research on task-based language learning, Skehan and his associates (Skehan, 1998; Mehnert, 1998; Foster, 2001) have used lexical measures as one means of interpreting the effects of different task variables on learners oral production. As part of his more theoretical discussion of the research, Skehan (1998) proposes that the objective of good task design is to achieve the optimum balance between promoting acquisition of the rule system (which he calls syntacticisation) and encouraging the fluent use of lexical phrases (or lexicalisation). Wrays (2002) recent book on formulaic language brings together for the first time a broad range of work in various fields and will undoubtedly stimulate further research on multi-word lexical items. In addition, Norbert Schmitt, Zoltan Dornyei and their associates at the University of Nottingham have just completed a series of studies on factors influencing the acquisition of multi-word lexical structures by international students at the university (Schmitt, 2004). Another line of research relevant to the proposed study is work on the discourse structure of oral interviews. Studies in this area in the 1990s included Ross and Berwick (1992), Young and Milanovic (1992) and Young and He (1998). Lazaraton (2001), in particular, has carried out such research on an ongoing basis in conjunction with UCLES, including her recent analysis of the new IELTS Speaking Test (Lazaraton, 2000, cited in Taylor, 2001). In one sense, a lexical investigation gives only a limited view of the candidates performance in the speaking test. It focuses on specific features of the spoken text rather than the kind of broad discourse analysis undertaken by Lazaraton and appears to relate to just one of the four rating scales employed by examiners in assessing candidates performance. Nevertheless, the literature cited above gives ample justification to explore the Speaking Test from a lexical perspective, given the lack of previous research on spoken vocabulary and the growing recognition of the importance of vocabulary in second language learning.
IELTS Research Reports Volume 6
6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

RESEARCH QUESTIONS

Based on our reading of the literature, we set out to address the following questions: 1. What can lexical statistics reveal about the vocabulary of a corpus of IELTS Speaking Tests? 2. What are the distinctive characteristics of candidates vocabulary use at different band score levels? 3. What kinds of formulaic language are used by candidates in the Speaking Test? 4. Does the use of formulaic language vary according to the candidates band score level? Formulaic language is used here as a cover term for multi-word lexical items, following Wray (2002: 9), who defines a formulaic sequence as: a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar. 4 4.1 METHOD The format of the IELTS Speaking Test

As indicated in the introduction, the IELTS Speaking Test is an individually administered test conducted by a single examiner and is routinely audiotaped. It takes 1114 minutes and consists of three parts: Part 1: Interview (45 minutes) The candidate answers questions about himself/herself and other familiar topic areas. Part 2: Long Turn (34 minutes) After some preparation time, the candidate speaks for 12 minutes on a topic given by the examiner. Part 3: Discussion (45 minutes) The examiner and candidate discuss more abstract issues and concepts related to the Part 2 topic. The examiner rates the candidates performance on four nine-band scales: Fluency and coherence; Lexical resource; Grammatical range and accuracy; and Pronunciation. The four criteria have equal weighting and the final score for speaking is the average of the individual ratings, rounded to a whole band score. 4.2 Selection of texts The corpus of spoken texts for this project was compiled from audiotapes of actual IELTS tests conducted at various test centres around the world in 2002. The tapes had been sent to Cambridge ESOL as part of the routine monitoring process to ensure that adequate standards of reliability are being maintained. The Research and Validation Group of Cambridge ESOL then made a large inventory of nearly 2000 tapes available to approved outside researchers. The list included the following data on each candidate: centre number; candidate number; gender; module (Academic or General Training); Part 2 task number; and band score for Speaking. The original plan was to select the tapes of 100 candidates for the IELTS Academic Module according to a quota sample. The first sampling criterion was the task (or topic) for Part 2 of the test. We wanted to restrict the number of tasks included in the sample because we were aware that the topic would have quite an influence on the candidates choice of vocabulary and we wanted to be able to reveal its effect by working with just a restricted number of tasks. Thus, the sample was

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

limited to candidates who had been given one of four Part 2 tasks: Tasks 70, 78, 79, 80. The choice of these specific tasks was influenced by the second criterion, which was that the band scores from 4.0 to 8.0 should be evenly represented, to allow for meaningful comparisons of the lexical characteristics of candidate speech at different proficiency levels, and in particular at Bands 4.0, 6.0 and 8.0. Since there are relatively fewer IELTS candidates who score at Band 4.0 or Band 8.0, compared to the scores in between, it was important to select tasks for which there was an adequate number of tapes across the band score range in the inventory. The four tasks chosen offered the best coverage in this sense. The score that we used for the selection of candidates was the overall band level for Speaking, rather than the specific rating for Lexical resource (which was also available to us). We decided that, for the purpose of our analyses, it was preferable to classify the candidates according to their speaking proficiency, which was arguably a more reliable and independent measure than the Lexical resource score. In practice, though, the two scores were either the same or no more than one point different for the vast majority of candidates. Where there were more candidates available than we required, especially at Bands 5.0, 6.0 and 7.0, an effort was made to preserve a gender balance and to include as many test centres in different countries as possible. However, it was not possible to achieve our ideal selection. Ours was not the first request for the speaking tapes to be received from outside researchers by Cambridge ESOL and thus a number of our selected tapes were no longer available or could not be located. Thus, the final sample consisted of 88 recorded Speaking Tests, as set out in Table 1. The sample included 34 female and 54 male candidates. The tests had been administered in Australia, Cambodia, China, Colombia, Fiji, Hong Kong, India, Ireland, Libya, New Zealand, Peru, Pakistan, Sudan and the United Kingdom. This meant that a range of countries were included. Although the original intention was to select only Academic Module candidates, the sample included eight who were taking the General Training Module. This was not really a problem for the research because candidates for both modules take the same Speaking Test.
Task 70 Band 8 Band 7 Band 6 Band 5 Band 4 Totals 4 5 5 5 4 23 Task 78 4 4 5 5 4 22 Task 79 4* 6 5 5 2 22 Task 80 3 4 4 6 4 21 Totals 15 19 19 21 14 88

*One of these tapes turned out to have a different Part 2 task. It was thus excluded from the analyses by task.

Table 1: The final sample of IELTS Speaking Test tapes by band score and Part 2 task

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

4.3

Preparation of texts for analysis

The transcription of the tapes was undertaken by transcribers employed by the Language in the Workplace Project at Victoria University of Wellington. They had been trained to follow the conventions of the Wellington Archive of New Zealand English transcription system (Vine, Johnson, OBrien and Robertson, 2002), which is primarily designed for the analysis of workplace discourse. Since the transcribers were mainly Linguistics students employed part-time, the transcribing took nearly nine months to complete. For the qualitative analyses, the full transcripts were used. To produce text files for the calculation of lexical statistics for the candidates speech, the transcripts were electronically edited to remove all of the interviewer utterances, as well as other extraneous elements such as pause markings and notes on speech quality which had been inserted into the transcripts in square brackets. The resulting files were saved as plain text files and then manually edited to delete the hesitations um, er and mm; backchannelling utterances such as mm, mhm, yeah, okay and oh; and false starts represented by incompletely articulated words and by short phrases repeated verbatim. In addition, contracted forms were separated (itll it ll, dont do nt) and multi-word proper nouns were linked as single lexical items (Margaret_Thatcher, Lord_of_the_Rings). 5 5.1 STATISTICAL ANALYSES Analytical procedures

To investigate the words used by the candidates, a variety of lexical statistics were calculated, using four different computer programs. 1. WordSmith Tools (Smith, 1998). This is a widely used program for analysing vocabulary in computer corpora. The Wordlist tool was used to identify the most frequently occurring content words, both in the whole corpus and in the texts for each of the four Part 2 tasks. It also provided descriptive statistics on the lexical output of candidates at the five band score levels. A second WordSmith tool, Keyword, allowed us to identify words that were distinctively associated with each of the tasks and with the whole corpus. 2. Range (Nation and Heatley, 1996). This program produces a profile of the vocabulary in a text according to frequency level. It includes three default English vocabulary lists the first 1000 words, the second 1000 words (both from West, 1953) and the Academic Word List (Coxhead, 2000). The output provides a separate inventory of words from each list, plus words that are not in any of the lists. There are also descriptive statistics which give a summary profile and indicate the relative proportion of high and lower frequency words in the text. The Range program was used to produce profiles not for individual candidates but for each of the five band score levels represented in the corpus. 3. P_Lex (Meara and Bell, 2001). Whereas Range creates a frequency profile, P_Lex yields a single summary measure, lambda, calculated by determining how many non-high frequency words occur in every 10-word segment throughout the text. A low lambda shows that the text contains predominantly high-frequency words, whereas a higher value indicates the use of more lower-frequency vocabulary. 4. D_Tools (Meara and Miralpeix, 2004). The purpose of this pair of programs is to calculate the value of D, the measure of lexical diversity devised by Malvern and Richards. D values range from a maximum of 90 down to 0, reflecting the number of different words used in a text.

IELTS Research Reports Volume 6

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

6 6.1

STATISTICAL RESULTS Lexical output

Let us first review some characteristics of the overall production of vocabulary by candidates in the test. In Table 2, candidates have been classified according to their band score level and the figures show descriptively how many word forms were produced at each level.
TOTALS Tokens BAND 8 (n=15) BAND 7 (n=19) BAND 6 (n=19) BAND 5 (n=21) BAND 4 (n=14) 22,366 21,865 18,493 15,989 6931 Types 2374 2191 1795 1553 996 MEANS (standard deviations) Tokens 1491.0 (565.9) 1150.7 (186.7) 937.3 (261.4) 761.4 (146.7) 475.8 (216.9) Types 408.1 (106.0) 334.6 (46.0) 276.7 (48.2) 234.2 (35.5) 166.6 (48.6)

Table 2: Lexical output of IELTS candidates by band score level (WordSmith analysis)

Since there were different numbers of candidates in the five bands, the mean scores in the third and fourth columns of the table give a more accurate indication of the band score distinctions than the raw totals. There is a clear pattern of declining output from top to bottom, with candidates at the higher band score levels producing a much larger amount of vocabulary on average than those at the lower levels, both in terms of tokens and types. It is reasonable to expect that more proficient candidates would have the lexical resources to speak at greater length than those who were less proficient. However, it should also be noted that all the standard deviations were quite large. That is to say, there was great variation within band score levels in lexical production, which means that number of words used is not in itself a very reliable index of the quality of a candidates speech. For example, the range in length of the edited texts for Band 8 candidates was from 728 to 2741 words. Thus, high proficiency learners varied in how talkative they were and in the extent to which the examiner allowed them to speak at length in response to the test questions. It would be possible to calculate type-token ratios (TTRs) from the figures in Table 2 and in fact, the WordSmith output includes a standardised TTR. However, as noted above, the TTR is a problematic measure of lexical variation, particularly in a situation like the present one where candidate texts vary widely in length. 6.2 Lexical variation To deal with the TTR problem, Malvern and Richards D was calculated by means of D_Tools. The D values for the texts in our corpus are presented in Table 3. As noted in the table, there may be a small bug in the program, because seven texts yielded a value above 90, which is not supposed to happen. An inspection of the seven texts suggested the possibility that the use of rare or unusually diverse vocabulary by some more proficient candidates may tend to distort the calculation, but this will require further investigation. Leaving aside those anomalous cases, the pattern of the findings for lexical variation is somewhat similar to those for lexical output. The mean values for D decline as we go down the band score scale, but again the standard deviations show a large dispersion in the values at each band level, and particularly at Bands 7 and 6.

IELTS Research Reports Volume 6

10

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

As a general principle, more proficient candidates use a wider range of vocabulary than less proficient ones, but D by itself cannot reliably distinguish candidates by band score.

D (LEXICAL DIVERSITY) Mean BAND 8 (n=11)* BAND 7 (n=17)* BAND 6 (n=18)* BAND 5 (n=21) BAND 4 (n=14)
* Seven candidates with abnormal D values were excluded

SD 4.9 18.2 16.0 11.3

Maximum 87.5 89.5 81.4 86.7

Minimum 72.0 61.2 57.0 39.5

79.0 71.8 67.2 63.4

60.7

11.4

76.1

37.5

Table 3: Summary output from the D_Tools Program, by band score level

6.3

Lexical sophistication

The third kind of quantitative analysis used the Range program to classify the words (in this case, the types) into four categories, as set out in Table 4. Essentially, the figures in the table provide Laufer and Nations (1995) Lexical Frequency Profile for candidates at the five band score levels represented in our corpus. If we look at the List 1 column, we see that overall at least half of the words used by the candidates were from the 1000 most frequent words in the language, but the percentage rises with decreasing proficiency, so that the high-frequency words accounted for two-thirds of the types in the speech of Band 4 candidates. Conversely, the figures in the fourth column (Not in Lists) show the reverse pattern. Words that are not in the three lists represent less frequent and more specific vocabulary, and it was to be expected that the percentage of such words would be higher among candidates at Bands 8 and 7. In fact, there is an overall decline in the percentage of words outside the lists, from 21% at Band 8 to about 12% at Band 4.

IELTS Research Reports Volume 6

11

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

TYPES List 1 BAND 8 (n=15) BAND 7 (n=19) BAND 6 (n=19) BAND 5 (n=21) BAND 4 (n=14)
KEY List 1 List 2 List 3 Not in Lists

List 2 347 14.7% 329 15.1% 266 14.9% 222 14.4% 132 13.3%

List 3 243 10.3% 205 9.4% 179 10.0% 119 7.7% 58 5.9%

Not in Lists 504 21.3% 455 20.9% 277 15.5% 243 15.8% 122 12.3%

Total 2364 100% 2179 100% 1782 100% 1542 100% 989 100%

1270 53.7% 1190 54.6% 1060 59.5% 958 62.1% 677 68.5%

First 1000 words of the GSL (West, 1953) Second 1000 words of the GSL Academic Word List (Coxhead, 2000) Not occurring in any of the above lists

Table 4: Analysis by the Range program of the relative frequency of words (lemmas) used by candidates at different band score levels

The patterns for the two intermediate columns are less clear-cut. Candidates at the various band levels used a variable proportion of words from the second 1000 list, around an overall figure of 1315%. In the case of the academic vocabulary in List 3, the speech of candidates at Bands 68 contained around 910% of these words, with the percentage declining to about 6% for Band 4 candidates. If we take the percentages in the third and fourth columns as representing the use of more sophisticated vocabulary, we can say that higher proficiency candidates used substantially more of those words. Another perspective on the lexical sophistication of the speaking texts is provided by Meara and Bells (2001) P-Lex program, which produces a summary measure lambda based on this same distinction between high and low-frequency vocabulary use in individual texts. As noted above, a low value of lambda shows that the text contains mostly high-frequency words, whereas a higher value is intended to indicate more sophisticated vocabulary use. In Table 5, the mean values of lambda show the expected decline from Band 8 to 4, confirming the pattern in Table 4 that higher proficiency candidates used a greater proportion of lower-frequency vocabulary in their speech. However, the standard deviations and the range figures also demonstrate what was seen in Tables 2 and 3; except to some degree at Band 6, there was a great deal of variation within band score levels.

IELTS Research Reports Volume 6

12

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

LAMBDA Mean BAND 8 (n=15) BAND 7 (n=19) BAND 6 (n=19) BAND 5 (n=21) BAND 4 (n=14) 1.10 1.05 0.89 0.88 0.83 SD 0.22 0.26 0.17 0.24 0.33 Maximum 1.50 1.49 1.17 1.38 1.48 Minimum 0.77 0.60 0.55 0.33 0.40

Table 5: Summary output from the P-Lex Program, by band score level

To get some indication of why such variation might occur, it is interesting to look at candidates for whom there is a big mismatch between the band score level and the value of lambda. There were four cases of Band 8 candidates with lambdas between 0.77 and 0.86. An inspection of their transcripts suggests the following tentative explanations: Candidate 62 may have been overrated as Band 8, based on the simple language used and the apparent difficulty in understanding some of the examiners questions in Part 3 of the test. Candidate 19 spoke fluently in idiomatic English composed largely of high-frequency words. Three of them used relatively few technical terms in discussing their employment, their study and the Part 2 task. Candidate 76 used quite a lot of technical terminology in talking about his employment history but switched to a much greater proportion of high-frequency vocabulary in the rest of the test. On the other hand, four Band 4 candidates had lambdas between 1.16 and 1.48. There is an interesting contrast between two Band 4 candidates who said relatively little in the test (their edited texts are both around 300 words) but who had markedly different lambdas. Candidate 78 responded in simple high-frequency vocabulary, which produced a value of 0.44, whereas Candidate 77 used quite a few somewhat lower frequency words, often ones that were repeated from the examiners questions (available, transport, celebrating, information, encourage), and thus obtained a lambda of 1.48. The other Band 4 candidates with high lambdas also appeared to produce a good proportion of words outside the high-frequency vocabulary range, relative to their small lexical output. Another factor with some Band 4 candidates was that poor pronunciation reduced their intelligibility on tape, with the result that it was difficult for the transcriber to make a full record of what they said and this may have affected high-frequency function words more than phonologically salient lexical items. Some of these lexical characteristics of performance at the different band score levels are considered further below in the qualitative analysis of the transcripts.

IELTS Research Reports Volume 6

13

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

6.4

Key words in the four tasks

To investigate the vocabulary associated with particular topics, the texts were classified according to the four Part 2 tasks represented in the corpus. There were about 2123 texts for each task. Table 6 lists the most frequently occurring content word forms in descending order, according to the WordSmith word lists. The lists have been lemmatised, in the sense that a stem word and its inflected forms (cook, cooking, cooked; book, books) were counted as a single unit, or lemma.
TASK 70 Eating out (n=23)
food think people like (vb) restaurant time good friend eat fast place home work cook know country travel family year nice city spend name traditional talk different find study course prefer enjoy dish prepare 269 190 187 177 151 125 117 104 96 86 79 60 58 54 54 50 49 47 41 39 38 38 38 36 31 30 30 28 27 26 26 25 25

TASK 78 Reading a book (n=22)


book read think people like (vb) time friend name good work different child study story life television problem write family find important interesting country help learn city love 333 224 195 130 126 82 81 80 69 61 49 48 46 44 43 43 42 38 37 34 32 32 29 28 28 26 25

TASK 79 Language learning (n=21)

TASK 80 Describing a person (n=21)

English 340 people 315 think 229 think 226 learn 157 know 175 language 148 famous 130 like (vb) 88 good 129 people 87 name 105 know 85 like (vb) 79 79 person 79 start 63 friend 72 speak 59 work 67 school 55 country 62 friend 55 time 60 different 45 year 57 time 43 day 55 country 40 life 54 study 39 family 53 important 39 help 51 good 37 study 50 year 36 city 48 difficult 34 important 47 start 33 public 44 class 32 different 42 music 30 example 42 work 29 way 41 listen 29 problem 39 name 25 history 38 word 25 transport 37 teach 35 teacher 34 place 34 write 31 grammar 31 interesting 31 mean 31 travel 28 family 27 talk 26 new 26 university 25 foreign Note: Some high-frequency verbs which occurred fairly uniformly across the four tasks have been excluded: get, go, make, say, see, use/used to and want.

Table 6: The most frequent content words used by candidates according to their Part 2 topic (WordSmith Wordlist analysis)

The lists represent, in a sense, the default vocabulary for each topic the mostly high-frequency words one would expect learners to use in talking about the topic. As such, these words will almost certainly not be salient for the examiners in rating the learners lexical resource, except perhaps in the case of low-proficiency candidates who exhibit uncertain mastery of even this basic vocabulary. It should be remembered that these lists come from the full test for each candidate, not just Parts 2 and 3, where the designated topic was being discussed. This helps to explain why words such as
IELTS Research Reports Volume 6
14

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

friend, people, family, study and country tend to occur on all four lists because of the frequency of these words in Part 1 of the test, where candidates talked about themselves and their background, including in particular, a question about whether they preferred to socialise with family members or friends. Words were selected for the four lists down to a frequency of 25. It is interesting to note some variation between topics in the number of words above that minimum level. The longest list was generated by Task 79, on language learning. This indicates that, from a lexical point of view, the candidates discussed this topic in similar terms, so that a relatively small number of words, including English, learn, language, study, listen, word and talk, recurred quite frequently. That is to say, their experience of language learning had much in common from a vocabulary perspective. By contrast, for Task 78 the list of frequently repeated words is noticeably shorter, presumably because the books that the candidates chose to discuss had quite varied characteristics. The same would apply to the people that candidates who were assigned Task 80 chose to talk about.
TASK 70 Eating out (n=23)
food restaurant fast eat foods eating go cook like home traditional restaurants dishes cooking nice out McDonalds meal delicious shop healthy 463.1 327.8 184.0 104.8 90.0 86.7 76.1 74.3 58.8 57.7 52.0 47.0 45.3 45.3 42.2 40.0 32.0 31.3 29.3 26.6 24.0

TASK 78 Reading a book (n=22)


read books book reading story children internet television girl men writer boy this hear women fiction 342.8 309.2 358.9 102.2 66.4 57.2 38.4 38.4 36.8 36.8 35.1 29.7 28.6 28.5 27.4 24.3

TASK 79 Language learning (n=21)


English language learn speak learning languages school class grammar communicate foreign started words speaking teacher difficult communication listening 713.1 233.6 251.1 99.4 76.8 74.7 72.4 69.7 62.2 56.2 52.1 40.5 37.7 34.9 33.8 32.4 29.3 27.5

TASK 80 Describing a person (n=21)


he famous people him person his public admire who known media become she chairman president 346.5 270.4 115.2 110.6 76.0 60.2 53.0 51.5 50.6 48.5 45.7 42.0 39.0 24.2 24.2

Table 7: Results of the WordSmith Keyword analysis for the four Part 2 tasks

Another facility offered by WordSmith is a Keyword analysis, which identifies words occurring with high frequency in a particular text or set of texts as compared with their occurrence in a reference corpus. For this purpose, the texts associated with each of the four Part 2 tasks were collectively analysed by reference to the corpus formed by the texts on the other three tasks. The results can be seen in Table 7, which lists the keywords for each of the four tasks, accompanied by a keyness statistic, representing the extent of the mismatch in frequency between the words in the texts for a particular task and in the rest of the corpus. The keyword results show more clearly than the previous analysis the semantically salient words associated with each task. From a lexical point of view, it is the vocabulary needed for the Part 2 long turn and the Part 3 discussion which dominates each candidates Speaking Test.

IELTS Research Reports Volume 6

15

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

QUALITATIVE ANALYSES

To complement the statistical analyses, a subset of the test transcripts was selected for a more qualitative examination. There were two aims in this part of the study: 1. to identify lexical features of the candidate speech which might help to distinguish performance at different band score levels 2. to seek evidence of the role that formulaic language might play in the Speaking Test. 7.1 Procedures The approach to this phase of the analysis was exploratory and inherently subjective in nature. As we and others have previously noted (Wray 2002, Schmitt and Carter 2004, Read and Nation 2004), there is a great deal of uncertainty about both how to define formulaic language in general, and how to identify particular sequences of words as formulaic. Our initial expectations were that formulaic language could potentially take a number of different forms in the IELTS Speaking Test: 1. The examiners speech in the test is constrained by a frame, which is essentially a script specifying the questions that should be asked, with only limited options to tailor them for an individual candidate. This might give the examiners speech a formulaic character which would in turn be reflected in the way that the candidate responded to the questions. 2. In the case of high-proficiency candidates who were fluent speakers of the language, one kind of evidence for their fluency could be the use of a wide range of idiomatic expressions, ie, sequences of words appropriately conveying a meaning which might not be predictable from knowledge of the individual words. This would make their speech seem more native-like than that of candidates at lower band score levels. 3. Conversely, lower-proficiency candidates might attempt such expressions but produce ones that were inaccurate or inappropriate. 4. At a low level, candidates might show evidence of using (or perhaps overusing) a number of fixed expressions that they had consciously memorised in an effort to improve their performance in the test. It could be argued that the widespread availability of IELTS preparation courses and materials might encourage this tendency. In order to highlight contrasts between score levels, the transcripts at Bands 8, 6 and 4 for each of the four tasks were selected for analysis. Our strategy was to read each of the selected transcripts carefully, marking words, phrases and longer sequences that seemed to be lexically distinctive in the following ways: individual words that we judged to be of low frequency, whether or not they were accurately or appropriately used words or phrases which had a pragmatic or discourse function within the text sequences of words which could in some sense be regarded as formulaic. At this point, it is useful to make a distinction between formulaic sequences which could be recognised as such on the basis of native speaker intuition, and sequences that were formulaic for the individual learner as a result of being stored and retrieved as whole lexical units, regardless of how idiomatic they might be judged as being by native speakers. One indication that a sequence was formulaic in the latter sense was that it was produced by the candidate with little if any pauses, hesitation or false start. Another was that the same sequence or a similar one was used by the candidate more than once during the test.

IELTS Research Reports Volume 6

16

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

8 8.1

QUALITATIVE RESULTS Band 8

As noted in the results of the statistical analyses, the candidates at Band 8 produced substantially more words as a group than did those at lower proficiency levels. However, the quality of their vocabulary use was also distinctive. This was reflected partly in their confident use of low frequency vocabulary items, particularly those associated with their employment or their leisure interests. Several of the Band 8 candidates in the sample were medical practitioners and here for example is Candidate 01 recounting the daily routine at his hospital: ..and after that um I should er go back to the ward to check patients and check if theres any complication from receiving the drugs er usually after er er giving the drugs er some drugs may cause side effects which need to my intervention The underlined words are obviously more or less technical terms in medicine and one would expect a doctor to have command of them. Similarly, Candidate 48 described her favourite movie actor in this way: he is a very, very versatile actor like hes er he has got his own styles and mannerisms in a very short span er in two decades or two and a half decades he has established himself as a very good actor in the (cine field) The use of styles and span here may not be entirely native-like, but the candidate was able to give a convincing description of the actor. Thus, high-proficiency candidates have available to them a wide range of low frequency words that allow them to express more specific meanings than can be conveyed with more general vocabulary. However, it is important to emphasise that such lower-frequency vocabulary does not necessarily occur with high density in the speech of Band 8 candidates. The sophistication of their vocabulary ability may also be reflected in their use of formulaic sequences made up largely or entirely of high-frequency words which give their speech a native-like quality. Here are some excerpts from the transcripts of Band 8 candidates, with some of the sequences that we consider to be formulaic underlined: one of the main reasons [why he became a doctor] was both my parents are doctors so naturally I got into that line but I was also interested in this medicine, as such, of course the money factors come into play (Candidate 54) its quite nice [describing a restaurant] its er its er Japanese er all type of food but basically what I like there is the sushi, I love sushi so I just enjoy going there and when you go in they start shouting and stuff, very Japanese culture type of restaurant which is very good (Candidate 19) [after visiting a new place] I like to remember everything later on and er I dont know its a habit I just keep picking up these small things like er + um if I go to the northern areas that is if I go to [place name] or some place like that, Ill be picking up these small pieces and them um on the way back when I look at them I was like God, I cannot explain why I got this, theres just this weird stuff that Ive picked up (Candidate 72) A related feature of the speech of many Band 8 candidates was the use of short words or phrases functioning as pragmatic devices, or what Hasselgren (2002) has termed smallwords. These include you know, I mean, I guess, actually, basically, obviously, like and okay. These tend to be semantically empty and as such might be considered outside the scope of a lexical analysis, but

IELTS Research Reports Volume 6

17

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

nevertheless they need to be included in any account of formulaic language use. Here are some examples from Candidate 47, who possibly overdoes the use of such devices: Im a marine engineer actually so er I work on the ship and er basically we have to go wherever the ship you know goes and so obviously we are on the ship so basically I am taking care of the machinery and thats it so + er well Ive travelled quite a lot you know I mean all around the world Another distinctive user was Candidate 38: [I prefer America] er because um to be frank like er um people were nice I mean they were not biased or you know they didnt show any coloured preference or whatever yeah they were more friendly In most cases, these pragmatic devices did not occur as frequently as in these two excerpts, but they were still a noticeable feature of the speech in many of the Band 8 transcripts that we examined. Another kind of device was the use of discourse markers to enumerate two or three points that the candidate wanted to make: if you compare er my language with er English its completely different because .. er firstly we write from right to left and in English you write from right to left um another thing the grammar our grammar its not like English grammar (Candidate 62) my name has two meanings theres one um its actually a Muslim name so theres two meanings to that one is that it means a guardian from heaven and the second meaning its er second name it was given to a tribe of people that were lofty and known for their arrogance (Candidate 71) These discourse markers were not so common in candidate speech, which is perhaps a little surprising, particularly in relation to the Part 2 long turn, when the candidates were given a minute or so to prepare what they were going to say. It is important not to overstate the extent to which the features identified so far can be found in the speech of all the candidates at Band 8. In fact, they varied in the extent to which their speech appeared to be formulaic, in the sense of containing a lot of idiomatic expressions, pragmatic devices and so on. Here is a candidate who expresses her opinion about the importance of English in a relatively plain style: er I think the English language is very important now + at first it didnt used to be, actually it has been strong for the last er fifty years but importance was not given to it + now in every organisation in every school in every college, er basically at the university level everything is taught in English basically so you need to understand the language, I think we students are better off because we are studying from a younger age we understand the language but a big problem we have here is that + people dont communicate but now teachers encourage the students to speak in English and um + it is very important (Candidate 83) Apart from the words actually and basically, plus a phrase such as are better off, there is not much in this excerpt which could be considered formulaic in any overt way. Another example is this candidate talking about the kind of friends he prefers: er normally I prefer one or two very close friends so that I can discuss with them if I have any problems or things like that, I can have more contact close contact with them instead of having so many friends, but I have so many friends I make friends as soon as I see I see

IELTS Research Reports Volume 6

18

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

people for the first time its like when I came here today I talk to a number of people here but I have I prefer to have just one or two friends who are very close to me (Candidate 64) On the face of it, these opinions are expressed in very simple vocabulary without any idiomatic expression. It should be noted that the phrase prefer to have one or two very close friends was part of the preceding question asked by the examiner and thus the opening statement is formulaic in the sense that it echoes what the examiner said. On closer inspection, there are other phrases that could be formulaic, such as or things like that, close contact with them, I make friends and its like when. 8.2 Band 6 As compared to Band 8 candidates, those at Band 6 had some similar features but overall they showed a more limited range of vocabulary and a smaller amount of idiomatic expression. One tendency among Band 6 candidates was either to use an incorrect form of a word or to reveal some uncertainty about what the correct form was. For instance, Candidate 09 said if I go by myself maybe some dangerous or something and its more e- economy if I travel with other peoples. Similarly, Candidate 69 made statements such as when I was third year old, the differences between health and dirty and, in perhaps an extreme case of uncertainty, then my parents brang bring bringed me branged me here . One noticeable characteristic of many candidates at this level was the occurrence of a mixture of appropriate and inappropriate expression, both in individual word choice and in the longer word sequences which they produced. Here are some examples: I think adventurous books are really good for um pleasure time where you can sit and you can think and read those books and really come into real world (Candidate 36) No people rarely do [change their names] especially because er first of theyre proud of their names and proud of their tribes if you ever ever er go through the history of those people they would think themselves like a very proudy person and most of the people dont change their name (Candidate 84) Mm I think train is better because its fast and convenient but sometimes when in the weekend theres many people who are travel by train to somewhere else + so I think that time is very busy (Candidate 10) These examples illustrate how Band 6 candidates were able to communicate their meaning effectively enough, even though they made some errors and did not express themselves in the more idiomatic fashion that Band 8 candidates were capable of. Here is one further example, which includes low-frequency vocabulary such as relaxation, dwelling, cassette recorder and distract, as well as the formulaic expression (its) (just) a matter of, but in other respects it is not very idiomatic: I usually listen to music as a relaxation time after duties at my dwelling its just a matter of relaxation ( ) cassette recorder certain cassettes I have picked ( ) I am travelling from town to town in the recorder of my car I used to put it on just a matter of you can distract ( ) going by your thoughts ( ) cannot sleep ( ) so its a matter of relaxation ( ) something to distract me ( ) also I enjoy it very much. (Candidate 87) Candidates at Band 6 did not generally use pragmatic devices such as actually, you know and I mean with any frequency. Candidate 69 is a clear exception to this but the other transcripts contained few, if any examples, of such devices.

IELTS Research Reports Volume 6

19

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

8.3

Band 4

First, it should be noted that there were some practical difficulties for the transcribers in accurately recording what candidates at this level said, both because of the intelligibility of their accent and because their answers to questions might not be very coherent, particularly when the candidate had not properly understood the question. Although candidates at this level used predominantly high-frequency vocabulary, they often knew some key lower-frequency words related to familiar topics, which they would use without necessarily being able to incorporate them into well-formulated statements, as in this response by Candidate 77 about transport in his city: Transport problems locally there is a problem of these ( ) and er rickshaws motor rickshaws a lot of problems of making pollution and er problems Here is another example from Candidate 18: Er in my case I have a working holiday visa yes (before) I I worked as salesperson in convenience shop A third example is a description of a local festival by Candidate 73: Er is er our locality is very famous were celebrating [name] festivals and we er too celebrate with our er relatives and theres a big gathering there and we always er make chitchat and we negotiate and deal of our personal characters in such kind of + festifestivals There was not a great deal of evidence of formulaic language among the candidates at the Band 4 level. In some respects, the most formulaic section of the test was at the very beginning, as in this exchange: IN: CM: IN: CM: IN: CM: Can you tell me your full name please? My full name is [full name] Okay and um and what shall I call you? Um you can call me [name] [Name] okay [name] er can you tell me where youre from [name]? Er Im from [place] in [country] (Candidate 65)

Of course, this introductory part was formulaic to varying degrees for candidates at all levels of proficiency because examiners are required to go through the routine at the beginning of every test. There was one Band 4 candidate who gave an unusually well-formulated response which seems quite formulaic in the sense of being perhaps a rehearsed explanation for her decision to study medicine: I want to be a doctor because I think this is a meaningful job to use my knowledge to help others and also to contribute to the society (Candidate 45) More typically, the responses by Band 4 candidates to questions that they had understood were not nearly as well-formed as this. For example, Candidate 78 responded thus to a question about English teaching in her country: Er in my school is very good I can er Ill er + read there er two years last + nine ten matric than Ill leave the school go to college + and theres no good English in colleges For the most part, there were only certain limited sequences which we could identify as in any way formulaic in the speech of these low-proficiency candidates. For instance, Candidate 80 used the formula Yes of course six times. Other phrases such as most of the time, in my opinion, first of all,
IELTS Research Reports Volume 6
20

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

I dont know Im not sure and I like music very much occur sporadically in the transcripts we examined. Particularly in Part 3, which is designed to be the most challenging section of the test, the Band 4 candidates had difficulty in understanding the examiners questions, let alone composing an adequate answer. However, even here they mostly did not have formulaic expressions to express their difficulty and to request a repetition of the question. Some used pardon, please or (Im) sorry, or else just struggled to respond as best they could. Exceptions were I do not understand (Candidate 80) and sorry I dont exactly understand what youre ( ) can you repeat please (Candidate 45). 9 DISCUSSION

In this study we used a variety of statistical tools, as well as our own judgement, to explore the lexical characteristics of oral texts produced by IELTS candidates in the Speaking Test. We decided to conduct most of the analyses using the band scores for speaking which had been assigned to the candidates performance by the examiners in the operational situation. For research purposes, it might have been desirable to check the reliability of the original scores by having the tapes re-rated by two certificated examiners. On the other hand, the fact that the recording quality of the audiotapes was quite variable, and that rating of tapes is a different experience from assessing candidates live, meant that the re-ratings would not necessarily have produced more valid measures of the candidates speaking ability. Classifying the candidates by band score, then, we found that the lexical statistics revealed broad patterns in the use of individual word forms which followed ones general expectations: Higher proficiency candidates gave more extended responses to the questions and thus produced more vocabulary than lower proficiency candidates. Candidates with higher band scores also used a wider range of vocabulary than those on lower band scores. The speech of less proficient candidates contained a higher proportion of high-frequency words, particularly the first 1000 most frequent words in the language, reflecting the limitations of their vocabulary knowledge. Conversely, higher proficiency candidates used greater percentages of lower frequency words, demonstrating their larger vocabulary size and their ability to use more specific and technical terms as appropriate. It is important, though, that all of these findings should be seen as tendencies of varying strengths rather than defining characteristics of a particular band score level, because in all cases, there was substantial variation within levels. Thus, for instance, some Band 8 candidates gave relatively short responses and used predominantly high-frequency word forms, whereas those at Band 4 often produced quite a few low-frequency words, which could form a substantial proportion of their lexical output. Another point worth reiterating here is that, following Nation (2001: 13-16), we are defining high-frequency as occurring among the 2000 most frequent words in English and, in the case of the P_Lex analysis, even more narrowly as the first 1000 words. As Nation (pp 19) also notes, the distinction between high and low is a somewhat arbitrary one and many very familiar words are classified as low-frequency by this criterion. However, the division still seems to provide a useful basis for evaluating the lexical quality of these oral texts.

IELTS Research Reports Volume 6

21

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

No particular analysis was conducted of technical terms used by these IELTS candidates. The test questions are not really intended to elicit much discussion of the candidates field of study or employment, particularly since the same test material is used with both Academic and General Training candidates. Within the short time-span of the test, the examiner cannot afford to let the candidate speak at length on any one topic. Even the Part 2 long turn is supposed to be restricted to 12 minutes. Nevertheless, some more proficient candidates who were well-established professionals in medicine, finance or engineering did give relatively technical accounts of their professional experience and interests in Parts 1 and 2 of the test. The WordSmith analyses of the four Part 2 tasks clearly showed the influence of the topic that was the focus of Parts 2 and 3 of each candidates test. The distinctive, frequently occurring content words were mostly those associated with the Part 2 task, which then led to the more demanding follow-up questions in Part 3. One interesting point to emerge from the analysis of the four topics was that they varied in terms of the range of content vocabulary that they elicited. Task 79, which concerned the candidates experience of learning English, was the most narrowly focused in this regard. In other words, the candidates who talked on this topic tended to draw on the same lexical set related to formal study of the language in a classroom. On the other hand, Tasks 78 (a book) and 80 (a person) required some generic terms, but also more specific vocabulary to talk about the particular characteristics of the book or person. The qualitative analysis was exploratory in nature and the findings must be regarded as suggestive rather than in any way conclusive. As noted in the literature review, there are no well-established procedures for identifying formulaic language, which indeed can be defined in several different ways. We found it no easier than previous researchers to confidently identify multi-word units as formulaic in nature on the basis of a careful reading of the transcripts. The comparison of transcripts within and across Bands 4, 6 and 8 produced some interesting patterns of lexical distinction between candidates at these widely separated proficiency levels. However, we were also conscious of the amount of individual variation within levels, which of course was one of the findings of the quantitative analysis as well. It should also be pointed out that the candidates whose tapes we were working with comprised a relatively small, non-probabilistic sample of the IELTS candidates worldwide another reason for caution in drawing any firm conclusions. The simple fact of working with the transcripts obliged us to shift from focusing on the individual word forms that were the primary units of analysis for the statistical procedures to a consideration of how the forms combined into multi-word lexical units in the candidates speech. This gave another perspective on the concept of lexical sophistication. In the statistical analyses, sophistication is conceived in terms of the occurrence of low frequency words in the language users production. The qualitative analysis, particularly of Band 8 texts, highlighted the point that the lexical superiority of these candidates was shown not only by their use of individual words but also their mastery of colloquial or idiomatic expressions which were often composed of relatively high-frequency words. 10 CONCLUSION

In the first instance, this study can be seen as a useful contribution to the analysis of spoken vocabulary in English, an area which is receiving more attention now after a long period of neglect. Within a somewhat specialised context non-native speakers performing in a high-stakes proficiency test the research offers interesting insights into oral vocabulary use, both at the level of individual words and through multi-word formulaic units. The texts are incomplete in one sense, in that the examiners speech has been deleted, but of course the primary focus of the assessment is on what the candidate says (and discourse analytic procedures such as those used by Lazaraton (2002) are more appropriate for investigating the interactive nature of the Speaking Test). Although oral

IELTS Research Reports Volume 6

22

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

texts like these are certainly not as tidy as written ones, it appears that lexical statistics can provide an informative summary of some key aspects of the vocabulary they contain. From the perspective of IELTS itself, it is important to investigate vocabulary use in the Speaking Test as part of the ongoing validation of the IELTS test, particularly as Lexical resource is one of the criteria on which the candidates performance is assessed. Our findings suggest that it is not surprising if examiners have some difficulty in reliably rating vocabulary performance as a separate component from the other three rating criteria. Whereas broad distinctions can be identified across band score levels, we found considerable variation in vocabulary use by candidates within levels. Ideally, research of this kind will, in the longer term, inform a revision of the rating descriptors for the Lexical resource scale, so that they direct the examiners attention to salient distinguishing features of the different bands. However, it would be premature to attempt to identify such features on the basis of the present study. One fruitful area of further research would be to ask a group of IELTS examiners to listen to a sample of the Speaking Test tapes and discuss the features of each candidates vocabulary use that were noticeable to them. Their comments could then be compared with the results of the present study to see to what extent there was a match between their subjective perceptions and the various quantitative measures. However, it should also be remembered that, in the operational setting, examiners need to be monitoring all four rateable components of the candidates performance, thus restricting the amount of attention they can pay to Lexical resource or any one of the others. It may well be that it is unrealistic to expect them to reliably separate the components. Moreover, the formulaic nature of oral language, as we observed it in our data particularly among Band 8 candidates, calls into question the whole notion of a clear distinction between vocabulary and grammar. Thus, while as vocabulary researchers we emphasise the importance of the lexical dimension of second language performance, we also recognise that it represents one perspective among several on what determines how effectively a candidate can perform in the IELTS Speaking Test.

IELTS Research Reports Volume 6

23

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

REFERENCES Adolphs, S and Schmitt, N, 2003, Lexical coverage of spoken discourse, Applied Linguistics, vol 24, pp 425-438 Ball, F, 2001, Using corpora in language testing in Research Notes 6, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 6-8 Catt, C, 2001, IELTS Speaking preparation and practice, Catt Publishing, Christchurch Durn, P, Malvern, D, Richards, B and Chipere, N, 2004, Developmental trends in lexical diversity, Applied Linguistics, vol 25, pp 220-242 Foster, P, 2001, Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers in Researching pedagogic tasks: Second language learning, teaching and testing, eds M Bygate, P Skehan, and M Swain, Harlow: Longman, pp 75-93 Hasselgren, A, 2002, Learner corpora and language testing: Smallwords as markers of oral fluency in Computer learner corpora, second language acquisition and foreign language teaching, eds S Granger, J Hung and S Petch-Tyson, John Benjamins, Amsterdam, pp 143-173 Hawkey, R, 2001, Towards a common scale to describe L2 writing performance in Research Notes 5, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-13 Laufer, B, 1995, Beyond 2000: A measure of productive lexicon in a second language in The current state of interlanguage, eds L Eubank, L Selinker and M Sharwood Smith, John Benjamins, Amsterdam, pp 265-272 Laufer, B and Nation, P, 1995, Vocabulary size and use: Lexical richness in L2 written production, Applied Linguistics, vol 16, pp 307-322 Lazaraton, A, 2002, A qualitative approach to the validation of oral language tests, Cambridge University Press, Cambridge Malvern, D and Richards, B, 2002, Investigating accommodation in language proficiency interviews using a new measure of lexical diversity, Language Testing, vol 19, pp 85-104 McCarthy, M, 1990, Vocabulary, Oxford University Press, Oxford McCarthy, M, 1998, Spoken language and applied linguistics, Cambridge University Press, Cambridge Meara, P and Bell, H, 2001, P_Lex: A simple and effective way of describing the lexical characteristics of short L2 texts, Prospect, vol 16, pp 5-24 Meara, P and Miralpeix, I, 2004, D_Tools computer software, Lognostics (Centre for Applied Language Studies, University of Wales Swansea), Swansea Mehnert, U, 1998, The effects of different lengths of time for planning on second language performance, Studies in Second Language Acquisition, vol 20, pp 83-108 Nation, ISP, 2001, Learning vocabulary in another language, Cambridge University Press, Cambridge Nation, P and Heatley, A, 1996, Range computer program, English Language Institute, Victoria University of Wellington, Wellington

IELTS Research Reports Volume 6

24

7. An investigation of the lexical dimension of the IELTS Speaking Test John Read + Paul Nation

Pawley, A and Syder, FH, 1983, Two puzzles for linguistic theory: Native-like selection and nativelike fluency in Language and communication, eds JC Richards and RW Schmidt, Longman, London, pp 191-226 Read, J, 2000, Assessing vocabulary, Cambridge University Press, Cambridge Read, J and Nation, P, 2004, Measurement of formulaic sequences in Formulaic sequences: Acquisition, processing and use, ed N Schmitt, John Benjamins, Amsterdam, pp 23-35 Ross, S and Berwick, R, 1992, The discourse of accommodation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 159-176 Schmitt, N (ed), 2004, Formulaic sequences: Acquisition, processing and use, John Benjamins, Amsterdam Sinclair, J, 1991, Corpus, concordance, collocation, Oxford University Press, Oxford Skehan, P, 1998, A cognitive approach to language learning, Oxford University Press, Oxford Smith, M, 1998, WordSmith tools, version 3.0, computer software, Oxford University Press, Oxford Taylor, L, 2001, Revising the IELTS Speaking Test: Developments in test format and task design in Research Notes, 5, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 2-5 Taylor, L and Jones, N, 2001, Revising the IELTS Speaking Test in Research Notes, 4, EFL Division, University of Cambridge Local Examinations Syndicate, Cambridge, pp 9-12 van Lier, L, 1989, Reeling, writhing, drawling, stretching and fainting in coils: Oral proficiency interviews as conversation, TESOL Quarterly, vol 23, pp 489-508 Wray, A, 2002, Formulaic language and the lexicon, Cambridge University Press, Cambridge Young, R and He, AW (eds), 1998, Talking and testing: Discourse approaches to the assessment of oral proficiency, John Benjamins, Amsterdam Young, R and Milanovic, M, 1992, Discourse variation in oral proficiency interviews, Studies in Second Language Acquisition, vol 14, pp 403-424

IELTS Research Reports Volume 6

25

Potrebbero piacerti anche