Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1.1
1.2
1.3
1.4
2.
2.1
2.2
2.3
2.4
3.
3.1
Interview tasks
3.1.1
Structured interviews
3.1.2
Unstructured interviews
3.2
3.2.1
3.2.2
4.
Special issues
4.1
Practicality
4.2
4.3
Marking
Conclusion
Bibliography
Appendix: Sample rating scales
1. Speaking Proficiency English Assessment Kit (SPEAK), Educational Testing Service, USA.
2. Test in English for Educational Purposes (TEEP), Associated Examining Board, England.
3. Negotiated grading scheme, Tokyo Denki University, Japan.
4. Placement rating scale, Nova conversation school, Japan.
Such informal assessment is clearly a central part of language teaching. It is no less important
than the formal testing of achievement, or the testing of employment and academic-related
proficiency. It follows that all teachers in EFL contexts, whatever their positions and duties, ought
to have a basic understanding of the principles underlying assessment of oral language skills.
The term 'proficiency' refers to the practical use of language as a whole. It is therefore best
assessed directly by eliciting extended samples of interactive language use in realistic contexts.
The indirect assessment of oral language, through controlled response to single test items, has
limited value as an indicator of real-life oral proficiency.
Unfortunately there is no such thing as a definitive test of general oral ability that can be applied
in any situation. The standard of 'native-like' proficiency is only a convenient abstraction - one
that ignores the personal and cultural differences that make communication real and complex. In
EFL contexts, such as Japan, testees are often quite unfamiliar with Western cultural references
and modes of behaviour, and so the design of test items needs to be as culturally neutral as
possible without being too vague.
These kinds of tests are used as high-grade filters that discriminate between learners and rank
them against a sliding scale. The main purpose of assessment here is to identify candidates for
access to limited opportunities such as scholarships and promotions. As such they are not
particularly suitable for assessing an individual's particular level of proficiency in detail.
Validity has been described as "the single most critical element in constructing foreign language
tests" (Nakamura 1995, 126). A valid test has a recognizable logic to it that makes the test a
meaningful tool of assessment. The most fundamental kind of validity relates to the underlying
theory of language on which the test is constructed (construct validity). This influences the
sampling of language material and tasks (content validity), which in turn has an effect on the
appearance of the test to the teachers and learners who use it (face validity).
Construct validity requires a set of principles that can adequately describe real-life language use.
In the case of oral language skills this is not such a simple matter. Speaking may seem to be a
general-purpose ability, but it occurs under many contexts and conditions, and for many reasons.
Each has its own characteristics and demands, especially when seen as an interactive skill. In the
last few decades a great deal of effort has been made to describe language use as an interactive or
communicative system. Canale and Swain's (1980) model of 'communicative competence' is
certainly the best known example in the literature on applied linguistics.
A second category called discourse competence concerns the way language is conventionally
shaped in different communicative contexts. Describing a suspect during a police interview, for
example, requires more than basic grammatical skills - it involves selecting, organising and
linking elements together to create a structured and coherent whole. Canale and Swain distinguish
a third category called sociocultural competence; which covers the cultural forms of speech
deemed appropriate in a particular community.
Weir (1993), drawing from Bygate, conveniently includes both discourse and sociocultural
aspects of language use under the single heading "routine skills". These are "frequently recurring
ways of structuring speech, such as descriptions, comparisons, instructions, telling stories", and
includes the patterns of interactional language use seen in such things as "buying goods in a shop,
or telephone conversations, interviews, meetings, discussions, decision making, etc" (Weir 1993,
32)
Canale and Swain's fourth category is strategic competence, which covers the various techniques
people use to manage and enhance communication. This category is covered by Weir under the
heading "improvisation skills" (1993, 32-4). Communication is a faulty and chaotic process and
speakers need to be able to improvise when their conventional language routines fail. This
includes both the "negotiation of meaning" in various ways to enhance understanding, as well as
the "management of interaction" to establish "who is going to speak next and what the topic is
going to be" (turn taking and topic initiation).
Improvisational skills are useful in every general context. For example, "Excuse me, what did you
say?", or its equivalent, is an essential phrase. In particular contexts, such as business negotiation,
there is a greater need for highly developed improvisational skills. In choosing or designing
specific performance criteria for an oral test it is important to decide which of these categories are
important and to what extent at each level of a candidate's ability. Different criteria will produce
different results. As noted by Brown, "if each group were to develop its own assessment
framework..., they may, in fact, through the inclusion or weighting of specific criteria, produce
schemes which lead to quite different evaluations of candidates ability." (cited in Turner 1998,
198) The assessment criteria need to be related to the actual purpose of the test. This is sometimes
called systemic validity. It requires close consultation with the relevant educational and
employment bodies to help determine in detail what they intend the assessment instrument to
achieve.
Global descriptors are not always so brief as this. The Australian Second Language Proficiency
Ratings (ASLPR) scale, developed by Ingram and Wylie in 1982, uses an A4 page to present each
band descriptor in considerable detail. This allows for increased accuracy of identification, but at
the cost of flexibility of assessment. Detailed global scales effectively dictate what combination of
skills is to be recognized at each level, although in practice the particular features "may not cooccur in actual student performance" (Turner 1998, 200).
Although the general components of oral language use are those discussed above in 2.2, there are
various ways in which this "cake" of abilities can be sliced for assessment. Following are
examples of assessment categories from four different analytic rating scales:
Within each category, different levels of ability need to be distinguished clearly using descriptive
language that can be matched against test results. With clear criteria determined by the overall
purpose of assessment (systemic validity) and founded on a clear theory of language use
(construct validity), it is possible to choose relevant assessment tasks. The choice of relevant tasks
is an important step in itself, for as shown in one study of interview-format discourse (cited in
Turner 1998, 195), "some of the supposed characteristics of intermediate versus advanced
10
learners represented in the rating scales were not substantiated in the actual performance of
intermediate and advanced learners."
Assessing interactive language means by definition that there is someone else actively taking part.
The person being tested is not only producing language, but is also responding in a
communicative way with another interlocutor. This is quite different from non-interactive
stimulus response tasks. Techniques that use written or visual prompts to elicit language samples
are very straightforward and time-efficient to administer, and can also help to gauge the general
educational level of the student. The SPEAK test of oral proficiency is one example of a test
composed mostly of non-interactive tasks (Clankie 1995). Unfortunately they fulfil very few of
the qualities of communicative testing listed above.
There are many kinds of oral assessment task that can be used - one writer listing over sixty
variations (Underhill 1987). In essence there are two general approaches that meet the criteria for
interactive assessment. These are interview and role play.
11
setting. Interview tasks thus represent a compromise solution to the problem of how to control
something that is inherently unpredictable.
A common interview structure has four stages - 1) a friendly warm up, 2) a level check to
determine the candidate's overall ability in terms of the criteria, 3) challenging probes to find
where performance drops, and 4) a final wind down at a less challenging level (Nagata 1995). In
EFL contexts, such as Japan, the structured interview is readily accepted by test users since it
mirrors the often formal social relationship that exists between teacher and student. This high face
validity makes it a popular method of oral assessment, despite its limitations as a measure of reallife oral ability.
The structured interview allows only a partial assessment of routine and improvisation skills (as
defined in 2.2 above). However keen the candidates may be, they remain passive respondents.
Interactive routine and improvisational skills require greater freedom on the part of the candidate
to direct and initiate the conversational flow.
12
there is ideally a greater use of interactive skills - including the strategic skills of negotiation of
meaning and turn taking.
Of course this will depend on there being suitable motivation for conversation, and also a positive
atmosphere for communication to happen.
The major drawback of information gap activities is that they are often no more than mechanical
exercises requiring the production of linguistic forms on cue. There is very little scope for the
13
purposeful creative use of language, which makes it difficult for students to identify with the role.
Tightly scripted information gap tasks have little predictive validity, since real interactive
language use is much more unpredictable.
Role plays can be designed to test language use in various settings, such as at a hotel, a doctor's
office, a supermarket, or a boardroom. The role play may focus on general language functions. (or
purposes), such as asking, checking, describing, complaining, apologizing, or giving advice, to
take only a few examples. Unlike information gap activities, role play instructions do not usually
specify particular language structures to be used, though they may be implied in the way the
instructions are written.
The lack of obvious manipulation of the testees' responses is the main strength of this format.
Well-designed role plays are purposive, interesting, motivating, interactive, unpredictable and
realistic, to use the characteristics of communicative language given by Weir (1988, 82). This
means that there is more scope for higher level testees to display a range of interactional and
improvisational skills.
The advantages of this approach means increased validity as a test of real life oral skills, but at the
cost of reliability of measurement due to the unpredictability of testees' responses. To some extent
this can be balanced out by ensuring there are well defined procedures of assessment based on
14
clear criteria. Testees themselves also need to understand the criteria under which their own
language performances will be judged.
4. Special Issues
Some special issues that influence the design and implementation of assessment also need to be
mentioned.
4.1 Practicality
The practicality of a test refers to the degree to which it is cost effective and easy to administer.
The number of testees, the time constraints for testing and marking, and the available human and
physical resources all need to considered carefully before an assessment scheme is chosen. This is
not only an issue of money, but also of the perceptions of those who will be taking and using the
test. Also, if a test can be administered efficiently by assessors and markers, this increases the
validity and reliability of the results as a whole.
4.3 Marking
Applying descriptive assessment criteria to a candidate's oral performance requires making
subjective (or impressionistic) judgements. This is in contrast to objective marking, in which a
15
quantitative marking scheme is mechanically applied to structured tasks, such as multiple choice
and sentence completion exercises.
A descriptive scale of oral performance, with clearly defined levels, can be combined with
quantitative grades. Subjective judgements matching performance to such descriptors will then
generate a quantitative grade score useful for ranking candidates. Analytic rating scales, that
describe specific language skills (see 2.5 above), can be graded differently to emphasize the
relative importance of different skills. This is called 'weighting' the assessment criteria, and needs
to be based on a clear understanding of the stages of language development (construct validity)
and the purpose of the assessment instrument (systemic validity). A graded analytic scale can then
be combined with a global scale, for example as shown by McClean (1995) in her description of a
negotiated grading scheme at a Japanese university.
Grading is very much dependent on the purpose of the test and the way this is reflected in the
criteria. An achievement test that is criterion referenced will judge candidates individually on
their achievement of learning outcomes. Score distribution depends solely on learning success,
and it is theoretically possible for all candidates to receive 100%. On the other hand, a test for
selection purposes will need to separate candidates, making fine distinctions between their
performances. This kind of comparative assessment is called norm referenced, and the scores are
ideally distributed on a bell-shaped curve, so that most candidates are placed at the centre of the
distribution.
Conclusion
An effective test of interactive oral skills is not a haphazard selection of tasks chosen at random.
Instead each assessment situation presents a set of practical demands that need to be specifically
addressed. The principles of validity, reliability, practicality and bias for best provide basic
guidelines for evaluating the effectiveness of a test instrument.
16
A theoretical model of oral skills is also necessary to structure what is fundamentally fleeting and
changeable. At the same it needs to be remembered that human skills are highly dependent on a
variety of internal and external factors that are independent of language ability per se. The art of
testing involves minimising the influence of such extraneous factors and creating conditions under
which all candidates can display their genuine abilities.
17
Bibliography
Canale, M. and M. Swain. 1980. Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics (1): 1-47.
Clankie, S. 1995. The SPEAK test of oral proficiency: A case study of incoming freshmen. In
JALT Applied Materials: Language Testing in Japan. eds. J. D. Brown and S. O. Yamashita,
119-125. Tokyo: The Japan Association for Language Teaching.
Kent, H. 1998. The Australian Oxford Mini Dictionary. 2nd ed. Melbourne: Oxford University
Press.
McClean, J. 1995 Negotiating a spoken-English scheme with Japanese university students. In
JALT Applied Materials: Language Testing in Japan. eds. J. D. Brown and S. O. Yamashita,
119-125. Tokyo: The Japan Association for Language Teaching.
Nagata, H. 1995. Testing oral ability: ILR and ACTFL oral proficiency interviews. In JALT
Applied Materials: Language Testing in Japan. eds. J. D. Brown and S. O. Yamashita, 119125. Tokyo: The Japan Association for Language Teaching.
Nakamura, Y. 1995. Making speaking tests valid: Practical considerations in a classroom setting.
In JALT Applied Materials: Language Testing in Japan. eds. J. D. Brown and S. O.
Yamashita, 119-125. Tokyo: The Japan Association for Language Teaching.
Turner, J. 1998. Assessing speaking. Annual Review of Applied Linguistics 18: 192-207.
Underhill, N. 1987. Testing Spoken Language: A Handbook of Oral Testing Techniques.
Cambridge: Cambridge University Press.
Weir, C. J. 1988. Communicative Language Testing with Special Reference to English as a
Foreign Language. Exeter: University of Exeter.
Weir, C. J. 1993. Understanding and Developing Language Tests. New York: Prentice Hall.