Sei sulla pagina 1di 13

COMPUTERIZED DICTATION FOR ASSESSING

LISTENING PROFICIENCY
David Coniam, Chinese University of Hong Kong
ABSTRACT
This paper describes the construction and small-scale implementation of a computer
program which can be used on a self-access basis to assess secondary school students
ESL listening proficiency. The test involves an extended dictation which is in the form
of a dialogue. Subjects both hear and see on the screen (to provide context) the first
speakers utterances, but only hear the second speakers utterances. After each
exchange, subjects have to type in the second speakers utterance, and the match
between their input and the utterance is scored. Results indicate a good correlation with
traditional pen-and-paper tests, suggesting that the concept has the potential to assess
listening other than by administering a test to a group of subjects via a taped recording
at a single sitting.
KEYWORDS
Language testing, listening, dictation, ESL
INTRODUCTION
While computerized testing in the 1980s and 1990s has been receiving attention due to
the increased availability and power of computers, computerization has been essentially
in two areas:
1. the adaptation of traditional pen-and-paper tests to an electronic medium
The ease of marking and immediate delivery of scores via multiple-choice and nth
deletion exact-scoring cloze tests has inevitably resulted in the development of many
such tests in computerized form.
2. the implementation of computer-adaptive language testing (CALT)
The establishing of tests and item banks calibrated by Item Response Theory techniques
to determine subjects ability through a minimum of test items has been an area where
there has been ample recent research and development (Hambleton 1989; Carlson 1994;
Meunier 1994).

CALICO Journal, Volume 13 Numbers 2 & 3

73

In his 1988 paper (p. 37), Aldersonwith regard to point (1) abovedescribes
Computer Based English Language Testing (CBELT) procedures in the 1980s and
suggests possible future directions. He makes the important point that one requirement
of a CBELT test is that it should be:
...something that could not be done equally by other means (e.g. by pen
and paper).
In a subsequent 1991 paper, Alderson notes, however, that surprisingly little has been
done on the development of computer based tests:
The tremendous opportunities that computers might offer for innovation
in test method have not been taken up...(1991, 19)
It might be argued that the innovative possibilities which CBELT offers have been
realized in CALT programs, where subjects can be awarded a score on the basis of
many fewer items than are necessary with traditional modes of assessment where
subjects work sequentially through an entire tests. See Smittle (1991) for example, for a
description of a CALT test of reading. Since 1993 it has been possible to take the
Graduate Record Examination for entry to graduate school in the US in this way. And
in Singapore, where the University of Cambridge Local Examination Syndicate is
experimenting with self-access testing by telephone, it is possible to dial in and take a
test by modem.
As Meunier rightly observes, however, the focus of CALT tests too often centers around
the testing of grammar via the multiple-choice or cloze formats, since open-ended
responses cannot be dealt in CALT tests (1994, 37). The tests described above, for
example, focus on reading and general language proficiency via such restricted formats.
Further, the testing of the listening skill is an issue which has received much less
attention in terms of self-access testing (see, however, Mydlarski and Paramskas 1985;
Ariew and Dunkel 1989; Dunkel 1991). One of the reasons for this lack of attention is
self-evident: the testing of listening by conventional means necessitates playing an
audio tapeusually in a classroom or a school hall. For reasons of security, listening
tests cannot be repeated over and over again as tests of reading can.
The focus of the testing of listening in this paper is for Hong Kong secondary schools,
with which the current author has had considerable contact in terms of the assessment
that takes place there. Current testing of listening involves large groups of subjects
taking the same test at one time, a factor which places severe operational constraints on
running listening tests. For reasons of face validity, Hong Kong secondary schools feel
that different tests have to be prepared and administered each time so as to prevent
students gaining an advantage by having heard the test before.

CALICO Journal, Volume 13 Numbers 2 & 3

74

If a bank of test material were established which could be accessed on an individual


basis through headphones, the problem of possible advantage could be overcome,
and larger numbers of students could be administered listening tests without
compromising security and the reliability of the results. The experimental listening test
described in this paper could form the basis to aid schools in such testing.
The use of dictation as a vehicle for testing listening comprehension is not
uncontroversialfrom the perspectives of what dictation measures and how a dictation
should be marked. Although the focus of this paper is essentially on the design and
implementation of a computerized dictation test, theoretical underpinnings need to be
reviewed so that test operation and functionality can be put into perspective.
Dictation as a test of global proficiency/listening proficiency
Oller (1979) suggests that the partial dictation conceptwhere all of the test material is
presented in auditory form, but only part of the test material is presented in written
formis a valid pragmatic testing measure because it requires subjects to interpret
what they hear as part of natural spoken discourse, and hence subjects global language
proficiency can be tapped (1979, 266). While the concept of a single factor to account for
second language proficiency is disputed by certain researchers (e.g., Bachman and
Palmer 1982), the work of Fouly (1985) and Fouly and Cziko (1985) in dictation would,
however, appear to provide evidence supporting dictation as a valid tool for sampling
second language proficiency, and listening proficiency in particular.
The marking of dictation
One of Ollers proposed marking schemes for dictation is that one mark should be
deducted for each error, as long as the errors are not simply spelling errors (1987, 282).
Examples which Oller provides of spelling for which no mark has been deducted
(poisened for poisoned; repeate for repeat) indicate a substantial amount of
judgement involved (1979, 289). It has not been able to incorporate judgement as a
factor in the current computerized dictation test: the marking scheme essentially
consists of marking a word as either correct or incorrect on the basis of grammar and
spelling. As is discussed in the Results section below, in the current dictation, very little
misspelling actually occurred, suggesting that the necessity for correctly-spelt words
does not invalidate the testing procedure.
BACKGROUND TO THE HONG KONG SECONDARY SCHOOL SYSTEM

CALICO Journal, Volume 13 Numbers 2 & 3

75

The Hong Kong secondary school system has seven forms, extending from Secondary 1
(age 12) to Secondary 7 (age 19). This equates roughly to grades 6 to 13 in the US
system. Students take the Hong Kong Examination Authoritys (HKEA) Hong Kong
Certificate of Education (HKCE) examinations at the end of Secondary 5, and the
Advanced Supplementary Level (ASL) examinationsthe tertiary entrance
examinationsat the end of Secondary 7. The English language examinations which
students take are the HKCE in English Language and ASL Use of English.
TEST DESIGN
The current test involves an adaptation of the dictation procedureas a
repetition/imitation task (Fouly and Cziko 1985). The test, adapted from an idea in the
Hong Kong students textbook Focus 5 (Richards 1990), consists of a dialogue of
authentic native-speaker speech recorded at normal speaking speed. The dialogue is
between two participantsAlan and Brian. Brian is a student of English and is being
interviewed by Alan about his motivation for learning English. Subjects both hear and
see written on the screen what Alan, the first speaker, says but they only hear (via
headphones) what Brian, the second speaker, says. The rationale for this is that the first
speakers questions provide a context (albeit topical rather than grammatical, as is the
case with Ollers examples of partial dictation [1979, 285]). The topical context provided
by the first speakers questions does, however, prepare subjects for the test items, the
second speakers answers.
After subjects have heard an utterance, they are given time to type it in. (Utterances
need to be typed in exactly, although as it is a listening test, punctuation/ capitalization
are not scored.) The utterance is then parsed for accuracy. The current author
experimented with a number of methods of marking, and the one which produced the
most consistent results, and which the program utilizes in this study is an adaptation of
that proposed by Oller (1979). The marking procedure involves the following:
one mark is awarded for a correct word in the correct place in the utterance
half a mark is awarded for a correct work somewhere in the utterance.
no mark is awarded for a word which is spelled incorrectly
A simple calculation then converts the total mark for an item to a percentage in terms of
the number of words in an utterance. Fouly and Cziko (1985) recommend the procedure
of awarding 1.0 marks for a totally correct answer and zero for an answer which
contains any incorrect words. While this is a simpler operation in terms of marking,
such a marking scheme leaves little room for variability within subjects answers. (The

CALICO Journal, Volume 13 Numbers 2 & 3

76

HKEA recognized this fact by splitting many of the longer answers in its former
Secondary 6 Higher Level Examination of English and by scoring such answers 2-1-0.)
To put the current marking system into perspective, an example will now be given.
Consider the second exchange which subjects see and hear (the second speakers
utterancewhich is to be typed inis in italics.):
Alan: How long have you been learning English?
Brian: I started learning when I was ten.
With the system for scoring set at 1.0 marks for a correctly-spelt word in the correct
place, and a correct word somewhere in the utterance, Table 1 below presents two
subjects responses.
Subject
1 Input
Score

i
1.0

started
1.0

i
1.0

start
0

learning
1.0

at
0

about
0

ten
0.5

learning
0.5

about
0

i
0.5

was
0.5

3.5/7

Total
50%

3/7

Total
43%

Subject
2 Input
Score

for
0

ten
0.5

Table 1. Two subjects responses for the utterance I started learning when I was ten
On the above item, Subject 1 has scored higher than Subject 2. While we would agree
that his is correct in that Subject 1 appears to have understood more than Subject 2, the
score differential does not really reflect this. To what extent this points up problems
with current testing procedure is discussed below where correlations between the
computerized dictation and a conventional pen-and-paper listening test are presented.
The nine exchanges of the dialogue were constructed so that, in line with the work of
Fouly and Cziko (1985) on the scalability of difficulty of dictation items, earlier test
items were short, with a gradual increase in the number of words tested in each item.
While there is a progression of difficulty through the utterances, this progression was
not mathematically enforced, as this would have made for a rather unnatural-sounding
dialogue.

CALICO Journal, Volume 13 Numbers 2 & 3

77

There is some disagreement as to how long repetition/imitation exercises such as


dictation should be. Wanner (1974) suggests that subjects start to lose the exact wording
of a sentence after 16 intervening syllables, whereas Perkins, Brutton and Angelis (1986)
suggest the threshold for ESL subjects is nearer eight syllables. These two tests,
however, involve the repetition of individual isolated sentences. In their dictation test,
Fouly and Cziko (1985 present results for items which appear to be functioning well
with 22 syllables in them. Given that the current test involved a dictation in the form of
a dialogue where there was considerable supporting context, Fouly and Czikos
maximum of 22 syllables was set as the upper limit for item length.
Utterance
1
2
3
4
5
6
7
8
9

No. of words
4
7
8
6
5
6
9
11
12

No. of syllables
5
9
10
7
6
9
18
19
19

Table 2. Number of words tested in each utterance


It can be seen that, while there is not an absolute increase in the number of words in
each utterance, the tendency is for listening demands to increase through the course of
the dialogue. While word frequency is no guarantee of the accessibility of a text, there is
evidence that a strong relationship exists between control over the most frequent words
and language proficiency (Harlech-Jones 1983; Mears and Jones 1988; Sinclair 1991).
Percentage of all
English text
accounted for by the
number most
frequent words
1%-800%
80%-90%
>90%

(1-2,145)
(2,146-6,358)
(>6,358)

Speaker A

Speaker B

47 (92%)
4 (8%)
0

48 (89%
6 (11%)
0

Table 3. Word frequency in the dialogues

CALICO Journal, Volume 13 Numbers 2 & 3

78

An analysis of the words in the dialogue against the University of Birminghams Bank
of English list of most frequent words puts 91% of the words in the test within the 80%
most common words, as can be seen from Table 3 above.1 Both sides of the dialogue are
presented in the table, since subjects need to understand the first speakers questions in
order to better interpret the second speakers responses.
Secondary 7 students would be expected to know all the words in the dialogue, since all
the words fall within the 90% most frequent wordswith the vast majority occurring
within the 2,145 (the top 80%) most frequent. Given this, the test can therefore be
viewed, as Oller (1979) argues, as a test of interpreting in context, rather than simply
accuracy in terms of spelling. Conversely if students are not familiar with a lot of the
words, they also have to guess from the limited context, which affects the validity of the
exercise. Eight words fell outside the tope 80%. These were:
Foreigners, mistakes, prefer, hello, studying, classes, learning, courses
Allied to the fact that the majority of the words in the dialogue were known entities is
the fact that for Hong Kong students generally spelling is not a major problem. The
stipulation of the rule that words must be spelt correctly is therefore an acceptable
constraint.2
A time limit of five seconds for each word in an utterance was setset so that the test
would terminate if a student was extremely indecisive or slow to respond. With test
item 9, for example, which was 12 words long, subjects had 60 seconds in which to enter
their answer.
SUBJECTS
The subjects who took the class were the 28 students in a Secondary 7 class in a local
Hong Kong secondary school. Prior to taking the computer listening test, they
attempted a number of short multiple-choice items which had been calibrated against a
representative sample of the Hong Kong cohort of secondary school students (Coniam
1995). The results of these items placed the general language proficiency of the students
between Secondary 5 and Secondary 6. These test results are purely for the readers
reference, so that the subjects ability can be placed within the Hong Kong context: the
relationship between listening and general proficiency is not a concern of the current
study.
The subjects constituted a general Secondary 7 class, who were not particularly
computer-literate. To forestall any possible unfamiliarity, they were therefore given a
few minutes to familiarize themselves with the notebook computer and its keyboard

CALICO Journal, Volume 13 Numbers 2 & 3

79

before attempting the test. Subjects took the test toward the end of their time in
Secondary 7 and quite close to when they took the mock examination for the ASL Use of
English Examso that comparisons might be made with their performance on the two
types of test.
Subjects took the test via headphones at the back of the class while the teacher
conducted a normal lesson. While these conditions are far from ideal, it underscores the
fact that a listening test can be conducted on an unobtrusive, self-access basis.
RESULTS
As mentioned above, subjects were allowed five seconds for each word in a particular
utterance. Subjects did not report any instances, however, of having had insufficient
time, so the time limit would appear to be acceptable.
For a proficiency test, the optimum mean for the test is in the region of 50% (Gronlund
1985, 253). Such a mean suggestsas an initial indicatorthat the test is neither too
easy nor too difficult for the subjects and roughly matches subjects level of proficiency.
The results for the test are presented in Table 4 below. The overall mean for the test was
43%slightly lower than the optimum of 50%and reflects subjects general lower
ability, as indicated by the groups performance on the short items. The standard
deviation for the test was 11.1%, which is very similar to that obtained by the HKEA on
the Use of English examination, and indicates that the group are fairly homogeneous in
terms of ability.
Utterance
1
2
3
4
5
6
7
8
9

No. of words
4
7
8
6
5
6
9
11
12

Utterance Mean
35%
37%
26%
67%
66%
35%
64%
39%
25%

Corr. with test total


.48, p= .01
.46, p= .01
.24, p= .22
.49, p=.01
.49, p= .01
.52, p= .01
.55, p=.002
.63, p=.000
.28, p=.15

Test Mean 43%; SD 11.1%


Table 4. Test results

CALICO Journal, Volume 13 Numbers 2 & 3

80

The correlations between items and the whole test generally correlate highlywith the
exception of items 3 and 9. Given that items 3 and 9 have also emerged as rather too
difficult3 and the test had not been piloted (in which case items 3 and 9 would have
been amended or deleted) the results look quite encouraging. It can be seen that the
number of words in each utterance is not necessarily an indicator of the test items
difficulty. This does not constitute a problem, however, as the test is not intended to be
one of grammatical accuracy per se, but the perceiving of the words in context. It has to
be admitted that this cannot be directly construed as having comprehended the utterances;
however, between subjects scores on the computer dictation and a traditional pen-andpaper listening test (their mock Use of English examination) which was taken at a
similar time, the correlation was 0.46 (p= 0.01). This suggests that the two tests are
sampling similar aspects of the listening proficiency.
The issue of dictation being simply a disguised test of spelling was alluded to above.
While there were incidences of misspellingsome examples of which are presented in
Table 5an examination of subjects output provided little evidence of incorrectly-spelt
words which distorted the scoring.
Subjects input
begain
lisen, lesent
foregner

Target word
begin
listen
foreigner

Table 5. Incidences of misspelling


In general, thee was very little misspellingwith few incidences of very high frequency
words such as wood or woudl for would etc. being misspelled. Interestingly, in
the case of the word foreigners, six of the 28 subjects typed in stranger or
strangersunderscoring the fact that the listening test is sampling context as much as
straightforward accuracy, and that subjects did in fact comprehend the test and the task.
DISCUSSION AND CONCLUSIONS
The fact that the test can be administered and marked on a self-access basis suggests
that, in terms of a computer listening test as a diagnostic device for schools, the test may
well have potential. With developments in computer technology, and the fact that all
schools now have quite large computer labs, it may well be possible to run the same
test independently on a number of computers.

CALICO Journal, Volume 13 Numbers 2 & 3

81

Subjects comments on the test and the medium were generally positive, suggesting that
the medium of the keyboard as the mode of input does not constitute a major threat to
the validity of the concept (Wise and Plake 1989 for a discussion of the effects of
administering tests via computer). Coupled with the fact that spelling does not appear
to be a problemeven for lower ability students in Hong Kongthe computer can be
assumed to have minimal impact on subjects performance on such tests.
In this study, subjects have been awarded one mark for a correct word in the correct
position in the utterance and 0.5 of a mark for the correct word appearing somewhere
in the utterance. One minor problem here is with inflected words, where a subject has
missed off an s or ed ending. For example, the first exchange of the dialogue is as
follows:
Alan: How long have you been studying English?
Brain: For about six years.
If subjects enter for about six year, they score zero for the word year. One
amendment to the scoring procedure with which the author has been experimenting
involves comparing subjects input on words where common inflections may have been
omitted against the Bank of English list of most frequent words, and awarding a score
of 0.25 for such input. It appears, however, that such an amendment makes only a
minor difference to subjects overall scores and to overall results such as test mean and
concurrent validity. Further, it may be argued, that such an amendment places a tighter
focus on word-for-word accuracy and possibly reduces the listening in context which
the test is sampling.
At present, to give sufficient surrounding context, subjects see what Speaker A says, but
only hear what Speaker B says. Another option with which the current author has been
experimenting and which would improve the validity of the listening test is for subjects
to only hear both speakers, without the support of Speaker As words on the screen. This
is, however, an option which needs to be field-tested for it viability.
For the system to be adopted by schools, it needs to be as user-friendly as possible. The
test in the current project has involved the direct recording of sound files onto a
computers hard disk. Although this involves less effort than is necessary with a
conventional listening test in terms of recording and duplicating, even on computer, the
constructing of listening tests is still time-consuming in terms of recording.
The test is essentially an objectiverather than integrativetest of listening. One of the
limitations of computerized testing, as discussed above, is the fact that a word like
strangers which a number of subjects wrote in place of foreigners was marked as

CALICO Journal, Volume 13 Numbers 2 & 3

82

Incorrect by the program although the word is semantically acceptable. Nonetheless,


despite these limitations, the program still requires more of subjects in terms of
listening than simply selecting from a set of multiple-choice options.
Given that the sample of one class of 28 students on which the test has been run is
rather limited, conclusions can only be tentative. However, the fact that the computer
listening test and the Use of English mock listening examination correlated significantly
at 0.46 (p= 0.01) suggests that the concept of the computer listening test is one which is
worth pursuing.
NOTES
1

The figures of the number of words and the amount of English text for which they
account were obtained from a frequency list derived in February 1996 from the 211million-word Cobuild Bank of English corpus.
The HKEA recognized this fact in the criterion-referenced marking scheme for the
pre-1980 HKCE composition paper where a maximum of 5 marks only could be
deducted for incorrect spelling.
30% is generally regarded as the cutoff point in terms of item difficultysee Falvery,
Holbrook and Coniam 1994, 119ff for an elaboration.

REFERENCES
Alderson, J. (1991). Language Testing in the 1990s: How Far Have We Come? How
Much Further Have We to Go? Current Developments in Language Testing, edited
by S. Anivan. SEAMEO, Regional Language Centre.
______. (1988). Innovations in Language Testing: Can the Micro-computer Help?
Special Report No. 1: Language Testing Update. Lancaster: Lancaster University.
Ariew, R., and P. Dunkel (1989). A Prototype for a Computer-based Listening
Comprehension Proficiency Test. Final Report. Pennsylvania: Dept. of Speech
Communication, Pennsylvania State University.
Bachman, L., and A. Palmer (1982). The Construct Validation of some Components of
Communicative Competence. TESOL Quarterly 16, 449-465.
Carlson, R. (1994). Computer Adaptive Testing: A Shift in the Evaluation Paradigm.
Journal of Educational Technology Systems 22, 3, 213-24.
Coniam, D. (1995). Towards a Common Ability Scale for Hong Kong English
Secondary School Forms. Language Testing 12, 2, 184-195.

CALICO Journal, Volume 13 Numbers 2 & 3

83

Dunkel, P. (1991). Computerized Testing of Nonparticipatory L2 Listening


Comprehension Proficiency: An ESL Prototype Development Effort. Modern
Language Journal 75, 1, 64-73.
Falvey, P., J. Holbrook and D. Coniam (1994). Assessing Students. Hong Kong: Longman.
Fouly, K. (1985). A Multivariate Study of the Nature of Language Proficiency and its
Relationship to Learner Traits: A Confirmatory Approach. Unpublished Ph.D.
thesis. University of Illinois at Urbana-Champaign.
______., and G. Cziko (1985). Determining the Reliability, Validity and Scalability of
the Graduated Dictation Test. Language Learning 35, 5, 555-566.
Gronlund, N. (1985). Measurement and Evaluation in Teaching. New York: Macmillan.
Hambleton, R. (1989). Applications of Item Response Theory. International Journal of
Educational Research 13, 2, 121-220.
Harlech-Jones, B. (1983). ESL Proficiency and a Word Frequency Count. ELT Journal
37, 1, 62-70.
Meara, P., and G. Jones (1988). Vocabulary Size as a Placement Indicator. Applied
Linguistics in Society: Papers from the 20th Annual Meeting of the British Association for
Applied Linguistics, edited by Pamela Grunwell. Nottingham, September 1987.
Meunier, L. (1994). Computer Adaptive Language Tests (CALT) Offer a Great
Potential for Functional Testing. Yet, Why Dont They? CALICO Journal 11, 4, 2339.
Mydlarski, D., and D. Paramskas (1985). Template System for Second Language Aural
Comprehension. CALICO Journal 3, 8-12.
Oller, J. (1979). Language Testing at School: A Pragmatic Approach. Longman: London.
Perkins, K., S. Brutton and P. Angelis (1986). Derivational Complexity and Item
Difficulty in a Sentence Repetition Task. Language Learning 36, 2, 125-141.
Richards, J. (1991). Focus 5. Hong Kong: Oxford University Press.
Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.
Smittle, P. (1991). Computerized Adaptive Testing in Reading. Journal of
Developmental Education 15, 2, 2-5.
Wise, S., and B. Plake (1989). Research on the Effects of Administering Tests via
Computers. Educational Measurement: Issues and Practice 8, 3, 5-10.

CALICO Journal, Volume 13 Numbers 2 & 3

84

ACKNOWLEDGMENTS
I would like to thank Cobuild of the University of Birmingham for access to the Bank
of English corpus.
APPENDIX I: DIALOGUE USED IN THE COMPUTER DICTATION EXERCISE
(Brians responses [in italics] formed the test items)
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:
Brian:
Alan:

Hello Brian, can I ask you a few questions about your English?
Sure what would you like to know
[done as an example]
How long have you been studying English?
For about six years.
When did you begin?
I started learning when I was ten
Well your English is pretty good.
Come on I make quite a lot of mistakes.
Mm, not really; are your still studying English now?
yes I take courses at night
How often do you have classes?
Twice a week usually
And how do you find learning English?
Its difficult but I enjoy it
And do you study on your own at all?
Yes I listen to the radio every evening
What about English books?
I read a lot of English books but I prefer speaking
So who do you speak to in English?
I try and talk to foreigners on the street when I can
Good stuffkeep it up!

AUTHORS BIODATA
David Coniam is an Associate Professor in the Faculty of Education at the Chinese
University of Hong Kong, where his teaching duties involve working on ESL
methodology with secondary school teachers. His publications and research interests
are in the fields of computational linguistics, language testing and English language
teaching methodology.
AUTHORS ADDRESS
Faculty of Education
Chinese University of Hong Kong
Sha Tin, Hong Kong
Phone: 852 2609 6917
Fax: 852 2603 6129
E-mail: coniam@cuhk.edu.hk

CALICO Journal, Volume 13 Numbers 2 & 3

85

Potrebbero piacerti anche