Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1. Introduction
A great deal of experience in corpus work has been acquired in the past few
decades and a stock of very powerful routines for processing text held in
machine-readable form has now been developed. Some of these routines have
not only become standard operations which any corpus holder will have access
to, but they are also now included in software packages which are readily
available to the public at very modest prices. The most popular and versatile of
these packages is Microconcord, marketed by Oxford University Press. OUP
have so far also released two corpus collections of one million words each and
are planning to release more corpora as part of the British National Corpus
initiative. Working with corpora is therefore becoming a perfectly viable
proposition even at the level of individual researchers.
The corpus analyst's stock-in-trade is the KWIC concordance, KWIC
being an acronym for Key Word In Context. This is a list of all the occurrences
of a specified keyword or expression in the corpus, set in the middle of one line
of context each. The following is a KWIC concordance of Greek from the OUP
Corpus Collection A (British newspaper texts from The Independent and The
Independent on Sunday):
THOMPSON in Athens </bl> <st> <p> A GREEK air force warrant officer, Michalis P MCA_IND4.FOR
ere four sendings-off, three of them Greek, as Bulgaria beat the visitors 4-0 in MCA_IND3.SPO
There are two more sales today. <p> Greek bidders descended on the sale yesterd MCA_IND4.HOM
ed over oil spill </hl> <st> <p> The Greek captain of a ship responsible for Sun MCA_IND1 .HOM
ere added to other Christian groups (Greek Catholics, Greek Orthodox, Armenians MCA_IND1.FOR
er </hl> <st> <p> ATHENS (UPI) _ The Greek Chief Justice, Yannis Grivas, was swo MCA_IND3.FOR
s, it fails to persuade us that with Greek drama we are not secular theatre-goer MCA_IND2.ART
1012 </dt> <hl> Storms threaten mild Greek election climate </hl> <bl> From PETE MCA_IND2.FOR
1011 </dt> <hl> THEATRE / A spark of Greek fire: The Trojan Women - Liverpool Ev MCA_IND2.ART
well as agriculture, she has studied Greek, History, Philosophy, Astonomy, Mathe MCA_IND2.ART
eration by sunshine and retsina on a Greek island holiday. (They don't have wint MCA_IND2.ART
r _ alleged to be a rare work of the Greek master Skopas. Archaeologists and art MCA_IND4.FOR
t this place?" asks the disapproving Greek matriarch when her grandson returns f MCA_IND1.ART
of the Titans (1981), his foray into Greek mythology. <p> There must be easier w MCA_IND3.ART
ter Lebanon in 1926, they proposed a Greek Orthodox as president, because he wou MCA_IND 1 FOR
r Christian groups (Greek Catholics, Greek Orthodox, Armenians and others), the MCA_IND1 .FOR
sarouchis, the doyen of contemporary Greek painters, who died early this year, a MCA_IND4.HOM
depicted wearing bull's horns in the Greek playwright Euripides' Bacchae. <p> Th MCA_IND4.HOM
er this constructive interlude, that Greek politics will avoid reverting to thei MCA_IND2.FOR
CORPORA IN TRANSLATION STUDIES 227
sion _ it is the first time a former Greek prime minister has been indicted by p MCA_IND2.FOR
pline by backing bouzouki players in Greek restaurants, or a plethora of cabaret MCA_IND2.ART
ted by a Wren facade, another by the Greek revival. <p> At best these buildings MCA_IND1.ART
04.) Where can you go to church in a Greek sarcophagus? (St Jude's Church, Blyth MCA_IND1 .ART
three sciences and perhaps Latin and Greek. <sect> Home News Page 3 </sect> </st MCA_IND4.HOM
residing over the Everyman's current Greek sequence caused the jolt which sent a MCA_IND4.ART
e sculpture _ the helmeted head of a Greek soldier _ alleged to be a rare work o MCA_IND4.FOR
election campaign, though low-key by Greek standards, is proving rancorous. The MCA_IND4.FOR
venged unnaturally. This may be more Greek than Asian, but it's also compelling MCA_IND2.ART
r of riddles. <p> Why did Alexander 'Greek" Thompson design Eygptian halls in Un MCA_IND1.ART
like Templeton's carpet factory or 'Greek" Thompson's Vincent Street church (is MCA_IND1.ART
eeds to remember is that the 'almost Greek tragedy" line should be restricted to MCA_IND3.ART
sequel. Aspiring to the condition of Greek tragedy, this version makes the dread MCA_IND2.ART
The codes at the end of each line indicate the source of the concordanc
(whether it is from the arts [ART], sports [SPO], home [HOM] or foreign news
[FOR] sections, for instance). The codes in angle brackets indicate typographic
conventions (e.g. <b> for bold and <u> for underline) and other information
(e.g. <hl> for headline and <ct> for caption).
KWIC concordances can be sorted in a variety of ways (for instance to the
left or right of the keyword) and can be expanded online to reveal more of the
context. Some programmes also allow the user access to sentence and para-
graph length concordances. Others, such as Microconcord, offer a collocational
profile of the keyword by listing the most frequent collocates within a given
span, for instance three words to the left or right of the keyword.
Apart from KWIC concordances, most software packages offer facilities
for listing all the word-forms in a corpus, or in a specific text or group of texts
in a corpus, in frequency or alphabetical order. Here is an extract from a
frequency list for a short text quoted in Sinclair (1991a: 141-142):
11 is 2 although 2 to
10 of 2 are 2 very
8 and 2 but active
8 the 2 have an
6 activity 2 if animals
5 a 2 kinds anything
5 communication 2 language armchair
5 in 2 library aspects
4 it 2 like attempts
3 his 2 look authors
3 human 2 many become
3 only 2 there boast
3 we 2 through brain
can
228 MONA BAKER
Word frequency profiles can also be obtained for the whole corpus or any part
of it. These give statistical information on the total number of word-forms in a
corpus, and the number of words which occur x number of times, expressed
both in raw form and as a percentage of the total number of words. For
example, a frequency profile of the above list might tell us, among other
things, that there are four word-forms which occur three times each in the text
and that, given a total number of 113 words, this represents 10.6 per cent of the
vocabulary used in the text (Sinclair 1991a: 32). (For a more detailed overview
of text processing operations, see Sinclair 1991a.)
All these facilities have immediate and obvious applications in the study
of translation. Let me give three straightforward examples before we move on
to look at corpora specifically designed for research in translation studies.
(i) Shamaa (1978: 168-171) did a manual count on a small corpus of English
translations of Arabic novels and concluded that common words such as
day and say occur with a much higher frequency in the English transla-
tions than they do in original English texts. She suggested that this type of
unusual distribution of vocabulary in translation as opposed to original
writing has a subliminal effect on the way we respond to translated text
and contributes to its identification as a translation. By using the re-
sources and techniques described above, this type of research can be done
on a much larger scale and can yield much more powerful insights into the
nature of translated texts and — at least by implication — the nature of
the processes whereby such texts come into being.
(ii) Corpus techniques may have pedagogical applications too. Consider for
instance the notion of 'structural equivalence' as posited by Vermeer
(1987: 30-31). Vermeer suggests that statistical information is needed to
achieve equivalence at the level of surface structure, for example allitera-
tion, rhythm, statistical distribution of vocabulary, and so on. The exam-
ple he uses is as follows. Let us assume that
(a) the average number of words per sentence in modern German is 12,
(b) Goethe averages 24 words per sentence, and
(c) literary Latin has 24 words per average sentence.
We might then want to suggest that Goethe should be translated into
literary Latin with an average frequency of 48 words per sentence, in
order to reflect not only the deviance from the norm as such but also
reconstruct the extent of such deviance. Vermeer admits that such statisti-
cal information can only serve as an approximate indicator of what needs
CORPORA IN TRANSLATION STUDIES 229
to be done, and also that other factors will be involved in the decision-
making process. But there is no denying that this type of information
should be made available to the translator, for whatever use s/he might
decide to make of it. Needless to say, the types of operation described in
this section provide ready access to statistical information about almost
any kind of textual feature, including the average number of words per
sentence.
(iii) In Baker (1992: 156-57), I discussed the use of typographic and punctua-
tion devices to signal marked information structure (or stress) in written
language. This feature is particularly important for the study of dialogue
in literary translation. The example I gave was from Agatha Christie's
Crooked House (1949), where italics are used to signal stress in state-
ments such as "What did it matter to them? They'd all got loads of money.
He gave it to them" and "I was very fond of him. I was fond of him". I also
suggested that phonological stress is not used to highlight a clause ele-
ment in this way in languages such as French and Chinese. It is possible to
concordance typographic marks such as italics in a corpus of original
English texts and, using an alignment programme of the type described in
3.1 below, study the strategies used to signal stress in a corpus of, say,
French or Chinese translations of these texts.
I would now like to look more closely at types of corpora which are either
being used or need to be set up specifically for translation research. These are
usually lumped together under the all-purpose title of 'parallel corpora'. It is
also generally assumed that translation corpora are not monolingual, i.e. that
any research into translation must involve contrasting corpora in two or more
languages. This, however, is not necessarily the case (see 3.3 below).
The terminology for discussing types of corpora in translation studies is
far from established, essentially because, with a few exceptions such as Laffling
(1991, 1992), serious corpus work has not yet started within the discipline. I
would nevertheless like to propose three main types in anticipation of the surge
of activity which I believe we are about to witness in this area.
(i) Parallel corpora
(ii) Multilingual corpora
(iii) Comparable corpora
texts with their translations, parallel corpora will quickly become indispensible
in materials writing, computer-aided translator training and improving the
performance of machine translation systems. Their most important contribu-
tion to the discipline in general is that they support a shift of emphasis, from
prescription to description. They allow us to establish, objectively, how trans-
lators overcome difficulties of translation in practice, and to use this evidence
to provide realistic models for trainee translators. They also have an important
role to play in exploring norms of translating in specific socio-cultural and
historical contexts (see Toury 1978 and Baker 1993a for an explanation and
examples of norms).
The best known corpus of this type is the Hansard Corpus, which consists
of the proceedings of the Canadian Parliament in English and French. Gale and
Church (1991) and Church and Gale (1991) describe two types of concordance
tools, one sentence-based and one word-based, which they have developed for
aligning parallel texts, using the Hansard Corpus as a test-bed. The sentence
alignment method relies on the assumption that, in most cases, "the length of a
text (in characters) is highly correlated (0.991) with the length of its transla-
tion" (Gale and Church 1991: 1). Once each source sentence has been aligned
with its translation, the next step involves identifying as many word corre-
spondences as possible and using these to align the two texts. The output looks
something like this (Church and Gale 1991: 53):
The drug was simply banned.
Ce dernier a ete I simplement interdit.
A particularly successful alignment looks like this (Gale and Church 1991: 4):
. . . finance (mr. wilson) and the governor of the bank of canada
. . . finances (m. wilson) et le gouverneur de la banque du canada
Marinai et al. (1991) also describe a system for aligning parallel text
which they have developed using a parallel corpus of Italian and English. This
corpus includes a variety of texts (short stories, computer-science texts,
American scientific text books, on-flight magazines, etc.) with translations
ranging from the very literal to the very free. The system allows the user to
retrieve parallel contexts for any word or expression in the corpus. Johansson
and Hofland (1993) similarly outline a recent project which involves setting
corpora of (a) original English texts and their Norwegian translations and (b)
original Norwegian texts and their English translations, as well as an alignment
method which relies on a combination of sentence length and a set of "anchor
words" (primitive bilingual lexicon).
232 MONA BAKER
I will use the term 'multilingual corpora' to refer to sets of two or more mono-
lingual corpora in different languages, built up either in the same or different
institutions on the basis of similar design criteria. Multilingual corpora essen-
tially enable us to study items and linguistic features in their home environ-
ment, rather than as they are used in translated text. They are useful inasmuch
as they can provide access to the natural patterns of the target language in
particular, and they therefore have an important role to play in materials
writing, translator training and improving the performance of machine transla-
tion systems. They do, however, have their limitations, and I will come back to
this shortly.
The best example of research based on a multilingual corpus is the
Council of Europe Multilingual Lexicography Project (Sinclair 1991b; Baker
1993b). This research drew on corpora in seven European languages: English,
German, Swedish, Italian, Spanish, Hungarian and Serbo-Croatian. Prima
facie equivalents of common words such as day, say, world, little and know
were investigated in the various corpora to identify regularities in the textual
environment of each member of an equivalence pair. The idea was that the
computer could be trained to detect equivalences on the basis of identifying
such structural environments. Here is an example of the kind of algorithm
which can be devised, in this case for translating Swedish låna into English, on
the basis of evidence drawn from multilingual corpora:8
CORPORA IN TRANSLATION STUDIES 233
For lack of a better term, I will use 'comparable corpora' to refer to something
which is a cross between parallel and multilingual corpora. To my knowledge,
comparable corpora do not yet exist anywhere. In Baker (1993a), I advocated
setting up corpora of this type and suggested a number of research investiga-
tions which can be pursued if they were made available to translation scholars.
Comparable corpora consist of two separate collections of texts in the
same language: one corpus consists of original texts in the language in ques-
tion and the other consists of translations in that language from a given source
language or languages. The corpus of original texts is therefore an ordinary
monolingual corpus of the type linguists have been using for several decades.
Any existing monolingual corpus can be used, provided it is similar in design
to the translation corpus. Both corpora should cover a similar domain, variety
of language and time span, and be of comparable length. The translation
corpus should be representative in terms of the range of original authors and of
translators.
The most important contribution that comparable corpora can make to the
discipline is to identify patterning which is specific to translated texts, irre-
spective of the source or target languages involved. I will return to this shortly
and will give examples of the kind of patterning that can be revealed. But,
generally speaking, what we would be comparing here is not, for instance,
French originals with their English translations, nor original French texts with
original English texts, but rather substantial amounts of original English text
with substantial amounts of translated English text (whatever the source lan-
guage). Similar studies done in other languages would either support or refute
hypotheses about the process of translation which were made on the basis of
evidence drawn from a comparable corpus of English.
The prospect of setting up independent corpora of translated texts, in
various languages and in different institutions, is daunting. There is, for
instance, the problem of establishing who holds the copyright for each transla-
tion and of obtaining permission to hold the material in electronic form and to
publish results of analyses performed on it. In my experience, it is particularly
difficult to get permission to use translated material because there is always a
lot of sensitivity and suspicion wherever translation is involved. People tend to
assume that you want to get hold of their translations in order to criticise them.
This is a natural reaction to the kind of discourse that has so far dominated the
CORPORA IN TRANSLATION STUDIES 235
field, and I suppose we have only ourselves to blame for it. But we need do no
more than reflect on a statement by John Sinclair, one of the leading scholars in
corpus linguistics, to realise that corpus-based translation studies is not only
feasible but, if the experience of corpus linguists is anything to go by, it will
sooner or later become the norm:
Thirty years ago when this research started it was considered impossible to
process texts of several million words in length. Twenty years ago it was
considered marginally possible but lunatic. Ten years ago it was considered
quite possible but still lunatic. Today it is very popular. (Sinclair 1991a: 1)
One problem with comparable corpora is that, unlike parallel and multilingual
corpora, they do not have direct applications in the classroom and it is not
immediately obvious how they might contribute to improving the performance
of machine translation systems. The idea of setting up comparable corpora is
therefore unlikely to attract funding of the type offered by the European
Commission for instance. The effort and expense involved in setting up a
comparable corpus in any language can nevertheless be justified if there is a
strong argument that this type of 'pure' research is essential for the survival
and growth of the discipline. And the difficulty of trying to develop this
argument in detail is that, as with any kind of pure research, one can initially do
no more than pose questions. The answers and revelations do not come until
the research is in full swing. Be that as it may, I would now like to give a few
examples of the kind of research that can be carried out on comparable
corpora.
4. Comparable Corpora:
What Can They Tell Us About Translation?
that occurs in this structure less frequently in the original corpus than in the
corpus of translations. This is a local pattern, specific to English, but what we
discover about it may tell us something about the nature of translated text in
general and the nature of the process of translation itself. It may be used, for
instance, to support the explicitation hypothesis, i.e. that translated text exhib-
its a higher level of explicitness, even at the syntactic level, than specific
source texts and original texts in general (Baker 1993a: 243).
Comparable corpora can also help us arrive at more global statements
about the nature of translated text. I will suggest two possible lines of research
to illustrate the kind of global patterning I have in mind.
20) or with an easier text on a similar subject (Larson 1984: 439). Fortunately,
the texts used in both cases are short enough to be analysed manually and I
have therefore been able to calculate their respective lexical densities.
Both examples exhibit a difference in lexical density: the texts perceived
as 'difficult' have a higher lexical density than the ones presented as being
'easier' (43% vs. 33% respectively in the case of Headland, and 52% vs. 44%
for the Larson texts). This is true irrespective of the type of vocabulary used,
the length of each text or the level of detail involved. It is therefore possible to
argue, at least tentatively, that lexical density contributes to information load.
Given the facilities available for the automatic processing of corpora, and
particularly the algorithm devised by Stubbs (1986), this feature can now be
studied on a much wider scale and may well reveal important facts about the
nature and extent of mediation in translated text. If, for instance, we discover
that the overall lexical density of a corpus of English translations is signifi-
cantly lower than that of a comparable corpus of original English, we might
want to argue that translators use this feature, consciously or subconsciously,
to control information load and to make a translated text more accessible to its
new readearship.
5. Conclusion
I have tried to give an overview of the types of corpora that can be used in
translation studies and examples of the kind of corpus-based research that can
be carried out in the discipline. We are still a long way from achieving a
coherent methodology in this area. Corpus-based research offers enormous
potential for translation scholars but the process of setting up the required
corpora and of devising the relevant software is fraught with difficulties.
Hopefully, the examples I have given of the kind of questions that can be
answered on the basis of corpora, and the scale on which they can be answered,
will go some way towards justifying the effort required to establish corpus-
based research as a serious option in translation studies.
Author's address:
Mona Baker . 2 Maple Road West. Brooklands . MANCHESTER M23 9HH .
United Kingdom
CORPORA IN TRANSLATION STUDIES 239
Notes
"First, the corpus builder attempts to create a representative corpus. Then this corpus is
used and analysed and its strengths and weaknesses identified and reported. In the light of
this experience and feedback the corpus is enhanced by the addition or deletion of material
and the cycle is repeated continually" (Atkins et al. 1991). Also, in practice, mundane
considerations such as the availability of particular texts and the ease of obtaining copy-
right permission play an important role in the selection of samples. Similarly, the reasons
for the fact that corpora of spoken language, where they are held at all, are much smaller in
size than those of written language, are entirely practical rather than theoretical. They have
to do with the cost of transcription and keyboarding and the fact that surreptitious recording
of face-to-face and telephone conversations is illegal in most countries.
5. I personally believe that corpora which consist of whole texts are, on the whole, far more
useful than those which consist of text fragments, but the discussion of this particular issue
lies outside the scope of this paper. Suffice it to say here that a corpus which consists of text
fragments has obvious limitations in terms of studying larger text patterns, such as patterns
of cohesion across chapters. Any fragmentation of a novel would similarly rule out a study
of character development. And a corpus which consists of a set of sentences will not even
allow a study of more modest patterns, such as paragraphing and inter-sentential cohesion.
And finally, a corpus of complete texts offers a way out of having to address the issue of the
selection and representativeness of text fragments, though the question of the representa-
tiveness of the corpus as a whole remains unresolved.
6. For the importance of this particular dimension, i.e. direction of translation, see Malmkjaer
(1993). Malmkjaer argues, quite convincingly but without the benefit of substantial corpus
research, that experienced translators "might produce higher degrees of equivalence if they
are translating out of their mother tongues than if they are translating into their mother
tongues" (p. 213). She coins the term 'SL mother tongue but TT habitual use translators' to
refer to this group.
7. Another term used in the literature is 'bilingual corpora' (Leech 1991). Hartmann (1980:
37-38) uses the term 'parallel texts' to refer to three distinct types of text collections, one of
which is what I have referred to here as 'multilingual corpora' and in Baker (1993: 248) as
'comparable corpora'. I apologise for the change in terminology.
8. The English corpus on which this study was based is the original Cobuild corpus of modern
English (c. 20 million words); the Swedish corpus consisted of c. 20 million words of
modern Swedish held at the University of Göteborg, Department of Swedish (contact
Martin Gellerstam). A detailed breakdown of the various structural patterns identified in
the two corpora for words such as låna, lend and borrow can be found in Sinclair 1991b.
One example of this detailed analysis is that the pattern 'A lends B x' (as in 'She lent me a
sweater') occurs 26 times in the English corpus, whereas 'A lånar B x' does not occur at all
in the Swedish corpus, in spite of the fact that it is generally accepted as a perfectly normal
type of construction in Swedish.
9. The rejection of the assumption that you can say anything naturally in any language also
has implications for the development of translation studies as an independent discipline. If
the assumption held, we would perhaps be justified in seeing translation theory as "a
branch of Comparative Linguistics" (Catford 1965: 20) or "a branch of applied linguistics"
(Lindquist 1984: 261). Rejecting this assumption is necessary to the claim that the activity
of translation is qualitatively different from that of text production, and hence that we need
an independent theory to account for it (inasmuch as any theory can be independent from
other theories of course).
CORPORA IN TRANSLATION STUDIES 241
10. Studies of this type should be based on very large corpora if distortion of data is to be
avoided.
11. Lexical simplification may be defined as "the process and/or result of making do with less
words" (Blum-Kulka and Levenston 1983: 119).
References
Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1991. "Corpus Design Criteria". Paper
Presented at the Workshop on European Textual Corpora, Pisa, 7-10 January 1991.
Baker, Mona. 1992. In Other Words: A Coursebook on Translation. London and New York:
Routledge.
Baker, Mona. 1993a. "Corpus Linguistics and Translation Studies: Implications and Appli-
cations". Baker et al. 1993: 233-250.
Baker, Mona. 1993b. Multilingual Databases. Birmingham: University of Birmingham.
[Report submitted to the European Commission as a contribution to a European enquiry
into corpus work.]
Baker, Mona, Gill Francis and Elena Tognini-Bonelli, eds. 1993. Text and Technology: In
Honour of John Sinclair. Amsterdam/Philadelphia: John Benjamins.
Bernardo, Aldo S. 1981. "Maximizing Computer Assistance in Literary Translation:
Petrarch's Familiares". Marilyn Gaddis Rose, ed. Translation Spectrum: Essays in
Theory and Practice. State University of New York Press, 1981. 74-80.
Blum-Kulka, Shoshana and Eddie A. Levenston. 1983. "Universais of Lexical Simplifica-
tion". Claus Faerch and Gabriele Kasper, eds. Strategies in IL Communication.
Longman, 1983. 119-139.
British National Corpus: Written Corpus Design Specification. 1991. OUP Promotional
Document Dated 2 September.
Catford, J.C. 1965. A Linguistic Theory of Translation: An Essay in Applied Linguistics.
Oxford University Press.
Church, Kenneth and William Gale. 1991. "Concordances for Parallel Text". Paper Pre-
sented at the Seventh Annual Conference of the UW Centre for the New OED and Text
Research. St. Catherine's College, Oxford.
Gale, William and Kenneth Church. 1991. "Identifying Word Correspondences in Parallel
Texts". Darpa SLS Workshop.
Hartmann, R.R.K. 1980. Contrastive Textology: Comparative Discourse Analysis in Ap-
plied Linguistics. Heidelberg: Julius Groos.
Headland, Thomas. 1981. "Information Rate, Information Overload, and Communication
Problems in the Casiguran Dumagat New Testament". Notes on Translation 83. 18-27.
Hofland, K. and S. Johansson. 1982. Word Frequencies in British and American English.
Bergen: The Norwegian Computing Centre for the Humanities.
Johansson, Stig and Knut Hofland. 1993. "Towards an English-Norwegian Parallel Cor-
pus". Udo Fries, Gunnel Tottie and Peter Schneider, eds. Creating and Using English
Language Corpora: Papers from the Fourteenth International Conference on English
Language Research on Computerized Corpora. Zurich, 1993. 25-37.
242 MONA BAKER
Appendix