Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995

Corpora in Translation Studies:
An Overview and Some Suggestions

for Future Research
Mona Baker
UMIST& Middlesex University
Abstract: Corpus-based research has become widely accepted as a factor in

improving the performance of machine translation systems, and corpus-based
terminology compilation is now the norm rather than the exception. Within
translation studies proper, Lindquist (1984) has advocated the use of corpora for
training translators, and Baker (1993a) has argued that theoretical research into
the nature of translation will receive a powerful impetus from corpus-based
studies. It is becoming increasingly important to take stock of what is happening
on this front and to start working towards the development of an explicit and
coherent methodology for corpus-based research in the discipline. This paper
discusses the current and potential use of corpora in translation studies, with
particular reference to theoretical issues.
Résumé: On s'accorde à voir dans la recherche sur corpus un facteur susceptible
d'améliorer les systèmes de traduction automatique; la terminologie basée sur
corpus devient la règle plutôt que l'exception. A propos des recherches sur la
traduction, Lidquist (1984) a prôné le recours aux corpora dans la formation des
traducteurs; selon Baker (1993a), l'étude théorique de la traduction bénéficiera
des recherches fondées sur corpus. Il importe désormais de répertorier les acquis
en ce domaine, afin de mettre au point une méthodologie explicite et cohérente.
L'article qui suit analyse l'usage présent et possible des corpora dans les recher-
ches sur la traduction, et prêtant une attention particulière aux questions théori-
ques.
Target 7:2 (1995), 223-243. DOI 10.1075/target.7.2.03bak

ISSN 0924-1884 / E-ISSN 1569-9986 © John Benjamins Publishing Company
224 MONA BAKER
1. Introduction
The potential for using corpora is beginning to take shape in translation

studies. Computerised corpora are becoming increasingly popular in those
areas of the discipline which have close links with the hard sciences. This is
particularly true of terminology and machine translation, where the emphasis
is primarily, if not exclusively, on scientific and technical texts.
Terminology compilation is now firmly corpus-based. The desire to
construct abstract and neat conceptual systems has given way here to the
practical need of addressing what happens in real life. Terms are therefore no
longer extracted from previous lists but are rather drawn from a representative
corpus of authentic texts held in electronic form (Sager 1990: 130).1 A similar
development has taken place in machine translation where it is now widely
accepted that access to computerised corpora may well hold the key to future
success in the field. Again, this reflects a move away from conceptual and
formal representations of language, which have not proved very helpful in the
past, to addressing natural language in all its variety and infiniteness. The
repeated failure of pre-formulated rules and neat semantic analyses to improve
the performance of machine translation systems has led to the gradual realisa-
tion that the knowledge required to improve these systems must come from
natural language in use (Schubert 1992: 87; Laffling 1991, 1992). Corpora are
not only used by linguists to write better rules for the machines to operate on
but also as a direct knowledge source for the machines themselves (ibid).
Modern machine translation systems use the principle of analogy to extrapo-
late from the typical examples held in the corpus to texts that have not been
encountered before.
The development of corpus-based techniques in terminology and ma-
chine translation is encouraging. It goes some way towards fulfilling the
growing need for a rigorous descriptive methodology in an attempt to increase
the inter-subjectivity of the applied areas of translation studies, such as transla-
tor training and translation criticism, and of course in the pursuit of a more
satisfying theoretical account of the phenomenon of translation itself. It is the
potential use of corpora in these theoretical and pedagogical areas that I would
now like to address. But before I do so, it is perhaps important to look in some
detail at the way in which the words corpus and corpora have been used in the
literature in order to avoid possible misunderstanding in the discussion which
follows. It might also be useful to give a brief overview of the kind of
information that can be obtained from corpora.
CORPORA IN TRANSLATION STUDIES 225
2. Corpora: Definition, Types and Overview of Basic Operations
2.1. What Is a Corpus?
The word corpus originally meant any collection of writings, in a processed or

unprocessed form, usually by a specific author.2 In recent years, and with the
growth of corpus linguistics,3 this definition has changed in three important
ways: (i) corpus now means primarily a collection of texts held in machine-
readable form and capable of being analysed automatically or semi-automati-
cally in a variety of ways; (ii) a corpus is no longer restricted to 'writings' but
includes spoken as well as written text, and (iii) a corpus may include a large
number of texts from a variety of sources, by many writers and speakers and on
a multitude of topics. What is important is that it is put together for a particular
purpose and according to explicit design criteria in order to ensure that it is
representative of the given area or sample of language it aims to account for.
Some of these criteria are discussed under 2.3 below.4
One important feature that remains variable in modern corpora is the
nature and extent of the texts held. In linguistics, corpora usually consist of
running texts, but these texts are not always held in full. For example, the
Brown and LOB Corpora consist of fragments of texts, each fragment being
approximately 2000 words in length (Hofland and Johansson 1982), selected
on a more or less random basis within specified genres (Sinclair 1991a: 23).
The British National Corpus consists of text samples, generally no longer than
40,000 words each. These samples are taken randomly from the beginning,
middle or end of longer texts, but care is taken to choose a convenient
breakpoint, such as the end of a section or chapter, to begin and end the sample
in order not to fragment high-level discourse units (British National Corpus
1991). Other corpora, for example the Cobuild/Bank of English corpus, consist
of whole texts, irrespective of the size of any individual text held in the
collection.
In machine translation, by contrast, a corpus does not necessarily consist
of running texts; it may be no more than a set of examples (Schubert 1992: 87).
One of the definitions of corpus in this field is therefore "the finite collection
of grammatical sentences that is used as a basis for the descriptive analysis of a
language" (definition given in the 'Glossary of Terms' in Newton 1992: 223).
It is also important to bear in mind that the word corpus has often been used in
translation studies proper to refer to fairly small collections of text which are
not held in electronic form and which are therefore searched manually (Baker
1993a: 241).
226 MONA BAKER
In what follows, I intend to use corpus to mean any collection of running

texts (as opposed to examples/sentences), held in electronic form and analysable
automatically or semi-automatically (rather than manually).5
2.2. Basic Text Processing Operations
A great deal of experience in corpus work has been acquired in the past few
decades and a stock of very powerful routines for processing text held in
machine-readable form has now been developed. Some of these routines have
not only become standard operations which any corpus holder will have access
to, but they are also now included in software packages which are readily
available to the public at very modest prices. The most popular and versatile of
these packages is Microconcord, marketed by Oxford University Press. OUP
have so far also released two corpus collections of one million words each and
are planning to release more corpora as part of the British National Corpus
initiative. Working with corpora is therefore becoming a perfectly viable
proposition even at the level of individual researchers.
The corpus analyst's stock-in-trade is the KWIC concordance, KWIC
being an acronym for Key Word In Context. This is a list of all the occurrences
of a specified keyword or expression in the corpus, set in the middle of one line
of context each. The following is a KWIC concordance of Greek from the OUP
Corpus Collection A (British newspaper texts from The Independent and The
Independent on Sunday):
THOMPSON in Athens </bl> <st> A GREEK air force warrant officer, Michalis P MCA_IND4.FOR
ere four sendings-off, three of them Greek, as Bulgaria beat the visitors 4-0 in MCA_IND3.SPO
There are two more sales today. Greek bidders descended on the sale yesterd MCA_IND4.HOM
ed over oil spill </hl> <st> The Greek captain of a ship responsible for Sun MCA_IND1 .HOM
ere added to other Christian groups (Greek Catholics, Greek Orthodox, Armenians MCA_IND1.FOR
er </hl> <st> ATHENS (UPI) _ The Greek Chief Justice, Yannis Grivas, was swo MCA_IND3.FOR
s, it fails to persuade us that with Greek drama we are not secular theatre-goer MCA_IND2.ART
1012 </dt> <hl> Storms threaten mild Greek election climate </hl> <bl> From PETE MCA_IND2.FOR
1011 </dt> <hl> THEATRE / A spark of Greek fire: The Trojan Women - Liverpool Ev MCA_IND2.ART
well as agriculture, she has studied Greek, History, Philosophy, Astonomy, Mathe MCA_IND2.ART
eration by sunshine and retsina on a Greek island holiday. (They don't have wint MCA_IND2.ART
r _ alleged to be a rare work of the Greek master Skopas. Archaeologists and art MCA_IND4.FOR
t this place?" asks the disapproving Greek matriarch when her grandson returns f MCA_IND1.ART
of the Titans (1981), his foray into Greek mythology. There must be easier w MCA_IND3.ART
ter Lebanon in 1926, they proposed a Greek Orthodox as president, because he wou MCA_IND 1 FOR
r Christian groups (Greek Catholics, Greek Orthodox, Armenians and others), the MCA_IND1 .FOR
sarouchis, the doyen of contemporary Greek painters, who died early this year, a MCA_IND4.HOM
depicted wearing bull's horns in the Greek playwright Euripides' Bacchae. Th MCA_IND4.HOM
er this constructive interlude, that Greek politics will avoid reverting to thei MCA_IND2.FOR
sion _ it is the first time a former Greek prime minister has been indicted by p MCA_IND2.FOR
pline by backing bouzouki players in Greek restaurants, or a plethora of cabaret MCA_IND2.ART
ted by a Wren facade, another by the Greek revival. At best these buildings MCA_IND1.ART
04.) Where can you go to church in a Greek sarcophagus? (St Jude's Church, Blyth MCA_IND1 .ART
three sciences and perhaps Latin and Greek. <sect> Home News Page 3 </sect> </st MCA_IND4.HOM
residing over the Everyman's current Greek sequence caused the jolt which sent a MCA_IND4.ART
e sculpture _ the helmeted head of a Greek soldier _ alleged to be a rare work o MCA_IND4.FOR
election campaign, though low-key by Greek standards, is proving rancorous. The MCA_IND4.FOR
venged unnaturally. This may be more Greek than Asian, but it's also compelling MCA_IND2.ART
r of riddles. Why did Alexander 'Greek" Thompson design Eygptian halls in Un MCA_IND1.ART
like Templeton's carpet factory or 'Greek" Thompson's Vincent Street church (is MCA_IND1.ART
eeds to remember is that the 'almost Greek tragedy" line should be restricted to MCA_IND3.ART
sequel. Aspiring to the condition of Greek tragedy, this version makes the dread MCA_IND2.ART
The codes at the end of each line indicate the source of the concordanc
(whether it is from the arts [ART], sports [SPO], home [HOM] or foreign news
[FOR] sections, for instance). The codes in angle brackets indicate typographic
conventions (e.g. for bold and for underline) and other information
(e.g. <hl> for headline and <ct> for caption).
KWIC concordances can be sorted in a variety of ways (for instance to the
left or right of the keyword) and can be expanded online to reveal more of the
context. Some programmes also allow the user access to sentence and para-
graph length concordances. Others, such as Microconcord, offer a collocational
profile of the keyword by listing the most frequent collocates within a given
span, for instance three words to the left or right of the keyword.
Apart from KWIC concordances, most software packages offer facilities
for listing all the word-forms in a corpus, or in a specific text or group of texts
in a corpus, in frequency or alphabetical order. Here is an extract from a
frequency list for a short text quoted in Sinclair (1991a: 141-142):
11 is 2 although 2 to
10 of 2 are 2 very
8 and 2 but active
8 the 2 have an
6 activity 2 if animals
5 a 2 kinds anything
5 communication 2 language armchair
5 in 2 library aspects
4 it 2 like attempts
3 his 2 look authors
3 human 2 many become
3 only 2 there boast
3 we 2 through brain
can
228 MONA BAKER
Word frequency profiles can also be obtained for the whole corpus or any part
of it. These give statistical information on the total number of word-forms in a
corpus, and the number of words which occur x number of times, expressed
both in raw form and as a percentage of the total number of words. For
example, a frequency profile of the above list might tell us, among other
things, that there are four word-forms which occur three times each in the text
and that, given a total number of 113 words, this represents 10.6 per cent of the
vocabulary used in the text (Sinclair 1991a: 32). (For a more detailed overview
of text processing operations, see Sinclair 1991a.)
All these facilities have immediate and obvious applications in the study
of translation. Let me give three straightforward examples before we move on
to look at corpora specifically designed for research in translation studies.
(i) Shamaa (1978: 168-171) did a manual count on a small corpus of English
translations of Arabic novels and concluded that common words such as
day and say occur with a much higher frequency in the English transla-
tions than they do in original English texts. She suggested that this type of
unusual distribution of vocabulary in translation as opposed to original
writing has a subliminal effect on the way we respond to translated text
and contributes to its identification as a translation. By using the re-
sources and techniques described above, this type of research can be done
on a much larger scale and can yield much more powerful insights into the
nature of translated texts and — at least by implication — the nature of
the processes whereby such texts come into being.
(ii) Corpus techniques may have pedagogical applications too. Consider for
instance the notion of 'structural equivalence' as posited by Vermeer
(1987: 30-31). Vermeer suggests that statistical information is needed to
achieve equivalence at the level of surface structure, for example allitera-
tion, rhythm, statistical distribution of vocabulary, and so on. The exam-
ple he uses is as follows. Let us assume that
(a) the average number of words per sentence in modern German is 12,
(b) Goethe averages 24 words per sentence, and
(c) literary Latin has 24 words per average sentence.
We might then want to suggest that Goethe should be translated into
literary Latin with an average frequency of 48 words per sentence, in
order to reflect not only the deviance from the norm as such but also
reconstruct the extent of such deviance. Vermeer admits that such statisti-
cal information can only serve as an approximate indicator of what needs
to be done, and also that other factors will be involved in the decision-
making process. But there is no denying that this type of information
should be made available to the translator, for whatever use s/he might
decide to make of it. Needless to say, the types of operation described in
this section provide ready access to statistical information about almost
any kind of textual feature, including the average number of words per
sentence.
(iii) In Baker (1992: 156-57), I discussed the use of typographic and punctua-
tion devices to signal marked information structure (or stress) in written
language. This feature is particularly important for the study of dialogue
in literary translation. The example I gave was from Agatha Christie's
Crooked House (1949), where italics are used to signal stress in state-
ments such as "What did it matter to them? They'd all got loads of money.
He gave it to them" and "I was very fond of him. I was fond of him". I also
suggested that phonological stress is not used to highlight a clause ele-
ment in this way in languages such as French and Chinese. It is possible to
concordance typographic marks such as italics in a corpus of original
English texts and, using an alignment programme of the type described in
3.1 below, study the strategies used to signal stress in a corpus of, say,
French or Chinese translations of these texts.
2.3. Types of Corpora
Corpora are generally designed on the basis of a number of selection criteria,

the most important of which are:
(i) general language vs. restricted domain
(ii) written vs. spoken language
(iii)synchronic vs. diachronic
(iv) typicality in terms of range of sources (writers/speakers) and genres
(e.g. newspaper editorials, radio interviews, fiction, journal articles,
court hearings)
(v) geographical limits, e.g. British vs. American English
(vi) monolingual vs. bilingual or multilingual
The classification of corpora along the above dimensions is valid but not
sufficient for the purposes of translation scholars. The details and implications
of the last criterion in particular, which has so far been crudely based on
230 MONA BAKER
nothing more than the range of languages involved, need to be developed. I

will deal with this in more detail in section 3 below.
It is also clear that the criteria in general have been developed by lin-
guists, and on the basis of monolingual corpora only. This is not a criticism but
merely a statement of fact. It is up to translation scholars now to refine these
criteria and adapt them to our needs. Thus the criterion for typicality, for
instance, would need to be refined to take on board, in addition to writers/
speakers, the range of translators represented in the corpus (both how many
and whether they are professional/amateur, staff/freelance, translating into or
out of their mother tongue,6 etc.). Similarly, considerations of genres would
have to take account of the fact that there may be more than one genre involved
for each pair of texts in a bilingual corpus. This is because a translation does
not always belong to the same genre as its original.
3. Corpora for Translation Research and Pedagogy
I would now like to look more closely at types of corpora which are either
being used or need to be set up specifically for translation research. These are
usually lumped together under the all-purpose title of 'parallel corpora'. It is
also generally assumed that translation corpora are not monolingual, i.e. that
any research into translation must involve contrasting corpora in two or more
languages. This, however, is not necessarily the case (see 3.3 below).
The terminology for discussing types of corpora in translation studies is
far from established, essentially because, with a few exceptions such as Laffling
(1991, 1992), serious corpus work has not yet started within the discipline. I
would nevertheless like to propose three main types in anticipation of the surge
of activity which I believe we are about to witness in this area.
(i) Parallel corpora
(ii) Multilingual corpora
(iii) Comparable corpora
3.1 Parallel Corpora7
A parallel corpus consists of original, source language-texts in language A and

their translated versions in language B. This is the type of corpus that one
immediately thinks of in the context of translation studies. Provided that robust
software routines are developed for automatically aligning stretches of source
texts with their translations, parallel corpora will quickly become indispensible
in materials writing, computer-aided translator training and improving the
performance of machine translation systems. Their most important contribu-
tion to the discipline in general is that they support a shift of emphasis, from
prescription to description. They allow us to establish, objectively, how trans-
lators overcome difficulties of translation in practice, and to use this evidence
to provide realistic models for trainee translators. They also have an important
role to play in exploring norms of translating in specific socio-cultural and
historical contexts (see Toury 1978 and Baker 1993a for an explanation and
examples of norms).
The best known corpus of this type is the Hansard Corpus, which consists
of the proceedings of the Canadian Parliament in English and French. Gale and
Church (1991) and Church and Gale (1991) describe two types of concordance
tools, one sentence-based and one word-based, which they have developed for
aligning parallel texts, using the Hansard Corpus as a test-bed. The sentence
alignment method relies on the assumption that, in most cases, "the length of a
text (in characters) is highly correlated (0.991) with the length of its transla-
tion" (Gale and Church 1991: 1). Once each source sentence has been aligned
with its translation, the next step involves identifying as many word corre-
spondences as possible and using these to align the two texts. The output looks
something like this (Church and Gale 1991: 53):
The drug was simply banned.
Ce dernier a ete I simplement interdit.
A particularly successful alignment looks like this (Gale and Church 1991: 4):
. . . finance (mr. wilson) and the governor of the bank of canada
. . . finances (m. wilson) et le gouverneur de la banque du canada
Marinai et al. (1991) also describe a system for aligning parallel text
which they have developed using a parallel corpus of Italian and English. This
corpus includes a variety of texts (short stories, computer-science texts,
American scientific text books, on-flight magazines, etc.) with translations
ranging from the very literal to the very free. The system allows the user to
retrieve parallel contexts for any word or expression in the corpus. Johansson
and Hofland (1993) similarly outline a recent project which involves setting
corpora of (a) original English texts and their Norwegian translations and (b)
original Norwegian texts and their English translations, as well as an alignment
method which relies on a combination of sentence length and a set of "anchor
words" (primitive bilingual lexicon).
232 MONA BAKER
Rettig (1993) reports three further corpora which include substantial

amounts of parallel texts:
(i) The GILLBT Corpus is 80% parallel texts and contains a multitude of
spoken African languages (Vagla, Mo, Konni, Adele, etc.). Correspond-
ences are set up semi-automatically between morphemes, words, and
sentences.
(ii) The ATR Dialogue Database is 99% parallel texts and is used as a test-
bed for Japanese-English machine translation. Correspondences between
equivalent elements are set up manually.
(iii) The Leiden-Jerusalem Armenian Database is only 5% parallel texts, in
combinations of Armenian with Greek, Arabic and Syriac. Correspond-
ences between equivalent elements have been set up semi-automatically.
3.2 Multilingual Corpora
I will use the term 'multilingual corpora' to refer to sets of two or more mono-
lingual corpora in different languages, built up either in the same or different
institutions on the basis of similar design criteria. Multilingual corpora essen-
tially enable us to study items and linguistic features in their home environ-
ment, rather than as they are used in translated text. They are useful inasmuch
as they can provide access to the natural patterns of the target language in
particular, and they therefore have an important role to play in materials
writing, translator training and improving the performance of machine transla-
tion systems. They do, however, have their limitations, and I will come back to
this shortly.
The best example of research based on a multilingual corpus is the
Council of Europe Multilingual Lexicography Project (Sinclair 1991b; Baker
1993b). This research drew on corpora in seven European languages: English,
German, Swedish, Italian, Spanish, Hungarian and Serbo-Croatian. Prima
facie equivalents of common words such as day, say, world, little and know
were investigated in the various corpora to identify regularities in the textual
environment of each member of an equivalence pair. The idea was that the
computer could be trained to detect equivalences on the basis of identifying
such structural environments. Here is an example of the kind of algorithm
which can be devised, in this case for translating Swedish låna into English, on
the basis of evidence drawn from multilingual corpora:8
lâna followed by ut is translated as LEND

lâna followed by in is translated as HIRE
lâna followed by LOCALITY is translated as USE
lâna in all other contexts is translated as BORROW
The usefulness of this type of methodology is not restricted to machine

translation. Multilingual corpora, and the kind of insights they provide on the
typical behaviour of so-called 'equivalent' items and structures in various
languages, can be extremely useful in developing teaching materials for trans-
lators and in computer-aided translator training. However, research based on,
or limited to, multilingual corpora cannot provide answers to theoretical issues
which lie at the heart of the discipline and cannot explain the phenomenon of
translation per se. It essentially proceeds from the assumption that there is a
natural way of saying anything in any language, and that all we need to do is to
find out how to say something naturally in language A and in language B. But
the fact is that, as Bible translators have discovered, "there are some things that
simply cannot be said naturally in some languages" (Headland 1981: 18). So,
while the assumption that underlies work on multilingual corpora is essential
for the development of pedagogical material, it cannot be carried over into the
theoretical branch of the discipline because it will result in a serious distortion
of our view of the very phenomenon we should be trying to explicate.9 If
anything and everything could be said naturally in any language, then scholars
such as Catford would be justified in defining translation as "the replacement
of textual material in one language (SL) by equivalent textual material in
another language (TL)" (1965: 20). If, on the other hand, we discount this
assumption and accept that, for every language, there are many things, from
the most banal to the most exotic, which cannot be said naturally, then we have
to look for a different research methodology.
What I am suggesting is that we need to effect a shift in the focus of
theoretical research in the discipline, a shift away from comparing either ST
with TT or language A with language B to comparing text production per se
with translation. In other words, we need to explore how text produced in
relative freedom from an individual script in another language differs from text
produced under the normal conditions which pertain in translation, where a
fully developed and coherent text exists in language A and requires recoding in
language B. This shift in focus may be facilitated by access to comparable
corpora.
234 MONA BAKER
3.3 Comparable Corpora
For lack of a better term, I will use 'comparable corpora' to refer to something
which is a cross between parallel and multilingual corpora. To my knowledge,
comparable corpora do not yet exist anywhere. In Baker (1993a), I advocated
setting up corpora of this type and suggested a number of research investiga-
tions which can be pursued if they were made available to translation scholars.
Comparable corpora consist of two separate collections of texts in the
same language: one corpus consists of original texts in the language in ques-
tion and the other consists of translations in that language from a given source
language or languages. The corpus of original texts is therefore an ordinary
monolingual corpus of the type linguists have been using for several decades.
Any existing monolingual corpus can be used, provided it is similar in design
to the translation corpus. Both corpora should cover a similar domain, variety
of language and time span, and be of comparable length. The translation
corpus should be representative in terms of the range of original authors and of
translators.
The most important contribution that comparable corpora can make to the
discipline is to identify patterning which is specific to translated texts, irre-
spective of the source or target languages involved. I will return to this shortly
and will give examples of the kind of patterning that can be revealed. But,
generally speaking, what we would be comparing here is not, for instance,
French originals with their English translations, nor original French texts with
original English texts, but rather substantial amounts of original English text
with substantial amounts of translated English text (whatever the source lan-
guage). Similar studies done in other languages would either support or refute
hypotheses about the process of translation which were made on the basis of
evidence drawn from a comparable corpus of English.
The prospect of setting up independent corpora of translated texts, in
various languages and in different institutions, is daunting. There is, for
instance, the problem of establishing who holds the copyright for each transla-
tion and of obtaining permission to hold the material in electronic form and to
publish results of analyses performed on it. In my experience, it is particularly
difficult to get permission to use translated material because there is always a
lot of sensitivity and suspicion wherever translation is involved. People tend to
assume that you want to get hold of their translations in order to criticise them.
This is a natural reaction to the kind of discourse that has so far dominated the
field, and I suppose we have only ourselves to blame for it. But we need do no
more than reflect on a statement by John Sinclair, one of the leading scholars in
corpus linguistics, to realise that corpus-based translation studies is not only
feasible but, if the experience of corpus linguists is anything to go by, it will
sooner or later become the norm:
Thirty years ago when this research started it was considered impossible to
process texts of several million words in length. Twenty years ago it was
considered marginally possible but lunatic. Ten years ago it was considered
quite possible but still lunatic. Today it is very popular. (Sinclair 1991a: 1)
One problem with comparable corpora is that, unlike parallel and multilingual
corpora, they do not have direct applications in the classroom and it is not
immediately obvious how they might contribute to improving the performance
of machine translation systems. The idea of setting up comparable corpora is
therefore unlikely to attract funding of the type offered by the European
Commission for instance. The effort and expense involved in setting up a
comparable corpus in any language can nevertheless be justified if there is a
strong argument that this type of 'pure' research is essential for the survival
and growth of the discipline. And the difficulty of trying to develop this
argument in detail is that, as with any kind of pure research, one can initially do
no more than pose questions. The answers and revelations do not come until
the research is in full swing. Be that as it may, I would now like to give a few
examples of the kind of research that can be carried out on comparable
corpora.
4. Comparable Corpora:
What Can They Tell Us About Translation?
Access to comparable corpora should allow us to capture patterns which are

either restricted to translated text or which occur with a significantly higher or
lower frequency in translated text than they do in originals (see section 2.2, i,
for an example from Shamaa 1978). These patterns may be quite local, in the
sense that they are specific to a particular linguistic feature in a particular
language. One might discover, for instance, that in English reported structures
of the type "He said that x is the case" or "They claimed x was the case", the
optional that appears with a significantly higher or lower frequency in original
English texts vis-à-vis translated English texts. Let us assume that, as I suspect,
236 MONA BAKER
that occurs in this structure less frequently in the original corpus than in the
corpus of translations. This is a local pattern, specific to English, but what we
discover about it may tell us something about the nature of translated text in
general and the nature of the process of translation itself. It may be used, for
instance, to support the explicitation hypothesis, i.e. that translated text exhib-
its a higher level of explicitness, even at the syntactic level, than specific
source texts and original texts in general (Baker 1993a: 243).
Comparable corpora can also help us arrive at more global statements
about the nature of translated text. I will suggest two possible lines of research
to illustrate the kind of global patterning I have in mind.
4.1 The Type-Token Ratio
As far as the computer is concerned, any sequence of letters with an ortho-

graphic space on either side is counted as a word or, more precisely, a token.
Thus each occurrence of the word day is counted as an individual token and we
can say that there are x tokens of day in a given corpus. The word-form day
itself is a type, no matter how often it occurs. So we can say that there are x
tokens of the type day in a corpus of y million words.
It is useful to explore how the types of a language are distributed in a
given corpus and to use the result as a basis for comparison with other
corpora. 10 For example, Krishnamurthy (1992) reports that the overall type/
token ratio for a corpus of BBC World Service broadcasts is approximately
174 and for a corpus of the Times Daily Newspaper approximately 60. This
suggests that the Times uses much more varied vocabulary than the BBC or,
conversely, that the BBC uses a more restricted set of lexical items than the
Times corpus.
It would be interesting to compare the type/token ratio of a corpus of
original texts and a corpus of translated texts of the same language and in the
same type of domain (for example fiction, media, instruction manuals). The
result may help us capture global patterning that contributes to the identifica-
tion of translations as translations. It may also tell us something about the
nature of mediation. A high type-token ratio, for instance, may be interpreted
as a consequence of the process of lexical simplification11 which has been
reported as taking place in a variety of mediated communicative activities,
including translation (Blum-Kulka and Levenston 1983).
4.2 Lexical Density
It is generally accepted that language, any language, consists of a series of

lexical and grammatical words. Lexical words are generally 'about' something
and typically comprise items which belong to categories such as nouns,
adjectives and verbs. Grammatical words belong to closed sets such as deter-
miners and prepositions. Stubbs (1986: 28-33) discusses a series of tests which
can be used to decide whether a given item is a lexical or grammatical word.
He also details an explicit procedure and an algorithm which can be used to
identify lexical and grammatical items automatically in English.
Lexical density is the percentage of lexical as opposed to grammatical
items in a given text or corpus of texts. It is calculated by dividing the number
of lexical items by the total number of words in a text or corpus and multiply-
ing the result by 100 to arrive at a percentage. Stubbs (ibid) reports that the
lexical density of written English has been found to be higher than that of
spoken English and interprets the result in a variety of ways. For example,
written text is relatively context-free compared to spoken text, which can rely
on the immediate physical context. Written text is also highly edited and
redrafted to cut out redundancy and repetition, while spoken text has to be
understood in real time, which means that it has to be more predictable. Lexical
words are less predictable than grammatical words because they belong to
open-ended categories: there are thousands and thousands of them. Thus in
pint of milk, pint predicts of much more than of predicts milk — of predicts too
many things, for example bitter, lager, beer, and, if we do not take collo-
cational restrictions into consideration, the complete set of nouns which exist
in English.
The questions of lexical density and of predictability are directly related
to the notions of information rate and information load, which have been
identified as a problem in translation. Information load has generally been
associated with features such as the use of technical vs. general vocabulary, the
percentage of known vs. unknown information, the overall length of the text
and the amount of detail involved in explaining or describing an event or
entity.
Headland (1981) and Larson (1984) give an example each where they
contrast a text perceived to be 'difficult', in the sense of having a high
information load, with a simplified version of the same text (Headland 1981:
238 MONA BAKER
20) or with an easier text on a similar subject (Larson 1984: 439). Fortunately,
the texts used in both cases are short enough to be analysed manually and I
have therefore been able to calculate their respective lexical densities.
Both examples exhibit a difference in lexical density: the texts perceived
as 'difficult' have a higher lexical density than the ones presented as being
'easier' (43% vs. 33% respectively in the case of Headland, and 52% vs. 44%
for the Larson texts). This is true irrespective of the type of vocabulary used,
the length of each text or the level of detail involved. It is therefore possible to
argue, at least tentatively, that lexical density contributes to information load.
Given the facilities available for the automatic processing of corpora, and
particularly the algorithm devised by Stubbs (1986), this feature can now be
studied on a much wider scale and may well reveal important facts about the
nature and extent of mediation in translated text. If, for instance, we discover
that the overall lexical density of a corpus of English translations is signifi-
cantly lower than that of a comparable corpus of original English, we might
want to argue that translators use this feature, consciously or subconsciously,
to control information load and to make a translated text more accessible to its
new readearship.
5. Conclusion
I have tried to give an overview of the types of corpora that can be used in
translation studies and examples of the kind of corpus-based research that can
be carried out in the discipline. We are still a long way from achieving a
coherent methodology in this area. Corpus-based research offers enormous
potential for translation scholars but the process of setting up the required
corpora and of devising the relevant software is fraught with difficulties.
Hopefully, the examples I have given of the kind of questions that can be
answered on the basis of corpora, and the scale on which they can be answered,
will go some way towards justifying the effort required to establish corpus-
based research as a serious option in translation studies.
Author's address:
Mona Baker . 2 Maple Road West. Brooklands . MANCHESTER M23 9HH .
United Kingdom
Notes
1. Sager (1990: 131-132) details various applications of corpus-based analysis in terminology

work. These include, for example:
- supplementing machine-translation lexicons.
- processing texts which are to be translated and comparing them to machine-readable
terminology holdings to identify new items. Items thus identified can be used both to
supplement the holdings and to help terminologists pre-empt translators' queries.
- ensuring that all possible variants of a term are covered.
- demonstrating the linguistic behaviour of terms.
- identifying changes in the frequency or usage of terms by means of statistical analyses of
large corpora.
2. This type of corpus is still popular and some find it useful in translation. A computerised
corpus of Petrarch's Familiares was set up in the early eighties at the SUNY Research
Centre, Binghamton with a view to simplifying the translation process and compiling a
dictionary of Humanistic and Renaissance Latin (Bernardo 1981). The corpus included a
"Latin-English in-context word list" (p. 74), the idea being that this could (a) generate,
automatically, a similar word-list for other prose works by Petrarch and produce a first
draft, an unidiomatic translation, which can then be polished, or (b) produce a printout of
Latin words in context, with their possible translation equivalents, to enable the translator
to dictate an idiomatic translation by selecting the most likely word translations from each
list.
3. Corpus linguistics is a branch of general linguistics which draws on large, computerised
collections of natural language, processed in a variety of ways, to substantiate its findings.
A few years ago, very few people had heard of corpus linguistics. Today, the European
Commission is firmly committed to the development of large reference corpora in all
European languages, and a substantial group of linguists now see the use of corpora as the
most exciting development in modern linguistics. For a good overview of the development
of this method of studying language, see Leech (1991) and Stubbs (1993). For ah overview
of techniques used in processing corpora, see Sinclair (1991a).
4. The question of what constitutes a representative corpus remains highly problematic and
has been the subject of much debate over the years. Any corpus, whether of whole texts or
text fragments, is essentially a sample of a particular domain of language, or of the general
core of everyday language. First, it is impossible to delimit in any rigorous way the
population of texts which itself constitutes the domain in question. This makes the
application of methods of statistical sampling irrelevant, given that statistical sampling is
generally done on clearly defined populations. Second, almost every unit of language that
we might want to consider as the basis for collecting a 'sample' raises its own problems of
definition. Even 'text' is not a well defined unit. Similarly, most of the criteria that we have
to use in selecting our sample, for example informal vs. formal, has a high subjective
element to it and further complicates the question of representativeness. And finally, the
sheer size of the population we need to sample (namely language) and, by comparison, the
severe restrictions on available resources (whether physical, financial or human), all mean
that it is virtually impossible to ensure coverage of every feature of the population being
sampled. In theoretical terms then, it is untenable to assume that a perfectly balanced and
representative corpus can ever be achieved. In practical terms, this situation suggests that
'more' representativeness should be sought gradually, through a series of approximations:
240 MONA BAKER
"First, the corpus builder attempts to create a representative corpus. Then this corpus is
used and analysed and its strengths and weaknesses identified and reported. In the light of
this experience and feedback the corpus is enhanced by the addition or deletion of material
and the cycle is repeated continually" (Atkins et al. 1991). Also, in practice, mundane
considerations such as the availability of particular texts and the ease of obtaining copy-
right permission play an important role in the selection of samples. Similarly, the reasons
for the fact that corpora of spoken language, where they are held at all, are much smaller in
size than those of written language, are entirely practical rather than theoretical. They have
to do with the cost of transcription and keyboarding and the fact that surreptitious recording
of face-to-face and telephone conversations is illegal in most countries.
5. I personally believe that corpora which consist of whole texts are, on the whole, far more
useful than those which consist of text fragments, but the discussion of this particular issue
lies outside the scope of this paper. Suffice it to say here that a corpus which consists of text
fragments has obvious limitations in terms of studying larger text patterns, such as patterns
of cohesion across chapters. Any fragmentation of a novel would similarly rule out a study
of character development. And a corpus which consists of a set of sentences will not even
allow a study of more modest patterns, such as paragraphing and inter-sentential cohesion.
And finally, a corpus of complete texts offers a way out of having to address the issue of the
selection and representativeness of text fragments, though the question of the representa-
tiveness of the corpus as a whole remains unresolved.
6. For the importance of this particular dimension, i.e. direction of translation, see Malmkjaer
(1993). Malmkjaer argues, quite convincingly but without the benefit of substantial corpus
research, that experienced translators "might produce higher degrees of equivalence if they
are translating out of their mother tongues than if they are translating into their mother
tongues" (p. 213). She coins the term 'SL mother tongue but TT habitual use translators' to
refer to this group.
7. Another term used in the literature is 'bilingual corpora' (Leech 1991). Hartmann (1980:
37-38) uses the term 'parallel texts' to refer to three distinct types of text collections, one of
which is what I have referred to here as 'multilingual corpora' and in Baker (1993: 248) as
'comparable corpora'. I apologise for the change in terminology.
8. The English corpus on which this study was based is the original Cobuild corpus of modern
English (c. 20 million words); the Swedish corpus consisted of c. 20 million words of
modern Swedish held at the University of Göteborg, Department of Swedish (contact
Martin Gellerstam). A detailed breakdown of the various structural patterns identified in
the two corpora for words such as låna, lend and borrow can be found in Sinclair 1991b.
One example of this detailed analysis is that the pattern 'A lends B x' (as in 'She lent me a
sweater') occurs 26 times in the English corpus, whereas 'A lånar B x' does not occur at all
in the Swedish corpus, in spite of the fact that it is generally accepted as a perfectly normal
type of construction in Swedish.
9. The rejection of the assumption that you can say anything naturally in any language also
has implications for the development of translation studies as an independent discipline. If
the assumption held, we would perhaps be justified in seeing translation theory as "a
branch of Comparative Linguistics" (Catford 1965: 20) or "a branch of applied linguistics"
(Lindquist 1984: 261). Rejecting this assumption is necessary to the claim that the activity
of translation is qualitatively different from that of text production, and hence that we need
an independent theory to account for it (inasmuch as any theory can be independent from
other theories of course).
10. Studies of this type should be based on very large corpora if distortion of data is to be
avoided.
11. Lexical simplification may be defined as "the process and/or result of making do with less
words" (Blum-Kulka and Levenston 1983: 119).
References
Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1991. "Corpus Design Criteria". Paper
Presented at the Workshop on European Textual Corpora, Pisa, 7-10 January 1991.
Baker, Mona. 1992. In Other Words: A Coursebook on Translation. London and New York:
Routledge.
Baker, Mona. 1993a. "Corpus Linguistics and Translation Studies: Implications and Appli-
cations". Baker et al. 1993: 233-250.
Baker, Mona. 1993b. Multilingual Databases. Birmingham: University of Birmingham.
[Report submitted to the European Commission as a contribution to a European enquiry
into corpus work.]
Baker, Mona, Gill Francis and Elena Tognini-Bonelli, eds. 1993. Text and Technology: In
Honour of John Sinclair. Amsterdam/Philadelphia: John Benjamins.
Bernardo, Aldo S. 1981. "Maximizing Computer Assistance in Literary Translation:
Petrarch's Familiares". Marilyn Gaddis Rose, ed. Translation Spectrum: Essays in
Theory and Practice. State University of New York Press, 1981. 74-80.
Blum-Kulka, Shoshana and Eddie A. Levenston. 1983. "Universais of Lexical Simplifica-
tion". Claus Faerch and Gabriele Kasper, eds. Strategies in IL Communication.
Longman, 1983. 119-139.
British National Corpus: Written Corpus Design Specification. 1991. OUP Promotional
Document Dated 2 September.
Catford, J.C. 1965. A Linguistic Theory of Translation: An Essay in Applied Linguistics.
Oxford University Press.
Church, Kenneth and William Gale. 1991. "Concordances for Parallel Text". Paper Pre-
sented at the Seventh Annual Conference of the UW Centre for the New OED and Text
Research. St. Catherine's College, Oxford.
Gale, William and Kenneth Church. 1991. "Identifying Word Correspondences in Parallel
Texts". Darpa SLS Workshop.
Hartmann, R.R.K. 1980. Contrastive Textology: Comparative Discourse Analysis in Ap-
plied Linguistics. Heidelberg: Julius Groos.
Headland, Thomas. 1981. "Information Rate, Information Overload, and Communication
Problems in the Casiguran Dumagat New Testament". Notes on Translation 83. 18-27.
Hofland, K. and S. Johansson. 1982. Word Frequencies in British and American English.
Bergen: The Norwegian Computing Centre for the Humanities.
Johansson, Stig and Knut Hofland. 1993. "Towards an English-Norwegian Parallel Cor-
pus". Udo Fries, Gunnel Tottie and Peter Schneider, eds. Creating and Using English
Language Corpora: Papers from the Fourteenth International Conference on English
Language Research on Computerized Corpora. Zurich, 1993. 25-37.
242 MONA BAKER
Krishnamurthy, Ramesh. 1992. "Basic Access Software: Word Lists". Birmingham:

Cobuild. [Report submitted to the European Commission as a contribution to NERC
workpackage 5: Access and Management Software Tools.]
Laffling, John. 1991. Towards High-Precision Machine Translation — Based on Contras-
tive Textology. Berlin-New York: Foris Publications.
Laffling, John. 1992. "On Constructing a Transfer Dictionary for Man and Machine".
Target 4:1. 17-31.
Larson, Mildred. 1984. Meaning-Based Translation: A Guide to Cross-Language Equiva-
lence. Lanham, New York and London: University Press of America.
Leech, Geoffrey. 1991. "Corpora". Kirsten Malmkjaer, ed. The Linguistics Encyclopedia.
London and New York: Routledge, 1991. 73-80.
Lindquist, Hans. 1984. "The Use of Corpus-Based Studies in the Preparation of Handbooks
for Translators". Wolfram Wilss and Gisela Thome, eds. Translation Theory and Its
Implementation in the Teaching of Translating and Interpreting. Tubingen: Narr, 1984.
260-270.
Malmkjaer, Kirsten. 1993. "Who Can Make Nice a Better Word than Pretty?: Collocation,
Translation, and Psycholinguistics". Baker et al. 1993: 213-232.
Marinai, E., C. Peters and E. Picchi. 1991. "Bilingual Reference Corpora: A System for
Parallel Text Retrieval". Paper presented at the Seventh Annual Conference of the UW
Centre for the New OED and Text Research, St. Catherine's College, Oxford.
Newton, John, ed. 1992. Computers in Translation: A Practical Appraisal. London and
New York: Routledge.
Rettig, Heike. 1993. Evaluative Report on the Corpus Survey. Institut fur Deutsche Sprache:
Mannheim NERC Working Paper 17. [Submitted to the European Commission as a
contribution to a European enquiry into corpus work.]
Sager, Juan. 1990. A Practical Course in Terminology Processing. Amsterdam/Philadel-
phia: John Benjamins.
Schubert, Klaus. 1992. "Esperanto as an Intermediate Language for Machine Translation".
Newton 1992: 78-95.
Shamaa, Najah. 1978. A Linguistic Analysis of Some Problems of Arabic to English
Translation. Oxford University. [Ph.D. Thesis.]
Sinclair, John McHardy. 1991a. Corpus, Concordance, Collocation. Oxford: Oxford Uni-
versity Press.
Sinclair, John McHardy. 1991b. Council of Europe Multilingual Lexicography Project.
[Report submitted to the Council of Europe under contract no. 57/89.]
Stubbs, Michael. 1986. "Lexical Density: A Computational Technique". Talking About
Text. Discourse Analysis Monograph 13. University of Birmingham: English Language
Research, 1986. 27-42.
Stubbs, Michael. 1993. "British Traditions in Text Analysis: From Firth to Sinclair". Baker
et al. 1993: 1-33.
Toury, Gideon. 1978. "The Nature and Role of Norms in Literary Translation". James S
Holmes, José Lambert and Raymond van den Broeck, eds. Literature and Translation:
New Perspectives in Literary Studies. Leuven: ACCO, 1978. 83-100. [A revised version
in: Gideon Toury. Descriptive Translation Studies and beyond. Amsterdam/Philadel-
phia: John Benjamins, 1995. 53-69.]
Vermeer, Hans J. 1987. "What Does It Mean to Translate?". Gideon Toury, ed. Translation
Across Cultures. New Delhi: Bahri Publications, 1987. 25-33.
Appendix
A simplified version of the same text (Headland 1981: 20)

Text A ('difficult')
Ye were not redeemed with corruptible things, as silver and gold, from your vain conversa-
tion received by tradition from your fathers; but with the precious blood of Christ, as of a
lamb without blemish and without spot. (Peter 1:18)
16:21 (lexical/grammatical) = 43% lexical density.
Text B (simplified version)
You were redeemed, not with money, but with the blood of Christ.
Two texts on a similar subject (Larson 1984: 439)

Text A ('difficult')
The steam turbine obtains its motive power from the change of momentum of a jet of steam
flowing over a curved vane. The steam jet, in moving over the curved surface of the blade,
exerts a pressure on the blade owing to its centrifugal force. This centrifugal pressure is
exerted normal to the blade surface and acts along the whole length of the blade. The
resultant combination of these centrifugal pressures, plus the effect of changes of velocity,
is the motive force on the blade, (from: E.H. Lewitt. Thermodynamics Applied to Heat
Engines.)
Text B ('easier')
The principle of the turbine is extremely simple. If the lid of a kettle is wedged down, when
the water boils, a jet of steam will issue from the spout. If this jet is projected against the
blades of a fan, or any sort of wheel shaped like the old-fashioned water-wheel it will,
obviously, drive it round. In the power station, steam is generated in huge boilers, and very
often a temperature as high as 850 degrees Fahrenheit at a pressure of sometimes 1,000 lb.
per sq. in. is built up before the steam is released from the boiler to the turbine jets.
The turbine comprises two parts, the rotor or moving part, and the stator or fixed portion.
Instead of a single nozzle with one jet, there are a large number of nozzles . . . (from: How
and Why it Works, published by Odhams.)

Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995

Caricato da

Copyright:

Formati disponibili

Corpora in Translation Studies:

An Overview and Some Suggestions

Abstract: Corpus-based research has become widely accepted as a factor in

Target 7:2 (1995), 223-243. DOI 10.1075/target.7.2.03bak

The potential for using corpora is beginning to take shape in translation

2. Corpora: Definition, Types and Overview of Basic Operations

2.1. What Is a Corpus?

The word corpus originally meant any collection of writings, in a processed or

In what follows, I intend to use corpus to mean any collection of running

2.2. Basic Text Processing Operations

2.3. Types of Corpora

Corpora are generally designed on the basis of a number of selection criteria,

nothing more than the range of languages involved, need to be developed. I

3. Corpora for Translation Research and Pedagogy

3.1 Parallel Corpora7

A parallel corpus consists of original, source language-texts in language A and

Rettig (1993) reports three further corpora which include substantial

3.2 Multilingual Corpora

lâna followed by ut is translated as LEND

The usefulness of this type of methodology is not restricted to machine

3.3 Comparable Corpora

Access to comparable corpora should allow us to capture patterns which are

4.1 The Type-Token Ratio

As far as the computer is concerned, any sequence of letters with an ortho-

4.2 Lexical Density

It is generally accepted that language, any language, consists of a series of

1. Sager (1990: 131-132) details various applications of corpus-based analysis in terminology

Krishnamurthy, Ramesh. 1992. "Basic Access Software: Word Lists". Birmingham:

A simplified version of the same text (Headland 1981: 20)

Two texts on a similar subject (Larson 1984: 439)

Potrebbero piacerti anche