Corpus

CORPUS LINGUISTICS
Introduction: what is Corpus Linguistics?
 The study of language based on examples of “real life“ language

use, collected, stored and processed via computer.
 Corpus linguistics can be described as the study of language based

on text corpora.
 Facilitated by the advent of computer technology (1960s)
 Latin: corpus (body): body of text  any collection

of more than one text, written or spoken
INTRODUCTION TO CORPUS LINGUISTICS
• A corpus is a large collection of machine-

readable, authentic texts which represent
spoken and/or written usage, chosen to
characterize or represent a state or variety of a
language.
• Corpus v. Text archive
• Representativeness
Corpus vs. archive
 Text archive
 Collection of texts in their original format
(Oxford Text Archive: http://ota.ox.ac.uk/)
 Corpus
 texts collected and processed in a unified,
systematic manner
British National Corpus: http://www.natcorp.ox.ac.uk/
BTANT 129 w5
“A corpus can be defined as a collection of texts assumed
to be representative of a given language put together so
that it can be used for linguistic analysis. Usually the
assumption is that the language stored in a corpus is
naturally-occurring, that it is gathered according to explicit
design criteria, with a specific purpose in mind, and with a
claim to represent larger chunks of language selected
according to a specific typology.” (Tognini-Bonelli 2001,
p. 2)
Corpus Linguistics: Theory or Method?
Theory
WHY?
A set of ideas to explain the apparent facts
Methodology
HOW?
An approach to something; set of methods
What is corpus linguistics?
 A new theory of language?

 No. In principle, any theory of language is compatible with
corpus-based research.
 A separate branch of linguistics (in addition to syntax,
semantics…)?
 No. Most aspects of language can be studied using a corpus
(in principle).
 A methodology to study language in all its aspects?
 Yes! The most important principle is that aspects of
language are studied empirically by analysing natural
data using a corpus.
Corpus-based or Corpus-driven approaches?
Corpus-based approaches are used to “expound,

test or exemplify theories and descriptions that
were formulated before large corpora became
available to inform language study” (Tognini-
Bonelli 2001:65).
Therefore, corpus-based linguists are not strictly
committed to corpus data and they would discard
“inconvenient evidence” by insulation,
standardisation and instantiation (i.e. via corpus
annotation).
Corpus-based or Corpus-driven
approaches?
Corpus-driven linguists are “strictly committed to the

integrity of the data as a whole”.
Theoretical statements are fully consistent with, and

reflect directly, the evidence provided by the corpus.
(Tognini-Bonelli 2001:84-85).
Corpus-based or Corpus-driven approaches?
The distinction is overstated, they are 2 idealized extremes.
4 basic differences among the 2 approaches:

 Types of corpora used
 Attitudes towards theories and intuitions
 Focuses of research
 Paradigmatic claims
C.B. Approaches C.D. Approaches
 Corpus will balance itself when it grows to

 Corpus must be representative and balanced be big enough (cumulative
representativeness);
 Size is not all-important;
 Corpus must be very large;
 Minimum frequency is used to exclude non-
relevant results;  Corpus evidence is exploited fully, but this
way the number of the combinations is
enormous;
 In favour of corpus annotation: CB approaches

generally have existing theory as a starting  Against corpus annotation (no
point and correct and revise such theory in the preconceived theories)
light of corpus evidence;
 No distinction between lexis, syntax,

 Distinction between the different levels of pragmatics, etc. There is only 1 level
language analysis. of language description: the functionally
complete unit of meaning or language
patterning
We will only refer to
CORPUS-BASED APPROACHES
A few key notions in

Corpus Linguistics…
Representativeness
Essential feature of a corpus.
Balance (the range of genres included in a corpus) and
sampling (how the text chunks or each genre are selected)

ensure representativeness.
Representativeness
A corpus is representative if…

…the findings based on its contents cane be
generalized to the
said language variety (Leech 1991);
…its samples include the full
range of variability in a
population (Biber 1993)
Representativeness
It changes over time (Hunston 2002): if a corpus is not

regularly updated, it rapidly becomes
unrepresentative.
Representativeness
Criteria to select texts for a corpus:
 External criteria (Biber’s situational perspective): defined situationally, e.g.

genres, registers, text types, etc.
 Internal criteria (Biber’s linguistic perspective): defined linguistically, taking into

account the distribution of linguistic features. CIRCULAR – because a corpus is
typically design to study linguistic distribution, so there is no point in analysing
a corpus where distribution of linguistic features is predetermined.
Representativeness
2 main types (for the range of text categories represented):
 General corpora – a basis for an overall description of a
language (variety); their r. depends on the sampling from
a broad range of genres.
 Specialized corpora – domain- or genre specific corpora;
their r. can be measured by the degree of closure or
saturation (lexical features).
Sampling
A corpus is a sample of a given population
A sample is representative if what we find for the

sample holds for the general population
Samples are scaled-down versions of a larger

population
Sampling
Sampling unit: for written text, a s.u. could be a
book, periodical or newspaper.
Population: the assembly of all sampling units; it

can be defined in terms of language
production, reception (demographic, sex, age,
etc.) or language as a product (category, genre
of language data).
Sampling frame: the list of sampling units

Sampling
Sampling techniques:
 Simple random sampling: all sampling units within
the sampling frame are numbered and the sample is
chosen by use of a table or random numbers; rare
features could not be accounted for.
 Stratified random sampling: the population is
divided in relatively homogeneous groups, i.e. the
strata, and then these latter are sampled at random;
never less representative than the former method.
Historical background of Corpus
Linguistics
• R. Quirk’s Survey of English Usage (SEU)

• Advent of computers
• First corpora
• The Brown Corpus
Best known corpora
• The Birmingham Collection of English Texts

(COBUILD)
• The Bank of English
• The British National Corpus (BNC)
• The Brown Corpus
• The Lancaster-Oslo/Bergen Corpus (LOB)
• The Helsinki Corpus of English Texts: Diachronic
and Dialectal
• The International Corpus of English (ICE)
Best known corpora
• The Lancaster/IBM Spoken English Corpus (SEC)

• The London-Lund Corpus of Spoken English
(LLC)
• International Corpus of Learner English (ICLE)
• LINDSEI (Louvain International Database of
Spoken English Interlanguage)
• Corpus of the Contemporary Lithuanian
Language
Some important corpora
 1960s -1980s
 Brown Corpus (American English) 1 million words
 Lancaster –Oslo-Bergen (LOB) corpus (British English) 1 million words
 These corpora inspired the International Corpora of English (ICE) projects, which are
still continuing: see http://ice-corpora.net/ice/
 1980s-2000
 British National Corpus (100 million words)
 COBUILD corpus > Bank of English http://www.mycobuild.com/about-collins-
corpus.aspx
 2000-now
 BYU corpora (see http://corpus.byu.edu): CoCA, CoHA, TIME, Corpus of American
Soap Operas, etc
 SCOTS; Corpus of Modern Scottish Writing (1700-1945) (see
http://www.scottishcorpus.ac.uk)
BYU corpus suite: http://corpus.byu.edu
The scope of corpus linguistics
• Corpus makers or compilers.

• Developers of tools for the analysis of corpora.
• Descriptive linguists.
• Exploiters of corpus-based linguistic
descriptions for use in a variety of applications
such as language learning and teaching, natural
language processing by machine, including
speech recognition and translation.
 Lemma is a lexeme or dictionary headway, which is realized
by a word form; e.g. the lemma TAKE (upper case)/take,
took, taken, takes, taking (lower case)
 Node (the word form, lemma, or other pattern under

investigation) co-occurs with collocates (word forms or
lemmas) within a given span of word forms e.g. 4:4 (four
words to left and right).
 Collection is a purely lexical and nondirectional relation: it
is a node-collocate pair which occurs at least once in a
corpus.
 1. untold<N+1 : damage, misery,…; million, riches,…..>
 2. Cause <abstract nouns denoting “unpleasant things”>
Example: lexical semantics
 Quasi-synonymous lexical items exhibit subtle
differences in context.
 strong
 powerful
 A fine-grained theory of lexical semantics would

benefit from data about these contextual cues to
meaning.
Example continued
 Some differences between strong and powerful (source:
British National Corpus):
 strong
wind, feeling, accent, flavour
 powerful tool, weapon, punch, engine
 The differences are subtle, but examining their collocates

helps.
Chomsky and Early Corpus
Linguistics
• Empiricism and rationalism.

• Early corpus linguists.
• Corpus research addresses performance, the
linguist’s concern should be competence.
• Performance is a poor mirror of competence.
• A corpus is finite while language is infinite.
• Corpora would always be ‘skewed’.
Criticisms of corpora (I)
 Competence vs. performance:
 To explain language, we need to focus on
competence of an idealised speaker-hearer.
 Competence = internalised, tacit knowledge of
language
 Performance – the language we speak/write – is
not a good mirror of our knowledge
 it depends on situations
 it can be degraded
 it can be influenced by other cognitive factors
beyond linguistic knowledge
Criticisms of corpora (II)
 Early work using corpora assumed that:
 the number of sentences of a language is finite (so we can get
to know everything about language if the sample is large
enough)
 But actually, it is impossible to count the number of sentences in a
language.
 Syntactic rules make the possibilities literally infinite:
the man in the house (NP -> NP + PP)
the man in the house on the beach (PP -> PREP + NP)
the man in the house on the beach by the lake
…
 So what use is a corpus? We’re never going to have an infinite
corpus.
Criticisms of corpora (III)
 A corpus is always skewed, i.e. biased in favour of
certain things.
 Certain obvious things are simply never said. E.g.
We probably won’t find a dog is a dog in our
corpus.
 A corpus is always partial: We will only find things in
a corpus if they are frequent enough.
 A corpus is necessarily only a sample.
 Rare things are likely to be omitted from a sample.
Chomsky criticizes Corpus Linguistics
• Frequency tells you about the world rather

than about language (the sentence I live in
New York is fundamentally more likely than
I live in Dayton Ohio).
• Corpus research is slow and limited.
• Corpus leaves out what you don’t say, which
can be more informative than what you say.
• Pseudo-techniques.
Reply to Chomsky’s criticism
• Performance is still an inherently valid object of

study. Entire fields of science and research use
exclusively or almost exclusively observational
data: astronomy, archeology, paleontology,
biology, etc.
• Naturally-occurring data can be collected,
studied, analysed, commented and referred to.
Corpus-based observations are more verifiable
than introspectively based statements.
• The finite-infinite is not a big issue, since in

many other fields we also have an infinite
number of possible examples, but it does
not stop us from studying them.
• A big enough corpus (such as a 100 million
word British National Corpus) will provide a
lot of utterances one is likely to encounter in
language.
• Frequency lists compiled objectively from

corpora have shown that human intuition
about language is very specific and far from
being a reliable source.
• Word frequency is also a good reason to use
very large and well-balanced corpora.
• Corpora are now collected in extremely
systematic and controlled ways.
• Corpus analysis will never tell you that

an utterance is impossible. But with a
large enough and well balanced corpus
and sufficient statistical tools, it can tell
you when it is statistically significant for
such an utterance to be absent from the
corpus.
WHY USE CORPORA?
• Authenticity
• Objectivity
• Verifiability
• Exposure to large amounts of data
• New insights into language
• Enhancement of learner motivation
Authenticity
• Key notion in the field of corpus work.

• “One does not study all of botany by
artificial flowers” (Sinclair 1991:24).
Objectivity
• No prior selection of data.

• “I am above all an observer; I quite simply
cannot help making linguistic observations.
In conversations at home and abroad, in
railway compartments, when passing people
in streets and on roads, I am constantly
noticing oddities of pronunciation, forms
and sentence constructions”. (Jespersen
1995: 213):
Verifiability
• “Verifiability is a normal requirement in

scientific research, therefore, the science of
language – linguistics -- (which is often
claimed to be the scientific study of
language) should not be exempt from this
standard mode of research procedure”
(Leech 1991:112).
New insights into language
• Many subtle observations.

• Corpora can help learners discover new
meanings of the words they already know.
• New understanding of meaning in Corpus
Linguistics.
Collocation
 Collocation is among the linguistic concepts which have

benefited most from advances in corpus linguistics
 What is collocation?
 strong tea, powerful car (Halliday 1976)
 “collocations of a given word are statements of the habitual or customary
places of that word…the company that words keep” (Firth 1968:181-2)
 “One of the meanings of night is its collocability with dark” (Firth 1957:196)
 “a frequent co-occurrence of two lexical items in the language”
(Greenbaum 1974:82)
 expel a school child vs. cashier an army officer
 “I propose to bring forward as a technical term, meaning by
collocation, and apply the test of collocability” (Firth 1957: 194)
Meaning by collocation
 “There is frequently so high a degree of

interdependence between lexemes which tend to
occur in texts in collocation with one another that their
potentiality for collocation is reasonably described as
being part of their meaning” (Lyons 1977: 613)
 Complete description of the meaning of a word
would have to include the other word or words that
collocate with it
 “You shall know a word by the company it keeps!”
(Firth 1968:179)
 Collocation is part of the word meaning
Two types of collocation
 Coherence collocation vs. neighbourhood

(horizontal) collocation (Scott 1998)
 Coherence collocation
 Collocates associated with a word (e.g. letter – stamp, post
office)
 Neighbourhood collocation
 Words which do actually co-occur with the word (letter - my,
this, a, etc)
Coherence collocation
 “A cover term for the cohesion that results from the co-
occurrence of lexical items that are in some way or
other typically associated with one another, because
they tend to occur in similar environments.” (Halliday &
Hasan 1976:287)
 candle – flame – flicker
 hair – comb – curl – wave
 sky – sunshine – cloud – rain
 Difficult to measure using a statistical formula

Neighbourhood collocation
 Collocation in corpus linguistics

 Structure of collocation – collocation window
 “We may use the term node to refer to an item whose
collocations we are studying, and we may then define a
span as the number of lexical items on each side of a node
that we consider relevant to that node. Items in the
environment set by the span we will call collocates.” (Sinclair
1966:415)
 Casual vs. significant collocation
 Significant collocation: collocation that occurs more
frequently than would be expected (in a statistical sense)
on the basis of the individual items
 n.b. Neighbourhood (horizontal) collocations can
include some coherence collocations
Collocation is syntagmatic
Langue (Language system)
paradigmatic
famous boots. On the stroke of full time the
Stoke the lead on the stroke of half-time with a goal
Smith sin-binned on the stroke of half-time, added a
clinched their win on the stroke of lunch after resuming
chase by declaring on the stroke of lunch. <p> With a lead
expectant crowd, on the stroke of midday. The bird
hour began not upon the stroke of midnight but upon the
of midnight but upon the stroke of noon. There was,
booked in advance. On the stroke of seven, a gong summons
Promptly on the stroke of six 'clock, the chooks
from Edinburgh on the stroke of the Millennium.
Parole (Utterance)
syntagmatic
…to see the selected collocate
Enhancement of learner motivation
• “Corpus as an information source fits in very well

with the dominant trend in university teaching
philosophy over the past 20 years, which is the
trend from teaching as imparting knowledge to
teaching as mediated learning”(Leech 1997:2).
• There is no longer a gulf between research and
teaching, since the student is placed in a position
similar to that of a researcher, investigating and
imaginatively making sense of the data available
through observation of the corpus.
• McCarthy (1998: 67-68) argues that the

traditional ‘Three Ps’ methodology
Presentation – Practice – Production should
be supplemented by the ‘Three Is’ method:
Illustration – Interaction – Induction.
• Students “discover” language.
• The potential value in foreign language

teaching is considerable for at least 2
reasons:
• The first is the Hawthorne effect – a well-
known principle according to which any new
tool or method tends to stimulate the actors
of a pedagogic act and to improve the
results more than the mere continuance of
trite procedures.
• The second is connected with the Laws of

memory: memory is conditioned by an
active cognition of the past.
• Recognizing and recalling a word are in the
long run much easier if the mind, at the very
moment of the input, has actively associated
the fragment with circumstances of that
input.
Exposure to huge amounts of data
• Nurtures a “feel of language”, develops an

understanding of what is natural in a
language.
• The computer is “ a tireless native-speaker
informant, with rather greater potential
knowledge of the language than the average
native speaker” (Barnbrook 1996: 140).
Hazards and disadvantages
of using corpora
• A corpus is not an infallible source of all
linguistic information about language.
• Overdependence and overreliance upon
corpora can be an inhibiting dogma.
• An attempt to replace a laborious hands-on
analysis by a rapid automatic processing.
CORPUS CREATION
• The issues in corpus design and compilation are
directly related to the validity and reliability of
the research based on a particular corpus
(Kennedy 1998: 60).
• Sinclair (1991: 13) claimed that “the decisions
that are taken about what is to be in the
corpus, and how the selection is to be
organized, control almost everything that
happens subsequently. The results are only as
good as the corpus”.
Corpus creation
• Getting permissions
• Discussion and research points.
• Research the copyright laws of Lithuania and find
out what restrictions govern the production of an
electronic copy of copyrighted material for
research purposes. Contact one or more publishers
to find out about their policy and practice in
assisting researchers to build corpora.
• Further reading
• McEnery et al. 2006: 77-79
Corpus creation
• The design of a corpus is dependent upon

the type of a corpus and purpose for which
the corpus is to be used.
• Types of corpora (sample, monitor, general,
spoken, written, learner, translation, parallel,
comparable, etc).
SAMPLE CORPORA
• A sample corpus is a static collection of texts

(samples of texts) selected according to
some strict criteria and intended to be
typical of the whole language or an aspect of
the language at a particular period of time.
• Brown and LOB corpora consist of a large
number (500) short extracts (2000 words),
randomly selected from within 15 genres of
printed texts.
MONITOR CORPORA
• Monitor corpora are text corpora that

represent a dynamic, changing picture of a
language. Such a dynamic collection of texts
is constantly growing and changing with the
addition of new text samples.
GENERAL CORPORA
• They are assembled to serve as a reference
base for unspecified linguistic research
(Kennedy 1998:19).
• The size of a corpus: as a general rule, the
bigger a corpus is the richer and more
interesting the output from a concordancing
program will be, and the more likely to
represent accurately features of the
language.
General corpora: size
• A collection of machine-readable text does not

make a corpus.
• All very large collections of texts have been in
the medium of written language.
• “Technology advances quickly, while human
institutions evolve slowly ”. (Leech 1991:11)
• “Hardware technology advances by leaps and
bounds, software technology lags like a
crawling snail behind it” (Leech 1991: 12).
General corpora:
Spoken and written language
• The spoken form of the language is a better
guide to the fundamental organization of the
language than the written form.
• Spoken language is primary and all the changes
start there.
• Spoken language is not that well researched.
• Spoken language can also prove valuable for the
studies of differences between speech and
writing.
Guidelines for compilers
of general corpora:
• Texts should be authentic.

• Use contemporary texts.
• Beware of dialects.
• Stick to prose.
• Include highly technical material only in very
small doses.
Corpora and learner language
• Learner corpora are defined as electronic

collections of authentic texts produced by foreign
or second language learners (Granger 2003).
• The first computerised learner corpora were
collected in the 1990s when several learner
corpora projects were launched: the Longman
Learners’ Corpus, the Cambridge Learner Corpus,
the Hong Kong University Learner Corpus and the
International Corpus of Learner English (ICLE).
Learner corpora
• The Longman Learners‘ Corpus contains ten

million words of text written by learners of
English of different levels of proficiency and
from twenty different L1 backgrounds.
• The Cambridge Learner Corpus is a large
collection of written texts from learners of
English all over the world.
ICLE
• The International Corpus of Learner English

(ICLE) is the best-known learner corpus
which provides a collection of essays written
by advanced learners of English (third and
fourth year university students) from
different native language backgrounds.
project was launched in 1990 by S. Granger
at the University of Louvain in Belgium.
ICLE

(Version 2) contains 3.7 million words of
EFL writing from learners representing
16 mother tongue backgrounds (Bulgarian,
Chinese, Czech, Dutch, Finnish, French,
German, Italian, Japanese, Norwegian,
Polish, Russian, Spanish, Swedish, Turkish
and Tswana).
ICLE
• The main aim of the project was to collect a corpus of
objective data for the description of learner language.
• The primary goal of ICLE was to investigate the
interlanguage of the foreign language learner.
• The research goals of the ICLE project were twofold. On the
one hand, the project sought to collect reliable data on
learners‘ errors and to compare them cross-linguistically in
order to decide whether they are universal or language
specific. On the other hand, ICLE aimed to research aspects
of foreign-soundedness in non-native essays which are
revealed through the uveruse or underuse of words or
structures with respect to the target language norm.
Spoken learner corpora
• NICT JLE (Japanese Learner English) Corpus

(Izumi et al. 2004)
• The Giessen-Long Beach Chaplin Corpus
(Muller 2005)
• The PAROLE corpus (Hilton et al. 2004)
• The Louvain International Database of
Spoken English Interlanguage (LINDSEI,
Gilquin et al. 2010)
Learner corpora and Second
Language Acquisition
• Language acquisition is a mental process,
which we can observe only through its
product, i.e. the data the learner produces.
• Learner corpora can provide a wider
empirical basis on which many hypotheses
can be tested and the principles that govern
the process of learning a foreign language
uncovered.
Learner corpora and
language teaching
• The introduction of corpora in the classroom
might mean a tough job of changing attitudes of
teachers and learners.
• Educating teachers and spreading the word
about corpora.
• Using corpora in the classroom changes the
student’s role.
• “The distinction between teaching and research
becomes blurred and irrelevant” (Knowles 1990).
Corpora in Translation Studies
• The use of corpora in translation studies is

relatively new - it was first advocated by Mona
Baker in 1993.
• Linguists viewed translations with suspicion,
assumed them to be ontologically different from
non-translated texts and referred to them as
‘interlanguage’ (Selinker 1972), ‘third language’
(Duff 1981), ‘third code’ (Frawley 1984), or
‘translationese’ (e.g. Gellerstam 1986, Doherty
1998, Mauranen 1999, Tirkkonen-Condit 2002).
Parallel corpora
• A parallel corpus is a corpus composed of

source texts and their translations in one or
more different languages; parallel corpora
can be aligned at a word, phrase or sentence
level thus establishing correspondences
between units of bilingual or multilingual
texts.
Parallel corpora
• Parallel corpora are important resources for translation

studies. As Aijmer and Altenberg (1996:12) noted, they
can provide new insights into the languages compared,
insights that cannot be obtained in studies of mono-
lingual corpora, they can also be used for different
comparative purposes and enhance our understanding
of language-specific, typological and cultural differences
as well as universal features, they can highlight
differences between source texts and translations, they
can also be used for a number of practical applications in
translation teaching.
Parallel corpora
• Aligned parallel corpora can provide illuminating insights

into the nature of translation, they can help to devise
tools to aid translation, probabilistic machine translation
systems can be trained on such corpora. Parallel corpora
can be unidirectional (e.g. from English into Lithuanian
or from Lithuanian into English), or bidirectional (e.g.
containing both English source texts with their Lithua-
nian translations as well as Lithuanian source texts with
their English translations), or multidirectional (e.g. the
same text with its English, German, French, Russian,
Spanish, Italian, etc. versions).
Comparable corpora
• Comparable corpora are comparable original texts in

two or more languages, they are monolingual
corpora designed using the same sampling
techniques, e.g. the Aarhus corpus of contract law
(McEnery 2006: 47).
• Monolingual comparable corpus is particularly useful
in studying intrinsic features of translations,
improving the translator’s understanding of the
subject domain, terminology and idiomatic
expressions in the specific field.
Translation research
• The corpus translation studies focuses on

both the process and the product of
translation and contributes to the debates
going on in the discipline.
• One of the most important debates in
intellectual domains is connected with
research of the universals of translation.
Universals of translation
• Baker (1993, 1996) argues that these

features are characteristic of any translated
text and they do not vary across cultures,
unlike norms of translation, which are
considered to be social, cultural and
historical.
Universals of translation
• The translation universals are represented

by explicitation, i.e. translations tend to be
more explicit on different levels than the
originals, simplification – when the content
or form is simplified compared with non-
translated texts and normalization, i.e. the
language used in translations is more
conventional and normalized than that of
the originals (Olohan 2006:37).
Explicitation
• Baker refers to explicitation as “an overall

tendency to spell things out rather than
leave them implicit in translation” (Baker
1996:180)
• Translations tend to be longer than their
source texts. This can be tested by using
parallel corpora, comparing lengths of texts
and text segments and analyzing the
differences.
Explicitation
• Syntactic and lexical explicitation can be

investigated by using comparable corpora and
looking into the frequency of explanatory
vocabulary and conjunctions, e.g. cause,
reason, due to, lead to, because, therefore,
consequently (Baker 1996:181) in order to find
out whether they were more frequently used in
translations to make the relations between
propositions more explicit.
Normalization or conventionalization
• Baker (1996:176-7) defines normalization as

“tendency to conform to patterns and practices
which are typical of the target language, even
to the point of exaggerating them”.
• The discussion of normalization focuses on
typical collocational patterns, clichés,
grammatical structures and punctuation. The
issue of what is typical in a language can be
best answered if based on corpora.
Simplification
• Simplification is reflected in various

strategies such as the breaking up of long
sentences, omissions of redundant or
repeated information, shortening of complex
collocations, etc., which are aimed at
adhering to target language norms and
conventions.
Discussion and research points
• Read Chapter 7 ‘Features of translation’ in

M. Olohan. 2006. Introducing Corpora in
Translation Studies. Routledge.
• Discuss the universal features of
translations, focusing on the findings of the
studies and examples given. Provide your
own examples.
Corpora in translator training
• Corpora may be integrated into translator

training and may meet various needs of
translator trainers.
• Parallel corpora are especially useful as they
can be used to retrieve terminology, explore
collocations, phrasal patterns, lexical
polysemy, translation of collocations and
idioms, etc. (Botley et al. 2000).
• The students can also be encouraged to

compile their own specific corpora that can
be very useful for content information,
terminology, phraseology in some specific
domains or topics.
• A corpus compilation experiment can be
carried out as a real-life translation
assignment.
• Comparable corpora can also be helpful in

translator training as they can be used to
check terminology and collocates, identify
text-type-specific formulations, validate
intuitions and provide explanations for
appropriatness of certain solutions to
problems (Pearson 2003).
Corpora in translation practice
• Corpora can be very useful in translator’s

profession: specialized corpora can be used to
familiarize translators with concepts and terms
from a specific domain, translators can study
corpora output to understand text-type
conventions, literary translators can also resort to
corpora data to study an author’s style, to find
some literary devices, etc.
• Carry out a real-life translation assignment.

• Choose a specific topic and try to foresee
difficulties you may face.
• Identify text types to include and search
strategies for finding them online.
• Find text online.
CORPORA AND LANGUAGE RESEARCH.
UNDERSTANDING OF MEANING IN CORPUS
LINGUISTICS
• How does the language create meaning?

• What are the means by which language
creates meaning?
• Where do we look for meaning?
• How can Corpus linguistics contribute to the
understanding of language?
Corpus Linguistics and
the understanding of meaning
• The role of context is crucial:

it disambiguates.
• Meaning is the product of context.
Sinclair’s understanding of meaning
• The methodological steps proposed by Sinclair to

identify what he calls “extended unit of meaning are:
• identify collocational profile (lexical realizations)
• identify colligational patterns (lexico-grammatical
realizations)
• consider common semantic field (semantic
preference)
• consider pragmatic realisations (semantic prosody)
Extended unit of meaning
• Collocation is the occurrence of words with

no more than four intervening words.
• Colligation is the co-occurrence of
grammatical phenomena, and on the
syntagmatic axis our descriptive techniques
at present confine us to the co-occurrence of
a member of a grammatical class – say a
word class- with a word or phrase.
• Semantic preference is the restriction of

regular co-occurrence to items which share a
semantic feature, for example that they are
all about say, sport or suffering. Semantic
preference is a semantic field a word’s
collocates predominantly belong to.
• Semantic prosody is attitudinal, and on the pragmatic side of the

semantics/pragmatics continuum. Semantic prosody describes the
way in which certain seemingly neutral words can be perceived
with positive or negative associations through frequent
occurrences with particular collocations. Thus, such verbs as set in
(rot, decay, ill-will, decadence, infection, prejudice, etc.), cause
(cancer, crisis, accident, delay, death, damage, trouble, etc.),
commit (crime, offences, foul etc.), rife (crime, diseases, misery,
corruption, speculation, etc.), often have negative semantic
prosody, while such words as impressive will occur with lexical
items such as dignity, talent, gains, achievement, etc. will have
positive prosody.
• Study the article “Corpus Classroom Currency” by E.

Tognini Bonelli (2000:205-243).
• Study the example of the analysis of the phrase the
naked eye (J. Sinclair. 1996. The Search for Units of
Meaning. Textus (ix)vol. ix, no. 1, p. 75-106.
• Study the example of the analysis of the word budge
(Sinclair, J. 1997. The Lexical Item. In Contrastive
Lexical Semantics. Weingand, E. (ed).
Amsterdam/Philadelphia: J. Benjamins. 1-25.
Research points: mini-projects
• In groups of 2-6 choose a group of

synonymous words and carry out a research
project using J. Sinclair’s understanding of
meaning.
Corpora in teaching and learning
• Investigate variation in the verb form used

with collective nouns: aristocracy, army,
audience, cast, committee, community,
company, council, crew, data, family,
government, group, jury, media, navy,
nobility, opposition, press, public, staff,
team.
• Conventional collective noun phrases. Using

the BNC, complete the following:
• Words: bouquet, brood, bunch, bundle,
chain, clump, cluster, colony, covey, drove,
flight, flock, gang, gaggle, group, heap, herd,
litter, nest, pack, pair, pile, range, series,
shoal, school, suit, swarm.
Countable v. Uncountable nouns
Definite v. Zero Article
• Definite v. Zero Article
• There are in English a number of
countable/uncountable pairs of words. Study
the words: language, society, literature try
to work out the difference in meaning
between the noun as countable and as
uncountable.
• Phrasal Verbs
• Using the data from the BNC choose a group of phrasal
verbs:
 back away/down/off/out/up
 break away/down/in/into/off/out/through/up/with
 put
about/across/around/away/down/forward/off/on/out/thro
ugh/together/up
 set about/apart/aside/back/down/forth/in/off/on/to/up
 step aside/back/down/in/on/up/
• Prepositions
• Study the concordances of above and over
and work out the similarities and differences
between them.
Collocations
• First used by Firth (1957).

• “Collocations of a given word are statements
of the habitual or customary places of that
word” (Firth 1968: 181).
• Quantitative approach to collocations.
Collocations
• “Collocations are not absolute or deterministic,

but are probabilistic events, resulting from
repeated combinations used and encountered
by the speakers of any language” (O’Keefe et al.
2007: 59).
• Sinclair (1991) argues that there are two
fundamental principles at work in the creation
of meaning: the ‘idiom principle’ and the ‘open
choice principle’.
Collocations
• Biber et al. (1991) refer to lexical bundles as

recurrent strings of words, delimited by
establishing frequency cut-off points, for
example, that a string must occur at least 10
times per million words of text and must be
distributed over a number of different texts.
Collocations
• Research points:
• Use BNCWeb to analyse the collocations of
the words of your choice.
• Further reading:
• Mc Enery et al. 2006: 80-85, 52-58, 208-226.
Idiomaticity
• Different terminology: ‘lexical phrases’

(Nattinger and DeCarrico 1992), ‘prefabricated
patterns’ (Hakuta 1974), ‘routine formulae’
(Coulmas 1979), ‘formulaic sequences’ (Wray
2002; Schmitt 2004), ‘lexicalized stems’ (Pawley
and Syder 1983), ‘chunks’ (De Cock 2000) as
well as the more conventionally understood
labels such as ‘(restricted) collocations’, ‘fixed
expressions’, multi-word units/ expressions’,
‘idioms’ etc.
Idioms
• “Strings of more than one word whose

syntactic, lexical and phonological form is to
a greater or lesser degree fixed and whose
semantics and pragmatic functions are
opaque and specialised, also to a greater or
lesser degree” (O’Keefe 2007: 80).
• ‘Idiom-prone’ words: body parts, money,
light, colour and other basic notions.
Idioms
• ‘Paradox’ of idiomaticity: the very thing which for

native speakers promotes ease of processing and
fluent production seems to present non-native users
with an insurmountable obstacle.
• Idioms are difficult to get right.
• Idioms can sound strange on the lips of non-native
users.
• Idioms do not just ‘pop up’ in native speech; rather
they occur as part of a more extended phenomenon
that generates subtle webs of semantic, pragmatic and
discourse prosodies.
Idioms
• Research points:
• Use the BNC and the Corpus of
Contemporary Lithuanian to analyse idioms
contrastively.
Lexical difficulties
Use the BNC to study the differences between the following pairs of words:
• Adverse, averse • Compare to, compare • Distinct, distinctive
• Acute, chronic with • Each other, one another
• Among, amid • Complement, compliment • Economic, economical
• Amoral, immoral • Continual, continuous • Elicit, illicit
• Between, among • Convince, persuade • Fewer, less
• Biannual, biennial, • Creole, pidgin • Flammable, inflammable
• Bimonthly, biweekly • Definite, definitive • Ingenious, ingenuous
• Broach, brooch • Different from, to, than • Lay, lie
• Cement, concrete • Disinterested,
uninterested
• Cession, session
• Disposal, disposition
False friends
Use the BNC and dictionaries to study the following:

• Actual (topical, current) • Massive (solid) • Theme (topic, subject)
• Alley (avenue) • Novel (novella) • to conserve (to preserve)
• Costume (suit) • Pathetic (emotional) • to control (to check,
• Fabric (factory) • Patron (cartridge) monitor)
• Faction (fraction) • Physician (physicist) • to realise (to implement)
• Fantasy (imagination) • Preservative (condom) • to send (to broadcast)
• Formula (form) • Programme (TV) channel • to dislocate
• Fraction (decimal fraction) • Public (audience)

• Human (humane) • Receipt (recipe)
• Isolate (insulate) • Smoking (tuxedo, dinner
• Marmalade (jam) jacket)
Synonyms
Use the BNC to study the following:

• Ambivalent, ambiguous • Contrary, converse, • Concise, terse, succinct,
• Abdicate, abrogate, opposite, reverse laconic, pithy
abjure, adjure, arrogate, • Empathy, sympathy, • Conclusive, decisive,
derogate compassion, pity, determinative, definitive
• Allay, alleviate, assuage, commiseration • Dominant, predominant,
relieve • Fickle, flexible paramount, preponderant
• Arbitrate, mediate • Fractious, factitious, • Doubtful, dubious,
• Assume, presume fractious problematic, questionable
• Avenge, revenge • Healthy, healthful, • Effective, effectual,

salutary efficient, efficacious
• Barbaric, barbarous
• Imply, infer, insinuate • Apparent, illusionary,
• Between, among seeming, ostensible
• Sparing, frugal, thrifty,
• Born, borne economical

Corpus

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Corpus

Caricato da

Copyright:

Formati disponibili

CORPUS LINGUISTICS

Introduction: what is Corpus Linguistics?

 The study of language based on examples of “real life“ language

 Corpus linguistics can be described as the study of language based

 Facilitated by the advent of computer technology (1960s)

 Latin: corpus (body): body of text  any collection

• A corpus is a large collection of machine-

(Oxford Text Archive: http://ota.ox.ac.uk/)

 A new theory of language?

Corpus-based approaches are used to “expound,

Corpus-driven linguists are “strictly committed to the

Theoretical statements are fully consistent with, and

The distinction is overstated, they are 2 idealized extremes.

4 basic differences among the 2 approaches:

 Attitudes towards theories and intuitions

 Corpus will balance itself when it grows to

 In favour of corpus annotation: CB approaches

 No distinction between lexis, syntax,

A few key notions in

Essential feature of a corpus.

Balance (the range of genres included in a corpus) and

sampling (how the text chunks or each genre are selected)

A corpus is representative if…

It changes over time (Hunston 2002): if a corpus is not

 External criteria (Biber’s situational perspective): defined situationally, e.g.

 Internal criteria (Biber’s linguistic perspective): defined linguistically, taking into

A corpus is a sample of a given population

A sample is representative if what we find for the

Samples are scaled-down versions of a larger

Population: the assembly of all sampling units; it

Sampling frame: the list of sampling units

• R. Quirk’s Survey of English Usage (SEU)

• The Birmingham Collection of English Texts

• The Lancaster/IBM Spoken English Corpus (SEC)

• Corpus makers or compilers.

 Node (the word form, lemma, or other pattern under

 A fine-grained theory of lexical semantics would

 powerful tool, weapon, punch, engine

 The differences are subtle, but examining their collocates

• Empiricism and rationalism.

• Frequency tells you about the world rather

• Performance is still an inherently valid object of

• The finite-infinite is not a big issue, since in

• Frequency lists compiled objectively from

• Corpus analysis will never tell you that

• Key notion in the field of corpus work.

• No prior selection of data.

• “Verifiability is a normal requirement in

• Many subtle observations.

 Collocation is among the linguistic concepts which have

 “There is frequently so high a degree of

 Coherence collocation vs. neighbourhood

 sky – sunshine – cloud – rain

 Difficult to measure using a statistical formula

 Collocation in corpus linguistics

• “Corpus as an information source fits in very well

• McCarthy (1998: 67-68) argues that the

• The potential value in foreign language

• The second is connected with the Laws of

• Nurtures a “feel of language”, develops an

• The design of a corpus is dependent upon

• A sample corpus is a static collection of texts

• Monitor corpora are text corpora that

• A collection of machine-readable text does not