Sei sulla pagina 1di 24

What is a Corpus?

The word "corpus", derived from the Latin word meaning "body", may be used to refer to
any text in written or spoken form. However, in modern Linguistics this term is used to
refer to large collections of texts which represent a sample of a particular variety or use of
language(s) that are presented in machine readable form. Other definitions, broader or
stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony
McEnery and Andrew Wilson or read more about different kinds of corpora in the
Systematic Dictionary of Corpus Linguistics.
Computer-readable corpora can consist of raw text only, i.e. plain text with no additional
information. Many corpora have been provided with some kind of linguistic information,
here called mark-up or annotation.

Types of corpora
There are many different kinds of corpora. They can contain written or spoken
(transcribed) language, modern or old texts, texts from one language or several
languages. The texts can be whole books, newspapers, journals, speeches etc, or consist
of extracts of varying length. The kind of texts included and the combination of different
texts vary between different corpora and corpus types.
'General corpora' consist of general texts, texts that do not belong to a single text type,
subject field, or register. An example of a general corpus is the British National Corpus.
Some corpora contain texts that are sampled (chosen from) a particular variety of a
language, for example, from a particular dialect or from a particular subject area. These
corpora are sometimes called 'Sublanguage Corpora'.
Corpora can consist of texts in one language (or language variety) only or of texts in
more than one language. If the texts are the same in all languages, e.i. translations, the
corpus is called a Parallel Corpus. A Comparable Corpus is a collection of "similar" text

What is Corpus Linguistics?


Corpus Linguistics is now seen as the study of linguistic phenomena through large
collections of machine-readable texts: corpora. These are used within a number of
research areas going from the Descriptive Study of the Syntax of a Language to Prosody
or Language Learning, to mention but a few. An over-view of some of the areas where
corpora have been used can be found on the Research areas page.
The use of real examples of texts in the study of language is not a new issue in the history
of linguistics. However, Corpus Linguistics has developed considerably in the last
decades due to the great possibilities offered by the processing of natural language with
computers. The availability of computers and machine-readable text has made it possible

to get data quickly and easily and also to have this data presented in a format suitable for
analysis.
Corpus linguistics is, however, not the same as mainly obtaining language data through
the use of computers. Corpus linguistics is the study and analysis of data obtained from a
corpus. The main task of the corpus linguist is not to find the data but to analyse it.
Computers are useful, and sometimes indispensable, tools used in this process.

Learn more
If you want to learn more about corpora and corpus linguistics you can use the links
below. On the Background page you can follow the development of corpus linguistics
through presentations of some central corpora/kinds of corpora. On the Working with
Corpora page you will find information about things to think about when you want to use
corpora for language learning or research. Use the Tutorial to learn about how to make
corpus searches and analyse the result or go straight to the Search Engine to make online
searches in a number of corpora.

Background
The use of collections of text in language study is not a new idea. In the Middle Ages
work began on making lists of all the words in a particular texts, together with their
contexts - what we today call concordancing. Other scholars counted word frequencies
from single texts or from collections of texts and produced lists of the most frequent
words. Areas where corpora were used include language acquisition, syntax, semantics,
and comparative linguistics, among others. Even if the term 'corpus linguistics' was not
used, much of the work was similar to the kind of corpus based research we do today
with one great exception - they did not use computers.
You can learn more about early corpus linguistics, HERE (external link). We will move
on to look at some important stages in the development of corpus linguistics by focusing
on some central corpora. The presentation below is not an extensive account of all
corpora or every stage, but merely meant to help you get familiar with some key corpora
and concepts.

The first generation


Today, corpus linguistics is closely connected to the use of computers; so closely,
actually, that the term 'Corpus Linguistics' for many scholars today means 'the use of
collections of COMPUTER-READABLE text for language study'.

The Brown Corpus - worthy of imitation


The first modern, electronically readable, corpus was The Brown Corpus of Standard
American English. The corpus consists of one million words of American English texts
printed in 1961. To make the corpus a good standard reference, the texts were sampled in
different proportions from 15 different text categories: Press (repotage, editorial,

reviews), Skills and Hobbies, Religious, Learned/scientific, Fiction (various


subcategories), etc.
Today, this corpus is considered small, and slightly dated. The corpus is, however, still
used. Much of its usefulness lies in the fact that the Brown corpus lay-out has been
copied by other corpus compilers. The LOB, Lancaster-Oslo-Bergen, corpus (British
English) and the Kolhapur Corpus (Indian English) are two examples of corpora made
to match the Brown corpus. They both consist of 1 million words of written language,
(500 texts of 2,000 words each) sampled in the same 15 categories as the Brown Corpus.
The availability of corpora which are so similar in structure is a valuable resourse for, for
example, researchers interested in comparing different language varieties.
For a long time, the Brown and LOB corpora were the only easily available computer
readable corpora. Much research within the field of corpus linguistics has therefore been
based on these corpora.

The London-Lund Corpus of Spoken British English


Another important "small" corpus is the London-Lund Corpus of Spoken British
English (LLC). The corpus was the first computer readable corpus of spoken language,
and it consists of 100 spoken texts of appr. 5,000 words each. The texts are classified into
different categories, such as spontaneous conversation, spontaneous commentary,
spontaneous and prepared oration, etc. The texts are ortographically transcribed and have
been provided with detailed prosodic marking.

Big is beautiful?

BoE and BNC


The first generation corpora, of 500,000 and 1 million words, proved to be very useful in
many ways and have been used for a number of research tasks (links to be added here). It
soon turned out, however, that for certain tasks, larger collections of text were needed.
Dictionary makers, for example wanted large, up-to-date collections of text where it
would be possible to find not only rare words but also new words entering the language.
In 1980, COBUILD started to collect a corpus of texts on computer for dictionary making
and language study (learn more here ). The compilers of the Collins Cobuild English
Language Dictionary (1987) had daily access to a corpus of approximately 20 million
words. New texts were added to the corpus, and in 1991 it was launched as the Bank of
English (BoE). More and more data has been added to the BoE, and the latest release
(1996) contains some 320 million words! New material is constantly added to the corpus
to make it "reflect[s] the mainstream of current English today". A corpus of this kind,
which by the new additions 'monitors' changes in the language, is called a monitor
corpus. Some people prefer not to use the term corpus for text collections that are not
finite but constantly changing/growing.

In 1995 another large corpus was released; the British National Corpus (BNC). This
corpus consists of some 100 million words. Like the BoE it contains both written and
spoken material, but unlike the BoE, the BNC is finite - no more texts are added to it after
its completion. The BNC texts were selected according to carefully pre-defined selection
criteria with targets set for the amount of text to be included from different text types
(learn more HERE). The texts have been encoded with mark-up providing information
about the texts, authors, speakers.

Specialized corpora

Historical corpora
The use of collections of text in the study of language is, as we have seen, not a new
invention. Among those involved in historical linguistics were some that soon saw the
potential usefulness of computerised historical corpora. A diachronic corpus with English
texts from different periods was compiled at the University of Helsinki. The Helsinki
Corpus of English Texts contains texts from the Old, Middle and Early Modern English
periods, 1,5 million words in total.
Another historical corpus is the recently released Lampeter Corpus of Early Modern
English Tracts. This collection consists of "[P]amphlets and tracts published in the
century between 1640 and 1740" from six different domains. The Lampeter Corpus can
be seen as one example of a corpus covering a more specialized area.

Corpora for Special Purposes


The corpora described above are general collections of text, collected to be used for
research in various fields. There is a large, and growing, amount of highly specialized
corpora that are created for a special purpose. Many of these are used for work on spoken
language systems. Examples of such are, for example, the Air Traffic Control Corpus,
ATC0 , created to be used "in the area of robust speech recognition in domains similar to
air traffic control" and the TRAINS Spoken Dialogue Corpus collected as part of a
project set up to create "a conversationally proficient planning assistant" (railroad freight
system).
A number of highly specialized corpora are held at the Centre for Spoken Language
Understanding, CSLU, in Oregon. These corpora are specialized in a different way to the
ones mentioned above. They are not restricted to be used within a particular subject field,
but are called specialized because their content. Many of the corpora/databases consist of
recordings of people asked to perform a particular task over the telephone, such as saying
and spelling their name or repeating certain words/phrases/numbers/letters (read more
HERE).

International/multilingual Corpora
As we have seen above, there is a great variety of corpora in English. So far much corpus
work has indeed concerned the English language, for various reasons. There are,
however, a growing number of corpora available in other languages as well. Some of
them are monolingual corpora - collections of text from one language. Here the Oslo
Corpus of Bosnian text and the Contemporary Portuguese Corpus can be mentioned as
two examples.
A number of multilingual corpora also exist. Many of these are parallel corpora; corpora
with the same text in several languages. These corpora are often used in the field of
Machine Translation. The English-Norwegian Parallel Corpus is one example, the
English Turkish Aligned Parallel Corpora another.
The Linguistic Data Consortium (LDC) holds a collection of telephone conversations in
various languages: CALLFRIEND and CALLHOME.

Other
The increased availability and use of the Internet have made it possible to find great
amounts of texts readily available in electronic format. Apart from all the web-pages
containing information of different kinds, it is also possible to find whole collections of
text. Among these collections can be mentioned all the on-line newspapers and journals
(example), and sites where whole books can be found on-line (example). Other examples
yet include dictionaries and word-lists of various kinds.
Although these collections may not be considered corpora for one reason or another (see
definition of corpus), they can be analysed with corpus linguistic tools and methods. This
is an area which has not yet been explored in detail, although some attempts have been
made at using the Internet as one big corpus.
Further information about collections of text available on the Internet can be found on the
Related Sites page.

Ongoing projects

ICE: the International Corpus of English


In twenty centres around the world, compilers are busy collecting material for the ICE
corpora. Each ICE corpus will consist of 1 million words (written and spoken) of a
national variety of English. The first ICE corpus to be completed is the British
component, ICE-GB. On their own, the ICE corpora will be a small but valuable
resources to exploit in order to learn about different varieties of English. As a whole, the
20 corpora will be useful for variational studies of various kinds. You can learn more
about the ICE project at the ICE-GB site.

ICLE: the International Corpus of Learner English


Like ICE (see above) ICLE is an international project involving several countries. Unlike
ICE, however, the ICLE corpora do not consist of native speaker language. Instead they
are corpora of English language produced by learners in the different countries. This will
constitute a valuable resource for research on second language acquisition.
You can read about some of the areas where the ICLE corpora are used HERE(external
link) or in the book Learner English on Computer.

Others
The amount and diversity of corpus related research projects and groups are great. Below
is a small sample to give you an understanding of the scope and variety. You can find
more information by following the links on the Related Sites page.
AMALGAM Automatic Mapping Among Lexico-Grammatical Annotation
"an attempt to create a set of mapping algorithms to map between the main
tagsets and phrase structure grammar schemes used in various research corpora"
(home page)
The Canterbury Tales Project
"aims to make available ... full transcripts of the ... Canterbury Tales" (home
page).
CSLU: The Center for Spoken Language Understanding
"a multidisciplinary center for research in the area of spoken language
understanding" (home page).
ETAP : Creating and annotating a parallel corpus for the recognition of
translation equivalents
This project, run at the University of Uppsala, Sweden, aims to develop a
computerized multilingual corpus based on Swedish source text with translations
into Dutch, English, Finnish, French, German, Italian and Spanish. (home page)
TELRI
TELRI is an initiative, funded by the European Commission, meant to facilitate
work in the field of Natural Language Processing (NLP) by, among other things,
supplying various language resources. Read more on the home page.

What next?
The interest for computerised corpora and corpus linguistics is growing. More and more
universities offer courses in corpus linguistics and/or use corpora in their teaching and
research. The number and diversity of corpora being compiled are great and corpora as
used in many projects. It is not possible to go into detail and present all the corpora, all
the courses, all the projects here. This has been meant as a brief introduction. More
information can be found by browsing the net and reading journals and books. The
electronic mailing list Corpora can be a good starting point for someone who wishes to
learn about what goes on within the field of corpus linguistics at the moment.

Using corpora

To be able to use corpora sucessfully in linguistic study or research there are some areas
that you may want to look into.
The corpus
The kind of corpus you use and the kind of texts included in it are factors
that affect what results you get. Read more about
o choosing a corpus and
o knowing your corpus
The tools
There are a number of tools (computer programs) available to use with
corpora. The basic functions are usually to search the corpus and display
the hits, with different options for what kinds of searches are possible to
make and of how the hits can be displayed. For a presentation of what
corpus handling tools can do, click HERE (link to be added). For a list of
software to use with corpora, use this link.
The know-how
It is not difficult to search a corpus and find examples of various kinds
once you know how to use your tool. In the tutorial you are introduced to
the W3-Corpora search engine and shown how you can use it on various
corpora for a number of research tasks. The illustrations and comments
will provide you with examples on the kind of questions that are useful to
ask when you are working with corpora

Using corpora
Which corpus should I choose?
The choice of corpus is very important for the kind of results you will get and what the
results can tell you. When deciding which corpus to use, there are certain points that are
good to consider.

What kind of material do I want?


How much data do I want?
What is available?

What kind of material do I want?


What kind of material you want will vary with the kind of study you intend to perform.
Some primary points to consider can be:
medium (written, spoken or both)
text type (fiction, non-fiction, scientific writing, children's books, spoken
conversational, radio broadcasts etc.)
time (produced in the 20th century, in the 1990's, Middle English, etc)

In the List of Corpora you will find corpora of various kinds under certain sub-headings
(spoken, historical, etc.)

How much data do I want?


How much data you want depends on your study. If you want to make extensive claims
about the language as a whole, you will want large amounts of (representative) data.
Similarly if you want to make statistical calculations you will probably also need large
amounts of data. If you are interested in finding an example or two of how a particular
word/phrase can be used, you do not need much data at all, as long as you can find your
example in it. There are no given definitions of how large corpus you have to use or how
many examples of something you have to find for studies of this kind or another.
Generally speaking, it is important to have 'enough' data, and then it has to be decided in
connection to each study how much data is 'enough'.

Big or small? Which do I choose?


The bigger the corpus, the more data. However, it is important to remember that not even
a very big corpus can include all varieties of a language. On the other hand, a small
corpus only contains a small sample of the language as a whole. But maybe it is the kind
of sample you need?
A point that can be easy to forget is that when using a big corpus you can get too much
data. If you want to study modal verbs and use the BNC, you might be overwhelmed to
find that there are about 250,000 occurrences of the modal 'will' alone. If you want to
study a phenomen in detail it might be better to use a small corpus, or a subcorpus created
from a large corpus. A small corpus can be more convenient to use, but then it is
important to keep in mind that it might be a restricted sample, a sample from only a
subset of the language, or a small, not necessarily representative sample of the language
as a whole.

What is available?
A very important question to consider when setting out to make a corpus-based study is
'what is available?'. There is a number of corpora, but not all of them are
publically available
readily available
Publically available corpora are those which anyone can use for free. Most corpora are
not publically available. Some are available to anyone who buys a copy of it or a licence
to use it, which may vary in cost between a few ponds (to cover administrative costs) to
several hundred pounds. Some corpora are not available to anyone but their owners, and
therefore not possible to obtain.
By readily available we here mean corpora which are ready to be used at once. What is
readily available varies between different institutions. Some have corpora installed on
their network, or stored on CD-roms. These are then available to anyone who has access

to that network/CD-rom and knows how to use the corpus. Other institutions do not have
access to any corpora, or not to the corpora that is needed for the particular task/study.
When this is the case, the options are to try to get access to the corpus, or to use some
other data or method.
Getting a corpus usually means acquiring it (buying, down-loading, compiling), installing
it, and finding the right tools to use with it. This can be a time-consuming, complicated
and costly procedure. Some corpora can be accessed online, freely or at a cost. You will
find a list of such corpora here.

Tools
There are a number of different programs and search engines available for use with
corpora, and some are presented on the 'tools' page (to be added).

Using corpora
Knowing your corpus
Something about corpus compilation
To combine texts into a corpus is called to compile a corpus. There are various ways of
doing this, depending on what kind of corpus you want to create and on what resources
(time, money, knowledge) you have at your disposal.
Even if you are not compiling your own corpus, it is important to know something about
corpus compilation when you use a corpus. Using a corpus is using a selection of texts to
represent the language. How the corpus has been compiled is of utmost importance for
the results you get when using it. What texts are included, how these are marked up, the
proportions of different text types, the size of the various texts, how the texts have been
selected, etc. are all important issues.

Illustration: the language as a newspaper


Let us imagine that you have a newspaper - a collection of texts of different kinds
(editorials, reportage on different topics, reviews, cartoons, letters to the editor, sports
commentaries, lists of shares, etc) written by different people. You then cut the paper into
small pieces with one word on each. You put all the pieces/words into a bowl and pick a
sample of ten at random. Obviously there would be several words that you know exist in
the newspaper that are not found in your sample. If you were to pick another ten pieces of
paper you would not expect the two sets of ten words to be exactly the same. If you
picked two sets of 100 words each, you would probably find that some words, especially
frequent words like function words, can be found in both samples, if not in exactly the
same numbers. You would also find that many words are found in only one of the
samples. If you took two very large samples you would find that the frequent words
would occur to a similar extent. Words that occur only once in the newspaper would be

found in only one of the samples (at most). Words that occur infequently would not
necessarily be evenly distributed across the two samples.
Now imagine that you divide the newspaper into sections (or classify its content into
categories/text types) before cutting it up, and then put the cuttings in different bowls. By
picking your paper slips from the different bowls you can influence the composition of
your sample. You can choose to take slips from only one bowl or from several, in equal or
different proportions. If there is a difference in the language in the bowls, there will be a
difference in the language on the slips and that will affect your sample correspondingly.
You can easily see that if you were to take 100 slips of paper from the 'sports' bowl and
100 slips from the 'editorial' bowl, you would probably find a larger number of the word
football in the sample taken from the 'sports' bowl than from the 'editorial'.

Corpus compilation
We can use the image above to give a (simplified) description of how a corpus can be
created. (We will not go into any practical issues here - this is merely intended to give
you an understanding of why it is important to know the corpus you use). If we imagine
the language as a whole as the newspaper, we can say that the words on the slips of
papers are texts (bits of spoken or written language). You create (compile) a corpus by
selecting texts from the language. The composition of the corpus depends on the kind of
texts you use, how you have selected them, and in what proportions. If you have divided
your paper into sections you can decide to use more texts from one section, to use texts
from one section only, to use a set proportion of texts from each section, to use a set
number of texts from each section, etc. What kind of bowls you use will also make a
difference - will you have bowls for various text types (reviews, editorials, news
reportage, etc), or sort the cuttings according to author specification (age of author, sex,
education, etc)? Perhaps sorted according to time when they were written, intended
reader, colour of print? How do you classify the texts? If you look at the slips before you
select/dischard them, the composition of your sample/corpus will reflect the choices you
made (for example, you may choose to select texts which contain some particular
feature/construction/vocabulary item, irrespectable of what section they come from).

Discalaimer
The image of the language as a newspaper is perhaps giving the impresson that 'the
language as a whole' is a well-defined and generally agreed upon notion, something that
is concrete and possible to quantify. This is far from the case. We should not be tempted
to forget that language is not a confined, closed entity but a very difficult notion to
define, quantitatively or qualitatively. Try to decide, for example, how much language
you use in a day. Do you then count only the language you produce or all the language
you get in contact with? What are the proportions of written and spoken language?
Should the spoken language you hear on the radio (actively listening or just overhearing)
be counted differently from the spoken language directed to you? Does it make a
difference if you talk/write to several people or just to one? What is language spoken to a
dictaphone or answering machine? Would a shopping list be counted as language? What
about a diagram you make/see as an illustration to a text (spoken or written)? etc.

When compiling a corpus, you do not only have to take into account how you define
language - you also have to decide what proportions of different varieties of language you
want to include in your corpus. Once that is settled, you have to get the language acquire the texts. Articles from newspapers and books can be easy enough to get hold of,
and transcripts/scripts of certain radio and TV programs as well. How do you get the
more personal writings like letters and diarys, though? And records of personal
conversations, confessions, information given in confidence, etc? Moreover, as many
corpus compilers can testify, much time and effort has to be spent on legal issues such as
optaining permission to use the texts and making sure that no copy-right agreement is
broken.

Summary
When you think of what we have described above, it is easy to understand why it is
important to know something about how a corpus is compiled and what kind of text
sample are included. Among the issues that have to be considered, then, by both corpus
compilors and corpus users are:
the language sampled (what kind of newspaper has been used?)
the size of the corpus (how many pieces of paper were taken from the newspaper
bowl?)
kind of text included (from which bowls was the sample taken?)
the proportions of different text types (how many slips of paper from each bowl?)
If the corpus consists of samples from a particular variety of language (from the 'sports'
bowl, for example) you will find that it may be very different from another sample taken
from another bowl. Moreover, it is important to know about the size of the corpus and the
size/number of samples making up the corpus. If you have a big corpus (a large
proportion of the newspaper) you may be able to find even rare words. In a small sample
you have a bigger chance of missing something (think of all the words you don't get if
you take only ten slips from the newspaper bowl, for example). If the corpus consists of a
large part of one particular bowl you get a good picture of that particular bowl. It may or
may not be different to a sample from another bowl. If you have a corpus of the same size
but consisting of several small samples from different bowls, you will have a broader
corpus (from more areas). The samples from each bowl are still small, however, so you
may not be able to say much about the language in any one bowl.
Among the practical matters that have to be solved by the compiler are:

how can the texts be obtained? Where do they exist? (in books, on the WWW, etc)
do you need permission to use the texts?
do you need to process the material to include it (transcribe, code, convert files,
etc)?
how can the texts be converted to the format you want them in (made
electronically readable by scanning, keying-in, converting files to right format,
etc)?

Though the user of the corpus do not have to make decisions about these practical
matters, there are other issues that are important for the user to be aware of. Among those
are, for example:
permission to use the corpus.
Some corpora are only available to licence holders or for particular purposes
(such as non-commercial academic research, teaching, personal use, etc)
permission to reproduce text.
You may be permitted to use the texts as long as you do not quote them or publish
them.
format of the texts.
Some texts may be available only in particular formats that cannot be read by a
usual word processor, for example.
software.
A number of programs, search engines have been developed for the use on
corpora in general or on specific corpora. A basic knowledge of and access to
some of these tool may be necessary in order to make use of the corpus.

Annotated Corpora
Apart from the pure text, a corpus can also be provided with additional linguistic
information, called 'annotation'. This information can be of different nature, such as
prosodic, semantic or historical annotation. The most common form of annotated corpora
is the grammatically tagged one. In a grammatically tagged corpus, the words have been
assigned a word class label (part-of-speech tag). The Brown Corpus, the LOB Corpus and
the British National Corpus (BNC) are examples of grammatically annotated corpora.
The LLC Corpus has been prosodically annotated. The Susanne Corpus is an example of
a parsed corpus, a corpus that has been syntactically analysed and annotated.
Annotated corpora constitute a very useful tool for research. In the Tutorial you can find
examples of how to make use of the annotation when searching a corpus.
Further information about corpus annotation and annotated corpora can be found, for
example, in the book Corpus Annotation: Linguistic Information from Computer Text
Corpora (external link), or by using the following links:

Types of annotation (*)


UCREL Corpus Annotation Pages
Parsing (*)
Part-of-speech Annotation (*)
* Links to web-pages made to supplement the book "Corpus Linguistics" by Tony
McEnery and Andrew Wilson.

Types of annotation

Certain kinds of linguistic annotation, which involve the attachment of special codes to
words in order to indicate particular features, are often known as "tagging" rather than
annotation, and the codes which are assigned to features are known as "tags". These
terms will be used in the sections which follow:
Part of Speech annotation
Lemmatisation
Parsing
Semantics
Discoursal and text linguistic annotation
Phonetic transcription
Prosody
Problem-oriented tagging

Part-of-speech Annotation.
This is the most basic type of linguistic corpus annotation - the aim being to assign to
each lexical unit in the text a code indicating its part of speech. Part-of-speech annotation
is useful because it increases the specificity of data retrieval from corpora, and also forms
an esential foundation for further forms of analysis (such as syntactic parsing and
semantic field annotation). Part-of-speech annotation also allows us to distinguish
between homographs.
Click here for an example of part-of-speech annotation.
Part-of-speech annotation was one of the first types of annotation to be formed on
corpora and is the most common today. One reason for this is because it is a task that can
be carried out to a high degree of accuracy by a computer. Greene and Rubin (1971)
achieved a 71% accuracy rate of correctly tagged words with their early part-of-speech
tagging program (TAGGIT). In the early 1980s the UCREL team at Lancaster University
reported a success rate of 95% using their program CLAWS.
Read about idiomatic tags and the tagging of contracted forms in Corpus Linguistics,
chapter 2, pages 40-42.

Part-of-speech Annotation: An Example.


This example is taken from the Spoken English Corpus and used the C7 tagset:
Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF;
the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0;
ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN;
Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB;
out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC;
polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI;
playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;

The codes used are:


AJ0: general adjective
AT0: article, neutral for number
AV0: general adverb
AVP: prepositional adverb
CJC: co-ordinating conjunction
CJS: subordinating conjunction
CJT: that conjunction
DPS: possessive determiner
DT0: singular determiner
NN0: common noun, neutral for number
NN1: singular common noun
NN2: plural common noun
NP0: proper noun
POS: genitive marker
PNP: pronoun
PRF: of
PRP: prepostition
PUN: punctuation
TO0: infintive to
VBI: be
VM0: modal auxiliary
VVB: base form of lexical verb
VVD: past tense form of lexical verb
VVG: -ing form of lexical verb
VVI: infinitive form of lexical verb
VVN: past participle form of lexical verb

Points of interest

All the tags here contain three characters.


Tags have been attached to words by the use of TEI entity references delimited by
& and ;.
Some of the words (such as heard) have two tags assigned to them. These are
known as portmanteau tags and have been assigned to help the end user in cases
where there is a strong chance that the computer might otherwise have selected
the wrong part of speech from the choices available to it (this corpus has not been
corrected by hand).

Lemmatisation
Lemmatisation is closely allied to the identification of parts-of-speech and involves the
reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the
researcher to extract and examine all the variants of a particular lexeme without having to

input all the possible variants, and to produce frequency and distribution information for
the lexeme. Although accurate software has been developed for this purpose (Beale
1987), lemmatisation has not been applied to many of the more widely available corpora.
However, the SUSANNE corpus does contain lemmatised forms of the corpus words,
along with other information. See the example below - the fourth column contains the
lemmatised words:
N12:0510g
N12:0510h
N12:0510i
N12:0510j
N12:0510k
N12:0510m
N12:0510n
N12:0510p
N12:0520a
N12:0520b
N12:0520c
N12:0520d
N12:0520e
N12:0520f
N12:0520g
N12:0520h
N12:0520i
N12:0520j
N12:0520k
N12:0520m

PPHS1m
VVDv
AT
NN1c
IF
DD221
DD222
NNT2
CC
VVDv
IO
AT1
NNc
IIb
DDQr
PPH1
VMd
VB0
VVNt
YF

He
studied
the
problem
for
a
few
seconds
and
thought
of
a
means
by
which
it
might
be
solved
+.

he
study
the
problem
for
a
few
second
and
think
of
a
means
by
which
it
may
be
solve
-

Parsing
Parsing involves the procedure of bringing basic morphosyntactic categories into highlevel syntactic relationships with one another. This is probably the most commonly
encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are
sometimes known as treebanks. This term alludes to the tree diagrams or "phrase
markers" used in parsing. For example, the sentence "Claudia sat on a stool" (BNC)
might be represented by the following tree diagram:

(S=sentence, NP=noun phrase, VP=verb phrase, PP=prepositional phrase, N=noun,


V=verb, AT=article, P=preposition.)
Such visual diagrams are rarely encountered in corpus annotation - more often the
identical information is represented using sets of labelled brackets. Thus, for example, the
above parsed sentence might appear in a treebank in a form something like this:
[S[NP Claudia_NP1 NP][VP sat_VVD [PP on_II [NP a_AT1 stool_NN1 NP] PP]
VP] S]

Morphosyntactic information is attached to the words by underscore characters ( _ ) in


the form of part-of-speech tags, whereas the constituents are indicated by opening and
closing square brackets annotated at the beginning and end with the phrase type e.g.
[S ...... S]
Sometimes these bracket-based annotations are displayed with indentations so that they
resemble the properties of a tree diagram (a system used by the Penn Treebank project).
For instance:
[S

S]

[NP Claudia NP]


[VP sat
[PP on
[NP a stool NP]
PP]
VP]

In depth: You might want to read about full parsing, skeleton parsing, and constraint
grammar by following this link.
Because automatic parsing (via computer programs) has a lower success rate than part-ofspeech annotation, it is often either post-edited by human analysts or carried out by hand
(although possibly with the help of parsing software). The disadvantage of manual
parsing, however, is inconsistency, especially where more than one person is parsing or
editing the corpus, which can often be the case on large projects. The solution - more
detailed guidelines, but even then there can occur ambiguities where more than one
interpretation is possible.

Parsing: in depth
Not all parsing systems are the same. The two main differences are:
The number of constituent types which a system employs.
The way in which constituent types are allowed to combine with each other.
However, despite these differences, the majority of parsing schemes are based on a form
of context-free phrase structure grammar. Within this system an important distinction
must be made beyween full parsing and skeleton parsing.
Full parsing aims to provide as detailed as possible analysis of the sentence structure,
while skeleton parsing is a less detailed approach which tends to use a less finely
distinguished set of syntactic constituent types and ignores, for example, the internal
structure of certain constituent types. The two examples below show the differences.

Full parsing:
[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns
the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&]
heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn
teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJsquared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS
Np]P]Tn]Fr]Ns] ._. S]

This example was taken from the Lancaster-Leeds treebank


The syntactic constituent structure is indicated by nested pairs of labelled square
brackets, and the words have part-of-speech tags attached to them. The syntactic
constituent labels used are:
& whole coordination
+ subordinate conjunct, introduced
- subordinate conjunct, not introduced

Fr relative phrase
JJ adjective phrase
Ncs noun phrase, count noun singular
Np noun phrase, plural
Nq noun phrase, wh-word
Ns noun phrase, singular
P prepositional phrase
R adverbial phrase
S sentence
Tn past participal phrase
Vn verb phrase, past participle
Vzb verb phrase, third person singular to be
Vzp verb phrase, passive third person singular

Skeleton Parsing
[S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1
N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1
victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2
[P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0
seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N
scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT
V]Fr]N]P]V]S+] ._.

This example was taken from the Spoken English Corpus.


The two examples are similar, but in the example of skeleton parsing all noun phrases are
simply labelled with the letter N, whereas in the example of full parsing there are several
types of noun phrase which are distinguished according to features such as plurality. The
only constituent labels used in the skeleton parsing example are:
Fr relative clause
N noun phrase
P prepositional phrase
S& 1st main conjunct of a compound sentence
S+ 2nd main compound of a compound sentence
V verb phrase

Constraint grammar
It is not always the case that a corpus is parsed using context-free phrase structure
grammar. For example, the Birmingham Bank of English has been part-of-speech tagged

and parsed using a form of dependency grammar known as constraint grammar


(Karlsson et al. 1995).
Constraint grammar marks the grammatical functions of words within a sentence and the
interdependencies between them, rather than identifying hierarchies of constituent phrase
types. For example, a code with a forward pointing arrowhead (e.g. AN> ) indicates a
premodifying word, in this case an adjective, while a code with a backward pointing
arrowhead (e.g. <NOM-OF ) indicates a postmodifying word, in this case "of". The
example below shows parsing using the Helsinki constraint grammar for English:
It has maintained its independance and present boundaries intact since 1815.
""
""
""
""
""
""
""

"it"

PRON NOM SG3 SUBJ @SUBJ

"have"

V PRES SG3 VFIN @+FAUXV

"maintain"

PCP2 @-FMAINV

"it" PRON GEN SG3 @GN>


"independence" <-Indef> N NOM SG @OBJ @NN>
"and" CC @CC
"present"

V INF @-FMAINV
"present" A ABS @AN
""
"boundary" N NOM PL @OBJ
""
"intact" A ABS @PCOMPL-O @ NUM CARD @
"

On the line next to each word are three (or sometimes more) pieces of information. The
first item in double quotes is the lemma of that word, following that is a part-of speech
code (which can include more than one string e.g. N NOM PL); and at the right-hand end
of the line is a tag indicating the grammatical function of the word. These begin with a @
and stand for:

@+FMAINV
@-FMAINV
@

finite main predicator


non-finite main predicator

premodifying adjective
@CC

coordinator

@DN>

determiner

@GN>

premodifying genitive

@INFMARK>

infinitive marker

@NN>

premodifying noun

@OBJ

object

@PCOMPL-O

object compliment

@PCOMPL-S

subject compliment

@QN>

premodifying quantifier

@SUBJ

subject

Semantics
Two types of semantic annotation can be identified:
1. The marking of semantic relationships between items in the text, for example
the agents or patients of particular actions. This has scarcely begun to be
widely accepted at the time of writing, although some forms of parsing capture
much of its import.
2. The marking of semantic features of words in the text, essentially the
annotation of word senses in one form or another. This has quite a long history,
dating back to the 1960s.
There is no universal agreement about which semantic features ought to be annotated - in
fact in the past much of the annotation was motivated by social scientific theories of, for
instance, social interaction. However, Sedelow and Sedelow (1969) made use of Roget's
Thesarus - in which words are organised into general semantic categories.
The example below (Wilson, forthcoming) is intended to give the reader an idea of the
types of categories used in semantic tagging:
And
the
soldiers

00000000
00000000
23241000

platted
a
crown
of
thorns
and
put
it
on
his
head
and
they
put
on
him
a
purple
robe

21072000
00000000
21110400
00000000
13010000
00000000
21072000
00000000
00000000
00000000
21030000
00000000
00000000
21072000
00000000
00000000
00000000
31241100
21110321

The numeric codes stand for:


00000000
13010000
21030000
21072000
21110321
21110400
23231000
31241100

Low content word (and, the, a, of, on, his, they etc)
Plant life in general
Body and body parts
Object-oriented physical activity (e.g. put)
Men's clothing: outer clothing
Headgear
War and conflict: general
Colour

The semantic categories are represented by 8-digit numbers - the one above is based on
that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three
top level categories, which are themselves subdivided, and so on.

Discoursal and text linguistic annotation.


Aspects of language at the levels of text and discourse are one of the least frequently
encountered annotations in corpora. However, occasionally such annotations are applied.

Discourse tags
Stenstrm (1984) annotated the London-Lund spoken corpus with 16 "discourse tags".
They included categories such as:
"apologies" e.g. sorry, excuse me
"greetings" e.g. hello
"hedges" e.g. kind of, sort of thing
"politeness" e.g. please
"responses" e.g. really, that's right
Despite their potential role in the analysis of discourse these kinds of annotation have
never become widely used, possibly because the linguistic categories are context-

dependent and their identification in texts is a greater source of dispute than other forms
of linguistic phenomena.

Anaphoric annotation
Cohesion is the vehicle by which elements in text are linked together, through the use of
pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in
English" (1976) was considered to be a turning point in linguistics, as it was the most
influential account of cohesion. Anaphoric annotation is the marking of pronoun
reference - our pronoun system can only be realised and understood by reference to large
amounts of empirical data, in other words, corpora.
Anaphoric annotation can only be carried out by human analysts, since one of the aims of
the annotation is to train computer programs with this data to carry out the task. There are
only a few instances of corpora which have been anaphorically annotated; one of these is
the Lancaster/IBM anaphoric treebank, an example of which is given below:
A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2
Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO
rid_VVN of_IO [N 3 <REF=2 its_APP$ chaplain 3) ,_, [N {{3
Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_, 38_MC N]N]Ti]V]

[N the_AT (9
get_VV0
the_AT
._.

The above text has been part-of-speech tagged and skeleton parsed, as well as
anaphorically annotated. The following codes explain the annotation:

(1 1) etc. - noun phrase which enters into a relationship with anaphoric elements
in the text
<REF=2 - referential anaphor; the number indicates the noun phrase which it
refers to - here it refers to noun phrase number 2, the Charlotte Police
Department
{{3 3}} - noun phrase entering in equivalence relationship with preceding noun
phrase; here the Rev Dennis Whitaker is identified as being the same referent as
noun phrase number 3, its chaplain

Phonetic transcription
Spoken language corpora can also be transcibed using a form of phonetic transcription.
Not many examples of publicly available phonetically transcribed corpora exist at the
time of writing. This is possibly because phonetic transcription is a form of annotation
which needs to be carried out by humans rather than computers. Such humans have to be
well skilled in the perception and transcription of speech sounds. Phonetic transcription is
therefore a very time consuming task.
Another problem is that phonetic transcription works on the assumption that the speech
signal can be divided into single, clearly demarcated "sounds", while in fact, these

"sounds" do not have such clear boundaries, therefore what phonetic transcription takes
to be the same sound, might be different according to context.
Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who
lacks the technological tools and expertise for the laboratory analysis of recorded speech.
One such example is the MARSEC corpus (which is derived from the Lancaster/IBM
Spoken English Corpus) and has been manipulated by the Universities of Lancaster and
Leeds. The MARSEC corpus will include a phonetic transcription.

Prosody
Prosody refers to all aspects of the sound system above the level of segmental sounds e.g.
stress, intonation and rhythm. The annotations in prosodically annotated corpora typically
follow widely accepted descriptive frameworks for prosody such as that of O'Connor and
Arnold (1961). Usually, only the most prominent intonations are annotated, rather than
the intonation of every syllable. The example below is taken from the London-Lund
corpus:
1
1
1
1
1
1

8
8
8
8
8
8

14
15
14
14
14
14

1470 1
1480 1
1490 1
1500 1
1510 1
1520 1
/
1 8 14 1530 1
1 8 14 1540 1
1 5 15 1550 1

1
1
1
1
1
1

A
A
B
A
B
A

11
20
11
11
11
11

^what a_bout a cigar\ette# .


*((4 sylls))*
*I ^w\on't have one th/anks#* - - ^aren't you .going to sit d/own# ^[/\m]# ^have my _coffee in p=eace# - - -

/
/
/
/
/

1 B 11 ^quite a nice .room to !s\it in ((actually))# /


1 B 11 *^\isn't* it#
/
1 A 11 *^y/\es#* - - /

The codes used in this example are:


# end of tone group
^ onset
/ rising nuclear tone
\ falling nuclear tone
/\ rise-fall nuclear tone
_ level nuclear tone
[] enclose partial words and phonetic symbols
. normal stress
! booster: higher pitch than preceding prominent syllable
= booster: continuance
(( )) unclear
* * simultaneous speech
- pause of one stress unit

Problems of Prosodic Corpora


1. Judgements are inherently of an impressionistic nature. For example, the level
of a tone movement is a difficult matter to agree upon. Some listeners may

perceive a fall in pitch, while others may perceive a slight rise after the fall. This
leads to our second point:
2. Consistency is difficult to maintain, especially if more than one person
annotates the corpus. (This can be alleviated to some degree by having two people
both annotate a small part of the corpus.)
3. Recoverability is difficult (see Leech's 1st Maxim) since prosodic features are
carried by syllables rather than whole words - annotations appear within the
words themselves making it difficult for software to retrieve the raw corpus.
4. Sometimes special graphics characters are used to indicate prosodic phenomena.
However, not all computers and printers can handle such characters. TEI
guidelines for text encoding will hopefully alleviate these difficulties.

Problem-oriented tagging
Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby
users will take a corpus, either already annotated, or unannotated, and add to it their
own form of annotation, oriented particularly towards their own research goal. This
differs in two ways from the other types of annotation we have examined in this session.
1. It is not exhaustive. Not every word (or sentence) is tagged - only those which are
directly relevant to the research. This is something which problem-oriented
tagging has in common with anaphoric annotation.
2. Annotation schemes are selected, not for broad coverage and theory-neutrality,
but for the revelance of the distinctions which it makes to the specific questions
that the researcher wishes to ask of his/her data.
Although it is difficult to generalise further about this form of corpus annotation, it is an
important type to keep in mind in the context of practical research using corpora.

Potrebbero piacerti anche