Sei sulla pagina 1di 110

The Text Encoding Initiative an

overview
Laurent Romary
Inria
Overview
Part 1: The Text Encoding Initiative: history
and organisation
Part 2: Components of a TEI document
Part 3: The TEI architecture ODD
Part 4: Scientific and technical information use
cases
NPL, back-office usage, terminology
A QUICK INTRODUCTION TO XML
But before anything serious starts
A quick historical overview
1960s GML (Generalized Markup Language) by IBM
1970s & 1980s ANSI initiates project to develop a Standard text-
description language based on GML
1983 SGML became an industry standard
1986 SGML (Standard Generalized Markup Language) becomes
an ISO standard: ISO 8879:1986
1987 TEI (Text Encoding Initiative)
1990 HTML 1.0 (HyperText Markup Language)
1992 TEI edition P3 (Michael Sperberg-McQueen and Lou
Burnard, eds)
1997/1998 XML 1.0 (eXtensible Markup Language) (Tim Bray,
Jean Paoli and Michael Sperberg-McQueen, eds)
All you need to know about XML
XML as a serialization language
<gramGrp>
<gen>f</gen>
<num>p</num>
</gramGrp>
XML as a data model
Issues
Specifying structures: schemas
Providing semantics: documentation
Attributes, namespaces and youre nearly set

THE TEXT ENCODING INITIATIVE:
HISTORY AND ORGANIZATION

Part 1
29.05.2014 Seite 7
In the beginning
1. Novembre 1987:
Vassar College,
Poughkeepsie
L
o
u

B
u
r
n
a
r
d

Humanities
Text archives
Standards
SGML
TEI as a community endeavor
A trend towards digital curatorship
Describing digital sources: meta-data
Understanding and representing the structure of
digital sources: content
Enriching (annotations, links), versioning,
disseminating
A wide user community
From individual scholars to large digitization
projects

29.05.2014 Seite 9
The standard scenario?
Digitizing source documents
Further work on documents
TEI in a nutshell
TEI namespace:
xmlns="http://www.tei-c.org/ns/1.0"
TEI documentation:
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/
TEI processor, Roma:
http://www.tei-c.org/Roma/
TEI document model
Read: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html
TEI architecture: modules, classes
TEI vocabulary: more than 500 elements
Read: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html

TEI core principles (1)
The TEI document as a digital surrogate of a physical
source
A TEI document is always part of a digital library workflow
Source surrogate enrichment publication
Recorded in the header; encoded in the content
Born digital documents may as well encounter a
succession of changes/versions
The TEI document as an autonomous object in a DL
workflow
Embedded meta-data + content
Multiple hands: annotation

TEI core principles (2)
Favoring the semantics rather than the layout
(quasi) No presentational construct
Publication requires a transformation stage (XSLT;
ePub, pdf, HTML, etc.)
Document structure
Macro-structure: front-body-back
Meso-structure: divisions,
paragraphs/lists/figures/etc.
Micro-structure: in-line annotation mechanisms
Dates, names, notes, references, foreign expressions, etc.
All you can encode
Examples
Simple encoded text
The Little Riding Hood

Scholarly paper
Towards Higher Ground
Dear H. Everybody
is O.K. Mrs. Butler
from across the
street died last
night. Too bad is
not it? Goodbye
S. W.
How do you manage this?
TEI as a standardization body (1)
Consensus building
Community based decision process
Maintenance
Two releases per year
Publication
All TEI contents are available under the double CC-
BY+BSD 2 clause license.
TEI as a standardization body (2)
Organization
Consortium of institutional and individual members
Organized around some core partners
Brown University, Univ. of Virginia, Univ. of Oxford, TGE
Adonis
Conference, journal
The TEI at work
Board: administrative aspects
Technical council: coordinates the evolution of the TEI
guidelines
Standardization work
Community based workflow
Mailing list
SourceForge bugs and features
Recording all issues and decisions
Cf. ODD as a specification platform
Deliverables
Documentation TEI guidelines (more than 500 elements)
Schemas DTD, RelaxNG, W3C
Additional resources
Tools
Online customization: Roma
Online processing: OxGarage
Examples TEI by Example
Special Interest Groups (SIGs)
Computer-Mediated Communication (Michael Beiwenger)
Correspondence. (Peter Stadler and Joachim Veit)
Education (TBA)
Libraries (Stefanie Gehrke and Kevin Hawkins)
Manuscripts (Dot Porter and Gerrit Brning)
Music (Raffaele Viglianti)
Ontologies (Oyvind Eide and Christian-Emil Ore)
Scholarly Publishing (Daniel O'Donnell)
TEI for Linguists (Piotr Baoski and Andreas Witt)
Text and Graphics (John Walsh and Martin de la Iglesia)
Tools (Serge Heiden)
The TEI guidelines
Online documentation
Prose description organized in chapters
Specific documentation for each element
Access to all examples from the guidelines
Schema(s)
RelaxNG, W3C (, DTD)
Available online from the Roma interface
Delivered as packages (Ubuntu, Oxygen)
The TEI guidelines as specifications
Documentation and schemas are generated from one
single specification file
Expressed in a TEI sub-language: ODD (One Document Does it all)

What can we do with this?
Conformance to the TEI guidelines
Well-formedness criterion
Cf. TEI conformable (JSON JavaScript Object Notation)
Validation constraint
schema file derived from the published TEI Guidelines
Conformance to the TEI abstract model
Core concepts and general organization of a TEI document
Semantic constraints
E.g. <l> vs. <item> vs. <lb/>
Mandatory components of a TEI document
<teiHeader>
TEI Namespace
Documentation constraint
Varieties of TEI Conformance
Pure TEI-all subset
Most TEI projects
TEI subset with extensions
Cf. TBX in TEI
Non TEI document with TEI constructs (defined as
an ODD)
EAG extensions in the EU Cendari project
Non TEI document defined by means of an ODD
document
E.g. ISO 24616:2012 Language resources management
-- Multilingual information framework
The central role of customization
Each TEI project starts with the definition of a
customisation
Module selection
Sub-setting elements
Reducing possible values or content models
Adding, when necessary, new descriptive object
ODD as the technical platform for
customization
Consequences
Family of formats
Comparison of two TEI-based projects through their ODDs
Support for third-party projects
(cf. PDM) in-house maintenance of customization and
documentation
Does not prevent one from knowing the TEI
components
Most project can live with just a subset of the TEI ontology
With the strong possibility to impact on the guidelines themselves
E.g. <abstract>
COMPONENTS OF A TEI DOCUMENT
Part 2
TEI document architecture
<TEI>
<TEIHeader>
{additional
components}
<text>
<front>
<body>
<back>
An essential component: the header
Every TEI-conformant document comprises a
header usually followed by a text
the header contains:
mandatory file description (<fileDesc>)
optional encoding (<encodingDesc>), profile
(<profileDesc>) and revision (<revisionDesc>)
descriptions
the header is essential for:
bibliographic control and identification
resource documentation and processing
The components of a TEI header
<teiHeader>
<fileDesc>
<encodingDesc>
<profileDesc>
<revisionDesc>
The title page of the document
Constraints and editorial choices
General characteristics of the content
Digital history of the document
The TEI header: <fileDesc>
Bibliographic description of the whole
document
Main descriptors comprise:
Title and responsibilities associated to the
document (<titleStmt>)
Details regarding the publication of the document
(<publicationStmt>)
E.g. licensing
Information about the source(s) of the document
(<sourceDesc>)
<fileDesc>: example
<teiHeader>
<fileDesc>
<titleStmt>
<title>Thomas Paine: Common sense, a machine-readable transcript</title>
<respStmt>
<resp>compiled by</resp>
<name>Jon K Adams</name>
</respStmt>
</titleStmt>
<publicationStmt>
<distributor>Oxford Text Archive</distributor>
</publicationStmt>
<sourceDesc>
<bibl>The complete writings of Thomas Paine, collected and edited by Phillip S. Foner
(New York, Citadel Press, 1945)</bibl>
</sourceDesc>
</fileDesc>
</teiHeader>
A quick overview of the other header
components
<encodingDesc>
Characters, elements, geo-references, editorial
choices (sampling)
<profileDesc>
Index terms, abstract, classification codes, languages
of the document, participants (who speaks, who
writes, who annotates)
<revisionDesc>
<revisionDesc status="embargoed>
<change when="1991-11-11" who="#LB"> deleted chapter 10</change>
</revisionDesc>
Basic structure of a TEI text
A TEI text has a little more structure: it
contains
optional front matter (<front>)
Title page, table of contents, preface
a body (<body>)
Main informational content of a document
optional back matter (<back>)
Index, bibliographic references
The body of a text usually has divisions
<div>s may be nested within one-another

Systematic usage of global attributes
the type attribute labels a particular level e.g. as
"part" or "chapter
the n attribute gives a particular division a name
or number
the xml:id attribute gives a particular division a
unique identifier

For example...
<text>
<front>
<!-- titlepage, etc here -->
</front>
<body>
<div type="book" n="I" xml:id="JA0100">
<head>Book I.</head>
<div type="chapter" n="1" xml:id="JA0101">
<head>Of writing lives in general...</head>
<!-- remainder of chapter 1 here -->
</div>
<div n="2" xml:id="JA0102">
<!-- chapter 2 here -->
</div>
<!-- remainder of book 1 here -->
</div>
<div type="book" n="II" xml:id="JA0200">
<!-- book 2 here -->
</div>
<!-- remaining books here -->
</body>
</text>
Text components
Paragraph level components:
Paragraphs: <p>
Tables: <table>
Figures: <figure>
Groups of lines: <lg>/<l>
Lists of items: <list>/<item>
Speech turns: <sp>

These may be mixed, and may also appear
directly within undivided texts
Example: Encoding a list
<list type="xmas">
<label>For my true love</label>
<item>
<list type="bullets">
<item>three calling birds></item>
<item>two french hens</item>
<item>a partridge in a pear tree</item>
</list>
</item>
<label>For Uncle Joe</label>
<item>socks as usual</item>
</list>

Example: Encoding illustrations
<figure>
<head>Mr Fezziwig's
Ball</head>
<figDesc>A Cruikshank
engraving showing Mr
Fezziwig leading a group
of revellers.</figDesc>
<graphic url="fezz.gif"/>
</figure>
Inline markup
phrases that are conventionally
typographically distinct
Highlighting (<hi>, <emph>, <mentionned>)
data-like (names, numbers, dates, times,
addresses)
Cf. chapter Names, Dates, People, and Places
editorial interventions (corrections,
regularizations, additions, omissions ...)
cross references and links
Example: Inline annotations
<head>Of writing lives in general, and particularly of
Pamela, with a word by the bye of <name
ref="#CIBC03">Colley Cibber</name> and
others.</head>

<p>It is a trite but true observation, that <q>examples
work more forcibly on the mind than precepts</q>
</p>

<p> <name ref="#JA">Mr. Joseph Andrews</name>,
<rs ref="#JA">the hero of our ensuing history</rs>,
was esteemed to be ...</p>

Foreign language phrases
The xml:lang attribute may be attached to any element
<foreign> to mark up specific sequences in another
language
Use the ISO 639 codes to identify language + BCP 47
Have you read <title xml:lang="de">Die
Dreigroschenoper</title>?

<mentioned xml:lang="fr">Savoir-faire</mentioned> is
French for know-how.

John has real <foreign xml:lang="fr">savoir-
faire</foreign>.

Dealing with variety bibliographies
Three main bibliographical objects
<bibl>
<biblStruct>
Covers a whole range of use cases (cf. PDM)
<biblFull>
A common descriptive vocabulary
model.imprintPart
<biblScope>, <distributor>, <pubPlace>, <publisher>
model.respLike
<author>, <editor>, <funder>, <meeting>, <principal>, <respStmt>,
<sponsor>
<citedRange>, <edition>, <extent>, <listRelation>,
<msIdentifier>, <relatedItem>, <relationGrp>, <series>,
<textLang>
The <biblStruct> element
<biblStruct>
<analytic>
<monogr> <imprint>
<series>
Encoding a book using <biblStruct>
<biblStruct>
<monogr>
<author>Blain, Virginia</author>
<author>Clements, Patricia</author>
<author>Grundy, Isobel</author>
<title>The Feminist Companion to Literature in English: women writers
from the middle ages to the present</title>
<edition>first edition</edition>
<imprint>
<publisher>Yale University Press</publisher>
<pubPlace>New Haven and London</pubPlace>
<date>1990</date>
</imprint>
</monogr>
</biblStruct>
<biblStruct> - example
<biblStruct type="incollection">
<analytic>
<author>
<forename>Pliny Earle</forename>
<surname>Goddard</surname>
</author>
<title type="main" level="a">Athapascan (Hupa)</title>
</analytic>
<monogr>
<editor>Boas, Franz</editor>
<title type="main" level="m">Handbook of American Indian Languages</title>
<imprint>
<pubPlace>Washington, D. C.</pubPlace>
<publisher>Government Printing Office</publisher>
<date>1911</date>
<biblScope type="pp">85-158</biblScope>
</imprint>
</monogr>
</biblStruct>
29.05.2014 Seite 46
Source: http://wals.info
Relation to existing standards (1)
Default policy
Compliance to W3C and ISO standards
W3C: XML compliance, XML attributes (xml:id,
xml:lang, xml:base)
ISO: ISO 10646/Unicode ISO 639/BCP 47
(languages), ISO 3166 (countries), ISO 8601 (times
and dates)
Reuse of existing vocabulary
E.g. SVG, MathML
Relation to existing standards (2)
Departing when necessary
Cf. chapter 5: Non-standard Characters and Glyph
<char xml:id="ydotacute>
<charName>LATIN SMALL LETTER Y WITH DOT ABOVE AND ACUTE</charName>
<charProp>
<localName>entity</localName>
<value>ydotacute</value>
</charProp>
<mapping type="composed">&#x0079;&#x0307;&#x0301;</mapping>
<mapping type="PUA">U+E0A4</mapping>
</char>
Combination of proprietary RelaxNG vocabulary in
ODD
Customization mechanisms
Towards a full ODD
THE TEI ARCHITECTURE
Part 3
Mechanisms for change
General organization of the TEI guidelines
Mechanisms for defining a customization
Why customizing the TEI?
Possible scenarios
Experimenting
Using a large coverage schema in a specific domain
Testing samples to explore possible encodings
E.g. first attempt at encoding an existing dictionary
Large scale editing
Using a constrained schema used by many different editors
Very little variation is allowed from one editor to another
E.g. starting a new dictionary project
Baseline encoding
Generic schema allowing little variation as to the possible
encodings
Building a homogeneous corpus from heterogeneous sources
e.g. Monk project
General components of the TEI
architecture
Modules Model classes
Attribute
classes
elements
belong to
belong to use
belong to
belong to
MODULES
Modules
Highest organization level of TEI elements
A TEI element belongs to one and only one module
Modules have no intrinsic behavior in the TEI
architecture
They are used to include groups of elements when
defining a schema

A strong historical background
Modules are usually seen as too big and sometimes
intractable
Modules and elements
Drama
<speaker>
<stage>
<sp>
Dictionaries
<entry>
<sense>
<gramGrp>
<form>
Feature structures
<fs>
<f>
<binary>
<symbol>
Various types of modules
Core modules to be used
in all TEI documents
core
header
textStructure
Modules related to
specific genre or text
types
dictionaries
drama
spoken
verse
Modules adding features or
functionalities to the TEI
architecture
figures
namesdates
msdescription
linking
THE TEI CLASS SYSTEM
The TEI distinguishes over 500 elements. Having these organised into
classes aids comprehension, modularity, and modification.
29.05.2014 Seite 57
Introduction
Grouping mechanisms
Simplification of the customization process
Model Classes
Groups together all elements with the same role in the TEI architecture
Same syntactic behaviour
The elements in the class will appear in the same content models
Semantic similarity
The class defines a group of elements belonging to the same family of concepts
Principle:
elements declare themselves as belonging to a class
Classes are named according to the kind of relationship they express
[model.xxxLike]
Groups elements behaving in a similar way to xxx
E.g. model.biblLike
[model.xxxPart]
Groups together elements central to the content model of xxx
E.g model.biblePart
The standard paradigm
for content models
Element a
child elements are listed up
as part of the content model
of the parent element
Element x
Element y
Element z
Element w
Difficulties with the standard model
Pure top down view
Contents are always defined according to a parent
element
An element cannot declare its belonging to one
content model
Lack of factorization of content models
Groups of elements with the same behavior cannot be
reused
Lack of flexibility
No customization possibilities of content models
The TEI paradigm for content models
with classes
child classes are listed up as
part of the content model of the
parent element

classes are populated
independently from their usage
Element x
Element y
Element z
Element w
Model class
Element a
Example: a partial graph of elements
and classes
edition
model.biblPart
bibl
model.respLike model.imprintPart
extent series
author
editor
respStmt
imprint
distributo
r
publisher
biblScope
pubPlace
model.dateLike
model.biblLike
<imprint>
<pubPlace>Oxford</pubPlace>
<publisher>Clarendon Press</publisher>
<date>1987</date>
</imprint>
See <author>
See <imprint>
See <bibl>
Consequences
This mechanisms decouples
the semantic expression of a content model, from
the actual syntactic implementation for a given subset
of the TEI guidelines
Thus facilitating the customization of the TEI
guidelines
Adding an element to complement a content model
Deselecting an element from a class to make a schema
more terse
ATTRIBUTE CLASSES
Attribute classes
Groups together attributes representing the
same kind of behaviour within the TEI
architecture
Examples:
Pointing attributes (@target)
Timing attibutes (@start, @end)
Typing attributes (@type, @subtype)
Naming tradition in the TEI
att.xxx, where xxx qualifies the type of intended
behaviour (e.g. att.pointing, att.timed, att.typed)
Consequences
Avoid duplication of attribute definitions
@sortKey is only defined once in the TEI
guidelines, and used for bibliographic entries,
dictionary entries, persons, places, etc.
Facilitates the addition of some behavior to an
element
e.g. linking mechanisms, typing, etc.
Allows one to improve the attributes related
to a certain behavior
Example: shaping the attributes of
<bibl>
bibl
att.global
@xml:id, @n, @xml:lang,
att.global.facs
@facs
att.typed
@type, @subtype
att.sortable
@sortKey

<bibl type=book><author></author>,
<title></title>,<date></date></bibl> See <bibl>
ELEMENTS
Elements
Main components of the specification of an element
Identifier (the name of the element or GI generic
identifier)
Module to which the element belongs
Definition
Give the semantics, in plain text, of the element
Classes to which the element belongs
Model classes
Attribute classes
Content model of the element, as a RelaxNG fragment
Examples
Additional remarks
Derived information from the
specification of an element
Full documentation (HTML, pdf, etc.)
Example specification of <imprint>
Contribution to a compiled schema
DTD, RelaxNG or W3C schema
E.g. see behaviour of imprint on Oxygen
Allowed children
Available attributes
Pop-up documentation
TEI as a library of text concepts
Main conceptual objects
Modules
Classes
Elements
Creating a TEI conformant schema
Choosing from the above
Possibly adding your own components
A language to express all these
ODD: one document does it all, also expressed in TEI
ODD AND CUSTOMIZATION
Main concepts
Literate programming (Donald Knuth)
Integration of documentation and program within one
single specification
Knuth, Donald E. (1992). Literate Programming. California:
Stanford University Center for the Study of Language and
Information.
TEI: One Document Does it all
Schema specification
User oriented documentation
TEI customization
User-defined objects
<myElement>
Drama
Dictionary
Names, dates
Customization
Selection of modules
Modification/deletion of
elements and attributes
Additions
Linking in new elements
Global behavior
Local declaration
<gen>


<caseFrame>
model.gramPart
<gramGrp>
<gramGrp>
<pos>verb</pos>
<subc>intransitive</subc>
<caseFrame>X344</caseFrame>
</gramGrp>
Roma
Roma: web front-end to express and compile TEI
customizations

Online demo and exploration of an ODD file

The people behind Roma are:
Arno Mittelbach
Initial programming
Sebastian Rahtz
Maintenance and frequent improvements
Ioan Bernevig
A 'Sanity Checker' addition
29.05.2014 Seite 77
Discussion points
Conformance (cont.): First risk of divergence across
applications
Two application profiles are not necessarily compatible
with one another
Still, they share the same semantics for the same element
For additional elements, the documentation in ODD
facilitates negotiated interchange
TEI all
My TEI
schema
SCIENTIFIC AND TECHNICAL
INFORMATION USE CASES

Part 4
Three exemplary usages of the TEI
The EU Peer project
The TEI guidelines as an interoperability hub
The HAL publication repository
A three-level bibliographic exchange format
TBX goes TEI
Providing the TEI guidelines with an ISO compliant
terminological extension

INTERFACING NON-PATENT LITERATURE
IN THE EU PEER PROJECT
The PEER project
Initiated by the EU commission (DG INFSO)
Objective: study the impact of systematically
archiving stage-two outputs in institutional
repositories (cf. Romary & Armbruster 2010)
on journals and business models
on wider ecology of scientific research
Consortium
STM, European Science Foundation (ESF), Goettingen
State and University Library (UGOE), Max Planck
Gesellschaft (MPG), INRIA
Describing a large scale ingestion task


PEER Publishing and the Ecology of European Research 83 www.peerproject.eu
Content submission - publishers
Eligible Journals / Articles
Publishers
PEER Depot Authors
Select
100 % Metadata 50 % Manuscripts
Publishers
Transfer
50 % Manuscripts
Publishers
Deposit
Publishers
Inform

PEER Publishing and the Ecology of European Research 84 www.peerproject.eu
Content submission to repositories & LTP archive
PEER Depot
Transfer
Authors
Deposit
Transfer
Long-Term Preservation;
LTP Depot
(e-Depot, KB)
Publicly Available PEER Repositories









UGOE
HAL
ULD
TDC
MPG
SSOAR
KTU
Publishers
Deposit
Why is it so difficult?
Great heterogeneity of format within publishers
Meta data (and full-text)
Proprietary, ScholarOne, NLM 2.0, NLM 3.0,
Various issues
Affiliations
Publication date information
ISO 639 codes (countries)
Bibliographical references
Proprietary metadata fields
The information chaos
Article title
article-title/title | ArticleTitle | article-title | ce:title |
art_title | article_title | nihms-submit/title |
ArticleTitle/Title | ChapterTitle
Journal title
j-title | JournalTitle | full_journal_title | jrn_title | journal-
title
ISSN (print)
JournalPrintISSN | issn[@issn_type='print'] | issn[@pub-
type='ppub'] | PrintISSN | issn-paper
First page of a paper
spn | FirstPage | ArticleFirstPage | fpage | first-page
Sorting this out
Defining a coherent infrastructure to facilitate
The long-term management of scholarly content in
research institutions
Smooth interaction between publishers and research
institutions
Better understanding of what each of us can provide
A standards-based approach
All meta-data records transformed into a TEI
<biblStruct> representation
Meta-data records and files uploaded to publication
repositories using a SWORD interface
The PEER deposit workflow
HAL
SUB-Gt
MPS

PEER
Depot
KB
Publishers Repositories
Preservation
Proprietary formats TEI records
Conclusions and next steps
Towards a global vision for the representation of scientific and technical
information
Non patent literature
Scholarly papers
Books
Other textual and non-textual sources (e.g. recordings)
Patent literature
Coherence with the PDM
Relying on an interoperability hub
Large coverage TEI based model
Maintenance; well-documented and stable initiative
Departing from proprietary formats (even NLM JATS)
Anticipating on future usages
Licensing (subscriptions, publication)
Multiple granularity
Long term archiving (version of record)
Contribution to the evolution of scholarly communication
New publication models (overlay journals, open commentaries)
Research data
BACK-OFFICE INTERCHANGE
FORMAT IN HAL
Origins and aims
Challenges in scientific information management
Serial crisis, open access movement
New platforms (arXiv), new models (PLOS)
HAL: a publication repository for the French academic
community
All domains: from physics to human sciences
Articles, reports, PhD theses
Nearly 280 000 full-text documents
A tool for researchers and institutions
Used as a pre-print server by some communities (mirrored
onto arXiv)
Reporting tool and deposit mandates (e.g. Inria)
Document workflows in HAL
Import Export
V
i
s
u
a
l
i
s
e

U
p
l
o
a
d

HAL
r
e
s
Online visualization
Researchers home page
Reporting lists
Queries
r
e
s
OAI-PMH
arXiv

r
e
s
BIB2HAL

r
e
s
SWORD interface
arXiv

A three-level data management
W
o
r
k
f
l
o
w

m
a
n
a
g
e
m
e
n
t

Responsibilities
Timing
Queries
H
A
L

s
u
r
r
o
g
a
t
e

Affiliations
Depositor
Versions
Rights
P
r
i
n
t
e
d

o
r

d
i
g
i
t
a
l

s
o
u
r
c
e
s

Publication
information
Publisher
Journal

Implementation in TEI
W
o
r
k
f
l
o
w

m
a
n
a
g
e
m
e
n
t

TEI
document
H
A
L

s
u
r
r
o
g
a
t
e

List of
biblFull
objects
P
r
i
n
t
e
d

o
r

d
i
g
i
t
a
l

s
o
u
r
c
e
s

biblStruct
Example
Conclusion
Complex document models for complex
document architectures
Cf. PDM integrating the document work flow
Families
Stages in the examination process
Examiners annotations
INTEGRATING TBX ENTRIES IN THE
TEI FRAMEWORK
Terminology management
An essential task in several language related activities
Translation; machine translation
Technical writing
Data mining; indexing tasks
Concept-based approach
Concept to term representation (onomasiological)
Cf. early works of Eugen Wster
A whole range of related standards in ISO committee TC
37/SC 5
ISO 704
ISO 1087
Simple case: glossaries
Complex Termbanks - Termsciences
Scientific and technical terminology:
maintained at INIST-CNRS
multi-institutional
around 500 000 terms
Complex Termbanks - IATE
Inter-Active Terminology for Europe (http://iate.europa.eu)
EU's inter-institutional terminology database
Cf. EURODICAUTOM (Commission), EUTERPE (Parliament), TIS (Council)
in place since 2004 (public 2007)
8.4 million terms (540 000 abbreviations and 130 000 phrases)
covers all 24 official EU languages
Standards for the digital
representation of terminologies
ISO 6156:1987 (Mater) format for representing terminological
information on magnetic tapes; followed by an adaptation for
microcomputers (MicroMater; see Melby, 1991);
Chapter in the TEI guidelines; SGML-based representation;
remained there until the P4 edition
ISO 12200 (Martif), published in 1999; improves the TEI proposal
Strongly inspired from the TEI (e.g. the header-text organisation;
entries embedded within a <text> and <body> hierarchy)
Reaching out the translation and localisation industry
ISO 12620:1999, set of reference descriptors (or data categories)
ISO 16642:2003 (TMF) Terminological Markup Framework
TBX (TermBase eXchange) published in 2007 by LISA (Localisation
Industry Standards Association) as a follower to Martif
TBX: ISO standard 30042 in 2008
Modeling terminological entries with
ISO 16642 and ISO 12620:1999
ISO 16642:2003 Computer applications in
terminology -- Terminological markup
framework
Provides a meta-model for the description of
terminological databases
ISO 12620:1999 Computer applications in
terminology -- Data categories
Provides a reference set of descriptors for
building-up terminological data models
Building up a terminological model
Terminological
entry
Language
section
Term section Term section
Language
section
Term section
Building up a terminological model
Terminological
entry
Language
section
Term section Term section
Language
section
Term section
subjectField
definition+source
note
term
source
TBX serialisation
<termEntry xmlns="http://www.tbx.org">
<descrip type="subjectField" xml:lang="fr">Industrie mcanique</descrip>
<langSet xml:lang="de">
<descripGrp>
<descrip type="definition">endloser Riemen mit trapezfrmigem Querschnitt, der auf zwei Riemenscheiben mit
Eindrehungen luft</descrip>
<admin type="source">De Coster, Wrterbuch, Kraftfahrzeugtechnik, SAUR, Mnchen, 1982</admin>
</descripGrp>
<note>wird zum Antrieb der Lichtmaschine, des Ventilators und der Wasserpumpe benutzt</tnote>
<tig>
<term>Keilriemen</term>
<admin type="source">De Coster, </admin>
</tig>
</langSet>
<langSet xml:lang="fr>
<tig>
<term>courroie trapzodale</term>

</tig>
</langSet>
</termEntry>
ODD work in a nutshell
Keeping the TEI document architecture
Inserting TBX entries wherever there were
dictionary entries before
Using the class system to describe TBX entries
in ODD
Re-using TEI attributes and elements when
appropriate
att.global, att.typed, att.pointing
<term>, <ref>, <note>


TBX in ODD architecture
termEntry
langSec
tig
model.auxInfo
model.auxInfo
model.auxInfo tei:term termNote
model.auxInfo:
admin
descrip
descripGrp
transacGrp
(TEI) model.ptrLike
(TEI) model.noteLike
Demo
Oxygen
RelaxNG schema
Putting in entries
Showing the ODD
TBX goes TEI: Perspectives
Future component in the TEI framework
Framework for the future developments
around TBX (TBX Basic, TBX min)
Basis for terminology management at EPO?
With further customization work

Potrebbero piacerti anche