Sei sulla pagina 1di 13

Comparative and Functional Genomics

Comp Funct Genom 2004; 5: 509–520.


Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.435
Conference Paper

Obol: integrating language and meaning in


bio-ontologies
Christopher J. Mungall*
HHMI, Department of Molecular and Cellular Biology, Life Sciences Addition, University of California, Berkeley, CA 94729-3200, USA

*Correspondence to: Abstract


Christopher J. Mungall, University
of California Berkeley, Dept. of Ontologies are intended to capture and formalize a domain of knowledge. The
Molecular and Cell Biology, 142 ontologies comprising the Open Biological Ontologies (OBO) project, which includes
Life Sciences Addition #3200, the Gene Ontology (GO), are formalizations of various domains of biological
Berkeley, CA 94720-3200, USA. knowledge. Ontologies within OBO typically lack computable definitions that serve to
E-mail: cjm@fruitfly.org differentiate a term from other similar terms. The computer is unable to determine the
meaning of a term, which presents problems for tools such as automated reasoners.
Reasoners can be of enormous benefit in managing a complex ontology. OBO term
names frequently implicitly encode the kind of definitions that can be used by
computational tools, such as automated reasoners. The definitions encoded in the
names are not easily amenable to computation, because the names are ostensibly
natural language phrases designed for human users. These names are highly regular
in their grammar, and can thus be treated as valid sentences in some formal or
computable language. With a description of the rules underlying this formal language,
term names can be parsed to derive computable definitions, which can then be
reasoned over. This paper describes the effort to elucidate that language, called Obol,
and the attempts to reason over the resulting definitions. The current implementation
finds unique non-trivial definitions for around half of the terms in the GO, and
has been used to find 223 missing relationships, which have since been added to
the ontology. Obol has utility as an ontology maintenance tool, and as a means of
generating computable definitions for a whole ontology.
The software is available under an open-source license from: http://www.fruitfly.
Received: 1 November 2004
org/∼cjm/obol. Supplementary material for this article can be found at: http://www.
Revised: 2 November 2004
interscience.wiley.com/jpages/1531-6912/suppmat. Copyright  2004 John Wiley &
Accepted: 3 November 2004
Sons, Ltd.

Introduction concise phrase capturing the meaning of the term;


e.g. ‘negative regulation of interleukin-2 biosynthe-
The Gene Ontology (GO) is a collection of three sis’, ‘cardioblast differentiation’ and ‘small riboso-
ontologies, partitioned into orthogonal domains: mal subunit’. Terms may also have one or more
molecular function, biological process and cel- synonyms.
lular component [1,2]. GO is part of the Open Terms are interconnected via typed binary rela-
Bio-Ontologies (OBO) project [3], which includes tionships, such as ‘interleukin is a cytokine’ or
ontologies for other biologically relevant domains, ‘small ribosomal subunit part of mitochondrial
such as anatomy, cell type [4], chemical com- ribosome’.
pounds and phenotypic descriptors [5,6]. Text definitions precisely state the exact meaning
Ontologies in OBO consist of terms, which are of a term. Not all terms have text definitions. Text
used to describe biological data, such as gene definitions are interpreted by users, not computers.
products. Each term must have a name, which is a An OBO term has little in the way of computable

Copyright  2004 John Wiley & Sons, Ltd.


510 C. J. Mungall

definitions, which makes it difficult to perform the dual purpose of specifying meaning for both
certain kinds of reasoning over the terms in the human users and computers.
ontology [7,8]. This paper describes the attempt:
It has been noted [7,9–11] that many term names
are compositional, and indicate implicit relation- • To elucidate the formal language underlying
ships to terms, possibly in a different ontologies; OBO, called Obol.
e.g. ‘cardioblast differentiation’ in the biological • To use this language to derive the meaning of
process ontology has an implicit relationship to the existing OBO terms.
term ‘cardioblast’ in the cell ontology, yet this rela- • To use these derived meanings to help manage
tionship is absent from both ontologies; currently the redundancy and complexity within OBO.
there are no explicit relationships between differ-
ent ontologies in OBO. A computable definition
Materials and methods
of ‘cardioblast differentiation’ would by necessity
reference ‘cardioblast’.
Source ontologies
The existence of composite terms leads to redun-
dancy in both text definitions and relationships A subset of OBO was used for constructing
[12]. For example, the term ‘cytokine’ has redun- and testing Obol. The system described here was
dant definitions embedded in the text definitions for constructed using the three GO ontologies, the
‘cytokine metabolism’ and ‘cytokine biosynthesis’. biochemical ontology and the cell ontology [4].
The redundancy in relationships manifests itself as These ontologies were downloaded on 24 July
‘cytokine metabolism’ and ‘cytokine biosynthesis’ 2004. Currently there is neither a generic (non-
being related via is a to both ‘protein metabolism’ species-specific) anatomy or protein family ontol-
and ‘protein biosynthesis’. ogy in OBO.
The compositional nature of many terms leads
to an increase in the number of relationships Tokenization
and consequent increase in complexity of the
ontology. This is particularly true of the GO; Each term in OBO has exactly one primary name,
the term ‘positive regulation of T-helper 2 cell which is a string of characters containing a phrase
or sequence of words, indicated here by W*. The
differentiation’ has 114 distinct paths through the
first step in determining the implicit meaning of a
relationships in the ontology to the root term. This
term is tokenizing the term name string.
complexity can have a negative impact on both
Tokenization breaks a term name phrase-string
users and curators, searching and maintaining the
into an ordered sequence of word-strings. Each of
ontologies.
these word-strings is treated as an atomic (non-
We could integrate OBO by providing com- decomposable) token. The phrase-string is split
putable definitions for all existing terms. This on white-space characters and non-alphanumeric
would bring benefits, such as the ability to rea- characters. White-space characters are discarded,
son over the ontology and automatically derive but non-alphanumeric characters (e.g. the hyphen
certain relationships. However, adding and main- character ‘-’) are preserved and treated as special
taining these computable definitions is a significant word tokens. For example:
undertaking, and it is unclear whether the benefits
would outweigh the costs. W∗ GO:0045085 = (negative, regulation, of,
An alternative approach is to use implicit knowl-
edge encoded in a term name to derive the intended interleukin, ‘ − ’, 2, biosynthesis)
meaning of the term. OBO term names exhibit a W∗ GO:0006412 = (protein, biosynthesis)
high degree of regularity in phrase structure, and
in how that structure relates to meaning. In fact,
OBO term names can be viewed as phrases writ- Atomic vocabularies (AVs)
ten in a formal language, described using logical Each word token is matched against an atomic
rules. The formal language of OBO term phrases vocabulary (AV) of words. The AV contains words,
is a subset of natural language, and it thus serves not phrases. It thus overlaps and is distinct from the

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 511

OBO-controlled vocabulary of phrases, where each treated as binary relations between words; e.g. rela-
phrase consists of one or more words. tional adj(epidermal, epidermis). There is also an
The atomic vocabulary assigns each word a additional category for numbers, roman numerals,
domain, and a lexical category (also known as a words designating Greek symbols and alphanu-
part of speech). The domains typically correspond meric characters used as ‘type designators’; e.g.
to the domains of existing ontologies within OBO ‘myosin II’, ‘interferon-alpha’ and ‘interleukin-
(e.g. molecular function, cellular component, cell, 2’.
biochemical) and also to domains that are not OBO terms typically do not contain verbs, so
presently covered by OBO (environment, protein, there is no need for such a category. Inflected verbs
generic anatomy). Table 1 shows a list of domains (e.g. ‘regulation’, the inflected form of ‘regulate’)
defined by Obol. are treated no differently from other nouns. It is
The lexical category of a word defines the role rare for OBO term names to include the definite
that word plays in a phrase. For example, in or indefinite article (’the’ or ‘a’), and other lexical
the phrase ‘negative regulation’, the word ‘neg- categories commonly used in natural language (’it’,
ative’ plays the role of adjective. For simplicity ‘they’, ‘he’, ‘she’). It is also rare for there to be
we assume that each word in the AV has exactly different variants of the same word, so there is no
one lexical category. The lexical categories used requirement for reducing words to stem forms.
include typical linguistic categories, such as nouns, The phrase ‘negative regulation of interleukin-2
adjectives, prepositions and relational adjectives. biosynthesis’ can be tokenized into words that are
Relational adjectives are useful for linking adjec- categorized into the following lexical-categories
tives to the noun form of that adjective. They are and domains:

Table 1. The 13 domains used to partition the Atomic • negative (adjective, general).
Vocabularies. Some ontologies have been split into more • regulation (noun, biological process).
than one domain; this was necessary if the ontology did not • of (preposition, general).
contain complete is a parentage to a suitable upper-level • interleukin (noun, protein family).
term • - (hyphen) (special token).
Corresponding OBO • 2 (type designator, general).
Domain ontology Notes • biosynthesis (noun, biological process).
General — Mostly prepositions
Constructing the atomic vocabulary
Anatomy Various species-specific Generic anatomical
structures All term names in the source ontology were tok-
Function GO molecular function enized; this resulted in a set of word tokens, which
comprised the initial AV. Domains and lexical cat-
Process GO biological process
egories were assigned both manually and semi-
Component GO cellular component Cell parts automatically, using the OBO grammar (see next
Cell Cell Cell types section).
Biochemical Biochemical Chemical compounds As OBO changes, the AVs must also change.
However, the atomic vocabularies need not be
Protein In progress e.g. Keratin, actin,
interleukin-2
completely up-to-date, because the system exhibits
graceful degradation with incomplete or incorrectly
Environment — e.g. Taste, touch, light
categorized AVs; performance is impacted, but not
Behaviour GO biological process Behaviour-specific severely.
processes
Enzyme GO molecular function e.g. Amylase,
OBO term grammar
deaminase
Organism — e.g. Viral/virus, A computational (or formal) grammar is a way
bacteria, parasite to describe a formal language, analogous to the
Sequence Sequence e.g. Transcript, five, concept of grammars for natural languages [13,14].
prime A formal language is a set of sequences (e.g.
sentences) over a finite alphabet (e.g. words).

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
512 C. J. Mungall

A computational grammar G consists of: • NounPhrase → Noun


(a noun phrase can be a single noun; e.g.
• A finite set  of terminal symbols. ‘regulation’ or ‘biosynthesis’)
• A finite set N of non-terminal symbols, disjoint • NounPhrase → AdjectiveNounPhrase
from . (a noun phrase can be an adjective immediately
• A finite set P of production rules, where a rule followed by a noun phrase; e.g. ‘negative regu-
is of the form: lation’ or ‘smooth muscle contraction’)
— Some string in ( ∪ N )∗ → some string in • NounPhrase → NounPhrase‘ − ’ Token
( ∪ N )∗ . (a noun phrase can be a noun phrase immediately
— [where ( ∪ N )∗ indicates zero or more followed by a type designator; e.g. ‘myosin II ’
occurrences of terminal and non-terminal or ‘interleukin-2 ’ or ‘interferon-alpha’)
symbols]. • NounPhrase → NounPhrase NounPhrase
• A symbol S in N that is indicated as the start (a noun phrase can be a stem noun phrase imme-
symbol. diately preceded by a modifier noun phrase; e.g.
‘interleukin-2 biosynthesis’ or ‘muscle contrac-
The language of a formal grammar G = (N , ,
tion’). Note that this rule is left-recursive, which
P , S ), denoted as L(G), is defined as all those
can cause problems with some computational
strings over  that can be generated by starting
systems.
with the start symbol S and then applying the
• PrepPhrase → Prep NounPhrase
production rules P until no more non-terminal
(a prepositional phrase can be a noun phrase
symbols are present.
immediately preceded by a preposition; e.g. ‘of
A grammar can be used for either generating or
interleukin-2 biosynthesis’ or ‘by pheromones’)
parsing sequences of tokens. Parsing a sequence of
• NounPhrase → NounPhrase PrepPhrase
tokens with a grammar will produce a parse tree,
(a noun phrase can be a noun phrase immediately
which can be used to elucidate the structure of the
sequence. A sequence of tokens may have zero or followed by a prepositional phrase; e.g. ‘negative
more parse trees. regulation of interleukin-2 biosynthesis’)
OBO term names are both natural language • Noun → abscission|absorption|accumulation|
phrases and well-formed phrases conforming to a acetylation| . . .
formal grammar. Adjective → apical|basal|early|endocytic| . . .
The formal grammar GOBO contains terminal Prep → by|of|in|as|during|via|with|using| . . .
symbols OBO equivalent to the word tokens from Token → 1|2| . . . |alpha|beta| . . . |A|B| . . . |I|
the AVs. The set of non-terminal symbols NOBO is II|III|IV| . . .
the union of the set of lexical categories (nouns, (lexical categories map to words in the AVs.
adjectives, etc.) and the set of phrase types used Note the use of the pipe symbol ‘|’ to indicate
to construct a term. Examples of phrase types are ‘or’)
noun phrases and prepositional phrases.
Not shown here are rules for dealing with
The start symbol SOBO refers to a complete OBO
relational adjectives (e.g. ‘cytosolic ribosome’)
term name.
and for Boolean connectors (e.g. ‘recognition and
The production rules POBO specify how larger
cleavage’).
phrases are recursively constructed from smaller
GOBO production rules typically have one symbol
phrases and from words. A simplified subset of the
on the left-hand side and one or two symbols on
production rules is included below; the entire gram-
the right-hand side. When a rule has two symbols
mar can be viewed by downloading the system.
on the right-hand side, one symbol acts as the
Non-terminal symbols are indicated with a leading
stem phrase, the other as the supplementary phrase.
upper-case character, terminal symbols in lower-
In the example rules given above, supplementary
case:
phrases are indicated with italics.
• SOBO → NounPhrase An example syntax parse of the GO term with
(an OBO term name is a noun phrase; e.g. name ‘negative regulation of interleukin-2 biosyn-
the noun phrase ‘negative regulation of protein thesis’ is shown in Figure 1. Stem phrases are indi-
biosynthesis’) cated with the bold lines.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 513

SOBO NP Deriving class definitions


NP1 NP NP SOBO
NP2 Noun The grammar GOBO specifies a syntactic parse
NP3 Adj NP of OBO terms; this elucidates the compositional
NP5
NP4 NP Tok structure of term names. The next step is to derive
PP Prep NP
NP5 NP PP
a semantic parse; in other words, to derive the
PP
meaning encoded in a term name. The meaning
NP3
NP1 of a term can be formally specified as an Obol
definition, modelled after Aristotelian definitions
NP4 [15].
NP2 NP2
An Obol definition consists of a genus and dif-
NP2
ferentiae. The genus is a broad category (as distinct
from genus, sensu phylogeny). Obol considers each
Noun Prep Noun Tok Noun
Adj word in the AV to be a distinct genus. The differen-
negative regulation of interleukin-2 biosynthesis
tiae are a set of necessary and sufficient conditions
Figure 1. An example parse of a GO term. Stem phrases
that distinguish a term from other terms of the
are indicated by bold text same genus. The differentiae are similar to stan-
dard relationships between OBO terms. However,
they are different in that these standard relation-
The syntax parse can also be shown as a brack- ships are not always sufficient to distinguish one
eted expression. There are actually two possible term from other terms of the same genus. For exam-
parses under GOBO : ple, the term ‘interleukin-2 biosynthesis’ is defined
as having genus ‘biosynthesis’ and the differen-
1. (NP (NP negativeadj regulationnoun ) tia ‘forms interleukin-2’. The property of creating
(PP ofprep (NP (NP interleukinnoun — 2token ) interleukin-2 proteins is sufficient to discriminate
this term from all other kinds of biosynthesis.
biosynthesisnoun ))) Definitions can be nested. The definition for
2. (NP negativeadj ‘interleukin-2’ would have genus ‘interleukin’
and differentia ‘type token 2’. Using the nota-
(NP regulationnoun tion <genus><rel − type> = (<definition>) to
(PP ofprep (NP (NP interleukinnoun — 2token ) specify a definition, and simply <genus> for a
primitive definition. Figure 2 shows the definitional
biosynthesisnoun )))) structure of the GO term ‘negative regulation of
interleukin-2 biosynthesis’.
POBO is augmented with precedence rules (not Genus–differentiae definitions are useful for
shown) to favour the first parse. automated reasoning. For example, we can prove
POBO was constructed manually. Not all GO term that ‘interleukin-2 biosynthesis’ is a ‘interleukin
names are in L(GOBO ), which is to say that not biosynthesis’ is a ‘cytokine biosynthesis’ from a
all term names can be parsed. Some term names single relationship ‘interleukin’ is a ‘cytokine’,
have more that one possible parse tree, indicating given some ontology of protein families.
possible ambiguity in the syntactic structure (and It is possible to derive candidate definitions for
thus in the interpreted meaning of the term). phrases using a grammar, G, a set of is a rela-
L(GOBO ) contains an infinite amount of potential tionships, R, ranging over the genus categories in
term names, a finite subset of which correspond to an AV, and set of biological relationship types, Q.
meaningful biological phrases. The next step is to OBO has minimal set of logical relations defined
derive these meaningful phrases. in the relations ontology [16]. This was manually

(regulation qualifier = (negative)


affects = (biosynthesis forms = (interleukin type_ token = (2))))

Figure 2. Class definition for the term ‘negative regulation of interleukin-2 biosynthesis’

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
514 C. J. Mungall

Table 2. Examples of how grammatical contexts can be used to derive which relationship forms the differentia in a
definition. Example terms are provided, with the genus shown in bold and the differentia term shown in italics. Note that
this is just a subset of the contexts — the relationship ‘part of’ can be part of a set of differentiae in a variety of contexts.
Currently the only relationship type above defined by GO is ‘part of’, Obol has to extend this basic set in order to derive
definitions
Relationship Example
type Domain Range Grammatic context (genus differentia)

Qualifier Regulation −ve OR + ve Adjective prefix Negative regulation


Affects Process Process Prep phrase with ‘of’ Regulation of xxx biosynthesis
Forms Biosynthesis Substance Noun phrase prefix Interleukin-2 biosynthesis
Type token Any Token Type token Interleukin-2
Part of Component Component Relational adjective Cytoplasmic chromosome

Table 3. Some results from applying semantic parses on typical OBO terms. Recursive definitions are shown as successive
parses of each differentia term. Note that there is no protein ontology in OBO at present, so Obol creates a temporary
ID for terms such as ‘interleukin-2’. Note also that there is no GO term for the genus ‘regulation’
OBO ID Term name Genus Differentiae

GO:0045085 Negative regulation of interleukin-2 Regulation Qualifier = negative affects = GO:0042094


biosynthesis
GO:0042094 Interleukin-2 biosynthesis Biosynthesis (GO:0009058) Forms = <tempID:1>
<tempID:1> Interleukin-2 Interleukin Type token = 2
GO:0000229 Cytoplasmic chromosome Chromosome (GO:0005694) Part of = GO:0005737 (cytoplasm)

augmented to get QOBO . Here, a relationship type Class definition derivation method
belonging to Q is defined by a set of domain,
range and grammatical context triples. The domain Given a set of relationship types Q, as defined
and range specify to which genus or genera a above, we can attempt to infer a class definition
relationship type pertains. The grammatical con- for any syntax parse tree.
text specifies how a relationship is linguistically Starting at the root of the parse tree, S, we recur-
manifested in a phrase. For example, consider the sively build the class definition. The production
relationship type affects in QOBO . This type can rules, P , are constructed such that each node in
have the domain ‘regulation’ and the range ‘pro- the parse tree should have one or two children. If
cess’ occurring with the preposition ‘of’ in some it has one child, then the class definition at the
phrase. Because ‘biosynthesis’ is a ‘process’, the parent node is equal to the class definition at the
phrase ‘regulation of biosynthesis’ is equivalent to child node. If the node has two children — one
the definition regulation affects = (biosynthesis). stem node n1 and another supplementary node
n2 — then the class definition will be the combina-
Table 2 shows some example relationships and the
tion of the class definition D1 at n1 with a relation-
contexts in which they are used.
ship, r, to the class definition D2 at n2 , where the
OBO term names are remarkably consistent in possible relationship types are constrained by the
mapping grammatical structure to meaning. A term grammatical type of the current node, n, together
with name as ‘regulation of transcription’ would with the domain and range types matched to D1
never use a different grammatical context (except and D2 , according to Q.
as a synonym). An example of a different gram- When we reach the leaf nodes of the parse tree,
matical context is a relational adjective modifier, we assign a primitive definition, which is simply a
as in ‘transcriptional regulation’. genus lacking differentiae.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 515

Deriving a class definition is also known as T by following a chain of one or more is a rela-
semantic parsing. This extension to a standard tionships.
grammar is called a semantic grammar. A semantic
grammar is reversible, allowing for the generation T is a+ T if (T is a X and X is a+ T )or(T is a T )
of phrases from definitions.
Table 3 gives an example of the semantic There are rules for inferring contingent inclusion
parse of the term name ‘negative regulation of is a relationships [16] that are based on natural is a
interleukin-2 biosynthesis’. relationships using genus–differentiae definitions
If we augment our definition of a grammar G (where G = genus, and the differentia is Q = (D),
with relationship types and a set of is a rela- and D is itself a definition). Here is a subset of the
tionships over genera, giving a semantic grammar Obol rule-set:
 = (N , , P , S , Q, RIS A ), then the language of
1: G,Q = (D) is a G,Q = (D ) if D is a D
, denoted as L(), is defined as all those mean-
2: G,Q = (D) is a G , Q = (D) if G is a G
ingful strings over  belonging to L(G). If OBO is
3: G,Q = (D) is a G if ¬(∃D such that D is a D )
defined perfectly, then L(OBO ) should correspond
to the universe of meaningful biological terms. We Using rule 1, and the semantic grammar, the
refer colloquially to L(OBO ) as Obol. Obol system infers an is a relationship between
Definitions can be exported using either the obo ‘chromoplast membrane’ (membrane surrounds
flat-file format (http://www.geneontology.org/ = chromoplast) and ‘plastid membrane’ (mem-
GO.format.html#oboflat) or using a description brane surrounds = plastid) based on the is a rela-
logic (DL) format, such as Ontology Web Lan- tionship between ‘chromoplast’ and ‘plastid’.
guage (OWL) [17]. An example of applying rule 2 is to dis-
The candidate class definitions can be considered cover that ‘vitamin E biosynthesis’ (biosynthesis
an end in and of themselves, or they can be used forms = vitaminE) is a ‘vitamin E metabolism’
for reasoning. (metabolism forms = vitaminE), based on the
is a relationship between ‘biosynthesis’ and ‘meta-
bolism’.
Reasoning An example of applying rule 3 is to discover that
Class definitions can be used for reasoning over ‘primary septum’ (septum type = primary) is a
the ontologies. This can be done using an external ‘septum’, based on the fact that there is no OBO
third-party reasoner, such as FaCT [18] or RACER term ‘primary’ and thus no parent for that term.
[19]. There is also an advantage to integrating some Rules such as these can be used to find errors
simple reasoning facility into the grammar system. of omission in an ontology, to automatically add
Some of the things we may wish to do with a relationships for newly created terms, to suggest
reasoner and some class definitions include: intermediate terms, and to determine which rela-
tionships in the ontology are true by contingency.
• Check for inconsistent or missing relationships, Table 4 has the complete set of all relationships
or missing terms. that can be derived for the term named ‘negative
• Automatically assign relationships for new regulation of interleukin-2 biosynthesis’. Note that
terms. all these derived relationships are currently present
• Extract implicit ontologies from GO. in GO due to curator diligence.

The reasoner can also be used to feed back Implementation


information to the class builder, to help filter The grammar, class builder and reasoner, described
ambiguous parses. above, are all specified and implemented in Prolog.
The Obol reasoner uses a rule-based system for The grammar is implemented as a Definite Clause
performing these tasks. An example of a simple Grammar (DCG) [20]. Prolog allows for direct
Obol rule is one involving subsumption (subtyp- specification of DCGs as part of the language.
ing). Here the notation is a+ means transitive is a, Prolog is a high-level declarative logic program-
and T is a+ T means that T can be reached from ming language, and is considerably different from

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
516 C. J. Mungall

Table 4. Examples of relationships that can be automatically derived by semantically parsing the term ‘negative regulation
of interleukin-2 biosynthesis’. Note that all 16 of these relationships are currently present in GO, so Obol tells us nothing
new here. However, manual curation of these derivable relationships is an onerous and error-prone task. Obol can help
maintain these. Note also that in order to make all of these derivations Obol requires an ontology of proteins, such as
interleukins. No such ontology currently exists within OBO — an example of one was generated for this particular test.
This highlights the importance of OBO for the GO project
Relationship
Child term type Parent term Notes

Negative regulation of Is a Regulation of interleukin-2 ’Negative’ has no parents


interleukin-2 biosynthesis biosynthesis
Is a Negative regulation of cytokine From: interleukin-2 is a cytokine
biosynthesis
Regulation of Is a Regulation of cytokine biosynthesis From: interleukin-2 is a cytokine
interleukin-2 biosynthesis
Regulates Interleukin-2 biosynthesis Via inferred definition
Negative regulation of Is a Regulation of cytokine biosynthesis ’Negative’ has no parents
cytokine biosynthesis
is a Negative regulation of protein From: cytokine is a protein
biosynthesis
Regulation of cytokine Is a Regulation of protein biosynthesis From: cytokine is a protein
biosynthesis
Regulates Cytokine biosynthesis Via inferred definition
Negative regulation of Is a Negative regulation of biosynthesis ’Protein’ is root term in protein
protein biosynthesis ontology
Is a Regulation of protein biosynthesis ’Negative’ has no parents
Regulation of protein Is a Regulation of biosynthesis ’Protein’ is root term in protein
biosynthesis ontology
Regulates Proteinbiosynthesis Via inferred definition
Regulation of Regulates Biosynthesis Via inferred definition
biosynthesis
Interleukin-2 Is a Cytokine biosynthesis From: interleukin-2 is a cytokine
biosynthesis
Cytokine biosynthesis Is a Protein biosynthesis From: cytokine is a protein
Protein biosynthesis Is a Biosynthesis ’Protein’ is root term

Table 5. An illustrative subset of the 400 purported missing relationships derived by the system. Some of these derived
missing relationships were incorrect, based on erroneous definition parses. The final example illustrates a case where the
formal definition of a term cannot be derived from the term phrase alone
Subject Relationship type Object Notes

Nucleolar chromatin Part of Nucleolus Derived directly from inferred definition


Clathrin-coated vesicle Has part Clathrin coat Correct, but GO uses part of
Chromoplast membrane Is a Plastid membrane Derived via ‘chromoplast is a plastid’
Nuclear microtubule Part of Nucleus Derived directly from inferred definition
Vitamin E biosynthesis Is a Vitamin E From: biosynthesis is a metabolism
metabolism
Negative regulation of lipid biosynthesis Is a Negative From: biosynthesis is a metabolism
regulation of lipid
metabolism
Dense nuclear body Is a Nuclear body INCORRECT

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 517

languages typically used in bioinformatics, such


syntactic parses
as perl, C and java. It is particularly well suited semantic parses
to implementing logical rules, the foundations of

3000
the Obol system. We use XSB Prolog [21], which
provides tabling in addition to a standard prolog
system. This allows grammar production rules to

biological−process−terms
be left-recursive, as is the case for the POBO . The

2000
go-perl library [22] is used to convert OBO files
to prolog fact files. Interaction with the system is
either via prolog interpreter or via a UNIX make-
file. The system has been tested on both Linux and

1000
Mac OS X, and is available under an open source
licence from http://www.fruitfly.org/∼cjm/obol.

0
0 1 2 3 4 5 6 7
Results and discussion number of parses

Figure 3. The distribution of the number of parses,


Parsing both syntactic and semantic, for the biological process
GO ontology
Any OBO term can have zero, one or multiple
ambiguous syntactic or semantic parses. We tested
the system on all terms in GO (but not the rest of
500

OBO). Ideally, one semantic parse (class definition) syntactic parses


semantic parses
would be generated for each GO term, but in
practice this is currently only achieved for about
400
cellular−component−terms

half of all terms. Ideally, the semantic parse would


reflect the actual meaning of the term, but in some
300

cases inaccurate meanings are derived.


Typically, term names with lots of words gener-
ate multiple ambiguous parses. In particular, term
200

names that contain long sequences of nouns result


in an exponential increase in syntactic parses; this
is due to the rule NounPhrase → NounPhrase
100

NounPhrase. Note that it is still possible to do


reasoning with ambiguous parses.
The histograms in Figures 3, 4 and 5 show the
0

0 1 2 3 4 5 6 7
distribution of the number of parses, both syntactic number of parses
and semantic, for the three GO ontologies. Single-
word terms have been excluded, since these are Figure 4. The distribution of the number of parses,
trivially parsed to the corresponding genus. The both syntactic and semantic, for the cellular component
GO ontology
histograms have been truncated at five parses;
one particular molecular function term (’receptor
signaling protein tyrosine kinase signaling protein acid binding’. This has recently been fixed by
activity’) has 132 syntactic parses (of which none adding tokenizer rules, such that compound identi-
derive a definition). fiers are treated as single tokens.
Note that the molecular function ontology has Improving the AVs will reduce the number of
the highest number of terms that cannot be incorrect syntax parses (and thus reduce the incor-
parsed. One reason is that the tokenizer and gram- rect semantic parses). Many words are incorrectly
mar originally did not deal adequately with term categorized or are not yet present in the AVs. This
names containing chemical notation, such as ‘5(S )- can be improved by manual and semi-automatic
hydroxyperoxy−6E , 8Z , 11Z , 14Z -icosatetraenoic AV curation.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
518 C. J. Mungall

Deriving existing relationships


4000

syntactic parses
semantic parses We examined all 23 306 relationships in GO to
determine which could be automatically derived.
We compared two methods. The first method
3000
molecular−function−terms

uses the semantic grammar OBO and relationship


derivation rules, as above. The second method
is similar to that described in [10], in which a
2000

relationship is derivable if all the words in the


parent term are an ordered subset of words in
the child term. Table 6 shows the numbers of
1000

derivable relationships in GO. This gives a rough


indication of what proportion of GO relationships
reflect biological knowledge, and what proportion
are true by their logical definition.
0

0 1 2 3 4 5 6 7 Use of a semantic grammar has lower sensitivity


number of parses than a subset- or substring-based approach. This is
because many terms cannot be parsed by OBO ,
Figure 5. The distribution of the number of parses,
both syntactic and semantic, for the molecular function The sensitivity will improve as OBO improves.
GO ontology It is difficult to measure specificity, because GO
relationships may be incomplete.

Refining the set of relationship types will Generating implicit ontologies


improve semantic parsing. Currently only very
general relationship types are used. More specific The GO biological process ontology contains an
relationship types could help filter out ambiguous implicit generic anatomy ontology, encoded in term
parses, and provide more specific meanings. names such as ‘haltere disc metamorphosis’ and
corresponding relationships. This implicit anatomi-
cal ontology has been extracted, and could be used
Exporting definitions as the basis for a cross-species generic anatomy.

The derived definitions can be represented as an Limitations


OWL file. An export of all of the source ontologies
is available as Online Supplementary Material at: Not all composite term names reflect the exact
http://www.interscience.wiley.com/jpages/1531- meaning of the term. For example, ‘dense nuclear
6912/suppmat. This file was generated automat- body’ refers to something more specific than what
ically. An OWL definition is provided only if there is suggested by the individual words. This can
is exactly one semantic parse for a term. lead to errors in reasoning, such as ‘dense nuclear
body’ is a ‘nuclear body’. This can be overcome by
manually creating formal definitions that override
Missing relationships
Table 6. Derivable relationships in GO. Two methods are
We found 400 candidate missing relationships over compared, one using Obol, and the other using a subset
two iterations. The first run was performed on the relation between words in the two term phrases
March 2004 version of GO, with an earlier version
Function Process Component
of the code; the second run was performed on
the July 2004 version of GO. Of these 400, at Total relationships 8002 13 613 1691
least 223 have since been added to GO. Each term Obol-D 479 (6%) 3055 (22%) 346 (20%)
was parsed to derive a class definition. Reasoning Subset-D 2135 (27%) 4299 (32%) 534 (32%)
Obol-D, Subset-ND 0 1089 (8%) 63 (3%)
rules were applied over the class definitions to Obol-ND, Subset-D 1656 (21%) 2333 (17%) 251 (15%)
infer missing relationships. Table 5 shows some
examples of the missing relationships detected. D, derivable; ND, non-derivable.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 519

the derived definition. Similarly, there are many regular term names not reflected in the term defi-
words that Obol incorrectly treats as primitive nitions. Different domains may require alterations
definitions lacking differentiae. Examples include to the grammar production rules, as well as their
‘chemotaxis’ and ‘neurogenesis’. own AVs and relationship types; however, the core
parts of the Obol software could be used unal-
tered.
Future directions One of the most intriguing possibilities offered
by Obol is to assist in the permanent man-
OBO terms occasionally use concise colloquial ual transition of OBO from the existing repre-
term names rather than term names that explic- sentation to a more formal representation, such
itly indicate the meaning according to OBO . These as a genus–differentiae paradigm as outlined
terms typically have an exact synonym. Future here, or a description logic [7] paradigm. Such
implementations will also parse exact synonyms. a transition would be extremely daunting with-
Another simple extension would allow for auto- out computational assistance. This would mark
matic generation of synonyms for terms. a significant change in OBO curation, but could
Text definitions also conform to a regular struc- have significant benefits for curators in terms of
ture, and so it should be possible to decompose time spent creating and maintaining compound
and generate these. However, these are less concise terms.
than term names, and have a more complex gram-
mar. A different approach is required for parsing
text definitions. Acknowledgements
The main benefit of using Obol is as an aid This work was supported by the Howard Hughes Medical
to the curation process, in that they assist with Institute, and the Gene Ontology Consortium P41 Grant
deriving relationships. In this mode, the end-user from the National Human Genome Research Institute
of the ontology would not use the grammar in (NHGRI), Grant No. HG002273. I am grateful to many
any way. However, it is in theory possible to use colleagues for their help with this research, in particular
Obol to enhance searching and navigation. For to Midori Harris, Jennifer Clark, Amelia Ireland and
example, if a user searches for genes involved Jane Lomax for evaluating the initial results, and to
in ‘transcriptional regulation’, Obol could map Suzanna Lewis, John Day-Richter, David Hill and Michael
this onto the correct term name ‘regulation of Ashburner for their insightful comments, ideas, inspiration
transcription’. Other end-users who may benefit and encouragement.
from the richer definitions that come from parsing
include bioinformatics users who use measures of
semantic similarity [23]. References
While it is important to stress that parsing a
highly restrictive language (such as the language of
OBO terms) is much simpler than parsing natural 1. Harris MA, et al. 2004. The Gene Ontology (GO) database
language, it may be possible to use parts of OBO as and informatics resource. Nucleic Acids Res 32: (database
issue): D258–261.
an aid to natural language processing. An example 2. Ashburner M, et al. 2000. Gene ontology: tool for the
application is deriving the meaning of sentences in unification of biology. The Gene Ontology Consortium. Nature
Medline abstracts. Another possibility is alternative Genet 25(1): 25–29.
grammars and lexical mappings to aid automatic 3. Open Biological Ontologies: http://obo.sourceforge.net/
translation of OBO term names to languages other 4. Bard J, Reed SY, Ashburner M. 2004. An ontology for cell
types. Genome Biology (manuscript submitted).
than English. 5. Drysdale R. 2001. Phenotypic data in FlyBase. Brief
The software described here was designed with Bioinform 2(1): 68–80.
the needs of the OBO community in mind. How- 6. Gkoutos GV, et al. 2004. Building mouse phenotype
ever, much of it could be applied to ontologies ontologies. Pac SympBiocomput 178–189.
that are not part of OBO, biological or otherwise. 7. Wroe CJ, et al. 2003. A methodology to migrate the gene
ontology to a description logic environment using DAML +
This would only be worthwhile if the target ontol- OIL. Pac Symp Biocomput 624–635.
ogy contained a large number of compound terms, 8. Smith B, Williams J, Schulze-Kremer S. 2003. The ontology
with the implicit knowledge encoded in highly of the gene ontology. AMIA Annu Symp Proc 609–613.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
520 C. J. Mungall

9. Hill DP, et al. 2002. Extension and integration of the gene 17. OWL Web Ontology Language: http://www.w3.org/TR/2004/
ontology (GO): combining GO vocabularies with external REC-owl-ref-20 040 210/
vocabularies. Genome Res 12(12): 1982–1991. 18. Horrocks I. 1998. The FaCT system. In Automated Reasoning
10. Ogren PV, et al. 2004. The compositional structure of Gene with Analytic Tableaux and Related Methods: International
Ontology terms. Pac Symp Biocomput 214–225. Conference Tableaux  98 307–312.
11. Yeh I, et al. 2003. Knowledge acquisition, consistency 19. Volker Haarslev RM. 2003. Racer: a core inference engine
checking and concurrency control for Gene Ontology (GO). for the semantic web. In Proceedings of the 2nd International
Bioinformatics 19(2): 241–248. Workshop on Evaluation of Ontology-based Tools (EON2003).
12. Smith B, Köhler J, Kumar A. 2004. On the application of Sanibel Island, FL, USA.
formal principles to life science data: a case study in the Gene 20. Clocksin WF, Mellish CS. 1981. Programming in Prolog.
Ontology. In Data Integration in the Life Sciences (DILS): Springer-Verlag: New York.
79–94. 21. XSB Prolog: http://xsb.sourceforge.net
13. Chomsky N. 1959. On certain formal properties of grammars. 22. go-perl library: http://www.godatabase.org/dev/go-perl/doc/
Inform Control 2: 137–167. go-perl-doc.html
14. Formal grammar. In Wikipedia: The Free Encyclopedia. Vol. 8, 23. Lord PW, et al. 2003. Investigating semantic similarity
Sept. 2004. measures across the Gene Ontology: the relationship
15. Cohen S.M. 2003. Artistotle’s metaphysics. In The Stanford between sequence and annotation. Bioinformatics 19(10):
Encyclopedia of Philosophy, Zalta EN (ed.). 1275–1283.
16. Smith B, WC, Köhler J, Kumar A, et al. 2004. Relations in
biological ontologies. Genome Biology, submitted.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
International Journal of

Peptides

Advances in
BioMed
Research International
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Stem Cells
International
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Virolog y
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
International Journal of
Genomics
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014

Journal of
Nucleic Acids

Zoology
 International Journal of

Hindawi Publishing Corporation Hindawi Publishing Corporation


http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Submit your manuscripts at


http://www.hindawi.com

Journal of The Scientific


Signal Transduction
Hindawi Publishing Corporation
World Journal
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Genetics Anatomy International Journal of Biochemistry Advances in


Research International
Hindawi Publishing Corporation
Research International
Hindawi Publishing Corporation
Microbiology
Hindawi Publishing Corporation
Research International
Hindawi Publishing Corporation
Bioinformatics
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Enzyme International Journal of Molecular Biology Journal of


Archaea
Hindawi Publishing Corporation
Research
Hindawi Publishing Corporation
Evolutionary Biology
Hindawi Publishing Corporation
International
Hindawi Publishing Corporation
Marine Biology
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Potrebbero piacerti anche