Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
definitions, which makes it difficult to perform the dual purpose of specifying meaning for both
certain kinds of reasoning over the terms in the human users and computers.
ontology [7,8]. This paper describes the attempt:
It has been noted [7,9–11] that many term names
are compositional, and indicate implicit relation- • To elucidate the formal language underlying
ships to terms, possibly in a different ontologies; OBO, called Obol.
e.g. ‘cardioblast differentiation’ in the biological • To use this language to derive the meaning of
process ontology has an implicit relationship to the existing OBO terms.
term ‘cardioblast’ in the cell ontology, yet this rela- • To use these derived meanings to help manage
tionship is absent from both ontologies; currently the redundancy and complexity within OBO.
there are no explicit relationships between differ-
ent ontologies in OBO. A computable definition
Materials and methods
of ‘cardioblast differentiation’ would by necessity
reference ‘cardioblast’.
Source ontologies
The existence of composite terms leads to redun-
dancy in both text definitions and relationships A subset of OBO was used for constructing
[12]. For example, the term ‘cytokine’ has redun- and testing Obol. The system described here was
dant definitions embedded in the text definitions for constructed using the three GO ontologies, the
‘cytokine metabolism’ and ‘cytokine biosynthesis’. biochemical ontology and the cell ontology [4].
The redundancy in relationships manifests itself as These ontologies were downloaded on 24 July
‘cytokine metabolism’ and ‘cytokine biosynthesis’ 2004. Currently there is neither a generic (non-
being related via is a to both ‘protein metabolism’ species-specific) anatomy or protein family ontol-
and ‘protein biosynthesis’. ogy in OBO.
The compositional nature of many terms leads
to an increase in the number of relationships Tokenization
and consequent increase in complexity of the
ontology. This is particularly true of the GO; Each term in OBO has exactly one primary name,
the term ‘positive regulation of T-helper 2 cell which is a string of characters containing a phrase
or sequence of words, indicated here by W*. The
differentiation’ has 114 distinct paths through the
first step in determining the implicit meaning of a
relationships in the ontology to the root term. This
term is tokenizing the term name string.
complexity can have a negative impact on both
Tokenization breaks a term name phrase-string
users and curators, searching and maintaining the
into an ordered sequence of word-strings. Each of
ontologies.
these word-strings is treated as an atomic (non-
We could integrate OBO by providing com- decomposable) token. The phrase-string is split
putable definitions for all existing terms. This on white-space characters and non-alphanumeric
would bring benefits, such as the ability to rea- characters. White-space characters are discarded,
son over the ontology and automatically derive but non-alphanumeric characters (e.g. the hyphen
certain relationships. However, adding and main- character ‘-’) are preserved and treated as special
taining these computable definitions is a significant word tokens. For example:
undertaking, and it is unclear whether the benefits
would outweigh the costs. W∗ GO:0045085 = (negative, regulation, of,
An alternative approach is to use implicit knowl-
edge encoded in a term name to derive the intended interleukin, ‘ − ’, 2, biosynthesis)
meaning of the term. OBO term names exhibit a W∗ GO:0006412 = (protein, biosynthesis)
high degree of regularity in phrase structure, and
in how that structure relates to meaning. In fact,
OBO term names can be viewed as phrases writ- Atomic vocabularies (AVs)
ten in a formal language, described using logical Each word token is matched against an atomic
rules. The formal language of OBO term phrases vocabulary (AV) of words. The AV contains words,
is a subset of natural language, and it thus serves not phrases. It thus overlaps and is distinct from the
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 511
OBO-controlled vocabulary of phrases, where each treated as binary relations between words; e.g. rela-
phrase consists of one or more words. tional adj(epidermal, epidermis). There is also an
The atomic vocabulary assigns each word a additional category for numbers, roman numerals,
domain, and a lexical category (also known as a words designating Greek symbols and alphanu-
part of speech). The domains typically correspond meric characters used as ‘type designators’; e.g.
to the domains of existing ontologies within OBO ‘myosin II’, ‘interferon-alpha’ and ‘interleukin-
(e.g. molecular function, cellular component, cell, 2’.
biochemical) and also to domains that are not OBO terms typically do not contain verbs, so
presently covered by OBO (environment, protein, there is no need for such a category. Inflected verbs
generic anatomy). Table 1 shows a list of domains (e.g. ‘regulation’, the inflected form of ‘regulate’)
defined by Obol. are treated no differently from other nouns. It is
The lexical category of a word defines the role rare for OBO term names to include the definite
that word plays in a phrase. For example, in or indefinite article (’the’ or ‘a’), and other lexical
the phrase ‘negative regulation’, the word ‘neg- categories commonly used in natural language (’it’,
ative’ plays the role of adjective. For simplicity ‘they’, ‘he’, ‘she’). It is also rare for there to be
we assume that each word in the AV has exactly different variants of the same word, so there is no
one lexical category. The lexical categories used requirement for reducing words to stem forms.
include typical linguistic categories, such as nouns, The phrase ‘negative regulation of interleukin-2
adjectives, prepositions and relational adjectives. biosynthesis’ can be tokenized into words that are
Relational adjectives are useful for linking adjec- categorized into the following lexical-categories
tives to the noun form of that adjective. They are and domains:
Table 1. The 13 domains used to partition the Atomic • negative (adjective, general).
Vocabularies. Some ontologies have been split into more • regulation (noun, biological process).
than one domain; this was necessary if the ontology did not • of (preposition, general).
contain complete is a parentage to a suitable upper-level • interleukin (noun, protein family).
term • - (hyphen) (special token).
Corresponding OBO • 2 (type designator, general).
Domain ontology Notes • biosynthesis (noun, biological process).
General — Mostly prepositions
Constructing the atomic vocabulary
Anatomy Various species-specific Generic anatomical
structures All term names in the source ontology were tok-
Function GO molecular function enized; this resulted in a set of word tokens, which
comprised the initial AV. Domains and lexical cat-
Process GO biological process
egories were assigned both manually and semi-
Component GO cellular component Cell parts automatically, using the OBO grammar (see next
Cell Cell Cell types section).
Biochemical Biochemical Chemical compounds As OBO changes, the AVs must also change.
However, the atomic vocabularies need not be
Protein In progress e.g. Keratin, actin,
interleukin-2
completely up-to-date, because the system exhibits
graceful degradation with incomplete or incorrectly
Environment — e.g. Taste, touch, light
categorized AVs; performance is impacted, but not
Behaviour GO biological process Behaviour-specific severely.
processes
Enzyme GO molecular function e.g. Amylase,
OBO term grammar
deaminase
Organism — e.g. Viral/virus, A computational (or formal) grammar is a way
bacteria, parasite to describe a formal language, analogous to the
Sequence Sequence e.g. Transcript, five, concept of grammars for natural languages [13,14].
prime A formal language is a set of sequences (e.g.
sentences) over a finite alphabet (e.g. words).
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
512 C. J. Mungall
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 513
Figure 2. Class definition for the term ‘negative regulation of interleukin-2 biosynthesis’
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
514 C. J. Mungall
Table 2. Examples of how grammatical contexts can be used to derive which relationship forms the differentia in a
definition. Example terms are provided, with the genus shown in bold and the differentia term shown in italics. Note that
this is just a subset of the contexts — the relationship ‘part of’ can be part of a set of differentiae in a variety of contexts.
Currently the only relationship type above defined by GO is ‘part of’, Obol has to extend this basic set in order to derive
definitions
Relationship Example
type Domain Range Grammatic context (genus differentia)
Table 3. Some results from applying semantic parses on typical OBO terms. Recursive definitions are shown as successive
parses of each differentia term. Note that there is no protein ontology in OBO at present, so Obol creates a temporary
ID for terms such as ‘interleukin-2’. Note also that there is no GO term for the genus ‘regulation’
OBO ID Term name Genus Differentiae
augmented to get QOBO . Here, a relationship type Class definition derivation method
belonging to Q is defined by a set of domain,
range and grammatical context triples. The domain Given a set of relationship types Q, as defined
and range specify to which genus or genera a above, we can attempt to infer a class definition
relationship type pertains. The grammatical con- for any syntax parse tree.
text specifies how a relationship is linguistically Starting at the root of the parse tree, S, we recur-
manifested in a phrase. For example, consider the sively build the class definition. The production
relationship type affects in QOBO . This type can rules, P , are constructed such that each node in
have the domain ‘regulation’ and the range ‘pro- the parse tree should have one or two children. If
cess’ occurring with the preposition ‘of’ in some it has one child, then the class definition at the
phrase. Because ‘biosynthesis’ is a ‘process’, the parent node is equal to the class definition at the
phrase ‘regulation of biosynthesis’ is equivalent to child node. If the node has two children — one
the definition regulation affects = (biosynthesis). stem node n1 and another supplementary node
n2 — then the class definition will be the combina-
Table 2 shows some example relationships and the
tion of the class definition D1 at n1 with a relation-
contexts in which they are used.
ship, r, to the class definition D2 at n2 , where the
OBO term names are remarkably consistent in possible relationship types are constrained by the
mapping grammatical structure to meaning. A term grammatical type of the current node, n, together
with name as ‘regulation of transcription’ would with the domain and range types matched to D1
never use a different grammatical context (except and D2 , according to Q.
as a synonym). An example of a different gram- When we reach the leaf nodes of the parse tree,
matical context is a relational adjective modifier, we assign a primitive definition, which is simply a
as in ‘transcriptional regulation’. genus lacking differentiae.
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 515
Deriving a class definition is also known as T by following a chain of one or more is a rela-
semantic parsing. This extension to a standard tionships.
grammar is called a semantic grammar. A semantic
grammar is reversible, allowing for the generation T is a+ T if (T is a X and X is a+ T )or(T is a T )
of phrases from definitions.
Table 3 gives an example of the semantic There are rules for inferring contingent inclusion
parse of the term name ‘negative regulation of is a relationships [16] that are based on natural is a
interleukin-2 biosynthesis’. relationships using genus–differentiae definitions
If we augment our definition of a grammar G (where G = genus, and the differentia is Q = (D),
with relationship types and a set of is a rela- and D is itself a definition). Here is a subset of the
tionships over genera, giving a semantic grammar Obol rule-set:
= (N , , P , S , Q, RIS A ), then the language of
1: G,Q = (D) is a G,Q = (D ) if D is a D
, denoted as L(), is defined as all those mean-
2: G,Q = (D) is a G , Q = (D) if G is a G
ingful strings over belonging to L(G). If OBO is
3: G,Q = (D) is a G if ¬(∃D such that D is a D )
defined perfectly, then L(OBO ) should correspond
to the universe of meaningful biological terms. We Using rule 1, and the semantic grammar, the
refer colloquially to L(OBO ) as Obol. Obol system infers an is a relationship between
Definitions can be exported using either the obo ‘chromoplast membrane’ (membrane surrounds
flat-file format (http://www.geneontology.org/ = chromoplast) and ‘plastid membrane’ (mem-
GO.format.html#oboflat) or using a description brane surrounds = plastid) based on the is a rela-
logic (DL) format, such as Ontology Web Lan- tionship between ‘chromoplast’ and ‘plastid’.
guage (OWL) [17]. An example of applying rule 2 is to dis-
The candidate class definitions can be considered cover that ‘vitamin E biosynthesis’ (biosynthesis
an end in and of themselves, or they can be used forms = vitaminE) is a ‘vitamin E metabolism’
for reasoning. (metabolism forms = vitaminE), based on the
is a relationship between ‘biosynthesis’ and ‘meta-
bolism’.
Reasoning An example of applying rule 3 is to discover that
Class definitions can be used for reasoning over ‘primary septum’ (septum type = primary) is a
the ontologies. This can be done using an external ‘septum’, based on the fact that there is no OBO
third-party reasoner, such as FaCT [18] or RACER term ‘primary’ and thus no parent for that term.
[19]. There is also an advantage to integrating some Rules such as these can be used to find errors
simple reasoning facility into the grammar system. of omission in an ontology, to automatically add
Some of the things we may wish to do with a relationships for newly created terms, to suggest
reasoner and some class definitions include: intermediate terms, and to determine which rela-
tionships in the ontology are true by contingency.
• Check for inconsistent or missing relationships, Table 4 has the complete set of all relationships
or missing terms. that can be derived for the term named ‘negative
• Automatically assign relationships for new regulation of interleukin-2 biosynthesis’. Note that
terms. all these derived relationships are currently present
• Extract implicit ontologies from GO. in GO due to curator diligence.
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
516 C. J. Mungall
Table 4. Examples of relationships that can be automatically derived by semantically parsing the term ‘negative regulation
of interleukin-2 biosynthesis’. Note that all 16 of these relationships are currently present in GO, so Obol tells us nothing
new here. However, manual curation of these derivable relationships is an onerous and error-prone task. Obol can help
maintain these. Note also that in order to make all of these derivations Obol requires an ontology of proteins, such as
interleukins. No such ontology currently exists within OBO — an example of one was generated for this particular test.
This highlights the importance of OBO for the GO project
Relationship
Child term type Parent term Notes
Table 5. An illustrative subset of the 400 purported missing relationships derived by the system. Some of these derived
missing relationships were incorrect, based on erroneous definition parses. The final example illustrates a case where the
formal definition of a term cannot be derived from the term phrase alone
Subject Relationship type Object Notes
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 517
3000
the Obol system. We use XSB Prolog [21], which
provides tabling in addition to a standard prolog
system. This allows grammar production rules to
biological−process−terms
be left-recursive, as is the case for the POBO . The
2000
go-perl library [22] is used to convert OBO files
to prolog fact files. Interaction with the system is
either via prolog interpreter or via a UNIX make-
file. The system has been tested on both Linux and
1000
Mac OS X, and is available under an open source
licence from http://www.fruitfly.org/∼cjm/obol.
0
0 1 2 3 4 5 6 7
Results and discussion number of parses
0 1 2 3 4 5 6 7
distribution of the number of parses, both syntactic number of parses
and semantic, for the three GO ontologies. Single-
word terms have been excluded, since these are Figure 4. The distribution of the number of parses,
trivially parsed to the corresponding genus. The both syntactic and semantic, for the cellular component
GO ontology
histograms have been truncated at five parses;
one particular molecular function term (’receptor
signaling protein tyrosine kinase signaling protein acid binding’. This has recently been fixed by
activity’) has 132 syntactic parses (of which none adding tokenizer rules, such that compound identi-
derive a definition). fiers are treated as single tokens.
Note that the molecular function ontology has Improving the AVs will reduce the number of
the highest number of terms that cannot be incorrect syntax parses (and thus reduce the incor-
parsed. One reason is that the tokenizer and gram- rect semantic parses). Many words are incorrectly
mar originally did not deal adequately with term categorized or are not yet present in the AVs. This
names containing chemical notation, such as ‘5(S )- can be improved by manual and semi-automatic
hydroxyperoxy−6E , 8Z , 11Z , 14Z -icosatetraenoic AV curation.
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
518 C. J. Mungall
syntactic parses
semantic parses We examined all 23 306 relationships in GO to
determine which could be automatically derived.
We compared two methods. The first method
3000
molecular−function−terms
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
Obol: integrating language and meaning in bio-ontologies 519
the derived definition. Similarly, there are many regular term names not reflected in the term defi-
words that Obol incorrectly treats as primitive nitions. Different domains may require alterations
definitions lacking differentiae. Examples include to the grammar production rules, as well as their
‘chemotaxis’ and ‘neurogenesis’. own AVs and relationship types; however, the core
parts of the Obol software could be used unal-
tered.
Future directions One of the most intriguing possibilities offered
by Obol is to assist in the permanent man-
OBO terms occasionally use concise colloquial ual transition of OBO from the existing repre-
term names rather than term names that explic- sentation to a more formal representation, such
itly indicate the meaning according to OBO . These as a genus–differentiae paradigm as outlined
terms typically have an exact synonym. Future here, or a description logic [7] paradigm. Such
implementations will also parse exact synonyms. a transition would be extremely daunting with-
Another simple extension would allow for auto- out computational assistance. This would mark
matic generation of synonyms for terms. a significant change in OBO curation, but could
Text definitions also conform to a regular struc- have significant benefits for curators in terms of
ture, and so it should be possible to decompose time spent creating and maintaining compound
and generate these. However, these are less concise terms.
than term names, and have a more complex gram-
mar. A different approach is required for parsing
text definitions. Acknowledgements
The main benefit of using Obol is as an aid This work was supported by the Howard Hughes Medical
to the curation process, in that they assist with Institute, and the Gene Ontology Consortium P41 Grant
deriving relationships. In this mode, the end-user from the National Human Genome Research Institute
of the ontology would not use the grammar in (NHGRI), Grant No. HG002273. I am grateful to many
any way. However, it is in theory possible to use colleagues for their help with this research, in particular
Obol to enhance searching and navigation. For to Midori Harris, Jennifer Clark, Amelia Ireland and
example, if a user searches for genes involved Jane Lomax for evaluating the initial results, and to
in ‘transcriptional regulation’, Obol could map Suzanna Lewis, John Day-Richter, David Hill and Michael
this onto the correct term name ‘regulation of Ashburner for their insightful comments, ideas, inspiration
transcription’. Other end-users who may benefit and encouragement.
from the richer definitions that come from parsing
include bioinformatics users who use measures of
semantic similarity [23]. References
While it is important to stress that parsing a
highly restrictive language (such as the language of
OBO terms) is much simpler than parsing natural 1. Harris MA, et al. 2004. The Gene Ontology (GO) database
language, it may be possible to use parts of OBO as and informatics resource. Nucleic Acids Res 32: (database
issue): D258–261.
an aid to natural language processing. An example 2. Ashburner M, et al. 2000. Gene ontology: tool for the
application is deriving the meaning of sentences in unification of biology. The Gene Ontology Consortium. Nature
Medline abstracts. Another possibility is alternative Genet 25(1): 25–29.
grammars and lexical mappings to aid automatic 3. Open Biological Ontologies: http://obo.sourceforge.net/
translation of OBO term names to languages other 4. Bard J, Reed SY, Ashburner M. 2004. An ontology for cell
types. Genome Biology (manuscript submitted).
than English. 5. Drysdale R. 2001. Phenotypic data in FlyBase. Brief
The software described here was designed with Bioinform 2(1): 68–80.
the needs of the OBO community in mind. How- 6. Gkoutos GV, et al. 2004. Building mouse phenotype
ever, much of it could be applied to ontologies ontologies. Pac SympBiocomput 178–189.
that are not part of OBO, biological or otherwise. 7. Wroe CJ, et al. 2003. A methodology to migrate the gene
ontology to a description logic environment using DAML +
This would only be worthwhile if the target ontol- OIL. Pac Symp Biocomput 624–635.
ogy contained a large number of compound terms, 8. Smith B, Williams J, Schulze-Kremer S. 2003. The ontology
with the implicit knowledge encoded in highly of the gene ontology. AMIA Annu Symp Proc 609–613.
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
520 C. J. Mungall
9. Hill DP, et al. 2002. Extension and integration of the gene 17. OWL Web Ontology Language: http://www.w3.org/TR/2004/
ontology (GO): combining GO vocabularies with external REC-owl-ref-20 040 210/
vocabularies. Genome Res 12(12): 1982–1991. 18. Horrocks I. 1998. The FaCT system. In Automated Reasoning
10. Ogren PV, et al. 2004. The compositional structure of Gene with Analytic Tableaux and Related Methods: International
Ontology terms. Pac Symp Biocomput 214–225. Conference Tableaux 98 307–312.
11. Yeh I, et al. 2003. Knowledge acquisition, consistency 19. Volker Haarslev RM. 2003. Racer: a core inference engine
checking and concurrency control for Gene Ontology (GO). for the semantic web. In Proceedings of the 2nd International
Bioinformatics 19(2): 241–248. Workshop on Evaluation of Ontology-based Tools (EON2003).
12. Smith B, Köhler J, Kumar A. 2004. On the application of Sanibel Island, FL, USA.
formal principles to life science data: a case study in the Gene 20. Clocksin WF, Mellish CS. 1981. Programming in Prolog.
Ontology. In Data Integration in the Life Sciences (DILS): Springer-Verlag: New York.
79–94. 21. XSB Prolog: http://xsb.sourceforge.net
13. Chomsky N. 1959. On certain formal properties of grammars. 22. go-perl library: http://www.godatabase.org/dev/go-perl/doc/
Inform Control 2: 137–167. go-perl-doc.html
14. Formal grammar. In Wikipedia: The Free Encyclopedia. Vol. 8, 23. Lord PW, et al. 2003. Investigating semantic similarity
Sept. 2004. measures across the Gene Ontology: the relationship
15. Cohen S.M. 2003. Artistotle’s metaphysics. In The Stanford between sequence and annotation. Bioinformatics 19(10):
Encyclopedia of Philosophy, Zalta EN (ed.). 1275–1283.
16. Smith B, WC, Köhler J, Kumar A, et al. 2004. Relations in
biological ontologies. Genome Biology, submitted.
Copyright 2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 509–520.
International Journal of
Peptides
Advances in
BioMed
Research International
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Stem Cells
International
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Virolog y
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
International Journal of
Genomics
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Journal of
Nucleic Acids
Zoology
International Journal of