Sei sulla pagina 1di 19

THEME ISSUE PAPER

Evaluation of semantic retrieval


systems on the semantic web
Jorge Luis Morato, Sonia Sanchez-Cuadrado and Christos Dimou
Computer Science Department, University Carlos III of Madrid, Leganes, Spain
Divakar Yadav
Computer Science and Engineering, Jaypee Institute of Information Technology,
Noida, India, and
Vicente Palacios
Computer Science Department, University Carlos III of Madrid, Leganes, Spain
Abstract
Purpose This paper seeks to analyze and evaluate different types of semantic web retrieval
systems, with respect to their ability to manage and retrieve semantic documents.
Design/methodology/approach The authors provide a brief overview of knowledge modeling
and semantic retrieval systems in order to identify their major problems. They classify a set of
characteristics to evaluate the management of semantic documents. For doing the same the authors
select 12 retrieval systems classied according to these features. The evaluation methodology followed
in this work is the one that has been used in the Desmet project for the evaluation of qualitative
characteristics.
Findings A review of the literature has shown deciencies in the current state of the semantic web
to cope with known problems. Additionally, the way semantic retrieval systems are implemented
shows discrepancies in their implementation. The authors analyze the presence of a set of
functionalities in different types of semantic retrieval systems and nd a low degree of implementation
of important specications and in the criteria to evaluate them. The results of this evaluation indicate
that, at the moment, the semantic web is characterized by a lack of usability that is derived by the
problems related to the management of semantic documents.
Originality/value This proposal shows a simple way to compare requirements of semantic
retrieval systems based in DESMET methodology qualitatively. The functionalities chosen to test the
methodology are based on the problems as well as relevant criteria discussed in the literature. This
work provides functionalities to design semantic retrieval systems in different scenarios.
Keywords Semantic web, Semantic search engines, Problems in the semantic web,
Qualitative evaluation, Requirements analysis
Paper type Research paper
Introduction
The concept of the semantic web has emerged out of the need to facilitate access,
management and retrieval of knowledge. Although different authors interpret the
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm
MINCIN funded this research project HAR2011-27540 and TIN2011-27244.
An earlier version of this paper was presented at the LOV symposium, held in Madrid, Spain,
on 18 June, 2012.
LHT
31,4
638
Received 4 March 2013
Revised 22 August 2013
Accepted 3 September 2013
Library Hi Tech
Vol. 31 No. 4, 2013
pp. 638-656
qEmerald Group Publishing Limited
0737-8831
DOI 10.1108/LHT-03-2013-0026
semantic web in a different way (Bizer et al. 2009) the idea that underlies is a web of
machine-readable, semantically related data, which means the navigation can be
achieved by means of semantic relationships among concepts instead of hyperlinks
between documents. Therefore, the semantic web requires semantic documents and
resources that permit knowledge representation and inferences.
Semantic documents, in some application domains, are referred to as composite
information resources composed of uniquely identied, semantically annotated, and
semantically interlinked document data units of different granularity (Nesic, 2010). In
others, such as the Swoogle search engine, a SWD (stands for semantic web document)
is dened as an RDF document. Considering an SWD just as an RDF document is
problematic, as Pastor-Sanchez et al. (2012) point out that the RDF employment in the
DataHub documents is frequently scarce and sometimes anecdotal.
Vocabularies of metadata, ontologies and schemas are also considered semantic
description resources, because all of them are key elements for providing
interoperability in the context of the linked data cloud and semantic web
implementations. Vocabularies are dened as the concepts and relationships used
to describe and represent an area of concern. They are used to classify the terms that
can be used in a particular application, characterize possible relationships, and dene
possible constraints on using those terms (W3C, 2013). Vocabularies are not only more
useful when they can be retrieved, accessed and used easily, but also when they are
formalized in a more precise way. They can be formalized in various levels from XML
(eXtensible Markup Language) schemas or RDF to ontologies. Their use can also vary
from a simple resource description (e.g. in HTML code) to an application prole that
combines different vocabularies.
In the following sections, we proceed with a review of literature related to the
infrastructure of the semantic web. Next, we identify characteristics and relevant
problems in other studies about semantic retrieval evaluation. Finally, we propose a
qualitative evaluation of different functionalities for distinct types of semantic search
systems in the context of the Web.
1. Discrepancies of the semantic web
Many projects on the semantic web have distinct approaches to represent knowledge in
a formalized way. For example, ontologies are described with other formal languages,
such as KIF, RIF or Common Logic. Moreover, some initiatives focus on the inclusion
of metadata in HTML, through, for instance, the projects RDFa, microformats or
dataspaces (Bizer et al., 2009). In the case of the ISO Topic Map standard denes a data
model, expressed with XML syntax and the TCML language. There are other similar
initiatives based on XML Schema, for example XMI for UML, MPEG-7 or TEI (Bikakis
et al. 2013).
Each representation model has different syntax and semantics that can produce
conicts when we try to fuse them together. In the semantic web the prevailing
tendency uses RDF as a standard data model and a serialization format based on XML
syntax (Allemang and Hendler, 2011), where data resources and vocabulary concepts
can be accessed by URIs. Despite this tendency, there is a need to establish
transformations from other languages and models such as XML Schema, KIF, etc.
The W3C has proposed some recommendations for reducing the gap between the
different languages of ontologies, annotation and XML Schema (W3C , 2007, 2012). It is
Semantic
retrieval systems
639
important to note that the semantics in XML Schema are informal, in contrast to RDF
where semantics are more formalized. An additional main aspect is that unlike RDF,
querying XML Schema takes a closed world assumption (Bikakis et al., 2013).
One of the most successful initiatives is the Linked Data project (Bizer et al., 2009).
The main proposal of this project is based on four principles:
(1) identity for each entity (e.g. URIs);
(2) accessibility to each object (e.g. HTTP URI);
(3) structure information in a formalized way with standards (e.g. RDF and
SPARQL); and
(4) integration of entities through relationships between them (e.g. application
proles, metadata vocabularies, and so on).
Linked data enables integrating distributed data, and leveraging the generation of
repositories for semantic resources and datasets (DBpedia, GeoNames, UMBEL). The
success of the linked data initiative is determined by the ability of practitioners to
identify, reuse, or link to other available sources of linked data (W3C, 2011).
It must be noted that although in the linked data project the les that describe
resources are usually RDF documents, in practice we observe other formats as well, for
example Xlink (W3C, 2010), a required format for topic maps until 2006, or JSON linked
data (Lanthaler and Gutl, 2012). There have been many initiatives that return to the
simplicity of the Web 1.0 to overcome problems with b-nodes and RDF molecules, for
instance hypernotation (Milicic, 2011). The heterogeneity of document formats in the
semantic web explains why, for example, when we submit a search query to the
DataHub repository, we retrieve a large variety of CSV, XML or HTML les, instead of
only RDF les.
As a conclusion, the specications of a retrieval system for semantic web resources
must not be restricted to RDF documents only, but instead they should include other
types of resources and languages as well.
1.1 Knowledge modeling on the web
The main consequence of modeling knowledge in the semantic web context is the
necessity to select metadata vocabularies for elaborating semantic documents.
However, selecting the appropriate metadata vocabulary involves several difculties
and challenges. This process includes:
(1) Selecting a vocabulary from the linked open data (LOD) Cloud.
(2) Selecting the appropriate approach for describing the subject.
(3) Dening queries and conceptual and semantic navigation.
We have found some difculties to carry out these tasks.
1.1.1 Selecting a metadata vocabulary from the LOD cloud. Selecting a metadata
vocabulary from the Cloud involves several difculties. One problem is a lack of
criteria to select an adequate metadata vocabulary. According to the resource
Datasets in the next LOD Cloud, by the Research Group Data and Web Science from
the University of Mannheim (2013), there are 25.2 billion RDF triples in W3C linking
open data (aka LOD); datasets like Yago alone have 19 million triples. According to
Pastor-Sanchez et al. (2012), 347 millions of triples are associated with controlled
LHT
31,4
640
vocabularies in the Data Hub. For example, there are many vocabularies that can be
used to describe resources, persons and institutions. However, there is a lack of criteria
to select the best candidate. Palacios (2010) has suggested some criteria such as degree
of standardization, stability, number of elements, and usage statistics and popularity.
Other problems are lack of consensus between similar metadata element set due to
overlapped denitions, lack of descriptions, formalization problems and different
conceptualizations. Specically there exist various vocabularies for description of
persons and organizations. Many of them have a large degree of overlapping, as for
example FOAF, PIM, MADS (xsd), LID, DOAC, VCard, hcard, BIO or XFN. Similarly,
there are many different alternative vocabularies for semantic relationships, such as
SKOS, VDEX, BS 8723, ISO 25964, Zthes, PSI TopicMaps (Morato et al., 2007). This
overlapping should not be a problem, provided that the denitions are coherent when
they describe the same concepts. On the contrary, it could prove benecial for LOD. A
rigorous analysis will show us typical problems in the determination of exact
correspondences. For example, one of the obvious elements would be birthday; if we
see, however, its description in two different widely used vocabularies, like Foaf and
Vcard we observe that the correspondence between elements is not obvious.
.
Foaf:birthday, denition: the birthday of this agent, represented in mm-dd
string form, eg. 12-31
.
Foaf:agent, denition: Used to describe any agent related to bibliographic items.
Such agents can be persons, organizations or groups of any kind.
.
Vcard:bday, denition Date of birth of the individual associated with the vCard
.
Vcard:agent, Information about another person who will act on behalf of the
vCard object
On the one hand, the meaning of the term "agent" is different between two
vocabularies. And on the other hand, one of the vocabularies species a date format,
while the other one does not (although it is usually associated with the xsd datatype
of date).
In addition to the above problem, there are other resources that present metadata
without denitions, updates or nonsense elements (e.g. PIM or synonyms and
deprecated elements like foaf:lastName, foaf:familyName or
foaf:family_name). It must be noted that, according to the ISO 11179 data
model, metadata registry can include different vocabularies and their semantics.
However, metadata handling is implemented by the original system, not in the registry.
In other words, in order to create a new registry and be able to alter the denitions, we
should dene a new local resource.
1.1.2 Selecting the appropriate approach for describing the subject. Selecting the
appropriate approach for describing the subject implies problems with URIs such as the
ever changing nature of the internet and the absence of URIs for some elements is
another problem. The absence of these URIs is usually caused either by higher order
relationships or by RDF molecules. Typically, many of these problems are solved by
blank nodes (bnodes). A similar approach is widely found in knowledge management
under the name of reication. These bnodes lack in URIs to identify them, impeding the
linking of triples. This problem is also similar to the absence of URIs for RDF molecules
(Ding et al., 2005). RDF molecules are atomic units that can be larger than an RDF triple
Semantic
retrieval systems
641
and whose division would imply signicant loss of semantics. As an example we can see
that if we have a person whose name is Tim and his surname is Berners-Lee, we need
a higher level of granularity to group name and surname together.
Finally, many nodes provide literal text as their value, which produces many
problems during their semi-automatic linking. Various works (such as Waitelonis and
Sack, 2009 and Kobilarov et al., 2009) that link proles with Dbpedia and Foaf present
a success ratio of 12 percent for persons and 18 percent in organizations. These
authors, however, suggest that DBpedia could not guarantee correct and exhaustive
results. In the case of Kobilarov et al. (2009), the results for persons and organizations
were at 30 percent. These results agree with the work of Howarth (2000) and Buscaldi
et al. (2003).
1.1.3 Queries and conceptual navigation. In addition to the problems that we have
already mentioned, queries are formalized in SPARQL language ( Jain et al., 2010). In
order to represent the query, it is necessary not only to know this language, but also to
know in detail the concepts that are present in each resource and the properties with
which the concepts have been analyzed. In other words, it is improbable that a generic
query on various triples of different dataset would give the desired results.
Semantic navigation is dened as a transversal through relationships between
concepts. However, many linked vocabularies have mistakes in their hierarchy
(e.g. circles), disjoint properties, different classication criteria, granularity, and so on
(Palacios, 2010; Fuentes and Mej a, 2013). Thus, for the working example about optical
materials, we can nd in Dbpedia that:
Given that it is impossible for the same concept to be specic and generic of another
concept at the same time, conceptual navigation is rendered impossible. There exists
an abundance of paradoxical examples in the work of Fuentes and Mej a (2013).
A similar problem would occur if we search in Wikipedia for the URI that contains
Transparent_Materials; we will be forwarded to the article titled Transparency and
translucency. This last term is itself a composite term that implies a hierarchy, since
in optics translucency is a generic term of transparency. Additionally, it is
noticeable that the concepts are related under dc:subject, where instances are
mixed with properties.
A summary of other problems identied on the semantic web are shown in Table I.
LHT
31,4
642
Problems Authors Description
Vocabulary quality Bizer et al. (2009), Palacios
(2010) Fuentes and Mej a
(2013)
Relevance of resources and trust. Many
vocabularies have mistakes in their hierarchy
hampering conceptual navigation
Quality describing
documents
Pastor-Sanchez et al. (2012),
Jain et al. (2010)
Lack of conceptual description of
datasets,RDF employment in the

DataHub
documents is frequently

scarce and
sometimes anecdotal. Descriptions in the LoD
cloud presents shallow expressivity
Licensing and open
initiatives
Bizer et al. (2009),
Strasunskas and Tomassen
(2010)
The semantic web community prefer open
standards, like OWL or RDFS, than
alternatives with proprietary encoding format
or results of open academic experiments
Semantic linking Jain et al. (2010), Palacios
(2010)
The high number of description resources
causes scalability problems (linking
architecture problems due to one-by-one
mappings), overlapping in vocabularies. The
LoD Cloud datasets lack schema level
mappings between concepts of different
datasets at the schema level
Obsolescence Milicic (2011), Bizer et al.
(2009)
Updating or removing semantic resources
from the web. Link maintenance is poor
Trustworthiness Morato et al. (2007),
Bechhofer et al. (2010),
Bizer et al. (2009)
Publishing requirements as absence of
authoring information, quality, credit,
attribution are scarcely implemented.
Additional problems related with
advertisements and semantic spam in
vocabulary building and metadata description
Formalization Bikakis et al. (2013), Bizer
et al. (2009), Lanthaler and
Gutl (2012), Milicic (2011)
Variety of technologies to represent
knowledge (e.g. RDF, XML, UML, TEI, topic
maps, TCML, microformats, KIF, common
logic) and heterogeneity of linking. formats in
the semantic web: RDF, JSON linked data,
Xlink, Hypernotation. Semantics in XML
Schema is informal and a closed world
assumption
URI problems Milicic (2011) Problems to assign URIs to b-nodes and RDF
molecules
Difculties to querying
SPARQL
Jain et al. (2010) Users to specify the details of the structure of
the graph and be familiar con multiple
datasets
Privacy problems Bizer et al. (2009) Privacy problems caused by integrating data
from distinct sources
Vocabulary suitability
and adaptation
Palacios (2010), Mangold
(2007)
The suitability of a vocabulary is dened on
the basis of low or tight coupling. There is a
lack of statistic data to help the selection of
vocabularies
Usability Morato et al. (2007), Uren
et al. (2007)
Usability of current systems in the semantic
web
Table I.
Problems identied on
the semantic web
according to the literature
Semantic
retrieval systems
643
1.2 Semantic retrieval
1.2.1 Semantic search. According to Wei et al. (2008), in this work semantic search
refers to the retrieval of resources described for knowledge modeling and the usage of
logic-based knowledge representation languages for automated machine processing.
The term semantic search on the web is currently a buzzword with different
interpretations (Batzios and Mitkas, 2012). Traditionally, it includes techniques that
address the improvement of accuracy of searches (Fazzinga and Lukasiewicz, 2010,
Girit et al., 2012, Guha et al., 2003): disambiguation and contextualization of queries,
questions to semantic and semantically annotated documents, faceted search,
question-answering, query formalization or searches by similarity. In general, the main
feature of semantic search engines is to be able to solve complex queries by giving an
answer to a query than to offer us a set of documents where we could nd that answer.
1.2.2. Semantic retrieval systems. In the context of the semantic web, the concept of
information retrieval systems is rather generic and vague. It encompasses different
criteria. Scheir et al. (2007) propose the following classication:
.
the system operates on the semantic web with machine-interpretable data;
.
the systems is based on technology for the semantic web and ontology-driven
information retrieval approaches; and
.
the systems perform information retrieval and not data retrieval based on query
languages as SPARQL.
Among the rst search engines to appear were SHOE (Mangold, 2007) and On2broker.
On2broker (Fensel et al., 1999) had the objective of retrieving XML and RDF
documents, as well as vocabularies like MPEG-7 o Dublin Core. Since then, a number of
search engines have been presented: WebOWL (Batzios and Mitkas, 2012), Swoogle
(Ding et al., 2005a), XSearch (Amer-yahia and Lalmas, 2006), SWSE (Hogan et al., 2011),
Sindice, SemSearch or Watson. Most of these search engines are based on RDF or OWL
documents, as for example Swoogle, Falcons and Watson. The semantic results are
RDF documents (for example ontologies and ontology instances).
Some web retrieval systems extend even more the document typology; SWSE
transforms XML and HTML documents to RDF for subsequent indexing. Sindice, in
addition to RDF, includes microformats, RDFa and Microdata. Watson focuses on RDF
and OWL, but it includes other ontology languages like DAML-OIL.
Regarding the positioning of results, it is usually based on solutions similar to
Google (e.g. Swoogle, WebOWL and SWSE), but others are like Falcons utilize
variations of TF-IDF and popularity. XSearch is based on XML data, returning the part
of the XML tree structure that coincides with the search. A review of XML based
systems can be found in Amer-Yahia and Lalmas (2006).
Another type of information retrieval systems that search semantic web resources
as metadata vocabularies are directories. They are considered as a simple retrieval
system because searching for results is realized through the navigation of a tree
hierarchy that contains the resources. These directories are often not included in many
studies about semantic search; however we consider that it is a relevant resource on
semantic web environment.
1.2.3 Previous works in evaluation of semantic retrieval systems. Initial works on
evaluation of semantic search engines were mainly focused on query performance
(Tumer et al., 2009; Andago et al., 2010). Many of these studies compare general
LHT
31,4
644
purpose search engines to those that extract semantic knowledge from natural
language texts (for instance, Hakia) by means of a knowledge organization system.
These studies identied and analyzed common elements for the comparison of the two
categories of search engines (general purpose and semantic search). The results show
an advantage of general purpose search engines. These results are different when
structured and formalized documents, such as RDF, are taken into account.
The criteria to evaluate these semantic search engines are not based just in
query performance (Strasunskas and Tomassen, 2010). These authors state that a
rigorous comparison must take into account factors such as: query and ontology
quality, user interaction, semantic indexing criteria, query expansion, ltering,
ranking methods and presentation of results. Therefore, they propose a
classication framework (Table II) that comprises seven categories. All of them
are based on previous works to classify semantic search engines (Esmaili and
Abolhassani, 2006; Mangold, 2007).
Mangold (2007) carries out a classication of semantic search approaches. This
study analyzes ten systems according to seven criteria, as shown in Table II. Some of
Mangolds criteria are dependent on each other. Although the author recognizes that
there are other possible characteristics, they are not included in that study because the
purpose of that work was to focus on characteristics that most authors regard as
relevant. In the ontology structure, three types are analyzed: anonymous properties
(the only aspect presented is a shared context); standard properties: the common
thesaurus relationships (synonym, hypernym, meronym, instance), in addition to
negation; and Domain specic properties. In the case of Uren et al. (2007), the authors
identify four characteristics for classifying retrieval systems, none of which is related
to ontology quality criteria.
Mangold (2007) Strasunskas and Tomassen
(2010)
Uren et al. (2007)
Architecture Architecture Search environment: large
scale, heterogeneity and
portability
User context (user&s
information needs)
Search goal (question
answering, ontologies, data)
Query types
Query modication Search phase Iterative and exploratory
dimensions: renement,
recommendation and reuse
Transparency (transparent/
interactive)
User input (keywords, natural
language, graphics, formal
query or interactive)
Intrinsic problems:
Understanding, result ranking
and matching
Ontology structure Knowledge richness (taxonomy,
thesaurus, ontology)
Ontology technology Ontology encoding (RDFS,
OWL, . . .)
Coupling (ontology-documents
tight/low)
Scope (Web, desktop)
Table II.
Criteria for evaluating
semantic retrieval
systems
Semantic
retrieval systems
645
As it can be observed, some of the evaluation criteria, such as user context or
search goal, take into account the type of semantic search. Different works show
different types of semantic search. Wei et al. (2008) classify semantic search
research with respect to objectives, methodologies and functionalities:
document-oriented search; entity and knowledge-oriented search; multimedia
information search; relation-centered search; semantic analytics; mining-based
search. Fazzinga and Lukasiewicz (2010) points out that the evaluation of the
accuracy of a system must be dependent on its search capacity. The proposals of
Uren et al. (2007) and Strasunskas and Tomassen (2010) reduce the typology
proposed by Wei et al. (2008). The work of Strasunskas and Tomassen (2010) states
that standard IR metrics as recall and precision are not enough to measure user
satisfaction because of the complexity and the effort needed to use semantic search
tools. Therefore these authors suggest a holistic evaluation that includes system
quality, ontology quality, query quality, topic complexity and user interaction.
Table III arrays the types of semantic search, as they are presented in the above
publications.
Hence, there is a need to establish criteria to evaluate semantic search engines.
Many of the earlier studies just describe the functionalities of these search engines, but
there is still a need to provide a mechanism to facilitate the comparison in a similar
way to query performance metrics in classical retrieval.
2. Evaluation method
A summary of some problems identied on the semantic web are shown in Table I. As
we observe, all characteristics are qualitative and therefore difcult to measure with
classical information retrieval evaluation methods. In this section, we propose a
method to deal with some criteria scarcely analyzed in previous studies. Next, we have
selected the Desmet method (Kitchenham, 1996) in order to analyze and evaluate
different types of semantic web retrieval systems (directories and search engines), with
respect to their ability to manage and retrieve semantic documents. The goal is to
clarify if these semantic system types are implementing the requirements that are
discussed in prior studies and if they deal with the current problems found in the
semantic web.
Wei et al. (2008) Uren et al. (2007)
Strasunskas and
Tomassen (2010)
Fazzinga and
Lukasiewicz (2010)
Document-oriented
search
Entity search Information search Structured languages
Entity and knowledge-
oriented search
Relation search Data search Keyword-based
Multimedia information
search
Parameterized (faceted)
search
Question Answering Natural languages
Relation-centered
search
Ontology retrieval
Semantic analytics
Mining-based search
Table III.
Types of semantic search
LHT
31,4
646
DESMET is a comparative method for performing simple, reliable and impartial
evaluations in software engineering, such as requirement analysis. This method is
intended to help an evaluator in an evaluation exercise that is unbiased and reliable
(e.g. maximizes the chance of identifying the best method/tool). The DESMET method
is context-dependent, which means that we do not expect a specic tool to be the best in
all circumstances. Thus, in this work we do not intend to determine the best retrieval
system but to offer a way to select one semantic system type or another according to
the context. We consider that the method is adequate because the main evaluation
criteria are functionalities difcult to measure in the same way that classical retrieval
systems do. Besides, these web retrieval systems are always evolving, so we suggest
methods capable to be adapted to functionality modications. This method enables a
qualitative evaluation of the level of support that various systems provide to the
organization and the retrieval of semantic elements.
Following the steps of the DESMET method, rst we have identied the specic
circumstances for a context to retrieve ontologies and metadata vocabularies about a
specic subject. Second, we have performed a feature analysis, which essentially is an
evaluation based on the identication of requirements and their correspondence to the
characteristics that these specications support. Finally, we have dened the retrieval
systems to be evaluated, the criteria to evaluate them and assigned the values and
prioritization degree according to DESMET method.
2.1 Selecting retrieval systems of semantic documents
We have collected 12 semantic retrieval systems. We have found that retrieval systems
are different according to kinds and functionalities. In consequence, we have classied
retrieval systems in four types of semantic search engines, in order to provide a
comparison framework where we can analyze the results by groups. We propose the
following classication by types of retrieval systems and types of document that they
search:
.
Ontology search engines. These applications crawl the web discovering semantic
web documents. The search engine indexes the ontologies in order to retrieve and
rank the results. Examples are Swoogle (http://swoogle.umbc.edu/), Sindice
(http://sindice.com/), or Watson (http://watson.kmi.open.ac.uk/WatsonWUI/).
.
Search engines for metadata. A search engine aimed to retrieve metadata, as for
example the Linked Open Vocabulary (LOV) (http://lov.okfn.org/dataset/lov/
index.html) and the DataHub (http://datahub.io/ http://datahub.io/).
.
Ontology directories. Ontology catalogues collected by hand. Examples: DAML
Ontology Library (www.daml.org/ontologies/) and Protege Ontologies (http://
protegewiki.stanford.edu/wiki/Protege_Ontology_Library).
.
Metadata directories. Metadata catalogues, such as UKOLN metadata resource
(www.ukoln.ac.uk/metadata/resources/), Topic Maps PSIs (http://psi.
mchapman.com/vl/index), RDA vocabulary (http://rdvocab.info/) and the Open
Metadata Registry (http://metadataregistry.org/vocabulary/list.html).
We have avoided some kinds of search engines such as question-answering and
chatbots due to the fact that their technology is based on information extraction
instead of metadata description and their KOSs are not public. Although they interact
Semantic
retrieval systems
647
with a human user, they do not necessarily retrieve semantic documents, but instead
they utilize semantic resources as a natural language processing technique for user
interaction purposes.
2.2 Evaluation criteria for retrieval systems of semantic documents
Tables IV-VI present the set of criteria that we have dened for evaluating the
resources. These characteristics have been selected and rened from the previous
literature and classied in three types of criteria associated to each characteristic:
(1) Schema management. The related criteria are: interoperability, formalization,
interactivity and semantic framework (Table IV).
(2) Semantic management. Related to the meaning of concepts and their
management; related criteria are: disambiguation, multilingualism, synonyms,
scope, extensibility, reusability, modiability, and language (Table V).
(3) Queries. Concerning the query process and the management of the obtained
results. This category copes with sense specication, conceptual queries,
contextual queries and document retrieval (Table VI).
Following the Desmet method, we establish two types of features: simple and
compound. The simple characteristics are those that can be present or absent and can
be assessed using a Boolean scale. The compound characteristics get the degree to
which they are supported and quantied in an ordinal scale. The characteristics are
identied and prioritized, and we establish the system for the assessment of the
characteristics, with respect to their type and importance:
.
Simple types: No (0) and Yes (5).
.
Compound types: None (0), Low (1), Medium (3), High/Fundamental (5).
.
Importance: Optional (3), Desirable (6) and Obligatory (10).
3. Results
In the evaluation process of the different methods for the retrieval of semantic
documents, we have assigned one value to each of the characteristics. None of the
retrieval systems supports the characteristic of formalization, using the schema as it is
dened by the entity that is responsible for its creation and maintenance. This aspect
also determines that none of the resources copes with the ambiguity that exists in the
syntactic and semantic representation to, disambiguate for each concept and property
of the candidates that are included in the schema. We have neither found the
characteristics of multilingualism nor sense specication.
Characteristic Importance/type Description
Interoperability Obligatory/simple Possibility to establish relationships between concepts of
different schemas
Formalization Obligatory/simple Possibility to realize or improve the formalization of a
schema, regarding the management process
Interactivity Desirable/compound Possibility that the user participates actively in the Schema
Management, in accordance to the Web 2.0 guidelines
Table IV.
Characteristics of schema
management for the
evaluation of systems for
semantic retrieval
LHT
31,4
648
In this study, we have obtained criteria to be considered in a semantic retrieval system
instead of answering what system obtains the best results, because in this context
systems are constantly evolving.
3.1 Results of schema management
With respect to interoperability, we have observed that metadata directories do not
support this characteristic. The metadata registries do incorporate one-to-one
Characteristic Importance/type Description
Disambiguation Obligatory/simple Possibility ability to eliminate structural and
semantic ambiguity of concepts, in order to facilitate
the conceptual retrieval
Semantic framework Obligatory/compound The scope in which the semantic and the conceptual
retrieval of concepts is managed. The possible values
are: None (0), Local, in the schema (1), Local, with
relationships between schemas (3), Global, between
schemas that use a shared resource (e.g. An
ontology) (5)
Multilingualism Desirable/simple Possibility to support multiple languages
Synonymy Obligatory/compound Possibility to solve problems that arise from
different concepts with the same meaning
Scope Obligatory/simple The domain in which the semantics of the schemas
to be managed are dened. It can be either
homogeneous or heterogeneous
Extensibility Desirable/compound Possibility to expand the representation of the
schema semantics
Reusability Desirable/compound Possibility to reuse the representation of the schema
semantics
Modiability Desirable/compound Possibility to modify the representation of the
schema semantics
Language Optional/compound Possibility to represent the language that is use in
the formalization of the semantic
Table V.
Characteristics of
semantic management for
the evaluation of systems
for semantic retrieval
Characteristic Importance/type Description
Sense specication Obligatory/simple Possibility to express the concrete meaning of a
concept in the query process
Conceptual query Obligatory/compound Possibility to perform queries, according to the
meaning of the concepts
Contextual query Obligatory/compound Possibility to obtain results that derive from the
existing relationships between concepts
Document retrieval Optional/simple Chance to obtain semantic documents that derive
from schemas, as well as the schemas themselves
Table VI.
Characteristics of queries
for the evaluation of
systems for semantic
retrieval
Semantic
retrieval systems
649
relationships between schemas. Some special cases of ontology engines, such as
Watson, analyze relationships between concepts.
Metadata registries and ontology directories often provide extra functionality to the
users so that they can incorporate new resources to the system. Metadata directories,
similar to the ontology engines, are usually closed to user interventions, except for the
query processes.
3.2 Results of semantic management
With respect to the semantic framework, metadata directories do not use the semantics
associated to the concept; rather they only use the description tokens. In contrast to the
metadata engines, ontology engines and ontology directories utilize the semantic that
is local to the schema, including relationships with other schemas. Likewise, only these
categories present the characteristics of language and modiability. The schema
denition language that they use is either XML or RDF. On the other hand, the
correspondence between schemas and the semantic representation model is a
one-to-one relationship, which implies the revision and update of all correspondences.
With respect to synonymy, we have not detected it in metadata directories.
However, we consider it partially covered in the rest of typologies, because they
support the denition of one-to-one correspondences between concepts.
The scope of retrieval systems is wide and heterogeneous. As an example in the
case of metadata engines, LOV works with 322 vocabulary spaces. This resource
includes statistics such as LOV distribution, LOV popularity and LOD popularity. The
DataHub also includes ratings, but they are scarcely implemented.
The reusability, dened as the ability to reuse the representation of the schema
semantics, is applied only by the ontology directories through the publication of
one-to-one alignments for their possible reuse.
3.3 Queries
Concerning the query process and the management of the obtained results, we analyze
features such as Sense specication, conceptual and contextual queries, and document
retrieval. From the point-of-view of semantic retrieval, differentiating between
polysemic meanings, we have not detected in any of the categories the possibility to
search the concrete meaning.
Regarding conceptual queries, metadata directories base the retrieval process to the
syntactic search of the labels and attributes. In contrast, metadata engines, ontology
engines and ontology directories extend the searching by including meanings and
relationships between concepts in a generic environment, at the time that they permit
the establishment of a concrete semantic for the concept to be retrieved.
The possibility to extend queries with relationship concepts is present in metadata
engines, ontology engines and directories. However, metadata directories do not extend
the results of concepts through their relationships.
Finally, with respect to document retrieval, schema, metadata and ontology
directories only permit schema retrieval, while the corresponding engines permit the
retrieval of documents that are instances of these schemas.
For each characteristic, we have obtained the product of the assigned value by the
factor of importance. Once the weighted values of each system are calculated, we
calculate the aggregate percentages for each category, in order to facilitate their
LHT
31,4
650
interpretation. More specically, for each of the dened categories (Schema
management, Semantic management and Query), we have summed the value of
their characteristics and we have calculated the percentage of the above-mentioned
sum over the maximum possible value, which would correspond to 100 percent. In
Figure 1, we present the results that correspond to the evaluation of each method in
percentage and grouped by category.
In the schema management category, the metadata search engines and the ontology
search engines and directories obtain the highest results (43.1). The ontology
directories obtain this result mainly because they promote the participation of the user
and support the denition of relationships between schema elements. The ontology
search engines are positioned just below them due to their lesser ability of interactivity
with the user. The rest of the methods obtain noticeably lower values, as a result of the
lack of support to the management of correspondences between elements, as well as a
lesser degree of interactivity with the user.
For the semantic management category (Figure 1), the ontology directories and the
ontology search engines obtain the best results (53.4). In this case, they highlight the
management of relationships between concepts; their application scope, heterogeneous
with respect to the knowledge domain; the modiability of the solution and the
semantic representation language employed. The decrease of the values for metadata
search engines (43.3) is caused by the fact that these engines deal with a more restricted
scope, as well as the use of languages with less semantic expressivity for
representation. The metadata directories obtain the lowest value (6.1), a fact that can be
attributed mainly to the restricted nature of the application environment.
In the Query category (Figure 1), the ontology search engines and the metadata search
engines obtain the best results (57.6), mainly due to their ability to perform contextualized
conceptual queries, as well as the possibility to obtain semantic documents. The next
value corresponds to the ontology directories (36.4). The decrease of their score is due to
the impossibility to obtain documents that are associated to the schemas. The decrease of
the rest of the values is caused by the absence of contextualization of the results and the
local use of the schema semantics. As a result, the previous points cause the schema
directories and the metadata directories to get lower values.
The overall results of the evaluation of the methods (Figure 1) show that the
ontology directories achieve a good score (46.8), resulting from a positive evaluation
Figure 1.
Results of the evaluation
of each system grouped
by category
Semantic
retrieval systems
651
regarding the schema management and semantic management categories. Proceeding
in descending order, the ontology search engines (52.4) owe this result to the positive
evaluation of the query and semantic management methods. In the case of metadata
search engines (47.0), the obtained assessment arises from the tradeoff between the
semantic management and a good schema and query management. In the last place, we
nd the metadata directories (7.1), due to shortcomings in the support of all the
evaluated categories.
4. Discussion
In this work, the denition of a semantic document is extended to other schemas and
codications that contain a semantic description of document content. Standards like
topic maps or OWL can be represented with XML Schema and without the use of RDF.
Rigorous studies of this eld must not be limited to retrieval, maintenance and storage
of RDF documents only. Our main motivation is that this type of semantic documents
constitutes a key issue for the semantic description of other resources. Since these
vocabularies are considered as semantic documents, they must be retrievable by a
semantic search engine. Nevertheless, it is true that the semantic web community
prefers open standards, like OWL or RDFS, than alternatives with proprietary
encoding format or results of open academic experiments (Strasunskas and Tomassen,
2010).
Wei et al. (2008) has stressed the need to develop a formalized semantic search
framework. We believe that a desirable characteristic of semantic search is to
integrate this framework in the semantic web evaluation procedures. As an example
of the challenges that arise on the semantic web, we can take a closer look to the
linked data proposals and its element sets and value vocabularies. The rst problem
that arises is the linking architecture approach. There are scalability problems when
connecting resources using one-to-one mappings between vocabularies. In fact if we
compute the potential number of alignments in the whole set of vocabularies taking
2 at a time, we will have n! divided by 2!
*
(n-2)!, where n is the number of
vocabularies. The W3C Library Linked Data Incubator Group has undertaken great
effort of collecting and classifying value vocabularies and metadata element sets to
decrease these possibilities, but it is a long-term project. Another solution is a
unique central resource, which would connect to the rest of vocabularies, would
result into n-1 mappings between all possible concepts. But updating problems will
still remain: what effects have updates in vocabularys hierarchy or corrections in
descriptions due to ambiguities? Besides, we have taken into account that values for
the elements can been drawn not just from values vocabularies but even from free
text. Finally, there are difculties to cope with identifying and adapting
vocabularies. There is a need to identify and adapt the vocabularies, selecting the
most appropriate among the candidates and leaving open the possibility of adapting
it to the resource without modifying the original vocabulary, thus reducing possible
ambiguities.
Improving the process of identifying the essential functionalities, such as usability,
in the implementation of a semantic retrieval system is a critical point for popularizing
these resources (Morato et al., 2012). Interaction with a larger set of user is an essential
element that will help the semantic web and linked data technologies to achieve an
even greater degree of potential.
LHT
31,4
652
5. Conclusions
We have performed an evaluation of methods for semantic documents retrieval. The
results of the evaluation indicate that, at the moment, many of the resources of
semantic document retrieval lack the minimum of functionality in order to popularize
their use. Some of the difculties can be identied in ambiguity and lack of
formalization of resource descriptions, difculties in usability and operation, isolation
of datasets that impede more exhaustive searches and the difculty to carry out
conceptual searches and navigation.
As it has been shown in the analysis only the ontology directories get hardly over 50
percent in the evaluation. In our judgment, the current methods need to cover
characteristics that are essential for the management of semantic documents. Such
characteristics may include the formalization of the documents, their disambiguation,
multiple language support and the semantic coverage of queries.
There are many problems that are difcult to measure. If we observe metadata
vocabularies, we realize that selecting the right vocabularies is a tough task due to the
large number of vocabularies in the cloud. The absence of URIs, the low usability and
the lack of consensus between overlapping vocabularies, are difculties that we have
to overcome to facilitate the access of users to semantic web resources.
In this study we have proposed a mechanism to facilitate the comparison in a
similar way to query performance metrics in classical retrieval. Previous studies have
emphasized a descriptive approach to evaluate semantic search engines. We propose
an approach that gives weight to each evaluation criteria facilitating the comparison in
the future.
As a work in progress, we are studying how to identify criteria related with
trustworthiness and link quality. Although it is noticeable that some search engines
have included some statistics to guide the user in the selection of a vocabulary, there is
a lack of studies showing the real importance of this data in the user behavior.
References
Allemang, D. and Hendler, J. (2011), RDF-The basis of the Semantic Web, Semantic Web for the
Working Ontologist, 2nd ed., Morgan Kaufmann, Burlington, MA, pp. 27-50.
Amer-yahia, S. and Lalmas, M. (2006), XML Search: languages, Inex and Scoring, SIGMOD
Rec., Vol. 36 No. 7, pp. 16-23.
Andago, M.O., Phoebe, T. and Thanoun, B.A.M. (2010), Evaluation of a semantic search engine
against a keyword search engine using rst 20 precision, Intern. Journal for the Advanc.
of Science & Arts, Vol. 1 No. 2, pp. 55-63.
Batzios, A. and Mitkas, P.A. (2012), WebOWL: A semantic web search engine development
experiment, Expert Systems with Applications, Vol. 39 No. 5, pp. 5052-5060.
Bechhofer, S., Ainsworth, J., Bhagat, J., Buchan, I., Couch, P., Cruickshank, D., Deldereld, M.,
Dunlop, I., Gamble, M., Goble, C., Michaelides, D., Missier, P., Owen, S., Newman, D.,
De Roure, D. and Su, S. (2010), Why linked data is not enough for scientists,
IEEE International Conference on eScience, IEEE Sixth International Conference on
e-Science, pp. 300-307.
Bikakis, N., Tsinaraki, C., Gioldasis, N., Stavrakantonakis, I. and Christodoulakis, S. (2013),
The XML and sematic web worlds: technologies, interoperability and integration: a
survey of the state of the art, in Anagnostopoulos, I.E. (Ed.), Sematic Hyper/Mutlimedia
Adaptation, Vol. 418, Springer, Berlin, pp. 319-360.
Semantic
retrieval systems
653
Bizer, C., Heath, T. and Berners-Lee, T. (2009), Linked data the story so far, International
Journal on Semantic Web and Information Systems, Vol. 5 No. 3, pp. 1-22.
Buscaldi, D., Guerrini, G., Mesiti, M. and Rosso, P. (2003), Tag semantics for the retrieval of
XML documents, Proceedings of the 1st International Symposium on Information and
Communication Technologies, 24-26 September, Dublin, Ireland, Trinity College Dublin,
Dublin, pp. 273-278.
Ding, L., Finin, T., Peng, Y., Pinheiro da Silva, P. and McGuinness, D. (2005a), Tracking RDF
Graph Provenance using RDF Molecules, report TR-CS-05-06, Computer Science and
Electrical Engineering, University of Maryland, Baltimore County, April 30.
Ding, L., Pan, R., Finin, T., Joshi, A., Peng, Y. and Kolari, P. (2005), Finding and ranking
knowledge on the semantic web, Proceedings of the 4th International Semantic Web
Conference.
Esmaili, K.S. and Abolhassani, H. (2006), A categorization scheme for semantic web search
engine, AICCSA 06 Proceedings of the IEEE International Conference on Computer
Systems and Applications, IEEE Computer Society, Washington DC, pp. 171-178.
Fazzinga, B. and Lukasiewicz, T. (2010), Semantic search on the web, Semantic Web, Vol. 1,
pp. 89-96.
Fensel, D., Angele, J., Decker, S., Erdmann, M., Schnurr, H.-P., Staab, S., Studer, R. and Witt, A.
(1999), On2broker: Semantic-Based Access to Information Sources at the WWW,
Proc.WWW Conf., Internet (WebNet 99), Honolulu, 25-30 October, pp. 25-30.
Fuentes, D. and Mej a, A. (2013), Cycle management in semantic similarity among Wikipedias
concepts, degree thesis, University Carlos III.
Girit, H., Eberhard, R., Michelberger, B. and Mutschler, B. (2012), On the precision of search
engines: results from a controlled experiment, 15th Int. Conf. on Business Information
Systems (BIS 2012), pp. 260-271.
Guha, R., McCool, R. and Miller, E. (2003), Semantic search, WWW2003, Budapest, pp. 700-709.
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A. and Decker, S. (2011), Searching and
browsing linked data with SWSE: the semantic web search engine, Web Semantics:
Science, Services and Agents on the World Wide Web, Vol. 9 No. 4, pp. 365-401.
Howarth, L.C. (2000), Creating a metadata-enabled framework for resource, available at: www.
cais-acsi.ca/proceedings/2000/howarth_2000.pdf (accesed January 2013).
Jain, P., Hitzler, P., Yeh, P.Z., Verma, K. and Sheth, A.P. (2010), Linked data is merely more
data, in Brickley, D., Chaudhri, V.K., Halpin, H. and McGuinness, D. (Eds), Linked Data
Meets Articial Intelligence. Tech. Rep. SS-10-07, AAAI Press, Menlo Park, CA, pp. 82-86.
Kitchenham, B. (1996), DESMET: a method for evaluating software engineering methods and
tools, available at: www.osel.co.uk/desmet.pdf (accesed June 2012).
Kobilarov, G., Scott, T., Raimond, Y., Oliver, S., Sizemore, C., Smethurst, M. and Lee, R. (2009),
Media meets semantic web - how the BBC uses DBpedia and linked data to make
conections, European Semantic Web Conf. Semantic Web in Use Track, Crete.
Lanthaler, M. and Gutl, C. (2012), On using JSON-LD to create evolvable RESTful services,
Proc.3rd Internat. Workshop on RESTful Design (WS-REST 2012) at WWW2012,
ACM Press, Lyon, pp. 25-32.
Mangold, C. (2007), A survey and classication of semantic search approaches, Int. J. Metadata
Semant. Ontologies, Vol. 2 No. 1, pp. 23-34.
LHT
31,4
654
Milicic, V. (2011), Introducing hypernotation an alternative to linked data available at: http://
milicicvuk.com/blog/category/hypernotation/ (accessed Ago 2013).
Morato, J., Sanchez-Cuadrado, S., Fraga, A. and Moreno Pelayo, V. (2007), Towards a social
semantic web environment, El profesional de la Informacion, Vol. 17 No. 1, pp. 78-85.
Morato, J., Fraga, A., Andreadakis, Y. and Sanchez-Cuadrado, S. (2012), Eight steps towards the
socialisation of the semantic web, International Journal of Social and Humanistic
Computing, Vol. 1 No. 4, pp. 347-362.
Nesic, S. (2010), Semantic document architecture for desktop data integration and
management, PhD thesis, Univ. Svizzera Italiana.
Palacios, V. (2010), Sistema de recuperacion conceptual mediante niveles semanticos en la
representacion de esquemas de metadatos,[Conceptual retrieval system using semantic
levels in the representation of metadata schemas] PhD Carlos III University, available at:
http://hdl.handle.net/10016/9332 (accesed June 2012).
Pastor-Sanchez, J.A., Mart nez-Mendez, F. and Rodr guez-Munoz, J.V. (2012), SKOS application
for interoperability of controlled vocabularies in the eld of linked open data,
El Profesional de la Informacion, Vol. 21 No. 3, pp. 245-253.
Research Group Data and Web Science (University of Mannheim) (n.d.), Datasets in the next
LOD Cloud, available at: http://wifo5-03.informatik.uni-mannheim.de/lodcloud/ (accessed
January 2013).
Strasunskas, D. and Tomassen, S.L. (2010), On variety of semantic search systems and their
evaluation methods, Proceedings of the International Conference on Information
Management and Evaluation, Academic Conferences Publishing, Cape Town, pp. 380-387.
Scheir, P., Pammer, V. and Lindstaedt, S.N. (2007), Information retrieval on the semantic web
does it exist?, Proceedings of Lernen-Wissen-Adaption (LWA) 2007, pp. 252-257.
Tumer, D., Shah, M.A. and Bitirim, Y. (2009), An empirical evaluation on semantic search
performance of keyword-based and semantic search engines: Google, Yahoo, Msn and
Hakia, 2009, 4th International Conference on Internet Monitoring and Protection (ICIMP
09), available at: http://doi.ieeecomputersociety.org/10.1109/ICIMP.2009.16 (accessed
Ago. 2013).
Uren, V., Lei, Y., Lopez, V., Liu, H., Motta, E. and Giordanino, M. (2007), The usability of
semantic search tools: a review, The Knowledge Engineering Review, Vol. 22 No. 4,
pp. 361-377.
W3C (2007), Semantic Annotations for WSDL and XML Schema, Farrell, J., Lausen, H. W3C
Recommendation 28 August 2007, available at: wwww.w3.org/TR/sawsdl/ (accesed
January 2013).
W3C (2010), XML Linking Language (XLink) Version 1.1. DeRose, S., Maler, E., Orchard,
D. Walsh, N. W3C Recommendation 06 May 2010, available at: www.w3.org/TR/xlink11/
(accesed January 2013).
W3C (2011), Library Linked Data Incubator Group Final Report, Baker, T. et al. W3C Incubator
Group Report 25 October 2011, available at: www.w3.org/2005/Incubator/lld/XGR-lld-
20111025/ (accesed January 2013).
W3C (2012), OWL 2 Web Ontology Language XML Serialization (second edition), Motik,
B. Parsia., B., and Patel-Schneider P.F. W3C Recommendation 11 December 2012, available
at: www.w3.org/TR/owl-xml-serialization (accessed January 2013).
W3C (2013), Ontologies available at: www.w3.org/standards/semanticweb/ontology/ (accessed
January 2013).
Semantic
retrieval systems
655
Waitelonis, J. and Sack, H. (2009), Augmenting video search with linked open data2009),
2-4 September, Graz, Proceedings of International Conference on Semantic Systems 2009
(i-semantics, Verlag der TU Graz, Austria.
Wei, W., Barnaghi, P.M. and Bargiela, A. (2008), Search with meanings: an overview of semantic
search systems, Int. J. Commun. SIWN, No. 3, pp. 76-82.
About the authors
Jorge Luis Morato is currently a Professor of Information Science in the Department of
Informatics at the Carlos III University of Madrid (Spain). In 1999, he received his PhD in Library
Science from Carlos III University. Jorge Luis Morato is the corresponding author and can be
contacted at: jorge.morato@gmail.com
Sonia Sanchez-Cuadrado works as an Assistant Professor in the Department of Informatics at
Carlos III University of Madrid. In 2007, she received her PhD in Library Science and Digital
Environment, designing a methodology for the automatic construction of knowledge
organization systems and NLP.
Christos Dimou is a Visiting Lecturer at the Department of Informatics, at the Carlos III
University of Madrid. In 2010, he obtained his PhD in Electrical and Computer Engineering,
Aristotle University of Thessaloniki, Greece, dening a framework for the performance
evaluation of software agents. His research interests include requirements engineering, software
agents and information retrieval.
Divakar Yadav is an Assistant Professor in the Department of Computer Science and
Engineering at Jaypee Institute of Information Technology, Noida and Carlos III University for
the last 12 years. His area of interests includes information retrieval, soft-computing, and
operating systems. He has participated, reviewed and organized many international and national
conferences. He received his PhD in Computer Sc. and Engineering in 2010.
Vicente Palacios is currently working as Systems Engineer at the Carlos III University of
Madrid, where he is also a lecturer of Software Processes and Advanced Software Design.
LHT
31,4
656
To purchase reprints of this article please e-mail: reprints@emeraldinsight.com
Or visit our web site for further details: www.emeraldinsight.com/reprints

Potrebbero piacerti anche