Sei sulla pagina 1di 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221615479

SPIDER: A system for scalable, parallel / distributed evaluation of large-


scale RDF data

Conference Paper · January 2009


Source: DBLP

CITATIONS READS

23 208

5 authors, including:

Hyunsik Choi Jihoon Son


Gruter Korea University
7 PUBLICATIONS   418 CITATIONS    9 PUBLICATIONS   86 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Jihoon Son on 04 June 2014.

The user has requested enhancement of the downloaded file.


SPIDER : A System for Scalable, Parallel / Distributed
Evaluation of large-scale RDF Data

Hyunsik Choi, Jihoon Son, YongHyun Cho



Min Kyoung Sung, Yon Dohn Chung
Department of Computer Science and Engineering
College of Information and Communication
Korea University, South Korea
{hyunsikchoi, jihoonson, yonghyun, mj999, ydchung}@korea.ac.kr

ABSTRACT In these applications, the size of dataset is usually very


RDF is a data model for representing labeled directed graphs, huge. For example, in semantic web large-scale graph datasets
and it is also an important building block of semantic web. are required for complex inferencing, in life science many
Due to its flexibility and applicability, RDF has been used in projects[13, 3, 4, 8] are accumulating DNA sequence and
broad applications, such as bioinformatics, social networks, protein interaction network datasets, and the social net-
so on. In these applications, large-scale graph datasets are works area needs massive sample data sets for complex an-
very common. However, existing techniques are not effec- alyzing of relations between people (or groups). Volumes
tively managing them. In this paper, we present a query of graph data generated from these applications reach tens
processing system for RDF data, named SPIDER, based on of petabytes. However, existing systems[7, 19, 18] are not
the well-known parallel/distributed computing framework, sufficient to deal with these large-scale RDF graph datasets
Hadoop. SPIDER consists of two major modules (1) the due to following reasons:
graph data loader and (2) the graph query processor. The • Existing systems are based on single machine. So, they
loader analyzes and dissects the RDF data and places parts have no capability to keep the dramatically increasing
of data over multiple servers. The query processor parses the RDF data. It is known that the storage technology
user query and distributes subqueries to cluster nodes. Also, (e.g., SAS SCSI and SATA2 HDD interfaces[14]) has a
the results of subqueries from multiple servers are gathered theoretically maximum transfer rate of 3Gb/s. For ex-
(and re-evaluated if necessary) and delievered to the user. ample, 1 petabyte scanning on a recent HDD requires
Both modules utilize the MapReduce framework of Hadoop. at least 33 days.
In addition, our system supports some features of SPARQL
query language. This prototype will be foundation to de- • The subgraph query is the core RDF query type. How-
velop real applications in many domains where large-scale ever, existing systems cannot efficiently process sub-
graph data are common. graph queries for big graph datasets because it is NP-
Complete.
1. INTRODUCTION As a practical approach to overcome these problems above,
As a data model for representing labeled, directed graphs, we use the server-cluster approach, where the large RDF
RDF is used as important building blocks of semantic web. data is partitioned and distributed over multiple servers,
RDF can also be extended easily to ontologies, such as RDFS and the user subgraph query is dissected/delivered into mul-
and OWL, where these ontologies provide a means to define tiple subqueries and processed over multiple servers in a dis-
vocabularies specified to some domain, schemas, and rela- tributed and parallel manner.
tions between elements of vocabulary. Recently, owing to its In this demonstration, we present a RDF subgraph pro-
flexibility and applicability, RDF has been popularly used cessing system, which enables Scalable, Parallel / Distributed
in a variety of applications, such as semantic web, bioinfor- Evaluation of large-scale RDF data (SPIDER). It aims to
matics[3], and social networks[6]. store and process petabyte-scale RDF graph datasets. In or-
der to store and process a huge amount of RDF graph data,
∗the corresponding author we use a distributed file system and employ the MapRe-
duce framework[10] provided by Hadoop[2]. Hadoop cluster
can be composed of from several to tens of thousands ma-
chines. Due to its scalability, it is regarded as a representa-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage, tive framework in cloud computing environments.
the VLDB copyright notice and the title of the publication and its date appear, The main contributions are as follows:
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
• Our system can store and process large-scale RDF
or to redistribute to lists, requires a fee and/or special permission from the graph datasets on commodity cluster of servers.
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France • When more storage and computation power are re-
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. quired, our system can be easily expanded by adding
Hadoop
Importing Graph Data Distributed File System Pre-Processor
XML TURTLE N-TRIPLE Query Parser
Query Input
Input Formatter Query Dissemination
Users

Graph Deployment Query Processing


Graph Data Partitioner
Map Process

Graph Data Contractor Partial Matching

Local Contractor Global Contractor


Web UI

Graph Data Allocator Reduce Process


Full Matching

Index Constructor
Tree Bitmap

Hash Table List


Output
Output Formatter

Applications
Graph Data Loader Graph Query Processor

Figure 1: A System Overview of SPIDER

more cluster nodes. it is popularly used as a representative framework in cloud


computing environments, such as Amazon EC[1] and Ya-
• Our system allows users to retrieve subgraphs matched hoo!.
to user-defined graph patterns. Hadoop consists of a distributed file system called Hadoop
• Our system supports main features of SPARQL query Distributed File System (HDFS)[5] and MapReduce frame-
language. work[10]. HDFS is inspired by Google File System (GFS)[11].
To store huge amount of data, HDFS divides big data into
The rest of this paper is organized as follows. In Sec- small data blocks, and then it disseminates them into all
tion 2, we give an overview of our proposed system, and cluster nodes uniformly as possible. Data blocks are repli-
we description two main components, the graph loader and cated for reliability.
the graph query processor. In Section 3, we describe our MapReduce is a programming model and a framework
demonstration system that we developed to store and pro- based on shared-nothing architecture[17]. In shared-nothing
cess large-scale graph data. In Section 4, we discuss the architecture each node does not communicate one another
applicability of our system and some directions for further during computation. Therefore, it does not incur overheads
improvement. caused by managing distributed processing. Thus, the shared-
nothing architecture provides the ability to have linear scal-
2. SYSTEM OVERVIEW ability. Based on shared-nothing, MapReduce provides a
means to divide a big task into many small tasks. The
This section gives an overview of SPIDER. Our system MapReduce computation basically has two steps, Map and
SPIDER is developed on Hadoop[2] distributed system frame- Reduce. In each step, each cluster node is assigned to either
work. Using Hadoop, the SPIDER system stores the RDF mapper or reducer. A mapper performs map function on
data on the distributed file system (DFS), and processes input key-value pairs extracted from raw input data, and
subgraph queries in a parallel/distributed manner with the outputs a new pair consisting of a key and a set of values
MapReduce technique. For these, our system has two main corresponding to the key. Then, reducers aggregate input
components, which are (1) the graph loader and (2) the sub- pairs passed from mappers and output final key-value pairs.
graph query processor. We will disscuss them in the follow-
ing subsections. This design allows our system to deal with
a huge graph data whose volume is from several hundred 2.2 RDF Data Model
megabytes to several petabytes. Fig. 1 shows the overall Before we discuss main components, we need to discuss
architecture of the SPIDER system. The left-hand side is the data model on which our system is based. RDF (Re-
the Graph Data Loader module and the right-hand side is source Description Framework)[15] is a graph data model
the Graph Query Processor module. In this figure, dashed to represent labeled directed graph, and it has several in-
lines indicate not yet implementated modules. stance document types, such as XML, TURTLE, N3, and N
Triple. Although RDF is originally designed to represent se-
2.1 Underlying Architecture mantic web data, because of its flexibility and applicability
In this section, we briefly describe the underlying architec- it has also been widely used in various areas, such as social
ture on which SPIDER is based. As we mentioned, SPIDER networks[6, 9] and life sciences[3].
is developed under Hadoop[2], a distributed system frame- In RDF data model[15], graphs are a set of triples, each
work. Hadoop is designed to build parallel/distributed com- of which consists of a subject, a predicate, and an object.
puting environments composed of tens of thousands com- They are usually represented as < S, P, O >. A subject can
modity servers, and it is known to be very scalable. Thus, be either a URI (Uniform Resource Identifier) or a blank
Map
Process phase. In the full matching phase, each reducer is assigned
Node N1 Node N2 Node N3
Q1 Q1 Q1 to an individual query graph. They accumulate triples passed
Partial Partial Partial
Matching Q2 Matching Q2 Matching Q2 by mappers and joins them in terms of their identifications
Local Storage
Q3
Queries Local Storage
Q3
Local Storage
Q3 (i.e., URI). Eventually, joined triples are organized to con-
Queries Queries
nected graphs. After that, reducers retrieve subgraphs fully
matched to query graphs.
Fig. 2. shows an example of the query processing through
Reduce
MapReduce. In this example, there is a cluster consisted of
Process three nodes, N1 , N2 and N3 . When three queries Q1 , Q2
Node N1 Node N2 Node N3
Complete Complete Complete
and Q3 are put into the GQP, map operations start to per-
Matching Q1 Matching Q2 Matching Q3 form the partial matching phase. During the phase, mappers
Partial Partial Partial
Matched Matched Matched find partially matched triples to any queries and pass them
Graphs Graphs Graphs
to reducers. Nodes N1 , N2 and N3 also perform reduce op-
erations for queries Q1 , Q2 and Q3 respectively. Each of
Figure 2: Query Processing carried out with them connects partially matched triples and validates them
MapReduce with each query.
Under the shared-nothing architecture, our system has
the high scalability. In each phase of the query processing,
node1 . A predicate can be only URI. A object can be URI or
there is no communication between mappers and reducers.
literal string. In RDF, the role of URI is a unique identifier.
Therefore, the performance of our system is increased lin-
A subject (or an object) corresponds to a vertex in graph,
early with an increasing the number of clusters.
and a predicate corresponds to an edge. Then, a triple <
S, P, O > is a directed arc in the graph.
3. DEMONSTRATION SCENARIO
2.3 Graph Data Loader
SPIDER is developed under Hadoop, which is a distributed
The graph data loader (GDL) is to read RDF graph data system framework. Our hadoop cluster that our demonstra-
and then to store them into DFS. GDL can load RDF graph tion runs on is composed of 10 cluster nodes, and they are
data either from local file system or from DFS. When GDL connected to 100Mbps LAN.
loads RDF data from DFS, it can be carried out with dis- SPIDER provides web-based user interface (UI) that pro-
tributed and parallel manner by using MapReduce. vides two features, which are the graph loader UI and the
Initially, GDL disseminates information of splits data to graph query UI. It enables users to import a RDF dataset
cluster nodes. During loading RDF data, all cluster nodes from a local file system or DFS. Later, users will choose from
try to read RDF data on their local storage, and they save which data are loaded. The graph loader UI enables users
them into DFS. This loading process is performed in dis- to give an instance name to the loaded graph data. When
tributed manner, so it is appropriate for a huge amount of users describe graph query, they can choose which instance
RDF datasets. is queried. When a user specifies both import directory and
As storing triples to DFS, GDL immediately stores a triple instance name, the graph loader UI passes the parameters
into a fixed length record. We called this storing method to the graph data loader. According to whether the import
the triple record. This method is a simple and efficient way directory indicates local directory or distributed file system,
to manage lots of triples in DFS because it can well take the graph data loader decides to be performed with MapRe-
mechanisms that HDFS provides. We expect the graph data duce. The graph query UI allows users to submits subgraph
loader to read and place petabyte-scale RDF graph. patterns defined by SPARQL query language[16], W3C rec-
ommendation. When a user submit a query, the UI passes a
2.4 Graph Query Processor query to the graph query processor and then the graph query
The graph query processor(GQP) reads graph data and processor installs the given query across all cluster nodes.
searches subgraphs matched to query graphs given by the After then, query processing starts with map operation in
query pre-processor. The query processing is carried out each node. While map operation is executing, each mapper
with two phases, the partial matching phase and the full finds triples matched to any of the given query patterns.
matching phase. The retrieved triples are passed to a reducer. The reducer
tries to join triples passed from mappers, and then it checks
• Partial matching phase. Each node reads locally
stored graph data sequentially and tries to find triples
matched to any condition of given queries.

• Full matching phase. Each node aggregates partially


matched triples to connected graphs and makes final
results which are fully matched with given queries.

We implement GQP using MapReduce framework. The


partial matching phase is performed in the map step. Each
mapper finds the triples matched with any queries and passes
the results to reducers. Reducers perform the full matching
1
A blank node is an anonymous node that groups one or Figure 3: Web-based User Interface enables users to
more RDF statements. submit SPARQL and shows visualized results.
joined triples (i.e., a subgraph matched to the given query). 5. REFERENCES
If the subgraph is matched, the query processor outputs the [1] E. Amazon. Amazon elastic compute cloud. aws.
retrieved subgraph. Finally, the graph query UI can output amazon. com/ec2.
the result of the given query. At that time, the user can [2] Apache Software Foundation. Hadoop, 2006.
choose to see the result as visualized graph and download http://hadoop.apache.org/core.
a text file. This UI is shown in Fig. 3. For representing
[3] A. Bairoch, R. Apweiler, C. Wu, W. Barker,
visualized graph, we employ the prefuse[12], the toolkit for
B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang,
interactive information visualization.
R. Lopez, M. Magrane, et al. The Universal Protein
As we mentioned above, our system supports some fea-
Resource (UniProt). Nucleic Acids Research, 33:D154,
tures of SPARQL. SPARQL provides a means to define var-
2005.
ious subgraph patterns to be retrieved. Among various
[4] H. M. Berman, J. D. Westbrook, Z. Feng,
graph patterns, our system provides Basic Graph Pattern
G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov,
and Group Graph Pattern. The basic graph patterns are
and P. E. Bourne. The protein data bank. Nucleic
filters to find a set of triples that must match patterns.
Acids Research, 28(1):235–242, 2000.
The group graph patterns are filters to find a set of graphs
matched to all patterns. For instance, the group graph pat- [5] D. Borthakur. The Hadoop Distributed File System:
terns are following: Architecture and Design. Hadoop Wiki, 2008.
[6] D. Brickley and L. Miller. FOAF Vocabulary
select ?x Specification. Namespace Document, 3, 2005.
where { [7] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An
?x type Book . efficient sql-based rdf querying scheme. In VLDB,
?x hasAuthor ‘‘Donald E. Knuth’’ . pages 1216–1227, 2005.
?x year 1998 . [8] G. Cochrane, R. Akhtar, J. Bonfield, L. Bower,
} F. Demiralp, N. Faruque, R. Gibson, G. Hoad,
Above example shows an example of SPARQL. In this ex- T. Hubbard, C. Hunter, et al. Petabyte-scale
ample, graph patterns are defined by ‘where’ clause. This innovations at the European Nucleotide Archive.
query statement is to find subjects that satisfy with all fol- Nucleic Acids Research.
lowing conditions: a type of the subject is Book, the suject [9] M. P. Consens. Managing linked data on the web: The
has author ‘Donald E. Knuth’, and it is published in 1998 linkedmdb showcase. In LA-WEB, pages 1–2, 2008.
year. Two types of graph patterns are frequently used in [10] J. Dean and S. Ghemawat. Mapreduce: Simplified
many applications. data processing on large clusters. In OSDI, pages
137–150, 2004.
4. DISCUSSION [11] S. Ghemawat, H. Gobioff, and S.-T. Leung. The
SPIDER focuses the scalability for storing and querying google file system. In SOSP, pages 29–43, 2003.
large-scale RDF graph datasets. Our system wins the goal [12] J. Heer, S. Card, and J. Landay. Prefuse: a toolkit for
by making use of DFS and the MapReduce framework pro- interactive information visualization. In Proceedings of
vided by Hadoop. Hence, our system can be easily extended the SIGCHI conference on Human factors in
by adding more cluster nodes when the system needs more computing systems, pages 421–430. ACM New York,
storage and computation power. In addition, SPIDER takes NY, USA, 2005.
advantages of shared-nothing architecture. Thus, SPIDER [13] M. Kanehisa and S. Goto. Kegg: Kyoto encyclopedia
has linear scalability as cluster nodes increases. Being scal- of genes and genomes. Nucleic Acids Research,
able, SPIDER can be widely used in many areas where large- 28(1):27–30, 2000.
scale graph processing is required. [14] M. Kawamoto. HDD Interface Technologies. Fujitsu
There still remain many research issues for improving SPI- Scientific and Technical Journal, 42(1):78–92, 2006.
DER. [15] O. Lassila, R. Swick, et al. Resource Description
Framework (RDF) Model and Syntax Specification.
• Our system is still insufficient to provide on-line ser- 1999.
vices that face users. It is because Hadoop takes long
[16] E. PrudHommeaux, A. Seaborne, et al. SPARQL
time to initialize tasks. In order to mitigate long re-
query language for RDF. W3C working draft, 4, 2006.
sponse time, we will need some cache techniques for
subgraph queries. [17] M. Stonebraker. The case for shared nothing. IEEE
Database Eng. Bull., 9(1):4–9, 1986.
• Index structures for various queries, such as path queries [18] Y. Tian, J. M. Patel, V. Nair, S. Martini, and
and subgraph queries are required. These index struc- M. Kretzler. Periscope/gq: a graph querying toolkit.
tures have to reflect distributed file systems and MapRe- PVLDB, 1(2):1404–1407, 2008.
duce framework. [19] K. Wilkinson, C. Sayers, H. A. Kuno, and
• A programmable graph query language (PGQL) for D. Reynolds. Efficient rdf storage and retrieval in
analyzing graph datasets are required to provide the jena2. In SWDB, pages 131–150, 2003.
abstraction of the distributed systems and expressive
power. In addition, a PGQL program composed of a
sequence of graph operations will give the system more
opportunities to optimize complex queries because the
system expects whole graph operations.

View publication stats

Potrebbero piacerti anche