Sei sulla pagina 1di 8

Multi-source Heterogeneous Hakka Culture Heritage Data Management based on MongoDB

Abstract—Hakka culture is an important part of China southern culture. There are many features
of Hakka culture data collected through the digital and information technology, such as a diverse
data type and format, unstructured data, and huge volume, which brings big difficulty to manage
and use them. NoSQL technology has high availability and high scalability, which provides new
methods for the storage and management of unstructured Hakka culture data. In this paper,
MongoDB, A document-oriented NoSQL database, was lead into the management of unstructured
Hakka culture data. The study focuses on multi-source Hakka culture heterogeneous data
management strategy and data service method. Storage strategy of the multi-source
heterogeneous Hakka culture data based on MongoDB are proposed and established. The
sharding, indexing and query mechanism for Hakka culture database were explored. Taking
Western Fujian Hakka Culture data as the example, Hakka culture data management prototype
system was constructed.

I. INTRODUCTION

As an important part of China's Southern culture, Hakka culture[1] is a valuable cultural asset.
However, due to series of irresistible natural and human factors, such as environment changes,
social development and single inheritance mode, Hakka culture have different degrees of damage
or permanent loss, which seriously have an effect on the long-term retention and inheritance of
cultural heritage.

Digital and information technology provides new ideas for the protection of the Hakka cultural
heritage. It is an effective way to achieve retention and inheritance of cultural heritage that
digitalize cultural heritage using three-dimensional ground laser scanning, 360 panoramic camera,
low-altitude satellites and remote sensing, GPS, digital scanning devices, digital audio and video
equipment, and manage these data using database.

However, Hakka culture heritage data, including video, audio, image, point cloud, DEM, three-
dimensional model, vector and raster data, have a variety of data types and formats, unstructured,
massive volume characteristics. How to manage and exploit these massive multi-source
heterogeneous Hakka cultural heritage data becomes an important problem to be solved.

Relational databases are mainly used in highly structured data management. It is difficult to
achieve efficient management and utilization of Hakka culture heterogeneous massive data.
NoSQL technology[2] abandon relational model of paradigm constraint, can store data with
different types, and has high scalability characteristics. MongoDB is a crossplatform document-
oriented NoSQL database, and eschews the traditional table-based relational database structure in
favor of BSON, a JSON-like documents with dynamic schemas, which makes the integration of data
easier and faster in certain types of applications. Therefore, MongoDB database was introduced to
explore organizational management strategies of Hakka culture data and to establish data service
system.

The rest of this paper is organized as follows: Section 2 describes the related works on culture data
management and NoSQL techniques, and the datasets and classification is given in Section 3.
Section 4 describes the storage strategy of Hakka culture heritage data. Section 5 explains
sharding, indexing, spatial indexing and query mechanism. Hakka culture heritage data
management system was implemented in section 6. The discussions and result are also given in
Section 7.

II. REALTED WORKS

A. culture data management

Cultural heritage is a valuable asset for each country and the nation. In order to strengthen the
protection of cultural heritage, databases for specific culture areas have established to protect
these cultural heritages. Qi et al., (2014) [3] constructed Zhuang costume culture database to
manage cultural profile Zhuang costumes, costume pattern, costume accessories and costume
production process. He et al., (2010) [4] constructed academy culture database for managing basic
information, the character information and literature of academy culture. Fang et al., (2009) [5]
established Guqin music database for managing Guqin theories resources, music resources and
associated cultural resources. Many scholars have proposed different solutions for different types
of culture data management. Tong (2014) [6] proposed a uniform data exchange interface to
manage heterogeneous Sibo Language and culture data. Li et al. (2013) [7] constructed an XML
data storage model for paper-cut pattern digitized representation, content retrieval and cultural
calculation. These cultural data management systems mainly focus on building a culture database,
and pay less attention to multi-source heterogeneous data efficient management and integration
issues.

B. MongoDB –Document-oriented database

From the data model point of view, NoSQL can be divided into the following categories, key - value
type, column-oriented type, document-oriented type and Graph type. The logical structure of the
document-oriented type database is a hierarchy that consists of three parts: document, collection
and databases. Document is equivalent to row of relational database. a collection is composed of
multiple documents that is equivalent to table. Multiple collections, grouped together logically, is
the database.

MongoDB is a typical representative to document-oriented storage system. MongoDB makes use


of BSON-a JSON-like data format to store data dynamically, and supports a variety of basic types,
such as number, string, date and arrays, and supports nested documents. Some of the features of
MongoDB include: [8]

(1) Indexing and querying. Any field, including within arrays and embedded documents, can be
indexed. MongoDB supports field, range queries and regular expression searches. Queries can
return specific fields of documents and also include user-defined JavaScript functions.

(2) Replication. MongoDB provides high availability with replica sets that each replica set consists
of two or more copies of the data. Each copy may act in the role of primary or secondary replica at
any time. Primary replica performs all reading and writing by default. Secondary replicas maintain
a copy of the data of the primary using built-in replication.

(3) Load balancing. MongoDB can run over multiple servers, balancing the load and/or duplicating
data. MongoDB scales horizontally using sharding. A shard is a master with one or more slaves.
The data is split into ranges based on the shard key and distributed across multiple shards. The
user chooses a shard key, which determines how the data in a collection will be distributed.

(4) File storage(GridFS): MongoDB can be used as a file system called Grid File System. MongoDB
exposes functions for file manipulation and content to developers. Instead of storing a file in a
single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a
separate document.

In a multi-machine MongoDB system, files can be distributed and copied multiple times between
machines transparently, which effectively create a load-balanced and fault-tolerant system.

III. DATASETS AND CLASSIFICATION

Hakka culture is the sum of tangible culture and spiritual culture created by Hakka people,
including language, theater, music, dance, crafts, folklore, architecture, food and so on. Hakka
culture data is digitized Hakka culture data resources collected through a variety of digital
technology. We primarily utilize ground laser scanning, 360 panoramic camera, geographical
information systems, satellite and low altitude remote sensing, digital media technology to collect
western Fujian Hakka culture data. Western Fujian Hakka culture data collected mainly includes:

(1)Hakka cultural map data, mainly includes Hakka culture heritage distribution data, such as
distribution of Tulou, cultural relic and cultural heritage villages, administrative divisions data o,
natural geographic data of western Fujian, aerial data for Tulou cultural folk village, Hakka culture
thematic data and Hakka migration routes data et al.,.

(2) Hakka culture multimedia data, mainly includes Hakka cultural data with representation forms
of text, images, audio and video. Intangible cultural heritage image shot from multiple angles, with
text, audio and video. Tangible cultural heritage data with text, audio and video Hakka culture text
and statistical data, etc.

(3) Hakka culture point cloud data, mainly includes tangible cultural heritage point cloud data
through ground laser radar scanning. These data contain more cultural details; which facilities all-
round display of Hakka culture.

For the diverse type and the complex structure questions of Hakka culture data, we investigate
and develop the data classification system, shown in table 1. According to the different of data
expression content, Hakka culture data is divided into two categories. One is Hakka cultural
heritage data, and the other is infrastructure data. Hakka culture heritage data is divided into
Hakka tangible cultural heritage data and intangible cultural heritage data.

Hakka tangible Cultural Heritage data includes Hakka houses/Tulou, ancient bridges, temples,
Ancestral temple and cultural relics et al., The data types includes text description, pictures, video,
audio, 360 panoramic documents, threedimensional models, and location information.

Intangible cultural heritage includes intangible cultural heritage of historical events, religion and
folk customs activities, industrial and agricultural production, language, culture, art and food
culture. Intangible cultural heritage data types including text, images, video, audio, spatial location
et al.
Hakka cultural infrastructure data contains satellite imagery, Hakka cultural map, for example
Hakka population distribution, administrative division maps, raster data and visitor center, parking
lot, bathrooms, hotels, restaurants, shops, travel agencies, gas stations, bus stations, public
transport routes, medical facilities (hospitals, clinics, pharmacies), public safety agencies,
emergency shelter, native place of business ( cellar), entertainment, country fair, markets vector
map data.

IV. STORAGE STRATEGY

According to the above analysis of MongoDB, MongoDB is database with freedom mode and each
document may be a different structure within a collection. Unlike relational databases, MongoDB
can easily modify the table structure. In order to improve the operability of the database, a
structure was usually pre-defined during designing MongoDB database.

Before designing storage strategy, design schema of Hakka culture database was pre-defined
based on MongoDB to ensure the rationality and reliability of storage strategy. This article will
draw terms from relational database design schema to describe design principles of storage mode.
MongoDB supports arrays and embedded documents, Design principles of Hakka culture data
storage schema based on MongoDB are as follows:

(1) For multi-valued function dependency items, an array was used to store them.
(2) For part of the functional dependency data items, the full functional dependency data
should be extracted and placed in embedded documents in order to eliminate part of
functional dependency. For example,

if existing data item (X, Y, Z, K) and having functional dependency X  Z and X,YK

According to Hakka culture data classification system, Hakka culture data storage strategy will be
achieved respectively from three aspects: Hakka tangible cultural heritage, intangible cultural
heritage data storage strategy and infrastructure data storage strategy et al.,

A. Hakka tangible culture heritage data structure

Data structure of Hakka tangible culture heritage data includes the unique identifier, name,
category, introduction information, cultural heritage formation time, address, spatial location,
picture, video, audio, 3D model and 360 panorama(shown in figure 1). Some key description fields
data, including tangible cultural heritage name, built time, category, and address et al., are
extracted and stored in the document database that used to retrieve the associated tangible
cultural heritage, Other description fields data, including picture, video, audio, 3D model and 360
panorama, are stored together in GridFS. The document only records file name correspondingly to
avoid multi-table query of relational databases. The same tangible cultural heritage data are
placed in a document, there is no cross-table query that brings convenience for distributed data
storage.

B. Hakka intangible culture heritage data management

Compared with the tangible cultural heritage, intangible cultural heritage has the same fields, such
as name, category, images, spatial location, time, address, picture, video and audio, and has other
typical fields, such as historical events trend data and statistical data(shown in figure 2).
Trend data is used to express occurrence, development and results of historical events. The field of
this data is an array of embedded documents, and each action is stored as a document. An action
document contains action name, time of occurrence, starting point, ending point, starting point
coordinates and ending point coordinates.

Statistical data in intangible cultural heritage dataset is also an embedded document, which
contains the name of the statistical data, unit and data items. Data Item is an embedded
document array, and each embedded data has its name, data content.

C. Hakka Culture Infrastructure Data

Hakka cultural infrastructure data includes two datasets : raster datasets and vector data sets.
Hakka culture infrastructure data structure is shown in fig 3.

Raster infrastructure data includes satellite images and Hakka culture thematic maps raster data.
Each raster data is stored as a collection with two parts: one is raster data collection and the other
is the document. Document is composed by tile unique identifier, tile-level and tile binary data.

For vector dataset, each geographic feature data is stored in a collection. Each document is
composed by a unique identifier, attribute data and spatial data of feature. Wherein the attribute
data is organized by an embedded document.

D. Storage instance of Hakka culture heritage data

Fig 4 illustrates Hakka cultural Heritage dataset. As shown in the left part of fig 4, there are three
collections which indicates the corresponding Hakka tangible, intangible and infrastructure
dataset. The right is a data instance of the tangible cultural heritage and each record is a
document in JSON format.

V. DATABASE SHARD, INDEX AND QUERY

A. Databse Shard and Distributed Database Architecture

Shard means that your data can be split across multiple database instances thereby making your
data collections smaller, which in turn makes your indexes smaller, which in turn makes your
Create, Read, Update, Delete (CRUD) operations faster. According to the characteristics of
MongoDB and the actual situation of Hakka culture data, shard and replica cluster were used as
distributed database architecture. The architecture was shown in fig 5.

There are three kind of servers: data server, configure server and route server. Hakka culture
database uses a masterslave backup. Hakka culture data stored on a data server with Shard1,shard
2 and shard3. Configuration server (Configure Service) is used to store configuration information
of server cluster, and there are mutual backup among several configuration server. Routing server,
users can find which server the data exist though the routing server and configuration information
from configure server. Users simply send a command to the router server to operate in data shard
server.

B. Index, Spatial Index and query


Index is used to speed up the process of query, but if there are too much index in a database that
will reduce management efficiency. At the same time, Hakka cultural database stores large
amounts of spatial data, spatial index will be discussed to improve the efficiency of spatial query in
order to quickly respond to spatial queries submitted by users. Sharding determines the data
distribution in the cluster and shard key also exists as a key index. Thus, query field are usually
defined as shard key.

Most query in Hakka culture database is category and name of cultural heritage. If we take the
category of Hakka culture data as a shard key ({category: 1}), we can split Hakka culture data
storage chunk according to different types of data. At this time a situation may occur, the number
of categories of Hakka cultural heritage is limited and will not be a substantial increase, which
means that shard key has only limited N values and the database only can be divided into up to N
chunks. The consequences are that it is difficult to improve service efficiency by splitting the data
and increasing server when certain types of data increase dramatically.

Every culture has different name. If we take the name of Hakka culture data as a shard key
((name:1)), we can split the data chunk arbitrarily and do not worry about the continuing problem
of splitting chunk. However, MongoDB uses memory mapping mechanism, it will put frequently
used data blocks into memory to reduce IO operations. Hakka culture data is random in the same
block, which means that the same categories of data may appear in a different block, and thus do
not occur simultaneously in memory. When the user queries categories, a number of different
blocks need to be accessed and read from disk into memory, which reduces the operational
efficiency.

Based on the above analysis, taking category and name of Hakka culture data as a separate shard
key have pros and cons. In order to split data properly and make the data in each block having the
same category, we use a combination of categories and names {category: 1, name: 1} as a shard
key.

Taking splitting chunk of tangible cultural heritage datasets (TCHeritage) of Hakka Culture
Database (CHeritage), the specific operation is as follows:

As it has been constructed shard key {category: 1, name: 1} , this is equivalent to create index of
{category: 1, name: 1} and {category: 1}, which facilitates a given category and name query.
However, it is not conducive to only the given name inquiry. In order to increase only the name
query, a unique name field index was established for tangible cultural heritage datasets.

After we create an name field index, query operation of the name becomes very simple. The user
can execute the database operation by given data sets and query criteria. the following is a query
example for tangible cultural heritage dataset query.

C. spatial index and query

MongoDB 2D geospatial indexes is used to create the spatial index. When you create a geospatial
index on legacy coordinate pairs, MongoDB computes geohash values for the coordinate pairs
within specified location range and then indexes the geohash values. To calculate a geohash value,
we should divide a two-dimensional map into quadrants recursively. Then assign each quadrant a
two-bit value. Fig 6 illustrates a geohash coding model.

These two-bit values (00, 01, 10, and 11) represent each of quadrants and all points within each
quadrant. To provide additional precision, continue dividing each quadrant into subquadrants.
Each sub-quadrant would have the geohash value of the containing quadrant concatenated with
the value of the subquadrant. The geohash for the upper-right quadrant is 11, and the geohash for
the sub-quadrants would be (clockwise from the top left): 1101, 1111, 1110, and 1100,
respectively.

The location field for Hakka cultural heritage dataset is used to create "2dshpere" spatial index,
given indexing minimum and maximum range, and dividing number.

spatial query includes proximity query and range query. Proximity query can be done by using
geospatial query operator "$ near" and a regular index. Here is an example to find the nearest 5
residences around point [117.01,24.66].

Range queries can be done by using geospatial query operator "$geoWithin" to query the data
within a given polygon. the polygon data with GeoJSON format needs to be constructed in the
query. Range queries not only can be done by a given polygon, but also by a given rectangle and
circle. Here is an example to query data within the polygon composed by four points of [0,0], [0,4],
[4,4], [4,0].

VI. HAKKA CULTURE DATA MANAGEMENT SYSTEM

A Hakka culture data management system based on Browse and server architecture mode was
designed and developed that combines MongoDB database storage management and Tianditu API
for JavaScript. The main function includes Hakka culture data management, map services and
query services. The system includes three layers: database layer, server layer and clients.

1)Database layer: We have adopted the aforementioned method and architecture to construct
Hakka culture database.

2)Server layer: Main function includes data add, data modify, spatial query, categories query and
name query of Hakka culture data, and transferring the query result to the client.

3)client side: mainly implement interaction between user and system. Its core function is to
achieve browsing, query, and results display of Hakka culture data.

Fig 7 illustrates management system interface. Administrator has the right to login in the system,
and have data management functions, mainly including Hakka culture data add, modify and delete
function.

VII. CONCLUSION AND DISCUSSION

The author analyzes MongoDB and its implementation techniques, probes storage strategy of
Hakka tangible, intangible cultural heritage data and facilities infrastructure based on MongoDB.
Hakka culture database shard, index, spatial index and query were constructed. Hakka culture data
management prototype system was implemented finally. NoSQL database presents a new method
to improve flexibility and efficiency for multi-source heterogeneous and unstructured data
management.

However, for Hakka culture data with raster type, the author took the entire raster file as a whole
to management, which caused difficulties to read and parse data. in the followup study, Image
tiles and a pyramid model should be used before performing data storage.

Potrebbero piacerti anche