Sei sulla pagina 1di 15

1

Introduction

With the advances in Internet and Web technologies along with the increased availability of
tools and contents, we are seeing an exponential growth of online resources. However, a
variety of obstacles (such as dispersion over the Web and lack of metadata) hamper
discovery of such materials and hinder their widespread use. Digital libraries (DLs) address
these problems by providing an infrastructure for publishing and managing content so it is
discovered easily and effectively. Digital Libraries are being viewed as a means for
dissolving inequities in access to scientific information for both researchers and students
alike. A number of digital libraries exist as of today, both in commercial world as well as in
the government and educational domains. Some examples from the commercial world are:
American Physical Society [Aps], IEEE Digital Library [Iee], ACM Digital Library [Acm]; and a
few examples from the other domains are: Los Alamos Digital Library [Lan], PubMed [Pub],
and CERN [Cer].
Despite the creation of many digital libraries, there exists considerable untapped content
with individuals in diverse communities. One major hindrance is the centralized frameworks
under which digital libraries are organized and deployed, thus limiting their accessibility,
particularly in publishing. The centralized framework, by its very nature, requires an
organization in a domain/community to take a lead in not only providing the hardware and
software infrastructure to support a digital library, but also needs processes to develop and
maintain the content. A number of organizations and communities are struggling to define
business models that would financially sustain these digital libraries. Another problem, which
is faced by domain specific digital libraries of today, is the evolving nature of a communitys
interest. With time a communitys interest seems to grow out and move to a different topic,
however the processes and interfaces that have been put in place are not flexible to
accommodate the changing interest of the community. Finally, there is the problem of
building a community where members are distributed, do not know of other members, do not
know of available interest areas, yet still want to come together in a community of common
interest. Our solution moves away from the traditional Digital Libraries that relies on some
central organization to develop, support and maintain collections and services. In our vision
there are on the order of a million universal clients, each client being able to do all the
activities: searching, contributing, maintaining collections.
Here we address these issues by proposing a digital library framework that is based on peerto-peer (or P2P) networks and which leverages existing work in the area of Digital Library
such as OAI and Kepler projects [Oai, Kep]. A P2P network is a distributed network with no
central control and consists of nodes running identical software. These networks do not
require a central server, which is typically expensive and requires technical personnel to
maintain it. Gnutella [Gnu], Napster [Nap], and Kazaa [Kaz] are examples of file sharing
applications using a P2P network. There has been interest in using P2P networks for
building digital libraries [Mal01, Baw03]. The Kepler project [Kep] has some components of a
P2P network, but is not a pure P2P network. It does require a central service to locate target
nodes, and is closer to Napster, which is a broker based P2P network. In [Baw03], the
authors propose a topic-segmented network based on a pure P2P network, which has the
nice property of clustering nodes with similar content and at same time reducing the query
processing cost. However, this approach has an inherent problem in the handling of flexible
and evolving communities. In particular, it becomes difficult to identify communities that are
multidisciplinary in nature and are hard to categorize by a specific discipline subject. Also, in
the topic-segmented networks it is not straightforward to handle new communities in future
or communities with changing interests.

The intellectual merit of the proposed research is to develop new models and investigate
research issues in building digital libraries that are self-sustainable and support evolving
communities with diverse interests. The proposed research builds upon the existing work in
the area of Open Archive Initiative (OAI), Kepler, and Peer-to-Peer & Social networks. The
key in our vision of supporting evolving communities is to build the network based on
access pattern. We use concepts from SETS [Baw03] and Symphony [Man03] to build and
ensure that the network exhibits the small world property [Wat98], which in turn leads to
efficient search. The major challenges in building such libraries are: (a) Keeping the network
connected in the presence of frequent joining, re-joining, and leaving of participants; (b)
Maintaining a low barrier for building communities with diverse interests; and (c) Providing
support for new and evolving communities with changing interests.
The main objective of the proposed research is to investigate an alternate model for digital
libraries (we shall refer to it as Freelib) that is scalable and supports communities and
domains evolving from the bottom up. As part of this project we plan to, (i) Research system
design issues in building software to support such libraries that grow and evolve with time;
(ii) Build a universal client that every member node will run; (iii) Build a test bed (faculty, staff
and students at four departments at ODU) to demonstrate the viability of our approach; and
(iv) Evaluate the effectiveness of the proposed digital library in terms of end user services,
evolving domains/communities support, scalability, and sustainability. If Freelib proves
indeed to have the desired sustainability property then the project will have potentially a
broad impact on the education community, as there are many target communities that can
benefit from such a system. For example:
High school teachers nation-wide can build their own community to publish and
exchange school project ideas, class syllabuses, classroom materials etc.
Faculty members of universities nation-wide can use such a system to publish and make
available their own publications, project reports, course material and to search for
publications of interest to them.
Students can create collections of material related to their courses and also to their
private life.
Old Dominion University (ODU) is actively involved in the OAI as a member of the technical
working group and alpha tester. We developed the first OAI-compliant service provider Arc
[Arc]. Besides Arc, ODU is working with Phillips Air Force Research Laboratory (AFRL), Los
Alamos National Laboratory, and NASA Langley Research Center in building a Technical
Report Interchange (TRI) federation of report collections [Tri] available at the three
organizations in their native digital libraries. Recently, ODU demonstrated how OAI could be
used for building Kepler [Kep], a framework for individual publishers, Archon [Arch] an NSDL
project, and a new OAI-based NCSTRL [Ncs]. ODU will leverage its experience in building
OAI-compliant data and service providers for this project. As the nodes of Freelib are going
to be OAI compliant they can be integrate with any OAI service provider. In particular, we
will integrate Freelib with the NSDL architecture through collaboration with Archon (an NSDL
funded project), broadening the exposure of content available in Freelib.
2
2.1

Background
Digital Libraries

Today digital resources exist that include a large variety of objects: pre-prints, including
technical reports; tutorials, posters, and demonstrations from conferences; student project
reports, theses, and dissertations; working papers; courseware; and tool descriptions.
However, a variety of obstacles (such as dispersion over the Web and lack of metadata)
hamper discovery of such materials and hinder their widespread use. Digital libraries (DLs)
address these problems by providing an infrastructure for publishing and managing content

C-2

so it is discovered easily and effectively. Researchers have been addressing a variety of


issues ranging from theoretical models to system building issues, for example: [Rag03, Dlit,
Fox02, Eli, Inf, Lev, Dml, Dlib]. There is also interest in tracking user accesses to support
high level services, for example recommendation system [Bol02, Sar01, Ale, Vtd]. There are
approaches for resource discovery that are based either on harvesting as is OAI (described
below) or distributed real time searches such as [Shi02, Pae00].
OAI: In the last decade literally thousands of digital libraries have emerged. One of the
biggest obstacles for dissemination of information to a user community is that many digital
libraries use different, proprietary technologies that inhibit interoperability.
Building
interoperable digital libraries allows communities to share information that cut across artificial
institutional and geographic borders. The Open Archive Initiative [Oai, Lag01, Liu01] is one
major effort to address technical interoperability among distributed archives. The OAI
framework defines two roles, the data provider (archive) and the service provider. The data
provider exposes its metadata and makes it available to service providers through the use of
the Open Archive Initiative-Protocol for Metadata Harvesting (OAI-PMH).
Kepler: Kepler [Mal01] is a system for providing the digital library services to the individual.
Kepler provides the individual user with a client (called an Archivelet) that contains an
OAI2.0 compliant dataprovider and a user interface to enter and edit metadata. Kepler
supports community building by providing a group package. The group harvests metadata
from the individual archivelets (and optionally caches full text documents) into a centralized
service provider where any user can search the whole collection. The Kepler framework is
similar to a broker based P2P network such as Napster. The system supports two types of
users: individual publishers using the archivelet publishing tool and general users interested
in retrieving published documents. The individual publishers interact with the publishing tool
and the general users interact with a service provider and archivelets using a browser.
2.2

Peer-to-Peer

A peer-to-peer network is a distributed network with no central control and consists of nodes
running identical software. There are many applications of the P2P model with file sharing
as one of the most popular applications. It enables users to share and exchange files, for
example, Napster [Nap], Gnutella [Gnu], Freenet [Fre, Cla00], Kazaa [Kaz]. Some example
for supporting other applications with a P2P model are: Groove Networks [Gro],
SETI@home [Set], OpenCola [Ope], and JXTA Search [Jxt]. One of the problems most of
the P2P networks face is that of increased search latencies as the network grows and the
network degradation in presence of frequent leaves and re-joins [Pan01]. Recently, there
has been interest in using these networks for building digital libraries [Mal01, Kep, Baw03].
The Kepler project has some components of P2P networks but is not a pure P2P network. It
does require a central service to locate the right node, and is closer to Napster, which is a
broker based P2P network. In [Baw03], they propose a topic-segmented network based on
pure P2P networks, which has the nice property of clustering nodes with similar content and
at same time reducing the query processing cost. Also there is interest in building P2P
networks that exhibit small-world network property [Man03]. A small-world network is one
that has a small bounded diameter and high clustering coefficient [Wat98].
3

Approach and Objectives

The objective of the proposed work is to research the key issues in the design,
implementation, deployment, and evaluation of a sustainable digital library that supports
dynamic evolution of communities. We propose a digital library that builds upon the existing
work in the area of Open Archive Initiative (OAI), Kepler, Peer to Peer, and Social Networks.
Our solution moves away from the traditional Digital Libraries that rely on a central
organization to develop, support and maintain collections and services. In our vision there

C-3

are on the order of a million universal clients, each client being able to do all the activities:
searching, contributing, maintaining collections. We distribute the cost of evolving the code
through Opensource participation and the cost of running the DL to millions admittedly
incurring a tradeoff in less efficient search methods. Whereas in the OAI world data
providers and service providers are always drawn on the opposite side of a line, in Freelib
each node is both. We also envision that the current NSDL framework will be able to
integrate with Freelib and thus have access to its content. Note that all nodes of Freelib will
be OAI compliant. The three key features of the proposed library are sustainability, dynamic
evolution of communities, and support of diverse communities.
Sustainability. The underlying architecture of the proposed library is based on P2P, which is
decentralized and does not rely on any centralized expensive hardware in sustaining the
library. In addition, there is no need for any centralized administrative control in maintaining
and enforcing policies.
Dynamic Evolution of Communities. The key to our approach here is to characterize
communities based on users access patterns and to build the network topology to reflect
this structure. The cluster of nodes formed by common access pattern identifies a
community. Freelib allows for communities to form, grow, mature, dwindle and disappear as
the users interests change.
Support of Diverse Communities We want to architect our universal client in such a way that
different requirements for different communities, typically in terms of the metadata and
interfaces, can be integrated with the core client using plug-ins. In our current Kepler project,
we are addressing some of these issues and we plan to leverage that work for this proposal.
As part of our objectives, we plan to demonstrate that the Freelib approach will produce a
viable DL that is of benefit to NSDL and its community. We should emphasize that this is a
targeted research proposal and as such, we will only demonstrate in a rather small test bed
that Freelib has potential. The specific objectives we plan to address are:

Develop an alternate model of building digital library that is sustainable and supports
communities and domains with diverse interests.
Research how a P2P network model can be utilized for building such a digital libraries, in
particular investigate the effectiveness of user access driven P2P network topologies
and develop a framework.
Build a test bed (faculty, staff and students at four departments at ODU) that
demonstrates the network of DLs and their interactions.
Evaluate the effectiveness of the proposed digital library in terms of: end user services,
evolving domains/communities support, scalability, and sustainability
Research system design issues in building software to support the proposed library that
grows and evolves with time.
Analyze the community pattern that develop in the test bed.
Proposed Work

In the Approach section we described the specific objectives of the overall vision of a
scalable, sustainable DL that supports evolving communities. In this section we have
organized the work for achieving the objectives into network architecture, universal client
architecture, and evaluation .

C-4

4.1

The Network Architecture

A node in Freelib is an OAI service provider as well as an OAI data provider. Recall that a
service provider in the OAI framework is the one that harvests metadata and provides end
user services such as indexing and searching. On the other hand, an OAI data provider is
typically an archive that holds the published records. We will need to endow a node with
additional methods beyond the OAI protocol to support the P2P infrastructure. The network
architecture of Freelib consists of two overlaying networks: access network and support
network. The access network topology is characterized by the user access pattern, and the
support network topology is determined by an adaptation of the Symphony protocol [Man03].
The Symphony protocol preserves the connectivity and small world properties of the network
for supporting efficient search. A node linked to other nodes in this support network is called
a contact, either short or long-distance dependent on the distance between the nodes. For
the sake of clarity, in our discussion we will view the two networks separately although in the
implementation both protocols reside in the universal client. In the access network, a node
is connected to the interacting nodes. Interaction is defined to be both ways, that is, a node
can either search for objects and upon discovering them access them or a nodes object
can be discovered by other nodes and are accessed. Interacting nodes are linked to each
other by friend-links in the P2P terminology. The network contains only active nodes that
are on-line. A node is not connected to all the interacting nodes, only a subset as described
later. For completeness, we briefly describe the Symphony protocol and how we are
adapting it in our context, for details on the Symphony protocol please refer to [Man03].
Symphony is a protocol developed to maintain a distributed hash table of identities in such a
way that no node has global knowledge yet can discover any object in a network. It does so
by having each node maintain k (a system parameter) friend links besides knowing its
predecessor and successor. The latter two relations are obtained by organizing all nodes on
a virtual, directed ring. The ring is of unit perimeter and IDs of objects are real numbers in
the interval [0,1), each node, x, manages the sub-interval of the ring between its own ID and
that of its clockwise predecessor. Nodes can join the network by acquiring an ID (drawing
from a uniform distribution) and adjusting, creating links and can also leave (in which case
friend links are adjusted). The reachability property is maintained by drawing k links to
distant objects from a harmonic probability distribution with constraints such no node can
have more than 2k incoming links. Though we are not using the protocol for maintaining a
distributed hash table, we still maintain the concept of a manager of the sub-interval, which
helps us in inserting a new node in the network as explained later. We also expand the
concept of short-range friends for a node, which in the original protocol is defined as links to
the two adjacent nodes in the ring. As these are links in the support network additional to
those that are defined by the original Symphony protocol, we still maintain the network
characteristics that are ensured by the Symphony protocol.
4.1.1

Freelib Network Protocol

Rule for creating friend links(Zf). These links are in the access network, see Figure 1,
and are identified by the list of friends1 (interacting nodes). The nodes in the friend list are
ranked and links exist to only the first Zf nodes of this ranked list; it is this list that will be
used to search. In the access pattern information repository of a node, see Figure 3, we
maintain a list of all friends a node has ever had (up to a set limit), tagged by being active or
off-line. Consider m entries in the friend list of a node. The rank, Ri, i = 1 t o m, of a friend is
given by

Unless otherwise stated, friends are the active friends, see also the discussion on Rejoining the
Network and Dropping friends

C-5

n
p
R i = i + (1 ) i
N
P

(1)

where
N: is the total number of outgoing accesses to all nodes during a time interval t.
ni: is the total number of outgoing accesses to ith friend in the list.
P: is the total number of incoming accesses from all nodes during a time interval t.
pi: is the total number of incoming accesses from ith friend in the list.
: is the weighting factor in the range 0 to 1 where =1 ignores incoming accesses in the
ranking calculation ; and = 0 ignores outgoing accesses in the ranking calculation .
The optimal value of is an open question that requires further research and experiment as
it affects the formation of communities. Note that Ri forms the probability distribution function
over a friends list, so the sum of ranking over all friends is one. Ri indicates the probability
with which a friend will be accessed. Note that access here refers to access of the digital
object, typically a full text document. Access to metadata is not counted in the above rank
calculation. The number of friends links, Zf, is a system parameter and is bounded by the
following constraint:

Z f +Z s+Z l Z

(2)

Here, Z is the bound on the total links a node can have Zs short-range contact links and Zl
long range contact links which are defined below. The optimal values for these parameters
will be determined in parts by an analytical model, simulation, and experiments.
Rules for creating long-range contact links (Zl). These links are in the support network,
see Figure 1, and are created according to the Symphony protocol. These links are key in
maintaining a low diameter of the network. The number of long-range contact links, Zl, is a
system parameter to be determined by the simulation and experiment, and is also bounded
by constraint (2). The long-range contact links are utilized for executing search as described
later in the search section.
Rules for creating short-range contact links (Zs). These links are in the support network
and are created by considering Zs nodes close on the ring (support network) and adding
links to those nodes. These links are defined when the node, ci, joins the network or when
the node migrates. In either case, the node ci identifies all nodes within a distance d to form
short-range contacts. The distance metric:
ci,j = |ci-cj|

(3)

is the absolute value of the distance between the two nodes on the unit perimeter ring. The
threshold d is a system parameter we shall determine through experiments and simulation.
Note that we have extended the Symphony protocol here, which only defines two links as
short-range links; see our earlier discussion on the Symphony protocol.
Rule for when to migrate. The decision of when to migrate is not a black and white one but
rather is fuzzy one as is the definition of a community. Equation (2) provides the relation of
the number of links towards the steady state ideal of Z links per node. At the beginning of a
node joining there will be few friends and almost all links will be synthetic links to ensure that
it will be found during searches nevertheless. As the number of friends increases we can
reduce the number of short contacts while maintaining the number of long contacts. Using
equations (2) and (3) we develop heuristics that will relate the location of the friend links to
those of the short contact to determine whether or not a node is physically in the wrong
interval on the support ring.

C-6

Rule for how to migrate. At some time, as described above, a node may decide to migrate
to a new position on the support network. When a node makes the decision to migrate, it
simply picks its top ranked friend and sends a migrate request to it. This target node
behaves as if this is a new node that is trying to join and inserts it between itself and its
clockwise predecessor on the support network. The migrating node must notify its own
neighbors before it leaves so that they update their links. After moving to the new position,
the migrating node updates its short-range and long-range contacts on the support network.
The friend list in the access network does not change for a node.

Figure 1. Network architecture, showing the access and the support network
Rules for joining and leaving the network. To join the network, a new node simply needs
to know the address (IP) of any existing node on the network. This is usually done by some
offline means, e.g., a user getting the IP of a friend or getting some IP from a website. The
joining node sends a joining request to the existing node. The existing node picks an id
randomly from the sub-range it is managing, and returns the id along with the short-range
contacts (based on the picked id) to the joining node. The short-range contacts are built
using the distance metric (2) on the support network. In contrast to the Symphony protocol
we select Zs nodes at random from the interval (id-d, id+d). Note that if the joining node
happens to belong to the community of interest of the existing node, this approach of joining
will put the new node in the right cluster. The Zl long distance links are chosen in accordance
with the Symphony protocol. The new node can now start building its list of friends by
accessing and interacting with other nodes. When a node leaves the network, it informs its
contacts so that they can update their links accordingly (Note a link on a support network is
bi-directional as opposed to unidirectional links in the access network in Figure 1 we only
show the unidirectional links to avoid crowding the picture).
Rules for re-joining. When a node leaves the network temporarily such as the client being
on a home computer with an intermittent Internet connection one would not want to destroy
all the access information the node has built up during its existence. On the other hand, the
support network needs to delete all the links relating to the leaving node so as to keep the
network connected and not to violate the bound on the diameter. We will maintain the links

C-7

for a node in the access pattern information repository with a flag indicating that it is off-line
(there will be a time threshold for removing a node permanently). Any node in either the
access or support network periodically pings its friends or contacts to determine whether
they are active, and if they are not, the links are adjusted accordingly. Upon re-joining, the
friend list is restored from the repository and the contacts are created as if the node were
doing a regular joining.
Rules for dropping friends. As a node interacts with other nodes by accessing their objects
and is being accessed by other nodes, the friend list may grow beyond Zf. In that case we
drop the lowest ranked (active) friends, see equation (1), from the list but maintain the
information in the access information repository in case the list may need filling up again.
That may happen through friends leaving the network (either temporarily or permanently).
Rules for searching. There are various search protocols used in the area of P2P, some
variations of breadth first search and depth first search [Gnu, Cla00, Das03]. We will adapt
these searches in our context, utilizing access friends, short-range contacts, and long-range
contacts. On one end of the spectrum we may only utilize friend links and on the other end
we use all types of links: friends and contacts. We need to limit the number of links the
search method uses as it may flood the network, and will also lead to a low-precision result
set. The latency of searches will vary depending on various parameters such TTL (time to
live) and fraction of total links used. For example, a friend-only search with TTL of 1 would
be fast and have good precision and recall. This is the fastest search mode and after the
initial startup, when the node has enough access statistics, and access friends are ranked
based on these statistics, this search mode would return large portion of the results the user
might be interested in. We expect this to be the dominating search mode especially after the
initial startup. In the local community search mode, any node that receives the request does
not route it to any of its long-range synthetic friends. It routes it to only short-range synthetic
friends and access friends. This is the second fastest search mode and is expected to return
more search results than the previous one. In the Global search mode, there is no restriction
on which nodes to forward the request to. This mode is the default and should be used in the
beginning after a node joins until it discovers its access friends in the community.
Rules for Replication. Figure 2 shows the structure and contents of the repositories of the
nodes. Each node has the standard repository that contains the objects and their metadata
records the owner has published. It also shows supernodes (to be discussed in the next
section) that will store aggregated metadata of friends (to improve search) and replication of
metadata records and objects that the node owner had accessed. The replication of the
content is done at the time the node owner accesses an object and keeps it (maintaining
provenance) so it can serve the metadata (and the object) up should an appropriate query
reach it. The intuitive idea behind this approach of replication is that an object that is in great
demand will be replicated at many places and will be retrieved early on in the search. On
the other hand an object that is scarcely retrieved will not be replicated that often. This is in
spirit similar to Freenets replication strategy, though Freenet keeps a copy of the document
at intermediate nodes as well. When a node replicates an object, there is a need to notify the
original owner of the document (we skip some details of notification, particularly when the
owner node is not active) for the node owner to adjust its ranking calculations. The number
ni in the rank calculation equation includes these indirect accesses. On the other hand, it is
not clear whether the servicing node (keeping the replica) should consider the indirect
access for its rank calculation. We suspect it may be desirable to consider it, though treat it
as a fractional access. As part of this project we plan to evaluate it theoretically, by
simulation, and observe it experimentally.
Duplicate Detection. The search methods all involve processing the query locally for hits
and if the algorithm dictates, forwarding, the query to other nodes. Each node has therefore

C-8

to merge results from the local repository and the returns from the other nodes before it
sends these merged results back to the node that originated the query (or the user if the
query came directly from the application). Because of the replication mechanism and the
supernode (see below) mechanism, it is quite likely that duplicates will be present in the lists
to be merged. In the Archon [Arch] and TRI [Tri] projects we have developed methods for
detecting duplicates based on a number of heuristics which we will import for this purpose.

Figure 2. Collection structure


4.1.2

Network Topology and Communities

Recall from the previous section that in Figure 1, a physical node appears twice, once in the
access network and second time in the support network. In general, nodes that are clustered
in the access network may not be clustered in the support network. However, our protocol
for migrating nodes tries to form clusters on the support network corresponding to the
communities in the access network. The main objective of maintaining this correspondence
of communities and clusters is to help a node when it joins the network to find contacts,
which are most likely going to be its friends as well.
Super Nodes. A node has the option to become a supernode at any time.. Once a node
becomes a supernode, it harvests all metadata using OAI-PMH from all its friends. It also
detects and removes duplicates and normalizes & indexes the metadata. In other words, a
supernode can now act as an indexer (search engine) for the community. The user of the
node has an incentive to become a supernode because it will improve her search
performance. (It is not clear that awareness of supernodes is an advantage or not; at this
time we will keep its existence known only directly to the owner). At the discretion of the
user, a supernode may decide to keep a copy of the full text document as well. In this case,
it would add alternate URLs in the metadata records pointing to the copy of the full text.
4.1.3 OAI extensions
It is essential that our nodes in the network are OAI compliant so that there is flexibility of
these nodes to work with centralized services providers as well and it will also help us in

C-9

integrating Freelib with NSDL. For this reason we add methods to OAI-PMH that are
required for supporting the access and support networks, and for performing search. We
understand that there are trade-offs in doing so, as some of the network operations may not
be efficient over HTTP as required by OAI. The proposed extensions, like base OAI
methods, are submitted using either the HTTP GET or POST methods. We now briefly
describe a few of the extensions to illustrate on how we plan to develop these extensions.
Joining the network: submitted to an existing node by a new node that wants to join the
network. An example of this method is:
http://somenode.odu.edu/?verb=join&id=0.524&ip= 128.82.8.76
The existing node routs the request to the node responsible for the given id, the target node.
The target node replies by telling the node about its short-range contacts. The response
would be encoded in XML format, just like OAI-PMH responses. For lack of space , we skip
the details. The joining node then establishes short-range connections to these nodes and
starts accessing the network. It also starts building its friend list based on the access
information.
Leaving the network: submitted by an existing node to its contacts before leaving the
network. Note that this operation is not always explicitly initiated by the user (owner of the
existing node machine crashes). An example of this method is:
http://somenode.odu.edu/?verb=leave & id = 0.524
On receiving this request, the contacts adjust their links as defined by the protocol.
Migration: A node on making a decision to migrate, first leaves the network and then
submits the migration request to the first member in the friend list. An example of migration
request is:
http://somenode.odu.edu/?verb=migrate
When the target node receives the migrate request, it allocates an id for the requesting node
that is close to its own id and replies with information similar to the join request.
Search: sent or forwarded by a node to some other node requesting a search. Two sample
requests for a breadth first search six levels deep and the TTL received being 4:
http://somenode.odu.edu/?verb=listIdentifiers & metadataprefix=oai_dc &
author=maly & keyword=network&search=BF6&ttl=3
http://someonode.odu.edu/?verb=listRecord& metadataprefix=oai_dc & author=maly
& subject=digital library&search=BF6&ttl=3
Upon receiving a request, the node searches its collection, returns any matches found and
forwards the request to its neighbors in a P2P fashion, updating the TTL parameter. The
additional search parameters can be made optional. When the XML records return, the node
processes them as described in the section concerning duplication.
4.1.4 Formative Simulation Study
The objective of the simulation is to (i) test and validate the protocol, (ii) obtain ranges for
different system parameters, (iii) evaluate the effectiveness of support network, and (iv)
evaluate trade-offs between different variations of protocol rules. In (i), we will check if there
are any bottlenecks and if there are special boundary conditions that we need to adjust. We
need to avoid any infinite loops of operations, for example, migration may have a tendency
to create infinite loops. In (ii), we will obtain optimal system parameter ranges conducting
various experiments. These system parameters have been identified throughout the network
protocol sections. For (iii), we would simulate the access network first and see how it evolves
for a given set of access patterns. Next we put the support network along with contacts and
see how it evolves for the same access pattern. In (iv), we want to decide on protocol
options where we are not clear on the right strategy, for example, how to do a new join?

C - 10

Here, the first preferred option, as described in the rules, is to insert the node in the support
network and use all nodes within a short distance as the short-range contacts. These
contacts are initially used for executing search for this node and later with time the node
develops its friends. The alternate option is for the node to take a copy of short-range
contacts from the initial contact node.
4.2

Universal Client Architecture

In the preceding section we presented our model of the network and how nodes interact
mostly from the network perspective. In this section we present the architecture of the
universal client as seen from the perspective of the users. To start things off, the user
downloads the code for the universal client that is being maintained in OpenSource.
Customization. At the time of installation a user will be able to configure various
components. In Figure 3 we have shown two components that can be configured: which
metadata schemas should be available for publishing documents (objects), and within each
schema which of the fields are mandatory/optional (UI schema). The default will be that
Dublin Core is chosen with the fields shown in Figure 4 being mandatory (we have taken the
screenshots from Kepler where such configurability is a key component). The idea behind
this customization is to allow users to be part of several communities each preferring their
own metadata schema. The search interface presented to the user will similarly be
configurable to a chosen metadata schema. Finally, we allow a node to decide whether or
not it wants to serve as a supernode (periodically aggregate metadata records of friends).
Search. The core services of the network protocol, shown at the right part of Figure 3
provide a set of basic search mechanisms described in the search section of the network
architecture. One of the key questions we have to answer is how or whether to give some
control to the user over the search methods used by the system. In traditional DLs there is
no question that the user has no control over what algorithms the system uses to satisfy a
user request. The only control the user has, is to provide more or fewer details about the
constraints of the search. However, in our environment search efficiency can vary
dramatically depending what kind of search is used. We shall investigate how the system
can interact with the user to optimize the search. Specifically, we will design an interface that
lets the user express sentiments such as:

Need to broaden search as the search results are not satisfactory.


Wants more friends in community wants to see explicit buddy list, to do browsing
Wants to search all members of a community only is happy with community and for a
specific search wants fastest results but not necessarily all
Needs to broaden out knows very little about another community but wants to find out
Is interested in more than one community has friends in two or more communities but
wants to search only one particular community, need ways to group friend links
Wants to browse by communities need to support accessing other communities and
get to supernodes

This user input is taken at the application level search module at the left of Figure 3. This
module interacts with the core search service module that in turn sends extended OAI
requests to the network and obtains OAI compliant XML records back. The core search
module then sends the search results back to the application search module and also
updates the access pattern information repository. Accessing of documents is also routed
through these two modules in order to update the two repositories with regard to the
supernode attribute, replication of objects, and the access pattern. The application level
search module also handles the periodic supernode harvesting and handles the incoming
OAI requests for information about its local collection repository.

C - 11

Figure 3. Universal client architecture

Figure 4. Publication interface


In addition to the interface issues, we need to implement a number of application level
algorithms that use the routing protocol to implement say the browsing feature in the above
list. The basic idea would be to do a breadth first search of degree j, where j is the
community diameter (that could be either kept in each node or estimated through the support
network) to locate all supernodes and perform a browse on these and merge the results.
Assessment. One of the features of DL services we shall need for the evaluation study to
be described in Section 4.4 we will need assessment tools that provide statistics about the
network and its performance. As all nodes support OAI, we can use OAI to harvest logs and

C - 12

analyze them. For this purpose we shall develop a special harvester that can locate all the
nodes in the network and harvest their logs and statistics about the nodes collections
(number of objects, size) and access information repositories (searches: number, type,
query result: number of hits, community distribution, wall time to obtain query,accesses by
other nodes: who, what how often).
Publishing. This service is probably the least contentious problem. From our various DL
projects in the past we have tremendous experience in building customizable publication
services for a variety of digital libraries. The one most appropriate for our environment is the
Kepler [Mal01] publication services because it provides software that lets the user select
which metadata schema should govern the publication process and even further how to
enforce parts of metadata fields, e.g., should the creator filed in DC be mandatory. For a
sample of the services available in our Kepler project see [Kep]. In Figure 4 we have shown
a typical screenshot of such a service. I will interact with various metadata handlers the user
chose in her installation process. Each metadata handler will interact with the repository
through a standard API to deposit or retrieve an object to or from the repository. The
repository contains metadata records and objects of the owner, friends and contacts as
described in Figure 2.
Network maintenance. The module shown in the core service part in Figure 3 handles all
operations that maintain the connectivity of the node to the other nodes in the network.
These operations have been described in section 4.1.
Connection Obstacles. There are a number of well known obstacles to establishing and
maintaining connections within a P2P network. Firewalls, NAT, PAT, SOCKS, Proxies all
may make it difficult to either have someone connect to the client or the client to connect to
the outside world. This is a most difficult subject that occupies many researchers as a
separate subject, we will employ a simple solution that works for proxies, NAT and limited
firewalls from the Kepler project [Kep].
4.3

Evaluation

In the previous three sections we described the architecture and services of the proposed
DL and identified the major open research issues we will have to resolve. In this section we
shall describe both the process of validating the correctness of our approach and design and
also that of evaluating, or rather provide the groundwork for estimating the performance of
any such system. The approach we shall take is to implement a prototype universal client
and deploy it in a test bed here at ODU. We will perform then formative and substantive
evaluations of the prototype with an eye towards developing models for having predictive
capabilities. We will also use the test bed to analyze the behavior the network with regard to
community formation and evolution.
Performance estimation. For evaluation purposes we plan to build a test bed at ODU
consisting of three different categories of users in four of the departments: faculty, staff, and
students. The four departments that will be part of the test bed are: Physics, Computer
Science, Biology, and Teacher Education. We have contacts in these departments and in
the past we have collaborated with them on various projects. We now describe some of the
experiments that we will do to see whether we meet our objectives. Note that to support
collecting data for our experiments we will provide hooks in the Universal client architecture.
Discovery of communities. This experiment assumes that FreeLib has been deployed and
has been running for some time, a week or so. We conduct this experiment by inserting a
new node, a faculty from Physics department, into the FreeLib network at a position close to
Biology nodes in the support network (See the Join protocol description in Section 4.1). We

C - 13

will monitor the movement of node position with time as the new node starts publishing, and
searching. We will observe the impact of number of accesses to the speed at which the node
moves to the right community, in this case Physics community. We will repeat this
experiment for different insertion points and selecting new nodes from different communities.
How communities evolve. We conduct this experiment by asking a few faculty in Computer
Science and in Biology to start activity in terms of searching and publishing papers in the
area of Bioinformatics (the assumption here is that initially these faculty members have been
requested to be active only in their respective fields). Next we will observe the number of
accesses to the time it takes to these faculty members to come together and form a new
community. The proximity of the members will be measured on the support network.
Network Characteristics. With this experiment we will observe the impact of frequently
leaving and re-joining on the characteristics of the network. In particular, we will observer
whether the network remains connected all the time and whether the diameter of network is
as specified by small world networks (our sample may be too small to be useful).
Search Performance. The objective of this experiment is to evaluate the performance of
the search in terms of latency, precision, and recall for different search strategies as outlined
in Sections 4.1 and 4.2. For this experiment, we will develop a standard set of queries
covering all the domains in the test bed. These queries would be executed at different
specified points on the network. The results for different search strategies would be collected
and evaluated.
Evolution towards what community. In the test bed section we described the participants
of the evaluation study and emphasized that they come from four different departments. We
are interested how these participants will evolve into communities. Will it be along the lines
of subject as given by the departmental organization or will it be something entirely different?
It may be that communities evolve along the lines of organizational grouping such as staff,
faculty, students, or it may be that grouping by interest such as music, literature, sport will
dominate. Or it may be that age or gender will be the community-driving factor. The sample
is small but should provide us some insights into what the system is capable and whether it
will serve the needs of the participants: create resources they are interested in and find
resources they need. The study will be summary in the sense that we gather statistics as the
system is deployed and periodically we will take snapshots and produce reports and the
composition of communities and the nature of evolution whether it is continuous, oscillatory,
and/or converging to a stable state.
5

Impact and Dissemination

Potentially the impact of this project could be very large if we are completely successful in
demonstrating the concept of Freelib being not only feasible but also that it performs
comparable to traditional digital libraries yet provides a service that is not available under the
current model, one that scales to all the members of the community and is sustainable. The
impact is potential only because as part of this project we will not supervise the actual
deployment of the universal client in the environment of K-12 and Higher Education but only
in the test bed. We will most definitely complete the project and place the resulting source
code in OpenSource and encourage the development of additional service features that will
make the client acceptable to the community and encourage adoption by many. We will also
pursue the traditional means: publications, education of graduate and undergraduate
students through seminars, and talks at other universities. One aspect of this proposal
makes it stand out is the fact that its low cost instantiation will make it a tool that should
prove it amenable for wide dissemination among groups that are disadvantaged in the
current education system and we will attempt to bear this out through encouraging

C - 14

participation in the test bed by students at traditionally black universities in the Hampton
Roads area and high schools in Norfolk that have a predominant population of African
Americans.
6

Schedule and Deliverables

In the table below we have grouped the activities of the proposed work in terms of
architecture design and code development, test bed development and deployment, and
evaluation. Whereas the Proposed work section was organized in terms of required
functionality, the table is presented as a flow of tasks.
Tasks

Schedule
Year 1
months

12

15

Year 2
18
21

24

Network Architecture
Protocol Design
Protocol Simulation and Validation
Universal Client Archtecture
Server component with HTTP support
OAI Harvester and Normalization Component
Duplicate detection support
Protocol Implementation
Client Implementation with DC plug-in
Develop client plug-ins for two communities
Test Bed
Deploy the test bed at Computer Science
Test, refine, and debug the architecture
Include Physics and test the system
Include Biology, and Teacher Education & system test
Integrate harvesting of Freelib from Archon
Evaluation
Experiments for discovering communities
Experiments for evolving communities
Experiments for network evaluation, scalability prediction
Experiments for search performance evaluation

Results from Prior NSF Support

The PIs of this proposal have and had a number of NSF grants in the area of digital libraries.
Since in this proposal the collaboration with the Archon project is described, we chose it as
the representative grant for this section: NSF 0121656, An OAI-Compliant Federated
Physics Digital Library for the NSDL. This project is an ongoing NSDL project that is nearing
completion and results better than expect in some cases and worse than expected in other
parts hare available. Archon is a collaborative project between Old Dominion University,
American Physical Society and Los Alamos National Laboratory. This project is building an
Open Archives Initiative compliant federated digital library with an emphasis on physics for
the National Science Digital Library. NSDL is the comprehensive source for science,
technology, engineering and mathematics education, it is funded by the National Science
Foundation. This physics digital library federates holdings from the physics e-print server
arXiv, Physical Review D from the American Physical Society, CERN and a number of
smaller holdings. We have developed high-level services such as cross-reference linking,
which is based on OpenURL and leverages the Citebase research at Southampton. A
number of unique services for the Physics community have been developed, equation based
search being one of them, others are author and subject similarity. We also have developed
a strategy that relies on simple search followed by result set processing (recursive search)
that help the well known aversion of users to avoid advanced search. Finally, all these
services have been integrated with the OAI based dynamic harvesting of the contributing
collections.

C - 15

Potrebbero piacerti anche