Sei sulla pagina 1di 4

P2P Data Management

Kunal Kataria IIIT Delhi 20008030

Vikas Sangwan IIIT Delhi 2008058

Peer-to-Peer (P2P) networks have been receiving increasing demand from users and are now accepted as a standard way of distributing information. Transient population, large scale, storage, security, privacy etc are some issues that make data management in Peer-to-Peer (P2P) systems a challenging task. In this report, we will describe the different types of Peer-toPeer P2P) system models, challenges and problems in P2P data management, and describe briefly some methods that can be used to address these problems.


Peer-to-Peer systems,superpeers, overlay networks, clustering, data management, Schema, free riding.

The way P2P has grown in the past decade indicates that it has a lot of potential to be tapped upon and it is here to stay. As it has evolved and the applications have advanced, new challenges and issues need to be addressed. New Applications have semantically richer data which gives rise to data management issues. Moreover, as the data lies with different peers which themselves are transient and autonomous, it becomes a challenging task. In our report, we have described different P2P overlay network and representative systems. We also discuss the key issues in data management, particularly relating to P2P architecture. We also discuss the free riding problem and incentive mechanisms used to encourage peer cooperation.




Peer-to-Peer (P2P) is a computing paradigm in which a collection of distributed nodes (peers) share computer resources in a decentralized manner. It is opposite of the traditional web client/server model in the sense that in client/server model, a dedicated server is responsible for providing a particular service to the client. In P2P system, there is no client/server, each peer acts both as client and server and each peer uses its own resources as opposed to client/server where all the load is on the server. Apart from decentralization and resource sharing, P2P systems have other desirable features such as anonymity, scalability and self organization. Due to the above benefits of P2P model, there is currently an ongoing paradigm shift from the usual client/server model to P2P model as its a low cost flexible alternative. Peer-to-peer systems became famous with file sharing applications such as Napster, Lime wire, Gnutella, and several other related systems, primarily because they offered a way for people to get music for free. However, P2P technology has many diverse applications like Distribution of data, Internet Telephony (Skype). In fact P2P can have potential applications anywhere where users can benefit from sharing information amongst them in a decentralized manner and this concept has inspired new structures and philosophies in many areas of human interaction. For example in Navigation systems, using P2P location based clustering mechanisms and load balancing, we can very effectively increase the efficiency of the navigation system. With P2P its even possible to come up with a completely decentralized collaborative variant of a Google-style Web search engine.

In "pure" P2P systems, every node acts as a server and client and they share resources without any centralized control. However, most P2P applications have some degree of centralization. These are called "hybrid" P2P networks and they centralize at least the list of users. This is how instant messengers or file sharing programs work - the system keeps a list of users with their IP addresses. We can broadly categorize the architecture of P2P systems according to the presence or absence of centralized components.

Types of P2P System Models


In purely decentralized systems, all peers perform the same tasks, and there is no central coordination of their activities.

Purely Decentralized

2.1.1 Gnutella
In Gnutella there is no central coordination of the sharing activities in the network, and file downloads are executed directly between two peers. A flooding mechanism is used for the purpose of the distribution of a query, where the peer forwards the received messages to all its connecting peers. The peer declares itself when it joins a network by sending a special message (Ping) to its connecting peers. The peers send back a message (Pong) identifying themselves and also forward the Ping message to their connecting peers. A peer sends a Query message to the connecting peers whenever it wants to download a file, which subsequently uses the flooding method to forward the Query message to other peers. When the requested download file is located at a peer, the requester peer gets a QueryHit message from that peer which

results in the establishment of a direct connection between them for the file download. Search mechanism of Gnutella is illustrated in figure 1.


In partially centralized systems, some of the peers, called superpeers, act as local central servers maintaining indices for the files shared by local peers.

Partially Centralized

2.3.1 FastTrack
FastTrack uses Supernodes or superpeers to improve scalability. An ordinary node will send request for a file to its nearest super nodes. The Supernode has the metadata about the files stored at its local peers, it will respond back to the query with the location of the file. If the respective supernode cannot find the file requested it forwards the query to other Super nodes. Search Mechanism of FastTrack is illustrated in figure 3.
Figure 1

A Time To Live (TTL) value, which is initiated to a small integer, is associated with each Gnutella message to prevent it from circling the network forever. This value is decremented at each hop, and when it reaches 0, the message is terminated from the network.


Hybrid Decentralized

In hybrid decentralized systems there is a central server which manages the list and location of files stored at peers. The central server processes the search requests. During this process, it identifies the peers which has the requested file and sends the location of file to the requester. File exchange takes place directly between two peers.

2.2.1 Napster
A central Napster server maintains a list of music files shared by the peers which are currently connected to the Napster network. It supports keyword based searches. The requester sends a query to the Napster server to get the location of file it wants to download. After getting the response to its query from the server, the requester establishes a connection with the peer which has the requested file. Napster server has no role in the download of file; its done directly between the peers. The shared file list is automatically updated as the peers connect to or disconnect from the network. Search mechanism of Napster is illustrated in figure 2.
Figure 3

Figure 2

P2P systems can also be distinguished in terms of their structure. In unstructured networks, the placement of files on the peers does not follow specific rules. In other words, the P2P system is created in an ad hoc manner as peers and files are added to the system. Whereas in structured networks files are placed at precisely specified peers and mapping of files to peers is typically achieved by using Distributed Hash Tables (DHTs). In a structured network, queries submitted to locate files can be efficiently routed to the peers with the desired file. Unstructured networks are appropriate for accommodating highly transient peer populations, in which peers are joining and leaving at a high rate. However, the networks of this kind are faced with the scalability problem. While the scalability issue is handled smoothly in structured networks, the maintenance cost of structured overlays is high in the presence of transient peer population.

Examples of P2P unstructured systems (Gnutella, Napster), and structured systems (CAN, Chord).

3. 3.1

Issues in P2P Data Management Indexing

In usual distributed systems, centralized or distributed indices are employed to carry out the purpose of locating data while processing queries. P2P systems have profound transient peer populations and an enormous number of participating peers. The indexing structures, therefore, must be designed to administer frequent updates and must be scalable. There are three basic types of P2P indices: local, centralized, and distributed. In local indexing, their own content is indexed by the peers. In P2P systems with a purely local data index, usually flooding method is incorporated to search for data. Issue: Flooding In the case of a centralized index, the location information about the data stored on all the peers in the system is managed by a single server. Napster is a good example of P2P systems with centralized index Issue: Reliability problem exists because the central server is a single point of failure. In P2P systems with a distributed index, the distribution of the index is dependent on whether the underlying overlay network is structured or unstructured.

which needs the expressiveness of query language to be modified accordingly. Key lookups for search over text documents and Simple keyword queries for searching structured data would not be expressive enough. More Sophisticated querying approaches are required. Apart from this, Query propagation also demands attention. Queries submitted in a P2P system need to be routed to the peers which are responsible for maintaining the location of the needed data. Routing schemes used for that purpose can be generalized into two main categories: blind search and informed search. These are also known as the recursive method and the direct method .With a blind search method, no information is stored about data placement. When a query reaches a node, the query is reissued from that node and waits for a response from all the nodes that it queried. Example: Flooding With informed search methods, some data placement information is maintained at each peer. Queries are routed to those peers who have some information about the location of requested data. Therefore, routing is performed more effectively compared to blind searching, and the number of messages in locating data is reduced. Example: Query Routing Protocol (QRP)



3.2 Data Integration

Data sharing is one of the inherent primary objectives of P2P systems. When the shared data located at different peers is related, semantic issues arise i.e. there is always a high chance that two peers have the same files but have been named differently. Due to this, a requested file may be present but due to a different name, theres a very high chance that it might not show up in search results. This implies that same data residing at different places ca have different semantics. So, some way of communicating between the schemas is needed. In order to handle this issue, the heterogeneity from the data sources should be removed so that there is uniform querying environment. A common data sharing approach proposed for traditional distributed systems is to provide a global mediated schema but these are not directly applicable to P2P environments, given the peer autonomy, volatility and scalability aspects of P2P systems. Therefore, Coordination Formulas or Schema Mappings are used which allow different peers to communicate how data with one peer would relate to data with another peer.

In a distributed system, data items with common properties can be grouped together forming data clusters. Clustering would help as it would give better search results as the related data will be in nearby locations, the queries would have to travel less. Besides clustering the data, its also possible to cluster the peers based on their interests or the data they have. This type of clustering would also help in increasing the search quality. However, there are issues that make clustering in P2P environment a tough job. Autonomy is violated since peers are forced to store some specific data. Also due to the dynamic nature of P2P, peers often connect and disconnect. Therefore, Clusters formed in a P2P system need to dynamically adapt to the frequent changes in peer statistics. The lack of global knowledge of data and peer interests also causes a serious difficulty in forming clusters in P2P systems.

4. Challenges and Incentive Based Mechanisms

Much of the promise of P2P systems originates from their independence of dedicated infrastructure and centralized control. However, these very properties also expose P2P systems to some unique challenges not faced by other types of distributed systems.In general, current P2P applications suffer from a number of limitations:


Query Processing

To Support various types of applications, the need to describe the required data with more accuracy and detail has arisen

Quality of service: A certain level of quality cannot be promised due to applications interfering uncontrollably, which often results in the poor quality for P2P based

applications such as live video streaming or even telephony. Inefficient use of network resources and consequently poor performance. Design based around selfish user behavior and free-riding prevention mechanisms, rather than based on well thought out resource scheduling to maximize the performance of the overall system. Security and trust: Absence of security and control due to absence of central authority and self organizing nature of P2P systems makes it impossible to guarantee the integrity, security and authenticity of the content limiting the quality and the diversity of available content.. .

4.1.2 Reciprocity Based

In this approach each peer decides how to react to another peers service request based on the past behaviour of that peer to its requests. It employs a Tit-for-Tat mechanism to decide to which peer a file will be uploaded and at what bandwidth. A peer uploads to the peers that give it good downloading rate. The other peers are not allowed to download.

4.1.3 Reputation Based

A P2P reputation system is used to produce a reputation rating for the peers .Contribution of the peers to the system is monitored to determine the reputation rating. The peers with high reputation in that rating are offered better services.

4.1 Free Riding

A free rider is a peer that exploits P2P network resources but does not contribute to the network at an acceptable level. In a free riding environment where a majority of peers are free riders, only a small number of peers would contribute to a large population. This means that the capability of P2P architecture is not being fully utilized. Free riding leads to degradation of the performance of the system and adds vulnerability to the system.Therefore, if it is not dealt with appropriately, free riding poses a serious threat to the widespread use and efficient operation of P2P systems. In a study performed on the Gnutella network, it was observed that 85% of peers do not share any files at all [5]. Moreover, the top 1% of sharing peers provides 50% of all query hits, and the top 25% provides 98%. Most of the P2P systems in use lack effective mechanisms implemented against free riding and therefore suffer from free riding. To address this requirement, some approaches have been proposed to incorporate incentives in the existing protocols to encourage cooperation among peers. These approaches can be categorized into three main groups:



P2P has a lot to offer but its limited by the fact that it is unsecured, cannot guarantee a minimum quality of service and more importantly because it is uncontrolled. We have also outlined the key research issues associated. Therefore, more research is required to explore a broad range of P2P issues such as peer-node identity, naming, resource discovery, security etc.


Sauvola. Analytical Model for Mobile P2P Data Management System.[2008] RODRIGO RODRIGUES AND PETER DRUSCHEL. Peer-To-Peer System.[2010] Thomas Locher, Patrick Moor. Free Riding in Bit Torrent is Cheap.[2006] FRANCIS OTTO, DRAKE PATRICK MIREMBE . A Model for Data Management in Peer-to-Peer Systems[2007] zgr Ulusoy. Research Issues in Peer-to-Peer Data Management.[2009] A. Crespo, H. Garcia-Molina, Semantic Overlay Networks for P2P Systems. [2011] Albena Roshelova, A PEER-TO-PEER DATABASE MANAGEMENT SYSTEM.[2004]

[1] Mika Ylianttila, Erkki Harjula, Timo Koskela and Jaakko

[2] [3] [4] [5] [6] [7]

4.1.1 Micropayment Based

This requires peers to pay for the services they get or resources they consume with virtual currency distributed by a central authority which also ensures honest transactions. It is aimed to encourage peer cooperation within P2P systems by providing an efficient and secure pricing mechanism. The problem with this approach is that the requirement for centralized authority conflicts with the nature of P2P paradigm which is infact decentralized.