Privacy Preserving Intermediate Datasets in The Cloud

1
CHAPTER 1
INTRODUCTION
1.1 CLOUD COMPUTING
A key differentiating element of a successful information technology
(IT) is its ability to become a true, valuable, and economical contributor to
cyber infrastructure. Cloud computing embraces cyber infrastructure, and
builds upon decades of research in virtualization, distributed computing, grid
computing, utility computing and more recently, networking, web and
software services. It implies service oriented architecture, reduced
information technology over head for the end-user, greater flexibility, reduced
total cost of ownership, on demand services and many other things.
Cloud computing is the delivery of computing and storage capacity as
a service to a community of end recipients. The name comes from the use of a
cloud shaped symbol as an abstraction for the complex infrastructure it
contains in system diagrams. Cloud computing entrusts services with a user's
data, software and computation over a network. Cloud computing is internet
based development and use of computer technology.
A cloud service is an independent piece of software which can be used
in conjunction with other services to achieve an interoperable machine to
machine interaction over the network. Typical cloud computing services
provide common business applications online that are accessed from a web
browser, while the software and data are stored on the servers. Cloud
computing is a large scale distributed computing paradigm in which a pool of
computing resources is available to users via the internet computing
resources, e.g., processing power, storage, software and network bandwidth
are represented to cloud consumers as the accessible public utility services.
2

These services are broadly divided into three categories:
a) Software as a Service (SaaS)
b) Platform as a Service (PaaS)
c) Infrastructure as a Service (IaaS)
a) Software as a Service (SaaS)
Software as a service features a complete application offered as a
service on demand. A single instance of the software runs on the cloud and
services multiple end users or client organizations. Software as a Service
(SaaS) is a software distribution model in which applications are hosted by a
vendor or service provider.
SaaS is becoming an increasingly prevalent delivery model as
underlying technologies that support web services and service-oriented
architecture (SOA) mature and new developmental approaches such as Ajax
become popular. Meanwhile, broadband service has become increasingly
available to support user access from more areas around the world. SaaS is
closely related to the ASP (Application Service Provider) and on demand
computing software delivery models. IDC identifies two slightly different
delivery models for SaaS.
b) Platform as a Service (PaaS)
Platform as a service encapsulates a layer of software and provides it
as a service that can be used to build higher level services. There are at least
two perspectives on PaaS depending on the perspective of the producer or
consumer of the services. Someone producing PaaS might produce a platform
by integrating an OS, middleware, application software and even a
development environment that is then provided to a customer as a service.
3

c) Infrastructure as a Service (IaaS)
Infrastructure as a service delivers basic storage and compute
capabilities as standardized services over the network. Servers, storage
systems, switches, routers and other systems are pooled and made available to
handle workloads that range from application components to high
performance computing applications. Commercial examples of IaaS include
Joyent, whose main product is a line of virtualized servers that provide a
highly available on demand infrastructure. The name cloud computing was
inspired by the cloud symbol that is often used to represent the internet in
flow charts and diagrams.
1.1.1 Cloud Platform
Cloud platform is a kind of platform that lets developers write
applications that run in the cloud or use services provided from the cloud or
both. Different names are used for this kind of platform today, including on
demand platform and platform as a service (PaaS). Its a method of managing
data (files, photos, music, video, whatever, etc) from one or more web based
solutions. Rather than keeping data primarily on hard drives that are tethered
to computers or other devices, those things are keep it in the cloud where it
may be accessible from any number of devices. Cloud Infrastructure is the
concept of providing `hardware as a service` i.e. shared/reusable hardware for
a specific time of service. Example includes virtualization, grid computing
and para virtualization. This service helps reduce maintenance and usability
costs, considering the need for infrastructure management and upgrade.
1.1.2 Cloud Concepts
A powerful underlying and enabling concept is computing through
Service Oriented Architectures (SOA) delivery of an integrated and
4

orchestrated suite of functions to an end-user through composition of both
loosely and tightly coupled functions, or services often network based.
Related concepts are component-based system engineering, orchestration of
different services through workflows, and virtualization.
1.1.3 Service Oriented Architecture
SOA is not a new concept, although it again has been receiving
considerable attention in recent years. Examples of some of the first network
based service-oriented architectures are Remote Procedure Calls (RPC) and
object request brokers based on the CORBA specification. A more recent
example is the so called Grid Computing architectures and solutions.
In an SOA environment, end-users request an IT service at the desired
functional, quality and capacity level and receive it either at the time
requested or at a specified later time. Service discovery, brokering and
reliability are important and services are usually designed to interoperate, as
are the composites made of these services. It is expected that in the next 10
years, service-based solutions will be a major vehicle for delivery of
information and other IT-assisted functions at both individual and
organizational levels, e.g., software applications, web-based services,
personal and business desktop computing, high-performance computing.
1.1.4 Cloud Components
The key to a SOA framework that supports workflows is
componentization of its services, an ability to support a range of couplings
among workflow building blocks, fault tolerance in its data and process aware
service-based delivery, and an ability to audit processes, data and results, i.e.,
collect and use provenance information.
5

Component based approach is characterized by reusability,
substitutability, extensibility and scalability, customizability and
computability. Those include reliability and availability of the components
and services, the cost of the services, security, total cost of ownership and
economy of scale. In the context of cloud computing users distinguish many
categories of components: from differentiated and undifferentiated hardware,
to general purpose and specialized software and applications, to real and
virtual images, to environments, to no-root differentiated resources, to
workflow based environments and collections of services, and so on.
1.1.5 Cyber Infrastructure
Cyber infrastructure makes applications dramatically easier to
develop and deploy, thus expanding the feasible scope of applications
possible within budget and organizational constraints, and shifting the
scientists and engineers effort away from information technology
development and concentrating it on scientific and engineering research.
Cyber infrastructure also increases efficiency, quality, and reliability by
capturing commonalities among application needs, and facilitates the efficient
sharing of equipment and services.
Today, almost any business or major activity uses, or relies in some
form, on IT and IT services. These services need to be enabling and
appliance-like, and there must be an economy-of-scale for the total cost of
ownership to be better than it would be without cyber infrastructure.
Technology needs to improve end user productivity and reduce technology
driven overhead. For example, unless IT is the primary business of an
organization, less than 20% of its efforts not directly connected to its primary
business should have to do with IT overhead, even though 80% of its business
might be conducted using electronic means.
6

CHAPTER 2
LITERATURE REVIEW
2.1 MINIMUM COST BENCHMARKING FOR INTERMEDIATE
DATASET STORAGE IN SCIENTIFIC CLOUD
Dong Yuan et al (2007) have proposed the minimum cost benchmark
for scientific applications. Scientific applications are usually complex and
data intensive. In many fields, such as astronomy, high-energy physics and
bioinformatics, scientists need to analyse terabytes of data either from
existing data resources or collected from physical devices. The scientific
analyses are usually computation intensive, hence taking a long time for
execution. Workflow technologies can be facilitated to automate these
scientific applications. Accordingly, scientific workflows are typically very
complex. They usually have a large number of tasks and need a long time for
execution.
During the execution, a large volume of new intermediate datasets will
be generated. They could be even larger than the original dataset(s) and
contain some important intermediate results. After the execution of a
scientific workflow, some intermediate datasets may need to be stored for
future use because scientists may need to re-analyze the results or apply new
analyses on the intermediate datasets and for collaboration, the intermediate
results may need to be shared among scientists from different institutions and
the intermediate datasets may need to be reused. Storing valuable
intermediate datasets can save their regeneration cost when they are reused,
not to mention the waiting time saved by avoiding regeneration. Given the
large sizes of the datasets, running scientific workflow applications usually
need not only high performance computing resources but also massive
storage.
7

Nowadays, popular scientific workflows are often deployed in grid
systems because they have high performance and massive storage. However,
building a grid system is extremely expensive and it is normally not an option
for scientists all over the world. The emergence of cloud computing
technologies offers a new way to develop scientific workflow systems, in
which one research topic is cost-effective strategies for storing intermediate
datasets.
In late 2007, the concept of cloud computing was proposed and it is
deemed the next generation of IT platforms that can deliver computing as a
kind of utility. Foster et al. made a comprehensive comparison of grid
computing and cloud computing. Cloud computing systems provide high
performance and massive storage required for scientific applications in the
same way as grid systems, but with a lower infrastructure construction cost
among many other features, because cloud computing systems are composed
of data centres which can be clusters of commodity hardware. Research into
doing science and data-intensive applications on the cloud has already
commenced, such as early experiences like the Nimbus and Cumulus projects.
The work by Deelman et al. shows that cloud computing offers a cost
effective solution for data-intensive applications, such as scientific
workflows. Furthermore, cloud computing systems offer a new model:
namely, that scientist from all over the world can collaborate and conduct
their research together. Cloud computing systems are based on the Internet,
and so are the scientific workflow systems deployed in the cloud. Scientists
can upload their data and launch their applications on the scientific cloud
workflow systems from everywhere in the world via the Internet, and they
only need to pay for the resources that they use for their applications. As all
the data are managed in the cloud, it is easy to share data among scientists.
8

Scientific cloud workflows are deployed in a cloud computing
environment, where use of all the resources need to be paid for. For a
scientific cloud workflow system, storing all the intermediated datasets
generated during workflow executions may cause a high storage cost. In
contrast, if user delete all the intermediate datasets and regenerate them every
time they are needed, the computation cost of the system may well be very
high too. The intermediate dataset storage strategy is to reduce the total cost
of the whole system. The best way is to find a balance that selectively stores
some popular datasets and regenerates the rest of them when needed.
In this system proposes a novel algorithm that can calculate the
minimum cost for intermediate dataset storage in scientific cloud workflow
systems. The intermediate datasets in scientific cloud workflows often have
dependencies. These generation relationships are a kind of data provenance.
Based on the data provenance, user create an Intermediate data Dependency
Graph (IDG), which records the information of all the intermediate datasets
that have ever existed in the cloud workflow system, no matter whether they
are stored or deleted. With the IDG, user know how the intermediate datasets
are generated and can further calculate their generation cost.
Given an intermediate dataset, user divide its generation cost by its
usage rate, so that this cost can be compared with its storage cost per time
unit, where a datasets usage rate is the time between every usage of this
dataset. Then user can decide whether an intermediate dataset should be
stored or deleted in order to reduce the system cost. Given the historic usages
of the datasets in an IDG, propose a Cost Transitive Tournament Shortest
Path (CTT-SP) based algorithm that can find the minimum cost storage
strategy of the intermediate datasets on demand in scientific cloud workflow
systems. This minimum cost can be utilized as a benchmark to evaluate the
cost effectiveness of other intermediate dataset storage strategies.
9

2.2 PRIVACY PRESERVING MULTI-KEYWORD RANKED
SEARCH OVER ENCRYPTED CLOUD DATA
Ning Cao et al (2007) have proposed that cloud computing is the long
dreamed vision of computing as a utility, where cloud customers can remotely
store their data into the cloud so as to enjoy the on-demand high quality
applications and services from a shared pool of configurable computing
resources. To protect data privacy and combat unsolicited accesses in cloud
and beyond, sensitive data, e.g., emails, personal health records, photo
albums, tax documents, financial transactions, etc., may have to be encrypted
by data owners before outsourcing to commercial public cloud; this, however,
obsoletes the traditional data utilization service based on plaintext keyword
search.
The trivial solution of downloading all the data and decrypting locally
is clearly impractical, due to the huge amount of bandwidth cost in cloud
scale systems. Moreover, aside from eliminating the local storage
management, storing data into the cloud serves no purpose unless they can be
easily searched and utilized. Thus, exploring privacy-preserving and effective
search service over encrypted cloud data is of paramount importance.
Considering the potentially large number of on demand data users and huge
amount of outsourced data documents in cloud, this problem is particularly
challenging as it is extremely difficult to meet also the requirements of
performance, system usability and scalability.
On the one hand, to meet the effective data retrieval need, large
amount of documents demand cloud server to perform result relevance
ranking, instead of returning undifferentiated result. Such ranked search
system enables data users to find the most relevant information quickly, rather
than burdensomely sorting through every match in the content collection.
10

Ranked search can also elegantly eliminate unnecessary network traffic
by sending back only the most relevant data, which is highly desirable in the
pay-as-you use cloud paradigm. For privacy protection, such ranking
operation, however, should not leak any keyword related information. On the
other hand, to improve search result accuracy as well as enhance user
searching experience, it is also crucial for such ranking system to support
multiple keywords search, as single keyword search often yields far too
coarse result.
As a common practice indicated by todays web search engines, data
users may tend to provide a set of keywords instead of only one as the
indicator of their search interest to retrieve the most relevant data. And each
keyword in the search request is able to help narrow down the search result
further. Coordinate matching, i.e., as many matches as possible, is an
efficient principle among such multi-keyword semantics to refine the result
relevance, and has been widely used in the plaintext Information Retrieval
(IR) community. However, how to apply it in the encrypted cloud data search
system remains a very challenging task because of inherent security and
privacy obstacles, including various strict requirements like data privacy,
index privacy, keyword privacy, and many others.
In the literature, searchable encryption is a helpful technique that treats
encrypted data as documents and allows a user to securely search over it
through single keyword and retrieve documents of interest. However, direct
application of these approaches to deploy secure large scale cloud data
utilization system would not be necessarily suitable, as they are developed as
crypto primitives and cannot accommodate such high service-level
requirements like system usability, user searching experience, and easy
information discovery in mind.
11

Although some recent designs have been proposed to support boolean
keyword search as an attempt to enrich the search flexibility, they are still not
adequate to provide users with acceptable result ranking functionality. Users
early work has been aware of this problem, and solves the secure ranked
search over encrypted data with support of only single keyword query. But
how to design an efficient encrypted data search mechanism that supports
multikeyword semantics without privacy breaches still remains a challenging
open problem.
In this system define and solve the problem of multi-keyword ranked
search over encrypted cloud data while preserving strict system-wise privacy
in cloud computing paradigm. Among various multi-keyword semantics, user
choose the efficient principle of coordinate matching, i.e., as many matches
as possible, to capture the similarity between search query and data
documents. Specifically, use inner product similarity, i.e., the number of
query keywords appearing in a document, to quantitatively evaluate the
similarity of that document to the search query in coordinate matching
principle.
During index construction, each document is associated with a binary
vector as a sub index where each bit represents whether corresponding
keyword is contained in the document. The search query is also described as a
binary vector where each bit means whether corresponding keyword appears
in this search request, so the similarity could be exactly measured by inner
product of query vector with data vector. To meet the challenge of supporting
such multikeyword semantic without privacy breaches, propose a basic
scheme using secure inner product computation, which is adapted from a
secure k-nearest neighbour technique and then improve it step by step to
achieve various privacy requirements in two levels of threat models.
12

2.3 AUTHORIZED PRIVATE KEYWORD SEARCH OVER
ENCRYPTED PERSONAL HEALTH RECORDS IN CLOUD
Ming Li et al (2008) have proposed the keyword search in the Personal
Health Record (PHR) has emerged as a patient-centric model of health
information exchange. It had never been easier than now for one to create and
manage her own Personal Health Information (PHI) in one place, and share
that information with others. It enables a patient to merge potentially separate
health records from multiple geographically dispersed health providers into
one centralized profile over passages of time. This greatly facilitates multiple
other users, such as medical practitioners and researchers to gain access to
and utilize ones PHR on demand according to their professional need,
thereby making the healthcare processes much more efficient and accurate.
As a matter of fact, PHRs are usually untethered, i.e., provided by a
third-party service provider, in contrast to electronic medical records which
are usually tethered, i.e., kept by each patients own healthcare provider.
Untethered PHRs are the best ways to empower patients to manage their
health and wellbeing. The most popular examples of PHR systems include
Google Health and Microsoft HealthVault, which are hosted by cloud
computing platforms. And it is a vision dreamed by many to enable anyone to
access PHR service from anywhere, at anytime.
Despite enthusiasm around the idea of the patient-centric PHR
systems, their promises cannot be fulfilled until address the serious security
and privacy concerns patients have about these systems, which are the main
impediments standing in the way of their wide adoption. In fact, people
remain dubious about the levels of privacy protection of their health data
when they are stored in a server owned by a third-party cloud service
provider.
13

Most people do not fully entrust the third-party service providers for
their sensitive PHR data because there is no governance about how this
information can be used by them and whether the patients actually control
their information. On the other hand, even if patients choose to trust those
service providers, PHR data could be exposed if an insider in the service
providers company misbehaves, or the server is broken into. To cope with
the tough trust issues and to ensure patients control over their own privacy,
applying data encryption on patients PHR documents before outsourcing has
been proposed as a promising solution.
With encrypted PHR data, one of the key functionalities of a PHR
system keyword search becomes an especially challenging issue. First need to
support frequently used complex query types in practice, while preserving the
privacy of the query keywords. This class of boolean formulas feature
conjunctions among different keyword fields and will refer to them as multi
dimensional multi-keyword search in this paper. To hide the query keywords
from the server, it is apparently inefficient for a user to download the whole
database and try to decrypt the records one by one.
Searchable encryption has been proposed as a better solution;
informally speaking, a user submits a capability encoding her query to the
server, who searches through the encrypted keyword indexes created by the
owners, and returns a subset of encrypted documents that satisfy the
underlying query without ever knowing the keywords in the query and the
index. However, existing solutions of searchable encryption are still far from
practical for PHR applications in cloud environments. First and foremost, they
are limited both in the type of applications and system scalability. Recently,
Benaloh et al and Narayan et al proposed several solutions for securing
encrypted electronic health records. In their schemes for encrypted search,
each owner issues search capabilities to individual users upon request.
14

The main advantage is that the owner herself can exert fine-grained
control on users search accesses to her PHR documents. Yet, observe that
such a framework is limited to small-scale access and search applications,
where the best use case is for the users who are personally known by the
patient, such as family members, friends or primary doctors. User call such a
user set as personal domain. In contrast, there is public domain, which
contains a large number of users which may come from various wide avenues,
such as other fellow patients, medical researchers in a medical school, staffs
in public health agencies etc. Their corresponding applications are patient
matching medical research and public health monitoring, respectively. The
user set of ones PHR is potentially of large number, usually unknown to a
PHR owner and their access/search requests are basically unpredictable.
Under existing solutions, to support those important kinds of
applications will incur an intrinsic non-scalability on key management: an
owner will need to be always online and dealing with many search requests,
while a user shall obtain search capabilities one-by-one from every owner. In
this system focus on PHR applications in the public domain, where there are
multiple owners who can contribute encrypted PHR data while multiple users
shall be able to search over all those records with high scalability.
Second, in many existing searchable encryption schemes, the users are
often given a private key that endows her unlimited capability to generate any
query of her choice, which is essentially a 0 or 1 authorization. However,
note that fine-grained search authorization is an indispensable component
for a secure system. Although the accesses to actual documents can be
controlled by separate cryptographic enforced access control techniques such
as attribute-based encryption, 0-1 search authorization may still lead to
leakage of patients sensitive health information.
15

For example, if Alice is the only one with a rare disease in the PHR
database, by designing the query in a clever way, from the results Bob will be
certain that Alice has that disease. Thus, it is desirable that a user is only
allowed to search for some specific sets of keywords; in particular, the
authorization shall be based on a users attributes. For instance, in a patient
matching application in healthcare social networks, a patient should only be
matched to patients having similar symptoms as her, while shall not learn any
information about those who do not.
On the other hand, requiring every user to obtain restricted search
capabilities from a central trusted authority does not achieve high scalability
as well. If the TA assumes the responsibility of authorization at the same
time, it shall be always online, dealing with large workload, and facing the
threat of single-point-of-failure. In addition, since the global TA does not
directly possess the necessary information to check the attributes of users
from different local domains, additional infrastructure needs to be employed.
It is therefore desirable for the users to be authorized locally.
To realize such a framework, make novel use of a recent cryptographic
primitive, Hierarchical Predicate Encryption (HPE), which features delegation
of search capabilities. The first solution enhances search efficiency, especially
for subset and a class of simple range queries, while the second enhances
query privacy with the help of proxy servers. Both schemes support multi
dimensional multi keyword searches and allow delegation and revocation of
search capabilities. Finally, implement scheme on a modern workstation and
carry out extensive performance evaluation. Through experimental results
demonstrate that scheme is suitable for a wide range of delay-tolerant PHR
applications. To the best of knowledge, work is the first to address the
authorized private search over encrypted PHRs within the public domain.
16

2.4 SILVERLINE: TOWARD DATA CONFIDENTIALITY IN
STORAGE-INTENSIVE CLOUD APPLICATIONS
Jinjun Chen et al (2008) have proposed the third party computing
clouds, such as Amazons EC2 and Microsofts Azure, provide support for
computation, data management in database instances, and internet services.
By allowing organizations to efficiently outsource computation and data
management, they greatly simplify the deployment and management of
Internet applications. Examples of success stories on EC2 include Nimbus
Health, which manages and distributes patient medical records, and
ShareThis, a social content-sharing network that has shared 430 million items
across 30,000 websites.
Unfortunately, these game-changing advantages come with a
significant risk to data confidentiality. Using a multi-tenant model, clouds
locate applications from multiple organizations on a single managed
infrastructure. This means application data is vulnerable not only to operator
errors and software bugs in the cloud. With unencrypted data exposed on disk,
in memory, or on the network, it is not surprising that organizations cite data
confidentiality as their biggest concern in adopting cloud computing. In fact,
researchers recently showed that attackers could effectively target and
observe information from specific cloud instances on third party clouds. As a
result, many recommend that cloud providers should never be given access to
unencrypted data.
Organizations can achieve strong data confidentiality by encrypting
data before it reaches the cloud, but naively encrypting data severely restricts
how data can be used. The cloud cannot perform computation on any data it
cannot access in plaintext. For applications that want more than just pure
storage. There are efforts to perform specific operations on encrypted data
such as searches.
17

A recent proposal of a fully homomorphic cryptosystem even supports
arbitrary computations on encrypted data. However, these techniques are
either too costly or only support very limited functionality. Thus, users that
need real application support from todays clouds must choose between the
benefits of clouds and strong confidentiality of their data. In this system take a
first step towards improving data confidentiality in cloud applications, and
propose a new approach to balance confidentiality and computation on the
cloud. User key observation is this: in applications that can benefit the most
from the cloud model, the majority of their computations handle data in an
opaque way, i.e. without interpretation. Users refer to data that is never
interpreted by the application as functionally encryptable, i.e. encrypting them
does not limit the applications functionality.
Leveraging the observation that certain data is never interpreted by the
cloud; key step is to split the entire application data into two subsets:
functionally encryptable data, and data that must remain in plaintext to
support computations on the cloud. A majority of data in many of applications
is functionally encryptable. Such data would be encrypted by users before
uploading it to the cloud, and it would be decrypted by users after receiving
from the cloud. While this idea sounds conceptually simple, realizing it
requires us to address three significant challenges such as identifying
functionally encryptable data in cloud applications, assigning encryption keys
to data while minimizing key management complexity and risks due to key
compromise, and providing secure data access at the user device.
Identifying functionally encryptable data: The first challenge is to
identify data that can be functionally encrypted without breaking application
functionality. To this end, present an automated technique that marks data
objects using tags and tracks their usage and dependencies through dynamic
program analysis.
18

User identifies functionally encryptable data by discarding all data that
is involved in any computations on the cloud. Naturally, the size of this subset
of data depends on the type of service. For example, for programs that
compute values based on all data objects, techniques will not find any data
suitable for encryption. In practice, however, results show that for many
applications, including social networks and message boards, a large fraction
of the data can be encrypted.
Encryption key assignment: Once user identify the data to be
encrypted, must choose how many keys to use for encryption, and the
granularity of encryption. In the simplest case, can encrypt all such data using
a single key, and share the key with all users of the service. Unfortunately,
this has the problem that a malicious or compromised cloud could obtain
access to the encryption key, e.g. by posing as a legitimate user, or by
compromising or colluding with an existing user. In these cases,
confidentiality of the entire dataset would be compromised. In the other
extreme, could encrypt each data object with a different key. This increases
robustness to key compromise, but drastically increases key management
complexity.
Users goal is to automatically infer the right granularity for data
encryption that provides the best tradeoff between robustness and
management complexity. To this end, partition the data into subsets, where
each data subset is accessed by the same group of users. User then encrypts
each data subset using a different key, and distributes keys to groups of users
that should have access. Thus, a malicious or buggy cloud that compromises a
key can only access the data that is encrypted by that key, minimizing its
negative impact. User introduces a dynamic access analysis technique that
identifies user groups who can access different objects in the data set.
19

In addition, describe a key management system that leverages this
information to assign to each user all keys that she would need to properly
access her data. Since key assignment is based on user access patterns, can
obtain an assignment that uses a minimal number of encryption keys
necessary to cover all data subsets with distinct access groups, while
minimizing damage from key compromise. Key management is handled by
the organization. Users also develop mechanisms that need to manage keys
when users or objects are dynamically added to or removed from the
application or service.
Secure and transparent user data access: Client devices, e.g. browsers,
are given decryption keys by the organization to provide users with
transparent data access. Ofcourse, these devices must protect these keys from
compromise. To ward off these attacks, propose a client-side component that
allows users to access cloud services transparently, while preventing key
compromise. As a result, solution works without any browser modifications,
and can be easily deployed today.
Prototype and evaluation: User implemented techniques as part of
Silverline, a prototype of software tools designed to simplify the process of
securely transitioning applications into the cloud. Users prototype takes as
input an application and its data. First, it automatically identifies data that is
functionally encryptable. Then, it partitions this data into subsets that are
accessible to different sets of users. User assigns each group a different key,
and all users obtain a key for each group that they belong to. This allows the
application to be run on the cloud, while all data not used for computation is
encrypted. In addition, find that a large majority of data can be encrypted on
each of tested applications.

20

2.5 SEDIC: PRIVACY-AWARE DATA INTENSIVE COMPUTING
ON HYBRID CLOUDS
With the rapid growth of information within organizations, ranging
from hundreds of gigabytes of satellite images to terabytes of commercial
transaction data, the demands for processing such data are on the rise.
Meeting such demands requires an enormous amount of low-cost computing
resources, which can only be supplied by todays commercial cloud
computing systems. This newfound capability, however, cannot be fully
exploited without addressing the privacy risks it brings in: on one hand,
organizational data contains sensitive information and therefore cannot be
shared with the cloud provider without proper protection; on the other hand,
todays commercial clouds do not offer high security assurance, a concern
that has been significantly aggravated by the recent incidents of Amazon
outages and the Sony PlayStation network data breach and tend to avoid any
liability.
As a result, attempts to outsource the computations involving sensitive
data are often discouraged. A natural solution to this problem is cryptographic
techniques for secure computation outsourcing, which has been studied for a
decade. Secure hybrid-cloud computing. Oftentimes, a data-intensive
computation involves both public and sensitive data. For example, a simple
grep across an organizational file system encounters advertising slogans as
well as lines of commercial secrets. Also, many data analysis tasks, such as
intrusion detection, targeted advertising, etc., need to make use of the
information from public sources, sanitized network traces and social-network
data. If the computation on the public data can be separated from that on the
sensitive data, the former can be comfortably delegated to the public
commercial clouds and the latter, whose scale can be much smaller than the
original task, will become much easier to handle within the organization.
21

Such a split of computation is an effective first step to securely
outsource computations and can be naturally incorporated into todays cloud
infrastructure, in which a public cloud typically receives the computation
overflow from an organizations internal system when it is running out of
its computing resources. This way of computing is called hybrid cloud
computing. The hybrid cloud has already been adopted by most
organizational cloud users and is still undergoing a rapid development, with
new techniques mushroomed to enable a smoother inter-cloud coordination. It
also presents a new opportunity that makes practical, secure outsourcing of
computation tasks possible.
However, todays cloud-based computing frameworks, such as
MapReduce, are not ready for secure hybrid-cloud computing they are
designed to work on a single cloud and not aware of the presence of the data
with different security levels, which forces cloud users to manually split and
re-arrange each computation job across the public/private clouds. This lack of
a framework-level support also hampers the reuse of existing data-processing
code, and therefore significantly increases the cloud users programming
burden. Given the fact that privacy concerns have already become the major
hurdle for a broader adoption of the cloud-computing paradigm, it is in urgent
need to develop practical techniques to facilitate secure data-intensive
computing over hybrid clouds.
To answer this urgent call, a new, generic secure computing
framework needs to be built to support automatic splitting of a data-intensive
computing job and scheduling of it across the public and private clouds in
such a way that data privacy is preserved and computational and
communication overheads are minimized. Also desired here is
accommodation of legacy data-processing code, which is expected to run
directly within the framework without the users manual interventions.
22

User present a suite of new techniques that make this happen. Users
system, called Sedic, includes a privacy-aware execution framework that
automatically partitions a computing job according to the security levels of
the data it involves, and distributes the computation between the public and
private clouds. Sedic is based on MapReduce, which includes a map step
and a reduce step: the map step divides input data into lists of key-value
pairs and assigns them to a group of concurrently-running mappers; the
reduce step receives the outputs of these mappers, which are intermediate
key-value pairs, and runs a reducer to transform them into the final outputs.
This way of computation is characterized by its simple structure,
particularly the map operations that are performed independently and
concurrently on different data records. This feature is leveraged by execution
framework to automatically decompose a computation on a mixture of public
and sensitive data, which is actually difficult in general. More specifically,
Sedic transparently processes individual data blocks, sanitizes those carrying
sensitive information along the line set by the smallest data unit a map
operation works on, and replicates these sanitized copies to the public cloud.
Over those data blocks, map tasks are assigned to work solely on the public or
sensitive data within the blocks.
These tasks are carefully scheduled and executed to ensure the
correctness of the computing outcomes and the minimum impacts on
performance.In this way, the workload of map operations is distributed to the
public/private clouds according to their available computing resources and the
portion of sensitive data in the original dataset. A significant technical
challenge here is that reduction usually cannot be done on private nodes and
public nodes separately and only private nodes are suitable for such a task in
order to preserve privacy. This implies that the intermediate outputs of
computing nodes on the cloud need to be sent back to the private cloud.
23

To reduce such inter-cloud data transfer as well as move part of the
reduce computation to the public cloud, developed a new technique that
automatically analyzes and transforms reducers to make them suitable for
running on the hybrid cloud. Users approach extracts a combiner from the
original reducer for preprocessing the intermediate key-value pairs produced
by the public cloud, so as to compress the volume of the data to be delivered
to the private cloud. This was achieved, again, by leveraging the special
features of MapReduce: its reducer needs to perform a folding operation on a
list, which can be automatically identified and extracted by a program
analyzer embedded in Sedic.
If the operation turns out to be associative or even commutative, as
happens in the vast majority of cases, the combiner can be built upon it and
deployed to the public cloud to process the map outcomes. In research,
implemented Sedic on Hadoop and evaluated it over FutureGrid, a large scale,
cross-the-country cloud testbed. Users experimental results show that the
techniques effectively protected confidential user data and minimized the
workload of the private cloud at a small overall cost.
Sedic is designed to protect data privacy during map-reduce
operations, when the data involved contains both public and private records.
This protection is achieved by ensuring that the sensitive information within
the input data, intermediate outputs and final results will never be exposed to
untrusted nodes during the computation. Another important concern in data
intensive computing is integrity, i.e., whether the public cloud honestly
performs a computing task and deliveries the right results back to the private
cloud. User chooses to address the confidentiality issue first, as it has already
impeded the extensive use of the computing resources offered by the public
cloud. By comparison, many cloud users today live with the risk that their
computing jobs may not be done correctly on the public cloud.
24

2.6 ENABLING PRIVACY IN PROVENANCE AWARE
WORKFLOW SYSTEMS
Daniel Warneke et al (2009) have proposed a new paradigm for
creating and correcting scientific analyses is emerging, that of provenance
aware workflow systems. In such systems, repositories of workflow
specifications and of provenance graphs that represent their executions will be
made available as part of scientific information sharing. This will allow users
to search and query both workflow specifications and their provenance
graphs: Scientists who wish to perform new analyses may search workflow
repositories to find specifications of interest to reuse or modify. They may
also search provenance information to understand the meaning of a workflow,
or to debug a specification.
Finding erroneous or suspect data, a user may then ask provenance
queries to determine what downstream data might have been affected, or to
understand how the process failed that led to creating the data. With the
increased amount of available provenance information, there is a need to
efficiently search and query scientific workflows and their executions.
However, workflow authors or owners may wish to keep some information in
the repository confidential.
Although users with the appropriate access level may be allowed to see
such confidential data, making it available to all users, even for scientific
purposes, is an unacceptable breach of privacy. Beyond data privacy, a
module itself may be proprietary, and hiding its description may not be
enough: users without the appropriate access level should not be able to infer
its behavior if they are allowed to see the inputs and outputs of the module.
Finally, details of how certain modules in the workflow are connected may be
proprietary, and so showing how data is passed between modules may reveal
too much of the structure of the workflow.
25

Scientific workflows are gaining wide-spread use in life sciences
applications, a domain in which privacy concerns are particularly acute. User
now illustrates three types of privacy using an example from this domain.
Consider a personalized disease susceptibility workflow. Information such as
an individuals genetic make-up and family history of disorders, which this
workflow takes as input, is highly sensitive and should not be revealed to an
unauthorized user, placing stringent requirements on data privacy. Further, a
workflow module may compare an individuals genetic makeup to profiles of
other patients and controls. The manner in which such historical data is
aggregated and the comparison is made, is highly sensitive, pointing to the
need for module privacy.
As recently noted, You are better off designing in security and
privacy from the start, rather than trying to add them later. User apply this
principle by proposing that privacy guarantees should be integrated in the
design of the search and query engines that access provenance-aware
workflow repositories. Indeed, the alternative would be to create multiple
repositories corresponding to different levels of access, which would lead to
inconsistencies, inefficiency, and a lack of flexibility, affecting the desired
techniques.
This system focuses on privacy-preserving management of
provenance-aware workflow systems. User considers the formalization of
privacy concerns, as well as query processing in this context. Specifically,
address issues associated with keyword-based search as well as with querying
such repositories for structural patterns. To give some background on
provenance-aware workflow systems, first describe the common model for
workflow specifications and their executions. User then enumerates privacy
concerns, consider their effect on query processing, and discuss the
challenges.
26

2.7 SCALABLE AND SECURE SHARING OF HEALTH RECORDS
IN CLOUD USING ATTRIBUTE BASED ENCRYPTION
Christian Vecchiola et al (2009) have proposed this technique for
Personal Health Record (PHR) has emerged as a patient-centric model of
health information exchange. A PHR service allows a patient to create,
manage, and control her personal health data in one place through the web,
which has made the storage, retrieval, and sharing of the the medical
information more efficient. Especially, each patient is promised the full
control of her medical records and can share her health data with a wide range
of users, including healthcare providers, family members or friends. Due to
the high cost of building and maintaining specialized data centers, many PHR
services are outsourced to or provided by third-party service providers.
Recently, architectures of storing PHRs in cloud computing have been
proposed. While it is exciting to have convenient PHR services for everyone,
there are many security and privacy risks which could impede its wide
adoption. The main concern is about whether the patients could actually
control the sharing of their sensitive personal health information (PHI),
especially when they are stored on a third-party server which people may not
fully trust. On the one hand, although there exist healthcare regulations such
as HIPAA which is recently amended to incorporate business associates,
cloud providers are usually not covered entities.
On the other hand, due to the high value of the sensitive Personal
Health Information (PHI), the third-party storage servers are often the targets
of various malicious behaviors which may lead to exposure of the PHI. As a
famous incident, a Department of Veterans Affairs database containing
sensitive PHI of 26.5 million military veterans, including their social security
numbers and health problems was stolen by an employee who took the data
home without authorization.
27

To ensure patient-centric privacy control over their own PHRs, it is
essential to have fine-grained data access control mechanisms that work with
semi-trusted servers. A feasible and promising approach would be to encrypt
the data before outsourcing. Basically, the PHR owner herself should decide
how to encrypt her files and to allow which set of users to obtain access to
each file. A PHR file should only be available to the users who are given the
corresponding decryption key, while remain confidential to the rest of users.
Furthermore, the patient shall always retain the right to not only grant, but
also revoke access privileges when they feel it is necessary.
However, the goal of patient-centric privacy is often in conflict with
scalability in a PHR system. The authorized users may either need to access
the PHR for personal use or professional purposes. Users refer to the two
categories of users as personal and professional users, respectively. The latter
has potentially large scale; should each owner herself be directly responsible
for managing all the professional users, she will easily be overwhelmed by the
key management overhead. In addition, since those users access requests are
generally unpredictable, it is difficult for an owner to determine a list of them.
On the other hand, different from the single data owner scenario
considered in most of the existing works, in a PHR system, there are multiple
owners who may encrypt according to their own ways, possibly using
different sets of cryptographic keys. An alternative is to employ a Central
Authority (CA) to do the key management on behalf of all PHR owners, but
this requires too much trust on a single authority. In this system endeavor to
study the patient centric, secure sharing of PHRs stored on semi-trusted
servers, and focus on addressing the complicated and challenging key
management issues. In order to protect the personal health data stored on a
semi-trusted server, adopt Attribute Based Encryption (ABE) as the main
encryption primitive.
28

2.8 ENABLING SECURE AND EFFICIENT RANKED KEYWORD
SEARCH OVER OUTSOURCED CLOUD DATA
Wanchun Dou et al (2010) have proposed this ranked keyword
search. Cloud computing is the long dreamed vision of computing as a utility,
where cloud customers can remotely store their data into the cloud so as to
enjoy the on-demand high-quality applications and services from a shared
pool of configurable computing resources. The benefits brought by this new
computing model include but are not limited to: relief of the burden for
storage management, universal data access with independent geographical
locations, and avoidance of capital expenditure on hardware, software, and
personnel maintenances, etc.,
As cloud computing becomes prevalent, more and more sensitive
information are being centralized into the cloud, such as e-mails, personal
health records, company finance data, and government documents, etc. The
fact that data owners and cloud server are no longer in the same trusted
domain may put the outsourced unencrypted data at risk: the cloud server may
leak data information to unauthorized entities or even be hacked. It follows
that sensitive data have to be encrypted prior to outsourcing for data privacy
and combating unsolicited accesses.
Besides, in cloud computing, data owners may share their outsourced
data with a large number of users, who might want to only retrieve certain
specific data files they are interested in during a given session. One of the
most popular ways to do so is through keyword-based search. Such keyword
search technique allows users to selectively retrieve files of interest and has
been widely applied in plaintext search scenarios. Unfortunately, data
encryption, which restricts users ability to perform keyword search and
further demands the protection of keyword privacy, makes the traditional
plaintext search methods fail for encrypted cloud data.
29

On the one hand, for each search request, users without pre-knowledge
of the encrypted cloud data have to go through every retrieved file in order to
find ones most matching their interest, which demands possibly large amount
of post-processing overhead, On the other hand, invariably sending back all
files solely based on presence/ absence of the keyword further incurs large
unnecessary network traffic, which is absolutely undesirable in todays
pay-as-you-use cloud paradigm.
In short, lacking of effective mechanisms to ensure the file retrieval
accuracy is a significant drawback of existing searchable encryption schemes
in the context of Cloud Computing. Nonetheless, the state of the art in
Information Retrieval (IR) community has already been utilizing various
scoring mechanisms quantify and rank order the relevance of files in response
to any given search query. Therefore, how to enable a searchable encryption
system with support of secure ranked search is the problem tackled in this
system. Users work is among the first few ones to explore ranked search over
encrypted data in cloud computing.
To achieve design goals on both system security and usability, propose
to bring together the advance of both crypto and IR community to design the
Ranked Searchable Symmetric Encryption (RSSE) scheme, in the spirit of
as-strong-as-possible security guarantee. Specifically, explore the statistical
measure approach from IR and text mining to embed weight information of
each file during the establishment of searchable index before outsourcing the
encrypted file collection. As directly outsourcing relevance scores will leak
lots of sensitive frequency information against the keyword privacy and
properly modify it to develop a one-to- many order-preserving mapping
technique for purpose to protect those sensitive weight information, while
providing efficient ranked search functionalities.
30

2.9 A SECURE ERASURE CODE-BASED CLOUD STORAGE
SYSTEM WITH SECURE DATA FORWARDING
Jianfeng Zhan et al (2010) have proposed this code based cloud
storage system with secure data forwarding. As high-speed networks and
ubiquitous internet access become available in recent years, many services are
provided on the internet such that users can use them from anywhere at any
time. For example, the email service is probably the most popular one. Cloud
computing is a concept that treats the resources on the Internet as a unified
entity, a cloud. Users just use services without being concerned about how
computation is done and storage is managed.
In this system focused on designing a cloud storage system for
robustness, confidentiality, and functionality. A cloud storage system is
considered as a large scale distributed storage system that consists of many
independent storage servers. Data robustness is a major requirement for
storage systems. There have been many proposals of storing data over storage
servers. One way to provide data robustness is to replicate a message such
that each storage server stores a copy of the message. It is very robust because
the message can be retrieved as long as one storage server survives. Another
way is to encode a message of k symbols into a codeword of n symbols by
erasure coding.
To store a message, each of its codeword symbols is stored in a
different storage server. A storage server failure corresponds to an erasure
error of the codeword symbol. As long as the number of failure servers is
under the tolerance threshold of the erasure code, the message can be
recovered from the codeword symbols stored in the available storage servers
by the decoding process. This provides a tradeoff between the storage size
and the tolerance threshold of failure servers. Thus, the encoding process for a
message can be split into n parallel tasks of generating codeword symbols.
31

A decentralized erasure code is suitable for use in a distributed storage
system. After the message symbols are sent to storage servers, each storage
server independently computes a codeword symbol for the received message
symbols and stores it. This finishes the encoding and storing process. The
recovery process is the same. Storing data in a third partys cloud system
causes serious concern on data confidentiality. In order to provide strong
confidentiality for messages in storage servers, a user can encrypt messages
by a cryptographic method before applying an erasure code method to encode
and store messages. When he wants to use a message, he needs to retrieve the
codeword symbols from storage servers, decode them, and then decrypt them
by using cryptographic keys.
There are three problems in the above straight forward integration of
encryption and encoding. First, the user has to do most computation and the
communication traffic between the user and storage servers is high. Second,
the user has to manage his cryptographic keys. If the users device of storing
the keys is lost or compromised, the security is broken. Finally, besides data
storing and retrieving, it is hard for storage servers to directly support other
functions. For example, storage servers cannot directly forward a users
messages to another one. The owner of messages has to retrieve, decode,
decrypt and then forward them to another user.
In this paper, address the problem of forwarding data to another user
by storage servers directly under the command of the data owner. User
considers the system model that consists of distributed storage servers and
key servers. Since storing cryptographic keys in a single device is risky, a
user distributes his cryptographic key to key servers that shall perform
cryptographic functions on behalf of the user. These key servers are highly
protected by security mechanisms.
32

2.10 HARNESSING THE CLOUD FOR SECURELY
OUTSOURCING LARGE-SCALE SYSTEMS OF LINEAR
EQUATIONS
Kui Ren et al (2010) have proposed this concept of linear equation in
cloud. In cloud computing, customers with computationally weak devices are
now no longer limited by the slow processing speed, memory, and other
hardware constraints, but can enjoy the literally unlimited computing
resources in the cloud through the convenient yet flexible pay-per-use
manners. Despite the tremendous benefits, the fact that customers and cloud
are not necessarily in the same trusted domain brings many security concerns
and challenges toward this promising computation outsourcing model.
First, customers data that are processed and generated during the
computation in cloud are often sensitive in nature, such as business financial
records, proprietary research data, and personally identifiable health
information, etc. While applying ordinary encryption techniques to this
sensitive information before outsourcing could be one way to combat the
security concern, it also makes the task of computation over encrypted data in
general a very difficult problem. Second, since the operational details inside
the cloud are not transparent enough to customers, no guarantee is provided
on the quality of the computed results from the cloud.
For example, for computations demanding a large amount of resources,
there are huge financial incentives for the Cloud Server (CS) to be lazy if
the customer cannot tell the correctness of the answer. Besides, possible
software/hardware malfunctions and/or outsider attacks might also affect the
quality of the computed results. Thus, argue that the cloud is intrinsically not
secure from the viewpoint of customers. Without providing a mechanism for
secure computation outsourcing.
33

Focusing on the engineering and scientific computing problems, this
paper investigates secure outsourcing for widely applicable large-scale
systems of Linear Equations (LE), which are among the most popular
algorithmic and computational tools in various engineering disciplines that
analyze and optimize real-world systems. For example, by applying Newtons
method, to solve a system modeled by nonlinear equations converts to solve a
sequence of systems of linear equations. Also, by interior point methods,
system optimization problems can be converted to a system of nonlinear
equations, which is then solved as a sequence of systems of linear equations
as mentioned above.
Because the execution time of a computer program depends not only
on the number of operations it must execute, but on the location of the data in
the memory hierarchy, solving such large-scale problems on customers weak
computing devices can be practically impossible, due to the inevitably
involved huge IO cost. Thus, resorting to cloud for such computation
intensive tasks can be arguably the only choice for customers with weak
computing power, especially when the solution is demanded in a timely
fashion.
It is worth noting that in the literature, several cryptographic protocols
for solving various core problems in linear algebra, including the systems of
linear equations have already been proposed from the Secure Multiparty
Computation (SMC) community. However, these approaches are in general ill
suited in the context of computation outsourcing model with large problem
size. First, all these work developed under SMC model do not address the
asymmetry among the computational power possessed by cloud and the
customer, i.e., they all impose each involved party comparable computation
burdens, which in this paper design specifically intends to avoid.
34

Second, the framework of SMC usually does not directly consider the
computation result verification as an indispensable security requirement, due
to the assumption that each involved party is semi honest. This assumption is
not true anymore in model, where any unfaithful behavior by the cloud during
the computation should be strictly forbidden. Last but not the least, almost all
these solutions are focusing on the traditional direct method for jointly
solving the LE, like the joint Gaussian elimination method in, or the secure
matrix inversion method. While working well for small size problems, these
approaches in general do not derive practically acceptable solution time for
large-scale LE, due to the expensive cubictime computational burden for
matrix-matrix operations and the huge IO cost on customers weak devices.
The analysis from existing approaches and the computational
practicality motivates us to design secure mechanism of outsourcing LE via a
completely different approach iterative method, where the solution is
extracted via finding successive approximations to the solution until the
required accuracy is obtained. Compared to direct method, iterative method
only demands relatively simpler matrix-vector operations with O(n
2
)
computational cost, which is much easier to implement in practice and widely
adopted for large scale LE.
For a linear system with n n coefficient matrix, the proposed
mechanism is based on a one-time amortizable setup with O(n
2
) cost. Then, in
each iterative algorithm execution, the proposed mechanism only incurs O(n)
local computational burden to the customer and asymptotically eliminates the
expensive IO cost, i.e., no unrealistic memory demands. To ensure
computation result integrity, also propose a very efficient cheating detection
mechanism to effectively verify in one batch of all the computation results by
the cloud server from previous algorithm iterations with high probability.
Both designs ensure computational savings for the customer.
35

2.11 AN UPPER BOUND CONTROL APPROACH FOR
PRESERVING DATASET PRIVACY IN CLOUD
Xuyun Zhang et al (2012) have proposed this upper bound control
approach concept in the cloud computing for security purpose. Along with
more and more data intensive applications have been migrated into cloud
environments, storing some valuable intermediate datasets has been
accommodated in order to avoid the high cost of recomputing them.
However, this has a risk on data privacy protection because malicious
parties may deduce the private information of the parent dataset or original
dataset by analyzing some of those stored intermediate datasets.
The traditional way for addressing this issue is to encrypt all of those
stored datasets so that they can be hidden. Argue that this is neither efficient
nor cost effective because it is not necessary to encrypt all of those datasets
and encryption of all large amounts of datasets can be very costly. In this
paper, propose a new approach to identify which stored datasets need to be
encrypted and which not. Through intensive analysis of information theory,
approach designs an upper bound on privacy measure.
As long as the overall mixed information amount of some stored
datasets is no more than that upper bound, those datasets do not need to be
encrypted while privacy can still be protected. A tree model is leveraged to
analyze privacy disclosure of datasets and privacy requirements are
decomposed and satisfied layer by layer. With a heuristic implementation of
this approach, evaluation results demonstrate that the cost for encrypting
intermediate datasets decreases significantly compared with the traditional
approach while the privacy protection of parent or original dataset is
guaranteed. Technically, cloud computing could be regarded as a
combination of a series of developed or developing ideas and technologies,
establishing a model.
36

978-0-7695-4612-4/11 $26.00 2011 IEEE DOI
10.1109/DASC.2011.98

978-0-7695-4612-4/11 $26.00 2011 IEEE DOI
10.1109/DASC.2011.98

978-0-7695-4612-4/11 $26.00 2011 IEEE DOI
10.1109/DASC.2011.98

All the participants in cloud computing business chains can benefit
from this novel business model, as they can reduce their cost and
concentrate on their own core business. Therefore, many companies or
individuals have moved their business into cloud computing environments.
Hence, in the sense of pay-as-you-go feature of cloud computing,
computation sources are equivalent to storage sources. Then cloud users can
store some intermediate data and final results selectively when processing
raw data on a cloud, especially in data intensive applications like medical
diagnosis and bio informatics.
Since such kind of data volume may be accessed by multiple users,
storing these intermediate datasets would curtail the overall cost via
eliminating the frequently repeated computation to obtain some data. For
example, intermediate data within an execution may contain sensitive
information, such as a social security number, a medical record or financial
information about an individual. These scenarios are quite common because
data users often reanalyze the result or possess new analyses on
intermediate data or share some intermediate results for collaborations. The
occurrence of intermediate dataset storage enlarges the attack surface so that
the original data privacy is at risk of being compromised.
The intermediate dataset storage might be out of control of the
original data owner and can be accessed and shared by other applications,
enabling an adversary to collect them and menace the privacy information
of the original dataset, further leading to considerable economic loss or
severe social reputation impairment. This new paradigm allows allocating
compute resources dynamically and just for the time they are required
in the processing workflow. So from this the upper bound control
approach for storing intermediate datasets the users can upload their
data in online cloud with more security.
37

CHAPTER 3
SYSTEM ANALYSIS
3.1 EXISTING SYSTEM
Technically, cloud computing is regarded as an ingenious combination
of a series of technologies, establishing a novel business model by offering IT
services and using economies of scale. Participants in the business chain of
cloud computing can benefit from this novel model. Cloud customers can
save huge capital investment of IT infrastructure, and concentrate on their
own core business. Therefore, many companies or organizations have been
migrating or building their business into cloud. However, numerous potential
customers are still hesitant to take advantage of cloud due to security and
privacy concerns.
The privacy concerns caused by retaining intermediate data sets in
cloud are important but they are paid little attention. Storage and computation
services in cloud are equivalent from an economical perspective because they
are charged in proportion to their usage. Thus, cloud users can store valuable
intermediate data sets selectively when processing original data sets in data
intensive applications like medical diagnosis, in order to curtail the overall
expenses by avoiding frequent recomputation to obtain these data sets.
Such scenarios are quite common because data users often reanalyze
results, conduct new analysis on intermediate data sets, or share some
intermediate results with others for collaboration. Without loss of generality,
the notion of intermediate data set herein refers to intermediate and resultant
data sets. However, the storage of intermediate data enlarges attack surfaces
so that privacy requirements of data holders are at risk of being violated.
38

Usually, intermediate data sets in cloud are accessed and processed by
multiple parties, but rarely controlled by original data set holders. This
enables an adversary to collect intermediate data sets together and menace
privacy-sensitive information from them, bringing considerable economic loss
or severe social reputation impairment to data owners. But, little attention has
been paid to such a cloud-specific privacy issue.
3.1.1 Drawbacks of Existing System
The following drawbacks are identified in the existing system.
Static privacy preserving model
Privacy preserving data scheduling is not focused
Storage and computational aspects are not considered
Load balancing is not considered
3.2 PROPOSED SYSTEM
Shared data values are maintained under third party cloud data centers.
Data values are processed and stored in different cloud nodes. Privacy leakage
upper bound constraint model is used to protect the intermediate data values.
Dynamic privacy management and scheduling mechanism are integrated to
improve the data sharing with security.
Multiple intermediate data set privacy models is integrated with data
scheduling mechanism. Privacy preservation is ensured with dynamic data
size and access frequency values. Storage space and computational
requirements are optimally utilized in the privacy preservation process. Data
distribution complexity is handled in the scheduling process.
39

3.2.1 Advantages of Proposed System
Privacy preserving cost is reduced
Resource consumption is controlled
Data delivery overhead is reduced
Dynamic privacy preservation model
Encryption cost is reduced
3.3 ACHIEVING MINIMUM STORAGE COST WITH
PRIVACY PRESERVING INTERMEDIATE DATASET IN THE
CLOUD
This system has used an approach to identify which intermediate data
sets need to be encrypted and which do not, so that privacy-preserving cost
can be saved while the privacy requirements of data holders can still be
satisfied. It is promising to anonymize all datasets first and then encrypt them
before storing or sharing them in cloud. The volume of intermediate datasets
is huge. Hence, argue that encrypting all intermediate datasets will lead to
high overhead and low efficiency when they are frequently accessed or
processed.
For preserving privacy of datasets, it is promising to anonymize all
datasets first and then encrypt them before storing or sharing them in cloud.
Usually, the volume of intermediate data sets is huge. A tree structure is
modeled from generation relationships of intermediate datasets to analyze
privacy propagation of datasets. Design a practical heuristic algorithm
accordingly to identify the datasets that needs to be encrypted.
40

Directed Acyclic Graph (DAG) is exploited to capture the topological
structure of generation relationships among these datasets. First, formally
demonstrate the possibility of ensuring privacy leakage requirements without
encrypting all intermediate datasets when encryption is incorporated with
anonymization to preserve privacy. So by this model the privacy of dataset
holders are preserved and the sensitive intermediate datasets are protected
from the intruders.
3.4 PROJECT DESCRIPTION
Cloud computing services provide common business applications in
online that are accessed from a web browser, while the software and data are
stored on the servers. Massive computation power and storage capacity of
cloud computing systems allow scientists to deploy computation and data
intensive applications without infrastructure investment. Since the usage of
the cloud computing have been widely spreading there will be a lots of
transaction will be carried out in cloud. There will be intermediate data which
contain the sensitive data during the transaction.
3.4.1 Problem Definition
Existing technical approaches for preserving the privacy of data sets
stored in cloud mainly include encryption and anonymization. On one hand,
encrypting all data sets, a straightforward and effective approach, is widely
adopted in current research. However, processing on encrypted data sets
efficiently is quite a challenging task, because most existing applications only
run on unencrypted data sets. Although recent progress has been made in
homomorphic encryption which theoretically allows performing computation
on encrypted data sets, applying current algorithms are rather expensive due
to their inefficiency. On the other hand, partial information of data sets, e.g.,
41

aggregate information, is required to expose to data users in most cloud
applications like data mining and analytics.
In such cases, data sets are anonymized rather than encrypted to ensure
both data utility and privacy preserving. Current privacy-preserving
techniques like generalization can withstand most privacy attacks on one
single data set, while preserving privacy for multiple data sets is still a
challenging problem. Thus, for preserving privacy of multiple data sets, it is
promising to anonymize all data sets first and then encrypt them before
storing or sharing them in cloud. Usually, the volume of intermediate data sets
is huge.
Hence, argue that encrypting all intermediate data sets will lead to high
overhead and low efficiency when they are frequently accessed or processed.
As such, propose to encrypt part of intermediate data sets rather than all for
reducing privacy-preserving cost. In this system proposes a novel approach to
identify which intermediate data sets need to be encrypted while others do
not, in order to satisfy privacy requirements given by data holders. A tree
structure is modeled from generation relationships of intermediate data sets to
analyze privacy propagation of data sets.
As quantifying joint privacy leakage of multiple data sets efficiently is
challenging, exploit an upper bound constraint to confine privacy disclosure.
Based on such a constraint, model the problem of saving privacy-preserving
cost as a constrained optimization problem. Experimental results on real
world and extensive data sets demonstrate that privacy-preserving cost of
intermediate data sets can be significantly reduced with approach over
existing ones where all data sets are encrypted.
The major contributions of research are threefold. First, formally
demonstrate the possibility of ensuring privacy leakage requirements without
42

encrypting all intermediate data sets when encryption is incorporated with
anonymization to preserve privacy. Second, design a practical heuristic
algorithm to identify which data sets need to be encrypted for preserving
privacy while the rest of them do not. Third, experiment results demonstrate
can significantly reduce privacy-preserving cost over existing approaches,
which is quite beneficial for the cloud users who utilize cloud services in a
pay-as-you-go fashion.
3.4.2 Overview of the Project
Providing the security for the intermediate data is carrying out by
encryption which is very costly. Encrypting all intermediate data sets are
neither efficient nor cost-effective because it is very time consuming and
costly for data-intensive applications to encrypt/decrypt data sets frequently
while performing any operation on them.
Encrypting all intermediate data sets will lead to high overhead and
low efficiency when they are frequently accessed or processed to identify
which intermediate datasets need to be encrypted while others do not, in order
to satisfy privacy requirements given by data holders. To preserving the
privacy of the intermediate dataset, a novel upper bound privacy leakage
constraint based approach is used to identify which intermediate data sets
need to be encrypted and which do not, which minimize the process of
encryption. The purpose of the scheme is to implement the cost effective
system for preserving privacy for intermediate data. The user who does not
register with the online service provider could not able to upload their
information in cloud.

43

3.4.3 System Architecture
In this online service provider, users are going to upload their
information regarding their business. The users who are going to upload
should be already a registered member under the online service provider
norms. Only the registered user can able to upload their datas in the online
service provider. Formally demonstrate the possibility of ensuring privacy
leakage requirements without encrypting all intermediate data sets when
encryption is incorporated with anonymization to preserve privacy.

Cloud User

Figure 3.1 System Architecture
Cloud
application
Register

Register
Login
Datas Upload
Graph Cost Effective
Encryption
Security

Cloud
44

3.5 MODULE DESCRIPTION
Cloud data sharing system provides security for original and
intermediate data values. Data sensitivity is considered in the intermediate
data security process. Resource requirement levels are monitored and
controlled by the security operations. The system is divided into five major
modules. They are data center, data provider, intermediate data privacy,
security analysis and data scheduling.
The data center maintains the encrypted data values for the providers.
Shared data uploading process are managed by the data provider module.
Intermediate data privacy module is designed to protect intermediate results.
Security analysis module is designed to estimate the resource and access
levels. Original data and intermediate data distribution is planned under the
data scheduling module.
Data Center
Database transactions are shared in the data centers. Data center
maintains the shared data values in encrypted form. Homomorphic encryption
scheme is used for encryption process. Key values are also provided by the
data center.
Data Provider
Data provider uploads the database tables to the data center. Database
schema is also shared by the provider. Encryption process is performed under
the data provider environment. Access control tasks are managed by the
providers. Homomorphic encryption can be used to encrypt data in the cloud
and the user can decrypt the encrypted data at the processing time in the cloud
itself.
45

Intermediate Data Privacy
Intermediate data values are generated by processing the original data
values. Intermediate data values are stored under the data center or provider
environment. Encryption process is carried out on the intermediate data
values. Sensitivity information is used for the intermediate data security
process.
Security Analysis
Joint privacy leakage model is used for the security process. Storage
requirements are analyzed in the intermediate data analysis. Computational
resource requirements are also analyzed in the security analysis. Intermediate
data encryption decisions are made with reference to the storage and
computational resource requirements.
Data Scheduling
Data scheduling is used to plan the data distribution process.
Computational tasks are combined in the scheduling process. Scheduling is
applied to select suitable provider for data delivery process. Request levels are
considered in the data scheduling process.

46

3.6 SYSTEM SPECIFICATION
3.6.1 Hardware Requirements
Processor : Intel Dual Core 2.5GHz
RAM : 1GB
Hard Disk : 80 GB
Floppy Disk Drive : Sony 1.44 MB
DVD-ROM : LG 52X MAX
Keyboard : TVS Gold 104 Keys
Mouse : Tech-Com SSD Optical Mouse
Ethernet Card : Realtek 1110 10mbps
3.6.2 Software Requirements
Platform : Windows XP
Language : Java
Backend : Oracle
Simulation Tool : CloudSim

47

3.7 SOFTWARE DESCRIPTION
Windows XP
Windows XP offer many new, exciting features, in addition to
improvements to many features with form earlier versions to windows.
Windows XP Professional makes sharing a computer easier than ever by
storing personalized settings and preferences for each user.
Windows XP Features
XP RAP project members review individual features in Windows XP,
including:
Remote Desktop and Remote Assistance
Power management
Windows application compatibility
System tools: device driver rollback, last known good configuration,
and system restore
Multi-language toolkit
Personal firewall
Automatic unzip feature: There is no need for expander tools such as
WinZip or Aladdin Expander with Windows XP. Zipped files are
automatically unzipped by Windows and placed in folders.
Managing a myriad of network and Internet connections can be
confusing. Empower with knowledge about managing network and Internet
connections for local and remote users. Windows XP is loaded with new tools
and programs that ensure the privacy and security of data, and help to operate
computer at peak performance.

48

Java
Java is a general purpose; object oriented programming language
developed by Sun Microsystems of USA in 1991. The most striking features
of the language are that it is platform neural language. Java can be called as a
revolutionary technology because it has brought in a fundamental shift in
develop and use programs. The internal helped catapult have to the forefront
of programming. If can be used to develop both application and applet
programs. Java is mainly adopted for two reasons.
Security
Portability
These two features are available in java because of the byte code. Byte
code is a highly optimized set of instructions to be executed by the Java run
time system called Java Virtual Machine (JVM). Java program is under the
control of JVM; the JVM can contain the program and prevent it from
generating side effects outside the system. Thus safety is included in Java
language. Some of the features of Java which are adopted for the network
system explore are
Multithreading
Socket programming
Swing
Multithreading
Users perceive that their world is full of multiple events all happenings
at once and wants their computers to do the same. Unfortunately, writing
programs that deal with many things at once can be much more difficult than
writing conventional single threaded programs in C or C++. Thread safe in
49

multithreading means that a given library functions is implemented
concurrent threads of execution.
Socket programming
A socket is one end-point of a two-way communication link between
two programs running on the network. Socket classes are used to represent the
connection between a client program and a server program. The Java.net
package provides two classes.
Socket
Server Socket
These two classes implement the client and server side of the
connection respectively. The beauty of Java sockets is that no knowledge
whatsoever of the details of TCP is required. TCP stands for transmission
Control Protocol and is a standard protocol for data transmission with
confirmation of data reception.
Sockets are highly useful in at least three communications context
Client /server models
Peer-to-Peer scenarios, such as chat applications
Making Remote Procedure Calls (RPC) by having the
receiving application interpret a message as a function call.
Swing
Swing refers to the new library of GUI controls (buttons, sliders,
checkboxes etc) that replaces the somewhat weak and inflexible AWT
controls. Swing is a rapid GUI development tool that is part of the standard
Java development kit. It was primarily developed due to the shortcomings of
50

the Abstract Windows Toolkit. Swing is a set of classes that provides more
powerful and flexible components than AWT. Swing components are not
implemented by platform specific code. Instead they are written in Java and
therefore are platform independent. The term lightweight is used to describe
such elements. In addition, all Swing components support assertive
technologies.
Remote Method Invocation (RMI)
This is a brief introduction to Java Remote Method Invocation (RMI).
Java RMI is a mechanism that allows one to invoke a method on an object
that exists in another address space. The other address space could be on the
same machine or a different one. The RMI mechanism is basically an object
oriented RPC mechanism. CORBA is another object-oriented RPC
mechanism. CORBA differs from Java RMI in a number of ways.
CORBA is a language-independent standard. CORBA includes many
other mechanisms in its standard none of which are part of Java RMI. There is
also no notion of an "object request broker" in Java RMI. Java RMI has
recently been evolving toward becoming more compatible with CORBA. In
particular, there is now a form of RMI called RMI/IIOP ("RMI over IIOP")
that uses the Internet Inter-ORB Protocol (IIOP) of CORBA as the underlying
protocol for RMI communication. This tutorial attempts to show the essence
of RMI, without discussing any extraneous features. Sun includes a lot of
material that is not relevant to RMI itself. For example, it discusses how to
incorporate RMI into an Applet, how to use packages and how to place
compiled classes in a different directory than the source code. All of these are
interesting in themselves, but they have nothing at all to do with RMI. As a
result, Sun's guide is unnecessarily confusing. Moreover, Sun's guide and
examples omit a number of details that are important for RMI. The Client is
the process that is invoking a method on a remote object.
51

The server is the process that owns the remote object. The remote
object is an ordinary object in the address space of the server process. The
Object Registry is a name server that relates objects with names. Objects are
registered with the Object Registry. Once an object has been registered, one
can use the Object Registry to obtain access to a remote object using the name
of the object.
Relational Database Management System (RDBMS)
Over the past several years, relational database management systems
have become the most widely accepted way to manage data. Relational
systems offer benefits such as:
easy access of data
flexibility in data modeling
Reduced data storage and redundancy
Independence of physical storage and logical data design
A high- level data manipulation language (SQL)
The phenomenal growth, of the relational technology has led to more
demand for RDBMSS from personnel computers to large, highly secure
CPUs. The oracle corporation was the first company to offer a true RDBMS
commercially that is portable, compatible and connectable. These results
produced a set of powerful add-on tools user level to cater, to adhoc requests.
Oracle RDBMS
ORACLE demands greater expertise on the part of application
developer, an application developed on ORACLE will be able to keep space
with growth and change.
52

Oracle gives security and control
Disaster recovery can be extremely problematic ORACLE has several
features that ensure the integrity of the database. If an interruption occurs in
processing, a ROLL BACK can reset the database to the previous transaction
point before the disaster. If a restore is necessary, ORACLE has a ROLL
FORWARD command for recreating the database to its most recent
SAVEPOINT.
ORACLE provides users with several functions for security GRANT
and REVOKE commands limit access to information down to column and
row levels like this there are many ways to control access to a database. One
part of the kernel is the QUERY OPTIMIZER. The query optimizer examines
alternate access paths to the data, to find the optimal path to resolve a given
query. The query optimizer, in order to work with the query, it looks which
indexes will be most helpful.
At the heart of the ORACLE RDBMS is the Structured Query
Language (SQL). SQL was developed and defined by IBM research, and has
been accredited by the ANSI as the standard query language for RDBMS. It is
an English - like language that is used for most database activities. SQL is
simple enough to allow no voice users to access data easily and quickly, yet if
is powerful enough to offer programmers all the capability and flexibility they
require.
SQL statements are the one designed are the designed to work with
relational database data. SQL data language presumes that database data is
found in tables. Each table is defined by a Table name and a set of columns.
Each column has a column name and a datatype. Columns can contain null
values if they have not been declared NOT NULL. Columns are sometimes
called fields or attributes.
53

CloudSim
Cloud computing emerged as the leading technology for delivering
reliable, secure, fault-tolerant, sustainable, and scalable computational
services, which are presented as Software, Infrastructure, or Platform as
services. Moreover, these services may be offered in private data centers,
may be commercially offered for clients, or yet it is possible that both
public and private clouds are combined in hybrid clouds.
These already wide ecosystem of cloud architectures, along with the
increasing demand for energy-efficient IT technologies, demand timely,
repeatable, and controllable methodologies for evaluation of algorithms,
applications, and policies before actual development of cloud products.
Because utilization of real testbeds limits the experiments to the scale of the
testbed and makes the reproduction of results an extremely difficult
undertaking, alternative approaches for testing and experimentation
leverage development of new Cloud technologies.
A suitable alternative is the utilization of simulations tools, which
open the possibility of evaluating the hypothesis prior to software
development in an environment where one can reproduce tests. Specifically
in the case of Cloud computing, where access to the infrastructure incurs
payments in real currency, simulation-based approaches offer significant
benefits, as it allows Cloud customers to test their services in repeatable and
controllable environment free of cost, and to tune the performance
bottlenecks before deploying on real Clouds. At the provider side,
simulation environments allow evaluation of different kinds of resource
leasing scenarios under varying load and pricing distributions. Such studies
could aid the providers in optimizing the resource access cost with focus on
improving profits. In the absence of such simulation platforms, Cloud
customers and providers have to rely either on theoretical and imprecise
54

evaluations, or on try-and-error approaches that lead to inefficient service
performance and revenue generation.
The primary objective of this project is to provide a generalized and
extensible simulation framework that enables seamless modeling,
simulation, and experimentation of emerging Cloud computing
infrastructures and application services. By using CloudSim, researchers
and industry-based developers can focus on specific system design issues
that they want to investigate, without getting concerned about the low level
details related to cloud-based infrastructures and services
3.8 SYSTEM IMPLEMENTATION
System Implementation is the creation and the installation of the
method to follow the engineering principles to remove part of the human
element in the equation. Implementation is the process of realizing the design
as a program. The data footprint results presented is might unveil relative
performance of different classification techniques(given the memory system
is generally deemed as the bottleneck),computation steps are the mechanism
involved in dealing with the data structures are equally important and have to
be taken into consideration.
To arrive at more accurate evaluation, here executed al classification
and throughput performance is measured, with the results for HaRP and HC
listed. Values in the table are al relatively scaled to one thread Hyper Cuts
performance, which is shown as a reference of one in the second column for
clear and system configuration- independent comparison.
System implementation is a practice of creating or modifying a system
to create a new business process or replace an existing business process.
Implementation of software refers to the final installation of the packages in
55

its real environment, to the satisfaction of the intended users and the operation
of the systems. The people are not sure that the software is meant to make
their job easier.
The active user must be aware of the benefits of using the
system
Their confidence in the software buildup
Proper guidance is impaired to the user so that it is comfortable
in using the application
The second to the fourth column include the single core results. The
filth to the seventh columns contains the speed up relative to one thread
Hyper Cuts performance, with the rest for results and their four cores. When
the number of threads rises from one to two and then to four , HC shows
nearly linear or Super liner scalability(in terms of raw classification rates)
with respect to the number of cores. The super liner phenomenon may be due
to the fact that one thread can benefit from a warmed up cache by another
thread, marginal errors occurred during statistic collection.
The system implementation phase consists of the following steps :
Testing the developed software with sample data
Correction of any errors if identified
Creating the files of the system with actual data
Making necessary changes to the system to find out errors
Training of users personnel
The system has been tested with sample data, changes are made to the
user requirements and run in parallel with the existing system to find out the
discrepancies. When the number of threads rises from one to two and then to
four , HC shows nearly linear or Super liner scalability(in terms of raw
56

classification rates) with respect to the number of cores. The user has also
been appraised how to run the system during the training period.
3.8.1 Implementation Plan
Implementation is the stage, which is crucial in the life cycle of the
new system designed. Implementation means converting a new or revised
system design into an operational one. The mechanism involved in dealing
with the data structures are equally important and have to be taken into
consideration .This is the stage of the project where the theoretical design is
turned into a working system. In this project implementation includes all
those activities that take place to convert from the old system to the new one.
The important phase of implementation plan is changeover.
The implementation phases construction, installation and operations
lie on the new system .The most crucial and very important stage in achieving
a new successful system and in giving confidence on the new system for the
user that it will work efficiently and effectively.
There are several activities involved while implementing project:
Careful planning
Investigation current system and its constraints on
implementation
Design of methods to achieve the change over
Training of the staff in the changeover procedure and evaluation
of change over method
The implementation is the final stage and it is an important phase. It
involves the individual programming system testing, user training and the
operational running of developed proposed system that constitutes the
57

application subsystems .On major task of preparing for implementation is
education of users, which would really have taken place much earlier in the
project when were being involved in the investigation and design work. The
implementation phase of software development is concerned with translating
design specifications into source code. The user tests the developed system
and changes are made according to their needs.
3.8.2 Changeover
The implementation is to be done step by step since testing with
dummy data will not always reveal the faults. The system will be subjected to
the employees to work .If such error or failure is found, the system can be
corrected before it is implemented in full stretch. The trail should be done as
long as the system is made sure to function without any failure or errors.
Precautions should be taken so that any error if occurred should not totally
make the process to a halt. Such a care should be taken. The system can be
fully established if it does not create any error during the testing periods.
3.8.3 Education and User Training
Well-designed and technically elegant systems can succeed or fail
because of the way they are operated and used. Therefore the quality of the
training received by the personnel involved with the systems help or hinder,
and may even prevent, the successful completion of the system. An analysis
of user training focuses on user capabilities and the nature of the system being
installed. Those users are verifying type and nature. Some of them may not
have any knowledge about the computers and the others may be intelligent.
The requirements of the system also range from simple to complex tasks. So
the training has to be generated to the specific user based on his/her
capabilities and systems complexity. The user tests the develop system and
changes are made according to their needs.
58

Implementation is the stage, which is crucial in the life cycle of the
new system designed. Implementation means converting a new or revised
system design into an operational one. The mechanism involved in dealing
with the data structures are equally important and have to be taken into
consideration .This is the stage of the project where the theoretical design is
turned into a working system. In this project implementation includes all
those activities that take place to convert from the old system to the new one.
The important phase of implementation plan is changeover.
User training must instruct individuals in trouble shooting the system,
determining whether a problem that arises is caused by hardware or software.
The implementation phase of the software development is concerned with
translating design specifications into source code. The user tests the
developed system and changes are made according to their needs. A good or
bad perfect documentation which instructs the user on how to start the system
and the various functions and meanings of the various codes must be prepared
and that will help the user to understand the system in a better manner.
3.9 SYSTEM TESTING
3.9.1 Introduction
System testing is the stage of implementation, which is aimed at
ensuring that the system works accurately and efficiently before live operation
commences. Testing is vital to the success of the system. An elaborate testing
of data is prepared and the system is tested using this test data while testing
errors are noted and the corrections are made. The users are trained to operate
the developed system. Both hardware and software securities are made to run
the developed system successfully in future.

59

3.9.2 Testing Definition
Software development has several levels of testing
Unit Testing
Integration Testing
Validation Testing
Output Testing
User Acceptance Testing
Unit Testing
Unit testing focuses verification efforts on the smallest unit of software
design and the module. This is also known as Module Testing. The modules
are tested separately. This testing is carried out during programming stage
itself. In this testing step each Module is found to be working satisfactorily as
regard to the expected output from the module.
Integration Testing
Integration testing is a systematic technique for constructing tests to
uncover errors associated within the interface. In this project, all the modules
are combined, and then the entire Program is tested as a whole. Thus in the
integration testing step, all the errors uncovered are corrected for the next
testing steps.
Validation Testing
Validation testing is the testing where requirements are established as a
part of software requirement analysis is validated against the software that has
been constructed. This test provides the final assurance that the software
meets all functional, behavioral and performance requirements .The errors,
60

which are uncovered during integration testing, are corrected during this
phase.
Output Testing
After performing the validation testing, the next step is output testing
of the proposed system since no system could be useful if it does not produce
the required output in the specific format. The Output generated or displayed
by the system under consideration is tested asking the users about the format
required by then. Here, the output is considered into two ways: one is on the
screen and the other is in a printed format.
The output format on the screen is found to be correct as the format
designed according to the user needs .For the hard copy also; the output
comes out as specified by the user. Hence output testing doesnt result in any
connection in the system.
User Acceptance Testing
The testing of the software began along with coding. Since the design
was fully object-oriented, first the interfaces were developed and tested. Then
unit testing was done for every module in the software for various inputs,
such that each line of code is at least once executed User acceptance of a
system is the key factor for the success of any system. The system under
consideration is tested for user acceptance by constantly keeping in touch
with the prospective system users at time of developing and making of the
cygwin program.

61

CHAPTER 4
CONCLUSION AND FUTURE ENHANCEMENT
4.1 CONCLUSION
Shared data values are maintained under third party cloud data centers.
Data values are processed and stored in different cloud nodes. Privacy leakage
upper bound constraint model is used to protect the intermediate data values.
Dynamic privacy management and scheduling mechanism are integrated to
improve the data sharing with security. Privacy preserving cost is reduced by
the joint verification mechanism. Resource consumption is controlled by the
support of sensitive data information graph. Data delivery overhead is
reduced by the load balancing based scheduling mechanism. Dynamic privacy
preservation model is supported by the system.
4.2 FUTURE ENHANCEMENT
The system can be enhanced to support commercial cloud operations.
The system can be improved to handle intrusion detection process.
The system can be enhanced to share data under wireless data centers
based cloud model.
The system can be improved with data cache and data replica schemes
to ensure fast data delivery process.
.

62

APPENDIX I (SAMPLE CODE)
Datacenter Management
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.LinkedList;
import java.util.List;
import org.cloudbus.cloudsim.Cloudlet;
import org.cloudbus.cloudsim.CloudletSchedulerTimeShared;
import org.cloudbus.cloudsim.Datacenter;
import org.cloudbus.cloudsim.DatacenterBroker;
import org.cloudbus.cloudsim.DatacenterCharacteristics;
import org.cloudbus.cloudsim.Host;
import org.cloudbus.cloudsim.Log;
import org.cloudbus.cloudsim.NetworkTopology;
import org.cloudbus.cloudsim.Pe;
import org.cloudbus.cloudsim.Storage;
import org.cloudbus.cloudsim.UtilizationModel;
import org.cloudbus.cloudsim.UtilizationModelFull;
import org.cloudbus.cloudsim.Vm;
import org.cloudbus.cloudsim.VmAllocationPolicySimple;
import org.cloudbus.cloudsim.VmSchedulerTimeShared;
import org.cloudbus.cloudsim.core.CloudSim;
import org.cloudbus.cloudsim.provisioners.BwProvisionerSimple;
import org.cloudbus.cloudsim.provisioners.PeProvisionerSimple;
import org.cloudbus.cloudsim.provisioners.RamProvisionerSimple;
public class DataCenterMgmt
{
private static List<Cloudlet> cloudletList;
63

private static List<Vm> vmlist;
public static void construct()
{
Log.printLine("Starting Cloud Data Center Management...");
try {
int num_user = 1;
Calendar calendar = Calendar.getInstance();
boolean trace_flag = false;
CloudSim.init(num_user, calendar, trace_flag);
Datacenter datacenter0 = createDatacenter("Datacenter_0");
DatacenterBroker broker = createBroker();
int brokerId = broker.getId();
vmlist = new ArrayList<Vm>();
int vmid = 0;
int mips = 250;
long size = 10000;
int ram = 512;
long bw = 1000;
int pesNumber = 1;
String vmm = "Xen";

Vm vm1 = new Vm(vmid, brokerId, mips, pesNumber, ram,
bw, size, vmm, new CloudletSchedulerTimeShared());
vmlist.add(vm1);
broker.submitVmList(vmlist);
cloudletList = new ArrayList<Cloudlet>();
int id = 0;
long length = 40000;
long fileSize = 300;
long outputSize = 300;
64

UtilizationModel utilizationModel = new
utilizationModelFull();
Cloudlet cloudlet1 = new Cloudlet(id, length, pesNumber,
fileSize, outputSize, utilizationModel, utilizationModel,
utilizationModel);
cloudlet1.setUserId(brokerId);
cloudletList.add(cloudlet1);
broker.submitCloudletList(cloudletList);
NetworkTopology.buildNetworkTopology("topology.brite");
int briteNode=0;
NetworkTopology.mapNode(datacenter0.getId(),briteNode);
briteNode=3;
NetworkTopology.mapNode(broker.getId(),briteNode);
CloudSim.startSimulation();
List<Cloudlet> newList = broker.getCloudletReceivedList();
CloudSim.stopSimulation();
printCloudletList(newList);
Log.printLine("Cloud Data Center Constructed!");
}
catch (Exception e) {
e.printStackTrace();
Log.printLine("simulation has terminated due to an unexpected
error");
}
}
private static Datacenter createDatacenter(String name)
{
List<Host> hostList = new ArrayList<Host>();
List<Pe> peList = new ArrayList<Pe>();
int mips = 1000;
65

peList.add(new Pe(0, new PeProvisionerSimple(mips)));
int hostId=0;
int ram = 2048;
long storage = 1000000;
int bw = 10000;
hostList.add(
new Host( hostId, new RamProvisionerSimple(ram),
new BwProvisionerSimple(bw), storage, peList,
new VmSchedulerTimeShared(peList)
)
);
String arch = "x86";
String os = "Linux";
String vmm = "Xen";
double time_zone = 10.0;
double cost = 3.0;
double costPerMem = 0.05;
double costPerStorage = 0.001;
double costPerBw = 0.0;
LinkedList<Storage> storageList = new LinkedList<Storage>();
DatacenterCharacteristics characteristics = new
DatacenterCharacteristics(
arch, os, vmm, hostList, time_zone, cost, costPerMem,
costPerStorage, costPerBw);
Datacenter datacenter = null;
try {
datacenter = new Datacenter(name, characteristics, new
VmAllocationPolicySimple(hostList), storageList, 0);
} catch (Exception e) {
66

}
return datacenter;
}
private static DatacenterBroker createBroker()
{
DatacenterBroker broker = null;
try {
broker = new DatacenterBroker("Broker");
} catch (Exception e) {
return null;
}
return broker;
}
private static void printCloudletList(List<Cloudlet> list)
{
int size = list.size();
Cloudlet cloudlet;
String indent = " ";
Log.printLine();
Log.printLine("========== OUTPUT ==========");
Log.printLine("Cloudlet ID" + indent + "STATUS" + indent +
"Data center ID" + indent + "VM ID" + indent + "Time"
+ indent + "Start Time" + indent + "Finish Time");
for (int i = 0; i < size; i++) {
cloudlet = list.get(i);
Log.print(indent + cloudlet.getCloudletId() + indent + indent);
if (cloudlet.getCloudletStatus() == Cloudlet.SUCCESS){
Log.print("SUCCESS");
DecimalFormat dft = new DecimalFormat("###.##");
67

Log.printLine( indent + indent + cloudlet.getResourceId()
+ indent + indent + indent + cloudlet.getVmId() +
indent + indent +
dft.format(cloudlet.getActualCPUTime()) + indent + indent +
dft.format(cloudlet.getExecStartTime())+
indent + indent +
dft.format(cloudlet.getFinishTime()));
}
}
}
}

Rsa Key Generation
class RSAKeyGen
{
BigInteger n,e,d;
String owner;
Random ran;
int certainty = 32,radix=10,bits;
BigInteger one = new BigInteger("1");
BigInteger fn,p,q;
RSAKeyGen(String iowner)
{
bits = 92;
owner = iowner.replace(' ','_');
ran = new Random();
p = new BigInteger(bits/2,certainty,ran);
q = new BigInteger((bits+1)/2,certainty,ran);
fn = fi(p,q);
n = p.multiply(q);
68

e = chooseprimeto(fn);
d = e.modInverse(fn);
}
RSAKeyGen(long bp,long bq)
{
p = getPrime(bp);
System.out.println("P value : "+p);
q = getPrime(bq);
System.out.println("Q value : "+q);
ran = new Random();
fn = fi(p,q);
n = p.multiply(q);
e = chooseprimeto(fn);
d = e.modInverse(fn);
}
BigInteger fi(BigInteger prime1,BigInteger prime2)
{
return prime1.subtract(one).multiply(prime2.subtract(one));
}
BigInteger BI(String s)
{
return new BigInteger(s);
}
BigInteger chooseprimeto(BigInteger f)
{
BigInteger num;
do
{
num=new BigInteger(16,ran);
}while(!f.gcd(num).equals(one));
69

return num;
}
String printBI(BigInteger b)
{
return b.toString(radix);
}
BigInteger readBI(String s)
{
return new BigInteger(s,radix);
}
String getPublicKey()
{
/// return owner+" " +printBI(n)+" "+printBI(e);
return printBI(e)+" "+printBI(n);
}
String getPrivateKey()
{
/// return owner+" " +printBI(n)+" "+printBI(e)+" "+printBI(d);
return printBI(d)+" "+printBI(n);
}
String getKeys()
{
return owner+" " +printBI(n)+" "+printBI(e)+" "+printBI(d);
}
public BigInteger getPrime(long pval)
{
BigInteger tval = new BigInteger(""+pval);
while(true)
{
if(tval.isProbablePrime(8))
70

return tval;
tval = tval.add(one);
}
}
public String getP()
{
return printBI(p);
}
public String getQ()
{
return printBI(q);
}
}
class test
{
public static void main(String args[])
{
long bp = Long.parseLong(args[0]);
long bq = Long.parseLong(args[1]);
RSAKeyGen rkgen = new RSAKeyGen(bp,bq);
System.out.println("PKey : "+rkgen.getPrivateKey());
}
}

71

APPENDIX II (SCREEN SHOTS)

Figure A-II.1 Intermediate Data Security Process

72

Figure A-II.2 Data Import Process

73

Figure A-II.3 Credit Card Details

74

Figure A-II.4 Data Privacy & Security Process

75

Figure A-II.5 Anonymization Process

76

Figure A-II.6 Anonymization Process

77

Figure A-II.7 Data Scheduling Process

78

Figure A-II.8 Support Estimation Process

79

Figure A-II.9 Rule Mining Process

80

Figure A-II.10 Data Scheduling Process

81

REFERENCES
[1] Cao N., Wang C., Li M. and Lou (2007), Privacy-Preserving Multi-
Keyword Ranked Search over Encrypted Cloud Data IEEE Transactions
On Parallel and Distributed Systems, Vol.24, No.4, pp. 829-837.
[2] Ciriani V., Foresti S., Jajodia and Samarati (2012), Privacy-Preserving
Data Publishing: A Survey Of Recent Developments IEEE Transactions
On Parallel And Distributed Systems, Vol. 13, No. 3, pp. 133.
[3] Cong., Wang and Qian Wang (2010), Harnessing The Cloud For Securely
Outsourcing Large-Scale Systems Of Equations, IEEE Transactions On
Parallel And Distributed Systems , Vol. 24, No. 6, pp . 200-370.
[4] Crane Wang., Kui and Wenjing (2010), Enabling Secure and Efficient
Ranked Keyword Search over Outsourced Data IEEE Transactions on
Parallel and Distributed Systems, Vol. 23, No. 8, pp.400-476.
[5] Davidson., Khanna., Tannen and Stoyanovich (2009), Enabling Privacy
in Provenance-Aware Workflow Systems IEEE Transactions On Parallel
And Distributed Systems, Vol.24,No.3, pp. 215-218.
[6] Dong Yuan., Yun Yang., Member., Wenhao Li and Jinjun Chen (2011),
A Highly Practical Approach Toward Achieving Minimum Data Sets
Storage Cost In The Cloud IEEE Transactions On Parallel And
Distributed Systems, Vol. 24, No. 6, pp.600-760.
[7] Duan., Yang., Liu X. and Chen J.(2007), On-Demand Minimum Cost
Benchmarking For Intermediate Data Set Storage In Scientific Cloud
Workflow Systems IEEE Transactions On Parallel And Distributed
Systems, Vol. 71, No. 2, pp. 316-332.
[8] Dyauan., Yang., Liu. and Chen J.(2007), Benchmarking Approach For
Intermediate Data Set Storage In Cloud Systems IEEE Transactions On
Parallel And Distributed Systems, Vol. 71, No. 2, pp. 306-332, 2007.
[9] Hsiao.,Ying Lin and Wenguey Tzeng (2010), A Secure Code-Based
Cloud Storage System With Secure Data Forwarding IEEE Transactions
On Parallel And Distributed Systems, Vol. 23, No. 6, pp.260-300.
[10] Ko S., Hoque., Bho C. and Gupta (2010), Security And Privacy
Challenges In Cloud Computing Environments IEEE Transactions On
Parallel And Distributed Systems, Vol.24, No.5,pp. 181-192.
82

[11] Lei Wang., Jianfeng Zhan., Weisong Shi and Yi Liang (2012),
Scientific Communities Benefit From The Economies Of Scale IEEE
Transactions On Parallel And Distributed Systems, Vol. 23, No. 2,
pp.300-450.
[12] Li M., Yu S.,Cao N. and Lou W.(2008), Authorized Private Keyword
Search Over Encrypted Data In Cloud Computing IEEE Transactions
On Parallel And Distributed Systems,Vol.35, No.5,pp. 383-392.
[13] Mauii., Cao and Lou W (2008) Silverline: Toward Data Confidentiality In
Storage-Intensive Cloud Applications IEEE Transactions On Parallel
And Distributed Systems, Vol.24 , No.9, pp. 383-392.
[14] Ming Li., Shucheng Yu and Wenjing Lou (2012), Scalable and Secure
Sharing of Personal Health Records in Cloud Computing using Attribute-
based Encryption IEEE Transactions on Parallel and Distributed Systems,
Vol.24, No.8, pp. 383-392.
[15] Puttaswamy., Kruegel and Zhao (2011), Cloud Computing And Emerging
It Platforms: Vision IEEE Transactions On Parallel And Distributed
Systems,Vol.24,No.3, pp.600-678.
[16] Wang and Yi Liang (2012), Exploiting Dynamic Resource Allocation
For Efficient Parallel Data Processing In The Cloud IEEE transactions on
Parallel And Distributed , Vol. 23, No. 2, pp.488-577.
[17] Xuyun Zhang and Surya Nepal (2013), A Privacy Leakage Upper Bound
Constraint-Based Approach for Cost Effective Privacy Preserving of
Intermediate Data Sets in Cloud, IEEE Transactions On Parallel And
Distributed Systems, Vol. 24, No. 6, pp.699-700.
[18] Xuyun Zhang., Suraj Pandey and Jinjun (2013), Privacy: Integrating
Background Knowledge In Privacy Quantification, IEEE Transactions
On Parallel And Distributed Systems, Vol. 24, No. 6, pp.250-499.
[19] Yuan., Xiao and Jinjun Chen (2011),A Effective Strategy for Intermediate
Data Storage in Cloud Workflow Systems IEEE Transactions On Parallel
And Distributed Systems, Vol. 23, No. 6, pp.300-400.
[20] Zhang., Zhou., Chen., Wang and Ruan (2008), Sedic: Privacy Aware Data
Intensive Computing on Hybrid Clouds IEEE Transactions On Parallel
And Distributed Systems , Vol.36 ,No.4, pp. 515-526.

83

PUBLICATIONS
1. P.Suganya, R.Mahendra kumar, Achieving Minimum Storage Cost with
Privacy Preserving Intermediate Datasets in the Cloud , IJRIT, Vol. 01,
no.12,pp. 251-263, Dec 2013
2. P.Suganya, R.Mahendra kumar, Achieving Cost Effective Privacy
Preserving Intermediate Datasets In The Cloud , International Conference on
Competency Building Strategies in Business and Technology for Sustainable
Development 2014, Sri Ganesh School of Business Management ,Salem.,Feb
2013

Privacy Preserving Intermediate Datasets in The Cloud

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Privacy Preserving Intermediate Datasets in The Cloud

Caricato da

Copyright:

Formati disponibili

1

Potrebbero piacerti anche