Sei sulla pagina 1di 10

Privacy

Preserving
Data
Mining
Supervised By: Prof-DR-Safaa O AL-Mamory

Prepared By: Ali Hussain Mohammed Ali


Objective of the Research
• The matter of privacy-preserving data mining has become
more important in recent years due to the increasing ability
to store personal data about users, and therefore the
increasing sophistication of knowledge mining algorithms to
leverage this information.

• This has result in concerns that the private data is also


misused for a range of purposes. So as to alleviate these
concerns, variety of techniques have recently been proposed
so as to perform the info mining tasks in an exceedingly
privacy-preserving way.

• These techniques for performing privacy-preserving data


processing are drawn form a large array of related topics like
data processing, cryptography and information hiding.

• The research is intended to produce an honest overview of


a number of the important topics during this domain.

Outline of the Research

• Introduction to Privacy-Preserving Data Mining

• Privacy-Preserving Data Mining Models and Algorithms

• Applications of Privacy-Preserving Data Mining


Introduction
• The matter of privacy-preserving data mining has become
more important in recent years thanks to the increasing
ability to store personal data about users, and therefore the
increasing sophistication of information mining algorithms to
leverage this information.

• The matter has been discussed in multiple communities


like the database community, the statistical disclosure
control community and therefore the cryptography
community.
• This tutorial will try to explore different topics from the
perspective of different communities and give a fused idea of
the work in different communities.

• The key directions in the field of privacy-preserving


data mining are:

– Privacy-preserving data publishing.

– Changing the results of data mining applications to


preserve privacy.

– Query auditing.

– Cryptographic methods for distributed privacy.

– Theoretical challenges in high dimensionality.


Introduction
• Privacy-preserving data publishing
These techniques study different transformation methods
related to privacy, e.g., randomization, k-anonymity, l-
diversity and also handles like how perturbed data are often
employed in conjunction with association rule mining
approaches.

• Changing the results of knowledge mining applications


to preserve privacy
the results of knowledge mining algorithms like association
rule mining are modified so as to preserve privacy of
knowledge. Example: association rule hiding,

• Query auditing
The results of queries are either modified or restricted.
Example: output perturbation and query restriction.

• Cryptographic methods for distributed privacy


in many cases, the info could also be distributed across
multiple sites, and the owners of the info may need to
compute a standard function. A range of cryptographic
protocols could also be employed in order to speak between
the sites, so secure function computation is feasible.

• Theoretical challenges in high dimensionality


Real data sets are high dimensional making the method of
privacy preservation difficult both from computational and
effectiveness point of view.
Example: optimal k- anonymization is NP-hard.
Privacy Preserving Data Mining Models and
Algorithms

• Privacy computations use some variety of transformation


on the info so as to preserve privacy. Typically such
methods reduce the granularity of representation.
(i.e. the representation becomes coarser)

• The granularity reduction ends up in some loss of


effectiveness of information management or mining
algorithm. There’s a natural trade-off between information
loss and data privacy.

• Some examples of privacy preserving data mining


models:

– Randomization method

– k – Anonymity and l – diversity models

– Distributed privacy preservation

– Downgrading application effectiveness


Randomization Method
• In randomization method, noise is added to the info so as to mask
the attribute values of records.
• The noise added is sufficiently large so individual records values
cannot be recovered.
• Techniques are designed to derive aggregate distributions from the
perturbed records.
• Subsequently, data processing techniques are developed so as to
figure with these aggregate distributions.

k – Anonymity and l – Diversity Models


• The k-anonymity model was developed to forestall the likelihood of
indirect identification of records from public databases. Combinations
of record attributes is accustomed exactly identify individual records.

• In k-anonymity method, we reduce the granularity of information


representation with the use of techniques like generalization and
suppression. The granularity is reduced sufficiently in order that any
given record maps onto a minimum of k other records within the data.

• The l-diversity was designed to handle some weaknesses within the


k-anonymity model. Protecting identities to the extent of k individuals
isn't the identical as protecting the corresponding sensitive values,
especially when there's homogeneity of sensitive values within a
collection.

• To handle such requirements, the concept of intra-group diversity of


sensitive values is promoted within the anonymization scheme.
Distributed Privacy Preservation
• In many situations, individual entities might need to derive
aggregate results from data sets which are partitioned across these
entities.

• Such partitioning could also be horizontal (when the records are


distributed across multiple entities) or vertical (when the attributes are
distributed across multiple entities). While the individual entities might
not desire to share their entire data sets, they will consent to limited
information sharing with the uses of a range of protocols.

• The effect of such methods is to take care of privacy for every


individual entity, while deriving aggregate results over the whole data.

Downgrading Application Effectiveness


• In many cases, although the information might not be available, the
output of applications like association rule mining, classification or
query processing may lead to violations of privacy.

• This has cause research in downgrading the effectiveness of


applications by either data or application modification.

• Some samples of such techniques include association rule hiding,


classifier downgrading, and query auditing.
Applications of Privacy-Preserving Data Mining
• The problem of privacy-preserving data mining has numerous
applications such as:
– Medical Databases
– Bioterrorism Applications
– Genomic privacy
– Homeland Security Applications
• Credential validation problem
• Identity theft
• Web camera surveillance
• Video-surveillance
• The watch-list problem

Medical Databases: The Scrub and Data fly Systems


–The Scrub system was designed for de-identification of clinical notes and letters
which usually occurs within the kind of textual data. Clinical notes and letters are
typically within the kind of text which contain references to patients, members of
the family, addresses, phone numbers or providers. Traditional techniques simply
use a world search and replace procedure so as to supply privacy. However,
clinical notes often contain cryptic references within the kind of abbreviations
which may only be understood either by other providers or members of the
identical institutions. Therefore, traditional methods can identify no quite 30 –
60% of the identifying information within the data. The scrub system uses
numerous detection algorithms which compete in parallel to work out when a
block of text corresponds to a reputation, address or signaling. The scrub system
uses local knowledge sources which compete with each other supported the
understanding of their findings. It has been shown that such a system is in a
position to get rid of quite 90% of the identifying information from the data.

–The Data fly system was one among the earliest practical applications of
privacy-preserving transformations. This method was designed to forestall
identification of the topics of medical records which can be stored in multi-
dimensional format. The multi-dimensional information may include directly
identifying information like Social Security number, or indirectly identifying
information like age, sex, or zip-code. The system was designed in response to
the priority that the method of removing only directly identifying attributes like
Social Security numbers was not sufficient to ensure privacy. While the work
includes a similar motive because the k- anonymity approach of preventing
record identification, it doesn't formally use k-anonymity model so as to prevent
identification through linkage attacks.
Bioterrorism Applications
–Often a biological agent like anthrax produces symptoms which are
almost like the common respiratory diseases like the cough, cold and flu.
In absence of prior knowledge of such an attack, health-care providers
may diagnose a patient laid low with an anthrax attack as having
symptoms from one or more common respiratory diseases. The secret's to
quickly identify a real anthrax attack from a standard outbreak of a
standard disease. In many cases, an unusual number of such cases in an
exceedingly given locality may indicate a bio-terrorism attack. So as to
identify such attacks it's necessary to trace incidences of those common
diseases furthermore. Therefore, the corresponding data would want to be
reported to public health agencies. However, the common respiratory
diseases don't seem to be reportable disease by law. The answer is to
possess “selective revelation” which initially allows only limited access to
the info. However, within the event of suspicious activity, it allows, drill-
down into the underlying data. This provides more identifiable information
in accordance with public health law.

Genomic Privacy
–Recent years have seen tremendous advances within the science of DNA
sequencing and forensic analysis with the utilization of DNA. As a result,
the databases of collected DNA are growing in no time in the both the
medical and enforcement communities. DNA data is taken into account
extremely sensitive, since it contains almost uniquely identifying
information about a private. As within the case of multi-dimensional data,
simple removal of directly identifying data like social security number isn't
sufficient to stop re-identification. It’s been shown that a software called
Clean Gene can determine the identify ability of DNA entries independent
of the other demographic or other identifiable information. The software
relies on publicly available medical data and knowledge of particular
diseases so as to assign identifications to DNA entries. I has been shown
that 98-100% of the individuals are identifiable using this approach. The
identification is done by taking the DNA sequence of a private so
constructing a genetic profile corresponding to the sex, genetic diseases,
the placement where the DNA was collected etc. One way to protect the
anonymity of such sequences is that the use of generalization lattices
which are constructed in such some way that an entry within the modified
database can't be distinguished from at least (k-1) other entries. Another
approach constructs synthetic data which preserves the aggregate
characteristics of the initial data, but preserves the privacy of the initial
records.
Homeland Security Applications
-Credential validation problem: in credential validation approach, a
shot is created to take advantage of the semantics related to Social
Security number to work out whether the person presenting SSN
credential truly own it
- Identity theft: the identity angel system crawls through cyberspace,
and determines folks that are at risk from fraud. This information is
accustomed notify appropriate parties.
-Web camera surveillance: one possible method for surveillance is
with the employment of publicly available webcams, which may be
accustomed detect unusual activity. The approach is made more
privacy sensitive by extracting only facial count information from the
photographs and using these so as to detect unusual activity. it's
been hypothesized that unusual activity is detected only in terms of
facial count instead of using more specific information about
particular individuals.
-Video surveillance: in context of sharing video-surveillance data, a
serious threat is that the use of facial recognition software, which
might match the facial images in videos to the facial images in an
exceedingly driver license database. A balanced approach is to use
selective downgrading of the facial information, so that it scientifically
limits the flexibility of biometric identification software to reliably
identify faces, while maintaining facial details in images. The
algorithm is noted as k-Same, and also the secret's to identify faces
which are somewhat similar, then construct new faces which
construct combinations of features from these similar faces. Thus, the
identity of the underlying individual is anonymized to a particular
extent, but the video continues to stay useful.
-Watch list problem: the government typically contains a list of
known terrorists or suspected entities which it wishes to trace. The
aim is to look at transactional data like store purchases, hospital
admissions, airplane manifests, and hotel registrations so as to spot
or track these entities. This is a difficult problem, since the
transactional data is private, and also the privacy of subjects who
don't appear within the watch list have to be protected.

Potrebbero piacerti anche