0 Voti positivi0 Voti negativi

4 visualizzazioni12 pagineAug 24, 2010

© Attribution Non-Commercial (BY-NC)

PDF, TXT o leggi online da Scribd

Attribution Non-Commercial (BY-NC)

4 visualizzazioni

Attribution Non-Commercial (BY-NC)

- Assignment 3.docx
- Day 9
- Hierarchical Online Mining for Associative Rules
- Association Rule Mining in the field of Agriculture: A Survey
- CCT_Help.pdf
- Resn
- The Relationship Between 4ps Market Basket Analysis a Case Study of Grocery Retail Shops in Gweru Zimbabwe
- Lecture -9 Unsupervised Learning (K-means, Association analysis and Frequuent items).ppt
- CV_Diamond_Gabriel
- How to get electricity.pdf
- article
- Frequent Pattern Mining Approaches with Transactional and Impact Matrix Based Probabilistic Model for Privacy Preserved Data Mining
- Classified2020_1_4650557.pdf
- Classified2020_1_4650557
- 8 f 3 - practice
- Regula_Method
- SalesAnalysis_FlowCharts
- Apr May 2019
- syed ali naqi
- OLAP

Sei sulla pagina 1di 12

Association Rules on Horizontally Partitioned Data

Murat Kantarcıoǧlu and Chris Clifton, Senior Member, IEEE

Abstract— Data mining can extract important knowledge from to the distributed case using the following lemma: If a rule

large data collections – but sometimes these collections are split has support > k% globally, it must have support > k% on

among various parties. Privacy concerns may prevent the parties at least one of the individual sites. A distributed algorithm for

from directly sharing the data, and some types of information

about the data. This paper addresses secure mining of association this would work as follows: Request that each site send all

rules over horizontally partitioned data. The methods incorporate rules with support at least k. For each rule returned, request

cryptographic techniques to minimize the information shared, that all sites send the count of their transactions that support

while adding little overhead to the mining task. the rule, and the total count of all transactions at the site.

Index Terms— Data Mining, Security, Privacy From this, we can compute the global support of each rule,

and (from the lemma) be certain that all rules with support at

least k have been found. More thorough studies of distributed

I. I NTRODUCTION association rule mining can be found in [2], [3].

Data mining technology has emerged as a means of identi- The above approach protects individual data privacy, but

fying patterns and trends from large quantities of data. Data it does require that each site disclose what rules it supports,

mining and data warehousing go hand-in-hand: most tools and how much it supports each potential global rule. What

operate by gathering all data into a central site, then running if this information is sensitive? For example, suppose the

an algorithm against that data. However, privacy concerns Centers for Disease Control (CDC), a public agency, would

can prevent building a centralized warehouse – data may like to mine health records to try to find ways to reduce

be distributed among several custodians, none of which are the proliferation of antibiotic resistant bacteria. Insurance

allowed to transfer their data to another site. companies have data on patient diseases and prescriptions.

This paper addresses the problem of computing associa- Mining this data would allow the discovery of rules such

tion rules within such a scenario. We assume homogeneous as Augmentin&Summer ⇒ Inf ection&F all, i.e., people

databases: All sites have the same schema, but each site has taking Augmentin in the summer seem to have recurring

information on different entities. The goal is to produce asso- infections.

ciation rules that hold globally, while limiting the information The problem is that insurance companies will be concerned

shared about each site. about sharing this data. Not only must the privacy of patient

Computing association rules without disclosing individual records be maintained, but insurers will be unwilling to release

transactions is straightforward. We can compute the global rules pertaining only to them. Imagine a rule indicating a high

support and confidence of an association rule AB ⇒ C rate of complications with a particular medical procedure. If

knowing only the local supports of AB and ABC, and the this rule doesn’t hold globally, the insurer would like to know

size of each database: this – they can then try to pinpoint the problem with their

Psites policies and improve patient care. If the fact that the insurer’s

i=1 support countABC (i)

supportAB⇒C = P sites data supports this rule is revealed (say, under a Freedom of

i=1 database size(i) Information Act request to the CDC), the insurerer could be

Psites

i=1 support countAB (i) exposed to significant public relations or liability problems.

supportAB = sites

P This potential risk could exceed their own perception of the

i=1 database size(i)

supportAB⇒C benefit of participating in the CDC study.

conf idenceAB⇒C = This paper presents a solution that preserves such

supportAB

secrets – the parties learn (almost) nothing beyond

Note that this doesn’t require sharing any individual transac- the global results. The solution is efficient: The addi-

tions. We can easily extend an algorithm such as a-priori [1] tional cost relative to previous non-secure techniques is

The authors are with the Department of Computer Sciences, Purdue O(number of candidate itemsets ∗ sites) encryptions, and

University, 250 N. University St, W. Lafayette, IN 47907. a constant increase in the number of messages.

E-mail: kanmurat@cs.purdue.edu, clifton@cs.purdue.edu. The method presented in this paper assumes three or more

Manuscript received 29 Jan. 2003; revised 11 Jun. 2003, accepted 1 Jul.

2003. parties. In the two-party case, knowing a rule is supported

2003

c IEEE. Personal use of this material is permitted. However, permis- globally and not supported at one’s own site reveals that the

sion to reprint/republish this material for advertising or promotional purposes other site supports the rule. Thus, much of the knowledge

or for creating new collective works for resale or redistribution to servers or

lists, or to reuse any copyrighted component of this work in other works must we try to protect is revealed even with a completely secure

be obtained from the IEEE. method for computing the global results. We discuss the two-

KANTARCIOǦLU AND CLIFTON 3

E2(E3(D)) C Site 1 R+count-5%*DBsize

R=17 ABC: 5 = 17+5-5%*100

DBSize = 100

C ABC: Yes!

E3(C)

D

E3(D)

18 ≥ R? 17

Site 2 Site 3 E3(E1(C))

E2(E3(E1(C)))

D C

Site 3 Site 2

ABC: 20 13 ABC: 6

DBSize = 300 DBSize=200

Fig. 1. Determining global candidate itemsets

13+20-5%*300 17+6-5%*200

party case further in Section V. By the same argument, we

assume no collusion, as colluding parties can reduce this to Fig. 2. Determining if itemset support exceeds 5% threshold

the two-party case.

then go into detail on specific background work on which this

A. Private Association Rule Mining Overview paper builds.

Our method follows the basic approach outlined on Page 2 Previous work in privacy-preserving data mining has ad-

except that values are passed between the local data mining dressed two issues. In one, the aim is preserving customer

sites rather than to a centralized combiner. The two phases privacy by distorting the data values [4]. The idea is that

are discovering candidate itemsets (those that are frequent on the distorted data does not reveal private information, and

one or more sites), and determining which of the candidate thus is “safe” to use for mining. The key result is that the

itemsets meet the global support/confidence thresholds. distorted data, and information on the distribution of the

The first phase (Figure 1) uses commutative encryption. random data used to distort the data, can be used to generate

Each party encrypts its own frequent itemsets (e.g., Site 1 an approximation to the original data distribution, without

encrypts itemset C). The encrypted itemsets are then passed revealing the original data values. The distribution is used to

to other parties, until all parties have encrypted all itemsets. improve mining results over mining the distorted data directly,

These are passed to a common party to eliminate duplicates, primarily through selection of split points to “bin” continuous

and to begin decryption. (In the figure, the full set of itemsets data. Later refinement of this approach tightened the bounds

are shown to the left of Site 1, after Site 1 decrypts.) This on what private information is disclosed, by showing that the

set is then passed to each party, and each party decrypts each ability to reconstruct the distribution can be used to tighten

itemset. The final result is the common itemsets (C and D in estimates of original values based on the distorted data [5].

the figure). More recently, the data distortion approach has been applied

In the second phase (Figure 2), each of the locally supported to boolean association rules [6], [7]. Again, the idea is to

itemsets is tested to see if it is supported globally. In the figure, modify data values such that reconstruction of the values for

the itemset ABC is known to be supported at one or more sites, any individual transaction is difficult, but the rules learned on

and each computes their local support. The first site chooses a the distorted data are still valid. One interesting feature of

random value R, and adds to R the amount by which its support this work is a flexible definition of privacy; e.g., the ability to

for ABC exceeds the minimum support threshold. This value is correctly guess a value of ‘1’ from the distorted data can be

passed to site 2, which adds the amount by which its support considered a greater threat to privacy than correctly learning

exceeds the threshold (note that this may be negative, as shown a ‘0’.

in the figure.) This is passed to site three, which again adds The data distortion approach addresses a different problem

its excess support. The resulting value (18) is tested using a from our work. The assumption with distortion is that the

secure comparison to see if it exceeds the Random value (17). values must be kept private from whoever is doing the mining.

If so, itemset ABC is supported globally. We instead assume that some parties are allowed to see some

This gives a brief, oversimplified idea of how the method of the data, just that no one is allowed to see all the data.

works. Section III gives full details. Before going into the In return, we are able to get exact, rather than approximate,

details, we give background and definitions of relevant data results.

mining and security techniques. The other approach uses cryptographic tools to build deci-

sion trees. [8] In this work, the goal is to securely build an

ID3 decision tree where the training set is distributed between

II. BACKGROUND AND R ELATED W ORK two parties. The basic idea is that finding the attribute that

There are several fields where related work is occurring. We maximizes information gain is equivalent to finding the at-

first describe other work in privacy-preserving data mining, tribute that minimizes the conditional entropy. The conditional

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

entropy for an attribute for two parties can be written as a transactions containsPX. The global support count of X is

n

sum of the expression of the form (v1 + v2 ) × log(v1 + v2 ). given as X.sup = i=1 X.sup Pn i . An itemset X is globally

The authors give a way to securely calculate the expression supported if X.sup ≥ s × ( i=1 |DBi |). Global confidence

(v1 + v2 ) × log(v1 + v2 ) and show how to use this function of a rule X ⇒ Y can be given as {X ∪ Y } .sup/X.sup.

for building the ID3 securely. This approach treats privacy- The set of large itemsets L(k) consists of all k-itemsets

preserving data mining as a special case of secure multi-party that are globally supported. The set of locally large itemsets

computation [9] and not only aims for preserving individual LLi(k) consists of all k-itemsets supported locally at site Si .

privacy but also tries to preserve leakage of any information GLi(k) = L(k) ∩ LLi(k) is the set of globally large k-itemsets

other than the final result. We follow this approach, but address locally supported at site Si . The aim of distributed association

a different problem (association rules), and emphasize the rule mining is to find the sets L(k) for all k > 1 and the support

efficiency of the resulting algorithms. A particular difference counts for these itemsets, and from this compute association

is that we recognize that some kinds of information can be rules with the specified minimum support and confidence.

exchanged without violating security policies; secure multi- A fast algorithm for distributed association rule mining is

party computation forbids leakage of any information other given in Cheung et. al. [2]. Their procedure for fast distributed

than the final result. The ability to share non-sensitive data mining of association rules (FDM) is summarized below.

enables highly efficient solutions. 1) Candidate Sets Generation: Generate candidate sets

The problem of privately computing association rules in CGi(k) based on GLi(k−1) , itemsets that are supported

vertically partitioned distributed data has also been addressed by the Si at the (k-1)-th iteration, using the classic a-

[10]. The vertically partitioned problem occurs when each priori candidate generation algorithm. Each site gener-

transaction is split across multiple sites, with each site having ates candidates based on the intersection of globally

a different set of attributes for the entire set of transactions. large (k-1) itemsets and locally large (k-1) itemsets.

With horizontal partitioning each site has a set of complete 2) Local Pruning: For each X ∈ CGi(k) , scan the

transactions. In relational terms, with horizontal partioning the database DBi at Si to compute X.supi . If X is locally

relation to be mined is the union of the relations at the sites. In large Si , it is included in the LLi(k) set. It is clear that

vertical partitioning, the relations at the individual sites must if X is supported globally, it will be supported in one

be joined to get the relation to be mined. The change in the site.

way the data is distributed makes this a much different problem 3) Support Count Exchange: LLi(k) are broadcast, and

from the one we address here, resulting in a very different each site computes the local support for the items in

solution. ∪i LLi(k) .

4) Broadcast Mining Results: Each site broadcasts the

A. Mining of Association Rules local support for itemsets in ∪i LLi(k) . From this, each

site is able to compute L(k) .

The association rules mining problem can be defined as

follows: [1] Let I = {i1 , i2 , . . . , in } be a set of items. Let The details of the above algorithm can be found in [2].

DB be a set of transactions, where each transaction T is an

itemset such that T ⊆ I. Given an itemset X ⊆ I, a transaction

B. Secure Multi-party Computation

T contains X if and only if X ⊆ T . An association rule is

an implication of the form X ⇒ Y where X ⊆ I, Y ⊆ I and Substantial work has been done on secure multi-party com-

X ∩ Y = ∅. The rule X ⇒ Y has support s in the transaction putation. The key result is that a wide class of computations

database DB if s% of transactions in DB contain X ∪Y . The can be computed securely under reasonable assumptions. We

association rule holds in the transaction database DB with give a brief overview of this work, concentrating on material

confidence c if c% of transactions in DB that contain X also that is used later in the paper. The definitions given here are

contains Y. An itemset X with k items called k-itemset. The from Goldreich [9]. For simplicity, we concentrate on the two-

problem of mining association rules is to find all rules whose party case. Extending the definitions to the multi-party case is

support and confidence are higher than certain user specified straightforward.

minimum support and confidence. 1) Security in semi-honest model: A semi-honest party

In this simplified definition of the association rules, missing follows the rules of the protocol using its correct input, but is

items, negatives and quantities are not considered. In this free to later use what it sees during execution of the protocol

respect, transaction database DB can be seen as 0/1 matrix to compromise security. This is somewhat realistic in the real

where each column is an item and each row is a transaction. world because parties who want to mine data for their mutual

In this paper, we use this view of association rules. benefit will follow the protocol to get correct results. Also, a

1) Distributed Mining of Association Rules: The above protocol that is buried in large, complex software can not be

problem of mining association rules can be extended to easily altered.

distributed environments. Let us assume that a transaction A formal definition of private two-party computation in

database DB is horizontally partitioned among n sites (namely the semi-honest model is given below. Computing a function

S1 , S2 , . . . , Sn ) where DB = DB1 ∪ DB2 ∪ . . . ∪ DBn and privately is equivalent to computing it securely. The formal

DBi resides at side Si (1 ≤ i ≤ n). The itemset X has proof of this can be found in Goldreich [9].

local support count of X.supi at site Si if X.supi of the Definition 2.1: (privacy w.r.t. semi-honest behavior): [9]

KANTARCIOǦLU AND CLIFTON 5

Let f : {0, 1}∗ × {0, 1}∗ 7−→ {0, 1}∗ × {0, 1}∗ be prob- C. Commutative Encryption

abilistic, polynomial-time functionality, where f1 (x, y)(resp., Commutative encryption is an important tool that can be

f2 (x, y)) denotes the first (resp., second) element of f (x, y)) used in many privacy-preserving protocols. An encryption

and let Π be two-party protocol for computing f . algorithm is commutative if the following two equations hold

Let the view of the first (resp., second) party during for any given feasible encryption keys K1 , . . . , Kn ∈ K, any

an execution of Π on (x, y), denoted view1Π (x, y) m in items domain M , and any permutations of i, j.

(resp., view2Π (x, y)) be (x, r1 , m1 , . . . , mt ) (resp.,

(y, r2 , m1 , . . . , mt )) where r1 represent the outcome of EKi1 (. . . EKin (M ) . . .) = EKj1 (. . . EKjn (M ) . . .) (3)

the first (resp., r2 second) party’s internal coin tosses, and 1

∀M1 , M2 ∈ M such that M1 6= M2 and for given k, <

mi represent the ith message it has received. 2k

The output of the first (resp., second) party during an P r(EKi1 (. . . EKin (M1 ) . . .) = EKj1 (. . . EKjn (M2 ) . . .)) <

execution of Π on (x, y) is denoted outputΠ 1 (x, y) (resp., (4)

outputΠ 2 (x, y)) and is implicit in the party’s view of the These properties of commutative encryption can be used

execution. to check whether two items are equal without revealing them.

Π privately computes f if there exist probabilistic polyno- For example, assume that party A has item iA and party B has

mial time algorithms, denoted S1 , S2 such that item iB . To check if the items are equal, each party encrypts

its item and sends it to the other party: Party A sends EKA (iA )

{(S1 (x, f1 (x, y)) , f2 (x, y))}x,y∈{0,1}∗ ≡C to B and party B sends EKB (iB ) to A. Each party encrypts the

view1Π (x, y) , outputΠ received item with its own key, giving party A EKA (EKB (iB ))

2 (x, y) x,y∈{0,1}∗ (1)

and party B EKB (EKA (iA )). At this point, they can compare

the encrypted data. If the original items are the same, Equation

{(f1 (x, y) , S2 (x, f1 (x, y)))}x,y∈{0,1}∗ ≡C 3 ensures that they have the same encrypted value. If they are

different, Equation 4 ensure that with high probability they do

outputΠ Π

1 (x, y) , view2 (x, y) x,y∈{0,1}∗ (2)

not have the same encrypted value. During this comparison,

each site sees only the other site’s values in encrypted form.

where ≡C denotes computational indistinguishability.

In addition to meeting the above requirements, we require

The above definition says that a computation is secure if

that the encryption be secure. Specifically, the encrypted values

the view of each party during the execution of the protocol

of a set of items should reveal no information about the items

can be effectively simulated by the input and the output of

themselves. Consider the following experiment. For any two

the party. This is not quite the same as saying that private

sets of items, we encrypt each item of one randomly chosen

information is protected. For example, if two parties use a

set with the same key and present the resulting encrypted

secure protocol to mine distributed association rules, a secure

set and the initial two sets to a polynomial-time adversary.

protocol still reveals that if a particular rule is not supported by

Loosely speaking, our security assumption implies that this

a particular site, and that rule appears in the globally supported

polynomial-time adversary will not be able to predict which

rule set, then it must be supported by the other site. A site

of the two sets were encrypted with a probability better than

can deduce this information by solely looking at its locally

a random guess. Under this security assumption, it can be

supported rules and the globally supported rules. On the other

shown that resulting encrypted set is indistinguishable by a

hand, there is no way to deduce the exact support count of

polynomial adversary from a set of items that are randomly

some itemset by looking at the globally supported rules. With

chosen from the domain of the encryption; this fact is used in

three or more parties, knowing a rule holds globally reveals

the proof of the privacy-preserving properties of our protocol.

that at least one site supports it, but no site knows which site

The formal definition of multiple-message semantic security

(other than, obviously, itself). In summary, a secure multi-party

can be found in [13].

protocol will not reveal more information to a particular party

There are several examples of commutative encryption, per-

than the information that can be induced by looking at that

haps the most famous being RSA [14] (if keys are not shared).

party’s input and the output.

The appendix describes a how Pohlig-Hellman encryption [15]

2) Yao’s general two-party secure function evaluation: can be used to fulfill our requirements, as well as further

Yao’s general secure two-party evaluation is based on ex- discussion of relevant cryptographic details. The remainder of

pressing the function f (x, y) as a circuit and encrypting the this paper is based on the definitions given above, and does

gates for secure evaluation [11]. With this protocol, any two- not require a knowledge of the cryptographic discussion in the

party function can be evaluated securely in the semi-honest appendix.

model. To be efficiently evaluated, however, the functions

must have a small circuit representation. We will not give

details of this generic method, however we do use this generic III. S ECURE A SSOCIATION RULE M INING

results for securely finding whether a ≥ b (Yao’s millionaire We will now use the tools described above to construct a

problem). For comparing any two integers securely, Yao’s distributed association rule mining algorithm that preserves the

generic method is one of the most efficient methods known, privacy of individual site results. The algorithm given is for

although other asymptotically equivalent but practically more three or more parties – the difficulty with the two-party case

efficient algorithms could be used as well [12]. is discussed in Section V.

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

A. Problem Definition Clearly, Protocol 1 finds the union without revealing which

Let i ≥ 3 be the number of sites. Each site has a private itemset belongs to which site. It is not, however, secure under

transaction database DBi . We are given support threshold the definitions of secure multi-party computation. It reveals

s and confidence c as percentages. The goal is to discover the number of itemsets having common support between sites,

all association rules satisfying the thresholds, as defined in e.g., sites 3, 5, and 9 all support some itemset. It does not

Section II-A.1. We further desire that disclosure be limited: reveal which itemsets these are, but a truly secure computation

No site should be able to learn contents of a transaction at (as good as giving all input to a “trusted party”) could

any other site, what rules are supported by any other site, not reveal even this count. Allowing innocuous information

or the specific value of support/confidence for any rule at any leakage (the number of itemsets having common support)

other site, unless that information is revealed by knowledge of allows an algorithm that is sufficiently secure with much lower

one’s own data and the final result. E.g., if a rule is supported cost than a fully secure approach.

globally but not at one’s own site, we can deduce that at least If we deem leakage of the number of commonly supported

one other site support the rule. Here we assume no collusion itemsets as acceptable, we can prove that this method is secure

(this is discussed further in Section IV.) under the definitions of secure multi-party computation. The

idea behind the proof is to show that given the result, the

B. Method leaked information, and a site’s own input, a site can simulate

everything else seen during the protocol. Since the simulation

Our method follows the general approach of the FDM

generates everything seen during execution of the protocol, the

algorithm [2], with special protocols replacing the broadcasts

site clearly learns nothing new from the protocol beyond the

of LLi(k) and the support count of items in LL(k) . We first

input provided to the simulator. One key is that the simulator

give a method for finding the union of locally supported

does not need to generate exactly what is seen in any particular

itemsets without revealing the originator of the particular

run of the protocol. The exact content of messages passed

itemset. We then provide a method for securely testing if the

during the protocol is dependent on the random choice of keys;

support count exceeds the threshold.

the simulator must generate an equivalent distribution, based

1) Secure union of locally large itemsets: In the FDM

on random choices made by the simulator, to the distribution

algorithm (Section II-A.1), step 3 reveals the large itemsets

of messages seen in real executions of the protocol. A formal

supported by each site. To accomplish this without revealing

proof that this proof technique shows that a protocol preserves

what each site supports, we instead exchange locally large

privacy can be found in [9]. We use this approach to prove

itemsets in a way that obscures the source of each itemset.

that Protocol 1 reveals only the union of locally large itemsets

We assume a secure commutative encryption algorithm with

and a clearly bounded set of innocuous information.

negligible collision probability (Section II-C).

Theorem 3.1: Protocol 1 privately computes the union of

The main idea is that each site encrypts the locally supported

the locally large itemsets assuming no collusion, revealing at

itemsets, along with enough “fake” itemsets to hide the actual

most the result ∪N i=1 LLi(k) and:

number supported. Each site then encrypts the itemsets from

other sites. In Phases 2 and 3, the sets of encrypted itemsets 1) Size of intersection of locally supported itemsets be-

are merged. Since Equation 3 holds, duplicates in the locally tween any subset of odd numbered sites,

supported itemsets will be duplicates in the encrypted itemsets, 2) Size of intersection of locally supported itemsets be-

and can be deleted. The reason this occurs in two phases is tween any subset of even numbered sites, and

that if a site knows which fully encrypted itemsets come from 3) Number of itemsets supported by at least one odd and

which sites, it can compute the size of the intersection between one even site.

any set of sites. While generally innocuous, if it has this Proof: Phase 0: Since no communication occurs in Phase

information for itself, it can guess at the itemsets supported 0, each site can simulate its view by running the algorithm on

by other sites. Permuting the order after encryption in Phase its own input.

1 prevents knowing exactly which itemsets match, however Phase 1: At the first step, each site sees LLei−1(k) . The size

separately merging itemsets from odd and even sites in Phase of this set is the size of the global candidate set CG(k) , which

2 prevents any site from knowing the fully encrypted values is known to each site. Assuming the security of encryption,

of its own itemsets.1 Phase 4 decrypts the merged frequent each item in this set is computationally indistinguishable from

itemsets. Commutativity of encryption allows us to decrypt a number chosen from a uniform distribution. A site can

all itemsets in the same order regardless of the order they therefore simulate the set using a uniform random number

were encrypted in, preventing sites from tracking the source generator. This same argument holds for each subsequent

of each itemset. round.

The detailed algorithm is given in Protocol 1. In the protocol Phase 2: In Phase 2, site 0 gets the fully encrypted sets of

F represents the data that can be used as fake itemsets. itemsets from the other even sites. Assuming that each site

|LLei(k) | represents the set of the encrypted k itemsets at site knows the source of a received message, site 0 will know

i. Ei is the encryption and Di is the decryption by site i. which fully encrypted set LLe(k) contains encrypted itemsets

from which (odd) site. Equal itemsets will now be equal in

1 An alternative would be to use an anonymizing protocol [16] to send

encrypted form. Thus, site 0 learns if any odd sites had locally

all fully encrypted itemsets to Site 0, thus preventing Site 0 from knowing

which were it’s own itemsets. The separate odd/even merging is lower cost supported itemsets in common. We can still build a simulator

and achieves sufficient security for practical purposes. for this view, using the information in point 1 above. If there

KANTARCIOǦLU AND CLIFTON 7

are k itemsets known to be common among all bN/2c odd itemsets “leak” then becomes an upper bound rather than

sites (from point 1), generate k random numbers and put them exact, at an increased cost in the number of candidates that

into the simulated LLei(k) . Repeat for each bN/2c − 1 subset, must be checked for global support. While not truly zero-

etc., down to 2 subsets of the odd sites. Then fill each LLei(k) knowledge, it reduces the confidence (and usefulness) of the

with randomly chosen values until it reaches size |CGi(k) |. leaked knowledge of the number of jointly supported itemsets.

The generated sets will have exactly the same combinations In practical terms, revealing the size (but not content) of

of common items as the real sets, and since the values of the intersections between sites is likely to be of little concern.

items in the real sets are computationally indistinguishable 2) Testing support threshold without revealing support

from a uniform distribution, their simulation matches the real count: Protocol 1 gives the full set of locally large itemsets

values. LL(k) . We still need to determine which of these itemsets

The same argument holds for site 1, using information from are supported globally. Step 4 of the FDM algorithm forces

point 2 to generate the simulator. each site to reveal its own support count for every itemset in

Phase 3: Site 1 eliminates duplicates from the LLei(k) LL(k) . All we need to know is for each itemset X ∈ LL(k) ,

to generate RuleSet1. We now demonstrate that Site 0 can is X.sup ≥ s% × |DB|? The following allows us to reduce

simulate RuleSet1. First, the size of RuleSet1 can be sim- this to a comparison against a sum of local values (the excess

ulated knowing point 2. There may be itemsets in common support at each site):

between RuleSet0 and RuleSet1. These can be simulated n

using point 3: If there are k items in common between even

X

X.sup ≥ s ∗ |DB| = s ∗ ( |DBi |)

and odd sites, site 0 selects k random items from RuleSet0 i=1

and inserts them into RuleSet1. RuleSet1 is then filled with n

X Xn

randomly generated values. Since the encryption guarantees X.supi ≥ s∗( |DBi |)

that the values are computationally indistinguishable from a i=1 i=1

n

uniform distribution, and the set sizes |RuleSet0|, |RuleSet1|, X

and |RuleSet0 ∩RuleSet1| (and thus |RuleSet|) are identical (X.supi − s ∗ |DBi |) ≥ 0

i=1

in the simulation and real execution, this phase is secure.

Phase 4: Each site sees only the encrypted items after PTherefore,

n

checking for support is equivalent to checking if

decryption by the preceding site. Some of these may be i=1 (X.sup i − s ∗ |DBi |) ≥ 0. The challenge is to do this

identical to items seen in Phase 2, but since all items must without revealing X.supi or |DBi |. An algorithm for this is

be in the union, this reveals nothing. The simulator for site given in Protocol 2.

i is built as follows: take the values generated in Phase 2 The first site generates a random number xr for each itemset

step N − 1 − i, and place them in the RuleSet. Then insert X, adds that number to its (X.supi − s ∗ |DBi |), and sends

random values in RuleSet up to the proper size (calculated it to the next site. (All arithmetic is modm ≥ 2 ∗ |DB|, for

as in the simulator for Phase 3). The values we have not seen reasons that will become apparent later.) The random number

before are computationally indistinguishable from data from a masks the actual excess support, so the second site learns

uniform distribution, and the simulator includes the values we nothing about the first site’s actual database size or support.

have seen (and knew would be there), so the simulated view The second site adds its excess support and sends the value

is computationally indistinguishable from the real values. on. The random value now hides P both support counts. The last

The simulator for site N − 1 is different, since it learns site in the change now has ni=1 (X.supi − s ∗ |DBi |) + xr

RuleSet(k). To simulate what it sees in Phase 4, site N − 1 (mod m). Since the total database size |DB| ≤ m/2, negative

takes each item in RuleSet(k) , the final result, and encrypts summation will be mapped to some number that is bigger then

it with EN −1 . These are placed in RuleSet. RuleSet is then or equal to m/2. (−k = m − k mod m.) The last site needs

filled with items chosen from F , also encrypted with EN −1 . to test if this sum minus xr (mod m) is less then m/2. This

Since the choice of items from F is random in both the real can be done securely using Yao’s generic method [11]. Clearly

and simulated execution, and the real items exactly match this algorithm is secure as long as there is no collusion, as no

in the real and simulation, the RuleSet site N − 1 receives site can distinguish what it receives from a random number.

in Phase 4 is computationally indistinguishable from the real Alternatively, the first site can simply send xr to the last site.

execution. The last site learns the actual excess support, but does not

Therefore, we can conclude that above protocol is privacy- learn the support values for any single site. In addition, if we

preserving in the semi-honest model with the stated assump- consider the excess support to be a valid part of the global

tions. result, this method is still secure.

The information disclosed by points 1-3 could be relaxed to Theorem 3.2: Protocol 2 privately computes globally sup-

the number of itemsets support by 1 site, 2 sites, ..., N sites ported itemsets in the semi-honest model.

if we assume anonymous message transmission. The number Proof: To show that Protocol 2 is secure under the

of jointly supported itemsets can also be masked by allowing semi-honest model, we have to show that a polynomial time

sites to inject itemsets that are not really supported locally. simulator can simulate the view of the parties during the

These fake itemsets will simply fail to be globally supported, execution of the protocol, based on their local inputs and the

and will be filtered from the final result when global support is global result. We also use the general composition theorem

calculated as shown in the next section. The jointly supported for semi-honest computation [9]. The theorem says that if g

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

securely reduces to f , and f is computed securely, then the |LLi(k) |, then site i has learned a subset of the itemsets at site

computation of f (g) is secure. In our context, f is the secure i + 1.

comparison of two integers, and g is Protocol 2. First, we Collusion can be a problem for our second protocol, because

show that the view of any site during the addition phase can site i + 1 and site i − 1 can collude to reveal site i’s

be efficiently simulated given the input of that site and the excess support value. This protocol can be made resilient

global output. Site i uniformly chooses a random integer sr , against collusions using a straightforward technique from the

0 ≤ sr < m. Next, we show that view and the output of the cryptographic community. The basic idea is each party divides

simulator are computationally indistinguishable by showing its input into n parts, and send the n − 1 pieces to different

that the probability of seeing a given x in both is equal. In sites. To reveal any parties input, n − 1 party must collude.

the following equations, xr is the random number added at The following is a brief summary of the protocol, details can

the beginning of Protocol 2, 0 ≤ Xr < m. The arithmetic be found in [17]. (A slightly more efficient version can be

is assumed to be mod m. Also note that X.supi is fixed for found in [18].)

each site. 1) PEach site i randomly chooses n elements such that xi =

n

j=1 zi,j mod m where xi is the input of site i. Site i

k=i−1 sends zi,j to site j.

" #

X

P r V IEWiP rotocol2 = x = P r xr = x − 2) Every site i computes wi = nj=1 zj,i mod m and sends

X.supi

P

k=1 wi to site n. Pn

1 3) Site n computes the final result i=1 wi mod m

=

m The above protocol can easily be used to improve our

= P r [sr = x] second protocol. Assume site 0 is the starting site in our

= P r [Simulatori = x] protocol and site N − 1 is the last site. Choose m such that

2 ∗ |DB| ≤ m. Set x1 = X.sup1 − s ∗ d1 + xr mod m and

Therefore, what each site sees during the addition phase is

xi = X.supi − s ∗ di mod m, i 6=P1. After this point, the

indistinguishable from that simulated with a random number

above protocol can be used to find ni=1 (X.supi − s ∗ di ) +

generator. During the comparison phase we can use the generic

xr mod m. At the end, one secure addition and comparison

secure method, so from the composition theorem we conclude

is done as in Protocol 2 to check if itemset X is globally

that Protocol 2 is secure in the semi-honest model.

supported.

To find if the confidence of a rule X ⇒ Y is higher

The two party case is problematic. First, globally supported

than the given confidence threshold c, we have to check if itemsets that are not supported at one site are known to be

{X∪Y }.sup

Y.sup ≥ c. Protocol 2 only reveals if an itemset is supported at the other site – this is an artifact of the result.

supported, it does not reveal the support count. The following Protocol 1 is worse yet, as itemsets that are supported at

equations show how to securely compute if confidence exceeds one site but not supported globally will become known to

a threshold using Protocol 2. The support of {X ∪ Y } .supi the other site. To retain any privacy, we must dispense with

is denoted as XY.supi . local pruning entirely (steps 1 and 2 of the FDM algorithm)

Pi=n

{X ∪ Y } .sup XY.supi and compute support for all candidates in CG(k) (as computed

≥ c ⇒ Pi=1 i=n

≥c from L(k−1) ). Second, the secure comparison phase at the end

Y.sup i=1 X.supi of the protocol 2 cannot be removed, as otherwise the support

i=n i=n

X X of one site is disclosed to the other. It is difficult to improve

⇒ XY.supi ≥ c ∗ ( X.supi )

on this, as evidenced by the following theorem.

i=1 i=1

i=n Theorem 5.1: For itemset X, to check whether

X.sup1 +X.sup2

≥ k can be securely computed if and

X

⇒ (XY.supi − c ∗ X.supi ) ≥ 0 d1 +d2

i=1 only if Yao’s millionaire problem securely solved for arbitrary

a and b.

Since each site knows XY.supi and X.supi , we can easily

use Protocol 2 to securely calculate the confidence of a rule.

Proof: Checking X.supd11 +X.sup

+d2

2

≥ k is equivalent to

checking (X.sup1 − k ∗ d1 ) ≥ (k ∗ d2 − X.sup2 ). If we have

IV. S ECURITY AGAINST COLLUSION a = X.sup1 − k ∗ d1 and b = k ∗ d2 − X.sup2 , we have an

Collusion in Protocol 1 could allow a site to know its own instance of Yao’s millionaire problem for a and b. Assume we

frequent itemsets after encryption by all parties. Using this, it have a secure protocol that computes whether X is supported

can learn the size of the intersection between its own itemsets globally or not for arbitrary X.sup1 , X.sup2 , d1 , d2 and k.

and those of another party. Specifically, if site i colludes with Take X.sup1 = 3a, d1 = 4a, X.sup2 = b, d2 = 4 ∗ b and

site i − 1, it can learn the size of its intersection with site i + 1. k = 0.5. This is equivalent to checking whether a ≥ b.

Collusion between sites 0 and 1 exacerbates the problem, as The above theorem implies that if we develop a method that

they know encrypted values of itemsets for all odd (even) sites. can check securely if an itemset is globally supported for the

This may reveal the actual itemsets; if |LLi(k) ∩ LLi+1(k) | = two party case in semi-honest model, it is equivalent to finding

KANTARCIOǦLU AND CLIFTON 9

a new solution to Yao’s millionaire problem. This problem is final secure comparison requires a computation cost of O(| ∪i

well studied in cryptography and to our knowledge, there is LLi(k) | ∗ m ∗ t3 ).

no significantly faster way for arbitrary a and b than using the As discussed in Section V, using only Protocol 2 directly

generic circuit evaluation solution. on CG(k) is fully secure assuming the desired result includes

It is worth noting that eliminating local pruning and using all globally large itemsets. The communication costs becomes

Protocol 2 to compute the global support of all candidates in O(m ∗ |CG(k) | ∗ N ), but because the communication in

CG(k) is secure under the definitions of secure multi-party Protocol 2 is sequential the communication time is roughly

computation, for two or more parties. The problem with the the same as the full protocol. The encryption portion of the

two-party case is that knowing a rule is supported globally computation cost becomes O(|CG(k) | ∗ m ∗ t3 ) for the secure

that is not supported at one’s own site reveals that the other comparison at the end of the protocol. However, there is a

site supports that rule. This is true no matter how secure substantial added cost in computing the support, as we must

computation, it is an artifact of the result. Thus, extending compute support for all |CG(k) | itemsets. This is generally

to secure computation in the two party case is unlikely to be much greater than the |CGi(k) ∪ (∪i LLi(k) )| required under

of use. the full algorithm (or FDM), as shown in [3]. It is reasonable

to expect that this cost will dominate the other costs, as it is

VI. C OMMUNICATION AND C OMPUTATION C OSTS linear in |DB|.

We now give cost estimates for association rule mining

using the method we have presented. The number of sites is A. Optimizations and Further Discussion

N . Let the total number of locally large candidate itemsets be The cost of “padding” LLei(k) from F to avoid disclosing

|CGi(k) |, and the number of candidates that can be directly the number of local itemsets supported can add significantly to

generated by the globally large (k-1) itemsets be |CG(k) | the communication and encryption costs. In practice, for k >

(= apriori gen(L(k−1) )). The excess support X.supi − |DBi | 1, |CG(k) | is likely to be of reasonable size. However, |CG(1) |

of an itemset X can be represented in m = dlog2 (2 ∗ |DB|)e could be very large, as it is dependent only on the size of the

bits. Let t be the number of bits in the output of the encryption domain of items, and is not limited by already discovered

of an itemset. A lower bound on t is log2 (|CG(k) |); based on frequent itemsets. If the participants can agree on an upper

current encryption standards t = 512 is a more appropriate bound on the number of frequent items supported at any one

value.2 site that is tighter than “every item may be frequent” without

The total bit-communication cost for Protocol 1 is O(t ∗ inspecting the data, we can achieve a corresponding decrease

|CG(k) | ∗ N 2 ), however, as much of this happens in parallel in the costs with no loss of security. This is likely to be feasible

we can divide by N to get an estimate of the communication in practice; the very success of the a-priori algorithm is based

time. For comparison, the FDM algorithm requires O(t ∗ | ∪i on the assumption that relatively few items are frequent.

LLi(k) | ∗ N ) for the corresponding steps, with effectively the Alternatively, if we are willing to leak an upper bound on the

same reduction in time due to parallelism (achieved through number of itemsets supported at each site, each site can set its

broadcast as opposed to simultaneous point-to-point transmis- own upper bound and pad only to that bound. This can be done

sions). The added cost of Protocol 1 is due to padding LLei(k) for every round, not just k = 1. As a practical matter, such an

to hide the actual number of local itemsets supported, and the approach would achieve acceptable security and would change

increase in bits required to represent encrypted itemsets. The the |CG(k) | factor in the communication and encryption costs

worst-case value for |CG(k) | is item domain k

size, however,

of Protocol 1 to O(| ∪i LLi(k) |), equivalent to FDM.

the optimizations that make the a-priori algorithm effective Another way to limit the encryption cost of padding is to

in practice would fail for such large |CG(k) |. In practice, pad randomly from the domain of the encryption output rather

only in the first round (k = 1) will this padding pose a than encrypting items from F . Assuming |domainof Ei | >>

high cost; |CG(1) | = the size of the domain of items. In |domainof itemsets|, the probability of padding with a value

later iterations, the size of |CG(k) | will be much closer to that decrypts to a real itemset is small, and even if this occurs

|LLei(k) |. The computation cost increase due to encryption it will only result in additional itemset being tested for support

is O(t3 ∗ |CG(k) | ∗ N 2 ), where t is the number of bits in the in Protocol 2. When the support count is tested, such “false

encryption key. Here t3 represents the bit-wise cost of modular hits” will be filtered out, and the final result will be correct.

exponentiation. The comparison phase at the end of protocol 2 can be

Protocol 2 requires O(m ∗ | ∪i LLi(k) | ∗ (N + t)) bits also removed, eliminating the O(m ∗ | ∪i LLi(k) | ∗ t) bits

of communication. The t factor is for the secure circuit of communication and O(| ∪i LLi(k) | ∗ m ∗ t3 ) encryption

evaluations between sites N − 1 and 0 required to determine if cost. This reveals the excess support for each itemset. Practical

each itemset is supported. FDM actually requires an additional applications may demand this count as part of the result for

factor of N due to the broadcast of local support instead of globally supported itemsets, so the only information leaked is

point-to-point communication. However, the broadcast results the support counts for itemsets in ∪i LLi(k) − L(k) . As these

in a single round instead of N rounds of our method. The cannot be traced to an individual site, this will generally be

2 item domain size acceptable in practice.

The worst-case bound on |CG(k) | is k

. t = 512 can

represent such worst-case itemsets for 50 million possible items and k = 20, The cost estimates are based on the assumption that all

adequate for most practical cases. frequent itemsets (even 1-itemsets) are part of the result. If

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

exposing the globally frequent 1-itemsets is a problem, the given procedures to mine distributed association rules on

algorithm could easily begin with 2-itemsets (or larger). While horizontally partitioned data. We have shown that distributed

the worst-case cost would be unchanged, there would be an association rule mining can be done efficiently under reason-

impact in practical terms. Eliminating the pruning of globally able security assumptions.

infrequent 1-itemsets would increase the size of CGi(2) and We believe the need for mining of data where access is

thus LLi(2) , however, local pruning of infrequent 1-itemsets restricted by privacy concerns will increase. Examples include

should make the sizes manageable. More critical is the impact knowledge discovery among intelligence services of differ-

on |CG(2) |, and thus the cost of padding to hide the number ent countries and collaboration among corporations without

of locally large itemsets. In practice, the size of CG(2) will revealing trade secrets. Even within a single multi-national

rarely be the theoretical limit of item domain

2

size, but this company, privacy laws in different jurisdictions may prevent

worst-case bound would need to be used if the algorithm began sharing individual data. Many more examples can be imagined.

with finding 2-itemsets (the problem is worse for k > 2). A We would like to see secure algorithms for classification,

practical solution would again be to have sites agree on a clustering, etc. Another possibility is secure approximate data

reasonable upper bound for the number of locally supported mining algorithms. Allowing error in the results may enable

k-itemsets for the initial k, revealing some information to more efficient algorithms that maintain the desired level of

substantially decrease the amount of padding needed. security.

The secure multi-party computation definitions from the

cryptography domain may be too restrictive for our purposes.

B. Practical Cost of Encryption

A specific example of the need for more flexible definitions

While achieving privacy comes at a reasonable increase in can be seen in Protocol 1. The “padding” set F is defined to be

communication cost, what about the cost of encryption? As a infinite, so that the probability of collision among these items

test of this, we implemented Pohlig-Hellman, described in the is 0. This is impractical, and intuitively allowing collisions

appendix. The encryption time per itemset represented with among the padded itemsets would seem more secure, as the

t = 512 bits was 0.00428 seconds on a 700MHz Pentium 3 information leaked (itemsets supported in common by subsets

under Linux. Using this, and the results for the reported in [3], of the sites) would become an upper bound rather than an exact

we can estimate the cost of privacy-preserving association rule value. However, unless we know in advance the probability of

mining on the tests in [3]. collision among real itemsets, or more specifically we can set

The first set of experiments described in [3] contain suffi- the size of F so the ratio of the collision probabilities in F

cient detail for us to estimate the cost of encryption. These and real itemsets is constant, the protocol is less secure under

experiments used three sites, an item domain size of 1000, secure multi-party communication definitions. The problem is

and a total database size of 500k transactions. that knowing the probability of collision among items chosen

The encryption cost for the initial round (k = 1) would from F enables us to predict (although generally with low

be 4.28 seconds at each site, as the padding need only be to accuracy) which fully encrypted itemsets are real and which

the domain size of 1000. While finding two-itemsets could are fake. This allows a probabilistic upper bound estimate on

potentially be much worse ( 1000

2 = 499500), in practice the number of itemsets supported at each site. It also allows a

|CG(2) | is much smaller. The

P experiment in [3] reports a total probabilistic estimate of the number of itemsets supported in

number of candidate sets ( k>1 |CG(k) |) of just over 100,000 common by subsets of the sites that is tighter than the number

at 1% support. This gives a total encryption cost of around of collisions found in the RuleSet. Definitions that allow us

430 seconds per site, with all sites encrypting simultaneously. to trade off such estimates, and techniques to prove protocols

This assumes none of the optimizations of Section VI-A; relative to those definitions, will allow us to prove the privacy

if the encryption cost at each site could be cut to |LLi(k) | of protocols that are practically superior to protocols meeting

by eliminating the cost of encrypting the padding items, the strict secure multi-party computation definitions.

encryption cost would be cut to 5% to 35% of the above on More suitable security definitions that allow parties to

the datasets used in [3]. choose their desired level of security are needed, allowing

There is also the encryption cost of the secure comparison efficient solutions that maintain the desired security. Some

at the end of Protocol 2. Although the data reported in [3] suggested directions for research in this area are given in [19].

does not give us the exact size of ∪i LLi(k) , it appears to be One line of research is to predict the value of information for

on the order of 2000. Based on this, the cost of the secure a particular organization, allowing tradeoff between disclosure

comparison, O(| ∪i LLi(k) | ∗ m ∗ t3 ), would be about 170 cost, computation cost, and benefit from the result. We believe

seconds. some ideas from game theory and economics may be relevant.

The total execution time for the experiment reported in In summary, it is possible to mine globally valid results

[3] was approximately 800 seconds. Similar numbers hold at from distributed data without revealing information that com-

different support levels; the added cost of encryption would at promises the privacy of the individual sources. Such privacy-

worst increase the total run time by roughly 75%. preserving data mining can be done with a reasonable increase

in cost over methods that do not maintain privacy. Continued

VII. C ONCLUSIONS AND F URTHER W ORK research will expand the scope of privacy-preserving data

Cryptographic tools can enable data mining that would mining, enabling most or all data mining methods to be applied

otherwise be prevented due to security concerns. We have in situations where privacy concerns would appear to restrict

KANTARCIOǦLU AND CLIFTON 11

secure encryption scheme that satisfies Equations 3 and 4

A PPENDIX can be used in our protocols. The above approach is used to

C RYPTOGRAPHIC N OTES ON C OMMUTATIVE E NCRYPTION generate the cost estimates in Section VI-B. Other approaches,

and further definitions and discussion of their security, can be

The Pohlig-Hellman encryption scheme [15] can be used for

found in [21]–[24].

a commutative encryption scheme meeting the requirements

of Section II-C. Pohlig-Hellman works as follows. Given a

large prime p with no small factors of p − 1, each party ACKNOWLEDGMENT

chooses a random e, d pair such that e ∗ d = 1 (mod p − 1). We wish to acknowledge the contributions of Mike Atallah

The encryption of a given message M is M e (mod p). and Jaideep Vaidya. Discussions with them have helped to

Decryption of a given ciphertext C is done by evaluating C d tighten the proofs, giving clear bounds on the information

(mod p). C d = M ed (mod p), and due to Fermat’s little released.

theorem, M ed = M 1+k(p−1) = M (mod p).

It is easy to see that Pohlig-Hellman with shared p satisfies R EFERENCES

equation 3. Let us assume that there are n different encryption

[1] R. Agrawal and R. Srikant, “Fast algorithms for mining association

and decryption pairs ((e1 , d1 ), . . . , (en , dn )). For any permuta- rules,” in Proceedings of the 20th International Conference on Very

tion function i, j and E = e1 ∗e2 ∗. . .∗en = ei1 ∗ei2 . . . ein = Large Data Bases. Santiago, Chile: VLDB, Sept. 12-15 1994, pp.

ei1 ∗ ei2 . . . ein (mod p − 1): 487–499. [Online]. Available: http://www.vldb.org/dblp/db/conf/vldb/

vldb94-487.html

[2] D. W.-L. Cheung, J. Han, V. Ng, A. W.-C. Fu, and Y. Fu, “A fast

Eei1 (. . . Eein (M ) . . .) distributed algorithm for mining association rules,” in Proceedings of the

= (. . . ((M ein (mod p))ein−1 (mod p)) . . .)ei1 (mod p)) 1996 International Conference on Parallel and Distributed Information

ein ∗ein−1 ...∗ei1 Systems (PDIS’96). Miami Beach, Florida, USA: IEEE, Dec. 1996,

= M (mod p) pp. 31–42.

[3] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, “Efficient mining

= ME (mod p) of association rules in distributed databases,” IEEE Trans. Knowledge

ejn ∗ejn−1 ...∗ej1

= M (mod p) Data Eng., vol. 8, no. 6, pp. 911–922, Dec. 1996.

[4] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in

= Eej1 (. . . Eejn (M ) . . .) Proceedings of the 2000 ACM SIGMOD Conference on Management

of Data. Dallas, TX: ACM, May 14-19 2000, pp. 439–450. [Online].

Equation 4 is also satisfied by the Pohlig-Hellman encryp- Available: http://doi.acm.org/10.1145/342009.335438

tion scheme. Let M1 , M2 ∈ GF (p) such that M1 6= M2 . [5] D. Agrawal and C. C. Aggarwal, “On the design and quantification

of privacy preserving data mining algorithms,” in Proceedings

Any order of encryption by all parties is equal to evaluating of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on

E th power mod p of the plain text. Let us assume that after Principles of Database Systems. Santa Barbara, California, USA:

encryptions M1 and M2 are mapped to the same value. This ACM, May 21-23 2001, pp. 247–255. [Online]. Available: http:

//doi.acm.org/10.1145/375551.375602

implies that M1E = M2E (mod p). By exponentiating both [6] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy

sides with D = d1 ∗ d2 ∗ . . . ∗ dn (mod p − 1), we get preserving mining of association rules,” in The Eighth ACM SIGKDD

M1 = M2 (mod p), a contradiction. (Note that E ∗ D = International Conference on Knowledge Discovery and Data Mining,

Edmonton, Alberta, Canada, July 23-26 2002, pp. 217–228. [Online].

e1 ∗ e2 ∗ . . . ∗ en ∗ d1 ∗ d2 ∗ . . . dn = e1 ∗ d1 . . . en ∗ dn = 1 Available: http://doi.acm.org/10.1145/775047.775080

(mod p − 1).) Therefore, the probability that two different [7] S. J. Rizvi and J. R. Haritsa, “Maintaining data privacy in association

elements map to the same value is zero. rule mining,” in Proceedings of 28th International Conference on Very

Large Data Bases. Hong Kong: VLDB, Aug. 20-23 2002, pp. 682–693.

Direct implementation of Pohlig-Hellman is not secure. [Online]. Available: http://www.vldb.org/conf/2002/S19P03.pdf

Consider the following example, encrypting two values a and [8] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in

e 2

b, where b = a2 . Ee (b) = Ee (a2 ) = (a2 ) (mod p) = (ae ) Advances in Cryptology – CRYPTO 2000. Springer-Verlag, Aug.

2 20-24 2000, pp. 36–54. [Online]. Available: http://link.springer.de/link/

(mod p) = (Ee (a)) (mod p). This shows that given two service/series/0558/bibs/1880/18800036.htm

encrypted values, it is possible to determine if one is the square [9] O. Goldreich, “Secure multi-party computation,” Sept. 1998, (working

of the other (even though the base values are not revealed.) draft). [Online]. Available: http://www.wisdom.weizmann.ac.il/∼ oded/

pp.html

This violates the security requirement of Section II-C. [10] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in

Huberman et al. provide a solution [20]. Rather than en- vertically partitioned data,” in The Eighth ACM SIGKDD International

crypting items directly, a hash of the items is encrypted. Conference on Knowledge Discovery and Data Mining, Edmonton,

Alberta, Canada, July 23-26 2002, pp. 639–644. [Online]. Available:

The hash occurs only at the originating site, the second and http://doi.acm.org/10.1145/775047.775142

later encryption of items can use Pohlig-Hellman directly. [11] A. C. Yao, “How to generate and exchange secrets,” in Proceedings of

The hash breaks the relationship revealed by the encryption the 27th IEEE Symposium on Foundations of Computer Science. IEEE,

1986, pp. 162–167.

(e.g., a = b2 ). After decryption, the hashed values must be [12] I. Ioannidis and A. Grama, “An efficient protocol for yao’s millionaires’

mapped back to the original values. This can be done by problem,” in Hawaii International Conference on System Sciences

hashing the candidate itemsets in CG(k) to build a lookup (HICSS-36), Waikoloa Village, Hawaii, Jan. 6-9 2003.

[13] O. Goldreich, “Encryption schemes,” Mar. 2003, (working

table, anything not in this table is a fake “padding” itemset draft). [Online]. Available: http://www.wisdom.weizmann.ac.il/∼ oded/

and can be discarded.3 PSBookFrag/enc.ps

[14] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining

3 A hash collision, resulting in a padding itemset mapping to an itemset in digital signatures and public-key cryptosystems,” Communications of

the table, could result in an extra itemset appearing in the union. This would the ACM, vol. 21, no. 2, pp. 120–126, 1978. [Online]. Available:

be filtered out by Protocol 2; the final results would be correct. http://doi.acm.org/10.1145/359340.359342

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

puting logarithms over GF(p) and its cryptographic significance,” IEEE

Transactions on Information Theory, vol. IT-24, pp. 106–110, 1978.

[16] M. K. Reiter and A. D. Rubin, “Crowds: Anonymity for Web

transactions,” ACM Transactions on Information and System Security,

vol. 1, no. 1, pp. 66–92, Nov. 1998. [Online]. Available: http:

//doi.acm.org/10.1145/290163.290168

[17] J. C. Benaloh, “Secret sharing homomorphisms: Keeping

shares of a secret secret,” in Advances in Cryptography - Protocol 1 Finding secure union of large itemsets of size k

CRYPTO86: Proceedings, A. Odlyzko, Ed., vol. 263. Springer- Require: N ≥ 3 sites numbered 1..N − 1, set F of non-

Verlag, Lecture Notes in Computer Science, 1986, pp. 251–

260. [Online]. Available: http://springerlink.metapress.com/openurl.asp? itemsets.

genre=article&issn=0302-9%743&volume=263&spage=251

[18] B. Chor and E. Kushilevitz, “A communication-privacy tradeoff for Phase 0: Encryption of all the rules by all sites

modular addition,” Information Processing Letters, vol. 45, no. 4, pp.

205–210, 1993. for each site i do

[19] C. Clifton, M. Kantarcioglu, and J. Vaidya, “Defining privacy for data generate LLi(k) as in steps 1 and 2 of the FDM algorithm

mining,” in National Science Foundation Workshop on Next Generation LLei(k) = ∅

Data Mining, H. Kargupta, A. Joshi, and K. Sivakumar, Eds., Baltimore,

MD, Nov. 1-3 2002, pp. 126–133. for each X ∈ LLi(k) do

[20] B. A. Huberman, M. Franklin, and T. Hogg, “Enhancing privacy and LLei(k) = LLei(k) ∪ {Ei (X)}

trust in electronic communities,” in Proceedings of the First ACM end for

Conference on Electronic Commerce (EC99). Denver, Colorado, USA:

ACM Press, Nov. 3–5 1999, pp. 78–86. for j = |LLei(k) | + 1 to |CG(k) | do

[21] J. C. Benaloh and M. de Mare, “One-way accumulators: A decentralized LLei(k) = LLei(k) ∪ {Ei (random selection from F )}

alternative to digital signatures,” in Advances in Cryptology – end for

EUROCRYPT’93, Workshop on the Theory and Application of

Cryptographic Techniques, ser. Lecture Notes in Computer Science, end for

vol. 765. Lofthus, Norway: Springer-Verlag, May 1993, pp. 274–

285. [Online]. Available: http://springerlink.metapress.com/openurl.asp? Phase 1: Encryption by all sites

genre=article&issn=0302-9%743&volume=765&spage=274

[22] W. Diffie and M. Hellman, “New directions in cryptography,” IEEE for Round j = 0 to N − 1 do

Trans. Inform. Theory, vol. IT-22, no. 6, pp. 644–654, Nov. 1976. if Round j= 0 then

[23] T. ElGamal, “A public key cryptosystem and a signature scheme based Each site i sends permuted LLei(k) to site (i + 1) mod

on discrete logarithms,” IEEE Trans. Inform. Theory, vol. IT-31, no. 4,

pp. 469–472, July 1985. N

[24] A. Shamir, R. L. Rivest, and L. M. Adleman, “Mental poker,” Laboratory else

for Computer Science, MIT, Technical Memo MIT-LCS-TM-125, Feb. Each site i encrypts all items in LLe(i−j mod N )(k)

1979.

with Ei , permutes, and sends it to site (i + 1) mod N

end if

end for{At the end of Phase 1, site i has the itemsets of

site (i + 1) mod N encrypted by every site}

Murat Kantarcıoǧlu is a Ph.D. candidate at Purdue

University. He has a Master’s degree in Computer

Each site i sends LLei+1 mod N to site i mod 2

d(N −1)/2e

Science from Purdue University and a Bachelor’s Site 0 sets RuleSet1 = ∪j=1 LLe(2j−1)(k)

degree in Computer Engineering from Middle East b(N −1)/2c

Technical University, Ankara Turkey. His research Site 1 sets RuleSet0 = ∪j=0 LLe(2j)(k)

interests include data mining, database security and

information security. He is a student member of Phase 3: Merge all itemsets

ACM.

Site 1 sends permuted RuleSet1 to site 0

Site 0 sets RuleSet = RuleSet0 ∪ RuleSet1

Phase 4: Decryption

for i = 0 to N − 1 do

Site i decrypts items in RuleSet using Di

Site i sends permuted RuleSet to site i + 1 mod N

Chris Clifton is an Associate Professor of Computer end for

Science at Purdue University. He has a Ph.D. from Site N − 1 decrypts items in RuleSet using DN −1

Princeton University, and Bachelor’s and Master’s

degrees from the Massachusetts Institute of Technol- RuleSet(k) = RuleSet − F

ogy. Prior to joining Purdue he held positions at The Site N − 1 broadcasts RuleSet(k) to sites 0..N − 2

MITRE Corporation and Northwestern University.

His research interests include data mining, database

support for text, and database security. He is a senior

member of the IEEE and a member of the IEEE

Computer Society and the ACM.

KANTARCIOǦLU AND CLIFTON 13

Require: N ≥ 3 sites numbered 0..N − 1, m ≥ 2 ∗ |DB|

rule set = ∅

at site 0:

for each r ∈ candidate set do

choose random integer xr from a uniform distribution

over 0..m − 1;

t = r.supi − s ∗ |DBi | + xr (mod m);

rule set = rule set ∪ {(r, t)};

end for

send rule set to site 1 ;

for i = 1 to N − 2 do

for each (r, t) ∈ rule set do

t̄ = r.supi − s ∗ |DBi | + t (mod m);

rule set = rule set − {(r, t)} ∪ {(r, t̄)} ;

end for

send rule set to site i + 1 ;

end for

at site N-1:

for each (r, t) ∈ rule set do

t̄ = r.supi − s ∗ |DBi | + t (mod m);

securely compute if (t̄ − xr ) (mod m) < m/2 with the

site 0; { Site 0 knows xr }

if (t̄ − xr ) (mod m) < m/2 then

multi-cast r as a globally large itemset.

end if

end for

- Assignment 3.docxCaricato dasun_917443954
- Day 9Caricato daHafizur Rahman Dhrubo
- Hierarchical Online Mining for Associative RulesCaricato damaruthi631
- Association Rule Mining in the field of Agriculture: A SurveyCaricato daIJSRP ORG
- CCT_Help.pdfCaricato daJose Antonio Reyes Duarte
- ResnCaricato dathickskin
- The Relationship Between 4ps Market Basket Analysis a Case Study of Grocery Retail Shops in Gweru ZimbabweCaricato daIJSTR Research Publication
- Lecture -9 Unsupervised Learning (K-means, Association analysis and Frequuent items).pptCaricato daABDURAHMAN ABDELLA
- CV_Diamond_GabrielCaricato daNursaid Andi Putra
- How to get electricity.pdfCaricato daZailee Habib
- articleCaricato daIves Rodríguez
- Frequent Pattern Mining Approaches with Transactional and Impact Matrix Based Probabilistic Model for Privacy Preserved Data MiningCaricato daInternational Journal for Scientific Research and Development - IJSRD
- Classified2020_1_4650557.pdfCaricato daMohamed Mahar
- Classified2020_1_4650557Caricato daMohamed Mahar
- 8 f 3 - practiceCaricato daapi-288675892
- Regula_MethodCaricato daTajender Kumar
- SalesAnalysis_FlowChartsCaricato daMian Talha
- Apr May 2019Caricato dasaro2330
- syed ali naqiCaricato daAliNaqi250355
- OLAPCaricato daSayed Hossain
- Creating Scenes and Buttons ActivityCaricato daJeph Pedrigal
- LogCaricato daRulkin Ruiz Tan Cabangon
- Gantt Chart and TableCaricato daSarfaraz Edrish
- 1003Caricato danarachas
- Daily Issue MonitorCaricato darich
- Changelog.txtCaricato daHoai-Vu
- Garrys Mod Essential GENERATOR Complete Cost-free DOWNLOADCaricato dapelysk493
- How to Restore ColorCaricato dapalikari01
- beth ong sss loan17.pdfCaricato dajade nogot
- VA02 Not Showing the Added Component for Configurable Material - SAP AnswersCaricato danabigcs

- Performance Criteria for Tremie Concrete in Deep FoundationCaricato daYoshua Yang
- Sample for Project Study.reviSEDCaricato daFranz Albina
- Manual Bbr-4hg TwCaricato dakwok hang Tong
- Chainshaw - MAKITACaricato daDizky Christian Hadi
- Loesche Round Table ATECCaricato daGaLuh Rianfiko
- FINAL MANUSCRIPT.docxCaricato daRalph Celeste
- Visible Light-driven Photocatalysis of Doped SrTiO3 Tubular StructureCaricato daLê Hồng Khanh
- The Periodic Table of the OperatorsCaricato daJeff Pratt
- E-001Caricato daAlessandro Moroni
- m102 Service ManualCaricato daAlejandroPáramoFandiño
- Cover Pages for PV Elite VenvelCaricato daSakthi Vel
- PlasteringCaricato dakhajaimad
- J. Serrin (auth.), James Serrin (eds.) - New Perspectives in Thermodynamics (1986, Springer-Verlag Berlin Heidelberg).pdfCaricato dajuanjo_1505
- 00028037Caricato daGowri Shankar
- Basys 2 Reference ManualCaricato daAnonymous lidok7lDi
- Fischer X-ray Florescence TesterCaricato daTravis Wood
- R Series ChokeCaricato daMoises Muguerza Pacho
- MANUAL NetDefendOS 2.27.03 Firewall UserManualCaricato daAlex Huanca
- Reinforced Concrete Deep Beams - Prof. F.KCaricato daward_civil036694
- Layer 2 Support Nexus_ch02Caricato daTwankie81
- Status of Standards 2013-09-05 FEPApublicCaricato daArvin Babu
- hakkoCaricato damartin napanga
- Linear programming solution examplesCaricato daSanjit Patel
- Design and Detailing of Flat SlabCaricato daAdigun Sunday Oladimeji
- PRS NCO3 Data Sheet EnUS 11145053835Caricato daRio Dwirahayu
- tmp863ECaricato daFrontiers
- Cycles - Customizing Additional Program Parameters CYC 832TCaricato dasilvanocsouza
- AIX_LVM_ODMCaricato daliuyl
- 13. SDM Unit 3 - PostponementCaricato daRahul Sarin
- Course Outlines 14-15 AllCoursesCaricato daarsathnma

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.