Sei sulla pagina 1di 11

Mondrian Multidimensional K-Anonymity

Kristen LeFevre David J. DeWitt Raghu Ramakrishnan


University of Wisconsin, Madison

Abstract Voter Registration Data


Name Age Sex Zipcode
K-Anonymity has been proposed as a mechanism for pro- Ahmed 25 Male 53711
tecting privacy in microdata publishing, and numerous re- Brooke 28 Female 55410
coding “models” have been considered for achieving k- Claire 31 Female 90210
anonymity. This paper proposes a new multidimensional Dave 19 Male 02174
model, which provides an additional degree of flexibility not Evelyn 40 Female 02237
seen in previous (single-dimensional) approaches. Often
this flexibility leads to higher-quality anonymizations, as Patient Data
measured both by general-purpose metrics and more spe- Age Sex Zipcode Disease
cific notions of query answerability. 25 Male 53711 Flu
25 Female 53712 Hepatitis
Optimal multidimensional anonymization is NP-hard
26 Male 53711 Brochitis
(like previous optimal k-anonymity problems). However, 27 Male 53710 Broken Arm
we introduce a simple greedy approximation algorithm, 27 Female 53712 AIDS
and experimental results show that this greedy algorithm 28 Male 53711 Hang Nail
frequently leads to more desirable anonymizations than
exhaustive optimal algorithms for two single-dimensional Figure 1. Tables vulnerable to a joining attack
models. rithm for k-anonymization, an approach with several impor-
tant advantages:1
• The greedy algorithm is substantially more efficient
1. Introduction than proposed optimal k-anonymization algorithms for
A number of organizations publish microdata for purposes single-dimensional models [2, 9, 12]. The time com-
such as demographic and public health research. In order plexity of the greedy algorithm is O(nlogn), while the
to protect individual privacy, known identifiers (e.g., Name optimal algorithms are exponential in the worst case.
and Social Security Number) must be removed. In addi- • The greedy multidimensional algorithm often pro-
tion, this process must account for the possibility of com- duces higher-quality results than optimal single-
bining certain other attributes with external data to uniquely dimensional algorithms (as well as the many existing
identify individuals [15]. For example, an individual might single-dimensional heuristic [6, 14, 16] and stochastic
be “re-identified” by joining the released data with another search [8, 18] algorithms).
(public) database on Age, Sex, and Zipcode. Figure 1 shows
such an attack, where Ahmed’s medical information is de- 1.1. Basic Definitions
termined by joining the released patient data with a public Quasi-Identifier Attribute Set A quasi-identifer is a min-
voter registration list. imal set of attributes X1 , ..., Xd in table T that can be joined
K-anonymity has been proposed to reduce the risk of with external information to re-identify individual records.
this type of attack [12, 13, 15]. The primary goal of k- We assume that the quasi-identifier is understood based on
anonymization is to protect the privacy of the individuals to specific knowledge of the domain.
whom the data pertains. However, subject to this constraint, Equivalence Class A table T consists of a multiset of tu-
it is important that the released data remain as “useful” as ples. An equivalence class for T with respect to attributes
possible. Numerous recoding models have been proposed X1 , ..., Xd is the set of all tuples in T containing identi-
in the literature for k-anonymization [8, 9, 13, 17, 10], and cal values (x1 , ..., xd ) for X1 , ..., Xd . In SQL, this is like a
often the “quality” of the published data is dictated by the GROUP BY query on X1 , ..., Xd .
model that is used. The main contributions of this paper are 1 The visual representation of such recodings reminded us of the work

a new multidimensional recoding model and a greedy algo- of artist Piet Mondrian (1872-1944).

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
K-Anonymity Property Table T is k-anonymous with sophisticated notion of quality measurement, based on a
respect to attributes X1 , ..., Xd if every unique tuple workload of aggregate queries (Section 5).
(x1 , ..., xd ) in the (multiset) projection of T on X1 , ..., Xd Using general-purpose metrics and a simple query work-
occurs at least k times. That is, the size of each equivalence load, our experimental evaluation (Section 6) indicates that
class in T with respect to X1 , ..., Xd is at least k. the quality of the anonymizations obtained by our greedy al-
gorithm are often superior to those obtained by exhaustive
K-Anonymization A view V of relation T is said to be a
optimal algorithms for two single-dimensional models.
k-anonymization if the view modifies or generalizes the data
The paper concludes with discussions of related and fu-
of T according to some model such that V is k-anonymous
ture work (Sections 7 and 8).
with respect to the quasi-identifier.
1.2. General-Purpose Quality Metrics 2. Multidimensional Global Recoding
There are a number of notions of microdata quality [2, 6, In a relational database, each attribute has some domain of
8, 10, 12, 13, 14, 15, 16], but intuitively, the anonymiza- values. We use the notation DX to denote the domain of at-
tion process should generalize or perturb the original data as tribute X. A global recoding achieves anonymity by map-
little as is necessary to satisfy the k-anonymity constraint. ping the domains of the quasi-identifier attributes to gener-
Here we consider some simple general-purpose quality met- alized or altered values [17].
rics, but a more targeted approach to quality measurement Global recoding can be further broken down into two
based on query answerability is described in Section 5. sub-classes [9]. A single-dimensional global recoding is
The simplest kind of quality measure is based on the size defined by a function φi : DXi → D for each attribute Xi
of the equivalence classes E in V . Intuitively, the discern- of the quasi-identifier. An anonymization V is obtained by
ability metric (CDM ), described in [2], assigns to each tu- applying each φi to the values of Xi in each tuple of T .
ple t in V a penalty, which is determined by the size of the Alternatively, a multidimensional global recoding is de-
equivalence class containing t. fined by a single function φ : DX1 × ... × DXn → D ,
 which is used to recode the domain of value vectors asso-
CDM = EquivClasses E |E|2 ciated with the set of quasi-identifier attributes. Under this
model, V is obtained by applying φ to the vector of quasi-
As an alternative, we also propose the normalized aver- identifier values in each tuple of T .
age equivalence class size metric (CAV G ). Multidimensional recoding can be applied to categor-
CAV G = ( total ical data (in the presence of user-defined generalization
equiv classes )/(k)
total records
hierarchies) or to numeric data. For numeric data, and
other totally-ordered domains, (single-dimensional) “par-
1.3. Paper Overview titioning” models have been proposed [2, 8]. A single-
The first contribution of this paper is a new multidimen- dimensional interval is defined by a pair of endpoints p, v ∈
sional model for k-anonymization (Section 2). Like previ- DXi such that p ≤ v. (The endpoints of such an interval
ous optimal k-anonymity problems [1, 10], optimal multi- may be open or closed to handle continuous domains.)
dimensional k-anonymization is NP-hard. However, for nu-
meric data, we find that under reasonable assumptions the Single-dimensional Partitioning Assume there is a total
worst-case maximum size of equivalence classes is O(k) in order associated with the domain of each quasi-identifier
the multidimensional case, while in the single-dimensional attribute Xi . A single-dimensional partitioning defines, for
model, this bound can grow linearly with the total number each Xi , a set of non-overlapping single-dimensional in-
of records. For a simple variation of the multidimensional tervals that cover DXi . φi maps each x ∈ DXi to some
model, this bound is 2k (Section 3). summary statistic for the interval in which it is contained.
Using the multidimensional recoding model, we intro- The released data will include simple statistics that sum-
duce a simple and efficient greedy algorithm that can be ap- marize the intervals they replace. For now, we assume that
plied to both categorical and numeric data (Section 4). For these summary statistics are min-max ranges, but we dis-
numeric data, the results are a constant-factor approxima- cuss some other possibilities in Section 5.
tion of optimal, as measured by the general-purpose quality This partitioning model is easily extended to multidi-
metrics described in the previous section. mensional recoding. Again, assume a total order for each
General-purpose quality metrics are a good starting point DXi . A multidimensional region is defined by a pair of d-
when the ultimate use of the published data is unknown. tuples (p1 , ..., pd ), (v1 , ..., vd ) ∈ DX1 × ... × DXd such that
However, in some cases, quality might be more appropri- ∀i, pi ≤ vi . Conceptually, each region is bounded by a d-
ately measured by the application consuming the published dimensional rectangular box, and each edge and vertex of
data. The second main contribution of this paper is a more this box may be either open or closed.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
Age Sex Zipcode Disease 2.1. Spatial Representation
[25-28] Male [53710-53711] Flu
[25-28] Female 53712 Hepatitis Throughout the paper, we use a convenient spatial repre-
[25-28] Male [53710-53711] Brochitis sentation for quasi-identifiers. Consider table T with quasi-
[25-28] Male [53710-53711] Broken Arm identifier attributes X1 , ..., Xd , and assume that there is a
[25-28] Female 53712 AIDS total ordering for each domain. The (multiset) projections
[25-28] Male [53710-53711] Hang Nail of T on X1 , ..., Xd can then be represented as a multiset
Figure 2. Single-dimensional anonymization of points in d-dimensional space. For example, Figure 4(a)
shows the two-dimensional representation of Patients from
Age Sex Zipcode Disease Figure 1, for quasi-identifier attributes Age and Zipcode.
[25-26] Male 53711 Flu Similar models have been considered for rectangular par-
[25-27] Female 53712 Hepatitis titioning in 2 dimensions [11]. In this context, the single-
[25-26] Male 53711 Brochitis dimensional and multidimensional partitioning models are
[27-28] Male [53710-53711] Broken Arm analogous to the “p × p” and “arbitrary” classes of tilings,
[25-27] Female 53712 AIDS respectively. However, to the best of our knowledge, none
[27-28] Male [53710-53711] Hang Nail of the previous optimal tiling problems have included con-
Figure 3. Multidimensional anonymization straints requiring minimum occupancy.
2.2. Hardness Result
Strict Multidimensional Partitioning A strict multidi-
mensional partitioning defines a set of non-overlapping There have been previous hardness results for optimal k-
multidimensional regions that cover DX1 × ... × DXd . φ anonymization under the attribute- and cell-suppression
maps each tuple (x1 , ..., xd ) ∈ DX1 × ... × DXd to a sum- models [1, 10]. The problem of optimal strict k-anonymous
mary statistic for the region in which it is contained. multidimensional partitioning (finding the k-anonymous
When φ is applied to table T (assuming each region is strict multidimensional partitioning with the smallest CDM
mapped to a unique vector of summary statistics), the tuple or CAV G ) is also NP-hard, but this result does not follow
set in each non-empty region forms an equivalence class in directly from the previous results.
V . For simplicity, we again assume that these summary We formulate the following decision problem using
statistics are ranges, and further discussion is provided in CAV G .2 Here, a multiset of points P is equivalently rep-
Section 5. resented as a set of distinct (point, count) pairs.
Sample 2-anonymizations of Patients, using single- Decisional K-Anonymous Multidimensional Partitioning
dimensional and strict multidimensional partitioning, are Given a set P of unique (point, count) pairs, with points
shown in Figures 2 and 3. Notice that the anonymization ob- in d-dimensional space, is there a strict multidimen-
tained using the multidimensional model is not permissible sional partitioning for P such
under the single-dimensional model because the domains of  that for every resulting
multidimensional
 region R i , p∈Ri count(p) ≥ k or
Age and Zipcode are not recoded to a single set of intervals count(p) = 0, and C ≤ positive constant c?
p∈Ri AV G
(e.g., Age 25 is mapped to either [25-26] or [25-27], de-
pending on the values of Zipcode and Sex). However, the Theorem 1 Decisional k-anonymous multidimensional
single-dimensional recoding is also valid under the multidi- partitioning is NP-complete.
mensional model.
Proof The proof is by reduction from Partition:
Proposition 1 Every single-dimensional partitioning for Partition Consider a set A of n positive integers
quasi-identifier attributes X1 , ..., Xd can be expressed as a {a1 , ..., an }. Is there some A ⊆ A, such that
strict multidimensional partitioning. However, when d ≥  
2 and ∀i, |DXi | ≥ 2, there exists a strict multidimen- ai ∈A ai = aj ∈A−A aj ?
sional partitioning that cannot be expressed as a single-
For each ai ∈ A, construct a (point, count) pair. Let the
dimensional partitioning.
point be defined by (01 , ..., 0, 1i , 0, ..., 0d ) (the ith coordi-
It is intuitive that the optimal strict multidimensional nate is 1, and all other coordinates are 0), which resides in
partitioning must be at least as good as the optimal a d-dimensional unit-hypercube, and let the count equal ai .
single-dimensional partitioning. However, the optimal k- Let P be the union of all such pairs.
anonymous multidimensional partitioning problem is NP- We claim that the partition problem  for A can be re-
hard (Section 2.2). For this reason, we consider the duced to the following: Let k = 2ai . Is there a k-
worst-case upper bounds on equivalence class size for anonymous strict multidimensional partitioning for P such
single-dimensional and multidimensional partitioning (Sec-
tion 2.3).
2 Though stated for CAV G , the result is similar for CDM .

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
53710 53711 53712 53710 53711 53712 53710 53711 53712

25 25 25

26 26 26

27 27 27

28 28 28

(a) Patients (b) Single-Dimensional (c) Strict Multidimensional

Figure 4. Spatial representation of Patients and partitionings (quasi-identifiers Zipcode and Age)

that CAV G ≤ 1? To prove this claim, we show that there is the maximum number of duplicate copies of a single point
a solution to the k-anonymous multidimensional partition- (Theorem 2). This is in contrast to the second result (The-
ing problem for P if and only if there is a solution to the orem 3), which indicates that for single-dimensional parti-
partition problem for A. tioning, this bound may grow linearly with the total number
Suppose there exists a k-anonymous multidimen- of points.
sional partitioning for P . This partitioning must de- In order to state these results, we first define some termi-
fine two multidimensional regions, R1 and R2 ,  such that nology. A multidimensional cut for a multiset of points is
  ai
p∈R1 count(p) = p∈R2 count(p) = k =
an axis-parallel binary cut producing two disjoint multisets
2 , and
possibly some number of empty regions. By the strictness of points. Intuitively, such a cut is allowable if it does not
property, these regions must not overlap. Thus, the sum of cause a violation of k-anonymity.
counts for the two non-empty regions constitute the sum of
Allowable Multidimensional Cut Consider multiset P of
integers in two disjoint complementary subsets of A, and
points in d-dimensional space. A cut perpendicular to axis
we have an equal partitioning of A.
Xi at xi is allowable if and only if Count(P.Xi > xi ) ≥ k
In the other direction, suppose there is a solution to and Count(P.Xi ≤ xi ) ≥ k.
the partition problem for A. For each binary partition-
A single-dimensional cut is also axis-parallel, but con-
ing of A into disjoint complementary subsets A1 and
siders all regions in the space to determine allowability.
A2 , there is a multidimensional
 partitioning of P into re-
gions
 R 1 , ..., Rn such
 that p∈R 1
count(p) = ai ∈A1 ai , Allowable Single-Dimensional Cut Consider a multiset
p∈R2 count(p) = ai ∈A2 ai , and all other Ri are empty: P of points in d-dimensional space, and suppose we have
R1 is defined by two points, the origin and the point p hav- already made S single-dimensional cuts, thereby separat-
ing ith coordinate 1 when ai ∈ A1 and 0 otherwise. The ing the space into disjoint regions R1 , ..., Rm . A single-
bounding box for R1 is closed at all edges and vertices. R2 dimensional cut perpendicular to Xi at xi is allowable,
is defined by the origin and the point p having ith coordi- given S, if ∀Rj overlapping line Xi = xi , Count(Rj .Xi >
nate = 1 when ai ∈ A2 , and 0 otherwise. The bounding xi ) ≥ k and Count(Rj .Xi ≤ xi ) ≥ k.
box for R2 is open at the origin, but closed on all other Notice that recursive allowable multidimensional cuts
edges and vertices. CAV G is the average sum of counts for will result in a k-anonymous strict multidimensional parti-
the non-empty regions, divided by k. In this construction, tioning for P (although not all strict multidimensional par-
CAV G = 1, and R1 , ..., Rn is a k-anonymous multidimen- titionings can be obtained in this way), and a k-anonymous
sional partitioning of P . single-dimensional partitioning for P is obtained through
Finally, a given solution to the decisional k-anonymous successive allowable single-dimensional cuts.
multidimensional partitioning problem can be verified in For example, in Figures 4(b) and (c), the first cut oc-
polynomial time by scanning the input set of (point, count) curs on the Zipcode dimension at 53711. In the multidi-
pairs, and maintaining a sum for each region. mensional case, the left-hand side is cut again on the Age
dimension, which is allowable because it does not produce
a region containing fewer than k points. In the single-
2.3. Bounds on Partition Size
dimensional case, however, once the first cut is made, there
It is also interesting to consider worst-case upper bounds on are no remaining allowable single-dimensional cuts. (Any
the size of partitions resulting from single-dimensional and cut perpendicular to the Age axis would result in a region
multidimensional partitioning. This section presents two re- on the right containing fewer than k points.)
sults, the first of which indicates that for a constant-sized Intuitively, a partitioning is considered minimal when
quasi-identifier, this upper bound depends only on k and there are no remaining allowable cuts.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
a1 a2 a3
For example, Figure 5 shows P in 2 dimensions. By
b1 k-1 addition, |P | = 2d(k − 1) + m, and by projecting P onto
any Xi we obtain the following point counts:
b2 k-1 m k-1 ⎧

⎪ k − 1, Xi = x i − 1

m + 2(d − 1)(k − 1), Xi = x i
b3 k-1 Count(Xi ) =

⎪ k − 1, X i = i + 1
x

(a) A set of points for which there is no allowable cut 0, otherwise
a1 a2 a3
Based on these counts, it is clear that any binary cut per-
b1 k-1 pendicular to axis Xi would result in some partition con-
taining fewer than k points.
b2 k-1 m k-1 Second, we show that for any multiset of points P in d-
dimensional space such that |P | > 2d(k − 1) + m, there
b3 k-1 1 exists an allowable multidimensional cut for P .
Consider some P in d-dimensional space, such that
(b) Adding a single point produces an allowable cut |P | = 2d(k − 1) + m + 1, and let x i denote the median
value of P projected on axis Xi . If there is no allowable cut
Figure 5. 2-Dimensional example of equiva- for P , we claim that there are at least m + 1 copies of point
lence class size bound ( d ) in P , contradicting the definition of m.
x1 , ..., x
For every dimension i = 1, ..., d, if there is no allow-
Minimal Strict Multidimensional Partitioning Let able cut perpendicular to axis Xi , then (because x i is the
R1 , ..., Rn denote a set of regions induced by a strict mul- median) Count(Xi < x i ) ≤ k − 1 and Count(Xi >
tidimensional partitioning, and let each region Ri contain i ) ≤ k − 1. This means that Count(Xi = x
x i ) ≥
multiset Pi of points. This multidimensional partitioning 2(d − 1)(k − 1) + m + 1. Thus, over d dimensions, we
is minimal if ∀i, |Pi | ≥ k and there exists no allowable find Count(X1 = x 1 ∧ ... ∧ Xd = xd ) ≥ m + 1. 
multidimensional cut for Pi .
Minimal Single-Dimensional Partitioning A set S of al- Theorem 3 The maximum number of points contained in
lowable single-dimensional cuts is a minimal single- any region R resulting from a minimal single-dimensional
dimensional partitioning for multiset P of points if there partitioning of a multiset of points P in d-dimensional
does not exist an allowable single-dimensional cut for P space (d ≥ 2) is O(|P |).
given S.
Proof We construct a multiset of points P , and a minimal
The following two theorems give upper-bounds on single-dimensional partitioning for P , such that the greatest
partition size for minimal multidimensional and single- number of points in a resulting region is O(|P |).
dimensional partitioning, respectively. Consider a quasi-identifier attribute X with domain DX ,
and a finite set VX ⊆ DX with a point x  ∈ VX .
Theorem 2 If R1 , ..., Rn denote the set of regions induced Initially, let P contain precisely 2k − 1 points p having
by a minimal strict multidimensional partitioning for multi- p.X = x . Then add to P an arbitrarily large number of
set of points P , the maximum number of points contained in points q, each with q.X ∈ VX , but q.X = x , and such that
any Ri is 2d(k − 1) + m, where m is the maximum number for each v ∈ VX there are at least k points in the resulting
of copies of any distinct point in P . set P having X = v.
By construction, if |Vx | = r, there are r − 1 allowable
Proof The proof has two parts. First, we show that there
single-dimensional cuts for P perpendicular to X (at each
exists a multiset P of points in d-dimensional space such
point in VX ), and we denote this set of cuts S. However,
that |P | = 2d(k−1)+m and that there is no allowable mul-
there are no allowable single-dimensional cuts for P given
tidimensional cut for P . Let xi denote some value on axis
S (perpendicular to any other axis). Thus, S is a minimal
Xi such that x i −1 and x
i +1 are also values on axis Xi , and
single-dimensional partitioning, and the size of the largest
let P initially contain m copies of the point ( 2 , ..., x
x1 , x d ).
resulting region is O(|P |). 
Add to P k − 1 copies of each of the following points:
x1 − 1, x
( 2 , ..., x
d ), (
x1 + 1, x 2 , ..., x
d ),
( 2 − 1, ..., x
x1 , x d ), ( 2 + 1, ..., x
x1 , x d ),
3. Multidimensional Local Recoding
... In contrast to global recoding, local recoding models map
( 2 , ..., x
x1 , x d − 1), ( 2 , ..., x
x1 , x d + 1) (non-distinct) individual data items to generalized values

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
[17]. Formally, a local recoding function, which we will Anonymize(partition)
denote φ∗ to distinguish it from global recoding functions, if (no allowable multidimensional cut for partition)
maps each (non-distinct) tuple t ∈ T to some recoded tuple return φ : partition → summary
else
t . V is obtained by replacing each tuple t ∈ T with φ∗ (t).
dim ← choose dimension()
Several local recoding models have been considered in the
f s ← frequency set(partition, dim)
literature, some of which are outlined in [9]. In this section, splitV al ← find median(f s)
we describe one such model that relaxes the requirements lhs ← {t ∈ partition : t.dim ≤ splitV al}
of strict multidimensional partitioning. rhs ← {t ∈ partition : t.dim > splitV al}
return Anonymize(rhs) ∪ Anonymize(lhs)
Relaxed Multidimensional Partitioning A relaxed multi-
dimensional partitioning for relation T defines a set of Figure 6. Top-down greedy algorithm for
(potentially overlapping) distinct multidimensional regions strict multidimensional partitioning
that cover DX1 × ... × DXd . Local recoding function φ∗
maps each tuple (x1 , ..., xd ) ∈ T to a summary statistic for strategy used for obtaining uniform occupancy was median-
one of the regions in which it is contained. partitioning [5]. In Figure 6, the split value is the median of
partition projected on dim. Like kd-tree construction, the
This relaxation offers an extra degree of flexibility. For
time complexity is O(nlogn), where n = |T |.
example, consider generating a 3-anonymization of Pa-
If there exists an allowable multidimensional cut for par-
tients, and suppose Zipcode is the single quasi-identifier at-
tition P perpendicular to some axis Xi , then the cut per-
tribute. Using the strict model, we would need to recode
pendicular to Xi at the median is allowable. By Theorem 2,
the Zipcode value in each tuple to [53710-53712]. Un-
the greedy (strict) median-partitioning algorithm results in a
der the relaxed model, this recoding can be performed on
set of multidimensional regions, each containing between k
a tuple-by-tuple basis, mapping two occurrences of 53711
and 2d(k −1)+m points, where m is the maximum number
to [53710 − 53711] and one occurrence to [53711 − 53712].
of copies of any distinct point.
Proposition 2 Every strict multidimensional partitioning We have some flexibility in choosing the dimension on
can be expressed as a relaxed multidimensional partition- which to partition. As long as we make an allowable cut
ing. However, if there are at least two tuples in table T when one exists, this choice does not affect the partition-
having the same vector of quasi-identifier values, there ex- size upper-bound. One heuristic, used in our implementa-
ists a relaxed multidimensional partitioning that cannot be tion, chooses the dimension with the widest (normalized)
expressed as a strict multidimensional partitioning. range of values [5]. Alternatively, it may be possible to
choose a dimension based on an anticipated workload.
Under the relaxed model, a partitioning is not necessar- The partitioning algorithm in Figure 6 is easily adapted
ily defined by binary cuts. Instead, a set of points is parti- for relaxed partitioning. Specifically, the points falling at
tioned by defining two (possibly overlapping) multidimen- the median (where t.dim = splitV al) are divided evenly
sional regions P1 and P2 , and then mapping each point to between lhs child and rhs child such that |lhs child| =
either P1 or P2 (but not both). In this case, the upper-bound |rhs child| (+1 when |partition| is odd). In this case,
on the size of a minimal partition (one that cannot be di- there is a 2k − 1 upper-bound on partition size.
vided without violating k-anonymity) is 2k − 1. Finally, a similar greedy multidimensional partitioning
strategy can be used for categorical attributes in the pres-
4. A Greedy Partitioning Algorithm ence of user-defined generalization hierarchies. However,
Using multidimensional partitioning, a k-anonymization is our quality upper-bounds do not hold in this case.
generated in two steps. In the first step, multidimensional
regions are defined that cover the domain space, and in the
4.1. Bounds on Quality
second step, recoding functions are constructed using sum- Using our upper bounds on partition size, it is easy to com-
mary statistics from each region. In the previous sections, pute bounds for the general-purpose metrics described in
we alluded to a recursive algorithm for the first step. In this Section 1.2 and totally-ordered attributes. By definition,
section we outline a simple scalable algorithm, reminiscent k-anonymity requires that every equivalence class contain
of those used to construct kd-trees [5], that can be adapted at least k records. For this reason, the optimal achievable
to either strict or relaxed partitioning. The second step is value of CDM (denoted CDM ∗) ≥ k ∗ total records, and
described in more detail in Section 5 CAV G ∗ ≥ 1.
The strict partitioning algorithm is shown in Figure 6. For strict multidimensional partitioning, assume that the
Each iteration must choose the dimension and value about points in each distinct partition are mapped to a unique
which to partition. In the literature about kd-trees, one vector of summary statistics. We showed that under the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
greedy algorithm, for each equivalence class E, |E| ≤ cases, the publisher may want to consider an anticipated
2d(k − 1) + m, where m is the maximum number of copies workload, such as building a data mining model [6, 8, 16],
of any distinct quasi-identifier tuple. or answering a set of aggregate queries. This section intro-
CDM is maximized when the tuples are divided into the duces the latter problem, including examples where multi-
largest possible equivalence classes, so CDM ≤ (2d(k − dimensional recoding provides needed flexibility.
1) + m) ∗ total records. Thus, Consider a set of queries with selection predicates
(equality or range) of the form attribute <oper>
CDM 2d(k − 1) + m
≤ constant and an aggregate function (COUNT, SUM,
CDM ∗ k AVG, MIN, and MAX). Our ability to answer this type of
Similarly, CAV G is maximized when the tuples are queries from anonymized data depends on two main factors:
divided into the largest possible equivalence classes, so the type of summary statistic(s) released for each attribute,
CAV G ≤ (2d(k − 1) + m)/k. and the degree to which the selection predicates in the work-
load match the range boundaries in the anonymous data.
CAV G 2d(k − 1) + m
≤ The choice of summary statistics influences our ability
CAV G ∗ k to compute various aggregate functions.3 In this paper, we
Observe that for constant d and m, the strict greedy al- consider releasing two summary statistics for each attribute
gorithm is a constant-factor approximation, as measured by A and equivalence class E:
CDM and CAV G . If d varies, but m/k is constant, the ap- • Range statistic (R) So far, all of our examples have in-
proximation is O(d). volved a single summary statistic defined by the range
For relaxed multidimensional partitioning, the greedy of values for A appearing in E, which allows for easy
algorithm produces a 2-approximation because CDM ≤ computation of MIN and MAX aggregates.
2k ∗ total records, and CAV G ≤ 2.
• Mean Statistic (M) We also consider a summary sta-
4.2. Scalability tistic defined by the mean value of A appearing in E,
When the table T to be anonymized is larger than the avail- which allows for the computation of AVG and SUM.
able memory, the main scalability issue to be addressed is When choosing summary statistics, it is important to
finding the median value of a selected attribute within a consider potential avenues for inference. Notice that in
given partition. some cases simply releasing the minimum-maximum range
We propose a solution to this problem based on the idea allows for some inferences about the distribution of values
of a frequency set. The frequency set of attribute A for par- within an equivalence class. For example, consider an at-
tition P is the set of unique values of A in P , each paired tribute A, and let k = 2. Suppose that an equivalence class
with an integer indicating the number of times it appears in of the released anonymization contains two tuples, and A is
P . Given the frequency set of A for P , the median value is summarized by the range [0 − 1]. It is easy to infer that in
found using a standard median-finding algorithm. one of the original tuples A = 0, and in the other A = 1.
Because individual frequency sets contain just one entry This type of inference about distribution (which may also
per value in the domain of a particular attribute, and are arise in single-dimensional recoding) is not likely to pose
much smaller than the size of the data itself, it is reasonable a problem in preventing joining attacks because, without
to assume that a single frequency set will fit in memory. For background knowledge, it is still impossible for an adver-
this reason, in the worst case, we must sequentially scan sary to distinguish the tuples within an equivalence class
the database at most twice, and write once, per level of the from one another.
recursive partitioning “tree.” The data is first scanned once The second factor influencing our ability to answer ag-
to find the median, and then scanned and written once to gregate queries is the degree to which the selection predi-
re-partition the data into two “runs” (lhs and rhs) on disk. cates in the given workload “match” the boundaries of the
It is worth noting that this scheme may be further opti- range statistics in the released anonymization. In many
mized to take advantage of available memory because, in ways, this is analogous to matching indices and selection
practice, the frequency sets for multiple attributes may fit predicates in traditional query processing.
in memory. One approach would be similar to the scalable
algorithms used for decision tree construction in [7]. Predicate-Range Matching A selection predicate P red
conceptually divides the original table T into two sets of tu-
5. Workload-Driven Quality Measurement ples, TPTred and TPFred (those that satisfy the predicate and
The general-purpose quality metrics in Section 1.2 are a 3 Certain types of aggregate functions (e.g., MEDIAN) are ill-suited to
good place to start when the application that ultimately con- this type of computation. We do not know of any way to compute such
sumes the anonymized data is unknown. However, in some functions from these summary statistics.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
those that do not). When range statistics are published, we Age(R) Age(M) Sex(R) Zipcode(R) Disease
say that an anonymization V matches a boolean predicate [25 − 26] 25.5 Male 53711 Flu
P red if every tuple t ∈ TPTred is mapped to an equivalence [25 − 27] 26 Female 53712 Hepatitis
[25 − 26] 25.5 Male 53711 Brochitis
class in V containing no tuples from TPFred .
[27 − 28] 27.5 Male [53710 − 53711] Broken Arm
To illustrate these ideas, consider a workload containing [25 − 27] 26 Female 53712 AIDS
two queries: [27 − 28] 27.5 Male [53710 − 53711] Hang Nail
SELECT COUNT(*)
SELECT AVG(Age) Figure 7. A 2-anonymization with multiple
FROM Patients
FROM Patients summary statistics
WHERE Sex = ‘Male’
WHERE Sex = ‘Male’
AND Age ≤ 26 Distribution (Discrete Uniform, Discrete Normal)
A strict multidimensional anonymization of Patients is Attributes Total quasi-identifier attributes
given in Figure 7, including two summary statistics (range Cardinality Distinct values per attribute
and mean) for Age. Notice that the mean allows us to an- Tuples Total tuples in table
swer the first query precisely and accurately. The second Std. Dev. (σ) With respect to standard normal (Normal only)
query can also be answered precisely because the predi- Mean (μ) (Normal only)
cate matches a single equivalence class in the anonymiza- Figure 8. Parameters of synthetic generator
tion. Comparing this with the single-dimensional recod-
ing shown in Figure 2, notice that it would be impossi- comparing these results with those produced by optimal al-
ble to answer the second query precisely using the single- gorithms for two other models: full-domain generalization
dimensional recoding. [9, 12], and single-dimensional partitioning [2, 8]. The spe-
When a workload consists of many queries, even a mul- cific algorithms used in the comparison (Incognito [9] and
tidimensional anonymization might not match every selec- K-Optimize [2]) were chosen for efficiency, but any exhaus-
tion predicate. An exhaustive discussion of query process- tive algorithm for these models would yield the same result.
ing over imprecise data is beyond the scope of this paper. It is important to note that the exhaustive algorithms are ex-
However, when no additional distribution information is ponential in the worst case, and they run many times slower
available, a simple approach assumes a uniform distribution than our greedy algorithm. Nonetheless, the quality of the
of values for each attribute within each equivalence class. results obtained by the latter is often superior.
The effects of multidimensional versus single-dimensional For these experiments, we used both synthetic and real-
recoding, with respect to a specific query workload, are ex- world data. We compared quality, using general-purpose
plored empirically in Section 6.3. metrics, and also with respect to a simple query workload.
Our work on workload-driven anonymization is prelimi-
nary, and in this paper, the workload is primarily used as an 6.1. Experimental Data
evaluation tool. One of the most important future directions For some experiments, we used a synthetic data generator,
is directly integrating knowledge of an anticipated work- which produced two discrete joint distributions: discrete
load into the anonymization algorithm. Formally, a query uniform and discrete normal. We limited the evaluation to
workload can be expressed as a set of (multdimensional re- discrete domains so that the exhaustive algorithms would
gion, aggregate, weight) triples, where the boundaries of be tractable without pre-generalizing the data. To generate
each region are determined by the selection predicates in the discrete normal distribution, we first generated the mul-
the workload. Each query is also assigned a weight indicat- tivariate normal distribution, and then discretized the values
ing its importance with respect to the rest of the workload. of each attribute into equal-width ranges. The parameters
When a selection predicate in the workload does not exactly are described in Figure 8.
match the boundaries of one or more equivalence classes, In addition to synthetic data, we also used the Adults
evaluating this query over the anonymized data will incur database from the UC Irvine Machine Learning Repository
some error. This error can be defined as the normalized [3], which contains census data, and has become a de facto
difference between the result of evaluating the query on the benchmark for k-anonymity. We configured this data set as
anonymous data, and the result on the original data. Intu- it was configured for the experiments reported in [2], using
itively, the task of a workload-driven algorithm is to gener- eight regular attributes, and removing tuples with missing
ate an anonymization that minimizes the weighted sum of values. The resulting database contained 30,162 records.
such errors. For the partitioning experiments, we imposed an intuitive
6. Experimental Evaluation ordering on each attribute, but unlike [2], we eliminated
all hierarchical constraints for both models. For the full-
Our experiments evaluated the quality of anonymizations domain experiments, we used the same generalization hier-
produced by our greedy multidimensional algorithm by archies that were used in [9].

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
1e+07 1e+07
Full-Domain Full-Domain
Single-Dimensional Single-Dimensional
Multidimensional Multidimensional
Discernability Penalty

Discernability Penalty
1e+06 1e+06

100000 100000

10000 10000
2 5 10 25 50 100 2 5 10 25 50 100
k k

(a) Uniform distribution (5 attributes) (b) Normal distribution (5 attributes, σ = .2)


1e+07 1e+07
Full-Domain Full-Domain
Single-Dimensional Single-Dimensional
Multidimensional Multidimensional
Discernability Penalty

Discernability Penalty
1e+06 1e+06

100000 100000
0.1 0.2 0.3 0.4 0.5 2 3 4 5 6 7 8
Standard Deviation Attributes

(c) Normal distribution (5 attributes, k = 10) (d) Normal distribution (k = 10, σ = .2)

Figure 9. Quality comparisons for synthetic data using the discernability metric
1e+09
6.2. General-Purpose Metrics Full-Domain
Single-Dimensional
Multidimensional
We first performed some experiments comparing quality us- 1e+08
Discernability Penalty

ing general-purpose metrics. Here we report results for the


discernability penalty (Section 1.2), but the comparisons are
1e+07
similar for the average equivalence class size.
The first experiment compared the three models for var-
ied k. We fixed the number of tuples at 10,000, the per- 1e+06

attribute cardinality at 8, and the number of attributes at 5.


For the full-domain generalization model, we constructed 100000
2 5 10 25 50 100 250 500 1000
generalization hierarchies using binary trees. The results k
for the uniform distribution are shown in Figure 9(a). Re- Figure 10. Quality comparison for Adults
sults for the discrete normal distribution (μ = 3.5, σ = .2) database using discernability metric
are given in Figure 9(b). We found that greedy multidimen-
sional partitioning produced “better” generalizations than The next experiment measured quality for varied quasi-
the other algorithms in both cases. However, the magnitude identifier size, with σ = .2 and k = 10. As the number
of this difference was much more pronounced for the non- of attributes increased, the observed discernability penalty
uniform distribution. decreased for each of the three models (Figure 9(d)). At
Following this observation, the second experiment com- first glance, this result is counter-intuitive. However, this
pared quality using the same three models, but varied the decrease is due to the sparsity of the original data, which
standard deviation (σ) of the synthetic data. (Small values contains fewer duplicate tuples as the number of attributes
of σ indicate a high degree of non-uniformity.) The number increases.
of attributes was again fixed at 5, and k was fixed at 10. The In addition to the synthetic data, we compared the three
results (Figure 9(c)) show that the difference in quality is algorithms using the Adults database (Figure 10). Again,
most pronounced for non-uniform distributions. we found that greedy multidimensional partitioning gener-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
(k = 50) (k = 25) (k = 10)
50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
(a) Optimal single-dimensional partitioning

(k = 50) (k = 25) (k = 10)


50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
(b) Greedy strict multidimensional partitioning

Figure 11. Anonymizations for two attributes with a discrete normal distribution (μ = 25, σ = .2).

Predicate on X trast, we observed that for non-uniform data and small k,


k Model Mean Error Std. Dev.
single-dimensional partitioning tends to reflect the distrib-
10 Single 7.73 5.94
10 Multi 4.66 3.26
ution of just one attribute. However, the optimal single-
25 Single 12.68 7.17 dimensional anonymization is quite sensitive to the under-
25 Multi 5.69 3.86 lying data, and a small change to the synthetic data set often
50 Single 7.73 5.94 dramatically changes the resulting anonymization.
50 Multi 7.94 5.87 This tendency to “linearize” attributes has an impact on
Predicate on Y query processing over the anonymized data. Consider a
k Model Mean Error Std. Dev. simple workload for this two-attribute data set, consisting of
10 Single 3.18 2.56 queries of the form “SELECT COUNT(*) WHERE {X, Y }
10 Multi 4.03 3.44 = value”, where X and Y are the quasi-identifier attributes,
25 Single 5.06 4.17 and value is an integer between 0 and 49. (In Figures 11(a)
25 Multi 5.67 3.80
and 11(b), X and Y are displayed on the horizontal and ver-
50 Single 8.25 6.15
tical axes.) We evaluated the set of queries of this form over
50 Multi 8.06 5.58
each anonymization and the original data set. When a pred-
Figure 12. Error for count queries with single- icate did not match any partition, we assumed a uniform
attribute selection predicates distribution within each partition.
For each anonymization, we computed the mean and
ally produced the best results. standard deviation of the absolute error over the set of
queries in the workload. These results are presented in Fig-
6.3. Workload-Based Quality ure 12. As is apparent from Figures 11(a) and 11(b), and
We also compared the optimal single-dimensional and from the error measurements, queries with predicates on Y
greedy multidimensional partitioning algorithms with re- are more accurately answered from the single-dimensional
spect to a simple query workload, using a synthetic data anonymization than are queries with predicates on X. The
set containing 1000 tuples, with two quasi-identifier at- observed error is more consistent across queries using the
tributes (discrete normal, each with cardinality 50, μ = 25, multidimensional anonymization.
σ = .2). Visual representations of the resulting partition-
ings are given in Figures 11(b) and 11(a).
7. Related Work
Multidimensional partitioning does an excellent job at Many recoding models have been proposed in the litera-
capturing the underlying multivariate distribution. In con- ture for guaranteeing k-anonymity. The majority of these

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.
models have involved user-defined value generalization hi- This work was supported by an IBM Ph.D. Fellowship
erarchies [6, 8, 9, 12, 14, 16]. Recently, partitioning models and NSF Grant IIS-0524671.
have been proposed to automatically produce generalization
hierarchies for totally-ordered domains [2, 8]. However, References
these recoding techniques have all been single-dimensional. [1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Pan-
In addition to global recoding models, simpler local re- igrahy, D. Thomas, and A. Zhu. Anonymizing tables. In
coding models have also been considered, including sup- ICDT, 2005.
pressing individual data cells. Several approximation algo- [2] R. Bayardo and R. Agrawal. Data privacy through optimal
rithms have been proposed for the problem of finding the k-anonymization. In ICDE, 2005.
[3] C. Blake and C. Merz. UCI repository of machine learning
k-anonymization that suppresses the fewest cells [1, 10].
databases, 1998.
In another related direction, Chawla et al. [4] propose a [4] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee.
theoretical framework for privacy in data publishing based Toward privacy in public databases. In 2nd Theory of Cryp-
on private histograms. This work describes a recursive san- tography Conference, 2005.
itization algorithm in multidimensional space. However, in [5] J. Friedman, J. Bentley, and R. Finkel. An algorithm for
their problem formulation, minimum partition-occupancy is finding best matches in logarithmic time. ACM Trans. on
not considered to be an absolute constraint. Mathematical Software, 3(3), 1977.
Finally, several papers have considered evaluating the re- [6] B. Fung, K. Wang, and P. Yu. Top-down specialization for
sults of k-anonymization algorithms based on a particular information and privacy preservation. In ICDE, 2005.
[7] J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest:
data mining task, such as building a decision tree [6, 8, 16],
A framework for fast decision tree construction of large
but quality evaluation based on a query workload has not datasets. In VLDB, 1998.
previously been explored. [8] V. Iyengar. Transforming data to satisfy privacy constraints.
In ACM SIGKDD, 2002.
8. Conclusion and Future Work [9] K. LeFevre, D.DeWitt, and R. Ramakrishnan. Incognito:
In this paper, we introduced a multidimensional recoding Efficient full-domain k-anonymity. In ACM SIGMOD, 2005.
[10] A. Meyerson and R. Williams. On the complexity of optimal
model for k-anonymity. Although optimal multidimen-
k-anonymity. In PODS, 2004.
sional partitioning is NP-hard, we provide a simple and ef- [11] S. Muthakrishnan, V. Poosala, and T. Suel. On rectangu-
ficient greedy approximation algorithm for several general- lar partitionings in two dimensions: Algorithms, complex-
purpose quality metrics. An experimental evaluation indi- ity, and applications. In ICDT, 1998.
cates that often the results of this algorithm are actually bet- [12] P. Samarati. Protecting respondents’ identities in microdata
ter than those produced by more expensive optimal algo- release. IEEE Trans. on Knowledge and Data Engineering,
rithms using other recoding models. 13(6), 2001.
The second main contribution of this paper is a more tar- [13] P. Samarati and L. Sweeney. Protecting privacy when
geted notion of quality measurement, based on a workload disclosing information: k-anonymity and its enforcement
through generalization and suppression. Technical Report
of aggregate queries. The second part of our experimen-
SRI-CSL-98-04, SRI Computer Science Laboratory, 1998.
tal evaluation indicated that, for workloads involving predi- [14] L. Sweeney. Achieving k-anonymity privacy protection
cates on multiple attributes, the multidimensional recoding using generalization and suppression. Int’l Journal on
model often produces more desirable results. Uncertainty, Fuzziness, and Knowledge-based Systems,
There are a number of promising areas for future work. 10(5):571–588, 2002.
In particular, as mentioned in Section 5, we are consider- [15] L. Sweeney. K-anonymity: A model for protecting privacy.
ing ways of integrating an anticipated query workload di- Int’l Journal on Uncertainty, Fuzziness, and Knowledge-
rectly into the anonymization algorithms. Also, we suspect based Systems, 10(5):557–570, 2002.
that multidimensional recoding would lend itself to creat- [16] K. Wang, P. Yu, and S. Chakraborty. Bottom-up generaliza-
tion: A data mining solution to privacy protection. In ICDM,
ing anonymizations that are useful for building data mining
2004.
models since the partitioning pattern more faithfully reflects [17] L. Willenborg and T. deWaal. Elements of Statistical Dis-
the multivariate distribution of the original data. closure Control. Springer Verlag, 2000.
[18] W. Winkler. Using simulated annealing for k-anonymity.
9. Acknowledgements Research Report 2002-07, US Census Bureau Statistical Re-
Our thanks to Vladlen Koltun for pointing us toward the search Division, 2002.
median-partitioning strategy for kd-trees, to Roberto Ba-
yardo for providing the unconstrained single-dimensional
quality results for the Adults database, to Miron Livny for
his ideas about the synthetic data generator, and to three
anonymous reviewers for their helpful feedback.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06)


8-7695-2570-9/06 $20.00 © 2006 IEEE
Authorized licensed use limited to: National University of Singapore. Downloaded on March 05,2010 at 23:24:40 EST from IEEE Xplore. Restrictions apply.

Potrebbero piacerti anche