Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
19 June 2014
Journal of Biomedical Informatics xxx (2014) xxxxxx
1
Methodological Review
6
4
7
5
8
Q1
9
10
11
13
12
14
a r t i c l e
1
2 6
8
17
18
19
20
i n f o
Article history:
Received 1 October 2013
Accepted 5 June 2014
Available online xxxx
21
22
23
24
25
26
27
Keywords:
Privacy
Electronic health records
Anonymization
Algorithms
Survey
a b s t r a c t
The dissemination of Electronic Health Records (EHRs) can be highly benecial for a range of medical
studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that
preserves patients privacy. This is not straightforward, because the disseminated data need to be protected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work,
we present a survey of algorithms that have been proposed for publishing structured patient data, in a
privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and
highlight their advantages and disadvantages. We also provide a discussion of some promising directions
for future research in this area.
2014 Published by Elsevier Inc.
29
30
31
32
33
34
35
36
37
38
39
40
41
1. Introduction
1.1. Motivation
63
42
64
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Q2
Corresponding author.
E-mail addresses: arisdiva@ie.ibm.com, arisgd@gmail.com (A. Gkoulalas-Divanis).
http://dx.doi.org/10.1016/j.jbi.2014.06.002
1532-0464/ 2014 Published by Elsevier Inc.
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
YJBIN 2187
19 June 2014
Q1
104
However, it is important to note that they do not provide any computational guarantees for thwarting identity disclosure nor aim at preserving the usefulness of disseminated data in analytic tasks.
To address re-identication, as well as other privacy threats, the
computer science and health informatics communities have developed various techniques. Most of these techniques aim at publishing a dataset of patient records, while satisfying certain privacy
and data usefulness objectives. Typically, privacy objectives are
formulated using privacy models, and enforced by algorithms that
transform a given dataset (to facilitate privacy protection) to the
minimum necessary extent. The majority of the proposed algorithms are applicable to data containing demographics or diagnosis
codes,1 focus on preventing the threats of identity, attribute, and/or
membership disclosure (to be dened in subsequent sections), and
operate by transforming the data using generalization and/or suppression techniques.
105
1.2. Contributions
106
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
1.2.1. Organization
The remainder of this work is organized as follows. Section 2
presents the privacy threats and models that have been proposed
for preventing them. Section 3 discusses the two scenarios for privacy-preserving data sharing. Section 4 surveys algorithms for
publishing data, in the non-interactive scenario. Section 5 discusses other classes of related techniques. Section 6 presents possible directions for future research, and Section 7 concludes the
paper.
147
148
156
157
Identity disclosure (or re-identication) [112,128]: This is arguably the most notorious threat in publishing medical data. It
occurs when an attacker can associate a patient with their
record in a published dataset. For example, an attacker may
re-identify Maria in Table 1, even if the table is published
deprived of the direct identiers (i.e., Name and Phone Number).
This is because Maria is the only person in the table who was
born on 17.01.1982 and also lives in zip code 55332.
Membership disclosure [100]: This threat occurs when an
attacker can infer with high probability that an individuals
record is contained in the published data. For example, consider
a dataset which contains information on only HIV-positive
patients. The fact that a patients record is contained in the
dataset allows inferring that the patient is HIV-positive, and
thus poses a threat to privacy. Note that membership disclosure
may occur even when the data are protected from identity disclosure, and that there are several real-world scenarios where
protection against membership disclosure is required. Such
interesting scenarios were discussed in detail in [100,101].
Attribute disclosure (or sensitive information disclosure) [88]: This
threat occurs when an individual is associated with information
about their sensitive attributes. This information can be, for
example, the individuals value for the sensitive attribute (e.g.,
the value in DNA in Table 1), or a range of values which contain
an individuals sensitive value (e.g., if the sensitive attribute is
Hospitalization Cost, then knowledge that a patients value in
this attribute lies in a narrow range, say 5400; 5500, may be
considered as sensitive, as it provides a near accurate estimate
of the actual cost incurred, which may be considered to be high,
rare, etc.).
174
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
150
151
152
153
154
155
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
149
205
206
207
YJBIN 2187
19 June 2014
Q1
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
Quasi-identiers
Name
Phone number
Date of birth
Zip code
Gender
DNA
Tom Green
Johanna Marer
Maria Durhame
Helen Tulid
Tim Lee
6152541261
6152532126
6151531562
6153553230
6155837612
11.02.1980
17.01.1982
17.01.1982
10.07.1977
15.04.1984
55432
55454
55332
55454
55332
Male
Female
Female
Female
Male
AT. . .G
CG. . .A
TG. . .C
AA. . .G
GC. . .T
231
232
In this section, we present some well-established privacy models that guard against the aforementioned threats. These privacy
models: (i) model what leads to one or more privacy threats and
(ii) describe a computational strategy to enforce protection against
the threat. Privacy models are subsequently categorized according
to the privacy threats they protect from, as also presented in Table
2.
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
Sensitive attribute
Table 2
Privacy models to guard against different attacks.
Attack type
Privacy models
Demographics
Identity disclosure
k-Anonymity [112]
k-Map [34]
1; k-Anonymity [45]
k; 1-Anonymity [45]
Diagnosis codes
Complete k-anonymity [52]
m
k -Anonymity [115]
Privacy-constrained
anonymity [76]
k; k-Anonymity [45]
Membership
disclosure
d-Presence [100]
c-condent d-presence
[103]
Attribute
disclosure
l-Diversity [88,69]
(a; k)-Anonymity [126]
p-Sensitive-k-anonymity
[118]
t-Closeness [69]
Range-based [81,60]
Variance-based [64]
Worst Group Protection
[84]
q-Uncertainty [16]
h; k; p-Coherence [130]
PS-rule based anonymity [80]
Another privacy model that has been proposed for demographics is k-map [113]. This model is similar to k-anonymity but considers that the linking is performed based on larger datasets
(called population tables), from which the published dataset has
been derived. Thus, k-map is less restrictive than k-anonymity,
typically allowing the publishing of more detailed patient information, which helps data utility preservation. On the negative side,
however, the k-map privacy model is weaker (in terms of offered
privacy protection) than k-anonymity because it assumes that: (i)
attackers do not know whether a record is included in the published dataset and (ii) data publishers have access to the population table.
El Emam et al. [34] provide a discussion of the k-anonymity and
k-map models and propose risk-based measures, which approximate k-map and are more applicable in certain re-identication
scenarios. Three privacy models, called 1; k-anonymity, k; 1-anonymity and k; k-anonymity, which follow a similar concept to kmap, and are relaxations to k-anonymity, have been proposed by
Gionis et al. [45]. These models differ in their assumptions about
the capabilities of attackers and can offer higher data utility but
weaker privacy than k-anonymity.
257
278
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
279
280
281
282
283
284
285
286
YJBIN 2187
19 June 2014
Q1
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
for diagnosis codes. In what follows, we describe some representative privacy models from each group.
351
353
396
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
352
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
YJBIN 2187
19 June 2014
Q1
434
identity and sensitive information disclosure. Similarly to association rules [7], PS-rules consist of two sets of diagnosis codes, the
antecedent and consequent, which contain diagnosis codes that
may be used in identity and sensitive information disclosure
attacks, respectively. Given a PS-rule A ! B, where A and B is the
antecedent and consequent of the rule, respectively, PS-rule based
anonymity requires that the set of diagnosis codes in A appears in
at least k records of the published dataset, while at most c 100% of
the records that contain the diagnosis codes in A, also contain the
diagnosis codes in B. Thus, it protects against attackers who know
whether a patients record is contained in the published dataset.
The parameter c is specied by data publishers, takes values
between 0 and 1, and is analogous to the condence threshold in
association rule mining [7]. The PS-rule based anonymity model
offers three signicant benets compared to previously discussed
models for diagnosis codes: (i) it protects against both identity
and sensitive information disclosure, (ii) it allows data publishers
to specify detailed privacy requirements and (iii) it is more general
than these models (i.e., the models in [115,130,52] are special
cases of PS-rule based anonymity).
435
3. Privacy scenarios
436
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
4. Privacy techniques
498
499
504
505
516
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
500
501
502
503
506
507
508
509
510
511
512
513
514
515
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
YJBIN 2187
19 June 2014
Q1
PROS:
Researchers
Protected data
repository
Data request
Privacy-aware result
CONS:
data owners
Original
data
Released
data
PROS:
CONS:
Table 3
Algorithms to prevent against different attacks.
Attack type
Privacy models
Demographics
Identity
k-Minimal generalization [109]
disclosure OLA [32]
Incognito [65]
Genetic [58]
Mondrian [66,67]
TDS [40]
NNG [28]
Greedy [129]
k-Member [15]
KACA [68]
Agglomerative [45]
k; k-Anonymizer [45]
Hilb [44]
iDist [44]
MDAV [25]
CBFS [62]
Diagnosis codes
UGACLIP [76]
CBA [87]
UAR [86]
Apriori [115]
LRA [116]
VPA [116]
mHgHs [74]
Recursive partition [52]
538
539
540
Greedy [130]
SuppressControl [16]
TDControl [16]
RBAT [79]
Tree-based [80]
Sample-based [80]
are interpreted as any (non-empty) subset of diagnosis codes contained in the set. Last, suppression involves the deletion of specic
QID values from the data.
Although each technique has its benets, generalization is typically preferred over microaggregation and suppression. This is
because microaggregation may harm data truthfulness (i.e., the
centroid may not appear in the data), while suppression incurs
high information loss. Interestingly, there are techniques that
employ more than one of these operations. For example, the work
of [76] employs suppression when it is not possible to apply generalization while satisfying some utility requirements.
541
552
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
542
543
544
545
546
547
548
549
550
551
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
YJBIN 2187
19 June 2014
Q1
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
et al. [85] proposed a model for expressing data utility requirements and an algorithm for anonymizing data, based on this
model. Utility requirements can be expressed at an attribute level
(e.g., imposing the length of range, or the size of set that anonymized groups may have in a given quasi-identier attribute), or
at a value level (e.g., imposing ranges or sets allowed for specied
values). The approach of [85] can be applied to patient demographics but not to diagnosis codes.
Anonymizing diagnosis codes in a way that satises data utility
requirements has been considered in [76]. The proposed approach
models data utility requirements using sets of diagnosis codes,
referred to as utility constraints. A utility constraint represents
the ways the codes, contained in it, can be generalized in order
to preserve data utility. Thus, utility constraints specify the information that the anonymized data should retain in order to be useful in intended medical analysis tasks. For example, assume that
the disseminated data must support a study that requires counting
the number of patients with Diabetes. To achieve this, a utility constraint for Diabetes, which is comprised of all different types of diabetes, must be specied. By anonymizing data according to this
utility constraint, we can ensure that the number of patients with
Diabetes in the anonymized data will be the same as in the original
data. Thus, the anonymized dataset will be as useful as the original
one, for the medical study on diabetes.
643
667
674
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
668
669
670
671
672
673
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
YJBIN 2187
19 June 2014
Q1
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
777
4.1.4.1. Algorithms for demographics. Table 4 presents a classication of algorithms for demographics. As can be seen, these algorithms employ various data transformation and heuristic
strategies, and aim at satisfying different utility objectives. All
algorithms adopt k-anonymity, with the exception of k; k-anonymizer [45] which adopts the k; k-anonymity model, discussed in
Section 2.2. The fact that k; k-anonymity is a relaxation of k-anonymity allows the algorithm in [45] to preserve more data utility
than the Agglomerative algorithm, which is also proposed in
[45]. Furthermore, most algorithms use generalization to anonymize data, except (i) the algorithms in [109,65], which use suppression in addition to generalization in order to deal with a
782
Table 4
Algorithms for preventing identity disclosure based on demographics.
Algorithm
Privacy model
Transformation
Utility objective
Heuristic strategy
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
(k, k)-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
k-Anonymity
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
778
779
780
781
783
784
785
786
787
788
789
790
791
792
793
YJBIN 2187
19 June 2014
Q1
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
typically small number of values that would incur excessive information loss if generalized and (ii) the algorithms in [25,62], which
use microaggregation.
In addition, it can be observed that the majority of algorithms
aim at minimizing information loss and that no algorithm takes
into account specic utility requirements, such as limiting the set
of allowable ways for generalizing a value in a quasi-identier.
At the same time, the Genetic [58], Infogain Mondrian [67], and
TDS [40] algorithms aim at releasing data in a way that allows
for building accurate classiers. These algorithms were compared
in terms of how well they can support classication tasks, using
publicly available demographic datasets [53,119]. The results are
reported in [67,40], and demonstrate that Infogain Mondrian outperforms TDS which, in turn, outperforms the Genetic algorithm.
The LSD Mondrian [67] algorithm is similar to Infogain Mondrian
but uses a different utility objective measure, as its goal is to preserve the ability of using the released data for linear regression.
It is also interesting to observe that several algorithms implement data partitioning heuristic strategies. Specically, the algorithms proposed in [66,67] follow a top-down partitioning
strategy inspired by kd-trees [38], while the TDS algorithm [40]
employs a different strategy that takes into account the partition
size and data utility in terms of classication accuracy. Interestingly, the partitioning strategy of NNG [28] is based on the distance
of values and allows creating k-anonymous datasets, whose utility
is no more than 6 q times worse than that of the optimal solution,
where q is the number of quasi-identiers in the dataset. On the
other hand, the algorithms that employ clustering [129,15,98,45]
follow a similar greedy, bottom-up procedure, which aims at building clusters of at least k records by iteratively merging together
smaller clusters (of one or more records), in a way that helps data
utility preservation. A detailed discussion and evaluation of clustering-based algorithms that employ generalization has been
reported in [82], while the authors of [25,26] provide a rigorous
analysis of clustering-based algorithms for microaggregation.
The use of space mapping techniques in algorithms iHilb and
iDist, both of which were proposed in [44], enables them to preserve data utility equally well or even better than the Mondrian
algorithm [66] and to anonymize data more efciently. To map
the space of quasi-identiers, iHilb uses the Hilbert curve, which
can preserve the locality of points (i.e., values in quasi-identiers)
fairly well [97]. The intuition behind using this curve is that, with
high probability, two records with similar values in quasi-identiers will also be similar with respect to their rank that is produced
based on the curve. The iDist algorithm employs iDistance [131], a
technique that measures similarity based on sampling and clustering of points, and is shown to be slightly inferior than iHilb in
terms of data utility. Last, the algorithms in [109,65,32] use lattice-search strategies. An experimental evaluation using a publicly
available dataset containing demographics [53], as well as 5 hospital discharge summaries, shows that the OLA algorithm [32]
845
853
Table 5
Algorithms for preventing identity disclosure based on diagnosis codes.
Algorithm
Privacy model
Transformation
Utility objective
Heuristic strategy
UGACLIP [76]
CBA [87]
UAR [86]
Apriori [115]
LRA [116]
Privacy-constrained anonymity
Privacy-constrained anonymity
Privacy-constrained anonymity
m
k -Anonymity
m
k -Anonymity
m
k -Anonymity
m
k -Anonymity
m
k -Anonymity
m
k -Anonymity
Complete k-anonymity
Utility requirements
Utility requirements
Utility requirements
Min. inf. loss
Min. inf. loss
Generalization
VPA [116]
mHgHs [74]
Recursive partition [52]
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
846
847
848
849
850
851
852
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
YJBIN 2187
19 June 2014
Q1
10
932
933
934
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
976
1012
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
YJBIN 2187
19 June 2014
Q1
11
Data type
Privacy model
Transformation
Utility objective
Heuristic strategy
SPALM [100]
MPALM [100]
SFALM [101]
Demographics
Demographics
Demographics
d-Presence
d-Presence
c-Condent
d-Presence
Generalization
Generalization
Generalization
Table 7
Algorithms for preventing attribute disclosure based on demographics.
Algorithm
Privacy model
Transformation
Approach
Heuristic strategy
Incognito with
l-diversity [88]
Incognito with
t-closeness [69]
Incognito with
a; k-anonymity [126]
p-Sens k-anon [118]
l-Diversity
Generalization
and suppression
Generalization
and suppression
Generalization
and suppression
Generalization
Protection constrained
Generalization
Generalization
Generalization
Generalization and suppression
Generalization
Generalization
Bucketization
Protection constrained
Protection constrained
Protection constrained
Trade-off constrained
Protection constrained
Protection constrained
Protection constrained
Apriori-like lattice
search
Apriori-like lattice
search
Apriori-like lattice
search
Apriori-like lattice
search
Data partitioning
Data partitioning
Data partitioning
Data clustering
Space mapping
Space mapping
Quasi-identiers are released intact
t-Closeness
a; k-Anonymity
p-Sensitive
k-anonymity
l-Diversity
t-Closeness
a; k-Anonymity
Tuple diversity
l-Diversity
l-Diversity
l-Diversity
Protection constrained
Protection constrained
Protection constrained
Table 8
Algorithms for preventing attribute disclosure based on diagnosis codes.
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
Algorithm
Privacy model
Transformation
Approach
Greedy [130]
SuppressControl [16]
TDControl [16]
RBAT [79]
Tree-based [80]
Sample-based [80]
h; k; p-Coherence
q-Uncertainty
q-Uncertainty
PS-rule based anonymity
PS-rule based anonymity
PS-rule based anonymity
Suppression
Suppression
Generalization and suppression
Generalization
Generalization
Generalization
Protection
Protection
Protection
Protection
Protection
Protection
the records that have the same value in the sensitive attribute and
then constructing groups with at least l different values in the sensitive attribute. The construction of groups is performed by selecting records from appropriate buckets, in a round-robin fashion.
Interestingly, the Anatomize algorithm requires an amount of
memory that is linear to the number of distinct values of the sensitive attribute and creates anonymized data with bounded reconstruction error, which quanties how well correlations among
values in quasi-identiers and the sensitive attribute are preserved. In fact, the authors of [127] demonstrated experimentally
that the Anatomize algorithm outperforms an adaption of the
Mondrian algorithm that enforces l-diversity in terms of preserving
data utility. Moreover, it is worth noting that the algorithms in
[118,126,81], which employ p-sensitive k-anonymity, a; k-anonymity, or tuple diversity, are applied to both quasi-identiers
and sensitive attributes and provide protection from identity and
attribute disclosure together. On the other hand, the Anatomize
algorithm does not provide protection guarantees against identity
disclosure, as all values in quasi-identiers are released intact.
4.2.1.2. Algorithms for diagnosis codes. Algorithms for anonymizing
diagnosis codes against attribute disclosure are summarized in
Table 8. As can be seen, the algorithms adopt different privacy
models, namely h; k; p-coherence, q-uncertainty, or PS-rule based
anonymity, and they use suppression, generalization, or a combination of suppression and generalization. Specically, the authors
in [16] propose an algorithm, called TDControl, which applies suppression when generalization alone cannot enforce q-uncertainty,
Heuristic strategy
constrained
constrained
constrained
constrained
constrained
constrained
Greedy search
Greedy search
Top-down space partitioning
Top-down space partitioning
Top-down space partitioning
Top-down and bottom-up space partitioning
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
YJBIN 2187
19 June 2014
Q1
12
1106
1107
5. Relevant techniques
1108
1112
This section provides a discussion of privacy-preserving techniques that are relevant, but not directly related, to those surveyed
in this paper. These techniques are applied to different types of
medical data, or aim at privately releasing aggregate information
about the data.
1113
1114
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1109
1110
1111
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1154
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
YJBIN 2187
19 June 2014
Q1
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
PKD 2 S 6
e
PKD 2 S
13
sets of diagnosis codes. Interestingly, both techniques apply partitioning strategies similar in principle to those used by the TDS
algorithm [40] and the Recursive partition [52] algorithm,
respectively. In addition, systems that allow the differentially private release of aggregate information from electronic health
records are emerging. For instance, SHARE [42] is a recently proposed system for releasing multidimensional histograms and longitudinal patterns.
1261
1269
1270
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
1262
1263
1264
1265
1266
1267
1268
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
YJBIN 2187
19 June 2014
Q1
14
1349
1350
7. Conclusions
1351
1356
1357
Acknowledgements
1358
1362
1363
References
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1352
1353
1354
1355
1359
1360
1361
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13] Bhagwan V, Grandison T, Maltzahn C. Recommendation-based deidentication: a practical systems approach towards de-identication of
unstructured text in healthcare. In: Proceedings of the 2012 IEEE eighth
world congress on services, SERVICES 12; 2012. p. 15562.
[14] Burges CJC. A tutorial on support vector machines for pattern recognition.
Data Min Knowl Discov 1998;2(2):12167.
[15] Byun J, Kamra A, Bertino E, Li N. Efcient k-anonymization using clustering
techniques. In: DASFAA; 2007. p. 188200.
[16] Cao J, Karras P, Rassi C, Tan K. rho-uncertainty: inference-proof transaction
anonymization. Pvldb 2010;3(1):103344.
[17] Cassa CA, Schmidt B, Kohane IS, Mandl KD. My sisters keeper? Genomic
research and the identiability of siblings. BMC Med Genom 2008;1:32.
[18] Chakaravarthy VT, Gupta H, Roy P, Mohania MK. Efcient techniques for
document sanitization. In: Proceedings of the 17th ACM conference on
information and knowledge management; 2008. p. 84352.
[19] Chen B, Ramakrishnan R, LeFevre K. Privacy skyline: privacy with
multidimensional adversarial knowledge. In: VLDB; 2007. p. 77081.
[20] Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L. Publishing set-valued
data via differential privacy. Pvldb 2011;4(11):108798.
[21] Cormode G. Personal privacy vs population privacy: learning to attack
anonymization. In: KDD; 2011. p. 125361.
[22] Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ. Use of electronic
medical records for health outcomes research: a literature review. Med Care
Res Rev 2010;66(6):61138.
[23] De Capitani di Vimercati S, Foresti S, Livraga G, Samarati P. Data privacy:
denitions and techniques. Int J Uncertainty, Fuzz Knowl-Based Syst
2012;20(6):793818.
[24] Domingo-Ferrer J, Mateo-Sanz JM. Practical data-oriented microaggregation
for statistical disclosure control. IEEE Trans Knowl Data Eng
2002;14(1):189201.
[25] Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneous
k-anonymity through microaggregation. Dmkd 2005;11(2):195212.
[26] Domingo-Ferrer J, Martnez-Ballest A, Mateo-Sanz J, Seb F. Efcient
multivariate data-oriented microaggregation. VLDB J 2006;15(4):
35569.
[27] Du W, Teng Z, Zhu Z. Privacy-maxent: integrating background knowledge in
privacy quantication. In: SIGMOD; 2008. p. 45972.
[28] Du Y, Xia T, Tao Y, Zhang D, Zhu F. On multidimensional k-anonymity with
local recoding generalization. In: ICDE 07; 2007. p. 14224.
[29] Dwork C. Differential privacy. In: ICALP; 2006. p. 112.
[30] Dwork C. Differential privacy: a survey of results. In: TAMC; 2008. p. 119.
[31] Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M. Our data, ourselves:
privacy via distributed noise generation. In: Proceedings of the 24th annual
international conference on the theory and applications of cryptographic
techniques, EUROCRYPT06; 2006. p. 486503.
[32] El Emam K, Dankar F, Issa R, Jonker E, et al. A globally optimal k-anonymity
method for the de-identication of health data. J Am Med Informat Assoc
2009;16(5):67082.
[33] El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of reidentication attacks on health data. PLoS ONE 2011;6(12) [e28071, 12].
[34] El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med
Informat Assoc 2008;15(5):62737.
[35] El Emam K, Buckeridge D, Tamblyn R, Neisa A, Jonker E, Verma A. The reidentication risk of canadians from longitudinal demographics. BMC Med
Informat Dec Mak 2011;11(46).
[36] Fernandez-Aleman J, Senor I, Oliver Lozoya PA, Toval A. Security and privacy
in electronic health records: a systematic literature review. J Biomed Informat
2013;46(3):54162.
[37] Fienberg SE, Slavkovic A, Uhler C. Privacy preserving GWAS data sharing. In:
IEEE ICDM worksops; 2011. p. 62835.
[38] Filho YV. Optimal choice of discriminators in a balanced kd binary search
tree. Inf Process Lett 1981;13(2):6770.
[39] Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: a
survey on recent developments. ACM Comput Surv 2010;42.
[40] Fung BCM, Wang K, Yu PS. Top-down specialization for information and
privacy preservation. In: ICDE; 2005. p. 20516.
[41] Fung BCM, Wang K, Wang L, Hung PCK. Privacy-preserving data publishing
for cluster analysis. Data Knowl Eng 2009;68(6):55275.
[42] Gardner JJ, Xiong L, Xiao Y, Gao J, Post AR, Jiang X, et al. Share: system design
and case studies for statistical health information release. Jamia
2013;20(1):10916.
[43] Gertz M, Jajodia S, editors. Handbook of database security applications and
trends. Springer; 2008.
[44] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with low
information loss. In: Proceedings of the 33rd international conference on very
large data bases, VLDB 07; 2007. p. 75869.
[45] Gionis A, Mazza A, Tassa T. k-Anonymization revisited. In: ICDE; 2008. p.
74453.
[46] Gkoulalas-Divanis A, Loukides G. PCTA: privacy-constrained clustering-based
transaction data anonymization. In: EDBT PAIS; 2011. p. 5.
[47] Gkoulalas-Divanis A, Loukides G. Revisiting sequential pattern hiding to
enhance utility. In: KDD; 2011. p. 131624.
[48] Gkoulalas-Divanis A, Verykios VS. Hiding sensitive knowledge without side
effects. Knowl Inf Syst 2009;20(3):26399.
[49] Guha S, Rastogi R, Shim K. Cure: an efcient clustering algorithm for large
databases. In: SIGMOD; 1998. p. 7384.
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
YJBIN 2187
19 June 2014
Q1
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
15
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
YJBIN 2187
19 June 2014
Q1
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
16
[122] Van Rijsbergen CJ. Information retrieval. 2nd ed. Newton, MA,
USA: Butterworth-Heinemann; 1979.
[123] Wang R, Li YF, Wang X, Tang H, Zhou X. Learning your identity and disease
from research papers: information leaks in genome wide association study.
In: CCS; 2009. p. 53444.
[124] Wang R, Wang X, Li Z, Tang H, Reiter MK, Dong Z. Privacy-preserving genomic
computation through program specialization. In: Proceedings of the 16th
ACM conference on computer and communications security, CCS 09; 2009. p.
33847.
[125] Wong RC, Fu A, Wang K, Pei J. Minimality attack in privacy preserving data
publishing. In: VLDB; 2007. p. 54354.
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
Q1 Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002