Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
[VLDB ’17]
Problem definition and applications
Diversity maximization
Objective:
For a given dataset, determine the most diverse subset of given
(small) size k
⇒
⇒
Applications
← News/document aggregators
↑ e-commerce ↑
← Facility location
Diversity maximization: formal definition
Given:
1. Set S of points in a metric space ∆
2. Distance function d : ∆ × ∆ → R + ∪ {0}
3. (Distance-based) diversity function div : 2∆ → R + ∪ {0}
4. Integer k > 1
Return
MapReduce Streaming
I Multiple processors I Single processor
I Limited memory I Limited memory
Performance measures
I number of rounds/passes
I space required by the processors
Previous work
β αseq β · αseq
Remote-edge 3 2 6
Remote-clique 6+ 2 12 +
Remote-star 12 2 24
Remote-bipartition 18 3 54
Remote-tree 4 4 16
Remote-cycle 3 3 9
Our setting
Metric spaces of Bounded Doubling Dimension: ∃D = O(1) s.t.
any ball of radius r is covered by ≤ 2D balls of radius r /2
r r/2
I Euclidean spaces
I Shortest-path distances of mildly expanding topologies
Summary of results
Streaming MapReduce
p
r-edge/cycle O(k/D ) O k|S|/D
p
other div’s O(k 2 /D ) O k |S|/D
I Previous
p works obtained constant approximations 1 with
O k|S| space usage
Summary of results
Streaming MapReduce
p
all div’s O(k/D ) O k|S|/D
Our approach
Core-set construction: algorithm
Input dataset: S
k = 3, τ = 8
Core-set construction: algorithm
⇒
Core-set construction: algorithm
⇒
MapReduce implementation
p
I Two rounds with O(k |S|/D ) space
I Approximation guarantee: αseq +
Streaming implementation
Implementation
I Compute a (1 + )-core-set of size using a variant of the
8-approximate τ -center algorithm of [Charikar et al.’04]
I Run sequential approximation on the coreset
Performace
1 pass, O k 2 /D space (no dependence on |S|!)
I
Datasets:
I Synthetic data: Euclidean spaces
I Real data: musiXmatch dataset
I ≈ 250K songs.
I bag-of-words model, cosine distance
Platform:
I 16-node cluster (Intel i7)
I 18GB-RAM/256GB-SSD per node
I 10Gbps ethernet
I Apache Spark (open source code:
github.com/Cecca/diversity-maximization)
Experiments: effectiveness of approach
105
k
8
32
104
128
103
k 2k 4k 8k
τ
Experiments: scalability