Matteo Scalperf 2017

MapReduce and Streaming Algorithms for
Diversity Maximization in Metric Spaces of

Bounded Doubling Dimension
Matteo Ceccarello
University of Padova
Joint work with:

A. Pietracaprina, G. Pucci (U. Padova), and E. Upfal (Brown U.)
[VLDB ’17]
Problem definition and applications
Diversity maximization
Objective:
For a given dataset, determine the most diverse subset of given
(small) size k
⇒
⇒
Applications
← News/document aggregators
↑ e-commerce ↑
← Facility location
Diversity maximization: formal definition
Given:
1. Set S of points in a metric space ∆
2. Distance function d : ∆ × ∆ → R + ∪ {0}
3. (Distance-based) diversity function div : 2∆ → R + ∪ {0}
4. Integer k > 1
Return
S ∗ ⊂ S, |S ∗ | = k s.t. S ∗ = argmaxS 0 ⊆S,|S 0 |=k div(S 0 )

Diversity measures studied in this work
remote-edge remote-clique remote-star
remote-bipartition remote-tree remote-cycle
All measures are NP-Hard to optimize

Background
MapReduce and Streaming frameworks
MapReduce Streaming
I Multiple processors I Single processor
I Limited memory I Limited memory
Performance measures
I number of rounds/passes
I space required by the processors
Previous work
β-core-set [Agarwal et al.’95]

I A small subset T (core-set) of input S s.t.
divk (T ) ≥ (1/β) divk (S)
I Compute final solution on T .
β-composable core-set [Indyk et al.’14]

I Partitioned input S = S1 ∪ S2 ∪ · · · ∪ S`
S
I β-composable core-sets Ti ⊂ Si ⇒ Ti is a β-core-set
Previous work
Known β-composable core-sets for diversity maximization

([Indyk et al.’14,Aghamolaei et al.’15])
β αseq β · αseq
Remote-edge 3 2 6
Remote-clique 6+ 2 12 +
Remote-star 12 2 24
Remote-bipartition 18 3 54
Remote-tree 4 4 16
Remote-cycle 3 3 9
I αseq = best sequential approximation ratio

I Core-set size: k
Summary of Results
Summary of results
Our setting
Metric spaces of Bounded Doubling Dimension: ∃D = O(1) s.t.
any ball of radius r is covered by ≤ 2D balls of radius r /2
r r/2
I Euclidean spaces
I Shortest-path distances of mildly expanding topologies
Summary of results
Our Results (all div’s):

I Improved β-(composable) core-sets: β = 1 +
I Overall approximation: αseq +
I 1-pass Streaming and 2-round MapReduce algorithms using
space:
Streaming MapReduce
p
r-edge/cycle O(k/D ) O k|S|/D
p
other div’s O(k 2 /D ) O k |S|/D
I Previous
p works obtained constant approximations 1 with
O k|S| space usage
Summary of results
Our Results (all div’s):

I Improved β-(composable) core-sets: β = 1 +
I Overall approximation: αseq +
I One extra pass/round brings space bounds for other div0 s
down to those for r-edge/cycle
Streaming MapReduce
p
all div’s O(k/D ) O k|S|/D
Our approach
Core-set construction: algorithm
Input dataset: S
Optimal solution O ⊂ S, with |O| = k
MAIN IDEA: Compute core-set T such that each o ∈ O has a

close (distinct) proxy p(o) ∈ T
1. Partition S into τ ≥ k clusters of small radius

I Use τ -center clustering, which minimizes the radius
2. Suitably select points from each cluster to build the coreset
k = 3, τ = 8
Compute τ -center clustering

remote-edge and remote-cycle

T = {cluster centers} (|T | = τ )
⇒
Other diversity measures

T = {k points per cluster} (|T | ≤ kτ )
⇒
MapReduce implementation
p
I Two rounds with O(k |S|/D ) space
I Approximation guarantee: αseq +
Streaming implementation
Implementation
I Compute a (1 + )-core-set of size using a variant of the
8-approximate τ -center algorithm of [Charikar et al.’04]
I Run sequential approximation on the coreset
Performace
1 pass, O k 2 /D space (no dependence on |S|!)

I
I Approximation guarantee: αseq +

Experiments
Experiments: setup
Datasets:
I Synthetic data: Euclidean spaces
I Real data: musiXmatch dataset
I ≈ 250K songs.
I bag-of-words model, cosine distance
Platform:
I 16-node cluster (Intel i7)
I 18GB-RAM/256GB-SSD per node
I 10Gbps ethernet
I Apache Spark (open source code:
github.com/Cecca/diversity-maximization)
Experiments: effectiveness of approach
Streaming algorithm on musiXmatch dataset

Experiments: comparison with state-of-art
Our algorithm (CCPU) vs. [Aghamolaei et al.’15] (AFZ)

I MapReduce using 16 machines
I remote-clique measure
I 4M points in R3 (max feasible for AFZ)
I Our algorithm: τ = 128
approximation time (s)

k AFZ CPPU AFZ CPPU
4 1.023 1.012 807.79 1.19
6 1.052 1.018 1,052.39 1.29
8 1.029 1.028 4,625.46 1.12
Experiments: throughput
Throughput on the musiXmatch dataset

throughput (points/s)
105
k
8
32
104
128
103
k 2k 4k 8k
τ
Experiments: scalability
I Synthetic data: points from R3

I 1 processor ≡ streaming algorithm
I τ = 2048
Conclusions
Conclusions
I (1 + )-(composable) core-set construction on metric spaces

of constant doubling dimension
I Space savings with additional rounds/passes
I Experiments on real and synthetic data demonstrate

effectiveness, efficiency and scalability of our approach
I Remark: knowledge of space dimension D not needed!

Increase τ as much as space/time limits allow
Future Work
I Better space/round/accuracy tradeoffs in MapReduce
I Extension to broader classes of metric spaces
I Incorporate other constraints:

I Density ⇒ outlier detection
I Matroid constraints

Matteo Scalperf 2017

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Matteo Scalperf 2017

Caricato da

Copyright:

Formati disponibili

MapReduce and Streaming Algorithms for

Diversity Maximization in Metric Spaces of

Joint work with:

S ∗ ⊂ S, |S ∗ | = k s.t. S ∗ = argmaxS 0 ⊆S,|S 0 |=k div(S 0 )

remote-edge remote-clique remote-star

remote-bipartition remote-tree remote-cycle

All measures are NP-Hard to optimize

β-core-set [Agarwal et al.’95]

I Compute final solution on T .

β-composable core-set [Indyk et al.’14]

Known β-composable core-sets for diversity maximization

I αseq = best sequential approximation ratio

Our Results (all div’s):

Our Results (all div’s):

Optimal solution O ⊂ S, with |O| = k

MAIN IDEA: Compute core-set T such that each o ∈ O has a

1. Partition S into τ ≥ k clusters of small radius

Compute τ -center clustering

remote-edge and remote-cycle

Other diversity measures

I Approximation guarantee: αseq + 

Streaming algorithm on musiXmatch dataset

Our algorithm (CCPU) vs. [Aghamolaei et al.’15] (AFZ)

approximation time (s)

Throughput on the musiXmatch dataset

I Synthetic data: points from R3

I (1 + )-(composable) core-set construction on metric spaces

I Space savings with additional rounds/passes

I Experiments on real and synthetic data demonstrate

I Remark: knowledge of space dimension D not needed!

I Better space/round/accuracy tradeoffs in MapReduce

I Extension to broader classes of metric spaces

I Incorporate other constraints:

Potrebbero piacerti anche

I Approximation guarantee: αseq +

I (1 + )-(composable) core-set construction on metric spaces