Sei sulla pagina 1di 14

Pattern Recognition Letters 36 (2014) 46–53

Contents lists available at ScienceDirect

Pattern Recognition Letters

j o u r n a l h o m e p a g e : w w w . e l s ev i e r . c o m / l o c a t e / p a t r e c

Optimal local community detection in social networks based on density drop of


subgraphs
a b b c b b,⇑,1
Xingqin Qi , Wenliang Tang , Yezhou Wu , Guodong Guo , Eddie Fuller , Cun-Quan Zhang
a b
School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
Department of Mathematics, West Virginia University, Morgantown, WV 26506, USA
c
Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA
article info abstract

Article history: The determination of community structures within social networks is a significant problem in the area of data mining. A proper
Received 12 April 2013 community is usually defined as a subgraph with a higher internal density and a lower crossing density with others subgraphs.
Available online 20 September 2013 Hierarchical clustering algorithms produce a set of nested clusters, sometimes called dense subgraphs, organized as a
hierarchical system and the output is always referred as a dendrogram. However, determining which of clusters in the
Communicated by M.A. Girolami
dendrogram will be selected to form communities in the final output is a difficult problem. Most implementations of data
mining algo-rithms require expert guidance in the implementation of the algorithm in order to establish the appro-priate
Keywords:
selection of such communities, and ultimately the output may not be optimized as with fixed height tree-cutting algorithms. In
Hierarchical clustering
this paper, a novel algorithm for community selection is proposed. The intuition of our approach is based on drops of densities
Community selection
Graph density between each pair of parent and child nodes on the dendrogram – the higher the drop in density, the higher probability the
Density drop child should form an inde-pendent community. Based on the Max-Flow Min-Cut theorem, we propose a novel algorithm
which can output an optimal set of local communities automatically. In addition, a faster algorithm running in linear time is
also presented for the case that the dendrogram is a tree. Finally, we validate this approach through a variety of data sets
ranging from synthetic graphs to real world benchmark data sets.

2013 Elsevier B.V. All rights reserved.

1. Introduction Hierarchical clustering, see Scott (2000), is one of the most pop-ular
approaches to clustering problems and the output is called dendrogram. A
Networks can be used to describe the pairwise relationships be-tween dendrogram is a diagram frequently used to illus-trate the arrangement of the
nodes. Thinking of these nodes as vertices, we can in turn view such networks clusters produced by hierarchical clustering (see Fig. 1(a)).
as graphs where the edges are defined by these same pairwise relationships.
Sociologists use networks to de-scribe the relationship among n persons in The dendrogram has a root, say v0, on the top of diagram repre-senting
terms of their connec-tion strength, reflecting how of a connection exists the whole network and leaves on the bottom representing each individual.
between pair, common behaviors, or the level of collaboration. The subgraph Every node among internal levels represents a subgraph of the original
with denser connections inside and sparser connections to other subgraphs network. The node on the lower level is called child, and one on higher level
can provide invaluable insight into the structure of the whole network or data is called parent. Each edge connecting two nodes in the dendrogram forms a
visualization. Detecting such communities or clusters of closely related parent–child relationship. The subgraphs induced by these parent–child rela-
objects remains one of the most inter-esting problems in the field of tionships, specifically the hierarchy of sets of nodes in the original graph, are
bioinformatics, social networks, epi-demiology and data mining. Many called clusters and form the candidate pool of communi-ties for output.
clustering algorithms have been proposed in the literature (Girvan and However, which of those clusters in the dendro-gram will be selected to form
Newman, 2001; Newman, 2004, 2006; Hastie et al., 2001; Kaufman and communities in final output?
Rousseeuw, 1990; Scott, 2000).
‘‘There are no completely satisfactory algorithms that can be used for
determining the number of population clusters for many type of cluster
harder to find
analysis’’ said in SAS/STAT 9.2 User’s Guide. It is much
⇑ Corresponding author. Fax: +1 304 293 3982.
1 out an optimal community partition than determin-ing the
E-mail addresses: cqzhang@math.wvu.edu, cqzhang@mail.wvu.edu (C.-Q. Zhang). This
research is partially supported by an NSA Grant H98230-12-1-0233 and an number of communities, and therefore it remains one of the
NSF Grant DMS-126480.
most challenging problems in current research of data
0167-8655/$ - see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2013.09.008
mining.
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 47
1 ... 130

101 ... 130


D

1 ... 100
A
B C D

101 ... 110 111 ... 120 121 ... 130


A
101 -- 110 111 -- 120 121 -- 130 1 -- 100 B C

(a) A dendrogram obtained by Quasi (b) A network consisting of 4


Clique Merger on left example strongly connected dense sub-
graphs.
Fig. 1. An example of dendrogram obtained by Quasi Clique Merger. Four clusters are formed by picking up all nodes immediately adjacent to edge cut, colored with orange. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

A subgraph with denser connections indicates that members are more Ghazalpour et al. (2006) and Gargalovic et al. (2006), which is mainly based
similar to each other than they are to portions of the graph outside the on the shape of branches of dendrogram and the in-ner structure of each node
subgraph. If we use ‘‘density’’ of a subgraph, de-fined to be the average was not reflected in the process of com-munity detecting.
number of edges between nodes in the ver-tex set of the subgraph, to describe
its global strength connections (see Section 2.2 for more detailed definition), a In this paper, we will propose a new community detection algo-rithm
good community detection algorithm should detect a partition of the input (Algorithm 1), different from other traditional algorithms (fixed height cutting
networks into subgraphs satisfying: algorithm where all edges to be cut are in the same height level or pre-defined
community number algo-rithm where the number of communities should be
inputted afore-hand), where the community result will be output automatically
1. Higher internal connection density. and the edges to be cut could be located in any level of the dendro-gram. To
2. Lower external connection density. be more specific, our algorithm will find an edge cut of a given dendrogram,
separating the root and all leaves, where the edges in the edge cut could be
The traditional algorithm of identifying communities of a dendrogram is located in any level. The family of all nodes (children) immediately below the
referred to as tree cutting, branch cutting or branch pruning. One kind of tree edge cut will be the output of our algorithm and form all desired communities
cutting algorithm needs the number of communities as an input aforehand, but automatically.
the problem of determining the number of clusters itself is hard in most cases.
Another most widely used tree cut algorithm is called fixed height cutting: the
user chooses a fixed height on the dendrogram, and all nodes in the branches The basic idea of our algorithm is as follows. As we know, each node v in
immediately below the height of the cut form the family of communities. The the dendrogram is a candidate of the optimal local community and the
fixed height tree cutting is simple and rather naive, but the output sometimes
induced subgraph Gv by its members has relatively higher inter-connection
does not make any sense especially for complicated cases. The following
example will reveal the downside of fixed height cutting. than extra-connection. Those arcs in the dendrogram with larger density drop
indicate improper agglomeration and hence form candidates for edge cutting.
Based on those observations, we assign weights on arcs of T based on density
drop between child and parent. Our community detection algorithm could
catch all arcs with larger density drop and automatically generate a proper
Let G be a network consists of 4 giant clusters A–D (see Fig. 1(b)). The community partition. When we test our algorithm on the above example (see
subgraph induced by A is a complete subgraph of or-der 100 and each edge is Fig. 1(b)), on which traditional community detection algorithm fails, 4
weighted by 3, and ones induced by B–D are also complete and of order 10. communities consisting of A–D respectively are obtained as we expected.
Each edge in those three clusters is assigned 4 and all rest of crossing edges
are weighted by 1. Fig. 1(a) is the dendrogram generated by a density driven
cluster-ing algorithm. By applying traditional fixed height cut algorithm, one
may produce an output consisting of two communities of or-ders 100 and 30, The outline of this paper is as follows. In Section 2 we shall de-scribe
respectively (see Fig. 1(a)), while the output of 4 communities consisting of classical algorithms and review the Quasi Clique Merge (QCM) algorithm,
A–D respectively should be the true clustering result. whose output dendrogram will be the start point of our new community
detection algorithm. In Section 3, the new algorithm (Algorithm 1) will be
described in detail. In addition, a faster algorithm (Algorithm 2) running in
Fixed height cutting is a simple and naive technique with many desirable linear time is also pre-sented in Section 3 for the special case when the
properties, but unreliable when the dendrogram is large and complicated as dendrogram is a tree. In Section 4 we apply our algorithm to some classic
we have seen from the above example. Another community selection social networks and compare its result with that of known clusters, which
technique, called ‘‘dynamic tree cut’’, was discussed in Carlson et al. (2006), verify our new algorithm’s utility.
Dong and Horvath (2007),
48 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53
between VðGja Þ VðGjb Þ and VðGjb Þ, a
2. Hierarchical clustering The Quasi Clique Merger The core of the QCM algorithm number of vertices in Gja and not in
(QCM) algorithm (Ou and Zhang, focuses on deciding whether or not Gjb . And the hierarchical
Clustering is a process of 2007) is a hierarchical clustering to add a member to an already den-
sity wT of the parent j0 is the
grouping all members of whole net- algorithm used to detect dense sub- selected dense subgraph C. For a P 1;...;t
wT ðja ;jb Þ
work into a family of subgraphs, graphs (clusters) of weighted member v R VðCÞ, we define the a;b
w G 2f t g
.
graphs. In this paper, we will use P Tð j0 Þ¼
called communities. Communities cðv; CÞ ¼ u2V ðCÞ wðuvÞ
2

are of interest because they often the output of QCM as the start point jVðCÞj . A member v is For example, let Gj1 ¼ K4 and
correspond to functional units in a of our new community detec-tion andðCÞ where n ¼ jVðCÞj and an ¼ 1 Gj2 ¼ K4 be two children of some
particular research purpose. A algorithm. This is because the node j0 and Gj1 \ Gj2 ¼ K2. Then Hj0
user specified parameter k (> 1),
cluster consist of a number of ob- combination of QCM algorithm and and serves as a coefficient that will be K2 and weighted by 4=ð2
jects with similar characteristics. our new community detection controls the density during the 2Þ ¼ 1.
Any cluster is a subset (or super- algorithm (Algorithm 1) meets the growing of a cluster C.
set) of some community and hence main requirement of an optimal The QCM algorithm consists of The node h in Fig. 1(a) obtained
a candidate for inclusion into any community detection: higher three main steps: Growing, Merging by merging three clusters B–D has a
containing community of interest. internal cluster density and lower and Contracting. See Ou and Zhang corresponding hierarchical
The goal of clustering is to find an inter-cluster density. Further-more, (2007) and Zhao and Zhang (2011) subgraph Hh ¼ K3, a weighted tri-
optimal output consisting of a set of QCM also has two other major for more details of the algorithm. angle and each edge is assigned
2=ð30 30Þ ¼ 2=900.
communities. features: (1) its output den-drogram By Ou and Zhang (2007), the
is smaller, which clearly highlights This hierarchical density will be
complexity of constructing used to measure extra connec-tion
meaningful clusters, while most
existing algorithms produce a larger the hierarchically nested among clusters. There will be a
2.1. Traditional approach system is Oðj V 3 logðj V large hierarchical density drop when
binary hierarchi-cal tree (Fiedler, 2 2 j
intra density and extra density vary
1973; Girvan and Newman, 2001; OðjVj log ðjVjÞÞ in average while the number of levels of the
K-means clustering (Hartigan, Pothen et al., 1990); (2) it allows dendrogram is OðlogðjVjÞÞ instead of thesignificantly.
worst case OThus the information
1975) is a well-known partition carried by inner structure of any
overlapping clustering or multi- As noted previously, an
algorithm which aims to partition cluster will be de-tected through the
membership, which is a concept important feature of output
the whole network into exactly k hierarchical density. In the
that has recently received increased generated by QCM algorithm is the
communities in which each member following we will propose a novel
attention (Palla et al., 2005). overlapping clustering or multi-
belongs to one community with algorithm (Algorithm 1) to find a
membership, which means one
nearest mean or center. The main set of edges based on the change of
object may belong to more than one
side effect of this algorithm is that hierarchical density from level to
A subgraph H in an un-weighted cluster while traditional algorithms
the number of communities has to level. Algorithm 2 is modified from
graph is defined as a clique if every force one object to belong to
be pre-assigned at the beginning of Algorithm 1 for the special case
pair of members of H is joined by exactly one cluster. Hence the
this process. when dendrogram is a tree.
one edge. It is well-known that the dendrograms generated by
search of cliques with maximum traditional algorithms are trees
Algorithm 1 presented here
Hierarchical clustering is vertices in graphs is an NP- while circuits may exist in the
finds an optimal edge cut in a den-
another kind of clustering analysis, complete problem. For a subgraph dendrogram obtained by QCM
drogram that will generate clusters
which either describes a C in a graph with weight algorithm because of the feature of with multi-membership.
partitioning of a graph varying from x on their edges, we can define the density of multi-membership.
C by Instance: Let T be the
a sin-gle cluster containing all P dendrogram and the unique vertex
members of the whole network to n dðCÞ ¼ 2 e2EðCÞ wðeÞ
, where EðCÞ v0 on the highest level. All arcs of T
is the set of edges connecting mem-
clus-ters each one containing a jVðCÞjðjVðCÞj 1Þ are oriented downward
single member, or from individual 3. Detecting optimal communities
bers in C. As seen above, let C be a from vertices
members to the single whole graph subgraph of un-weighted graph G
cluster (Defays, 1977; Sibson, (or weighted graph with w ¼ 1 for In the dendrogram T, each node
1973). As a result, strategies for all edges), then dðC Þ ¼ 1 im-plies represents a candidate commu-nity
hierarchical clustering generally fall that C induces a clique in G. For a for selection, which is a dense
into two types: agglomerative weighted graph, a subgraph C is subgraph (possible consisting of
algorithms, which proceed by a called a D-quasi-clique if dðCÞ P D only one vertex) in the original
series of fusions, and divisive for some positive real num-ber D. input network G.
algorithms, which process by a ser- Our new community detection
ies of partitions. The output of a algorithm starts with the output
hierarchical clustering algorithm dendrogram T of QCM algorithm.
can be described by Fig. 1(b). No Let j0 be a non-leaf node in the
matter which case, the desired dendrogram T with children j1; . . . ;
family of clusters is heavily jt , and let G ji (i ¼ 0; 1; . . . ; t) be
dependent on the horizontal cut line the corresponding subgraphs in the
with constant height, which will input network G. The
exhibit suboptimal or even awk- hierarchical density of Gj0 is
ward performance for complicated calculated as follows. Construct a
graph
dendrograms.
Hj0 with vertex set fj1; . . . ; jt g
edge between ja and jb is
min
P Ea P a; b wðeÞ
e b; b wðeÞ; e Eb
2 ½ & 2½
&

2.2. Quasi Clique Merger (QCM) jVðGja Þ VðGjb ÞjjVðGjb Þ VðGja Þj


, where E
approach
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 49
capacity in the original net-work of randomly generated with
on the higher level to lower level. edges from S to the remainder of S Instance: Let T0 be obtainedprobability
from dendrogram T by deleting
Pin. Edges all
connecting
Each v 2 T corresponds to a is the minimum cut as desired. For
S
leaves Lt ðTÞ and let c : EðTÞ # Rþ vertices from different communities
‘‘dense’’ subgroup Gv with more details about Max-Flow Min- way as in Step 2 of Algorithm 1. were randomly generated with
hierarchical density dT ðGv Þ. Cut theorem and realizations of Task: Detect an edge-cut E0 probability Pout , where Pin P Pout .
Task: Find an edge-cut E0 of T, algorithm please refer (Bondy and c E0 Þ ¼ P c e minimum, thus obtain communities
Probabilities W for chosen
were G so as to
thus obtain optimal communities by Murty, 1976; Corman et al., 2009; ð e2E0 ð Þ
keep the average degree of the
removing all edges in E0 from the Goldberg and Tarjan, 1988). Algorithm 2. output random graph almost equal
dendrogram. Step 1. For every arc uv

to 16. Those graphs have known


2 ð
E 0
T

!
community structure, but are
Algorithm 1. S uv v , which will be the output foressentially
communitiesrandom in other respects.
The main cost of the algorithm !
ð Þ¼f g
after the processing, and @

BEGIN is in Step 3, finding a max-flow of ð


Step 1. Construct T0 from N which runs in Oðnm2Þ, where n calculate the minimum cut. Let l ¼
T: delete all leaves in is the number of objects in the 0 Applying our algorithm to those
lowest level. Step 2. Step 2. For any non-leaf v 2 LlðT Þ
network and m is the number of random graphs, we calculated the
Construct an auxiliary v in T . P 0
edges among the network. The cost fraction of vertices that were
network N from T0 with the
of the remaining steps is Oðn þ mÞ. If @ uv > @
vz , then correctly classified by our algo-
root v0 as the source and !
ð Þ z2Nþ ðvÞ !
ð Þ

add a new vertex t as the Therefore the total runtime of our and S uv S S vz ; rithm. This fraction depends on z
v ! out þ !

algorithm is Oðnm2Þ. When the


ð Þ ð Þ z2N ð Þ

sink. the average number of neigh-bors


Otherwise, there is no change to @

Add arcs t weights are integers, the run-time of from communities other than its
Step 3. If l > 1 then l l 1 and back to
v
!
for every leaf v and assign weight
finding the max-flow is bounded by own community. The performance
! with Otherwise, W Gv : v S
the weig ht for the arc v0 v0 0

Oðmf Þ, where f is the maximum ¼f 2 of our algorithm is shown in Fig. 2.


! z

c jVðGv0 Þj
flow in the network. (where v corresponds the Each point in the above figure
v0v00

subgraph Gv of G as a represents an average over 15


ð Þ ¼ log dT ðv00 Þ=dT ðv0Þ ¼ log dT ð
The push-relabel algorithm is community in the output). examples. The curve has a deep
hierarchical
one of the most efficient algo- END. decrease at zout ¼ 6, where each
density with rithms to compute a maximum flow.
respect to member has many connections to
The general algorithm has Oðn2mÞ members in different groups, and
dendrogram T.
time complexity, while the hence present clustering difficulties.
Step 3. Find the Algorithm 2 checks every arc to
implementation with First In As the figure shows, our algorithm
max-flow of N
running time. The pffiffiffiffiffi obtain the expected minimum edge-
with min-cut ½S;
highest active vertex selection rule
U&.
provides Oðn2 m Þ time com- cut, and therefore the total running
Simultaneously, the
vertices of the auxiliary plexity, and the implementation
time is OðjEðTÞjÞ¼ OðjVðTÞjÞ.
network N is split into with Sleator’s and Tarjan’s dy-
namic tree data structure runs in
two parts S and U with v0 4. Experiments
2 S and t 2 U. Oðnm logðn2=mÞÞ time. But the
Step 4. Construct a Ford–Fulkerson algorithm is often In summary, our procedure
collection W of dense faster in practice for sparse consists of the following steps:
subgraphs of H as follows: networks.

When the dendrogram is a tree, Step 1. Use QCM of Section 2.2 to


Initially, W ;; get a dendrogram T;
a faster algorithm (Algorithm 2) can
W W þ fGv g where Gv is a Step 2. Use detecting optimal
be designed. The new algorithm
dense subgraph represented communities algorithm
by a vertex v in N such that
does not go through the time-
consuming max-flow min-cut (Algorithm 1 or Algorithm
(1) v 2 U
algorithm. 2) of Section 3 to get the
(2) N ðvÞ ¼ fu :
final community output.
uvisanarcofNg S. Let l ¼ 0; . . . ; t and LlðTÞ be
Output. W is the collection of the set of all nodes of T at the lth
selected communities. level (where L0ðTÞ ¼ fv0g, the In this section we present a
END.
root level, and the lowest level Lt couple of experiments of our ap-
ðTÞ consists of all leaves). Denote proach on computer-generated
The Ford–Fulkerson algorithm by NþðvÞ the set of children of graphs and on real-world networks
Ford and Fulkerson (1956), named
node v in T. where the community structure is
after Ford and Fulkerson, is widely
already known, and compare their
used to compute maxi-mum flows
results with that of the preceding
in networks. The idea behind the
algorithms.
algorithm is very sim-ple. As long
as there is a path from the source to
4.1. Synthetic graphs
sink, with available capacity on all
edges in the path, we could send
We employed a large number of
flow along one of those paths. Then randomly generated graphs to test
we find another path, and so on. A the performance of our algorithm.
path with available capacity is Following the benchmark data sets
called an augmenting path. When proposed in Girvan and Newman
no more augmenting paths can be (2001), each graph was created with
found, the source will not be able to 128 vertices and consists mainly of
reach the sink. Let S be the set of four com-munities of 32 vertices
all objects reachable by source in each. Edges connecting vertices
the algorithm, then the total from the same community were
50 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53

Computer_Generated Graphs
95100
100% 100% 100% 100%

correctedly
0.96 1.89 2.83 3.38 99.06%
3.84 98.18%
4.33 97.42%
4.67

90
Fraction of vertices classfied

92.77% 92.89%
5.38 5.84
85

85.5%
6.22
83.3%
80

5.96
1 2 3 4 5 6

average number of inter-community edges per vertex

Fig. 2. Numbers in dark green are the average out-degree, i.e., the number of inter-community edges over 15 random samples. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)

Fig. 3. The network of friendships in the Karate club study in Zachary as described in the text. The administrator and instructor are represented by nodes 1 and 33, respectively. Nodes associated with
the club administrator’s faction are drawn as green squares, those associated with the instructor’s faction are drawn as red circles. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)

performs nearly perfectly when zout < 6, classifying 92% or more of total club’’ of Zachary (1977). In this study, Zachary observed 34 mem-bers of a
vertices correctly, and 97% or more if z out < 5. In fact, when z out the number karate club over a period of two years at an American uni-versity. During the
of inter-community edges is larger, no one can make the decision comparing course of observation, a disagreement developed between the administrator of
the average degree 16. the club and the club’s instructor, which ultimately resulted in the instructor’s
Thus our algorithm performs very well as long as each vertex has more leaving and starting a new club, taking about a half of the original club’s
connection within the community than connection to other communities. Our members with him.
conclusions match the result obtained by Newman by calculating betweenness
(Girvan and Newman, 2001). Zachary constructed a simple un-weighted graph to show the friendships
between two members of the club, each member in the club is represented by
a node, and edge is drawn if the two members are friends outside the club
4.2. Zachary’s Karate club study shows the network, with the administrator and
activities. Fig. 3
instructor were respected by node 1 and 34, respectively.
We now turn our applications to real world network data. The first real-
world social network for which the community structure is already known Green square represent individuals associated with the
from other sources is the well known ‘‘Karate administrator and red circle represent those associated with
the instructor.
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 51

Fig. 4. Hierarchical tree showing the complete community structure for the network calculated by using Girvan and Newman’s algorithm. Only the left most object 3 was wrongly classified.

25 26 10 32 29 24 28 27 30 34 33 23 21 19 16 15 31 9 12 1 6 7 5 11 17 20 22 13 2 3 4 8 14 18

Fig. 5. Hierarchical tree of Karate club calculated by using our algorithm. Multi-membership can be observed in the first agglomerative step.

eigenvectors of the graph Laplacian; the algorithm of Girvan and Newman


(2001). Both have been respectively used for detecting the communities in
this network. Fig. 4 shows the hierarchical tree in Girvan and Newman
(2001).
Fig. 5 illustrates the dendrogram derived by applying our algo-rithm to
this network with output consisting of two communities at level one, which is
perfectly consistent with actual factions ob-served by Zachary.

4.3. Davis southern club women

These data were collected by Davis et al. in the 1930s. They rep-resent
observed attendance at 14 social events by 18 Southern wo-men. The result is
Fig. 6. The dolphin social network of Lusseau et al. The dashed curve represents the division
a person-by-event matrix: cell ði; jÞ is 1 if person i attended social event j,
into two equally sized parts found by a standard spectral partitioning calculation. The solid
curve represents the division found by the modularity-based algorithm of this section. and 0 otherwise. The goal of this study is to determine the social structure of
this club according to their atten-dance among all social events. The first
reported result on this data set was given by Homans (1950) as follows:

Among many algorithms suggested for this data set, two of them have
dominated the literature: the spectral bisection algo-rithm (Fiedler, 1973; Group 1: 1, 2, 7, 8, 14, 15, 16;
Pothen et al., 1990), which is based on the Group 2: 11, 12, 13, 17, 18;
52 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53

0 .2 2
17.26 17.26 17.26

0 .3 3 0 .3 3 0 .3 3

3.64 3.64

1
1 1 1

1 2 3 4 5 6 7 8 9 10 11 12

Fig. 7. The left figure is the dendrogram of clustering analysis. Right figure is the weighted network constructed by our algorithm with hierarchical densities and weights.

Members 3, 4, 5, 6, 9, 10 are not clearly clustered to either group. By 4.5. Bottlenose dolphins network
feeding the club data to our algorithm, two main clusters are obtained as
follows: Fig. 6 represents the social network of a community of 62 bot-tlenose
dolphins living in Doubtful Sound, New Zealand. The net-work was compiled
Group 1: 1, 2, 7, 8, 9, 10, 14, 15, 16; by Lusseau et al. (2005) from seven years of field studies of the dolphins,
Group 2: 3, 4, 11, 12, 13, 17, 18; with ties between dolphin pairs being established by observation of
statistically significant frequent asso-ciation. This network is interest because,
Note that members 5 and 6 are still not properly grouped either into Group during the course of the study, the dolphin group splits into two smaller
1 or Group 2. Those two members only attended two out of 14 conferences subgroups, repre-sented by the circles and squares, following the departure of
and hard to classify into either group only according to their attendance a key member of the population, represented by triangle in the figure.
records. Member 9 attended three times and each time he could meet member
8 and 14. Member 10 has the similar preference with the majority of members Our algorithm gives the perfect division when processing the whole
from Group 1. Hence it is reasonable to assemble members 9 and 10 into first network of 62 nodes. As shown, a community consists of all squares, and
group. Same information can be dig out from the attendance record to support another one consists of all circles and one triangle. If we remove the key
our result. This result improved an earlier result obtained by Zhao and Zhang member represented by yellow triangle in above figure, then our algorithm
(2011), which uses fixed height cut and adds member 4 into Group 1 and 10 outputs almost same result except one member was wrongly grouped. In fact,
to Group 2. Their result is also identical with one obtained by Ronald Breiger this member has only two neighbors in the network, one is the key member
using duality analysis (Breiger, 1974). and the other one belongs to the group consisting of all red squares in the
figure.

The application of our algorithm to the dolphins data set also demonstrates
4.4. Political books network the robustness of our algorithm to node removal to some extent.

A network of books about US politics published around the time of the


2004 presidential election and sold by the online bookseller Amazon.com. 4.6. Example with overlapping
Edges between books represented frequent co-purchasing of books by the
same buyers. The network, compiled by Krebs, consists of 105 nodes and 441 All examples listed so far give us perfect partition on original network.
edges representing frequency co-purchasing of books by same buyers, as The phenomenon of multi-membership was observed recently, espe-cially on
indicated by the feature ‘‘customers who bought this book also bought these larger and complex networks. This example is made by our group and
other books’’ on Amazon. Those nodes have been labeled as l; n or c to contains four communities, which is used to validating our algorithm on
indicate whether they are network with such feature. The network N with this feature consists of 4
communities Q1; . . . ; Q4, each of which is a clique K4 with
liberal; neutral or conservative. VðQ1Þ ¼ f1; 2; 3; 4g, VðQ 2Þ ¼ f4; 5; 6; 7g, VðQ 3Þ ¼ f7; 8; 9; 10g and
Our result is listed as follows and compared with those obtained by VðQ4Þ ¼ f10; 11; 12; 1g. Note that any two adjacent communities share
Newman (http://www-personal.umich.edu/mejn/net-data/) and Zhao and exactly one member (jVðQ iÞ \ VðQiþ1Þj ¼ 1 for each i ¼ 1; . . . ; 4 mod
Zhang (2011). (4)).
Fig. 7 contains all information obtained by our algorithm. Den-sity of each
node and weight on each edge are also labeled. The minimum edge-cut is
True clusters Zhao and Newman’s Our’s group represented by dotted line. Cutting through dotted lines gives us four sub-
Zhang group networks, where member 1, 4, 7, 10 have the feature of multi-membership.
13 4/13 (9) 2/7 (11) 4/10 (9)
49 45/52 (4) 46/53 (3) 46/53 (3)
43 38/40 (5) 40/45 (3) 39/42 (4)
5. Conclusion
87/115 (18) 88/115 (17) 89/115 (16)
In this paper, we have investigated the problem of detecting optimal
community structure in social networks. Traditional re-sults are always
In the above figure, the cell value ‘‘4=13 ð9Þ’’ means 4 out of 13 obtained from a dendrogram by cutting at some fixed height. This
members among first group in Zhao’s result are correct. The numbers set in methodology is observed to output unexpected results in many cases (See
bold were the number of books wrongly grouped. example in Section 1). A novel algorithm
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 53
was introduced here by constructing 2006. Identification of inflammatory
gene modules based on variations of
a weighted graph from den-
human endothelial cell responses to
drogram. Then we can apply max- oxidized lipids. PNAS 103 (34), 12741–
flow and min-cut theory on the new 12746.
obtained weighted graph. We have
Ghazalpour, A., Doss, S., Zhang, B., Plaisier,
validated our idea through a variety C., Wang, S., Schadt, E., Thomas, A.,
of data set ranging from synthetic Drake, T., Lusis, A., Horvath, S., 2006.
graphs to real world benchmark Integrating generics and network
analysis to characterize genes related to
data sets. mouse weight. PloS Genetics 2 (8),
e130.
The weight function in our Girvan, M., Newman, M., 2001. Community
structure in social and biological
algorithm plays the key role in clus- networks. PNAS 98, 404–409.
tering analysis. The choice of Goldberg, A.V., Tarjan, Robert E., 1988. A
weight function depends on how we new approach to the maximum-flow
problem. Journal of ACM 35, 921–940.
define the social network. This
Hartigan, J.A., 1975. Clustering Algorithm.
aspect of our algorithm requires that John Wiley & Sons, Inc.
node associations should be Hastie, T., Tibshirani, R., Friedman, J., 2001.
The Elements of Statistical Learning.
reflected in user defined weight Springer, New York.
function. A larger weight must be Homans, G., 1950. The Human Group.
assigned when the pair of parent Harcourt, Brace and World, New York.
Kaufman, L., Rousseeuw, P., 1990. Finding
and child have a closer density.
Groups in Data: An Introduction to
Cluster Analysis. John Wiley & Sons,
Inc., New York.
References Lusseau, D., Schneider, K., Boisseau, O.J.,
Haase, P., Slooten, E., Dawson, S.M.,
2005. The bottlenose dolphin
Bondy, J.A., Murty, U.S.R., 1976. Graph community of doubtful sound features a
Theory with Applications. Macmillan, large proportion of long-lasting
London. associations. Can geographic isolation
explain this unique trait? Behavioral
Breiger, R.L., 1974. The duality of persons Ecology and Sociobiology 54, 396–405.
and groups. Social Forces 53 (2), 181–190. Newman, M., 2004. Detecting community
Carlson, M., Zhang, B., Fang, Z., Horvath, structure in networks. European
S., Mishel, P., Nelson, S., 2006. Gene Physical Journal B 38, 321–330.
connectivity function and sequence Newman, M., 2006. Finding community
conservation: predictions from modular structure in networks using the
yeast co-expression networks. BMC eigenvectors of matrices. Physical
Genomics 7, 40. Review E 74, 036104.
Corman, T., Leiserson, C., Rivest, R., Stein, Ou, Y., Zhang, C.-Q., 2007. A new multi-
C., 2009. Introduction to Algorithms, membership clustering algorithm.
third ed. MIT. Journal of Industrial and Management
Defays, D., 1977. An efficient algorithm for Optimization 3 (4), 619–624.
a complete link algorithm. The Palla, G., Derényi, I., Farkas, I., Vicsek, T.,
Computer Journal (British Computer
2005. Uncovering the overlapping
Society) 20 (4), 364–366.
community structure of complex
Dong, J., Horvath, S., 2007. Understanding
networks in nature and society. Nature
network concepts in modules. BMC
435, 814–818.
Systems Biology 1 (1), 24.
Fiedler, M., 1973. Algebraic connectivity of
Pothen, A., Simon, H., Liou, K., 1990.
graph. Czechoslovak Mathematical
Partitioning sparse matrices with
Journal 23, 298–305.
eigenvectors of graphs. SIAM Journal
Ford, L.R., Fulkerson, D.R., 1956. Maximal
on Matrix Analysis and Applications 11,
flow through a network. Canadian
430–452.
Journal of Mathematics 8, 399–404.
Scott, J., 2000. Social Network Analysis: A
Gargalovic, P., Imura, M., Zhang, B.,
Handbook. second ed. Sage, London.
Gharavi, N., Clark, M., Pagnon, J.,
Sibson, R., 1973. SLINK: an optimally
Yang, W., He, A., Truong, A., Patel, S.,
efficient algorithm for the single-link cluster
Nelson, S., Horvath, S., Berliner, J.,
algorithm. The Computer Journal
Kirchgessner, T., Lusis, A.,
(British Computer Society) 16 (1), 30–34.
Zhao, P., Zhang, C.-Q., 2011. A new
clustering algorithm and its application in
social
networks. Pattern Recognition Letters
32, 2109–2118.