0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

2 visualizzazioni14 paginePublic

Dec 11, 2018

© © All Rights Reserved

DOC, PDF, TXT o leggi online da Scribd

Public

© All Rights Reserved

0 valutazioniIl 0% ha trovato utile questo documento (0 voti)

2 visualizzazioni14 paginePublic

© All Rights Reserved

Sei sulla pagina 1di 14

j o u r n a l h o m e p a g e : w w w . e l s ev i e r . c o m / l o c a t e / p a t r e c

subgraphs

a b b c b b,⇑,1

Xingqin Qi , Wenliang Tang , Yezhou Wu , Guodong Guo , Eddie Fuller , Cun-Quan Zhang

a b

School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China

Department of Mathematics, West Virginia University, Morgantown, WV 26506, USA

c

Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA

article info abstract

Article history: The determination of community structures within social networks is a significant problem in the area of data mining. A proper

Received 12 April 2013 community is usually defined as a subgraph with a higher internal density and a lower crossing density with others subgraphs.

Available online 20 September 2013 Hierarchical clustering algorithms produce a set of nested clusters, sometimes called dense subgraphs, organized as a

hierarchical system and the output is always referred as a dendrogram. However, determining which of clusters in the

Communicated by M.A. Girolami

dendrogram will be selected to form communities in the final output is a difficult problem. Most implementations of data

mining algo-rithms require expert guidance in the implementation of the algorithm in order to establish the appro-priate

Keywords:

selection of such communities, and ultimately the output may not be optimized as with fixed height tree-cutting algorithms. In

Hierarchical clustering

this paper, a novel algorithm for community selection is proposed. The intuition of our approach is based on drops of densities

Community selection

Graph density between each pair of parent and child nodes on the dendrogram – the higher the drop in density, the higher probability the

Density drop child should form an inde-pendent community. Based on the Max-Flow Min-Cut theorem, we propose a novel algorithm

which can output an optimal set of local communities automatically. In addition, a faster algorithm running in linear time is

also presented for the case that the dendrogram is a tree. Finally, we validate this approach through a variety of data sets

ranging from synthetic graphs to real world benchmark data sets.

1. Introduction Hierarchical clustering, see Scott (2000), is one of the most pop-ular

approaches to clustering problems and the output is called dendrogram. A

Networks can be used to describe the pairwise relationships be-tween dendrogram is a diagram frequently used to illus-trate the arrangement of the

nodes. Thinking of these nodes as vertices, we can in turn view such networks clusters produced by hierarchical clustering (see Fig. 1(a)).

as graphs where the edges are defined by these same pairwise relationships.

Sociologists use networks to de-scribe the relationship among n persons in The dendrogram has a root, say v0, on the top of diagram repre-senting

terms of their connec-tion strength, reflecting how of a connection exists the whole network and leaves on the bottom representing each individual.

between pair, common behaviors, or the level of collaboration. The subgraph Every node among internal levels represents a subgraph of the original

with denser connections inside and sparser connections to other subgraphs network. The node on the lower level is called child, and one on higher level

can provide invaluable insight into the structure of the whole network or data is called parent. Each edge connecting two nodes in the dendrogram forms a

visualization. Detecting such communities or clusters of closely related parent–child relationship. The subgraphs induced by these parent–child rela-

objects remains one of the most inter-esting problems in the field of tionships, specifically the hierarchy of sets of nodes in the original graph, are

bioinformatics, social networks, epi-demiology and data mining. Many called clusters and form the candidate pool of communi-ties for output.

clustering algorithms have been proposed in the literature (Girvan and However, which of those clusters in the dendro-gram will be selected to form

Newman, 2001; Newman, 2004, 2006; Hastie et al., 2001; Kaufman and communities in final output?

Rousseeuw, 1990; Scott, 2000).

‘‘There are no completely satisfactory algorithms that can be used for

determining the number of population clusters for many type of cluster

harder to find

analysis’’ said in SAS/STAT 9.2 User’s Guide. It is much

⇑ Corresponding author. Fax: +1 304 293 3982.

1 out an optimal community partition than determin-ing the

E-mail addresses: cqzhang@math.wvu.edu, cqzhang@mail.wvu.edu (C.-Q. Zhang). This

research is partially supported by an NSA Grant H98230-12-1-0233 and an number of communities, and therefore it remains one of the

NSF Grant DMS-126480.

most challenging problems in current research of data

0167-8655/$ - see front matter 2013 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.patrec.2013.09.008

mining.

X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 47

1 ... 130

D

1 ... 100

A

B C D

A

101 -- 110 111 -- 120 121 -- 130 1 -- 100 B C

Clique Merger on left example strongly connected dense sub-

graphs.

Fig. 1. An example of dendrogram obtained by Quasi Clique Merger. Four clusters are formed by picking up all nodes immediately adjacent to edge cut, colored with orange. (For interpretation of the

references to color in this figure legend, the reader is referred to the web version of this article.)

A subgraph with denser connections indicates that members are more Ghazalpour et al. (2006) and Gargalovic et al. (2006), which is mainly based

similar to each other than they are to portions of the graph outside the on the shape of branches of dendrogram and the in-ner structure of each node

subgraph. If we use ‘‘density’’ of a subgraph, de-fined to be the average was not reflected in the process of com-munity detecting.

number of edges between nodes in the ver-tex set of the subgraph, to describe

its global strength connections (see Section 2.2 for more detailed definition), a In this paper, we will propose a new community detection algo-rithm

good community detection algorithm should detect a partition of the input (Algorithm 1), different from other traditional algorithms (fixed height cutting

networks into subgraphs satisfying: algorithm where all edges to be cut are in the same height level or pre-defined

community number algo-rithm where the number of communities should be

inputted afore-hand), where the community result will be output automatically

1. Higher internal connection density. and the edges to be cut could be located in any level of the dendro-gram. To

2. Lower external connection density. be more specific, our algorithm will find an edge cut of a given dendrogram,

separating the root and all leaves, where the edges in the edge cut could be

The traditional algorithm of identifying communities of a dendrogram is located in any level. The family of all nodes (children) immediately below the

referred to as tree cutting, branch cutting or branch pruning. One kind of tree edge cut will be the output of our algorithm and form all desired communities

cutting algorithm needs the number of communities as an input aforehand, but automatically.

the problem of determining the number of clusters itself is hard in most cases.

Another most widely used tree cut algorithm is called fixed height cutting: the

user chooses a fixed height on the dendrogram, and all nodes in the branches The basic idea of our algorithm is as follows. As we know, each node v in

immediately below the height of the cut form the family of communities. The the dendrogram is a candidate of the optimal local community and the

fixed height tree cutting is simple and rather naive, but the output sometimes

induced subgraph Gv by its members has relatively higher inter-connection

does not make any sense especially for complicated cases. The following

example will reveal the downside of fixed height cutting. than extra-connection. Those arcs in the dendrogram with larger density drop

indicate improper agglomeration and hence form candidates for edge cutting.

Based on those observations, we assign weights on arcs of T based on density

drop between child and parent. Our community detection algorithm could

catch all arcs with larger density drop and automatically generate a proper

Let G be a network consists of 4 giant clusters A–D (see Fig. 1(b)). The community partition. When we test our algorithm on the above example (see

subgraph induced by A is a complete subgraph of or-der 100 and each edge is Fig. 1(b)), on which traditional community detection algorithm fails, 4

weighted by 3, and ones induced by B–D are also complete and of order 10. communities consisting of A–D respectively are obtained as we expected.

Each edge in those three clusters is assigned 4 and all rest of crossing edges

are weighted by 1. Fig. 1(a) is the dendrogram generated by a density driven

cluster-ing algorithm. By applying traditional fixed height cut algorithm, one

may produce an output consisting of two communities of or-ders 100 and 30, The outline of this paper is as follows. In Section 2 we shall de-scribe

respectively (see Fig. 1(a)), while the output of 4 communities consisting of classical algorithms and review the Quasi Clique Merge (QCM) algorithm,

A–D respectively should be the true clustering result. whose output dendrogram will be the start point of our new community

detection algorithm. In Section 3, the new algorithm (Algorithm 1) will be

described in detail. In addition, a faster algorithm (Algorithm 2) running in

Fixed height cutting is a simple and naive technique with many desirable linear time is also pre-sented in Section 3 for the special case when the

properties, but unreliable when the dendrogram is large and complicated as dendrogram is a tree. In Section 4 we apply our algorithm to some classic

we have seen from the above example. Another community selection social networks and compare its result with that of known clusters, which

technique, called ‘‘dynamic tree cut’’, was discussed in Carlson et al. (2006), verify our new algorithm’s utility.

Dong and Horvath (2007),

48 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53

between VðGja Þ VðGjb Þ and VðGjb Þ, a

2. Hierarchical clustering The Quasi Clique Merger The core of the QCM algorithm number of vertices in Gja and not in

(QCM) algorithm (Ou and Zhang, focuses on deciding whether or not Gjb . And the hierarchical

Clustering is a process of 2007) is a hierarchical clustering to add a member to an already den-

sity wT of the parent j0 is the

grouping all members of whole net- algorithm used to detect dense sub- selected dense subgraph C. For a P 1;...;t

wT ðja ;jb Þ

work into a family of subgraphs, graphs (clusters) of weighted member v R VðCÞ, we define the a;b

w G 2f t g

.

graphs. In this paper, we will use P Tð j0 Þ¼

called communities. Communities cðv; CÞ ¼ u2V ðCÞ wðuvÞ

2

are of interest because they often the output of QCM as the start point jVðCÞj . A member v is For example, let Gj1 ¼ K4 and

correspond to functional units in a of our new community detec-tion andðCÞ where n ¼ jVðCÞj and an ¼ 1 Gj2 ¼ K4 be two children of some

particular research purpose. A algorithm. This is because the node j0 and Gj1 \ Gj2 ¼ K2. Then Hj0

user specified parameter k (> 1),

cluster consist of a number of ob- combination of QCM algorithm and and serves as a coefficient that will be K2 and weighted by 4=ð2

jects with similar characteristics. our new community detection controls the density during the 2Þ ¼ 1.

Any cluster is a subset (or super- algorithm (Algorithm 1) meets the growing of a cluster C.

set) of some community and hence main requirement of an optimal The QCM algorithm consists of The node h in Fig. 1(a) obtained

a candidate for inclusion into any community detection: higher three main steps: Growing, Merging by merging three clusters B–D has a

containing community of interest. internal cluster density and lower and Contracting. See Ou and Zhang corresponding hierarchical

The goal of clustering is to find an inter-cluster density. Further-more, (2007) and Zhao and Zhang (2011) subgraph Hh ¼ K3, a weighted tri-

optimal output consisting of a set of QCM also has two other major for more details of the algorithm. angle and each edge is assigned

2=ð30 30Þ ¼ 2=900.

communities. features: (1) its output den-drogram By Ou and Zhang (2007), the

is smaller, which clearly highlights This hierarchical density will be

complexity of constructing used to measure extra connec-tion

meaningful clusters, while most

existing algorithms produce a larger the hierarchically nested among clusters. There will be a

2.1. Traditional approach system is Oðj V 3 logðj V large hierarchical density drop when

binary hierarchi-cal tree (Fiedler, 2 2 j

intra density and extra density vary

1973; Girvan and Newman, 2001; OðjVj log ðjVjÞÞ in average while the number of levels of the

K-means clustering (Hartigan, Pothen et al., 1990); (2) it allows dendrogram is OðlogðjVjÞÞ instead of thesignificantly.

worst case OThus the information

1975) is a well-known partition carried by inner structure of any

overlapping clustering or multi- As noted previously, an

algorithm which aims to partition cluster will be de-tected through the

membership, which is a concept important feature of output

the whole network into exactly k hierarchical density. In the

that has recently received increased generated by QCM algorithm is the

communities in which each member following we will propose a novel

attention (Palla et al., 2005). overlapping clustering or multi-

belongs to one community with algorithm (Algorithm 1) to find a

membership, which means one

nearest mean or center. The main set of edges based on the change of

object may belong to more than one

side effect of this algorithm is that hierarchical density from level to

A subgraph H in an un-weighted cluster while traditional algorithms

the number of communities has to level. Algorithm 2 is modified from

graph is defined as a clique if every force one object to belong to

be pre-assigned at the beginning of Algorithm 1 for the special case

pair of members of H is joined by exactly one cluster. Hence the

this process. when dendrogram is a tree.

one edge. It is well-known that the dendrograms generated by

search of cliques with maximum traditional algorithms are trees

Algorithm 1 presented here

Hierarchical clustering is vertices in graphs is an NP- while circuits may exist in the

finds an optimal edge cut in a den-

another kind of clustering analysis, complete problem. For a subgraph dendrogram obtained by QCM

drogram that will generate clusters

which either describes a C in a graph with weight algorithm because of the feature of with multi-membership.

partitioning of a graph varying from x on their edges, we can define the density of multi-membership.

C by Instance: Let T be the

a sin-gle cluster containing all P dendrogram and the unique vertex

members of the whole network to n dðCÞ ¼ 2 e2EðCÞ wðeÞ

, where EðCÞ v0 on the highest level. All arcs of T

is the set of edges connecting mem-

clus-ters each one containing a jVðCÞjðjVðCÞj 1Þ are oriented downward

single member, or from individual 3. Detecting optimal communities

bers in C. As seen above, let C be a from vertices

members to the single whole graph subgraph of un-weighted graph G

cluster (Defays, 1977; Sibson, (or weighted graph with w ¼ 1 for In the dendrogram T, each node

1973). As a result, strategies for all edges), then dðC Þ ¼ 1 im-plies represents a candidate commu-nity

hierarchical clustering generally fall that C induces a clique in G. For a for selection, which is a dense

into two types: agglomerative weighted graph, a subgraph C is subgraph (possible consisting of

algorithms, which proceed by a called a D-quasi-clique if dðCÞ P D only one vertex) in the original

series of fusions, and divisive for some positive real num-ber D. input network G.

algorithms, which process by a ser- Our new community detection

ies of partitions. The output of a algorithm starts with the output

hierarchical clustering algorithm dendrogram T of QCM algorithm.

can be described by Fig. 1(b). No Let j0 be a non-leaf node in the

matter which case, the desired dendrogram T with children j1; . . . ;

family of clusters is heavily jt , and let G ji (i ¼ 0; 1; . . . ; t) be

dependent on the horizontal cut line the corresponding subgraphs in the

with constant height, which will input network G. The

exhibit suboptimal or even awk- hierarchical density of Gj0 is

ward performance for complicated calculated as follows. Construct a

graph

dendrograms.

Hj0 with vertex set fj1; . . . ; jt g

edge between ja and jb is

min

P Ea P a; b wðeÞ

e b; b wðeÞ; e Eb

2 ½ & 2½

&

, where E

approach

X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 49

capacity in the original net-work of randomly generated with

on the higher level to lower level. edges from S to the remainder of S Instance: Let T0 be obtainedprobability

from dendrogram T by deleting

Pin. Edges all

connecting

Each v 2 T corresponds to a is the minimum cut as desired. For

S

leaves Lt ðTÞ and let c : EðTÞ # Rþ vertices from different communities

‘‘dense’’ subgroup Gv with more details about Max-Flow Min- way as in Step 2 of Algorithm 1. were randomly generated with

hierarchical density dT ðGv Þ. Cut theorem and realizations of Task: Detect an edge-cut E0 probability Pout , where Pin P Pout .

Task: Find an edge-cut E0 of T, algorithm please refer (Bondy and c E0 Þ ¼ P c e minimum, thus obtain communities

Probabilities W for chosen

were G so as to

thus obtain optimal communities by Murty, 1976; Corman et al., 2009; ð e2E0 ð Þ

keep the average degree of the

removing all edges in E0 from the Goldberg and Tarjan, 1988). Algorithm 2. output random graph almost equal

dendrogram. Step 1. For every arc uv

2 ð

E 0

T

!

community structure, but are

Algorithm 1. S uv v , which will be the output foressentially

communitiesrandom in other respects.

The main cost of the algorithm !

ð Þ¼f g

after the processing, and @

Step 1. Construct T0 from N which runs in Oðnm2Þ, where n calculate the minimum cut. Let l ¼

T: delete all leaves in is the number of objects in the 0 Applying our algorithm to those

lowest level. Step 2. Step 2. For any non-leaf v 2 LlðT Þ

network and m is the number of random graphs, we calculated the

Construct an auxiliary v in T . P 0

edges among the network. The cost fraction of vertices that were

network N from T0 with the

of the remaining steps is Oðn þ mÞ. If @ uv > @

vz , then correctly classified by our algo-

root v0 as the source and !

ð Þ z2Nþ ðvÞ !

ð Þ

add a new vertex t as the Therefore the total runtime of our and S uv S S vz ; rithm. This fraction depends on z

v ! out þ !

ð Þ ð Þ z2N ð Þ

Otherwise, there is no change to @

Add arcs t weights are integers, the run-time of from communities other than its

Step 3. If l > 1 then l l 1 and back to

v

!

for every leaf v and assign weight

finding the max-flow is bounded by own community. The performance

! with Otherwise, W Gv : v S

the weig ht for the arc v0 v0 0

! z

c jVðGv0 Þj

flow in the network. (where v corresponds the Each point in the above figure

v0v00

ð Þ ¼ log dT ðv00 Þ=dT ðv0Þ ¼ log dT ð

The push-relabel algorithm is community in the output). examples. The curve has a deep

hierarchical

one of the most efficient algo- END. decrease at zout ¼ 6, where each

density with rithms to compute a maximum flow.

respect to member has many connections to

The general algorithm has Oðn2mÞ members in different groups, and

dendrogram T.

time complexity, while the hence present clustering difficulties.

Step 3. Find the Algorithm 2 checks every arc to

implementation with First In As the figure shows, our algorithm

max-flow of N

running time. The pﬃﬃﬃﬃﬃ obtain the expected minimum edge-

with min-cut ½S;

highest active vertex selection rule

U&.

provides Oðn2 m Þ time com- cut, and therefore the total running

Simultaneously, the

vertices of the auxiliary plexity, and the implementation

time is OðjEðTÞjÞ¼ OðjVðTÞjÞ.

network N is split into with Sleator’s and Tarjan’s dy-

namic tree data structure runs in

two parts S and U with v0 4. Experiments

2 S and t 2 U. Oðnm logðn2=mÞÞ time. But the

Step 4. Construct a Ford–Fulkerson algorithm is often In summary, our procedure

collection W of dense faster in practice for sparse consists of the following steps:

subgraphs of H as follows: networks.

Initially, W ;; get a dendrogram T;

a faster algorithm (Algorithm 2) can

W W þ fGv g where Gv is a Step 2. Use detecting optimal

be designed. The new algorithm

dense subgraph represented communities algorithm

by a vertex v in N such that

does not go through the time-

consuming max-flow min-cut (Algorithm 1 or Algorithm

(1) v 2 U

algorithm. 2) of Section 3 to get the

(2) N ðvÞ ¼ fu :

final community output.

uvisanarcofNg S. Let l ¼ 0; . . . ; t and LlðTÞ be

Output. W is the collection of the set of all nodes of T at the lth

selected communities. level (where L0ðTÞ ¼ fv0g, the In this section we present a

END.

root level, and the lowest level Lt couple of experiments of our ap-

ðTÞ consists of all leaves). Denote proach on computer-generated

The Ford–Fulkerson algorithm by NþðvÞ the set of children of graphs and on real-world networks

Ford and Fulkerson (1956), named

node v in T. where the community structure is

after Ford and Fulkerson, is widely

already known, and compare their

used to compute maxi-mum flows

results with that of the preceding

in networks. The idea behind the

algorithms.

algorithm is very sim-ple. As long

as there is a path from the source to

4.1. Synthetic graphs

sink, with available capacity on all

edges in the path, we could send

We employed a large number of

flow along one of those paths. Then randomly generated graphs to test

we find another path, and so on. A the performance of our algorithm.

path with available capacity is Following the benchmark data sets

called an augmenting path. When proposed in Girvan and Newman

no more augmenting paths can be (2001), each graph was created with

found, the source will not be able to 128 vertices and consists mainly of

reach the sink. Let S be the set of four com-munities of 32 vertices

all objects reachable by source in each. Edges connecting vertices

the algorithm, then the total from the same community were

50 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53

Computer_Generated Graphs

95100

100% 100% 100% 100%

correctedly

0.96 1.89 2.83 3.38 99.06%

3.84 98.18%

4.33 97.42%

4.67

90

Fraction of vertices classfied

92.77% 92.89%

5.38 5.84

85

85.5%

6.22

83.3%

80

5.96

1 2 3 4 5 6

Fig. 2. Numbers in dark green are the average out-degree, i.e., the number of inter-community edges over 15 random samples. (For interpretation of the references to color in this figure legend, the

reader is referred to the web version of this article.)

Fig. 3. The network of friendships in the Karate club study in Zachary as described in the text. The administrator and instructor are represented by nodes 1 and 33, respectively. Nodes associated with

the club administrator’s faction are drawn as green squares, those associated with the instructor’s faction are drawn as red circles. (For interpretation of the references to color in this figure legend, the

reader is referred to the web version of this article.)

performs nearly perfectly when zout < 6, classifying 92% or more of total club’’ of Zachary (1977). In this study, Zachary observed 34 mem-bers of a

vertices correctly, and 97% or more if z out < 5. In fact, when z out the number karate club over a period of two years at an American uni-versity. During the

of inter-community edges is larger, no one can make the decision comparing course of observation, a disagreement developed between the administrator of

the average degree 16. the club and the club’s instructor, which ultimately resulted in the instructor’s

Thus our algorithm performs very well as long as each vertex has more leaving and starting a new club, taking about a half of the original club’s

connection within the community than connection to other communities. Our members with him.

conclusions match the result obtained by Newman by calculating betweenness

(Girvan and Newman, 2001). Zachary constructed a simple un-weighted graph to show the friendships

between two members of the club, each member in the club is represented by

a node, and edge is drawn if the two members are friends outside the club

4.2. Zachary’s Karate club study shows the network, with the administrator and

activities. Fig. 3

instructor were respected by node 1 and 34, respectively.

We now turn our applications to real world network data. The first real-

world social network for which the community structure is already known Green square represent individuals associated with the

from other sources is the well known ‘‘Karate administrator and red circle represent those associated with

the instructor.

X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 51

Fig. 4. Hierarchical tree showing the complete community structure for the network calculated by using Girvan and Newman’s algorithm. Only the left most object 3 was wrongly classified.

25 26 10 32 29 24 28 27 30 34 33 23 21 19 16 15 31 9 12 1 6 7 5 11 17 20 22 13 2 3 4 8 14 18

Fig. 5. Hierarchical tree of Karate club calculated by using our algorithm. Multi-membership can be observed in the first agglomerative step.

(2001). Both have been respectively used for detecting the communities in

this network. Fig. 4 shows the hierarchical tree in Girvan and Newman

(2001).

Fig. 5 illustrates the dendrogram derived by applying our algo-rithm to

this network with output consisting of two communities at level one, which is

perfectly consistent with actual factions ob-served by Zachary.

These data were collected by Davis et al. in the 1930s. They rep-resent

observed attendance at 14 social events by 18 Southern wo-men. The result is

Fig. 6. The dolphin social network of Lusseau et al. The dashed curve represents the division

a person-by-event matrix: cell ði; jÞ is 1 if person i attended social event j,

into two equally sized parts found by a standard spectral partitioning calculation. The solid

curve represents the division found by the modularity-based algorithm of this section. and 0 otherwise. The goal of this study is to determine the social structure of

this club according to their atten-dance among all social events. The first

reported result on this data set was given by Homans (1950) as follows:

Among many algorithms suggested for this data set, two of them have

dominated the literature: the spectral bisection algo-rithm (Fiedler, 1973; Group 1: 1, 2, 7, 8, 14, 15, 16;

Pothen et al., 1990), which is based on the Group 2: 11, 12, 13, 17, 18;

52 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53

0 .2 2

17.26 17.26 17.26

0 .3 3 0 .3 3 0 .3 3

3.64 3.64

1

1 1 1

1 2 3 4 5 6 7 8 9 10 11 12

Fig. 7. The left figure is the dendrogram of clustering analysis. Right figure is the weighted network constructed by our algorithm with hierarchical densities and weights.

Members 3, 4, 5, 6, 9, 10 are not clearly clustered to either group. By 4.5. Bottlenose dolphins network

feeding the club data to our algorithm, two main clusters are obtained as

follows: Fig. 6 represents the social network of a community of 62 bot-tlenose

dolphins living in Doubtful Sound, New Zealand. The net-work was compiled

Group 1: 1, 2, 7, 8, 9, 10, 14, 15, 16; by Lusseau et al. (2005) from seven years of field studies of the dolphins,

Group 2: 3, 4, 11, 12, 13, 17, 18; with ties between dolphin pairs being established by observation of

statistically significant frequent asso-ciation. This network is interest because,

Note that members 5 and 6 are still not properly grouped either into Group during the course of the study, the dolphin group splits into two smaller

1 or Group 2. Those two members only attended two out of 14 conferences subgroups, repre-sented by the circles and squares, following the departure of

and hard to classify into either group only according to their attendance a key member of the population, represented by triangle in the figure.

records. Member 9 attended three times and each time he could meet member

8 and 14. Member 10 has the similar preference with the majority of members Our algorithm gives the perfect division when processing the whole

from Group 1. Hence it is reasonable to assemble members 9 and 10 into first network of 62 nodes. As shown, a community consists of all squares, and

group. Same information can be dig out from the attendance record to support another one consists of all circles and one triangle. If we remove the key

our result. This result improved an earlier result obtained by Zhao and Zhang member represented by yellow triangle in above figure, then our algorithm

(2011), which uses fixed height cut and adds member 4 into Group 1 and 10 outputs almost same result except one member was wrongly grouped. In fact,

to Group 2. Their result is also identical with one obtained by Ronald Breiger this member has only two neighbors in the network, one is the key member

using duality analysis (Breiger, 1974). and the other one belongs to the group consisting of all red squares in the

figure.

The application of our algorithm to the dolphins data set also demonstrates

4.4. Political books network the robustness of our algorithm to node removal to some extent.

2004 presidential election and sold by the online bookseller Amazon.com. 4.6. Example with overlapping

Edges between books represented frequent co-purchasing of books by the

same buyers. The network, compiled by Krebs, consists of 105 nodes and 441 All examples listed so far give us perfect partition on original network.

edges representing frequency co-purchasing of books by same buyers, as The phenomenon of multi-membership was observed recently, espe-cially on

indicated by the feature ‘‘customers who bought this book also bought these larger and complex networks. This example is made by our group and

other books’’ on Amazon. Those nodes have been labeled as l; n or c to contains four communities, which is used to validating our algorithm on

indicate whether they are network with such feature. The network N with this feature consists of 4

communities Q1; . . . ; Q4, each of which is a clique K4 with

liberal; neutral or conservative. VðQ1Þ ¼ f1; 2; 3; 4g, VðQ 2Þ ¼ f4; 5; 6; 7g, VðQ 3Þ ¼ f7; 8; 9; 10g and

Our result is listed as follows and compared with those obtained by VðQ4Þ ¼ f10; 11; 12; 1g. Note that any two adjacent communities share

Newman (http://www-personal.umich.edu/mejn/net-data/) and Zhao and exactly one member (jVðQ iÞ \ VðQiþ1Þj ¼ 1 for each i ¼ 1; . . . ; 4 mod

Zhang (2011). (4)).

Fig. 7 contains all information obtained by our algorithm. Den-sity of each

node and weight on each edge are also labeled. The minimum edge-cut is

True clusters Zhao and Newman’s Our’s group represented by dotted line. Cutting through dotted lines gives us four sub-

Zhang group networks, where member 1, 4, 7, 10 have the feature of multi-membership.

13 4/13 (9) 2/7 (11) 4/10 (9)

49 45/52 (4) 46/53 (3) 46/53 (3)

43 38/40 (5) 40/45 (3) 39/42 (4)

5. Conclusion

87/115 (18) 88/115 (17) 89/115 (16)

In this paper, we have investigated the problem of detecting optimal

community structure in social networks. Traditional re-sults are always

In the above figure, the cell value ‘‘4=13 ð9Þ’’ means 4 out of 13 obtained from a dendrogram by cutting at some fixed height. This

members among first group in Zhao’s result are correct. The numbers set in methodology is observed to output unexpected results in many cases (See

bold were the number of books wrongly grouped. example in Section 1). A novel algorithm

X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 53

was introduced here by constructing 2006. Identification of inflammatory

gene modules based on variations of

a weighted graph from den-

human endothelial cell responses to

drogram. Then we can apply max- oxidized lipids. PNAS 103 (34), 12741–

flow and min-cut theory on the new 12746.

obtained weighted graph. We have

Ghazalpour, A., Doss, S., Zhang, B., Plaisier,

validated our idea through a variety C., Wang, S., Schadt, E., Thomas, A.,

of data set ranging from synthetic Drake, T., Lusis, A., Horvath, S., 2006.

graphs to real world benchmark Integrating generics and network

analysis to characterize genes related to

data sets. mouse weight. PloS Genetics 2 (8),

e130.

The weight function in our Girvan, M., Newman, M., 2001. Community

structure in social and biological

algorithm plays the key role in clus- networks. PNAS 98, 404–409.

tering analysis. The choice of Goldberg, A.V., Tarjan, Robert E., 1988. A

weight function depends on how we new approach to the maximum-flow

problem. Journal of ACM 35, 921–940.

define the social network. This

Hartigan, J.A., 1975. Clustering Algorithm.

aspect of our algorithm requires that John Wiley & Sons, Inc.

node associations should be Hastie, T., Tibshirani, R., Friedman, J., 2001.

The Elements of Statistical Learning.

reflected in user defined weight Springer, New York.

function. A larger weight must be Homans, G., 1950. The Human Group.

assigned when the pair of parent Harcourt, Brace and World, New York.

Kaufman, L., Rousseeuw, P., 1990. Finding

and child have a closer density.

Groups in Data: An Introduction to

Cluster Analysis. John Wiley & Sons,

Inc., New York.

References Lusseau, D., Schneider, K., Boisseau, O.J.,

Haase, P., Slooten, E., Dawson, S.M.,

2005. The bottlenose dolphin

Bondy, J.A., Murty, U.S.R., 1976. Graph community of doubtful sound features a

Theory with Applications. Macmillan, large proportion of long-lasting

London. associations. Can geographic isolation

explain this unique trait? Behavioral

Breiger, R.L., 1974. The duality of persons Ecology and Sociobiology 54, 396–405.

and groups. Social Forces 53 (2), 181–190. Newman, M., 2004. Detecting community

Carlson, M., Zhang, B., Fang, Z., Horvath, structure in networks. European

S., Mishel, P., Nelson, S., 2006. Gene Physical Journal B 38, 321–330.

connectivity function and sequence Newman, M., 2006. Finding community

conservation: predictions from modular structure in networks using the

yeast co-expression networks. BMC eigenvectors of matrices. Physical

Genomics 7, 40. Review E 74, 036104.

Corman, T., Leiserson, C., Rivest, R., Stein, Ou, Y., Zhang, C.-Q., 2007. A new multi-

C., 2009. Introduction to Algorithms, membership clustering algorithm.

third ed. MIT. Journal of Industrial and Management

Defays, D., 1977. An efficient algorithm for Optimization 3 (4), 619–624.

a complete link algorithm. The Palla, G., Derényi, I., Farkas, I., Vicsek, T.,

Computer Journal (British Computer

2005. Uncovering the overlapping

Society) 20 (4), 364–366.

community structure of complex

Dong, J., Horvath, S., 2007. Understanding

networks in nature and society. Nature

network concepts in modules. BMC

435, 814–818.

Systems Biology 1 (1), 24.

Fiedler, M., 1973. Algebraic connectivity of

Pothen, A., Simon, H., Liou, K., 1990.

graph. Czechoslovak Mathematical

Partitioning sparse matrices with

Journal 23, 298–305.

eigenvectors of graphs. SIAM Journal

Ford, L.R., Fulkerson, D.R., 1956. Maximal

on Matrix Analysis and Applications 11,

flow through a network. Canadian

430–452.

Journal of Mathematics 8, 399–404.

Scott, J., 2000. Social Network Analysis: A

Gargalovic, P., Imura, M., Zhang, B.,

Handbook. second ed. Sage, London.

Gharavi, N., Clark, M., Pagnon, J.,

Sibson, R., 1973. SLINK: an optimally

Yang, W., He, A., Truong, A., Patel, S.,

efficient algorithm for the single-link cluster

Nelson, S., Horvath, S., Berliner, J.,

algorithm. The Computer Journal

Kirchgessner, T., Lusis, A.,

(British Computer Society) 16 (1), 30–34.

Zhao, P., Zhang, C.-Q., 2011. A new

clustering algorithm and its application in

social

networks. Pattern Recognition Letters

32, 2109–2118.

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.