Sei sulla pagina 1di 9

GRAPH MINING

Graph Mining

Graphs Model sophisticated structures and their interactions

Chemical Informatics

Bioinformatics

Computer Vision

Video Indexing

Text Retrieval

Web Analysis

Social Networks

Mining frequent sub-graph patterns Characterization, Discrimination, Classification and Cluster Analysis, building graph indices and similarity search

Mining Frequent Subgraphs

Graph g

Vertex Set V(g)

Edge set E(g)

Label function maps a vertex / edge to a label

Graph g is a sub-graph of another graph g’ if there exists a graph iso-morphism from g to g’

Support(g) or frequency(g) number of graphs in

D = {G 1 , G 2 ,

G

n } where g is a sub-graph

Frequent graph satisfies min_sup

Discovery of Frequent Substructures

Step 1: Generate frequent sub-structure candidates

Step 2: Check for frequency of each candidate Involves sub-graph isomorphism test which is

computationally expensive Approaches

Apriori based approach

Pattern Growth approach

Apriori based Approach Start with graph of small size generate candidates with extra vertex/edge or path Apriori Approach

AGM (Apriori-based Graph Mining)

Vertex based candidate generation increases sub structure size by one vertex at each step

Two frequent k size graphs are joined only if they have the same (k-1) subgraph (Size number of vertices)

New candidate has (k-1) sized component and the additional two vertices

Two different sub-structures can be formed

FSG (Frequent Sub-graph mining)

Edge-based Candidate generation increases by one-edge at a time

Two size k patterns are merged iff they share the same subgraph having k-1 edges (core)

New candidate has core and the two additional edges

Edge disjoint path method

Classify graphs by number of disjoint paths they have

Two paths are edge-disjoint if they do not share any common edge

A substructure pattern with k+1 disjoint paths is generated by joining sub-structures with k disjoint paths

Disadvantage of Apriori Approaches

Overhead when joining two sub-structures

Uses BFS strategy : level-wise candidate generation

To check whether a k+1 graph is frequent it must check all of its size-k sub graphs

May consume more memory

Pattern-Growth Approach

Uses BFS as well as DFS

A graph g can be extended by adding a new edge e. The newly formed graph is denoted by g x e.

Edge e may or may not introduce a new vertex to

g.

If e introduces a new vertex, the new graph is

denoted by g xf e, otherwise, g xb e, where f or

b indicates that the extension is in a forward or

backward direction.

Pattern Growth Approach

For each discovered graph g performs extensions recursively until all frequent graphs with g are found

Simple but inefficient

Same graph is discovered multiple times duplicate graph

Pattern Growth in gSpan Algorithm

Reduces generation of duplicate graphs

Does not extend duplicate graphs

Uses Depth First Order

graph may have several DFS-trees

A

Visiting order of vertices forms a linear order

- Subscript

In a DFS tree starting vertex root; last visited vertex right-most vertex

Path from v 0 to v n right most path gSpan Algorithm

gSpan restricts the extension method

A new edge e can be added

between the right-most vertex and another vertex on the right-most path (backward extension);

or it can introduce a new vertex and connect to a vertex on the right-most path (forward extension)

Right-most extension, denoted by G r e

Chooses any one DFS tree base subscripting and extends it Each subscripted graph is transformed into an edge sequence DFS code Select the subscript that generates minimum sequence

Edge Order maps edges in a subscripted graph into a sequence

Sequence Order builds an order among edge sequences

Root Empty code

Each node is a DFS code encoding a graph

Each edge rightmost extension from a (k-1) length DFS code to a k-length DFS code

If codes s and s’ encode the same graph – search space s’ can be safely pruned

gSpan

Substructures

Helps to overcome the problem of pattern explosion

Algorithm

Mining

Closed

Frequent

A frequent graph G is closed if and only if there is no proper super graph G0 that has the same support as

G.

Closegraph Algorithm

A frequent pattern G is maximal if and only if there is no frequent super-pattern of G.

Maximal pattern set is a subset of the closed pattern

set.

But cannot be used to reconstruct entire set of frequent patterns

Mining Alternative Substructure Patterns

Mining unlabeled or partially labeled graphs

New empty label is assigned to vertices and edges that do not have labels

Mining non-simple graphs

A non simple graph may have a self-loop and multiple edges

growing order - backward edges, self-loops, and forward edges

To handle multiple edges - allow sharing of the same vertices in two neighboring edges in a DFS code

Mining directed graphs

6-tuple (i; j; d; l i ; l (i; j) ; l j ); d = +1 / -1

Mining disconnected graphs

Graph / Pattern may be disconnected

Disconnected Graph Add virtual vertex

Disconnected graph pattern set of connected graphs

Mining frequent subtrees

Tree Degenerate graph

Constraint based Mining of Substructure Patterns

Element, set, or subgraph containment constraint

user requires that the mined patterns contain a particular set of subgraphs - Succinct constraint

Geometric constraint

A geometric constraint can be that the angle between each pair of connected edges must be within a range Anti-monotonic constraint

Value-sum constraint

the sum_of (positive) weights on the edges, must be within a range low and high (sum > low) Monotonic / Anti-monotonic (sum < high) Multiple categories of constraints may also be enforced

Mining Approximate Frequent Substructures

Approximate frequent substructures allow slight structural variations

Several slightly different frequent substructures can be represented using one approximate substructure

SUBDUE Substructure discovery system

based on the Minimum Description Length (MDL) principle

adopts a constrained beam search

SUBDUE performs approximate matching

Mining Coherent and Dense Sub structures

A frequent substructure G is a coherent sub graph if the mutual information between G and each of its own sub graphs is above some threshold

Reduces number of patterns mined

Application: coherent substructure mining selects

a small subset of features that have high distinguishing power between protein classes.

Relational graph each label is used only once

Frequent highly connected or dense subgraph mining

People with strong associations in OSNs

Set of genes within the same functional module

Cannot judge based on average degree or minimal degree

Must ensure connectedness

Example: Average degree: 3.25 Minimum degree 3

Mining Dense Substructures

Dense graphs defined in terms of Edge Connectivity

Given a graph G, an edge cut is a set of edges Ec such that E(G) - Ec is disconnected.

A minimum cut is the smallest set in all edge cuts.

The edge connectivity of G is the size of a minimum cut.

A graph is dense if its edge connectivity is no less than a specified minimum cut threshold

Mining Dense substructures

Pattern-growth approach called Close-Cut (Scalable)

starts with a small frequent candidate graph and extends it until it finds the largest super graph with the same support

Pattern-reduction approach called Splat (High performance)

directly intersects relational graphs to obtain

highly connected graphs

A pattern g discovered in a set is progressively intersected with subsequent components to give g’

Some edges in g may be removed

The size of candidate graphs is reduced by intersection and decomposition operations.

Applications Graph Indexing

Indexing is essential for efficient search and query processing

Traditional approaches are not feasible for graphs

Indexing based on nodes / edges / sub-graphs

Path based Indexing approach

Enumerate all the paths in a database up to maxL length and index them

Index is used to identify all graphs with the paths in query

Not suitable for complex graph queries

Structural information is lost when a query graph is broken apart

Many false positives maybe returned

gIndex considers frequent and discriminative substructures as index features

A frequent substructure is discriminative if its support cannot be approximated by the intersection of the graph sets

Achieves good performance at less cost

Graph Indexing Substructure Similarity Search Bioinformatics and Chem-informatics applications

complex

query structural data

involve

based

search

in

massive

Substructure Similarity Search

Grafil (Graph Similarity Filtering)

Feature based structural filtering

Models each query graph as a set of features

Edge deletions feature misses

Too many features reduce performance

Multi-filter composition strategy

Feature Set - group of similar features

Classification

and

Cluster

Analysis

using

Graph

Patterns

Graph Classification

Mine frequent graph patterns

Features that are frequent in one class but less in another Discriminative features Model construction

Can adjust frequency, connectivity thresholds

SVM, NBM etc are used

Cluster Analysis

Cluster Similar graphs based on graph connectivity (minimal cuts)

Hierarchical clusters based on support threshold

Outliers can also be detected

Inter-related process