Sei sulla pagina 1di 37

Data Mining:

Concepts and
Techniques

December 11, 2015

Data Mining: Concepts and Techniques

What is Cluster Analysis?

Cluster: a collection of data objects


Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

General Applications of Clustering

Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering
feature spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns

December 11, 2015

Data Mining: Concepts and Techniques

Examples of Clustering
Applications

Marketing: Help marketers discover distinct groups


in their customer bases, and then use this
knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use


in an earth observation database

Insurance: Identifying groups of motor insurance


policy holders with a high average claim cost

City-planning: Identifying groups of houses according


to their house type, value, and geographical location

Earth-quake studies: Observed earth quake


epicenters should be clustered along continent faults

December 11, 2015

Data Mining: Concepts and Techniques

What Is Good Clustering?

A good clustering method will produce high


quality clusters with

high intra-class similarity

low inter-class similarity

December 11, 2015

Data Mining: Concepts and Techniques

Requirements of Clustering in Data


Mining

Scalability

Ability to deal with different types of attributes

Able to deal with noise and outliers

December 11, 2015

Data Mining: Concepts and Techniques

Data Structures

Data matrix
(two modes)

x11

...

x
i1
...
x
n1

Dissimilarity matrix
(one mode)

December 11, 2015

...

x1f

...

x1p

...

...

...

...

xif

...

...
xip

...
...
... xnf

...
...

...
xnp

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

Data Mining: Concepts and Techniques

... 0
7

Type of variables in clustering


analysis

Interval-scaled variables

Binary variables

Nominal, ordinal variables

Variables of mixed types

December 11, 2015

Data Mining: Concepts and Techniques

Mean absolute deviations Real/Intervalvalued variables (continuous


If each variable has its own disparate scale, then we can standardize
measurements)
each of the variables to a mean of Zero, and and a variability of One.

Standardizing data

Calculate the mean absolute deviation for variable I:

s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)

Where

m f nthe
(x1 f standardized
x2 f ... xnf )
Calculate
measurement (z-score)
.

xif m f
zif
sf
Then use distances/similarities
based on standardized scores
Examples: longitude, latitude coordinates, when you cluster houses,
weights, heights and weather temperatures

December 11, 2015

Data Mining: Concepts and Techniques

Real/Interval-valued variables
(continuous measurements)

Say if you change from meters to inches for


height, or kgs to pounds for weights it may lead
to different clustering structures. Hence
standardizing unsupervised data is essential.
Data for a variable can be standardized based on
mean absolute deviation. After standardizing then
distance between objects(similarities) should be
found.

December 11, 2015

Data Mining: Concepts and Techniques

10

Similarity and Dissimilarity


Between Objects

Distances are normally used to measure the


similarity or dissimilarity between two data
objects

q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)

i1

j1

i2

j2

ip

jp

Manhattan distance

d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2
ip jp

December 11, 2015

Data Mining: Concepts and Techniques

11

Similarity and Dissimilarity


Between Objects (Cont.)

Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp

Also, one can use weighted distance,


parametric Pearson product moment
correlation, or other dissimilarity
measures

December 11, 2015

Data Mining: Concepts and Techniques

12

Binary/Nominal Variables
/categorical variable

A generalization of the binary variable in that it can


take more than 2 states, e.g., red, yellow, blue,
green

Categorical class A,B,C etc.

Ordinal : Excellent, fair, good, etc

Ratio scaled: 34, 234, 123, etc

December 11, 2015

Data Mining: Concepts and Techniques

13

Ordinal Variables

An ordinal variable can be discrete or


continuous

Order is important, e.g., rank (medals of


sport, professors rank, loans credits
rankings)

December 11, 2015

Data Mining: Concepts and Techniques

14

Major Clustering Approaches(


actual research)

Partitioning algorithms: Construct various partitions


and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical


decomposition of the set of data (or objects) using
some criterion

Density-based: based on connectivity and density


functions

Grid-based: based on a multiple-level granularity


structure

December 11, 2015

Data Mining: Concepts and Techniques

15

The K-Means Algorithm


Choose a value for K, the total number of
clusters.
Randomly choose K points as cluster centers.
Assign the remaining objects to their closest
cluster center. based on the mean value of the
objects in the cluster)
Calculate a new cluster center for each cluster.
Repeat steps 3-5 until the cluster centers do not
change

1.

2.
3.

4.
5.

December 11, 201


5

Data Mining: Concepts and


Techniques

16

The K-Means Clustering Method

Example

10
9
8
7
6
5

10

10

4
3
2
1
0
0

K=2
Arbitrarily choose
K object as initial
cluster center

10

Assign
each
objects
to
most
similar
center

3
2
1
0
0

10

4
3
2
1
0
0

reassign
10

10

2
1
0
0

10

reassign

December 11, 2015

Update
the
cluster
means

10

Update
the
cluster
means

Data Mining: Concepts and Techniques

4
3
2
1
0
0

10

17

Comments on the K-Means Method

Strength: Relatively efficient:

Weakness

Applicable only when mean is defined, then


what about categorical data?

Need to specify k, the number of clusters, in


advance

Unable to handle noisy data and outliers

Used in market and customer segmenting

December 11, 2015

Data Mining: Concepts and Techniques

18

Hierarchical Clustering

Use distance matrix as clustering criteria. This


method does not require the number of clusters k
as an input, but needs a termination condition
Step 0

a
b

Step 1

Step 2 Step 3 Step 4

ab
abcde

cde

de

e
Step 4
December 11, 2015

agglomerative
(AGNES)

Step 3

Step 2 Step 1 Step 0


Data Mining: Concepts and Techniques

divisive
(DIANA)
19

A Dendrogram Shows How the


Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.

December 11, 2015

Data Mining: Concepts and Techniques

20

Distance Between Two Clusters


single-link clustering (also called the
connectedness or minimum method) : we
consider the distance between one cluster
and another cluster to be equal to the
shortest distance from any member of one
cluster to any member of the other cluster.
If the data consist of similarities, we
consider the similarity between one cluster
and another cluster to be equal to the
greatest similarity from any member of one
cluster to any member of the other cluster.

complete-link clustering (also called the


diameter or maximum method): we
consider the distance between one cluster
and another cluster to be equal to the
longest distance from any member of one
cluster to any member of the other cluster.

average-link clustering : we consider the


distance between one cluster and another
cluster to be equal to the average distance
from any member of one cluster to any
December
11,
2015
member
of the
other
cluster.

Min

Average

distance

distance

Max
distance

Single-Link Method / Nearest Neighbor

Complete-Link / Furthest Neighbor

Their Centroids.

Average of all cross-cluster pairs.

Data Mining: Concepts and Techniques

21

Calculate distance between two


records

Record 1
Name : carla
Prediction: yes
Age: 21
Balance: 2300$
Income: high
Eyes: blue
Gender: F

December 11, 2015

Record 2
Name : carl
Prediction: no
Age: 27
Balance: 5400$
Income: high
Eyes: brown
Gender: M

Data Mining: Concepts and Techniques

22

example

Distance of their ages and income is ok, between


eyes is problem.
Mismatch colors=1
Exact match colors distance=0
Distances:
age(6)+balance(3100)+income(high=3,
medium=2, low =1 , here same so 0)+eyes(1) and
gender(1)= 6+3100+0+1+1=3108.

December 11, 2015

Data Mining: Concepts and Techniques

23

CHURN ANALYSIS

Road Map for Minimizing Churn Rate


Churning means switching A PROBLEM
Increasingly competitive environment customer retention
has surfaced as one of the key problem faced by mobile
service provider

Business Objective: TO MINIMISE CHURN RATE

CHURN ANALYSIS
Data Mining Goal
IDENTIFIED CUSTOMER WITH DELIQUENT NATURE.

Scope
Assign Churn Score to all customers in order to identify those who are
most likely to churn (Quarter etc).

Determine the most relevant parameter that influences the inclination


to churn.

Define Clearly segments that are strongly divided by their churn relating
Behavior

CHURN ANALYSIS

Basic Understanding

There are mainly two types of churn

1. Customer Request
2. Forced Churn (Defaulters)

CHURN ANALYSIS

Information Sources
Call Statistics (CDR)
Credit History
Billing History
Revenue History
Payment History
Survey Data
Demographic data
Complaint information

CHURN ANALYSIS

Suggested Analysis
Pareto analysis

Also called 80/20 Analysis. Its been observed that 80% of the
revenue profit comes from 20 % of the customer. Key
Business Improvement was identifying those 20% and serves
them better.
Techniques/ Reports/Algorithms

Characterization and summarization, Top 10 report , List ,Cross


tab Reports , Graph Charts etc.

CHURN ANALYSIS

Suggested Analysis
Loyalty Analysis

A loyal customer is worth new customer. If it is possible to identify the


loyal customer and increase that volume. A loyal customer is defined
as the one who is with the company for last six months. This analysis
will give insight in to the complete details of various customer bases.

Techniques/ Reports/Algorithms
Characterization and summarization, Top 10 report , List ,Cross tab Reports
Graph Charts etc.

CHURN ANALYSIS

Suggested Analysis
Customer Profit Analysis

Identifying wining and loosing customer. A wining customer is one who


is giving increasing revenue month after month and vice versa. Identify
the characteristic and reason for better decision.

Techniques/ Reports/Algorithms
Characterization and summarization, Top 10 report , List ,Cross tab Reports ,
Graph Charts etc.

CHURN ANALYSIS

Suggested Analysis
Trend Analysis

Its a Visualization Technique. This Technique uses parallel


Coordinates from the database to show the trend of various measu
different time period
Techniques/ Reports/Algorithms

Parallel Coordinate graph

CHURN ANALYSIS

Suggested Analysis
Customer profiling

Inactive accounts, Light user, risky customer Active accounts, Loss making,
profit making accounts. This segment helps in mapping with the predictive
segment.

Techniques/ Reports/Algorithms
List, Cross Tab, clustering , Graph Charts

CHURN ANALYSIS

Suggested Analysis
LTV Analysis

Called Life Time value Analysis .Revenue projected over 25 yrs and
Projected Churning loss and rate.

Techniques/ Reports/Algorithms

List report, Line Graphs, graph Charts

CHURN ANALYSIS

Suggested Analysis
Churn Modeling

Cost of acquiring new customer is more than retaining one. Classification of


churn models. Assign churn score and predictive modeling. (use classification

Techniques/ Reports/Algorithms

Scoring, Decision Tree, neural network, Clustering.

CHURN ANALYSIS

Suggested Analysis
Survival Analysis
This predicts how long the customer would continue with existing
service in terms of time. What measures can be taken. One of the
Popular Technique are K. Hazard Analysis .

Techniques/ Reports/Algorithms
K.Hazard technique

CHURN ANALYSIS
Suggested Approach

Derived Customer Segmentation mapped them against their


probability of
churning and expected Profits. Each Segment has significantly
different
usage
Assign
characteristic,
Churn score and
demographic
create churn
andmodels
channel selection.

Thank You for


your time

Potrebbero piacerti anche