Sei sulla pagina 1di 400

DSECL ZC415

Data Mining
Introduction
Revision 1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
What is Data Mining?

 During the past few decades, the


advancement in the information
technology, particularly in the
storage systems, has created
abundance of data.
 Several times, we are in a situation
where we are data rich but
knowledge poor.
 Data mining is primarily Knowledge
Discovery from Data (KDD).
 What is knowledge then?
2

BITS Pilani, WILP


Scenario-1
Sepal Sepal Petal Petal
Length Width Length Width  There is a data set of 150 Iris flowers
# (cm) (cm) (cm) (cm) Species
... ... ... ... ... ...
categorized in sepal and petal characteristics.
48 4.6 3.2 1.4 0.2 setosa  The task is to predict the species of the
51 7 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
flower based on its given characteristics.
... ... ... ... ... ...  When petal width is plotted against petal
107 4.9 2.5 4.5 1.7 virginica
111 6.5 3.2 5.1 2 virginica length, it gives an interesting observation.
... ... ... ... ... ...
 Most of the flowers in the dataset can be
classified through this plot.
 Setosa species is well separated while there
is a little overlap between Versicolor and
Virginica.

 We are predicting the species of a flower.


 Will sepal characteristics also tell the same?
 Does the data set speak for itself?
setosa versicolor virginica
 Is it knowledge? 3

BITS Pilani, WILP


Scenario-2
 Point-of-sale data collected at the check-out
Transaction ID Items
counters of a grocery store.
Bread, Butter, Diaper,
1 Milk  Is there any correlation among the items that
tell the items that are bought together by the
Coffee, Cookies, Sugar,
customers?
2 Bread
 The following few associations can be
3 Eggs, Butter, Tea, Sugar
established:
Milk, Diaper, Sugar,
4 Coffee {Diaper} -> {Milk}
5 Tea, Sugar, Spices, Fruits {Tea or Coffee} -> {Sugar}
Milk, Diaper, Vegetables,  Can this knowledge discovery be used to keep
6 Fruits the items the together in a store?
7 Tea, Eggs, Bread  Can this knowledge discovery be used to sell
8 Cookies, Coffee, Sugar other related items like (cross selling or up
9 Diaper, Milk, Fruits selling): baby soap, toys, cookies, candies etc?
10 Milk, Diaper, Fish  Is it knowledge?
 Does the data set speak for itself?
 Visual inspection reveals few associations. If
data set is huge, will we miss few?
BITS Pilani, WILP
Scenario-3

Article  An on-line library is grouping articles based


ID Word: Frequency in the Article on the frequently occurring words in it.
1 Machinery: 5, Loan: 4, Labour: 3
2 Medicine: 4, Doctors: 3, Nurses: 2
 How many clusters of articles possible from
3 Cancer: 13, Therapy: 8, Cigarettes: 8 the given data?
Cost: 18, Material: 16, Market: 11, I. Industry / Economy
4 Loan: 3 II. Healthcare
Pharmacy: 11, Drug: 8, Clinical: 7,
5 Nurses: 6 III. Sports
Domestic: 12, Jobs: 10, Employment:  Will it help searching the relevant articles
6 8, Labour: 4 faster?
Olympics: 16, Health: 12, Encourage:  Is it knowledge?
7 6, Diet: 2
Tournament: 5, Medals: 4, Shoes: 3,
8 Teams: 2 Project Idea: Think of creating a classifier
Shoes: 12, Leather: 7, Tanning: 4, software that groups articles based on the
9 Employment: 3, Loan: 2 dynamically growing knowledge base.
10 Schools: 10, Baseball: 5, Teams: 3
5

BITS Pilani, WILP


Scenario-4

Over the years, PeppiPizza accumulated customer data that can


be clustered under following categories representing the volume
of the customers. Will this knowledge help in “targeted
campaign”?

Is it knowledge?

*Read Pizza Hut Story. 6

BITS Pilani, WILP


Scenario-5

 A bank monitors the usage of


Unusual data point
credit cards issued to its
customers.
 For few customers, the plot shows
average credit card purchase per
week versus purchase this week.
 The data shows the possible fraud
detecting an anomaly.
 Bank can take steps alerting the
customer or blocking the card.
 Is it knowledge?
 Unusual data point is also called
outlier.

BITS Pilani, WILP


Scenario-6
 A social networking website
suggesting you new friends.
 Chemical structure is analysed to
identify new, stable and inexpensive
drugs of similar or better properties.
 Is it knowledge?

BITS Pilani, WILP


Scenario-7

 A natural hot water geyser in


Yellowstone National Park, Wyoming,
US erupts every few minutes after
waiting for certain duration.
 Why is it called faithful?
 Is there any correlation between the
duration it remains active and the
duration it waits for the next eruption?
 There could be more such important
points to be noticed and studied to
make important geophysical
discoveries.
 Is it knowledge?

BITS Pilani, WILP


Scenario-8
 A jeweller receives the data from a
gemmologist about several varieties of
precious stones.
 This data has several attributes:
o Caratage
o Hardness
o Colour
o Crystal System
o Refractive Index
o Lustre
o Fusibility
o ......
 How would the jeweller price the stones based
on some reasoning and logic?
 Is it possible to group these stones?
 Is it possible to establish similarity and
dissimilarity and establish pricing?
10
 Does this process yield knowledge?
BITS Pilani, WILP
Scenario-9
Congressional Voting Records

Congressional Voting Records is a dataset of 435 transactions having 32 items on


16 key issues where Republican and Democratic parties are supporting the
Yes/No combinations. Issues are listed below:
1. handicapped-infants: 2 (y,n)
2. water-project-cost-sharing: 2 (y,n)
3. adoption-of-the-budget-resolution: 2 (y,n)
4. physician-fee-freeze: 2 (y,n)
5. el-salvador-aid: 2 (y,n)
6. religious-groups-in-schools: 2 (y,n)
7. anti-satellite-test-ban: 2 (y,n)
8. aid-to-nicaraguan-contras: 2 (y,n) Which are the issues that would get focus
9. mx-missile: 2 (y,n)
10. immigration: 2 (y,n) when a specific party wins?
11. synfuels-corporation-cutback: 2 (y,n)
12. education-spending: 2 (y,n)
13. superfund-right-to-sue: 2 (y,n)
14. crime: 2 (y,n)
15. duty-free-exports: 2 (y,n)
16. export-administration-act-south-africa: 2 (y,n) 11

BITS Pilani, WILP


Scenario-10
German Credit Data Set

 This dataset classifies people described by a set of attributes


as good or bad credit risks.
 It has several attributes:
o Status of existing checking account
o Loan duration in month
o Credit history
o Purpose of loan
o Loan amount
o Present employment
o Personal status – gender, married etc.
o Age
o Housing – own, rented
o Few more .....
 It is worse to class a customer as good when they are bad,
than it is to class a customer as bad when they are good.
 How Data Science involved here?
 How do we proceed? 12

BITS Pilani, WILP


Scenario-11

Operator Trip ID Fare Passenger Rating Driver Rating  Some cab data is
Uber U12 234.00 4 3 collected for a specific
Spot S9 34.00 3 4 period in a city for few
Ola O1 12.00 3 3 cab operators.
Zen Z24 987.00 5 2
 What does this data tell?
Uber U13 123.00 2 2
Ola O23 72.00 2 3  Can we establish
Zen Z54 23.00 3 3 passengers are happier in
Uber U65 45.00 2 2 longer trips but drivers
Spot S11 43.00 3 4 are not?
Ola O34 345.00 5 3  If we establish it, is it
Zen Z24 234.00 5 4 knowledge that we can
Uber U76 90.00 3 4
use for finding out the
... ... ... ... ...
reasons?
... ... ... ... ...
 How do we do that?
13

BITS Pilani, WILP


How knowledge is used?

What is your opinion?


 We did not talk about Machine Learning, what is it?
 Where Business Intelligence will sit?
 Then what is Business Analytics?
14

BITS Pilani, WILP


Exercise

In the Iris data classification the types of flowers were known in the beginning
however in the article clustering analysis the categories are evolved based on
the words encountered. In several references, Classification analysis is also
called “supervised learning” and clustering is called “unsupervised learning”.
Do you agree? Search and find out more on these key words.

15

BITS Pilani, WILP


Data Mining vs. Information Retrieval

Looking up individual records or searching the data are


important tasks come under Information Retrieval and
are not considered as Data Mining.

Example:

Finding out the average stock price of a particular company for a


particular year.
Vs.
Predicting future stock price for that company analysing the
historical data.

16

BITS Pilani, WILP


Exercise

Discuss whether or not each of the following activities is a data


mining task:
A. Dividing the customers of a company according to their gender.
B. Dividing the customers of a company according to their profitability.
C. Computing the total sales of a company.
D. Sorting a student database based on student identification numbers.
E. Predicting the outcomes of tossing a (fair) pair of dice.
F. Predicting the future stock price of a company using historical
records.
G. Monitoring the heart rate of a patient for abnormalities.
H. Monitoring seismic waves for earthquake activities.
I. Extracting the frequencies of a sound wave.
17

BITS Pilani, WILP


The Process of Knowledge Discovery
Data Mining - a step; though often referred to as the whole process!

Knowledge Presentation

Patterns Evaluation

Data Mining

Data Selection & Transformation

Data Cleaning & Integration


18

BITS Pilani, WILP


Life Cycle of a Typical “Data” Project
 Define the Goal: Decision making -
for example reduce the bank’s losses
due to bad loans.
 Data Collection & Management:
What data is available, is it enough,
good quality, errors in it, will it help?
Most time consuming process.
 Build the Model: Statistics and
Machine Learning play their roles
here to extract useful information
from the data for decision making.
We will review few models in this
course.
 Evaluate the Model: Run test sets
and use metrics to establish the
accuracy of the models.
 Presentation, Deployment &
Maintenance: Leverage the model to
solve business problems. Adapt the
model based on changing scenarios
and new datasets.
BITS Pilani, WILP
Scope of this Course

In line with the previously discussed scenarios, the scope


of this course will primarily focus on the following areas:
Data Concepts
Classification Techniques
Association Analysis
Cluster Analysis
Anomaly Detection
Few Advanced Concepts – Web, Text mining etc.
Experimentation of the different techniques using Python
Applications
20

BITS Pilani, WILP


Text Books

Text Book-1 (T1) by Pang-Ning Tan, Michael Text Book-2 (T2) by Jiawei Han, M
Steinbach and Vipin Kumar, 1st edition Kamber and Jian Pei, 3rd edition

Refer to the course handout for other reference books.


21

BITS Pilani, WILP


Evaluation Components (EC)

EC Name Type Timing Weight


Quiz-I MCQ Pre mid-sem 5%
Quiz-II MCQ Post mid-sem 5%
EC-1
Check
Assignment Modelling 10%
Calendar
EC-2 Mid-Semester Exam Closed Book 1.5 hours 30%
EC-3 Comprehensive Exam Open Book 2.5 hours 50%

Refer to the semester calendar for exact dates and timing.

23

BITS Pilani, WILP


Access to the Course Material - CANVAS
https://bits-pilani.instructure.com/login

24

BITS Pilani, WILP


Kinds of Data that can be Mined

 Relational Databases
 Data Warehouses
 Transactional
 Time series – stock exchange, ocean tides etc.
 Biological – data used to map DNA
 Data Streams - camera surveillance
 Spatial (maps)
 Web
 Text
 Multimedia
 NoSQL, Distributed data
.............
25

BITS Pilani, WILP


Broad Data Mining Tasks

Predictive Tasks:
Predict the value of an attribute (target or dependent
variable) based on the other attributes (independent
variables).

Descriptive Tasks:
Derive patterns – correlations, trends, clusters,
anomalies etc. Basically summarize underlying
relationship in data.

32

BITS Pilani, WILP


Data Mining Tasks
Pictorial View

33

BITS Pilani, WILP


Motivating Challenges

Traditional Data Analysis techniques have often


encountered practical difficulties in meeting the
challenges of new data sets. Following are few specific
challenges:
 Scalability: availability of humongous data.
 High Dimension: Very high number of attributes and
complex relationship among them.
 Heterogeneous Data: Several different types of data
from the same source. E.g. Text, hyperlinks,
multimedia etc. on a web page.
34

BITS Pilani, WILP


The Origins of Data Mining

Traditional data analysis techniques fall short in meeting the new


challenges. Data Mining is relatively a new discipline deploying
and investigating many non-traditional approaches in an inter-
disciplinary way.
Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines

Data Mining: Confluence of Multiple Disciplines 35

BITS Pilani, WILP


Exercise

Data for 20 million galaxies is available. Few available attributes are image
features, characteristics of light waves, distance from earth etc. A new found
galaxy is to be categorised in one of the categories – Early, Intermediate or
Aged. Which data mining task will be helpful to do so?

Early

Intermediate

Aged

36

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DSECL ZC415
Data Mining
Data Exploration
Revision -1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Data Objects

 A Data Object represents an entity and described by its


Attributes.
 Data Objects are also referred to as samples, instances, data
points, tuples, records etc.
 In a database table, a data object represents a row and the
attributes represent the columns.
 In the example below the row for Emp-1 is a data object,
while all the columns represent its properties and are called
attributes.
ID Name Age Gender Salary Department Last Rating

Emp-1 Mary John 29 F 15000 IT Good


2

BITS Pilani, WILP


Attributes

 An attribute is a data field representing a


characteristic or feature of a data object.
 Other names for attributes are: dimensions, features,
variables etc.
 Observed values for a given attribute is called
observations or just values.
 A set of attributes for a data object is also called an
attribute vector.

BITS Pilani, WILP


Do not get confused with
Attributes Types programming language data types!

Nominal: categories, states, or “names of things”


– eye_color = {black, brown, blue}
– marital status, occupation, ID numbers, pin codes

Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
– The word “ordinal” suggests “an order”.

Nominal and Ordinal attributes together are called Categorical attributes.

Binary
– Nominal attribute with only 2 states (0/1, True/False, Yes/No etc.)
– Symmetric binary: both outcomes equally important
e.g. gender
– Asymmetric binary: outcomes not equally important.
e.g. medical test (positive vs. negative)
4

BITS Pilani, WILP


Numeric Attributes
Represents quantity - integer or real-valued.
Interval
• Measured on a scale of equal-sized units
• Difference between the values are meaningful but not the multiplicity or the
divisibility
• Values have order e.g. temperature in C˚or F˚, calendar dates and pH value
• A true zero-point does not exist. E.g. we cannot say temperature does not
exist
• 7 pH value (water) is neutral – 0 pH value is acidic -14 pH value is very
alkaline, Which is true 0?
Ratio
• Difference as well as ratio are meaningful
• There is a zero-point
• E.g. age, years-of-experience, count-of-words, duration, monetary quantities
• Temperature in K˚ (why?)

BITS Pilani, WILP


Discrete vs. Continuous Attributes
Based on the values they can take

Discrete Attribute
– Has only a finite or practically infinite set of values depending on
the context:
• E.g. pin codes, profession, or the set of words in a collection
of documents.
• Binary attributes are a special case of discrete attributes.
Continuous Attribute
– Has real numbers as attribute values e.g. temperature, height,
or weight.
– Real values that can only be measured and represented using a
finite number of digits.
– Continuous attributes are represented as floating-point
variables.
6

BITS Pilani, WILP


Basic Statistical Description of Data

 It is necessary to have an overall picture of the data. If


few integers are given as: 25, 32, 24, 28, 26 and 27, how
will you characterize it? Can we predict the next number?
 Basic statistical description can be used to identify the
data properties and highlight which data values should
be treated as outliers or noise.
 Three basic areas of statistical description:
1. Central Tendency
2. Dispersion
3. Visual Inspection or Visualization
7

BITS Pilani, WILP


Central Tendency
Mean

 The most common and efficient numeric measure of the


center of a data set is the mean (or arithmetic mean).
 Let N values are x1, x2, x3......xN. The mean of this data set is
given by:
N

x i
x1  x 2  ....  xN
x i 1

N N
 Sometimes, each value xi may be associated with a weight
wi. This weight reflects the significance, importance, or the
occurrence
N frequency. In that case mean is:
w x i i
w1 x1  w2 x 2  ....  wNxN
x i 1
N

w
w1  w2  ....  wN 8
i
i 1
BITS Pilani, WILP
Exercise

1. For a group of employees the salary data in thousands of rupees is 30, 36,
47, 50, 52, 52, 56, 60, 63, 70, 70 and 110.
i. Calculate the mean salary.
ii. Calculate the mean salary using weights
(Answer: Rs. 58,000 for both sub-parts)

2. A learning program comprises of four subjects of 5, 4, 4, 3 units (credits)


respectively are offered in the two semesters. Subjects 1 and 2 in the first
semester and subjects 3 and 4 in the second semester.
i. In the first semester, a participant received A and A- grades in the first two subjects.
Calculate her CGPA if A and A- are equivalent to 10 and 9 respectively in numeric terms.
(Answer = 9.56)
i. In the second semester, the same participant received B and B- in the remaining two
subjects. Calculate her CGPA after the end of the second semester if B and B- are
equivalent to 8 and 7 respectively in numeric terms. Note that CGPA is calculated
cumulatively for all the completed courses.
(Answer = 8.69) 9

BITS Pilani, WILP


Central Tendency
Median

 A few numbers of extremes can provide the corrupt central


tendency.
 These extremes are called outliers. E.g. senior management
salary may be higher than others and can not represent the
true center.
 Chopping off some small % of low and high data extremes
provides trimmed mean.
 Too large chopping may lose important data. So not preferred.
 For skewed data, a better measure is the center of the data or
median. It is the middle value in the set of ordered data values.
 So median is the middle value if odd number of values are
present. Otherwise, average of the middle two values.
10

BITS Pilani, WILP


Exercise

For a group of employees the salary data in thousands


of rupees is 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70
and 110.
i. Calculate the median salary.
(Answer: Rs. 54,000)
ii. Calculate the median salary ignoring the last salary
value of Rs. 110,000.
(Answer: Rs. 52,000)

11

BITS Pilani, WILP


Central Tendency
Mode and Midrange

 Value that occurs most frequently is called the mode.


 It is possible that the highest frequency of a value is
same for several different values. So the data sets
with one, two and three modes are called uni-modal,
bi-modal and tri-modal data sets.
 The following relationship exists for moderately
skewed data (approximation):
mean  mode  3  (mean  median)
 Midrange is average of largest and smallest values in
the data set.
12

BITS Pilani, WILP


Exercise

For a group of employees the salary data in thousands


of rupees is 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70
and 110.
i. Calculate the mode.
(Answer: Rs. 52,000 and Rs. 70,000 )
ii. What modal dataset is this?
(Answer: Bi-modal)
iii. Calculate the midrange.
(Answer: Rs. 70,000)

13

BITS Pilani, WILP


Relationship
Mean, Median and Mode

Mean, Median and Mode Mode Mean

Symmetric Data Positively Skewed Data

Median

Mean Mode

Negatively Skewed Data

Frequency Median

Values
14

BITS Pilani, WILP


Exercise

Fill the table for the cells having x and justify the
relationship among mean, mode and median as shown
in the previous graphs:
Data Type Val-1 Val-2 Val-3 Val-4 Val-5 Val-6 Val-7 Val-8 Mean Median Mode
x 20 25 30 35 35 40 45 50 x x x

x 10 11 11 13 13 14 30 35 x x x

x 25 30 35 40 40 45 10 12 x x x

Answer
Data Type Mean Median Mode
Symmetric 35 35 35
Positively
Skewed 17.13 13.00 11, 13
Negatively
Skewed 29.63 32.5 40 15

BITS Pilani, WILP


Similarity and Dissimilarity

Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
Dissimilarity
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0 or close to it.
Proximity refers to a similarity or dissimilarity

An important concept used in cluster analysis and anomaly


detection.
16

BITS Pilani, WILP


Data Matrix
 There are n objects described by p attributes.
 xij is the value of object xi for the jth attribute. E.g. xik is the
value of object xi for the attribute k in the matrix shown
below.
 Each row corresponds to an object, while each column
corresponds to a particular attribute for all the objects.
 x11 ... x1k ... x1 p 
 ... ... ... ... ... 

 xi1 ... xik ... xip 
 
 xn1 ... xnk ... xnp 
17

BITS Pilani, WILP


Dissimilarity Matrix
 An element of the dissimilarity matrix d(i, j) is the difference between the attribute
values of objects i and j.
 d(i, j) is a non-negative number that is close to 0, when objects i and j are very similar,
otherwise larger. Maximum value is 1.
 The difference between an object and itself is 0. That is d(i, i) = 0. That also means that
all diagonal elements are 0 in dissimilarity matrix.
 d(i, j) = d(j, i) that means matrix is symmetric. Normally, d(j, i) entries are not shown.
 Unlike data matrix, dissimilarity shows only one kind of entity (dissimilarity). Attribute
values are not shown.

 0 
d ( 2 , 1 ) 0 
 
 ... ... 
 
d ( n , 1 ) d ( n , 2 ) 0
18

BITS Pilani, WILP


Nominal Attributes
Dissimilarity Measure

 The values of nominal attributes are just different names. They just provide
enough information to distinguish one object from another. Example: PIN codes,
Employee IDs, Gender etc.
 The elements of dissimilarity d(i, j) = (p-m)/p, where p is the count of nominal
attributes and m is the number of matches.
 In the table below, the values of Test-1 shows nominal attribute values for four
objects (code-A etc.) The value of p = 1.
 The dissimilarity matrix for Test-1 nominal attribute is shown below. Which
suggests that object 1 and 4 are similar.

Object ID Test-1 0 
1 code-A 1 0 
2 code-B  
3 code-C 1 1 0 
4 code-A  
0 1 1 0  19

BITS Pilani, WILP


Exercise

1. Based on the nominal values provided for four


objects in the previous slide, calculate the similarity
matrix.
(Clue: element of a similarity matrix s(i, j) = 1-{(p-m)/p} = m/p)

2. Two nominal attribute values (A1 and A2) for four


objects are shown below. Calculate the dissimilarity
matrix.
Objects A1 A2 0 
1 560012 Black 1 0 
2 560013 Brown Answer =  
3 560012 Black 0 1 0 
4 560012 Brown  
0.5 0.5 0.5 0  20

BITS Pilani, WILP


Binary Attributes
Dissimilarity Measure

 Binary attributes take only two values: 0 (absent) or 1 (present).


 For two objects i and j let:
q = the count of attributes that equal 1 for both
r = the count of attributes that equal 1 for the first object but 0 for the second object
s = the count of attributes that equal 0 for the first object but 1 for the second object
t = the count of attributes that equal 0 for both
 Dissimilarity measure between object i and j:
rs
d( i, j ) 
qr st
 In few scenarios, the matches when both the attributes equal 0 are
considered unimportant. They are called asymmetric binary attributes.
E.g. disease tests. In that case, t is ignored.
rs
d( i, j ) 
qrs 21

BITS Pilani, WILP


Example
Dissimilarity in Binary Attributes

The medical test data is provided for three patients. Find out which two patients
are unlikely to have the same disease. Except name and gender, all others are
asymmetric attributes that need to be counted for the calculation (M: Male, F:
Female, Y: Yes, N: Negative/No, P: Positive).
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Jim M Y Y N N N N
Mary F Y N P N P N

d (Jack, Jim) = (1+1) / (1+1+1) = 0.67


d (Jack, Mary) = (0+1) / (2+0+1) = 0.33
d (Jim, Mary) = (1+2) / (1+1+2) = 0.75

This suggests that Jim and Mary are unlikely to have the same disease – highest value of
d. While Jack and Mary are likely to have the same disease – smallest value of d.

22

BITS Pilani, WILP


Jaccard Similarity Coefficient

For asymmetric binary attributes of two objects i and j,


the similarity measure is called the Jaccard Similarity
Coefficient (J) and it is measured as:
rs q
J  s( i , j )  1  
qrs qrs

23

BITS Pilani, WILP


Exercise

Zoologists found three new types of mosquitoes in the


forests of Havana. They are trying to categorize them.
Which of the two species are unlikely to be kept in the
same category? All attributes are asymmetric binary in
the table below:
Species Two Compound Thick Sharp
Forewings Eyes Thorax Proboscis
Zena Yes No Yes No
Prota No Yes Yes Yes
Nixa No No No Yes

(Answer: Zena and Nixa)


24

BITS Pilani, WILP


Minkowski Distance
For Numeric Attributes

 Let i and j are two objects of p numeric attributes,


such that:
i  ( xi 1, xi 2....... xip ) and j  ( xj 1, xj 2....... xjp )
 The Minkowski Distance d(i, j) is given by:
d ( i , j )  | xi1  xj 1 |  | xi 2  xj 2 | ..... | xip  xjp |
h h h h

Where, h is real number >= 1


 This is also called Lh norm is some literature.

25

BITS Pilani, WILP


Manhattan (or City Block or Taxi Cab) Distance
For NumericAttributes

If we take h=1 in Minkowski Distance, it is called the


Manhattan Distance.

For two objects i  ( xi 1, xi 2....... xip ) and j  ( xj 1, xj 2....... xjp )


d ( i , j ) | xi 1  xj 1 |  | xi 2  xj 2 | ..... | xip  xjp |

This is also called L1 norm is some literature.

26

BITS Pilani, WILP


Euclidean Distance
For Numeric Attributes

If we take h=2 in Minkowski Distance, it is called the


Euclidean Distance.
For two objects i  ( xi 1, xi 2....... xip ) and j  ( xj 1, xj 2....... xjp )
d ( i , j )  ( xi1  xj 1 )2  ( xi 2  xj 2 )2  .....  ( xip  xjp )2

This is also called L2 norm is some literature.

27

BITS Pilani, WILP


Weighted Euclidean Distance
For Numeric Attributes

If each attribute assigned a weight (w1, w2, ….wp) then


Weighted Euclidean Distance is calculated as below:
For two objects i  ( xi 1, xi 2....... xip ) and j  ( xj 1, xj 2....... xjp )
dw( i , j )  w1( xi 1  xj 1 )2  w2( xi 2  xj 2 )2  .....  wp( xip  xjp )2

Weight is the perceived importance for an attribute.

28

BITS Pilani, WILP


Supermum Distance

For two objects i  ( xi1,xi 2.......xip ) and j  ( xj1,xj 2.......xjp )


d( i, j )  max | xif  xjf |,where f  1..p

This is also called Lmax or L∞ norm is some literature.

To calculate Supermum distance, find the attribute f


that gives the maximum absolute distance .

29

BITS Pilani, WILP


King’s Moves

Numeral inside each square represents the minimum


distance that the king has to make to reach to that
square. Which distance is this?
30

BITS Pilani, WILP


Distance Measurement: Properties

1. Non-negativity: d(i, j) >= 0


2. Identity: d(i, i) = 0
3. Symmetry: d(i, j) = d(j, i)
4. Triangle Inequality: d (i, j) <= d (i, k) + d (k, j)

31

BITS Pilani, WILP


Example
Distances

X2(3, 5)

X1(1, 2)

Manhattan Distance = |3-1| + |5-2| = 5


Euclidean Distance = √ ((3-1)2 + (5-2)2) = √13
Supermum Distance = Y attribute gives the maximum distance = 5-2 = 3

32

BITS Pilani, WILP


Normalization
For [0.0, 1.0]

Let us say we have few values of an attribute x. It is


required that they need to be scaled (normalized) on
[0.0, 1.0] for an easier comparison without losing the
insight of their dissimilarity.

It is done using the formula: xi  min( x )


zi 
x z
max( x )  min( x )
8 0.86
9 1.00
6 0.57
4 0.29
3 0.14
2 0.00 33

BITS Pilani, WILP


Ordinal Attributes
Dissimilarity Measure

 The values of an ordinal attribute have meaningful order or


ranking about them. For example – small, medium, large or
poor, fair, good, excellent.
 For measuring the dissimilarity, the following steps are
followed:
 Step-1: f is an attribute for ith object with value xif. Replace xif
with its rank rif ∈ {1……Mf}, where Mf maximum count of
ranks
 Step-2: Map the range of each attribute onto [0.0, 1.0] for
normalization as zif = (rif-1)/(Mf-1)
 Step-3: Calculate dissimilarity using numeric distance from zif.
34

BITS Pilani, WILP


Example
Dissimilarity in Ordinal Attributes

Four objects scored grades out of {Fair, Good, Excellent} as shown below.
Calculate the dissimilarity.

Object ID Test-2 Mf = 3, Rank (rif) Normalized Rank (zif)


1 Excellent 3 1.0
2 Fair 1 0.0
3 Good 2 0.5
4 Excellent 3 1.0

Step-1: Each value is replaced by its rank (rif)


Step-2: Normalize each value onto [0.0, 1.0] using zif = (rif-1)/(Mf-1)
Step-3: Calculate the Euclidean Distance in a matrix form from zif.
0.0 
1.0 0.0 
 
0.5 0.5 0.0  Objects 1-2 and 2-4 are most dissimilar!
 
0.0 1 .0 0.5 0 .0  35

BITS Pilani, WILP


Dissimilarity in Numeric Attributes
Converting them in the range of [0.0, 1.0]

Four objects scored Test-3 scores in numbers as shown


below. Calculate the normalized dissimilarity.
Object ID Test-3
1 45 f | xif  xjf |
dij 
2
3
22
64
max ( x )  min ( x )
4 28

max x = 64 and min x = 22


 0.0 
d (2, 1) = |22-45| / (64 - 22) = 0.55 0.55 0.0 
d (4, 3) = |28-64| / (64 - 22) = 0.86  
and so on... 0.45 1.0 0.0 
 
0.40 0.14 0.86 0.0  36

BITS Pilani, WILP


Mixed Attributes
Dissimilarity Measure

Dissimilarity measure with the data that contains p attributes of mixed


type. The dissimilarity is calculated as:

ij .dij
f 1
f f

d( i, j )  p

ij
f 1
f

 dijf is the dissimilarity measure for objects i and j for the attribute f.
 The value of δijf is calculated as follows:
= 0, if there is no measurement (missing data) for object i or j (xif or xjf=0)
= 0, if (xif = xjf = 0) and the attribute f is asymmetric binary attribute
= 1, otherwise 37

BITS Pilani, WILP


Mixed Attributes
Dissimilarity Measure…..continuing...

Object ID Test-1 Test-2 Test-3


(Nominal) (Ordinal) (Numeric)
1 code-A Excellent 45
2 code-B Fair 22
3 code-C Good 64
4 code-A Excellent 28

0  0.0   0.0 
1 0  1.0 0.0  0.55 0.0 
     
1 1 0  0.5 0.5 0.0  0.45 1.0 0.0 
     
0 1 1 0  0.0 1.0 0.5 0.0  0.40 0.14 0.86 0.0 

The above dissimilarity matrices are calculated in the previous examples.

38

BITS Pilani, WILP


Mixed Attributes
Dissimilarity Measure….concluded.

0  0.0   0.0 
1 0  1.0 0.0  0.55 0.0 
     
1 1 0  0.5 0.5 0.0  0.45 1.0 0.0 
     
0 1 1 0  0.0 1.0 0.5 0.0  0.40 0.14 0.86 0.0 
Dissimilarity matrix for the mixed attributes is calculated as:
p

ij .dij
f 1
f f
 0.0
0.85 0.0


d( i, j )  p
 
0.65 0.83 0.0 
ij
f 1
f

0 .13 0.71 0.79 0.0


d(3, 1) = (1x1 +1x0.5+1x0.45) / (1+1+1) = 0.65 Objects 1 and 2 are most
d(3, 2) = (1x1 +1x0.5+1x1) / (1+1+1) = 0.83 dissimilar while 1and 4 are
d(4, 3) = (1x1 +1x0.5+1x0.86) / (1+1+1) = 0.79 most similar.
39
and so on….
BITS Pilani, WILP
Exercise

Three types of diamond varieties are having following


attributes as shown in the table below. Calculate the
individual and combined dissimilarity matrix.

Price in Rs. Readily Resale


Variety Durability
per carat Available? Value?
1 24000 Excellent Yes No
2 12000 Fair Yes Yes
3 15000 Good No Yes
Answer:
Price: d(1, 2) = 1.0, d (1, 3) = 0.75, d(2, 3) = 0.25
Durability: d(1, 2) = 1.0, d (1, 3) = 0.50, d(2, 3) = 0.50
Availability : d(1, 2) = 0, d (1, 3) = 1.0, d(2, 3) = 1.0
Resale: d(1, 2) = 1.0, d (1, 3) = 1.0, d(2, 3) = 0 40
Calculate the combined matrix now.
BITS Pilani, WILP
Vectors: Basic Concepts
 Physical quantities are divided into two parts: scalar and vector.
 Scalar quantities are those which have only magnitude and no direction – mass, density,
volume, temperature etc.
 Vector quantities are those which have direction also associated with the magnitude –
velocity, weight, force etc.
 A vector PQ→ is determined by two points P and Q. The modulus or magnitude of the vector
is the length of the line PQ and direction is that from P to Q. P is the called the initial point
and Q is terminal point or tip. Vectors are normally denoted as a→.
 Magnitude or the modulus of a vector PQ→ denoted by |PQ→|.
 A vector with the coincidental initial and terminal points is called null vector.
 A vector with unity magnitude vector is called a unit vector. A unit vector is denoted like a^
and called a-cap.
 A vector in a three dimension space is denoted as: x.i^ + y.j^ + z.k^ and its magnitude is
√(x2+y2+z2).
 If (x.i^ + y.j^ + z.k^) and (p.i^ + q.j^ + r.k^) are two vectors then their dot product is defined as
(xp+yq+zr) which is a scalar quantity.
 If θ is the angle between two vectors X and Y, then the dot product (X.Y) of these two vectors
is also defined as: X.Y = |X|.|Y|.cos θ (cos 0o = 1 and cos 90o = 0)
BITS Pilani, WILP
Document Comparison
Example Scenario

 Let us say there are few documents which are recording the frequency of words as
shown in the table below. So that, each document is a term- frequency-vector.
 Two term-frequency-vectors may have many 0s in common and thus the resulting
matrix could be very sparse in nature.
 Having many 0s in common does not make the documents similar. Here the distance
techniques for numeric attributes will not work.
 It needs a measure that focuses on the words that are common in the two documents
and ignores 0 matches.
 Cosine Similarity is a measure that can be used to compare the documents.

Document team coach hockey baseball soccer penalty score win loss season
ID-1 5 0 3 0 2 0 0 2 0 0
ID-2 3 0 2 0 1 1 0 1 0 1
ID-3 0 7 0 2 1 0 0 3 0 0

BITS Pilani, WILP


Cosine Similarity Document team

ID-1
ID-2
5
3
coach hockey baseball soccer penalty score

0
0
3
2
0
0
2
1
0
1
0
0
win

2
1
loss

0
0
season

0
1
ID-3 0 7 0 2 1 0 0 3 0 0

 Let x and y are two vectors for comparison.


 The cosine similarity measure is calculated as:
x.y
sim( x , y ) 
| x | .| y |
where | x | and | y | are the magnitudes/ mod uli of the vectors
and x. y is the dot product of the vectors.
 If the cosine value comes to 0 it means the vectors are at 90o (cos 90o = 0) and there is
nothing much in common because they are perpendicular or orthogonal. A value close to 1
(cos 0o = 1) suggests that the documents are similar.
 For the example documents ID-1 and ID-2:
x.y = 5x3 + 0x0 + 3x2 + 0x0 + 2x1 + 0x1 + 0x0 + 2x1 + 0x0 + 0x1 = 25
|x| = (52 + 02 + 32 + 02 + 22 + 02 + 02 + 22 + 02 + 02)1/2 = 6.48
|y| = (32 + 02 + 22+ 02 + 12 + 12 + 02 + 12 + 02 + 12) 1/2 = 4.12
sim(x, y) = 25 / (6.48 x 4.12) = 0.94
This is a value close to 1 that suggests that the document ID-1 and ID-2 are similar.
Documents with ID-1 and ID-3 can also be compared for similarity in the same way. 43

BITS Pilani, WILP


Exercise

Using cosine similarity find out which two phrases are


most similar. (ignore case, punctuation symbols and plurality, if there is any):

i. Two for tea and tea for two


ii. Tea for me and tea for you
iii. You for me and me for you

44

BITS Pilani, WILP


Dispersion of the Data

 Dispersion or the spread of the data, as the name


suggests, provide a numerical view of the “spread”.
 The different techniques used to describe the dispersion
of the data are following:
o Range: the difference between the largest and the smallest
values. What is mid-range then?
o Quantile (quartile, median, percentile, inter-quartile range)
o Five-Number Summary
o Standard Deviation and Variance
o z-Score Normalization
o Covariance and Correlation 45

BITS Pilani, WILP


Quantiles
 Quantiles are points taken at regular intervals of a data distribution dividing it into equal size of
consecutive sets.
 The quantile is a generic term. 2-quantile: data is divided into two sets, 4-quantile (quartile): 4 sets,
100-quantile (percentile): 100 sets and so on.
 In the diagram below, there are 3 quantile values (Q1, Q2 and Q3) that divide the data into 4 equal
parts. This specific quantile is called 4-quantile or simply quartile.
 Generally speaking, the kth q-quantile value for a given data distribution is the value x such that at
most k/q of the data values are less than x and at most (q-k)/q values are more than x. Where, k is
an integer such that 0 < k < q so that there are (q-1) quantile values.
 In quartile, the difference between the 1st and the 3rd quantile values is called Interquartile Range
(IQR) = Q3-Q1.
 Median divides the data distribution into two equal parts with one quantile value (Q2).
 Percentile divides the data into one hundred equal parts.

Frequency

Values
46
Q1 Q2 Q3
BITS Pilani, WILP
Example This is one possible way!

For a group of employees the salary data in thousands of


rupees is 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70 and 110.
Divide the data into quartiles and find out the IQR.
Step-1: Order the values in increasing (or non-decreasing) order.
Step-2: Identify min and max values.
Step-3: Identify median (Q2).
Step-4: Identify Q1 and Q3. This is similar as identification of median of the leftover
elements on the left and the right side of identified Q2 of step-3.
Step-5: identify IQR as Q3-Q1. It will be required for outliers.

48.5 54 66.5
30 36 47 50 52 52 56 60 63 70 70 110

Min Q1 Median (Q2) Q3 Max


IQR = 66.5 – 48.5 = 18 47

BITS Pilani, WILP


Five-Number Summary & Box Plots
 Five-Number Summary consists of -
Minimum, Q1, Median (Q2), Q3,
Outlier
Maximum.
Maximum (within 1.5 x IQR)
 BoxPlot is one of the most popular
Whisker
way to summarize the data.
Q3
 Values falling (1.5 x IQR) below Q1 (<)
and above Q3 (>) can be treated as

Values
Median
outliers or anomalies.
 Five-Number Summary is shown using Q1
Whisker
box-plots. Minimum (within 1.5 x IQR)
Outlier
 Whiskers are drawn up to
Multiple data sets
maximum/minimum. Outliers are
A Box Plot
shown as dots separately. 48

BITS Pilani, WILP


Exercise

Data analyst of a bank noticed the following credit card


transactions of a customer for a year: January to December
respectively. All the values are in thousand ₹: 7, 13, 8, 32, 12,
10, 15, 5, 35, 17, 19 and 11. Which two months should he
suspect as to have the fraudulent transactions?
Answer: Q3 = 18 and Q1 = 9. So, IQR = 9. Data points above Q3 with a margin of 1.5xIQR (= 13.5) are 32
and 35. These months are April and September. There are no data points below Q1 with a margin of
1.5xIQR (= 13.5).

Important Note: In these question, do not just believe your intuitions looking at the data. For example you
might say 32 and 35 thousand values are obvious outliers. In data science, inferences need to be
numerically justified. In this exercise outliers are identified with a statistics based logic.

Also pay attention to the questions. Values are in thousands and question asks for the months (not
the values).
49

BITS Pilani, WILP


Remember!

 In statistics there is a concept of sample vs.


population. The values of many parameters like
variance, standard deviation etc. may differ based on
the assumptions. In this course, all datasets are
considered population.
 Mean/Average is represented as μ or x .
 When using scientific calculators, R, MS-Excel or any
other tools, participants are expected to know how
that computes – based on sample or population.

50

BITS Pilani, WILP


Standard Deviation and Variance

 Variance (σ2) of N observations (x1, x2,....xN) is calculated as:


N
1
 
2

N
 (
i 1
xi  x )2

where, x is the mean of the dataset values( xi ).


 Square root of the variance is called the Standard
Deviation (σ).
 A low standard deviation means data observation is close
to the mean. Otherwise it is spread out over a range of
large values. A zero standard deviation means all the
observations have the same value.
 Unit is not mentioned for the variance but for the standard
51
deviation the unit is same as the data observations.
BITS Pilani, WILP
Exercise

1. For a group of employees the salary data in thousands of


rupees is 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70 and 110.
Calculate the variance and standard deviation.
(Answer: 379.17, 19.47 thousands)

2. There are two datasets X={1, 2, 3, 4} and Y={1, 3, 5, 7}. In


which dataset observations are closer to the mean?
(Answer: standard deviation of X = 1.12 and Y = 2.24 that indicates dataset X is
more packed near the mean.)

52

BITS Pilani, WILP


z-Score Normalization
 A z-score is the number of standard deviations from the
mean a data point is.
 When for the data set, mean is μ and standard deviation is
σ then, for the ith element xi, z-score (zi) is calculated as:
xi  
zi 

 Sum of the z-score normalized data points will be 0.
 If zi is
= 1, that means, xi is (μ + σ)
= 2, that means, xi is (μ + 2σ)
 So, essentially, z-normalized score tells how far a data point
53
is from mean in the multiples of standard deviation.
BITS Pilani, WILP
Normal Distribution Curve
μ: Mean, σ: Standard Deviation

For the Normal Distribution (symmetrical) curve where is mean is μ


and standard deviation is denoted by σ:
o From (μ–σ) to (μ+σ): contains about 68% of the data
o From (μ–2σ) to (μ+2σ): contains about 95% of the data
o From (μ–3σ) to (μ+3σ): contains about 99.7% of the data

Frequency

Values -3 -2 -1 0 1 2 3
σ μ σ 54

BITS Pilani, WILP


Exercise

1. There are two datasets X= {1, 2, 3, 4} and Y={1, 3, 5, 7}.


Find out the z-score normalized values for both.
Answer: zx = {-1.34, -0.45, 0.45, 1.34}, zY = {-1.34, -0.45, 0.45, 1.34}

2. Why these scores are same for the both the sets?

55

BITS Pilani, WILP


Covariance

The covariance (sXY) of two datasets X and Y measures


how the two are linearly related. A positive covariance
would indicate a positive linear relationship between
the sets, a negative covariance would indicate the
opposite and a 0 covariance would indicate
uncorrelated.
1 N
SXY  . ( xk  x )( yk  y )
N k 1
where, x , y are the mean of X and Y datasets respectively.

56

BITS Pilani, WILP


Examples

X = {-3, 6, 0, 3, -6} and Y = {1, -2, 0, -1, 2}


Mean of X = 0, mean of Y = 0
SXY = {(-3-0)(1-0) + (6-0)(-2-0) + (0-0)(0-0) + (3-0)(-1-0) + (-6-0)(2-0)} / 5
= (-3-12+0-3-12)/5 = (-30)/5 = -6
(Negative sign of covariance; X decreases ⇒ Y increases)

X = {3, 6, 0, 3, 6} and Y = {1, 2, 0, 1, 2}


Mean of X = 3.6, mean of Y = 1.2
SXY = {(3-3.6)(1-1.2) + (6-3.6)(2-1.2) + (0-3.6)(0-1.2) + (3-3.6)(1-1.2) + (6-
3.6)(2-1.2)} / 5
= (0.12+1.92+4.32+0.12+1.92)/5 = 8.4/5 = 1.68
(Positive sign of covariance; X increases ⇒ Y increases)

57

BITS Pilani, WILP


Exercise

Find out the covariance of the vectors X = {-3, -2, -1, 0,


1, 2, 3} and Y = {9, 4, 1, 0, 1, 4, 9} and compare the
answer with their approximate scatter plot.
(Answer = 0, Parabolic relationship)

10

0
-4 -2 0 2 4

58

BITS Pilani, WILP


Correlation

The correlation of two datasets X and Y equals to their


covariance divided by the product of their individual
standard deviations. It is a normalized measurement of
how the two are linearly related between -1 and 1.

SXY
Correlation( X , Y ) 
X.Y
Correlation 1 represents perfect positive relationship and correlation -1 represents perfect
negative relationship.

59

BITS Pilani, WILP


Example

X = {3, 6, 0, 3, 6} and Y = {1, 2, 0, 1, 2}


Mean of X = 3.6, mean of Y = 1.2
SXY = {(3-3.6)(1-1.2) + (6-3.6)(2-1.2) + (0-3.6)(0-1.2) + (3-3.6)(1-1.2) + (6-3.6)(2-1.2)} / 5
= (0.12+1.92+4.32+0.12+1.92)/5 = 8.4/5
= 1.68
σX = 2.51
σY = 0.84
Correlation (X, Y) = 1.68 / (2.51x0.84) = 0.80

60

BITS Pilani, WILP


Appendix

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Median of a Grouped Distribution
It is cumbersome to calculate the median for a large
number of observations that can be grouped. In these
scenarios, approximate median is calculated by:
N 
  F 
Median  l   2 .h
 f 
 
where,
median class the class whose frequencyis just above N/2.
l  lower lim it of the median class
f  frequency of the median class
h  width of the median class
F  Cumulative frequency of the class peceding the median class
62
N  Sum of frequencies , the total obsevations
BITS Pilani, WILP
Example
Median of a Grouped Distribution

Class 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45


Frequency 5 6 15 10 5 4 2 2
Cumulative
Frequency 5 11 26 36 41 45 47 N = 49
N 
  F   N = sum of frequencies = 49
Median  l   2 .h
 f   N/2 = 24.5, so cumulative frequency just
  above 24.5 is 26
where,  Corresponding class is: 15-20
l  lower lim it of the median class  l = 15, f = 15, h = 5, F = 11
f  frequency of the median class  So, median = 15 + ((24.5-11) / 15 )* 5 = 19.5
h  width of the median class
F  Cumulative frequency of the class preceding the median class
N  Sum of frequencies , the total obsevations
Median class is identified whose frequency is just above N/2.
63

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DSECL ZC415
Data Mining
Data Pre-processing
Revision -4.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Introduction
Real world data is:
 Noisy with errors
 With missing values
 Inconsistent
Possibly because of:
 Huge size – loss, data corruption
 Multiple heterogeneous sources
This will lead to:
 Low quality knowledge discovery (data mining)
The situation necessitates pre-processing of data before mining is
performed, this may involve one or more of:
 Cleaning – e.g. treating missing values and noise
 Integration This modules identifies the problems and
 Reduction not necessarily solves them!
 Transformation Suggests different possibilities.

BITS Pilani, WILP


Cleaning: Missing Values:
Elimination
 A dataset has few records with missing values of few attributes.
Is it wise to eliminate those records (e.g. Record-2)?
 A dataset has few attributes which are missing for records. Is it
wise to eliminate those attributes itself (e.g. Attribute-3)?
 If a dataset is meant for classification model training, and the
label or class itself is missing for few records, it is of no use and
normally eliminated (e.g. Record-1).
Class
Attribute-1 Attribute-2 Attribute-3 Attribute-3 Attribute-4
Attribute
Record-1 Missing
Record-2 Missing Missing Missing
Record-3 Missing
Record-4
Record-5 Missing
.... .... .... .... .... .... .... 3

BITS Pilani, WILP


Cleaning: Missing Values: Estimated
Supply
 For an attribute, values are missing for few records. Values which are
present for this attribute in other records are:
o Symmetric: can mean be supplied for the missing?
o Skewed: can median/mode be supplied for the missing?
o Categorical: can most commonly occurring value be supplied for the missing?
 Other techniques: Regression, Decision Tree Induction, Bayesian
Formalism etc.
Attribute-2 Attribute-3
Attribute-1 Attribute-3 Attribute-4 Attribute-5
(Numeric) (Categorical)
Record-1 23 Black
Record-2 Missing Black
Record-3 Missing Missing
Record-4 41 Black
Record-5 22 Brown
.... .... .... .... .... .... ....
4

BITS Pilani, WILP


Cleaning: Noisy Data: Binning
Noise is a random error or variance in a measured numeric attribute. Is it possible to smooth out the
data to remove the noise? Binning (or bucketing) is one such technique. E.g.:

Attribute values: 21, 24, 21, 15, 28, 25, 34, 4, 8


After sorting: 4, 8, 15, 21, 21, 24, 25, 28, 34 Useful technique for concept
hierarchy. E.g. the interest is only in
identifying inexpensive, moderately
Three bins (or buckets) of equal size: priced and expensive values.
Bin-1: 4, 8, 15 (mean = 9)
Bin-2: 21, 21, 24 (mean = 22)
Bin-3: 25 28, 34 (mean = 29)

Method-1: Smoothing by bin means (replace by mean of the bin):


Bin-1: 9, 9, 9
Bin-2: 22, 22, 22
Bin-3: 29 29, 29

Method-2: Smoothing by bin boundaries (replace by min or max of the bin whichever is closer):
Bin-1: 4, 4, 15
Bin-2: 21, 21, 24
Bin-3: 25 25, 34 5

BITS Pilani, WILP


Cleaning: Noisy Data: Regression

 Regression is a technique that conforms attribute


values in a function. It can also be used in smoothing.
 Linear Regression involves finding the “best line” to
fit two attributes. One attribute can be used to
predict the other.
 Multiple Linear Regression involves more than two
attributes that fit the data to a “multidimensional
surface”.
 Linear Regression Technique is explained in the
Appendix-A. It is not in the syllabus of this course.
6

BITS Pilani, WILP


Cleaning: Noisy Data: Outlier
Detection
 Outliers can be found out to remove the anomalies.
 One such technique BoxPlot is discussed.
 More outlier detection techniques will be discussed
in the program.

BITS Pilani, WILP


Cleaning: How Do We Start?
A Process – possible few initial steps.....

Some of the following steps detect the inconsistencies in the data and assist in
cleaning:
 Relevance: Is collected data relevant?
o Road accident data - but driver’s age missing.
 Have Metadata (data about data): Prior knowledge of data helps in cleaning:
o Date format is DD/MM/YYYY or MM/DD/YYYY?
o Pin codes cannot be negative.
o What is the skewness of the numeric data?
o Address does not fall within a city.
o Product codes need to be exactly 8 characters long.
 Unique Rule Checking: prior knowledge under these rules help in identifying
the inconsistencies:
o Unique Rule: All values of an attribute are expected to be unique.
o Consecutive Rule: There cannot be an missing value in a range. E.g. youth data missing.
o Null Rule: Any special character (question mark, exclamatory sign, blank space etc)
represents a null. E.g. surveyors used different notations to fill null.
 Interactive Data Scrubbing and Auditing tools: are available to perform the
above (or even more) steps. E.g.
o UCB’s Potter’s Wheel A-B-C. 8

BITS Pilani, WILP


Integration
The merging of data from multiple data sources and store it in the coherent data store like Data
Warehouse. It may cause redundancies and inconsistencies. Some of the issues are following:
 Entity Identification: Schematic heterogeneity pose great challenges. E.g. Cutomer_ID in one
data base and Customer_Number in another but mean the same thing.
 Redundancy: Correlation and Covariance imply how strongly one numeric attribute implies
the other. Do we need to keep both the attributes when integrating? Similarly, chi-square (χ2)
correlation test is performed for measuring correlation relationship between two nominal
attributes.
 Duplication: Two or more records with exactly the same values of the most attributes –
duplicate records or different records?
Purchaser Name Address Date of Purchase Quantity
Jacob John 234 Camac Street 23-12-2018 23
Jacob John 126 Green Park 23-12-2018 23
 Data Value Conflict: Currency is in $ or €, CGPA is out of 10 or out of 4, hotel room plan is
MAP or EP, weight is in pounds or in grams etc.
 Abstraction Level: An electronics brand has several showrooms across India. In few states
the database stores Total_Sales branch wise while in other it is the complete state wise.
When integrating all this data, does this need to be considered? 12

BITS Pilani, WILP


Exercise
A nationwide flagship hospital has two clinics in a city. In the past, when they computerized
their day-to-day working, their IT department never imagined a day when it would also need to
integrate the data. The result is different schema at two different places while creating the
patient database. One clinic stored the year of birth of a patient while another stored the age.
The age attribute auto increments in the system every year (an additional overhead!). There are
several patients who got the treatment at both the clinics. As a data scientist how would
identify this issue while helping the hospital in integrating the data for a data warehouse.

Patient Unique ID Patient Unique ID Year of


Age in 2019
(Government Issued) (Government Issued) Birth
ABC123 48 ABC123 1971
MNP234 43 AXP123 1972
LQR876 21 LQR876 1998
XYZ871 23 XYZ871 1996
MAR979 27 MAR979 1992
UPG134 40 LMN302 1980
LMN642 36 LMN642 1983
Clinic -1 Clinic-2

Hint: Correlation of the common records is -1. 13

BITS Pilani, WILP


Chi-Square (χ2) Test
Slide : 1/2

 Chi-square statistics tests the hypotheses if Beach Cruise Total


two nominal attributes are independent. Men 209 280 489
Women 225 248 473
 Out of 962 people (men and women), the
Total 434 528 962
holiday preference is shown in the table. We Observed
have to find out if these nominal attributes;
(Observed  Expected) 2
 
Gender and Holiday Preference are 2
correlated from the given data with
Expected
significance level (P) as 0.05.
 Step-1: Calculate the value of chi-square (χ2) Count ( Attr1 ) x Count ( Attr2 )
using the shown formula. Expected 
Total Count
 Where Expected is calculated using the
shown formula. Beach Cruise
489 x 434
E.g. 220.61  Men 220.61 268.39
962 Women 213.39 259.61
Expected

( 209  220.61 )2 ( 225  213.39 )2 ( 280  268.39 )2 ( 248  259.61 )2


 
2
    2.264
220.61 213.39 268.39 259.61
BITS Pilani, WILP
Chi-Square (χ2) Test
Slide : 2/2

 Step-2: Identify Degree-of-Freedom (DF). That is the count of attributes that are
free to vary. If the observed data table is having (Row x Column) cells, the DF will
be (Row - 1) x (Column - 1). For the given data it is (2-1) x (2-1) = 1.
 Step-3: For the given significance level (0.05 in this example) and DF (1 in this
example), the chi-square value is identified from χ2 distribution table. From the
table it is 3.841.
 Since the calculated chi-square value (2.264) is lesser than the table value, the
attributes Gender and Holiday preference are independent. Hypotheses is correct.
 In case if the calculated chi-square value is equal or bigger than the corresponding
table value the attributes are considered correlated.

Larger the yellow shaded


area – more accurate the
hypotheses.

BITS Pilani, WILP


Exercise

The table shows the observed data of Genders and


their Pet Preference. If the significance level is 0.05.
calculate the chi-square measure and testify the
hypotheses.
Cat Dog
Men 207 282
Women 231 242
Observed

Answer: χ2 = 4.102. Attributes are correlated.


16

BITS Pilani, WILP


Reduction
Data reduction techniques can be applied to obtain reduced representation of the data that is
much smaller in volume, yet closely maintains the integrity and essence of the original data
set. The key idea is: Lesser volume - More efficient to work on - Same analytical results can be
obtained. There are different strategies for data reduction:
 Dimensionality Reduction: A general process of reducing variables or attributes (the
columns). Popular methods: Wavelet Transformations (WT) and Principal Component
Analysis (PCA). These two strategies project the original data into smaller space. They do
not remove attributes; rather transform the essence. Details of these strategies are not in
the syllabus but will be reviewed in other courses in the program. Attribute (Feature)
Subset Selection is a separate strategy where weak, redundant or irrelevant attributes are
detected and removed (the result is fewer columns).
 Numerosity Reduction: Data is stored in alternate smaller representation. Parametric
Methods: data parameters are stored in place of actual data. E.g. Regression and Log
Linear models. Non-parametric Methods: Reduced representation. E.g. Histograms,
Clustering, Sampling, Data-Cube aggregation. Outlier may be stored separately.
 Data Compression: Data is compressed. It could be lossy, where after decompression
some portion of the original data is lost, or lossless where after decompression that
original data is retrieved without any loss. Huffman Coding is one such lossless technique.
BITS Pilani, WILP
Attribute (Feature) Subset Selection
These methods are Greedy which attempt to make optimal choice at each step. There are 3
basic methods. Let us take Initial set as: {A1, A2, A3, A4, A5, A6}

1. Forward Selection: Start with NULL set and then keep adding:
{}
⇒ {A1} Attributes to add or to eliminate are
⇒ {A1, A4} decided – assuming they are independent
⇒ {A1, A4, A6} the reduced set and using their statistical significance.

2. Backward Elimination: Start with the entire set and then eliminate
{A1, A2, A3, A4, A5, A6}
⇒ {A1, A3, A4, A6}
A4
⇒ {A1, A4, A6} the reduced set
Yes No

3. Combination of above two is also possible. A6


A1
Yes No
4. Decision Tree Induction: When fewer Yes No

attributes capture all the classes, why Class-1 Class-2 Class-1 Class-2
we need all attributes? This will be
discussed in the Classification module. A Decision Tree
BITS Pilani, WILP
Histograms
 Usually shows the Unit Price
($)
Quantity
Sold
distribution of values of 9
12
15
6
a single variable. 14
21
9
4
 Divide the values into 19
23
1
6
bins or buckets and 25 5
27 5
show a bar plot of the 28 5
number of objects in 29
31
6
5
each bin. 32 6
40 2
 The height of each bar 45
48
4
12
indicates the number of 46 8
objects in the bin. 51
65
12
8

 Example: An electronics 74
78
23
12
store presented the data 79
81
2
8
in the shown table. How 83 12
85 19
to visually analyze the 86 12 This numerosity reduction might make decision
count of items sold in 87
91
2
3 making simpler e.g. can this store plan to remove
different price ranges? 92
94
21
40 the items which are less than 90$ per unit?
 Possibilities: Equal - 95
96
54
12
Width (as shown) or 99 89
Equal-Frequency (equal 100 200

objects in each bin)


BITS Pilani, WILP
Clustering

Cluster representation like the diameter (maximum distance


between to objects in the same cluster) or centroid (the center
point of a cluster) are used to represent the data in place of all
data points. This reduces the numerosity. This will be discussed
in the clustering techniques later in the course.

8.00
D
7.00
6.00
E F
5.00
C
4.00
G
3.00
B
2.00
A
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00

20

BITS Pilani, WILP


Sampling

The process of obtaining a small sample S to represent the whole data set
N.
Simple Random Sampling:
 There is an equal probability of selecting any particular item
Sampling Without Replacement:
 Once an object is selected, it is removed from the population
Sampling With Replacement:
 A selected object is not removed from the population
Stratified Sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
 Used in conjunction with skewed data
21

BITS Pilani, WILP


Simple Random Sampling
Illustration

Raw Data 22

BITS Pilani, WILP


Stratified Sampling
Illustration

Record ID Attribute
R101 Child
R303 Child
R404 Child Record ID Attribute
R110 Child R101 Child
R440 Child R404 Child
R606 Man R606 Man
R707 Man R707 Man
R808 Man R330 Man
R909 Man R202 Woman
R330 Man
R202 Woman
Strata is child or major and then with in the
R505 Woman major the gender. No random sampling.
R220 Woman

23

BITS Pilani, WILP


Exercise
Data Cube aggregation is anther numerosity reduction technique. Review the Data
Cube from earlier slides of this deck and attempt the following question: Indian cities
reported few diseases from the public hospitals. Create a data cube that shows the data
aggregation (roll up?) under the following dimensions: North Indian and South Indian
cities, Mosquito and Water borne diseases, before 2010 and after 2010.

Year 2000 2001 2011 2012 Year 2000 2001 2011 2012
Dengue 150 123 200 225 Dengue 123 65 145 150
Chikungunya 165 200 250 275 Chikungunya 180 100 125 150
Typhoid 1400 2000 2500 1240 Typhoid 600 700 600 750
Dysentery 1600 1800 2000 1250 Dysentery 700 650 700 800
New Delhi Jaipur

Year 2000 2001 2011 2012 Year 2000 2001 2011 2012
Dengue 100 96 12 35 Dengue 300 250 120 200
Chikungunya 50 34 125 100 Chikungunya 400 350 250 260
Typhoid 400 200 430 340 Typhoid 300 30 50 75
Dysentery 300 400 340 340 Dysentery 500 40 70 85
Hyderabad Chennai

BITS Pilani, WILP


Transformation

In Transformation pre-processing the original data set is


transformed so that mining process become efficient and the
patterns can be identified easily.

Few strategies to do transformation are following, some of which


were reviewed in the previous pre-processing techniques also:

 Smoothing
 Attribute Construction
 Aggregation
 Normalization
 Discretization
 Concept Hierarchy Generation for Nominal Attributes 25

BITS Pilani, WILP


Normalization
By z-score normalization using MAD

 Selecting a small measurement unit might lead to large values of the


attributes, so many times it is a good practice to normalize the data set.
 We have reviewed min/max and z-score normalization in the in Data
Exploration modules. Few more varieties are listed below:
 Z-score normalization using Mean Absolute Deviation (MAD): MAD SA of
an attribute A is defined as below. Where there are n values in A (v1,
v2,....vn).

1
S A  ( | v1  A| | v2  A|  ......... | vn  A|)
n
So,z  score normalization for ith element u sin g MAD will be :
vi  A
vi ' 
SA
26

BITS Pilani, WILP


Example

Attribute A Absolute z-score using


Sl. No. Mean MAD
values (vi) (vi-Mean) MAD
1 14 21 -1.40
2 17 18 -1.20
3 23 12 -0.80
4 26 35 9 15 -0.60
5 36 1 0.07
6 47 12 0.80
7 57 22 1.47
8 60 25 1.67
Total 280 120

27

BITS Pilani, WILP


Normalization
by Decimal Scaling

Normalization by decimal scaling moves the decimal point of


values of an attribute A. The number of decimal points moved
depends on the maximum absolute value of A. A value vi of
attribute A is normalized to vi’ by computing:
vi
vi '  j ,where j is the smallest int eger such that max | vi ' |  1
10

Example: The attribute of A ranges from -986 to 917. Maximum absolute


value of A is 986. What will be the value of j? If j = 1, then max |vi’| =
98.6. If j = 2, then max |vi’| = 9.86, if j = 3, then max |vi’| = 0.986 (less
than 1). For further values of j, the max |vi’| will also be smaller than 1.
But minimum value of j = 3. Now, all other values of A can be normalized
using the above formula. 28

BITS Pilani, WILP


Exercise
1. In Data Exploration, we reviewed [0.0, 1.0] normalization. If we need to change the scale of
normalization on a different min/max, how the formula would change?
2. An attribute has following values: {-0.004, -0.016, 0.20, 0.23, 1.87, 25.64}. Normalize using:
i. Min-max normalization on the scale of [0.0, 1.0]
ii. Z-score normalization
iii. Z-score normalization using MAD
iv. Normalization using decimal scaling
v. Why all values are negative except the last in z-score normalization.

29

BITS Pilani, WILP


Discretization

It divides the range of a continuous attribute into intervals:


 Interval labels can then be used to replace actual data
values
 Reduce data size by discretization
Typical methods:
 Binning
 Histogram analysis
 Clustering analysis
 Decision-tree analysis
 Correlation analysis
30

BITS Pilani, WILP


Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in
a data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
 Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
 Concept hierarchy can be automatically formed for both
numeric and nominal data. 31

BITS Pilani, WILP


Concept Hierarchy
For Nominal Data

1. Specification of a partial/total ordering of attributes explicitly


at the schema level by users or experts
street < city < state < country
2. Specification of a hierarchy for a set of values by explicit data
grouping – manual definition is required.
{Bangalore, Dharwad, Mangalore} < Karnataka
3. Specification of only a partial set of attributes
E.g., only street < city, not others
4. Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
32

BITS Pilani, WILP


Concept Hierarchy
Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the


analysis of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
 There could be exceptions. E.g. weekday, month, quarter, year

country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

33

BITS Pilani, WILP


Exercise

An automobile company manufacture cars. A car can


be seen as group of machinery and aesthetic parts.
Then there are several sub-parts or ancillary products
e.g. engine, electric parts, wheels, infotainment system,
seating etc. Create database schema for few individual
items and draw a concept hierarchy.

34

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Appendix-A

36

BITS Pilani, WILP


General Equation of a Line

 The equation of a line is given by:


ax + by + k = 0
or, y = (-a/b)x + (-k/b)
Where (-a/b) is the slope of the line
and (-k/b) is the Y-axis intercept of

Y
the line.
 a, b and k are constants. Any
coordinate on this line will satisfy the Y-axis intercept
θ
equation.
 Above equation can be simply written X-axis intercept X
as:
y = mx + c
Where m is the slope and c is the
intercept on the Y-axis.
37

BITS Pilani, WILP


Trend in the Data
 The demand data for the six years is provided from
Year Demand (Dt)
the year 2012 to 2017 in the given table. 2012 26
 The forecast for the year 2018 is to be found out. 2013 28
2014 29
 A casual observation of the data itself reveals that
2015 31
there is an increasing trend. 2016 32
 Therefore simple arithmetic average will not provide 2017 35
2018 ?
the forecast for the year 2018. because it will give a
value between 26 and 35.
 Also weighted average, k-moving simple average and exponential smoothing will
not work for this kind of data.
 There is a requirement to build different kind of model that would be capable to
capture the trend that the given data set is exhibiting.
 The increasing trend can also be observed by plotting the points.
 To come up with a model, a line can be drawn which represents the trend by38
being as close as possible from each of these points.
BITS Pilani, WILP
Linear Regression
 Let the equation of the best representative line
is: Y = a + bt; where, a and b are constants and
need to be found out. Y represents the Y-axis and
t represents the X-axis.
 Observe that some of the data points will be
falling above or below the line, where their
displacement would be given by: Yt - a – bt;
because displacement is only on the Y-axis and
X-coordinate would be same.
 The Sum-of-Squared-Error (SSE) can be
established as: et2 = (Yt - a – bt)2
 The best line would be having the least
aggregate SSE.
 So the objective would be to find out the values
of a and b for which total SSE is minimum i.e. to Points above/below/on the representative line
minimize the relationship:
∑ et2 = ∑ (Yt - a – bt)2 39

BITS Pilani, WILP


Illustration
Linear Regression

∑ et2 = ∑ (Yt - a – bt)2


Partially differentiating the equation w.r.t. a and b
respectively and equating the resulting expressions to 0:
-2∑ (Yt - a – bt) = 0 or, ∑Yt = na + b.∑t
-2∑ t.(Yt - a – bt) = 0 or, ∑t.Yt = a.∑t + b.∑t2
Year Demand (Dt) t Yt t.Yt t2 Substituting the values:
2012 26 1 26 26 1 181 = 6a + 21b --------- (i)
2013 28 2 28 56 4 663 = 21a + 91b ---------(ii)
2014 29 3 29 87 9 Or,
2015 31 4 31 124 16 a = 24.25
2016 32 5 32 160 25 b = 1.69
2017 35 6 35 210 36 Forecast for the 7th year =
2018 ? ∑ 21 181 663 91 using Y = a+bt,
24.25+ 1.69*7 = 36.08
BITS Pilani, WILP
DSECL ZC415
Data Mining
Classification & Prediction
Revision -3.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Introduction
 Let us say there is a collection of records (tuples) where each record contains a set of attributes
including one of the attributes that is called the class or label.
 It is required to find out a model for the class attribute as a function of other attributes. The
function is called the target function and the model is called the Classification Model or Classifier.
 Example: The animal data provided below establishes a class relationship based on other attributes.

Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Human Warm Hair Yes No No Yes No Mammal
Python Cold Scales No No No No Yes Reptile
Pigeon Warm Feathers No No Yes Yes No Bird
..... ..... ..... ..... ..... ..... ..... ..... .....

 The model can serve for descriptive modeling. It explains different features of different classes of
animals.
 The model can also help in predictive modeling. For example the record below can be tested
against the available data set to find out the class of a new animal.
Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Gila Cold Scales No No No Yes Yes ??
BITS Pilani, WILP
Classification: General Approach
1. Set of records provided where class is known - The Training Set:

Marital Taxable Risky


ID Status Income Loan? Classification
1 Single 125K No Algorithm
2 Married 100K No
3 Single 70K No
2. Model is built using a
4 Married 120K No
classification algorithm.
5 Divorced 95K Yes
6 Married 60K No Learn
7 Divorced 220K No Model
8 Single 85K Yes
9 Married 75K No
10 Single 90K Yes
Model

Apply E.g. Decision Tree or Rules

Marital Taxable Risky


Model
ID Status Income Loan?
11 Single 135K ? 3. Model is applied to the new
12 Married 180K ? data records ( Test Data Set) to
identify the class.

BITS Pilani, WILP


Decision Tree
A Classification Technique

Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Strange Warm Scales No No No Yes Yes ??

Body Root
Temperature
Internal Cold
Node Warm Assign Non-Mammal as class.
Gives Non
Birth Mammal

Yes No
 All attributes do not play a
role?
Non
 How do we decide? Mammal
Mammal
 Why Body Temperature
first? What is the order of
selecting attributes – root Leaf Nodes
and further down?

BITS Pilani, WILP


Binary Attributes
Attribute Test Conditions and Splitting

The partition is simple, because there are only two


outcomes.

Body Medical Test


Temperature Report
Warm Cold Positive Negative

Taxable
Income?
Yes No

BITS Pilani, WILP


Nominal Attributes
Attribute Test Conditions and Splitting

 Those attributes which uniquely identify a record do


not need splitting and can be ignored. E.g. Employee
ID. Not used for building a decision tree.
 Other nominal attributes can be grouped or splitted
in multiway.
Marital Marital
Status Status
Single Divorced {Single} {Married, Divorced}

Married
and other combinations....

BITS Pilani, WILP


Ordinal Attributes
Attribute Test Conditions and Splitting

Ordinal attributes can be grouped or splitted in


multiway as long as they do not violate order property.

Exam Exam
Grades Grades
Excellent Poor Poor {Good, Excellent}

Good

Exam
Grades
Probably a wrong split!
Good {Poor, Excellent} Why?

BITS Pilani, WILP


Numeric Continuous Attributes
Attribute Test Conditions and Splitting

 Test conditions for numeric attributes can be expressed


as comparison test with binary outcomes or in a form of
range query.
 In the first case, the comparison case becomes intensive
because it may be required to test each value for binary
outcome.
 For the second case, all possible ranges need to be
checked.
Annual Annual
Income Income
> 80K < 10K > 80K
Yes No
(10K, 25K}

BITS Pilani, WILP


Measure for the Best Split

 Best split is based on the degree of impurity of the child nodes.


 The smaller the degree of impurity, more skewed of the class
distribution.
 Let p(i|t) is the fraction of records belonging to class i at a given
node t, the following measures are present to measure the
impurity:
c 1
Entropy( t )  Info( t )   p( i / t )log 2 p( i / t )
i 0 Review Logic
c 1 Formulation
Gini( t )  1   [ p( i / t )] 2 from Appendix.
i 0

Classification Error( t )  1  max[ p( i / t )]


i

where, c  number of classes 10

BITS Pilani, WILP


Example

6 Records 6 Records 6 Records


Class-1 0 Class-1 1 Class-1 3
Class-2 6 Class-2 5 Class-2 3

Entropy = - { (0/6).log (0/6) + (6/6).log Entropy = - { (1/6).log (1/6) + Entropy = - { (3/6).log (3/6) +
(6/6) } = 0 (5/6).log (5/6) } = 0.65 (3/6).log (3/6) } = 1
Gini = 1- {(0/6)2 + (6/6)2} = 0 Gini = 1- {(1/6)2 + (5/6)2} = 0.28 Gini = 1- {(3/6)2 + (3/6)2} = 0.50
Classification Error = 1 – max {0/6, Classification Error = 1 – max Classification Error = 1 – max
6/6} = 0 {1/6, 5/6} = 0.17 {3/6, 3/6} = 0.50

c 1
Entropy( t )  Info( t )   p( i / t )log 2 p( i / t )
i 0
c 1 Best split is based on the
Gini( t )  1   [ p( i / t )] 2 lesser degree of impurity.
i 0

Classification Error( t )  1  max[ p( i / t )] Review basic logarithm from the


i
appendix.
where, c  number of classes

BITS Pilani, WILP


Observation

The possible values of different measures are shown


below where p is fraction of records.

12

BITS Pilani, WILP


Goodness of Split

 There are algorithms which work on the measures discussed earlier (Gini and
Entropy) to decide how the split of a branch will happen taking which
attributes; one at a time.
 CART (Classification and Regression Tree) works on Gini.
 ID3 (Iterative Dichotomiser-3) works on Entropy (Info).
 C4.5 extension of ID3. It leverages Entropy (Info) in calculating the Gain Ratio.
 The general principle works in the following manner:
o Degree of impurity before splitting is compared with the degree of impurity of
attributes (which are considered for splitting). The larger the difference, the
better the suitability to use that attribute for splitting. In other words, attributes
with lower values of impurity are preferred first.
o If CART algorithm is used, the difference is called the reduction in impurity.
o If entropy based ID3 algorithm is used, the difference is called the information
gain or simply the gain.
o In C4.5, the attribute that yields maximum gain ratio, is selected first.
BITS Pilani, WILP
Please Note

Now this slide deck will elaborate on the following:


 How measures of best split is calculated individually for
Binary, Nominal and Numeric attributes. Gini will be taken in
examples.
 The concept will be used to build a complete decision tree
taking all the attributes into account for a data set.

14

BITS Pilani, WILP


Splitting of Binary Attributes
Example continuing...

In the shown table the two binary Attribute A Attribute B Attribute x Attribute y
Yes Yes
Class
C0
attributes A and B are considered. Based Yes No C0
Yes No C0
on their combinations along with other Yes No C0
attributes describe a class (C0 or C1). No No C0
No No C0
Gini (Parent) before splitting = 1 – [(6/12)2+(6/12)2] = 0.5 Yes Yes C1
Yes Yes C1
Yes Yes C1
Parent
No Yes C1
C0 6
No No C1
C1 6
No No C1
Gini = 0.5

If attribute A is chosen to split


Yes No
Gini (Yes) = 1 – [(4/7)2+(3/7)2] = 0.49
C0 4 2 A Gini (No) = 1 – [(2/5)2+(3/5)2] = 0.48
C1 3 3 Yes No
Gini 0.49 0.48
Weighted Avg Gini = 0.49 Weighted Average
= (7/12)*0.49 + (5/12)*0.48 = 0.49

BITS Pilani, WILP


Splitting of Binary Attributes
Example concluded.

Attribute A Attribute B Attribute x Attribute y Class


If attribute B is chosen to split Yes Yes C0
Yes No C0
Yes No Yes No C0
C0 1 5 B Yes No C0
C1 4 2 No No C0
Yes No
Gini 0.32 0.41 No No C0
Weighted Avg Gini = 0.37 Yes Yes C1
Yes Yes C1
Yes Yes C1
No Yes C1
Gini (Yes) = 1 – [(1/5)2+(4/5)2] = 0.32 No No C1
No No C1
Gini (No) = 1 – [(5/7)2+(2/7)2] = 0.41

Weighted Average
= (5/12)*0.32 + (7/12)*0.41 = 0.37
Conclusion:
Gini before splitting is = 0.50
Gini for attribute A is = 0.49
Gini for attribute B is = 0.37
The reduction in impurity is maximized when B is chosen as (0.50 – 0.37) > (0.50 – 0.49)
16
Therefore, B is preferred over A for splitting.
BITS Pilani, WILP
Splitting of Nominal Attributes
Example continuing...
In the shown table a nominal attribute Car Type is Car Type Attribute x Attribute y Class
being considered. Based on its combinations along with Sports C0
Sports C0
other attributes describe a class (C0 or C1). Let us say,
Sports C0
the splitting is done using two groups: {Sports, Luxury} Sports C0
and {Family}. Sports C0
Sports C0
Sports C0
Sports C0
Car Type Car Type Luxury C0
{Sports, Luxury} {Family} Family C0
C0 9 1 Family C1
C1 7 3 Family C1
{Sports, Luxury} {Family} Gini 0.49 0.38 Family C1
Weighted Avg Gini 0.47 Luxury C1
Luxury C1
Luxury C1
Luxury C1
Gini ({Sports, Luxury}) = 1 – [(9/16)2+(7/16)2] = 0.49 Luxury C1
Gini ({Family}) = 1 – [(1/4)2+(3/4)2] = 0.38 Luxury C1
Luxury C1

Weighted Average
= (16/20)*0.49 + (4/20)*0.38 = 0.47
17

BITS Pilani, WILP


Splitting of Nominal Attributes
Example continuing...
Car Type Attribute x Attribute y Class
If the splitting is done using two Sports C0
Sports C0
groups: {Family, Luxury} and {Sports}: Sports C0
Sports C0
Sports C0
Sports C0
Sports C0
Sports C0
Car Type Car Type Luxury C0
{Family, Luxury} {Sports} Family C0
C0 2 8 Family C1
C1 10 0 Family C1
{Family, Luxury} {Sports} Gini 0.28 0 Family C1
Weighted Avg Gini 0.17 Luxury C1
Luxury C1
Luxury C1
Luxury C1
Gini ({Family, Luxury}) = 1 – [(2/12)2+(10/12)2] = 0.28 Luxury C1
Gini ({Sports}) = 1 – [(8/8)2+(0/8)2] = 0 Luxury C1
Luxury C1

Weighted Average
= (12/20)*0.29 + (8/20)*0= 0.17
18

BITS Pilani, WILP


Splitting of Nominal Attributes
Example continuing...
Car Type Attribute x Attribute y Class
If the splitting is done using two Sports C0
Sports C0
groups: {Family, Sports} and {Luxury}: Sports C0
Sports C0
Sports C0
Sports C0
Sports C0
Sports C0
Car Type Car Type Luxury C0
{Family, Sports} {Luxury} Family C0
C0 9 1 Family C1
C1 3 7 Family C1
{Family, Sports} {Luxury} Gini 0.38 0.21 Family C1
Weighted Avg Gini 0.31 Luxury C1
Luxury C1
Luxury C1
Luxury C1
Gini ({Family, Sports}) = 1 – [(9/12)2+(3/12)2] = 0.38 Luxury C1
Gini ({Luxury}) = 1 – [(1/8)2+(7/8)2] = 0.21 Luxury C1
Luxury C1

Weighted Average
= (12/20)*0.38 + (8/20)*0.21= 0.31
19

BITS Pilani, WILP


Splitting of Nominal Attributes
Example concluded.
Car Type Attribute x Attribute y Class
If the splitting is done using Sports
Sports
C0
C0
multiway split: Sports
Sports
C0
C0
Car Type Sports C0
{Family} {Luxury} {Sports} Sports C0
Car Type C0 1 1 8 Sports C0
C1 3 7 0 Sports C0
Gini 0.38 0.21 0 Luxury C0
Weighted Avg Gini 0.16 Family C0
{Family} {Sports}
Family C1
{Luxury} Family C1
Family C1
Gini ({Family}) = 1 – [(1/4)2+(3/4)2] = 0.38 Luxury C1
Gini ({Luxury}) = 1 – [(1/8)2+(7/8)2] = 0.21 Luxury C1
Luxury C1
Gini ({Sports}) = 1 – [(8/8)2+(0/8)2] = 0 Luxury C1
Luxury C1
Luxury C1
Weighted Average Luxury C1
= (4/20)* 0.38 + (8/20)*0.21 + (8/20)*0 = 0.16
Conclusion: The lowest value of Gini index suggests that multiway
split is the best option for this example, because it will maximize the
reduction in impurity.
BITS Pilani, WILP
Splitting of Continuous Numeric
Attributes
 Records are to be splitted based on the taxable income. ID
Marital Taxable
Status Income
Risky
Loan?
 Sort the data set on taxable income. 1 Single 125K No
2 Married 100K No
 Split positions are identified taking the mid-point of two 3 Single 70K No
adjacent values. 4 Married 120K No
5 Divorced 95K Yes
 Linearly scan all records (not shown in the figure), each time 6 Married 60K No
updating the count matrix and computing Gini index. 7 Divorced 220K No
8 Single 85K Yes
 Choose the split position that has the least Gini index (97K in 9 Married 75K No
the example) 10 Single 90K Yes

Risk? No No No Yes Yes Yes No No No No


Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values 65 72 80 87 92 97 110 122 172
Split Positions <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0

No 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1

Gini 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400

BITS Pilani, WILP


Sketchy Illustration
Now let us use the concept to build a Name Age
Car Crash

quick decision tree before we take a Type Risk


Ben 30-40 Family Low
bigger example. The following dataset is
Paul 20-30 Sports High
available as training data with a car
Bill 40-50 Sports High
insurance company. Draw a complete James 30-40 Family Low
decision tree using a suitable metric. John 20-30 Family High
The Crash Risk is the class of interest. Steven 30-40 Sports High

Age Car Type


20-30 30-40 40-50 Sports Family Car Type
Low 0 2 0 Low 0 2
High 2 1 1 High 3 1 Sports Family
Gini 0 0.44 0 Gini 0 0.44
Age
Weighted Avg Gini 0.22 Weighted Avg Gini 0.22
20-30 or
High 30-40
40-50

Same Gini value! High Low


The tree can be built taking Age at the root also.

BITS Pilani, WILP


Example-1 (Courtesy – Tom M Mitchell)
Based on CART Algorithm
Page-1/12 - First Root Level Split: Gini for Outlook

Gini Index at the root level (9 Yes and 5 No from total 14


records):
1-{(9/14)2+(5/14)2} Day Outlook Temperature Humidity Wind
Play
Tennis?
= 1 – (0.41+0.13)
 Impurity before splitting
1 Sunny Hot High Weak No
= 0.46 2 Sunny Hot High Strong No
Gini Index for Outlook: 3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
{Sunny}, {Overcast}, {Rain} {Sunny, Rain}, {Overcast} 5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
Outlook Outlook
7 Overcast Cool Normal Strong Yes
Sunny Overcast Rain Sunny, Rain Overcast
8 Sunny Mild High Weak No
Yes 2 4 3 Yes 5 4
9 Sunny Cool Normal Weak Yes
No 3 0 2 No 5 0
10 Rain Mild Normal Weak Yes
Gini 0.48 0 0.48 Gini 0.50 0
11 Sunny Mild Normal Strong Yes
Weighted Avg Gini 0.34 Weighted Avg Gini 0.36
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
{Sunny, Overcast}, {Rain} {Sunny}, {Rain, Overcast}
14 Rain Mild High Strong No
Outlook Outlook
Sunny, Overcast Rain Sunny Rain, Overcast
Yes 6 3 Yes 2 7 Multiway-split yields
No 3 2 No 3 2
Gini 0.44 0.48 Gini 0.48 0.35
the lowest Gini!
Weighted Avg Gini 0.45 Weighted Avg Gini 0.40

BITS Pilani, WILP


Page-2/12 - First Root Level Split: Gini for Temperature

Gini Index for Temperature:


Play
Day Outlook Temperature Humidity Wind
{Hot}, {Mild}, {Cool} {Hot, Cool}, {Mild} Tennis?

Temperature Temperature 1 Sunny Hot High Weak No


Hot Mild Cool Hot, Cool Mild 2 Sunny Hot High Strong No
Yes 2 4 3 Yes 5 4 3 Overcast Hot High Weak Yes
No 2 2 1 No 3 2 4 Rain Mild High Weak Yes
Gini 0.50 0.44 0.38 Gini 0.47 0.44 5 Rain Cool Normal Weak Yes
Weighted Avg Gini 0.44 Weighted Avg Gini 0.46 6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
{Hot, Mild}, {Cool} {Hot}, {Mild, Cool} 8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
Temperature Temperature
10 Rain Mild Normal Weak Yes
Hot, Mild Cool Hot Mild, Cool
11 Sunny Mild Normal Strong Yes
Yes 6 3 Yes 2 7
12 Overcast Mild High Strong Yes
No 4 1 No 2 3
13 Overcast Hot Normal Weak Yes
Gini 0.48 0.38 Gini 0.50 0.42
14 Rain Mild High Strong No
Weighted Avg Gini 0.45 Weighted Avg Gini 0.44

0.44 is the minimum value of Gini for Temperature but it is not as low
as it was for Outlook (0.34)! 24

BITS Pilani, WILP


Page-3/12 - First Root Level Split: Gini for Humidity

Gini Index for Humidity:


Play
Humidity Day Outlook Temperature Humidity Wind
Tennis?
High Normal
1 Sunny Hot High Weak No
Yes 3 6
2 Sunny Hot High Strong No
No 4 1
3 Overcast Hot High Weak Yes
Gini 0.49 0.24
4 Rain Mild High Weak Yes
Weighted Avg Gini 0.37
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

0.37 is the value of Gini for Humidity but it is not as low as it was for
Outlook (0.34)! 25

BITS Pilani, WILP


Page-4/12 - First Root Level Split: Gini for Wind

Gini Index for Wind:


Play
Wind Day Outlook Temperature Humidity Wind
Tennis?
Weak Strong
1 Sunny Hot High Weak No
Yes 6 3
2 Sunny Hot High Strong No
No 2 3
3 Overcast Hot High Weak Yes
Gini 0.38 0.50
4 Rain Mild High Weak Yes
Weighted Avg Gini 0.43
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

0.43 is the value of Gini for Wind but it is not as low as it was for
Outlook (0.34)! 26

BITS Pilani, WILP


Page-5/12 - First Root Level Split

Feature Gini index Day Outlook Temperature Humidity Wind


Play
Tennis?
Outlook 0.34 1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
Temperature 0.44 3 Overcast Hot High Weak Yes

Humidity 0.37 4
5
Rain
Rain
Mild
Cool
High
Normal
Weak
Weak
Yes
Yes
Wind 0.43 6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
Outlook 13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Sunny Overcast Rain

Let us go deeper for


Sunny Outlook first! 27

BITS Pilani, WILP


Page-6/12 - Sunny Outlook: Second Level Split - Temperature

Gini Index for Temperature:


Play
Day Outlook Temperature Humidity Wind
Tennis?
{Hot}, {Mild}, {Cool} {Hot, Cool}, {Mild} 1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
Temperature Temperature
8 Sunny Mild High Weak No
Hot Mild Cool Hot, Cool Mild
9 Sunny Cool Normal Weak Yes
Yes 0 1 1 Yes 1 1
11 Sunny Mild Normal Strong Yes
No 2 1 0 No 2 1
Gini 0 0.50 0 Gini 0.44 0.50 While calculating weighted average note
Weighted Avg Gini 0.20 Weighted Avg Gini 0.46 that there are only 5 records now.
{Hot, Mild}, {Cool} {Hot}, {Mild, Cool}
Temperature Temperature
Hot, Mild Cool Hot Mild, Cool
Yes 1 1 Yes 0 2
No 3 0 No 2 1
Gini 0.38 0 Gini 0 0.44
Weighted Avg Gini 0.30 Weighted Avg Gini 0.26

Multiway-split yields
the lowest Gini!
28

BITS Pilani, WILP


Page-7/12 - Sunny Outlook: Second Level Split - Humidity

Gini Index for Humidity:


Play
Day Outlook Temperature Humidity Wind
Tennis?
Humidity 1 Sunny Hot High Weak No
High Normal 2 Sunny Hot High Strong No
Yes 0 2 8 Sunny Mild High Weak No
No 3 0 9 Sunny Cool Normal Weak Yes
Gini 0 0 11 Sunny Mild Normal Strong Yes
Weighted Avg Gini 0

29

BITS Pilani, WILP


Page-8/12 - Sunny Outlook: Second Level Split - Wind

Gini Index for Wind:


Play
Day Outlook Temperature Humidity Wind
Tennis?
Wind 1 Sunny Hot High Weak No
Weak Strong 2 Sunny Hot High Strong No
Yes 1 1 8 Sunny Mild High Weak No
No 2 1 9 Sunny Cool Normal Weak Yes
Gini 0.44 0.50 11 Sunny Mild Normal Strong Yes
Weighted Avg Gini 0.46

30

BITS Pilani, WILP


Page-9/12 - Sunny Outlook: Second Level Split

Feature Gini index Play


Day Outlook Temperature Humidity Wind
Temperature 0.20 Tennis?
1 Sunny Hot High Weak No
Humidity 0.0 2 Sunny Hot High Strong No

Wind 0.46 8
9
Sunny
Sunny
Mild
Cool
High
Normal
Weak
Weak
No
Yes
11 Sunny Mild Normal Strong Yes

Outlook
Sunny Rain
Overcast

Humidity
High Normal

No Yes Classes are assigned directly. Verify


mathematically.

BITS Pilani, WILP


Page-10/12 - Overcast Outlook: Second Level Split

Play
Day Outlook Temperature Humidity Wind
Tennis?
Outlook
3 Overcast Hot High Weak Yes
Sunny Rain 7 Overcast Cool Normal Strong Yes
Overcast 12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
Yes
Humidity

High Normal

No Yes No other option!


Why?
Check mathematically also!

32

BITS Pilani, WILP


Page-11/12 - Rain Outlook: Second Level Split

Gini Index for Temperature:


Play
Day Outlook Temperature Humidity Wind
Temperature Tennis?
Mild Cool 4 Rain Mild High Weak Yes
Yes 2 1 5 Rain Cool Normal Weak Yes
No 1 1 6 Rain Cool Normal Strong No
Gini 0.44 0.50 10 Rain Mild Normal Weak Yes
Weighted Avg Gini 0.46 14 Rain Mild High Strong No

While calculating weighted average note


that there are only 5 records now.
Gini Index for Humidity:
Humidity
High Normal
Yes 1 2 Gini Index for Wind:
No 1 1
Gini 0.50 0.44 Wind
Weighted Avg Gini 0.46 Weak Strong
Yes 3 1
No 0 1
Gini 0.0 0.50
Weighted Avg Gini 0.20

33

BITS Pilani, WILP


Page-12/12 - Rain Outlook: Second Level Split

Feature Gini index Day Outlook Temperature Humidity Wind


Play
Tennis?
Temperature 0.46 4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
Humidity 0.46 6 Rain Cool Normal Strong No

Wind 0.20 10
14
Rain
Rain
Mild
Mild
Normal
High
Weak
Strong
Yes
No

Outlook
Sunny Rain
Overcast
Yes
Humidity Wind
High Normal Weak Strong

No Yes Yes No

Final Decision Tree Classes are assigned directly. Verify mathematically.

BITS Pilani, WILP


Example-2 Wine may be spoilt during
vinification if Temperature,
Pressure and Potassium Sorbet
Based on ID3 Algorithm are not managed well.

i. What is the entropy of the given data set? High High


Measurement Potassium Sorbet
Given that, 5 Yes and 4 No classes in the data, Instance
Pressure Temperature
Index (PSI)
Wine Spoilt?
(HP) (HT)
Entropy = - ∑ p(i/t).log 2p(i/t)
1 TRUE TRUE 1.00 No
= - { (5/9).log 2 (5/9) + (4/9).log 2 (4/9)} 2 TRUE TRUE 6.00 No
= - {- 0.47 - 0.52} = 0.99 3 TRUE FALSE 5.00 Yes
4 FALSE FALSE 4.00 No
5 FALSE TRUE 7.00 Yes
ii. What is information gain w.r.t. HP?
6 FALSE TRUE 3.00 Yes
HP 7 FALSE FALSE 8.00 Yes
TRUE FALSE 8 TRUE FALSE 7.00 No
Yes 1 4 9 FALSE TRUE 5.00 Yes
No 3 1
Entropy 0.81 0.72
Weighted Avg Entropy 0.76
iv. What is information gain w.r.t. PSI?
PSI 1.00 3.00 4.00 5.00 6.00 7.00 8.00
So, information gain = 0.99 - 0.76 = 0.23 Split 2.00 3.50 4.50 5.50 6.50 7.50
iii. What is information gain w.r.t. HT? <= > <= > <= > <= > <= > <= >
Yes 0 5 1 4 1 4 3 2 3 2 4 1
HT No 1 3 1 3 2 2 2 2 3 1 4 0
TRUE FALSE Entropy 0.84 0.99 0.92 0.98 0.97 0.89
Yes 3 2
No 2 2 So, information gain = 0.99 - 0.84 = 0.15
Entropy 0.97 0.99
Weighted Avg Entropy 0.98 Since information gain is highest w.r.t. HP (lowest
entropy), root level split should happen with HP.
So, information gain = 0.99 - 0.98 = 0.01
BITS Pilani, WILP
Example-3
C4.5: Gain Ratio

 Let us say in a dataset, there is an attribute that is ID Age Income Student?


Credit Class: Buys
unique for all records (e.g. BITS ID, Employee ID etc.). History Computers?
1 Youth High No Fair No
 This attribute would yield 0 Entropy and Gini (verify as 2 Youth High No Excellent No
an exercise!). 3 Middle High No Fair Yes
 Does it mean the information gain or reduction in 4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
impurity is highest for this attribute?
6 Senior Low Yes Excellent No
 The answer is no. Such attribute is useless for 7 Middle Low Yes Excellent Yes
classification and not used. 8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
 To avoid such bias, C4.5 algorithm introduced the
10 Senior Medium Yes Fair Yes
concept of Gain Ratio. The algorithm is considered as a 11 Youth Medium Yes Excellent Yes
successor of ID3. The attribute with higher gain ratio 12 Middle Medium No Excellent Yes
will be selected for splitting first. 13 Middle High Yes Fair Yes
14 Senior Medium No Excellent No
 In the given table, Low, Medium and High income
records are 4, 6 and 4 respectively out of 14 records.
So:
 4   4   6   6   4   4 
SplitInfo( Income )     .log 2      .log 2      .log 2    1.557 Notice the
 14   14   14   14   14   14 
method of calculating
Information Gain( Income )  0.029( calculate Entropy and then findout )
SplitInfo!
SoGainRatio( Income )  0.029 / 1.557  0.019

BITS Pilani, WILP


Exercise
Credit Class: Buys
ID Age Income Student?
History Computers?
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
This dataset will be used
6 Senior Low Yes Excellent No for the discussion of next
7 Middle Low Yes Excellent Yes few topics.
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle Medium No Excellent Yes
13 Middle High Yes Fair Yes
14 Senior Medium No Excellent No

1. Model the decision tree for the give records using CART and ID3 algorithms.
2. For the given attribute ID in the table above, calculate GainRatio and infer its value.

37

BITS Pilani, WILP


Noise in the Training Dataset
Scenario-1
Training Dataset
Body
Body Gives Four
Name Temperature Birth Legged Class
Temperature
Porcupine Warm Yes Yes Mammal Warm Cold
Cat Warm Yes Yes Mammal
Bat Warm Yes No Non-Mammal
Gives Non
Whale Warm Yes No Non-Mammal
Salamander Cold No Yes Non-Mammal
Birth Mammal
Dragon Cold No Yes Non-Mammal
Python Cold No No Non-Mammal Yes No
Salmon Cold No No Non-Mammal
Four
Eagle Warm No No Non-Mammal Non
Guppy Cold Yes No Non-Mammal
Legged
Mammal
Test Dataset Yes No
Body Gives Four Class
Name Temperature Birth Legged Prediction?
Non
Human Warm Yes No Non-Mammal Mammal
Pigeon Warm No No Non-Mammal
Mammal
Elephant Warm Yes Yes Mammal
Shark Cold Yes No Non-Mammal Corresponding Decision Tree: No Error
Turtle Cold No Yes Non-Mammal
Penguin Cold No No Non-Mammal
Eel Cold No No Non-Mammal
 There was error (noise) in the training dataset.
Dolphin Warm Yes No Non-Mammal  But decision tree is perfect as per the training data.
Gila Cold No Yes Non-Mammal  When test data is run, it predicts 2 records wrong.
Strange Cold Yes No Non-Mammal

BITS Pilani, WILP


Noise in the Training Dataset
Scenario-2
Training Dataset
Body Gives Four
Name Temperature Birth Legged Class Body
Porcupine Warm Yes Yes Mammal Temperature
Cat Warm Yes Yes Mammal
Warm Cold
Bat Warm Yes No Non-Mammal
Whale Warm Yes No Non-Mammal
Salamander Cold No Yes Non-Mammal Gives Non
Dragon Cold No Yes Non-Mammal Birth Mammal
Python Cold No No Non-Mammal
Salmon Cold No No Non-Mammal Yes No
Eagle Warm No No Non-Mammal
Guppy Cold Yes No Non-Mammal Mammal Non
Test Dataset Mammal
Body Gives Four Class
Name Temperature Birth Legged Perdition?
Corresponding Decision Tree: With Errors
Human Warm Yes No Mammal
Pigeon Warm No No Non-Mammal
Elephant Warm Yes Yes Mammal  There was error (noise) in the training dataset.
Shark Cold Yes No Non-Mammal  Decision tree is also imperfect as per the training data.
Turtle Cold No Yes Non-Mammal
 When test data is run, it predicts all records correct.
Penguin Cold No No Non-Mammal
Eel Cold No No Non-Mammal  The scenario-1 (previous slide – scenario-1) is said to be
Dolphin Warm Yes No Mammal overfitting Four Legged attribute because of noise.
Gila Cold No Yes Non-Mammal
Strange Cold Yes No Non-Mammal

BITS Pilani, WILP


Insufficient Training Data
Training Dataset Body
Body Four Temperature
Name Temperature Hibernates Legged Class
Salamander Cold No Yes Non-Mammal Warm Cold
Guppy Cold Yes No Non-Mammal
Eagle Warm No No Non-Mammal Non
Hibernates
Poorwill Warm Yes No Non-Mammal Mammal
Cat Warm Yes Yes Mammal
Yes No
Test Dataset
Four
Body Four Class Legged Non
Name Temperature Hibernates Legged Prediction? Mammal
Human Warm No No Non-Mammal
Pigeon Warm No No Non-Mammal
Yes No
Elephant Warm No Yes Non-Mammal
Shark Cold No No Non-Mammal Non
Mammal
Turtle Cold No Yes Non-Mammal Mammal
Penguin Cold No No Non-Mammal
Eel Cold No No Non-Mammal Corresponding Decision Tree: No Errors
Dolphin Warm No No Non-Mammal
 Decision tree is perfect as per the training data.
Gila Cold Yes Yes Non-Mammal
 When test data is run, it predicts 3 records wrong.
Strange Cold Yes No Non-Mammal
 Lack of sufficient training.
 In case if a decision tree does not capture all scenarios
of a decision tree – Underfitting.

BITS Pilani, WILP


Exercise

Refer to the text book and answer the following questions:


1. What is majority voting in while building a decision
tree?
2. What are these terms?
i. Pre-pruning
ii. Post-pruning
iii. Pessimistic Pruning
iv. The problems of repetition and replication

41

BITS Pilani, WILP


Rule-Based Classifier
Introduction

 In rule based classifier, the learned model is represented as a set of IF-


THEN rules. For example the ith rule in the model:
Ri: IF condition THEN conclusion
 The examples below show more concise form of writing these rules:
R1: (Age = Youth) ∧ (Student = Yes) ⇨ (Buys Computers = Yes)
R2: (Gives Birth = No) ∧ (Aerial = Yes) ⇨ (Class= Bird)
 If there are n rules in a model R. The model can be represented as:
R = (R1 ∨ R2 V R3 ∨..........Rn)
 Attribute value pairs have operators in the middle (=, ≠, >, <, ≤, ≥). Two
IF conditions are joined by ∧ (AND/Conjunction). Rules form a set in
the model with ∨ (OR/Disjunction) symbol.
 Left hand side of the rule is called the antecedent or precondition and
right side of the rule is called the consequent. 42

BITS Pilani, WILP


Rule Coverage and Accuracy

 If the condition (all the attribute tests) in a rule antecedent


holds TRUE for a given record, the rule is said to be satisfied
and the rule covers the record.
 In the dataset D, let Ncovers is the count of records covered by a
rule R, Ncorrect is the count of records correctly classified by R,
|D| is the count of records in D, the following relationships
can be expressed:
N cov ers
Coverage( R ) 
|D|
Ncorrect
Accuracy( R ) 
N cov ers
43

BITS Pilani, WILP


Example
Credit Class: Buys
ID Age Income Student
History Computers?
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle Medium No Excellent Yes
13 Middle High Yes Fair Yes
14 Senior Medium No Excellent No

R1: (Age = Youth) ∧ (Student = Yes) ⇨ (Buys Computers = Yes)


The rule R1 covers 2 out of 14 records and it can correctly classify both.
So, Coverage (R1) = 2/14 (=14.29%) and Accuracy (R1) = 2/2 (=100%) 44

BITS Pilani, WILP


Learning the Rules
To build the model

 The rules can be extracted directly from the training set


using sequential covering algorithm.
 The algorithm name came from the notion that the rules
are learned sequentially, where each rule for a given
class will ideally cover many of the class records and
desirably none of the records of the other class.
 Popular variations of the algorithms are AQ, CN2 and
RIPPER (details are beyond the scope of this course).
 In addition to the sequential covering algorithm, decision
trees are also used to learn the rules, which is preferable
at times to maintain the classification model.
45

BITS Pilani, WILP


Sequential Covering Algorithm
General Approach

seqCoveringAlgo (D, attrVals, c)


// D is data set, attrVals is different attribute-value pairs and c is the count of classes.
{
ruleSet = {ф}; // initially rule set is empty
for each class c
{
rule = learnOneRule (D, attrVals, c);
remove records covered by this rule from D;
ruleSet = ruleSet + rule;
}
return ruleSet;
}

BITS Pilani, WILP


Learn-One-Rule
General to Specific Approach

Loan  It starts with the empty antecedent set. It has poor


Loan Term Credit Rating
Name Age (A) Income (I) Decision quality because it covers all records.
(T) (C)
(D)  Greedy and Depth-First approach is adopted to
Sandy Youth Low Long Poor Risky
improve the rule quality and an antecedent is added.
Bill Middle Medium Long Excellent Safe
Rick Aged High Short Excellent Safe  Other attributes are added in the antecedent if quality
Smith Youth High Long Good Doubt of the rule further improves.
Ted Youth Low Long Good Risky  The procedure is repeated until all the classes and
Joe Aged Medium Short Poor Risky records are covered in the training set.
John Aged High Long Excellent Safe
 In the given example, attributes are added at each stage
Ruby Middle High Long Excellent Doubt
if the class is Safe for Loan Decision.
 The process is computationally very expensive.

IF (Empty) THEN (D = Safe)

IF (C = Excellent) THEN (D =
------ ------
Safe)

IF (C = Excellent AND I =
High) THEN (D = Safe)

IF (C = Excellent AND I = High


AND A = Aged) THEN (D = Safe)

BITS Pilani, WILP


Learn-One-Rule
Specific to General Approach

Loan Term Credit Rating


Loan  It starts with a specific positive record. In the
Name Age (A) Income (I) Decision
(T) (C)
(D) antecedent, all attribute value pairs are
Sandy Youth Low Long Poor Risky specified for that particular record. Rule
Bill Middle Medium Long Excellent Safe coverage is poor because it is only one record.
Rick Aged High Short Excellent Safe
Smith Youth High Long Good Doubt  Rule is generalized, dropping one or more of
Ted Youth Low Long Good Risky the conjuncts in the antecedent to cover more
Joe Aged Medium Short Poor Risky positive cases.
John Aged High Long Excellent Safe
Ruby Middle High Long Excellent Doubt  The procedure is repeated until all classes and
records are covered in the training set.
 Rules obtained finally may not be exactly the
same as obtained from general to specific
heuristic.
IF (A = Aged AND I = High AND T =
Long AND C = Excellent ) THEN (D =
Safe)

IF (C = Excellent AND I = High AND A = Aged)


THEN (D = Safe)

BITS Pilani, WILP


Rules: Conflict Resolution Strategy
 There is record whose class is to be predicted using rule-based classifier:
X = (Age = Youth, Income = Medium, Student = Yes, Credit History = Fair)

 Let there are two rules in the model:


R1: (Age = Youth) ∧ (Student = Yes) ⇨ (Buys Computers = Yes)
R2: (Age = Youth) ∧ (Student = Yes) ∧ (Income = Medium) ⇨ (Buys Computers = Yes)

 Both the rules are triggered and they specify the same class. But the question is which rule
will be fired and return the class? There are few methods to decide on that:
1. Size Ordering: The rule with higher antecedent size will be fired. In this case R2 will be
fired and return the class.
2. Rule Ordering: The rule having higher priority will be fired. In this scenario rule priority is
decided on some criteria like accuracy, coverage etc.
3. Class Ordering: This priority is decided on some basis like rules for the most prevalent
classes first and so on. This ordering is useful when two rules are triggered but they
specify different classes.
 If none of the rules is satisfied by X, it falls back to a default rule.

BITS Pilani, WILP


Coverage and Accuracy
Effectiveness in measuring the rule quality.

R1

+ + -
+ + + -
+ + + +
+ + +
+ + + + + -
+ + - -
+ + + + - + +
+ +
+ + +
+ + + R2
+ +
+ + + +
+ + + +
+

 There are total 44 ‘+’ class records and 6 ‘-’ class records in a data set of 50 records. ‘-’
records are those which are not ‘+’ class.
 Rule R1 covers 40 records and correctly classifies 38 as +. Accuracy (R1) = 38/40 = 95%
 Rule R2 covers 2 records and correctly classifies all 2 as +. Accuracy (R2) = 2/2 = 100%
 R1 has more coverage but may classify wrong. R2 is more accurate but its coverage is
poor.
 Therefore, alternative measures are required for evaluating the rule quality! 50

BITS Pilani, WILP


Likelihood Ratio Statistic
All classes
R  2. 
i 1
fi.log 2( fi ei ) In the previous example:
There are total 44 + class records and 6 - class records.
Rule R1 covers 40 + records and correctly classifies 38.
where, Rule R2 covers 2 + records and correctly classifies all 2.
R  Likelihood Ratio Statistic
fi  Observed frequency of class i
ei  Expected frequency of class i
For R1, + records For R1, - records Likelihood Ratio for R1
fi = 38 fi = 2 R(R1) = 2x[38 x log2(38/35.2) + 2 x log2(2/4.8)]
ei = 40 x 44/50 = 35.2 ei = 40 x 6/50 = 4.8 = 2x[4.18 – 2.5 ] = 3.36

For R2, + records For R2, - records Likelihood Ratio for R2


fi = 2 fi = 0 R(R2) = 2x[2 x log2(2/1.76) + 0 x log2(0/0.24)]
ei = 2 x 44/50 = 1.76 ei = 2 x 6/50 = 0.24 = 2x[0.38 – 0 ] = 0.76
Likelihood ratio suggests R1 is better!
BITS Pilani, WILP
First Order Inductive Learner Gain
(FOIL Gain)

 Let us say, R0: {} ⇨ class (empty rule) and R1: {+} ⇨ class
 FOIL Gain(R0, Rx) = px. [ log2(px/(px+nx)) – log 2(p0/(p0 + n0)) ]
Where,
p0: number of + instances covered by R0
n0: number of - instances covered by R0
px: number of + instances covered by Rx
nx: number of - instances covered by Rx
In the previous example:
There are total 44 + class records and 6 - class records. Rule R1 covers 40 + records and correctly
classifies 38. Rule R2 covers 2 + records and correctly classifies all 2. Foil gain of R1 and R2
respectively over R0 (the null rule):

FOIL Gain(R0, R1) = px. [ log2(px/(px+nx)) – log2(p0/(p0 + n0)) ]


= 38. [log2(38/(38+2)) – log2(44/(44 + 6)) ]
= 38. [-0.074 + 0.18] = 4.0

FOIL Gain(R0, R2) = px. [ log2(px/(px+nx)) – log2(p0/(p0 + n0)) ]


= 2. [log2(2/(2+0)) – log2(44/(44 + 6)) ]
= 2. [0 + 0.18] = 0.36
FOIL Gain suggests R1 is better! 52

BITS Pilani, WILP


Classifier Evaluation Metrics
Terminology

 We have reviewed two classifier techniques – Decision Trees and Rules. These
classifiers are used over the test data. How can we compare which one is better?
 Let us say for a classification model, Positive Records (P) are those which belong to
the main class of interest (E.g. Buys Computers = YES) and all other records are
Negative Records (N).
 True Positives (TP): Count of positive records correctly labelled by the classifier as
YES.
 True Negatives (TN): Count of negative records correctly labelled by the classifier as
NO.
 False Positives (FP): Count of records incorrectly labelled by the classifier as YES. E.g.
Class in actual is NO.
 False Negatives (FN): Count of records incorrectly labelled by the classifier as NO.
Class in actual is YES.
53
 We will review few metrics which are based on the accuracy of the classifier.
BITS Pilani, WILP
Classifier Evaluation Metrics
TP  TN
Accuracy  Re cognition Rate 
PN
FP  FN
Error Rate   F/F1/F-Score is the harmonic
PN
mean of Precision and
TP
Sensitivity (Re call )  Recall.
P
 Fβ is the weighted harmonic
TN
Specificity  mean of Precision and
N Recall, where β2 is the
TP weight assigned to Precision
Pr ecision 
TP  FP and 1 is the weight assigned
2 X Pr ecision X Re call to Recall.
FMeasure( F 1 or F Score ) 
Pr ecision  Re call
( 1   2 ) X Pr ecision X Re call
F 
 2 X Pr ecision  Re call
where,  is a non  negative real number
BITS Pilani, WILP
Confusion Matrix
An effective way to capture classifier result on the test data-set

Predicted Class Accuracy /


Actual Class Total
Buys = Yes Buys = No Recognition

Buys = Yes 6954 46 7000 99.34%


Buys = No 412 2588 3000 86.27%
Total 7366 2634 10000 95.42%

From the matrix: TP  TN 6954  2588


Accuracy  Re cognition Rate    95.42%
PN 10000
P = 7000, N = 3000
FP  FN 412  46
TP = 6954, FP= 412 Error Rate    4.58%
PN 10000
TN = 2588, FN = 46 TP 6954
Sensitivity (Re call )    99.34%
P 7000
TN 2588
Specificity    86.27%
N 3000
TP 6954
Pr ecision    94.41%
TP  FP 6954  412 55

BITS Pilani, WILP


More on Precision and Recall
 The generic form of confusion matrix (two classes) can be drawn as:
Actual Predicted Class
TP TP
Class Yes No Pr ecision  ,Sensitivity (Re call ) 
Yes TP FN
TP  FP TP  FN(  P )
No FP TN The class of interest is YES.

 If we notice, Precision talks only about the column of Yes records in the matrix (exactness)
and Recall only about the row (completeness) of Yes records.
 There is an alternative measure (F measure) that combines Precision and Recall both using
harmonic mean. Fβ is is a corresponding weighted measure (e.g. F β = 2 which weights Recall
twice as much as Precision and F β = 0.5 which weights Precision twice as much as Recall.)

2 X Pr ecision X Re call
F Measure( F 1 or F Score ) 
Pr ecision  Re call
( 1   2 ) X Pr ecision X Re call
F 
 2 X Pr ecision  Re call
where,  is a non  negative real number
BITS Pilani, WILP
Exercise
i. Refer to the text book and answer the following questions:
i. When class distribution is balanced, which metric is effective?
ii. When the main class of interest is rare (class imbalance problem) in the dataset, will
accuracy be a effective metric? If not, then which one?
iii. Formulate an expression for accuracy in terms of sensitivity and specificity.
iv. What is re-substitution error? How it is related with the error rate metric?
ii. For the shown table, calculate different metrics and answer the following
questions:
i. Which metric would indicate that in the dataset there are more negative records.
ii. Is accuracy an effective metric in this dataset?

Predicted Class
Actual Class Cancer = Yes Cancer = No
Records Records
Cancer = Yes 90 210
Cancer = No 140 9560 57

BITS Pilani, WILP


Classifier Evaluation
Other Aspects

The metrics which are reviewed so far are based on the


accuracy of the classifier. There are other aspects which
can also be used to compare the classifier:
 Speed (computational complexity)
 Robustness (performance in the presence of noise)
 Scalability (performance when data volume
increases)
 Interpretability (simple or complex; required to
manage and update the classifier)

58

BITS Pilani, WILP


Classifier Accuracy Assessment
Holdout and Cross Validation Methods

We have a dataset how do we decide training and test data and assess the accuracy?
Holdout Method
 Given data is randomly partitioned into two independent sets:
o Training set (e.g., 2/3) for model construction.
o Test set (e.g., 1/3) for accuracy estimation.
 Random sampling: a variation of Holdout:
o Repeat holdout k times, accuracy = average of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets or folds (Di, where i
= 1 to k), each approximately of equal size.
 At i-th iteration, use Di as test set and others collectively as training set.
 Each fold serves only once as test dataset and equal number of times as part of
the training set as other folds.
 In the end, accuracy = total correct classifications / count of records.
 Two variations:
o Leave-one-out: k is selected as count of records. For small datasets.
o Stratified cross-validation: folds are stratified so that class distribution in each
fold is approximately proportional as that in the initial data. 59

BITS Pilani, WILP


Classifier Accuracy Assessment
0.632 Bootstrap Method

 Let us say there are N records, which are sampled N times to create a training
set. Each time, when a sample is selected it is also put back for the next
sampling.
 The probability that a record is selected = 1/N and not selected = (1-1/N).
 So, the probability that a record is never chosen during N samplings = (1-1/N)N.
For large value of N, this is equal to e-1 = 0.368 (e = Euler’s number = 2.718)
 These not selected records will form the test set. Other records will be the
training set.
 The method is preferred for small data sets.
 If this sampling procedure is repeated k times, then accuracy for the model (M)
is find out as below: (note that accuracy is calculated individually for training and test set in an
iteration):

1 k
Acc( M )   { 0.632 X Acc( Mi )TestSet  0.368 X Acc( Mi )TrainSet }
k i 1
BITS Pilani, WILP
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Appendix

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Properties of Basic Logarithms
If we write, Z = XY, so what is the value of Y?
Y = logX(Z)
In English, we can say: Y is equal to the log of Z on base X or What should be the exponent of X that gives Z?

Basic logarithms have following properties:


i. logx (1) = 0 (log of 1 on any base is 0)

i. logx (x) = 1 (log of any number on the same base is 1)

ii. logx (y.z) = logx (y) + logx (z) (product rule)


Example: log10(6) = log10(2x3) = log10(2) + log10(3) = 0.3010 + 0.4771 = 0.7781

iv. logx (mn) = n x logx (m) (exponent rule)


3
Example: log10(8) = log10(2 ) = 3xlog10(2) = 3x0.3010 = 0.9030

v. logx (m/n) = logx (m) - logx (n) (quotient rule)


Example: log10(8/2) = log10(8) - log10(2) =0.9030 – 0.3010 = 0.6020

vi. logY (m) = logX (m) / logX (Y) (base change rule, take any value of X. 10 is usual)
Example: log2(8) = log10(8) / log10(2) = 0.9030 / 0.3010 = 3
63

BITS Pilani, WILP


Gini: Logic Formulation
Let us say we have two baskets. In the first basket there are fruits and in the second basket there are stickers
where names of the fruits are written. Count of fruits and stickers are same and are shown in the table below
and for each fruit there is one corresponding sticker.
Fruit Count Sticker Count Probability of choosing each at random
Pineapple 5 5 Fruit Sticker
Pear 2 2 Pineapple 0.5 0.5
Apple 3 3 Pear 0.2 0.2
Total 10 10 Apple 0.3 0.3

The Gini impurity tells us the probability that we select a fruit at random and a sticker at random and it is an
incorrect match. The contingency table below captures the data. Shaded cells are for the incorrect match
(impurity). Diagonal cells are for the correct match.
Probability Table
Pineapple Sticker Pear Sticker Apple Sticker
Pineapple Fruit 0.25 0.10 0.15
Pear Fruit 0.10 0.04 0.06
Apple Fruit 0.15 0.06 0.09

Notice that, sum of all cells in the probability table is 1.0. If we subtract the sum of correct match cells from
1.0, we get the sum of incorrect match cells. That is what is Gini. In general if fi is the fraction of items labelled
with label i correctly (Probability = fi x fi = fi2), where i ranges for m labels: m
Gini  1   ( f i 2 )
i 1

BITS Pilani, WILP


Entropy (Info): Logic Formulation
 Entropy, in physical systems is system's total energy that is unavailable to do useful
work.
 This idea of Entropy in information theory was introduced by Claude Shannon in
early 20th century. In information theory, Entropy is a measure of disorder.
 Let us say there is a set of N items. These items fall into two categories, x items
have label-1 and y items have label-2. Remember that, x+y = N.
 Two ratios are defines as:
x y
p  and q   1  p
N N
 Entropy of this scenario is given by: Claude Shannon
1916-2001
Entropy   p.log 2 ( p )  q.log 2 ( q )
 What is to be noted as: if p = 0 or q = 0, in both the cases the entropy is 0. That
means if items fall only in one category, there is no disorder. Otherwise the
entropy value ranges from 0 to 1 (see the observation graph in earlier slides).
BITS Pilani, WILP
Classification Error: Logic
Formulation
 A basket has 5 apples. The probability of randomly
picking up a fruit and that would be an apple = 5/5 =
1. Error = 1 - 1 = 0.
 A basket has 5 fruits: 1-apple, 1-pear, 1-pineapple, 1-
banana and 1-orange. The probability of randomly
picking up a fruit and that would be an apple = 1/5 =
0.2. Error = 1 - 0.2 = 0.8.

66

BITS Pilani, WILP


DSECL ZC415
Data Mining
Association Analysis
Revision -2.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Association Analysis

 Given a set of transactions, the objective is to find out the rules that will
predict the occurrence of an item based on the occurrences of other items
in the transaction.
 This module presents a methodology for discovering interesting
relationships hidden in the data sets. The uncovered relationships can be
represented in the forms of Association Rules.
Market-Basket Transactions Example of Association Rules
TID Items {Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs, Coke},
{Beer, Bread}  {Milk},
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
Are all equally convincing?
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke 2

BITS Pilani, WILP


Terminology
 Let I = {i1, i2, ....id} be the set of all items in the data set and T = {t1, t2,
.....tn} be the set of all transactions. Each transaction, ti contains a
subset of items chosen from I. E.g. tx is a receipt having i1, i3, i5 items.
 In Association Analysis, a collection of zero or more items is termed as
itemset. E.g. {ф}, {i1}, {i1, i3}, {i1, i3, i5 } are itemsets from tx.
 If an itemset contains k items, it is called k-itemset. The empty (null)
itemset does not contain any item. E.g. {Beer, Diapers, Milk} is a 3-
itemset.
 The count of items (without null) present in a transaction is defined as
the width of the transaction. E.g. the width of tx is 3.
 Support count is an important property which refers to the count of
transactions that contain a particular itemset. E.g. in the Market-
Basket transactions the support count of {Beer, Diapers, Milk} is 2,
because there are only two transactions that contain all the three
items. BITS Pilani, WILP
Association Rules
What are those? Which of them are relevant?
 An association rule is in the form of X  Y means whenever X (antecedent )
is observed in the transaction, Y (consequent) is also observed. X and Y are
itemsets.
Trans ID Items
 There is a table of transactions. Let us say there are two rules (R1 & R2) as: 1 Bread, Milk
{Bread, Milk}  {Diapers} Bread, Diapers,
2 Beer, Eggs
{Cola}  {Beer} Milk, Diapers,
 There are two measures for the evaluation of the association rules: 3 Beer, Cola
Bread, Milk,
support ( X  Y )  P( X  Y ); Probability that X and Y both are present. 4 Diapers, Beer
Bread, Milk,
support count( X  Y )
confidence ( X  Y )  P(Y | X )  5 Diapers, Cola
support count( X )
 Evaluation measures examples:
R1: {Bread, Milk}  {Diapers} Support is joint probability.
For R1 support = 0.40, confidence = 0.67 Confidence is conditional probability.
R2: {Cola}  {Beer}
For R2, support = 0.20, confidence = 0.50
 In practical conditions, there would be thresholds for support and confidence for the rules that
will make them relevant for the consideration. E.g. if threshold for support is 40% and
confidence is 50%, then rule R1 is useful for consideration.

BITS Pilani, WILP


Support & Confidence: Why Both?

 A low support rule is likely to be uninteresting from a


business perspective. E.g. there is no point in promoting
products which customers seldom buy together.
Therefore support helps in eliminating uninteresting
rules.
 On the other hand, for a given rule, X  Y, higher the
confidence, the more likely it is that Y is present in the
transactions that contain X. Helps in cross-selling.
Association rules do not necessarily mean causality
(cause and effect). E.g. Ozone depletion causes global
warming. This causality kind of relationship may not exist
in the association rules.
5

BITS Pilani, WILP


Remember!

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

 The count of items present in a transaction is defined as the width of the


transaction. Width of TID-1 = 2 and for all others = 4.
 Support Count for {Beer, Diapers, Milk} = 2
 If there is a rule: {Beer, Milk}  {Diapers}, its
o support = 2/5 = 0.4 (or 40%)
o confidence = 2/2= 1.0 (or 100%) 6

BITS Pilani, WILP


Discover Useful Association Rules

 Given a set of transactions T, the goal of association


rule mining is to find all rules having:
o support (s) ≥ the threshold support value (minsup)
o confidence (c) ≥ the threshold confidence value (minconf)
 Brute-force approach:
o List all possible association rules.
o Compute the support and confidence for each rule.
o Prune rules that fail the minsup and minconf thresholds.
o Computationally prohibitive! Likely that a lot of
computations will be wasted.
o So what is the solution?
7

BITS Pilani, WILP


Discover Association Rules

TID Items Example of Rules:


1 Bread, Milk {Milk, Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk ,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper, Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk, Beer} (s=0.4, c=0.5)
{Milk}  {Diaper, Beer} (s=0.4, c=0.5)
Observations:
 All the above rules are binary partitions of the same itemset -
{Milk, Diaper, Beer}.
 Rules originating from the same itemset have identical support but can have different
confidence.
 So, let us stop taking support and confidence together. Address one at a time. Decouple them.
 If support threshold is more than 0.4, we can say any itemset from {Milk, Diaper, Beer} is
infrequent and all the above six rules can be pruned (discarded) without calculating their
confidence values.
 So, there are two major sub-tasks: (i) Frequent Itemset Generation (ii) Rule Generation.
Confidence will be relevant in the second.
BITS Pilani, WILP
Frequent Itemset Generation

 If there are k items in a dataset, there could be 2k-1


frequent itemsets, ignoring the null itemset. If k is large,
the search space will be exponentially large.
 Application of Apriori Principle is an effective way to
eliminate some of the candidate itemsets without counting
their support values. According to it – “if an item set is
frequent, then all of its subsets must also be frequent.”
 Support of an itemset never exceeds the support of its
subsets. It can be same though. Apriori principle holds true
because of this property which is also known as anti-
monotone property.
 Frequent Itemsets are also called Frequent Patterns (FP). 9
BITS Pilani, WILP
Candidate Generation
Let us first generate all possible itemsets

 This slide talks about an important merging method to enumerate possible itemsets for
calculations. E.g. if {a, b, c} and {a, b, d} are 3 item itemsets. What could be a possible 4-
item itemset?
 Let A = {a1, a2, .....ak-1} and B = {b1, b2, .....bk-1} are two (k-1) itemsets. How a k itemset can
be generated after combining them?
 Brute-Force Method: Enumerate all combinations. Computationally prohibitive!
 Ck = Lk-1 x Lk-1 Method: Let A = {a1, a2, .....ak-1} and B = {b1, b2, .....bk-1}. A and B can be
merged to generate a k itemset if:
ai = bi (for i = 1, 2, .....k-2) and ak-1 ≠ bk-1
 The merged itemset will be {a1, a2, .....ak-1, bk-1}
Example: if A = {1, 2, 3} and B = {1, 2, 4} then a possible 4 itemset is {1, 2, 3, 4}.

Example: if A = {1, 2, 3} and B = {4, 5, 6} then merged 4 itemset is not possible.

Example: if A = { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } find its own merged 3 itemsets:

= { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } x { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} }

= { {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5} {I2, I4, I5} }

BITS Pilani, WILP


Apriori Algorithm: Frequent Itemset
Generation: Through Illustration
Transaction Table
C1 Trans ID Items
 Transactional data is available for 9 transactions. Let us Support
Itemsets Count T100 I1, I2, I5
assume minimum support count = 2. All possible frequent {I1} 6 T101 I2, I4
itemsets are to be generated from the given data. {I2} 7 T102 I2, I3
{I3} 6 T103 I1, I2, I4
 Note - items within a transaction are arranged in increasing {I4} 2 T104 I1, I3
lexicographic order (I1, I2, I5 and not I2, I1, I5). This helps {I5} 2 T105 I2, I3
in arranging the combinations. (If a transaction has Plane, T106 I1, I3
T107 I1, I2, I3, I5
Car and Ship items, rearrange them as C, P and S)
T108 I1, I2, I3
 Candidates for 1-itemsets C1 is generated and their L1
C2 Support
support count is scanned from the transaction table. Itemsets
Itemsets Count
{I1}
 None of the C1 itemsets is rejected because all of them {I2}
{I1, I2} 4
{I1, I3} 4
meet the criteria of minimum support count. {I3}
{I1, I4} 1
 1-item frequent item set (L1) is same as C1 because no {I4}
{I1, I5} 2
{I5}
itemset is pruned. {I2, I3} 4
{I2, I4} 2
 2-itemsets candidates C2 is generated from L1 and their L2
{I2, I5} 2
Itemsets
support count is scanned from the transaction table. {I1, I2}
{I3, I4} 0
{I3, I5} 1
 2-item frequent item set L2 is generated from C2 {I1, I3}
{I4, I5} 0
considering the minimum support count. Itemsets {I1, I4}, {I1, I5}
{I2, I3}
{I3, I4}, {I3, I5} and {I4, I5} are pruned because their {I2, I4}
support count is lesser than 2. {I2, I5}

BITS Pilani, WILP


Apriori Algorithm
Illustration Concluded.
 3-itemsets candidates C3 is generated from L2: L2
C3 = L2 x L2 Itemsets
C3 Support {I1, I2}
= { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } x Itemsets Count {I1, I3}
{ {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } {I1, I2, I3} 2 {I1, I5}
{I1, I2, I5} 2 {I2, I3}
= { {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5} {I2, I4}
{I2, I4, I5} } {I2, I5}
 C3 is generated considering Apriori Principle. The pruned itemsets are following:
(all subsets of a frequent itemset must also be frequent!)
L3
o {I1, I3, I5}: because {I3, I5} is not in L2 Itemsets
o {I2, I3, I4}: because {I3, I4} is not in L2 {I1, I2, I3}
{I1, I2, I5}
o {I2, I3, I5}: because {I3, I5} is not in L2
o {I2, I4, I5}: because {I4, I5} is not in L2
 Transaction table is scanned for the C3 support count. Effort of searching several
itemsets for their support count saved! C4 Itemsets

 3-item frequent itemset L3 is generated considering the support count. {ф}

 4-itemsets candidates C4 is generated from L3 as {I1, I2, I3, I5} but C4 = {ф}
because of Apriori Principle. E.g. {I3, I5} is not frequent so {I1, I2, I3, I5} can’t be.
L4 Itemsets
 Therefore L4 is also = {ф} and C5 cannot be generated.
{ф}
 The algorithm terminates with all the frequent itemsets identified: 1-item, 2-
item and 3-item frequent itemsets. There is no frequent 4-item itemset.
BITS Pilani, WILP
Exercise

1. In the previous Apriori Algorithm illustration:


a) L2 table generation from C2 needs a scan of all itemsets in C2 for
their support count and Apriori Algorithm is not useful here. Why?
b) Why C4 = {ф}?

2. In the Apriori Algorithm, Ck is generated using Lk-1. In this


step, using Apriori principle, few itemsets are pruned to
become a part of Ck. Are all Ck itemsets which are found
using this method, are guaranteed to be the part of Lk?
(Answer: No. Apriori principle says that if a set is frequent, all of its subsets must be also
frequent. But the reverse is not true. It is possible that subset is frequent but the set is not.
So support count is scanned in Ck to find out the Fk. But now the scan requirement is for
lesser itemsets, which Apriori principle helped to achieve).
14

BITS Pilani, WILP


Compact Representation
Of Frequent Itemsets

 The count of frequent itemsets produced from a


transactional data can be very large.
 There are two methods to find out the compact
representation of the frequent itemsets from which
all other frequent itemsets can be derived:
i. Maximal Frequent Itemset
ii. Closed Frequent Itemset

15

BITS Pilani, WILP


Maximal and Closed Frequent
Itemsets – through lattice diagram
TID Items 4 items, 5 transactions and Null
1 A, B, C Min support count = 40% = 2
2 A, B, C, D
3 B, C 1, 2, 4 1, 2, 3 1, 2, 3, 4 2, 4, 5
4 A, C, D
A B C D
5 D

1, 2 A, B 1, 2, 4 A, C 2, 4 A, D 1, 2, 3 B, C 2 B, D 2, 4 C, D

1, 2 A, B, C 2 A, B, D 2, 4 A, C, D 2 B, C, D

2 A, B, C, D
 Immediate superset of {X} and {Y} is prepared after merging {X} and {Y} where the widths of {X} and {Y} are same. E.g. Immediate
superset of {A, B} and {A, C} is {A, B, C}. In general, superset of {X} is the set where {X} is one of the elements.
 A frequent itemset having none of its immediate supersets frequent is called Maximal Frequent Itemset. E.g. {A, B, C} and {A, C, D} are
Maximal Frequent Itemsets.
 An itemset is closed if none of its immediate supersets has exactly the same support count as it has. E.g. {C}, {D}, {A, C}, {B, C}, {A, B, C}
and {A, C, D} are closed itemsets. They are Closed Frequent Itemsets also because they are meeting the minsup criteria.

BITS Pilani, WILP


Exercise

1. A store has following transactions as Trans ID Items


1
shown in the table. Using Apriori Shampoo, Phenyl
Shampoo, Soap,
Algorithm, generate the frequent 2 Detergent, Tissues
Phenyl, Soap,
itemset(s) with the count. Threshold Detergent,
support count = 3. 3 Sanitizer
Shampoo, Phenyl,
Answer: 4 Soap, Detergent
L1: {Phenyl}, {Detergent}, {Shampoo}, {Soap} Shampoo, Phenyl,
L2: {Phenyl, Shampoo} {Detergent, Soap} {Shampoo, Soap} 5 Soap, Sanitizer
L3: {ф}
1. Fiery Burger recorded the sale of
following items through four transactions
during a specified time. Generate all Receipt
Items
ID
frequent itemsets based on considering T1 Wedges, Burger, Pizza
the support threshold as 50%.
Answer: T2 Fries, Burger, Nuggets
L1: {Wedges}, {Fries}, {Nuggets}, {Burger} T3 Wedges, Fries, Burger,
L2: {Burger, Wedges}, {Fries, Nuggets}, {Burger, Fries}, {Burger, Nuggets} Nuggets
L3: {Burger, Fries, Nuggets} T4 Fries, Nuggets
L4: {ф}

BITS Pilani, WILP


Exercise

Verify the following properties from the maximal Frequent


and closed frequent itemset illustration: Itemsets
Closed
1. The relationship shown in the diagram exists Frequent
Itemsets
among the Frequent, Closed Frequent and
Maximal Frequent Itemsets. List down all the Maximal
Frequent
itemsets. Itemsets

2. Anti-monotone property.
3. Apriori Principle.

18

BITS Pilani, WILP


Quick Recap
 Association Analysis – finding out rules of interest.
 Two measures of usefulness – support and confidence.
 Frequent itemsets – that meet the support threshold from the given
transactions.
 Anti-monotone property - support of an itemset never exceeds the support of
its subsets.
 Apriori Principle - if an itemset is frequent, then all of its subsets must also be
frequent.
 An itemset may not be frequent even if all of its subsets are frequent.
 Maximal Frequent Itemset - a frequent itemset having none of its immediate
supersets frequent.
 Closed Frequent Itemset - a frequent itemset having none of its immediate
supersets with exactly the same support count as it has.
 We have reviewed how frequent, closed and maximal itemsets are generated
and now we are approaching towards rule generation.
BITS Pilani, WILP
Association Rule Generation
 An association rule is depicted in the form of X  Y, where X is called the antecedent and Y is
called the consequent. X and Y are drawn from a frequent itemset.
 The objective is to generate association rules efficiently from a given frequent itemset.
 For a frequent k-itemset, (2k-2) rules can be produced ignoring null antecedent or
consequent and including all the elements of the itemset. Let X = {1, 2, 3} is a frequent
itemset, then following rules are possible:
{1, 2}  {3}
{1, 3}  {2}
{2, 3}  {1}
{1}  {2, 3}
{2}  {1, 3}
{3}  {1, 2}

 Each of these above rules will have the support equal to the support of X in the transactions.
 Once frequent itemset is identified and rules are generated from those itemsets, calculating
the confidence does not require addition scan of the transaction table. E.g.
o For the rule {1, 2}  {3} the confidence = support count (1, 2, 3) / support count (1, 2).
o Since {1, 2, 3} is frequent, {1, 2} will also be frequent because of Apriori Principle and support counts were already found for
these two itemsets during the iterations of frequent itemsets generation.

 The question here is if all such produced rules are of interest? How confidence plays a role?
BITS Pilani, WILP
Confidence Based Rule Pruning
Trans ID Items L1 L2 L3
L4 Itemsets
T100 I1, I2, I5 Itemsets Itemsets Itemsets
T101 I2, I4 {I1} {I1, I2} {I1, I2, I3} {ф}
T102 I2, I3 {I2} {I1, I3} {I1, I2, I5}
T103 I1, I2, I4 {I3} {I1, I5}
T104 I1, I3 {I4} {I2, I3}
T105 I2, I3 {I5} {I2, I4}
T106 I1, I3 {I2, I5}
T107 I1, I2, I3, I5
T108 I1, I2, I3

Transaction Table Frequent Itemsets – Output of Apriori Algorithm

 Apriori Principle does not hold for the confidence measure for a rule. That means the
confidence measure for rule X  Y can be bigger, smaller or same for X’  Y’ where X’ ⊆ X
and Y’ ⊆ Y.
 Theorem (Confidence Based Rule Pruning): For a frequent itemset A where B ⊂ A, if a rule
B  (A-B) does not satisfy the confidence threshold, then a rule from this frequent itemset
A in the form of B’  (A-B’) will also not satisfy the confidence measure as well, where
B’ ⊂ B.
BITS Pilani, WILP
Illustration – continuing...
Confidence Based Rule Pruning

Trans ID Items L3
T100 I1, I2, I5 Itemsets Let us say, Confidence Threshold = 0.67
T101 I2, I4 {I1, I2, I3}
T102 I2, I3  Rules are being generated from the frequent itemset X, taking one item at a
T103 I1, I2, I4 time from antecedent to the consequent side starting from {X} {ф}.
T104 I1, I3
 Order for the movement can be lexicographic or reverse lexicographic .
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 {I1, I2, I3}  {ф}
T108 I1, I2, I3

Confidence = 2/ 4 = 0.50
Confidence = 2/ 4 = 0.50 Confidence = 2/ 4 = 0.50

{I2, I3}  {I1} {I1, I3}  {I2} {I1, I2}  {I3}

No need to calculate
{I3}  {I1, I2} {I2}  {I1, I3} {I1}  {I2, I3} confidence. Can be
discarded using
Confidence = 2/ 6 = 0.33 Confidence = 2/ 7 = 0.29 Confidence = 2/ 6 = 0.33 confidence based rule
pruning theorem.
No rule is of interest for the given confidence threshold for {I1, I2, I3}.
BITS Pilani, WILP
Illustration – continuing...
Confidence Based Rule Pruning

Trans ID Items L3
T100 I1, I2, I5 Itemsets
T101 I2, I4 {I1, I2, I5} Let us say, Confidence Threshold = 0.67
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5
T108 I1, I2, I3 {I1, I2, I5}  {ф}

Confidence = 2/ 4 = 0.50
Confidence = 2/ 2 = 1.00
Confidence = 2/ 2 = 1.00

{I2, I5}  {I1} {I1, I5}  {I2} {I1, I2}  {I5}

{I5}  {I1, I2} {I2}  {I1, I5} {I1}  {I2, I5}


Confidence = 2/ 2 = 1.00 Confidence = 2/ 7 = 0.29 Confidence = 2/ 6 = 0.33

23

BITS Pilani, WILP


Illustration – continuing...
Confidence Based Rule Pruning

Trans ID
T100
Items
I1, I2, I5
L2 {I1, I2}  {ф} Confidence Threshold = 0.67
Itemsets
T101 I2, I4
{I1, I2}
T102
T103
I2, I3
I1, I2, I4
{I1, I3} {I1}  {I2} {I2}  {I1}
{I1, I5}
T104 I1, I3
{I2, I3} Confidence = 4/ 6 = 0.67 Confidence = 4/ 7 = 0.57 {I1, I3}  {ф}
T105 I2, I3
{I2, I4}
T106 I1, I3
{I2, I5}
T107 I1, I2, I3, I5 {I1}  {I3} {I3}  {I1}
T108 I1, I2, I3
Confidence = 4/ 6 = 0.67 Confidence = 4/ 6 = 0.67

{I1, I5}  {ф} {I2, I3}  {ф}

{I1}  {I5} {I5}  {I1} {I2}  {I3} {I3}  {I2}


Confidence = 2/ 6 = 0.33 Confidence = 2/ 2 = 1.00
Confidence = 4/ 7 = 0.57 Confidence = 4/ 6= 0.67

{I2, I4}  {ф} {I2, I5}  {ф}

{I2}  {I4} {I4}  {I2} {I2}  {I5} {I5}  {I2}


Confidence = 2/ 7 = 0.29 Confidence = 2/ 2= 1.00
Confidence = 2/ 7 = 0.29 Confidence = 2/ 2= 1.00

BITS Pilani, WILP


Illustration – concluded.
Confidence Based Rule Pruning

Using the Confidence threshold and Rule Pruning Theorem the final set
of Association Rules are following (marked in green in the previous
illustration slides):

i. {I2, I5}  {I1}


ii. {I1, I5}  {I2}
iii. {I5}  {I1, I2} Why these rules are marked
iv. {I1}  {I2} in two different colours?
v. {I1}  {I3}
vi. {I3}  {I1}
vii. {I5}  {I1}
viii. {I3}  {I2}
ix. {I4}  {I2}
x. {I5}  {I2}
25

BITS Pilani, WILP


Support and Confidence
Misleading? Do we need something else as well?

 In an electronics store, out of 10,000 transactions, 6000 include mobile


phones, 7500 include VCD players and 4000 include both.
 Let us take minsup = 30% and minconf = 60%.
 If an association rule is discovered as: R: {Mobile Phones}  {VCD Player}
support (R) = 4000/10000 = 40%
confidence (R) = 4000/6000 = 67%
 Based on the thresholds, it is a strong association rule!
 However, the probability of purchasing VCD player is 7500/10000 = 75% which
is more than the confidence.
 Purchase of mobile phones actually reduces the likelihood of purchasing the
VCD players. They are negatively associated.
 This may lead to wrong business decisions.
 Is there any other measure to identify issues like this?
26

BITS Pilani, WILP


Lift
Lift is a measure to calculate the correlation between
the itemsets of a transaction. It is defined for two
itemsets A and B as:

P( A  B )
lift( A, B ) 
P( A ).P( B )

The value of lift, if:


< 1: The occurrence of A is negatively correlated with the occurrence of B.
> 1: The occurrence of A is positively correlated with the occurrence of B.
= 1: There is no correlation between A’s and B’s occurrences.

What is the lift value of previous electronics store example? 27


(Answer: 0.89; a negative correlation)
BITS Pilani, WILP
Exercise

A consumer good’s company marketing manager was demoted for over


emphasizing the sale of breakfast cereals to the basketball players based on
the following contingency table. Do you agree with the management’s
decision for demoting him? Assume support = 40%, confidence = 67%
thresholds.
Basketball No Basketball
Cereal 2000 1750
No Cereal 1000 250

Answer:
{Basketball }  {Cereal } (support = 40%, confidence = 67%)
Meeting the thresholds but cereals are eaten by 75% of the people – more than confidence metric.
That is also evident because lift (Basketball  Cereal) = 0.89
Negative correlation between Basketball and Cereals.
Agreed with the management’s decision to demote the marketing manager. 28

BITS Pilani, WILP


Exercise

In the previous two scenarios of lift calculation (Mobile


Phones/VCD Player and Basketball/Cereals) calculate
the Chi-Square value and verify its accordance with the
lift measurement.
Chi-Square value = 0; no correlation
Chi-Square value > 0; There is correlation, if the first top left cell expected is more than
observed: negative correlation else positive.

29

BITS Pilani, WILP


Illustration
Rules Pruning using Lift

Trans ID Items i. {I2, I5}  {I1}, lift = {(2/9)}/{(2/9)x(6/9)} = 1.50


T100 I1, I2, I5
T101 I2, I4 ii. {I1, I5}  {I2}, lift = {(2/9)}/{(2/9)x(7/9)} = 1.29
T102 I2, I3
T103 I1, I2, I4 iii. {I5}  {I1, I2}, lift = {(2/9)}/{(2/9)x(4/9)} = 2.25
T104 I1, I3
T105 I2, I3 iv. {I1}  {I2}, lift = {(4/9)}/{(6/9)x(7/9)} = 0.86
T106 I1, I3
T107 I1, I2, I3, I5 v. {I1}  {I3}, lift = {(4/9)}/{(6/9)x(6/9)} = 1.00
{I3}  {I1},
T108 I1, I2, I3
vi. lift = {(4/9)}/{(6/9)x(6/9)} = 1.00
Transaction Table
vii. {I5}  {I1} , lift = {(2/9)}/{(2/9)x(6/9)} = 1.50
viii. {I3}  {I2} , lift = {(4/9)}/{(6/9)x(7/9)} = 0.86
ix. {I4}  {I2} , lift = {(2/9)}/{(2/9)x(7/9)} = 1.29
x. {I5}  {I2} , lift = {(2/9)}/{(2/9)x(7/9)} = 1.29

All the pink colour rules are pruned because they are not meeting the lift criteria!

30

BITS Pilani, WILP


Redundant Association Rules

An association rule a  b is redundant if there exists any rule X  Y,


where a ⊆ X, b ⊆ Y and support count and confidence for both the
rules are same.

Example: Trans ID Items


R1: {I5}  {I1, I2} T100 I1, I2, I5
Support = 2/9, Confidence = 2/2 T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
R2: {I5}  {I1} T104 I1, I3
Support = 2/9, Confidence = 2/2 T105 I2, I3
T106 I1, I3
R3: {I5}  {I2} T107 I1, I2, I3, I5
Support = 2/9, Confidence = 2/2 T108 I1, I2, I3

Transaction Table
Therefore, R2 and R3 are redundant.

31

BITS Pilani, WILP


Illustration
Compact Frequent Itemsets

Trans ID Items L1 Support Support Support


L2 L3 L4 Itemsets
T100 I1, I2, I5 Itemsets Count Count Count
Itemsets Itemsets
T101 I2, I4 {I1} 6 {ф}
T102 I2, I3 {I2} 7 {I1, I2} 4 {I1, I2, I3} 2
T103 I1, I2, I4 {I3} 6 {I1, I3} 4 {I1, I2, I5} 2
T104 I1, I3 {I4} 2 {I1, I5} 2
T105 I2, I3 {I5} 2 {I2, I3} 4
T106 I1, I3 {I2, I4} 2
T107 I1, I2, I3, I5 {I2, I5} 2 Support Count Threshold = 2
T108 I1, I2, I3
Frequent Itemsets (from Apriori) with Support Count
Transaction Table

 Maximal Frequent Itemset (a frequent itemset having none of its


immediate supersets frequent) = {I2, I4}, {I1, I2,I3}, {I1, I2, I5}.
 Closed Frequent Itemset (a frequent itemset having none of its
immediate supersets with the same support count as it has) = {I1}, {I2}, {I3},
{I1,I2}, {I1, I3}, {I2, I3}, {I2, I4}, {I1, I2, I3}, {I1, I2, I5}.
 Optimum rules are already generated using lift. How these
compact frequent itemsets help now?
BITS Pilani, WILP
Illustration
Rules from Closed Frequent vs. Frequent Itemsets

i. {I2, I5}  {I1} i. {I2, I5}  {I1}


ii. {I1, I5}  {I2} ii. {I1, I5}  {I2}
iii. {I5}  {I1, I2} iii. {I5}  {I1, I2}
iv. {I4}  {I2} iv. {I4}  {I2}
{I5}  {I1}
Additional two rules
v. which are actually
redundant from the
vi. {I5}  {I2} (iii) rule :
{I5}  {I1, I2}
Rules Generated from the
Closed Frequent Itemsets Rules Generated from the
Using the same steps (including lift) as used for
the rule generation from the frequent itemsets.
Frequent Itemsets

Closed frequent itemsets help to identify rules faster eliminating


the redundant rules!
Few tools (like R) may discard rule (iii) and retain (v )and (vi) instead.

BITS Pilani, WILP


Points to Ponder!

1) Closed Frequent Itemsets can be used to calculate


the support count of non-closed frequent itemsets.
Use the lattice diagram discussed earlier and show
how?
2) Why Maximal Frequent Itemsets can generate
redundant rules but not the Closed Frequent
Itemsets?

34

BITS Pilani, WILP


Frequent Pattern (FP) Tree

 FP-Tree is a compact representation of the transaction


dataset.
 FP-Tree is a linked list based structure.
 As different transactions can have several items in
common, their paths may overlap.
 More the overlap, more the compression.
 If FP-Tree is completely fit into the main memory,
repeated disk accesses are saved.
 FP-Growth Algorithm is a frequent itemset generation
algorithm that explores the FP-Tree for frequent itemsets
in the bottom-up fashion. 35

BITS Pilani, WILP


Example

i. {A, B}
Null
ii. {B, C, D}
iii. {A, C, D, E} B:2
A:8
iv. {A, D, E}
v. {A, B, C}
B:5 C:2
vi. {A, B, C, D} C:1 D:1
vii. {A} D:1
C:3 E:1
viii. {A, B, C} D:1 E:1
D:1
ix. {A, B, D}
E:1
x. {B, C, E} D:1
FP Tree: Pointers are not shown
36

BITS Pilani, WILP


Example
Prefix paths and Conditional Trees for the itemsets ending with E
Null Null

A:8 B:2 A:8 B:2

B:5 C:2 C:2


C:1 D:1 C:1 D:1

D:1
C:3 E:1 E:1
D:1 E:1 D:1 E:1
D:1

D:1
E:1 E:1
Prefix Path Ending with E
FP Tree

Null Null Null

C:1
A:2 A:2 A:2

C:1 D:1 C:1 D:1

D:1 D:1

Conditional FP Tree for E Prefix path ending with D, E Conditional FP Tree for D, E /
(minsup = 2) (minsup = 2) Prefix path A, D, E
BITS Pilani, WILP
Example
Continued...
Null
Null

A:8 B:2 C:1


A:2

B:5 C:2
C:1 D:1
C:1 D:1

D:1
C:3 E:1
D:1 E:1
D:1 D:1

E:1
D:1
FP Tree Conditional FP Tree for E
(minsup = 2)
Null

C:1
A:1

Null
C:1 Ф
A:2

Prefix Path ending with C, E Conditional FP Tree for C, E Prefix Path / Conditional FP Tree
(minsup = 2) ending with A, E

Similarly, prefix paths and Conditional Trees for the itemsets ending with D, C, B and A are identified.
BITS Pilani, WILP
FP Tree
Illustration
C1 Support
Trans ID Items Minsup=2 Itemsets Count
T100 I1, I2, I5 Null {I1} 6
T101 I2, I4 {I2} 7
T102 I2, I3 {I3} 6
T103 I1, I2, I4 {I4} 2
T104 I1, I3 {I5} 2
T105 I2, I3 I1:6 I2: 3
T106 I1, I3
T107 I1, I2, I3, I5
T108 I1, I2, I3
I3:2 I2:4 I3: 2 I4: 1

Header Table
Item Pointer
I1 I3:2 I4:1 I5:1
I2
I3
I4 I5: 1
I5
39

BITS Pilani, WILP


FP Growth Algorithm - illustration
Iteration-1(Frequent itemsets ending with I5)
Trans ID Items
T100 I1, I2, I5 Null
Null Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I1:2 I1:2 I1:2
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 Conditional-FP Prefix path for
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I2:4 I2:2 tree for {I2, I5} / {I1, I5}
/ prefix path for
Conditional-FP {I1, I2, I5}
tree for I5 / prefix
path for {I2, I5}
I3:2 I4:1 I5:1 I3:2 I5:1

Prefix path for I5.

I5: 1 I5: 1

 Retain only those nodes which are having I5 in their path starting from NULL. This is called prefix path of I5. {I5} is meeting the
support threshold, so it is a frequent itemset.
 Next, conditional-FP tree for I5 is prepared. This conditional tree will be used to identify the frequent itemset ending in I3, I5 and I2,
I5 and I1, I5.
 Since in this tree, I2 is meeting the support count, {I2, I5} is a frequent itemset.
 Hide {I2} and update support counts if required. Conditional-FP tree for {I2, I5}. Since in this tree, I1 is meeting the support count, {I1,
I2, I5} is a frequent itemset.
 From the conditional tree of I5, prefix path for {I1, I5} is prepared. Since I1 is meeting the support the criteria, {I1, I5 } is a frequent
itemset.
 Frequent itemsets at this stage = {I5}, {I1, I5}, {I2, I5}, {I1, I2, I5}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-2 (Frequent itemsets ending with I4)
Trans ID Items
T100 I1, I2, I5 Null Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:1 I2: 1 I1:1
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 Prefix path for {I1, I4}
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I2:4 I4: 1 I2:1

Conditional-FP Tree
I4:1 for I4 / prefix path
for {I2, I4}
I3:2 I4:1 I5:1
Prefix path for I4.

I5: 1

 Retain only those nodes which are having I4 in their path starting from NULL. This is called the prefix path for I4. {I4}
is meeting the support threshold, so it is a frequent itemset.
 Hide {I4} and eliminate nodes that do not meet support criteria. Conditional-FP tree for I4. This conditional tree will
be used to identify the frequent itemset ending in I1, I4 and I2, I4 and I3, I4.
 Since in this tree, I2 is meeting the support criteria, {I2, I4} is a frequent itemset.
 Hide {I2} and update support counts if required. This is called the conditional-FP tree of {I2, I4}. This tree is NULL.
 From the conditional tree of I4, prefix path for {I1, I4} is prepared. Since I1 is not meeting the support the criteria,
the conditional tree for {I1, I4} will be also NULL.
 Frequent itemsets at this stage = {I4}, {I2, I4}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-3 (Frequent itemsets ending with I3)
Trans ID Items
T100 I1, I2, I5 Null Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:4 I2: 2 I1:4
T105 I2, I3
T106 I1, I3
Conditional-FP Tree
T107 I1, I2, I3, I5
for {I2, I3}
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I3:2 I2:4 I3: 2 I2:2

Null
Conditional-FP Tree
for I3 / prefix path
for {I2, I3}
I3:2 I4:1 I5:1 I3:2
I1:4

Prefix path for I3.


Prefix path for {I1, I3}
I5: 1

 Retain only those nodes which are having I3 in their path starting from NULL. This is called the prefix path for I3. {I3} is meeting the
support threshold, so it is a frequent itemset.
 Hide {I3} and eliminate nodes that do not meet support criteria. Conditional-FP tree for I3. This conditional tree will be used to identify
the frequent itemset ending in I1, I3 and I2, I3.
 Since in this tree, I2 is meeting the support criteria, {I2, I3} is a frequent itemset.
 Hide {I2} and update support counts if required. Conditional-FP tree for {I2, I3}. Since in this tree, I1 is meeting the support count, {I1, I2,
I3} is a frequent itemset.
 From the conditional FP tree of I3, the prefix path for {I1, I3} is prepared. Since I1 meets the support criteria, {I1, I3} is a frequent
itemset.
 Frequent itemsets at this stage = {I3}, {I1, I3}, {I2, I3}, {I1, I2, I3}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-4 (Frequent itemsets ending with I2)
Trans ID Items
T100 I1, I2, I5 Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:4
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 Conditional-FP Tree
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I2:4 for I2

Prefix path for I2.

I3:2 I4:1 I5:1

I5: 1

 Retain only those nodes which are having I2 in their path starting from NULL. This is called the
prefix path for I2. {I2} is meeting the support threshold, so it is a frequent itemset.
 Hide {I2} and update support counts if required. Conditional-FP tree for I2. This conditional tree
will be used to identify the frequent itemset ending in I1, I2.
 {I1} is meeting the support threshold. So {I1, I2} is a frequent itemset.
 Frequent itemsets at this stage = {I2}, {I1, I2}

BITS Pilani, WILP


FP Growth Algorithm - illustration
Iteration-5 (Frequent itemsets ending with I1)
Trans ID Items
T100 I1, I2, I5 Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1

I3:2 I4:1 I5:1

I5: 1

 Retain only those nodes which are having I1 in their path


starting from NULL. This is called the prefix path for I1. {I1} is
meeting the support threshold, so it is a frequent itemset.
 At the end of the itertion-5, all frequent itemsets are identified.
BITS Pilani, WILP
Exercise

Using FP-Tree approach, identify all frequent itemsets


from the following transactions assuming the minimum
support count as 2.
i. {A, B}
ii. {B, C, D}
iii. {A, C, D, E}
iv. {A, D, E}
v. {A, B, C}
vi. {A, B, C, D}
vii. {A}
viii. {A, B, C}
ix. {A, B, D}
x. {B, C, E}

45

BITS Pilani, WILP


Vertical Data Format
Introduction

 Normally transaction id and the items included in Trans ID


T100
Items
I1, I2, I5
that particular transactions are provided in the T200
T300
I2, I4
I2, I3
data set as shown in the Table-1. This is called T400 I1, I2, I4
T500 I1, I3
Horizontal Data Format. T600 I2, I3
T700 I1, I3
 Dataset can also be presented in items-Trans ID T800 I1, I2, I3, I5
T900 I1, I2, I3
set form that is called Vertical Data Format as
Table-1
shown in the Table-2.

Itemsets Trans ID Set


{I1} {T100, T400, T500, T700, T800, T900}
{T100, T200, T300, T400, T600, T800,
{I2} T900} Table-2
{I3} {T300, T500, T600, T700, T800, T900}
{I4} {T200, T400}
{I5} {T100, T800} 46

BITS Pilani, WILP


Vertical Data Format
Mining Frequent Itemsets

 Support count is the length of Transactions Set for a


given itemset.
 Merging is performed by intersection. That means
identifying the common transactions. E.g.
Itemsets Trans ID Set
{I1} {T100, T400, T500, T700, T800, T900}
{T100, T200, T300, T400, T600, T800,
{I2} T900}
{I1, I2} {T100, T400, T800, T900}

47

BITS Pilani, WILP


Vertical Data Format
Example: Mining Frequent Itemsets

Itemsets Trans ID Set Threshold Support Count = 2


{I1, I2} {T100, T400, T800, T900}
{I1, I3} {T500, T700, T800, T900}
{I1,I5} {T100, T800} Why {I1, I4} and {I3, I5}
{I2, I3} {T300, T600, T800, T900}
are not frequent itemsets?
{I2,I4} {T200, T400}
{I2,I5} {T100, T800}

2-Itemsets in Vertical Data Format

Itemsets Trans ID Set Why {I1, I2, I3, I5}


{I1, I2, I3} {T800, T900} is not a frequent itemset?
{I1, I2, I5} {T100, T800}
Further advantage over Apriori - no
need to scan the transaction to get
3-Itemsets in Vertical Data Format
the support count.
48

BITS Pilani, WILP


Exercise

1. Repeat the Vertical Data Format mining exercise to find out the frequent
itemsets where only the output of diff function is stored. Diff function is
defined between (k+1)th itemset and corresponding kth itemset as the
difference in the transactions they include. E.g.
Itemsets Trans ID Set
{I1} {T100, T400, T500, T700, T800, T900}
{I2} {T100, T200, T300, T400, T600, T800, T900}
{I1, I2} {T100, T400, T800, T900}
diff ( {I1, I2}, {I1} ) {T500, T700}

2. What is the advantage of doing so?

49

BITS Pilani, WILP


Other Measures

For the Association Rules in the form of A  B, there are


more measures which help to evaluate interesting
patterns and they can cut down the number of
uninteresting rules:

AllConf(A,B)=min {P(A|B),P(B|A)}
MaxConf(A,B)=max {P(A|B),P(B|A)}
1
Kulc(A,B)= .{P(A|B)+P(B|A)} Named after Polish mathematician S.Kulczynski
2
Cosine(A,B)= {P(A|B) x P(B|A)}

50

BITS Pilani, WILP


Significance of Other Measures

Milk Milk' Data Sets mc m'c mc' m'c' χ2 lift AllConf MaxConf Kulc Cosine
Coffee mc m'c D1 10,000 1000 1000 1,00,000 90557 9.26 0.91 0.91 0.91 0.91
Coffee' mc' m'c' D2 10,000 1000 1000 100 0 1 0.91 0.91 0.91 0.91
D3 100 1000 1000 1,00,000 670 8.44 0.09 0.09 0.09 0.09
D4 1000 1000 1000 1,00,000 24740 25.75 0.50 0.50 0.50 0.50
Datasets D1 and D2:
 m and c are positively associated. People who bought milk, bought coffee also and vice-versa because
confidence (m  c) = confidence (c  m) = 10,000 / 11,000 = 0.91. This is reflected in the last 4 measures
consistently. But lift and χ2 generate different values.
Dataset D3:
 confidence (m  c) = confidence (c  m) = 100 / 1100 = 0.09. It is very low, but lift and χ2 contradict.
Dataset D4:
 confidence (m  c) = confidence (c  m) = 1000 / 2000 = 0.50. It shows neutrality, but lift and χ2 show
positive association.

Lift and χ2 are not proper for the above transaction because they have dependency on m’c’. Such transactions
are called Null Transactions. They are very likely in the real world situations, i.e. there could be many
transactions which do not include any itemsets of interest.

Last four measures are Null-Invariant measures because they are not impacted by the null transactions. But are
they applicable for all scenarios? Let us see that.
BITS Pilani, WILP
Imbalance Ratio (IR)

| support(A) - support(B) |
IR(A,B)=
support(A) + support(B) - support(A  B)
Milk Milk' Data Sets mc m'c mc' m'c' χ2 lift AllConf MaxConf Kulc Cosine
Coffee mc m'c D4 1000 1000 1000 1,00,000 24740 25.75 0.50 0.50 0.50 0.50
Coffee' mc' m'c' D5 1000 100 10,000 1,00,000 8173 9.18 0.09 0.91 0.50 0.29
D6 1000 10 1,00,000 1,00,000 965 1.97 0.01 0.99 0.50 0.10

Dataset D5:
 confidence (m  c) = 1000/11000 = 0.09 and confidence (c  m) = 1000 / 1100 = 0.91.
Dataset D6:
 confidence (m  c) = 1000/1010 = 0.99 and confidence (c  m) = 1000 / 101000 = 0.01.

For D5 and D6, AllConf shows low association but MaxConf shows positive association. Kulc is neutral for both and Cosine
shows low for both.
Kulc measure along with Imbalance Ration (IR) presents a clear picture.
D4 D5 D6
IR 0 0.89 0.99

Perfectly
Type Imbalanced Skewed
Balanced

Kulc 0.50 0.50 0.50

BITS Pilani, WILP


Case Study
Congressional Voting Records

Congressional Voting Records (1984) is a dataset of 435 transactions having 32


items on 16 key issues where Republican and Democratic parties are supporting
the Yes/No combinations. Issues are listed below:
1. handicapped-infants: 2 (y,n) Data set looks like:
2. water-project-cost-sharing: 2 (y,n) republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
3. adoption-of-the-budget-resolution: 2 (y,n) .....
4. physician-fee-freeze: 2 (y,n) democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,y,y
5. el-salvador-aid: 2 (y,n) ......
6. religious-groups-in-schools: 2 (y,n)
7. anti-satellite-test-ban: 2 (y,n) Taking minsup = 30% and minconf = 90%
8. aid-to-nicaraguan-contras: 2 (y,n) and applying Apriori Principle; the high
9. mx-missile: 2 (y,n) confidence rules can be extracted.
10. immigration: 2 (y,n)
11. synfuels-corporation-cutback: 2 (y,n)
12. education-spending: 2 (y,n) Study the detailed example from the book
13. superfund-right-to-sue: 2 (y,n) (Tan, Steinbach & Kumar) Chapter:6,
14. crime: 2 (y,n) Section: 6.3.3.
15. duty-free-exports: 2 (y,n)
16. export-administration-act-south-africa: 2 (y,n) 53

BITS Pilani, WILP


Project Idea!
Facebook is rolling out AI-based suicide prevention effort (27-Nov-
2017).

Design how Association Analysis can be leveraged in these types of


scenarios in social networking websites for performing sentiment
analysis to prevent self-harm.

 Capture of posts/texts (time bound) to refresh the training transactions.


 Sample Transactions:
T1: Drugs, Depressed, Sucks, Gloom
T2: Cheer, Cake, Joy, Heaven
T2: Knife, Sleeping Pills, Breakup
T3: Ignore, Liquor, Kill
...........................
 Mine strong Association Rules and offer help. 54

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DSECL ZC 415
Data Mining
Basic Visualization Techniques
Revision 1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Visualization

 Data Visualization is the display of information in a


graphic or tabular format so that the characteristics of
the data and the relationships among data items or
attributes can be analyzed or reported.
 Visualization of data is one of the most powerful and
appealing techniques for data exploration.
o Humans have a well developed ability to analyze large
amounts of information that is presented visually.
o Can detect general patterns and trends.
o Can detect outliers and unusual patterns.
2

BITS Pilani, WILP


Motivation for Visualization
 Human ability to analyze large amounts of information that is presented visually.
 A very large amount of data can be captured and interpreted in no amount of time.
 Make use of the domain knowledge locked up in head. E.g. A doctor can inspect and infer the
results better if the data is presented visually.
 The picture below summarises the Sea Surface Temperature (SST) in Co for July 1982. There are
2,50,000 data points but it can be interpreted in no time.

BITS Pilani, WILP


Visualization
General Concepts

1. Representation: Data objects, their A B C D E F


1 0 1 0 1 1 0
attributes, and the relationships 2 1 0 1 0 0 1
among data objects are translated 3
4
0
1
1
0
0
1
1
0
1
0
0
1
into graphical elements such as 5
6
0
1
1
0
0
1
1
0
1
0
0
1
points, lines, shapes, and colors. 7 0 1 0 1 1 0
8 1 0 1 0 0 1
2. Arrangement: Placement of visual 9 0 1 0 1 1 0

elements within a display. It can F A C B E D


4 1 1 1 0 0 0
make a large difference in how 2 1 1 1 0 0 0
6 1 1 1 0 0 0
easy it is to understand the data. 8 1 1 1 0 0 0
5 0 0 0 1 1 1
3. Selection: To avoid too 3 0 0 0 1 1 1
9 0 0 0 1 1 1
crowdedness, the elimination or 1 0 0 0 1 1 1
7 0 0 0 1 1 1
the de-emphasis of certain objects
and attributes. Nine objects with six attributes.
Arrangement in the second table is conveying
a possible pattern.
BITS Pilani, WILP
Histograms
Unit Price Quantity
 Usually shows the ($) Sold
9 15
distribution of values of 12 6
14 9
a single variable. 21 4
19 1
 Divide the values into 23
25
6
5
bins and show a bar plot 27 5
28 5
of the number of objects 29 6
31 5
in each bin. 32 6
40 2
 The height of each bar 45
48
4
12
indicates the number of 46 8
51 12
object. 65 8
74 23
 Example: A store of 78
79
12
2
Mega Electronics 81
83
8
12
presented the data in 85 19
86 12 A possible decision to make - can this store
the shown table. How to 87 2
plan to remove the items which are
91 3
visually analyze the 92 21 less than 90$ per unit?
94 40
count of items sold in 95 54
96 12
different price ranges? 99 89
100 200

BITS Pilani, WILP


2D Histograms
 Show the joint distribution of the values of two attributes.
 Visually more complicated to comprehend because few columns
could be hidden.
 Example: From the Iris dataset, count of species are shown in the
2D histogram based on their petal length and petal width. Petal
length and petal width are divided into three bins, so total nine
columns are present.

BITS Pilani, WILP


Quantile Plots

 A quantile plot (q-plot) is a simple and effective way to have a first look at the
univariate data distribution.
 Let us say x1, x2, x3.....xN are N observations of an attribute, arranged in an order
where x1 is the smallest and xN is the largest observation.
 Associated with each
100
observation (xi) there is a 90
percentage term (fi) that is 80

calculated as fi = (i-0.5)/N. 70

60
 fi on the X-axis and xi on Y axis
50 Median
are plotted. The resulting plot 40
is quantile a plot. 30

 If two attributes are plotted 20

10
for their distribution, it is
0
called quantile-quantile (q-q) 0 0.25 0.5 0.75 1
Q1 Q2 Q3
plot.
BITS Pilani, WILP
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DSECL ZC415
Data Mining
Cluster Analysis
Revision -3.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Problem Description
 Director of customer relationships of an electronics company wants to group
the customers into 5 separate groups and assign one business manager to each
one of them to manage the relationships in the best possible way.
 Customers in each group are to be as similar as possible. Moreover two given
customers having very different buying patterns should not be placed in the
same group.
 Unlike classification, the class label of each customer is unknown. The
groupings are to be discovered.
 What kind of data science modelling technique is helpful to accomplish this
task?
 Supervised learning vs. unsupervised learning.
 Clustering Techniques are to be used in these types of scenarios.
 Cluster Analysis - finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or unrelated to) the
objects in other groups. 2

BITS Pilani, WILP


What may not be Cluster Analysis?
From Data Science Perspective

Supervised Classification
– Have class label information.
– E.g. characteristics of reptiles are known. Identify if a new found specie is
a reptile.
– Do you think an initial unsupervised classification may become a
supervised classification eventually?

Simple Segmentation
– Dividing students into different registration groups alphabetically, by last
name.

Results of a Query
– Groupings are a result of an external specification.
– E.g. employees completed 10 years in the job.

Graph Partitioning
– Some mutual relevance.
– E.g. reach of airways or train system to specific cities in a graph of cities. 3

BITS Pilani, WILP


Ambiguity
The notion of a cluster is not well defined. The best
definition depends on the nature of the data and the
desired results.

How many clusters? Two Clusters

Four Clusters Six Clusters 4

BITS Pilani, WILP


Well Separated Clusters

A cluster is a set of objects in which each object is


closer (or more similar) to every other object in the
cluster than to any object not in the cluster.

Idealistic situation when the data contains the natural


clusters.

Each object is closer to every


other object of its cluster than
to any object not in that cluster.

Two well-separated clusters


5

BITS Pilani, WILP


Prototype Based Clusters

A cluster is a set of objects in which each object is


closer (or more similar) to the prototype that defines
the cluster than to the prototype of any other cluster.

Prototype of continuous attributes: centroid


Prototype of categorical attributes: medoid (the most
representative point of a cluster)

Each object is closer to the


center of its cluster than to the
center of other cluster.

Centroid based clusters


6

BITS Pilani, WILP


Density Based Clusters

A cluster is a dense region of points, which is separated


by low-density regions, from other regions of high
density. These clusters are used when the clusters are
irregular or intertwined, and when noise and outliers
are present.

BITS Pilani, WILP


Conceptual Clusters

Clusters that share some common property or


represent a particular concept.

Temperature points from a heat source

BITS Pilani, WILP


Clustering Methods

Clustering
Methods

Partitioning Hierarchical Density Based Grid Based


Methods Methods Methods Methods #

Division of dataset Dataset is a set of nested A cluster is a dense


into non-overlapping Attributes values are
clusters organized as region of objects
splitted for creating a
subsets. tree. surrounded by the
grid to form the cluster.
Example: Agglomerative region of low density.
Example: K-Means Example: STING, CLIQUE
Algorithm Hierarchical Clustering Example: DBSCAN

#: Not in the syllabus 9

BITS Pilani, WILP


K-Means Algorithm

 K-Means is a prototype based partitioning clustering technique.


 K is a user specified parameter. This is the count of clusters
desired.
 K initial centroids are selected. They could be few actual data
points in the given datasets.
 Each data point in the dataset is assigned to the closest centroid.
 The collection of data points assigned to a centroid form a
cluster.
 Procedure is repeated until the movement of data points is
stabilized. (data points hops from one cluster to another does not happen)
10

BITS Pilani, WILP


Illustration
K-Means Algorithm continuing...

Points X Y
 Seven data points (A to G) are A 1.00 1.00
given with their coordinates. B 1.50 2.00
Two clusters (K = 2) are to be C 3.00 4.00
D 5.00 7.00
identified among them. E 3.50 5.00
 Two centroids are chosen F 4.50 5.00
G 3.50 4.50
randomly from them as A
Cluster-1 Cluster-2
(1.0, 1.0) and D (5.0, 7.0). Step Mean
Euclidean
Mean
Euclidean
Individual Distance Individual Distance
 Other points are taken one at 1 A
Centroid
(1.0, 1.0) 0 D
Centroid
(5.0, 7.0) 0
a time. Their distances from 2 B 1.12 B 6.10
the centroids are measured 3 A, B (1.25, 1.5) D (5.0, 7.0)
4 C 3.05 C 3.61
as Euclidean distance. Lesser 5 A, B, C (1.83, 2.33) D (5.0, 7.0)
distance means proximity and 6 E 3.15 E 2.50
7 A, B, C (1.83, 2.33) D, E (4.25, 6.0)
the points are added to the 8 F 3.78 F 1.03
corresponding cluster. Mean 9 A, B, C (1.83, 2.33) D, E. F (4.33, 5.67)
centroid coordinates are also 10 G 2.74 G 1.43
11 A, B, C (1.83, 2.33) D, E, F, G (4.13, 5.38)
updated.
BITS Pilani, WILP
Illustration
K-Means Algorithm concluded.

1. So, the clusters are {A, B, C} and {D, E, F, G}.


2. Mean centroid are: Cluster-1 (1.83, 2.33) and Cluster-2 (4.13, 5.38).
3. Now the distance of each point is measured against the mean centroid of its cluster for fine
tuning.
4. Point-C seems to be confusing, because it is closer now from the cluster-2 centroid than the
cluster-1 centroid. So it is moved to cluster-2.
5. Final clusters are {A, B} and {C, D, E, F, G}.
6. The mean centroid is updated and the procedure is repeated from #2 above.
7. The algorithm stops when there is no movement of points from one cluster to another.
8.00
Points X Y Distance Cluster-1 Cluster-2 D
7.00
A 1.00 1.00 A 1.57 5.38 6.00
B 1.50 2.00 E F
B 0.47 4.28 5.00
C 3.00 4.00 C
C 2.04 1.78
4.00
D 5.00 7.00 D 5.64 1.84 G
3.00
E 3.50 5.00 E 3.15 0.74 B
2.00
F 4.50 5.00 F 3.78 0.53 A
1.00
G 3.50 4.50 G 2.74 1.08
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
Original Points The distances from the mean centroids

BITS Pilani, WILP


K-Means Algorithm
More Details

 In the illustration, the centroid was updated after each point was
added to the cluster. Another variation to reduce this overhead is
to update the centroid in the end of the iteration when no more
points are left to be decided which cluster they belong
(recommended).
 If K is the count of clusters, a data point is x that belongs to
cluster Ci that has centroid ci, the objective is to minimize the
Euclidean’s distance based Sum of Squared Error (SSE), that is
defined as: K
SSE=  dist( ci ,x )2
i=1 x  Ci

 The centroids that minimize the SSE are the means of the data
points that belong to that cluster.
13

BITS Pilani, WILP


How to find K?
Elbow Method

 How many clusters (K) data points have hidden in them?


 Elbow method is used to identify that. SSE is plotted for taking the values from K = 1 to
unreasonable some large number.
 If the value of K on the X-axis is kept increasing the SSE will eventually be 0, that
means each data point is the centroid for itself and forms a cluster.
 The optimum value of K is the point of inflection in the graph that resembles an elbow.
For example K= 3 is an optimal choice in plot below.

K
BITS Pilani, WILP
Document Clustering
Using K-Means Algorithm

Doc ID Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
ID-1 5 0 3 0 2 0 0 2 0 0
ID-2 3 0 2 0 1 1 0 1 0 1
ID-3 0 7 0 2 1 0 0 3 0 0
... ... ... ... ... ... ... ... ... ... ...

 There are few documents for which term-frequency-vector data is available.


 The documents are to be clustered using K-means algorithm. The measure for similarity
is to be used is cosine similarity (review the module Exploring Data).
 K initial document centroids are selected.
 Cosine similarity for other documents is measured from these centroid documents.
 The maximum similarity between a document and a centroid document (closer to value
1, cos 0o = 1) identifies that the document belongs to that cluster.
 The centroid is updated by taking the mean for the values of the documents term-
frequency-vectors.
 The procedure is repeated for all the remaining documents. 15

BITS Pilani, WILP


Issues with K-Means Clustering

Is it what we expect from K-Means Clustering?


16

BITS Pilani, WILP


Hierarchical Clustering

How many clusters?


One cluster with two
sub-clusters

A clustering may be required


in an hierarchy.

Three level of 17
hierarchical Clusters.
BITS Pilani, WILP
Types of Hierarchical Clustering

1. Agglomerative: Start with the points as individual


clusters and at each step, merge the closest pair of
clusters. This requires defining a notion of cluster
proximity: MIN, MAX, Group Average.
2. Divisive: Start with one all-inclusive cluster and at
each step split a cluster until only clusters of
individual points remain. This requires the splitting
criteria to be defined.

18

BITS Pilani, WILP


Illustration
MIN (Single Link) Hierarchical Clustering continuing...

0.60
Point X Y
0.50
P1 0.40 0.53 P1
0.40
P2 0.22 0.38 P5
0.30 P2
P3 0.38 0.32 P3 P6
P4 0.26 0.19 0.20
P4
P5 0.08 0.41 0.10
P6 0.45 0.30 0.00
Six points with their coordinates are 0.00 0.10 0.20 0.30 0.40 0.50
provided In the beginning each point is considered
a singleton cluster
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two A minimum distance pairs is {P3, P6} . It forms a
point pairs are tabulated two-point cluster.

BITS Pilani, WILP


Illustration
MIN (Single Link) Hierarchical Clustering continuing...
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two Two Point Clusters
point pairs are tabulated
The distance between this two-point cluster and other points:
(P3, P6) and P1 = min {dist(P3, P1), dist(P6, P1)} = min {0.21, 0.24} = 0.21
(P3, P6) and P2 = min {dist(P3, P2), dist(P6, P2)} = min {0.17, 0.24} = 0.17
(P3, P6) and P4 = min {dist(P3, P4), dist(P6, P4)} = min {0.18, 0.22} = 0.18
(P3, P6) and P5 = min {dist(P3, P5), dist(P6, P5)} = min {0.31, 0.39} = 0.31

Since the next minimum distance among other minimum distances is


between P2 and P5. That forms the next cluster.

BITS Pilani, WILP


Illustration
MIN (Single Link) Hierarchical Clustering continuing...
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two {(P3, P6), (P2, P5)} form a cluster
point pairs are tabulated

(P3, P6) and P1 = min {dist(P3, P1), dist(P6, P1)} = min {0.21, 0.24} = 0.21
(P3, P6) and P4 = min {dist(P3, P4), dist(P6, P4)} = min {0.18, 0.22} = 0.18

The distance between (P2, P5) cluster and other points:


(P2, P5) and (P3, P6) = min {dist(P2, P3), dist(P2, P6), dist(P5, P3), dist (P5, P6)}
= min {0.17, 0.24, 0.31, 0.39} = 0.17
(P2, P5) and P1 = min {dist(P2, P1), dist(P5, P1)} = min {0.23, 0.34} = 0.23
(P2, P5) and P4 = min {dist(P2, P4), dist(P5, P4)} = min {0.19, 0.28} = 0.19
Since the next minimum distance among other minimum distances is
between (P2, P5) and (P3, P6). That forms the next cluster.
BITS Pilani, WILP
Illustration
MIN (Single Link) Hierarchical Clustering continuing...
0.60
P1 P2 P3 P4 P5 P6 P1
0.50
P1 0.00
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two [P4, {(P3, P6), (P2, P5)}] form a cluster
point pairs are tabulated

The distance between the new cluster (green line) and P4:
= min {dist(P5, P4), dist(P2, P4), dist(P3, P4), dist(P6, P4)}
= min (0.28, 0.19, 0.18, 0.22)
= 0.18

The distance between the new cluster (green line) and P1: Since 0.18 is the minimum
distance
= min {dist(P5, P1), dist(P2, P1), dist(P3, P1), dist(P6, P1)}
The next higher level cluster is
= min (0.34, 0.23, 0.21, 0.24) [P4, {(P3, P6), (P2, P5)}]
= 0.21

BITS Pilani, WILP


Illustration
MIN (Single Link) Hierarchical Clustering concluded.

0.60
P1
0.50

0.40
P5
P2
0.30
P3 P6
0.20
P4
0.10

0.00
0.00 0.10 0.20 0.30 0.40 0.50

Final MIN Hierarchical Clustering

BITS Pilani, WILP


Hierarchical Clustering
MIN (Single Link) Dendrogram Representation

Dendrogram is tree representation showing the


taxonomic relationship of a dataset. It can be used to
show the hierarchical clustering.
0.21

0.17

0.07

P3 P6 P2 P5 P4 P1
MIN Dendrogram Representation
24

BITS Pilani, WILP


Types of Hierarchical Clustering
 For the MIN clustering, The distance
between two clusters is represented
by the distance of the closest pair of
data objects belonging to different MIN (Single Link) Hierarchical Clustering
clusters.
 For the MAX clustering, The distance
between two clusters is represented
by the distance of the farthest pair of
data objects belonging to different MAX (Complete Link or CLIQUE) Hierarchical Clustering
clusters.
 For the Group Average clustering,
The distance between two clusters is
represented by the average distance
of all pairs of data objects belonging Group Average Hierarchical Clustering
to different clusters.
BITS Pilani, WILP
Illustration
MAX (Complete Link/CLIQUE) Hierarchical Clustering. Continuing....
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two
point pairs are tabulated Two Point Clusters

 Initially all the points are singleton clusters.


 To start with, two points of closest distance is identified. So (P3, P6) form a cluster.
 Now the maximum distance of other points from this cluster is calculated:
o (P3, P6) and P1 = max {dist(P3, P1), dist(P6, P1)} = max (0.21, 0.24) = 0.24
o (P3, P6) and P2 = max {dist(P3, P2), dist(P6, P2)} = max (0.17, 0.24) = 0.24
o (P3, P6) and P4 = max {dist(P3, P4), dist(P6, P4)} = max (0.18, 0.22) = 0.22 Min is 0.22
o (P3, P6) and P5 = max {dist(P3, P5), dist(P6, P5)} = max (0.31, 0.39) = 0.39
 But are there more two singleton points that can form a cluster?
 Yes. Minimum distance among them is between (P5, P2). So it forms the next level
cluster.
BITS Pilani, WILP
Illustration
MAX (Complete Link/CLIQUE) Hierarchical Clustering. Continuing....
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two
point pairs are tabulated P4 will form a cluster with (P3, P6)

(P3, P6) and P1 = max {dist(P3, P1), dist(P6, P1)} = max (0.21, 0.24) = 0.24
(P3, P6) and P4 = max {dist(P3, P4), dist(P6, P4)} = max (0.18, 0.22) = 0.22
(P2, P5) and P1 = max {dist(P2, P1), dist(P5, P1)} = max (0.23, 0.34) = 0.34
(P2, P5) and P4 = max {dist(P2, P4), dist(P5, P4)} = max (0.19, 0.28) = 0.28
(P2, P5) and (P3, P6) = max {dist(P2, P3), dist(P5, P3), dist(P2, P6), dist(P5, P6)}
= max (0.17, 0.31, 024, 0.39) = 0.39
P4 will form a cluster with (P3, P6) because its distance is minimum. 27

BITS Pilani, WILP


Illustration
MAX (Complete Link/CLIQUE) Hierarchical Clustering. Continuing....
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two
point pairs are tabulated P1 will form a cluster with (P2, P5)

(P2, P5) and P1 = max {dist(P2, P1), dist(P5, P1)} = max (0.23, 0.34) = 0.34
{(P3, P6), P4} and P1 = max {dist(P3, P1), dist(P6, P1), dist (P4, P1)}
= max (0.21, 0.24, 0.37) = 0.37
{(P3, P6), P4} and (P2, P5) = max {dist(P3, P2), dist(P6, P2), dist (P4, P2),
dist(P3, P5), dist(P6, P5), dist (P4, P5)}
= max (0.17, 0.24, 0.19, 0.31, 0.39, 0.28) = 0.39
P1 will form a cluster with (P2, P5) because its distance is minimum. 28

BITS Pilani, WILP


Illustration
MAX (Complete Link/CLIQUE) Hierarchical Clustering. Concluded.

0.60

0.50
P1
0.40
P5
P2
0.30
P3 P6
0.20
P4
0.10

0.00
0.00 0.10 0.20 0.30 0.40 0.50

Final MAX Hierarchical Clustering

29

BITS Pilani, WILP


Illustration
Group Average Hierarchical Clustering
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10

P6 0.24 0.24 0.07 0.22 0.39 0.00 0.00


0.00 0.10 0.20 0.30 0.40 0.50
The Euclidean distance between two
point pairs are tabulated {(P3, P6), (P2, P5)} form a cluster

 Initially all the points are singleton clusters.


 (P3, P6) and (P2, P5) form the initial two point clusters as in the previous slides.
 Now the average distance between the following pairs are calculated:
o (P3, P6) and (P2, P5) = (0.17 + 0.31 + 0.24 + 0.39) / (2 x 2) = 0.28
o (P3, P6) and P1 = (0.21 + 0.24) / (2 x 1) = 0.23
o (P3, P6) and P4 = (0.18 + 0.22) / (2 x 1) = 0.20
o (P2, P5) and P1 = (0.23 + 0.34) / (2 x 1) = 0.29
o (P2, P5) and P4 = (0.19 + 0.28) / (2 x 1) = 0.24
 Minimum distance among them is between (P3, P6) and P4. So it forms the next level
cluster.
 The procedure continues to complete the clustering.
BITS Pilani, WILP
MIN, MAX and Average Hierarchical
Clustering comparison:
 MIN technique is sensitive to outliers.
 MAX technique is less sensitive to outliers. It favors
elliptical shapes but breaks the large clusters.
 Average is mid-way approach.

31

BITS Pilani, WILP


Exercise

Complete the Group Average hierarchical clustering and


draw the dendrogram trees for the illustrations of MAX
and Group Average hierarchical clustering.

32

BITS Pilani, WILP


Balanced Iterative Reducing and
Clustering using Hierarchies (BIRCH)
 BIRCH is designed for clustering an extremely large amount of data, which cannot be read into memory.
 It overcomes two difficulties of agglomerative clustering methods: scalability and inability to undo the
previous step.
 It uses a small set of summary statistics to represent a larger set of data points. This summary statistics is
called the Clustering Feature (CF) which is a set of following attributes in a single cluster <n, LS, SS>:
i. Count (n): the count of data points in the cluster.
ii. Linear Sum (LS): the sum of individual data point coordinates. ∑xi (for i = 1 to n, xi is the coordinates of ith
point)
iii. Squared Sum (SS): the square sum of individual data point coordinates. ∑xi2 (for i = 1 to n, xi is the
coordinates of ith point).
 Using the CF, other useful statistics measure are defined as following:
n

x i
LS
Radius (R) reflects the tightness of the
Centroid (x0 )= i=1
= cluster around the centroid.
n n
Note that if just Clustering Feature is
n

 (x -x i 0 )2
SS LS 2
available, Centroid and Radius can be
Radius(R) = i=1
= -( ) calculated. Coordinates of individual
n n n data points are not needed.

BITS Pilani, WILP


Clustering Feature: Additive Property

For two disjoint clusters C1 and C2 with the clustering features


CF1 = <n1, LS1, SS1> and CF2 = <n2, LS2, SS2> respectively, the
clustering feature for the merged cluster would be = <n1 + n2,
LS1 + LS2, SS1 + SS2 >

Example:
In a cluster C1, there are three points (2, 5), (3, 2) and (4, 3).
So, CF1 = <3, (9, 10), (29, 38)>

In another cluster C2 the three points are (1, 2), (2, 6) and (3, 9).
So, CF2 = <3, (6, 17), (14, 121)>

CF for the merged cluster from C1 and C2 will be = <3+3, (9+6, 10+17), (29+14,
38+121)> = <6, (15, 27), (43, 159)>

34

BITS Pilani, WILP


Clustering Feature (CF) Tree

• A CF-Tree is a compressed form of the data and consists


of CFs.
• A CF-Tree has the following parameters:
• Branching Factor (B): the count of maximum children
allowed for a non-leaf node.
• Threshold (T): the upper limit of the Radius (R) of a
cluster in a leaf node.

35

BITS Pilani, WILP


CF Tree Illustration
Slide-1/6

 There are seven points in one-dimensional space as: x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 =
1.4, x7 = 1.1
 CF-Tree parameters are given as: Branching Factor (B) = 2, Threshold (T) = 0.15.
 The dataset is scanned and the first data (x1) point is read. A root as node-0 and a leaf as leaf-1 is
created with the CF value of this point. Since this is the first point, radius (R) of the leaf-1 is 0.
 Now the second point (x2) is read. Its radius w.r.t. the leaf-1 is calculated. The value of R comes as 0.13
that is less than the threshold (T). So, it is assigned to leaf-1 and the value of CF1 in the root is
updated.

Root, Node-0 Root, Node-0 Workings:


CF1 = <1, 0.50, 0.25> CF1 = <2, 0.75, 0.31> CF1 = <1, 0.50, 0.25>
CFx2 = <1, 0.25, 0.06>
Combined CF = <2, 0.75, 0.31>
Leaf-1, R = 0.13 SS/n = 0.31/2
Leaf-1, R = 0 x1 = 0.50 LS/n = 0.75/2
x1 = 0.50 x2 = 0.25 R = SQRT{ (SS/n) – (LS/n)2 } = 0.13

BITS Pilani, WILP


CF Tree Illustration
B = 2, T = 0.15
Slide-2/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1

 The third point (x3) is read now. Its radius w.r.t. the leaf-1 is calculated. The value of R comes
as 0.21 that is more than the threshold (T). So, it is not assigned to leaf-1 and a new leaf-2 is
created that contains only x3. Workings:
 In this new leaf, Radius (R) = 0 and root is updated for CF. CF1 = <2, 0.75, 0.31>
CFx3 = <1, 0, 0>
Combined CF = <3, 0.75, 0.31>
SS/n = 0.31/3
LS/n = 0.75/3
R = SQRT{ (SS/n) – (LS/n)2 } = 0.21

Root, Node-0
Root, Node-0 CF1 = <2, 0.75, 0.31>
CF1 = <2, 0.75, 0.31> CF2 = <1, 0, 0>

Leaf-1, R = 0.13
Leaf-1, R = 0.13
x1 = 0.50 Leaf-2, R = 0
x1 = 0.50
x2 = 0.25 x3 = 0
x2 = 0.25

BITS Pilani, WILP


CF Tree Illustration
B = 2, T = 0.15
Slide-3/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1

 The fourth point (x4) is read now. Its position in CF1 or CF2 is to be decided based on their respective
centroids. The centroid of CF1 is 0.75/2 = 0.375 and for CF2 is 0/1 = 0. Therefore x4 is closer to CF1.
 The radius (R) of x4 from CF1 comes as 0.16 that is > threshold (T). It means a new leaf nodes has to
be created.
 A new leaf node is not possible, because branching factor (B) = 2. So root node is splitted into Node-
1 and Node-2, where Node-1 is the old Node-0 and the Node-2 is has a separate leaf-3 containing x4
only.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3 = <1, 0.65, 0.42>

Root, Node-0
CF1 = <2, 0.75, 0.31>
Node-1
CF2 = <1, 0, 0> Node-2
CF1 = <2, 0.75, 0.31>
CF3 = <1, 0.65, 0.42>
CF2 = <1, 0, 0>

Leaf-1, R = 0.13
Leaf-2, R = 0
x1 = 0.50 Leaf-1, R = 0.13
x3 = 0 Leaf-3, R = 0
x2 = 0.25 Leaf-2, R = 0
x1 = 0.50 x4 = 0.65
x3 = 0
x2 = 0.25

 Note that sum of children CF is equal to the CF of the parent and leaves contain only the data
points.
BITS Pilani, WILP
CF Tree Illustration
B = 2, T = 0.15
Slide-4/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1

 Now, the fifth point (x5) is read. Its position in CF1-2 or CF3 is to be decided based on their respective
centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3 is 0.65/1 = 0.65. Therefore x5 is closer to
CF3.
 The radius (R) of x5 from CF3 comes as 0.18 that is > threshold (T). It means a new leaf nodes has to
be created in Node-2.
 The details of Node-2 and Node-0 are updated for their CFs.

Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4 = <2, 1.65, 1.42>

Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3 = <1, 0.65, 0.42>
CF2 = <1, 0, 0> CF4 = <1, 1.0, 1.0>

Leaf-1, R = 0.13 Leaf-3, R = 0 Leaf-4, R = 0


Leaf-2, R = 0
x1 = 0.50 x4 = 0.65 x5 = 1.0
x3 = 0
x2 = 0.25

BITS Pilani, WILP


CF Tree Illustration
B = 2, T = 0.15
Slide-5/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1

 Now, the sixth point (x6) is read. Its position in CF1-2 or CF3-4 is to be decided based on their
respective centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3-4 is 1.65/2 = 0.83. Therefore x6
is closer to CF3-4.
 Now the centroid of CF3 is 0.65/1 = 0.65 and CF4 is 1.0/1 = 1.0. Therefore, x6 is closer to CF4.
 The radius of x6 from CF4 is 0.20 which is greater than the threshold (T). So, Node-2 will be splitted
into 2 nodes.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4-5 = <3, 3.05, 3.38>

Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3-4 = <2, 1.65, 1.42>
CF2 = <1, 0, 0> CF5 = <1, 1.4, 1.96>

Node-2.1
Leaf-1, R = 0.13 CF3 = <1, 0.65, 0.42>
Node-2.2
Leaf-2, R = 0 CF5 = <1, 1.4, 1.96>
x1 = 0.50 CF4 = <1, 1.0, 1.0>
x3 = 0
x2 = 0.25

Leaf-3, R = 0 Leaf-4, R = 0 Leaf-5, R = 0


 The details of Node-2 and Node-0 x4 = 0.65 x5 = 1.0 x6 = 1.4
are updated for their CFs.
BITS Pilani, WILP
CF Tree Illustration
B = 2, T = 0.15
Slide-6/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1

 The seventh (the last) point (x7) is read. Its position in CF1-2 or CF3-4-5 is to be decided based on their
respective centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3-4-5 is 3.305/3 = 1.02. Therefore x7
is closer to CF3-4-5.
 Now the centroid of CF3-4 is 1.65/2 = 0.83 and CF5 is 1.4/1 = 1.4. Therefore, x7 is closer to CF3-4.
Similarly, x7 is closer to CF4 than CF3.The radius of x7 from Leaf-4 is 0.05 which is within the threshold
(T). So, x7 is assigned to Leaf-4.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4-5 = <4, 4.15, 4.21>

Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3-4 = <3, 2.75, 2.25>
CF2 = <1, 0, 0> CF5 = <1, 1.4, 1.96>

Node-2.1
Leaf-1, R = 0.13 CF3 = <1, 0.65, 0.42>
Node-2.2
Leaf-2, R = 0 CF5 = <1, 1.4, 1.96>
x1 = 0.50 CF4 = <2, 2.1, 2.21>
x3 = 0
x2 = 0.25

Leaf-4, R = 0.05
Leaf-3, R = 0 Leaf-5, R = 0
 The details of Node-2 and Node-0 x4 = 0.65
x5 = 1.0
x6 = 1.4
are updated for their CFs. x7 = 1.1

BITS Pilani, WILP


BIRCH Phases

i. Phase-1: BIRCH scans the dataset to build an initial CF-Tree. The


tree can be viewed as the compressed form of the data points
preserving the clustering structure in the dataset. E.g. in the
previous illustration, BIRCH identifies the following inherent
clusters in the dataset.
x3 x2 x1 x4 x5 x7 x6

0 0.25 0.50 0.65 1.0 1.10 1.40

ii. Phase-2: If required, any selected clustering algorithm is


applied to further cluster the leaf nodes data points
individually. This helps to identify the outliers and combine the
dense or separate the sparse regions as required.
42

BITS Pilani, WILP


Density Based Spatial Clustering of
Applications with Noise (DBSCAN)
 Density based clustering locates specific regions of high
density that are separated from one another by regions Noise
of low density.
 DBSCAN is one such density based algorithm. Border Point
 Density is estimated for a particular point in the data
set by counting the number of points within a specified
radius including the point itself. The radius is called Eps
(or ε) of the point.
 Core Points: A point is a core point if the number of Core Point
points within a given region defined by ε meets the
threshold called MinPts. For example in the shown
figure yellow point is a core point for ε = 1 cm and
MinPts = 5. E.g. ε = 1cm, MinPts = 5
 Border Points: It is not a core point but falls within the
neighbourhood of a core point. How to Classify Points?
 Noise Points: A point which is neither core nor border
point.
BITS Pilani, WILP
DBSCAN Points
Illustration

ε ε
ε

 For the given ε and MinPts = 7 the yellow point is a core point.
 For the same ε and MinPts = 7 the blue point is a border point.
 For the same ε and MinPts = 7 the red points are a noise points.

44

BITS Pilani, WILP


DBSCAN: The Algorithm
 Arbitrary select a point p.
 Retrieve all points density-reachable from p with respect to ε and
MinPts to identify the points of interest.
 If p meets the requirement of MinPts within ε, it is a core point.
 If p is not a core point, but exists in the neighborhood of a core point
then it is as a border point.
 If p is neither a core point nor a border point, then it is a noise point.
 Connected Core Points: all the core points that can be reached from
the previous core point travelling a distance <= ε.
 Mark each group of connected core points into a separate cluster.
 The process is continued until all of the points have been processed.

BITS Pilani, WILP


DBSCAN: Parameter Selection

 Let the distance of a point from its kth nearest neighbour is k-dist.
 For the points that belong to same cluster, the value of k-dist will
be small if k is not larger than the cluster size.
 For all the points there will be variation in the k-dist but it will not
be huge unless the cluster densities are not radically different.
 For the noise points, k-dist will be relatively large.
 If k-dist is calculated for all the data for some value of k, sorted in
the increasing order, a sharp change in the value of k-dist
represents the ε and the k value represents the MinPts.
 The points for which k-dist is <= ε are core points. Other points
are border or noise points.
46

BITS Pilani, WILP


Illustration
DBSCAN Parameters

3000 Data Points

For MinPts (k) = 4, at ε = 10 there


is a sharp change in the curve
47

BITS Pilani, WILP


Illustration
DBSCAN Algorithm

Core Border Noise/Outliers

Result of DBSCAN Algorithm on 3000 data points with k = 4


and ε = 10
48

BITS Pilani, WILP


Exercise

For the scatter plot shown below, identify the core,


border and noise points. ε = 2 units, MinPts = 3 and
Manhattan distance is the proximity measure.

49

BITS Pilani, WILP


Ordering Points to Identify the
Cluster Structure (OPTICS)
 DBSCAN suffers with the same problem as most other clustering algorithms do;
that is how to find out the critical parameters of clusters.
 These parameters are empirically found and difficult to determine for the real
world high dimensional and high volume data.
 Moreover, DBSCAN works on the global values of MinPts and ε. It has a major
weakness of detecting meaningful clusters in the data of varying densities.
 For example, the shown data points, there are 5
perceivable density regions. It is difficult to determine
the single set of DBSCAN parameters that isolate these
regions.
 To overcome this difficulty, OPTICS, a cluster analysis
method is proposed.
 OPTICS does not explicitly produce a data set
clustering. Instead it yields a cluster ordering.
 This ordering is a linear list of all data points and
50
represents the density based clustering structure.
BITS Pilani, WILP
OPTICS Important Parameters
OPTICS maintains two important pieces of information per data point:
i. core-distance: the minimum distance (ε’) that makes a point a core point. If the point is
not a core point with respect to MinPts and ε, then core-distance is undefined for that
point.
ii. reachability-distance: if p is a core point with respect to MinPts and ε then reachability-
distance for a point q from p is defined as max {core-distance (p), dist(p, q)}, where
dist() is the distance measure used in the context (e.g. L1 or L2 norm etc.). Else
undefined.
Example:
For shown figure, MinPts = 5 and ε = 6 mm.
For ε’ = 3 mm, the point p becomes a core point. So for p
the core-distance = 3 mm.

For point q1:


reachability-distance = max(core-distance(p),
dist(p, q1))
= ε’ = 3 mm
For point q2:
reachability-distance = max(core-distance(p),
dist(p, q2))
= dist(p, q2)

BITS Pilani, WILP


OPTICS Procedure

 The core-distance of each point in the dataset is updated and reachability-


distance is set to undefined.
 OPTICS maintains a list called OrderSeeds. Points in OrderSeeds are to be
arranged in the increasing order by the reachability-distance from their
respective closest core points.
 OPTICS starts with an arbitrary point from the dataset as the current point p.
 The p is marked processed and written to the output cluster ordering with its
reachability-distance as undefined (or ∞).
 If p is a core point, for each point q (which is not yet processes) in the ε-
neighborhood of p, reachability-distance for q from p is updated and inserted
into OrderSeeds. q is marked processed and its reachability-distance printed.
 If p is not a core-point, OPTICS moves to the next point in OrderSeeds or
original dataset (if OrderSeeds is empty).
 The procedure is repeated until the dataset and OrderSeeds are fully
consumed.
BITS Pilani, WILP
OPTICS Cluster Ordering

Reachability Distance

Cluster Ordering

The bumps (valleys ) C1 to C5 represent five clusters of different densities present


in the data set.

53

BITS Pilani, WILP


OPTICS Illustration
Slide-1/7

 16 data points are given in a dataset. The plot shows their


approximate distribution on X-Y axes (not to the scale).
 The value of MinPts = 3 and ε = 44.
 The point A is arbitrarily selected, marked processed and printed
for cluster ordering.
 Points B and I make point-A a core point. Both of these two points
are at a distance of 40 from the point A. These points are added
to OrderSeeds with their reachability-distance.

OrderSeeds: (B, 40), (I, 40)

BITS Pilani, WILP


OPTICS Illustration
Slide-2/7

 OrderSeeds is not empty, so its first element point B is selected next.


 B is marked processed and its reachability distance is printed.
 Points A and C make point-B a core point. Both of these two points are at a distance of
40 from the point B. A was already processed. So point C is added to the OrderSeeds
with its reachability-distance.

OrderSeeds: (I, 40), (C, 40)

BITS Pilani, WILP


OPTICS Illustration
Slide-3/7

 OrderSeeds is not empty, so its first element point I is selected next.


 I is marked processed and its reachability distance is printed.
 Points J and K make point-I a core point. Both of these two points are at a distance of 20
from the point I. In addition, points A, L, M and R are also in the ε-neighborhood of I. A was
already processed. Points J, K, L, M and R are added to OrderSeeds in the increasing order
of their reachability distance. Note that point C was already present before processing I, but
now its position in OrderSeeds is changed because of re-ordering.

OrderSeeds: (J, 20), (K, 20), (L, 31), (C, 40), (M, 40),(R, 43)

BITS Pilani, WILP


OPTICS Illustration
Slide-4/7
 OrderSeeds is not empty, so its first element point J is selected next.
 J is marked processed and its reachability distance is printed.
 Points L and K make point-J a core point. These two points are at a distance of 19 and 20
from the point J respectively. In addition, points I, R, M and P are also in the ε-
neighborhood of J. I was already processed. Points L, K, R, M and P are added to OrderSeeds
in the increasing order of their reachability distance. Note that few points were already
present before processing J, but now their position in OrderSeeds are changed because of
re-ordering.

OrderSeeds: (L, 19), (K, 20),(R, 21), (M, 30),(P, 31), (C, 40)

BITS Pilani, WILP


OPTICS Illustration
Slide-5/7
 OrderSeeds is not empty, so its first element point L is selected next.
 L is marked processed and its reachability distance is printed.
 Points M and K make point-L a core point. These two points are at a distance of 18
from the point L. In addition, points I, J, R, P and N are also in the ε-neighborhood of L.
I and J were already processed. Points M, K, R, P and N are updated in the OrderSeeds
in the increasing order of their reachability distance. Note that points are rearranged
in OrderSeeds in the increasing order of reachability-distance.

OrderSeeds: (M, 18), (K, 18), (R, 20), (P, 21), (N, 35), (C, 40)

BITS Pilani, WILP


OPTICS Illustration
Slide-6/7

The process continued and all the points are processed until there are no more points in
the dataset or OrderSeeds.

BITS Pilani, WILP


OPTICS Illustration
Slide-7/7

The valleys in the cluster ordering plot shows two density regions.

BITS Pilani, WILP


Evaluation of Clustering

 How to evaluate whether clustering results are good?


 The major tasks of clustering evaluation include:
i. Assessing cluster tendency: assess if non-uniform
(non-random) structure exists in the data set or if the
data points are random and uniform with no clusters.
ii. Determining number of clusters: how many possible
clusters exists? Algorithms like K-Means need this
information to decide the count of initial centroids.
Example: Elbow method reviewed with K-Means.
iii. Measuring cluster quality: how good the resulting
clusters are?
61

BITS Pilani, WILP


Assessing Clustering Tendency
Hopkins Statistic

 The data may be uniformly distributed (random structure) and applying a clustering technique might create
meaningless clusters.
 Cluster analysis is meaningful only when there are well separated points and there are natural clusters in the
data (non-random structure).
 Hopkins Statistic is one measure that helps to assesses the nature of the data: random or non-random structure.
o n data points p1, p2,.....pn are uniformly sampled (equal probability of getting selected, no bias) from the data set D.
For each point pi (1<= i <= n), the nearest neighbor is found out in D. Let xi be the distance between pi and its
nearest neighbor in D. So xi = min (dist (pi, v)), v ∈ D.
o n data points q1, q2,.....qn are uniformly sampled (equal probability of getting selected, no bias) from the data sets
D. For each point qi (1<= i <= n), the nearest neighbor is found out in (D-qi ) and qi is not added back in D. Let yi be
the distance between qi and its nearest neighbor in (D-qi ). So yi = min (dist (qi, v)), v ∈ (D-qi).
o Hopkins Statistic (H) is defined as: n

y
i=1
i
H= n n

 x + y
i=1
i
i=1
i

 If the data points are uniformly distributed the value of xi and yi will be close to each other and the value of H will
be close to 0.5 or more. There will be no meaningful clusters (Homogenous Hypothesis).
 If the data points are non-randomly distributed the value yi will be significantly smaller than xi and the value of H
will be close to 0. There will be meaningful clusters (Alternative Hypothesis).

BITS Pilani, WILP


Hopkins Statistic
Working Procedure

D is a data set = {0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5}
 Randomly sampled 5 points (pi)= {0.5, 1.5, 2.0, 2.5, 4.0}
 Nearest neighbors are to be found from D
 ∑xi = 0.5+0.5+0.5+0.5+0.5 = 2.5
 The second set of sampling, where the nearest neighbors are to be found from (D – qi)
q1 = 1.0 x1 = (D - q1) = {0, 0.5, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5} y1 = 0.5
q2 = 1.5 x2 = (x1 - q2) = {0, 0.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5} y2 = 0.5
q3 = 2.5 x3 = (x2 - q3) = {0, 0.5, 2.0, 3.0, 3.5, 4.0, 4.5} y3 = 0.5
q4 = 3.0 x4 = (x3 - q4) = {0, 0.5, 2.0, 3.5, 4.0, 4.5} y4 = 0.5
q5 = 3.5 x5 = (x4 - q5) = {0, 0.5, 2.0, 4.0, 4.5} y5 = 0.5
 ∑yi = 0.5+0.5+0.5+0.5+0.5 = 2.5
 H = 2.5 / (2.5+2.5) = 0.5 that means no meaningful clusters. Alternative hypothesis
rejected.
 This illustration is shown only from calculation perspective. The sample collection and iterations have to be
repeated several times and average of H is to be taken for any inference.
 Also review another variant of the Hopkins statistic from the other text book.
BITS Pilani, WILP
Measuring/Validation of Cluster Quality
Silhouette Coefficient

 A dataset D of n points, partitioned into k clusters C1, C2, .....Ck.


 For each point p ∈ D:
 a(p) = the average distance between p (p ∈ Ci , 1 <= i <= k) and all other points in the cluster that
p belongs to (Ci). The compactness of the cluster.
 b(p) = the minimum average distance from p to all other clusters to which p does not belong to.

  dist (p, q) 

q  Ci and p  q
dist (p, q) b(p)= min
 q  Cj



1 j  k , j  i |C |
a(p)=  j 
|Ci | - 1  
 The value of the Silhouette Coefficient ranges from -1 to
 The Silhouette Coefficient is defined as: 1.
 Value near 1 means: p is far away from other clusters.
b(p) - a(p) Value near (-1) means: p is closer to the points in other
s(p)= clusters than to the points in its own cluster.
max {a(p) , b(p)}  Average s for a cluster (for all points there) and average s
for clustering (for all clusters in the data set) can be
calculated.
BITS Pilani, WILP
Silhouette Coefficient
Working Procedure

 Four points P1 to P4 and their distance matrix P1 P2 P3 P4


is given. P1 0
P2 0.50 0
 P1 and P2 are in cluster C1, whereas P3 and P3 1.75 2.00 0
P4 are in cluster C2. P4 2.00 3.00 0.75 0

a(P1) = a(P2) = 0.50 / 1 = 0.50 a(P3) = a(P4) = 0.75 / 1 = 0.75


b(P1) = (1.75 + 2.00) / 2 = 1.88 b(P3) = (1.75+2.00) / 2 = 1.88
b(P2) = (2.00 + 3.00) / 2 = 2.50 b(P4) = (2.00+3.00) / 2 = 2.50

s(P1) = {b(P1) – a(P1)} / max {a(P1), b(P1)} s(P3) = {b(P3) – a(P3)} / max {a(P3), b(P3)}
= (1.88 – 0.50) / max (0.50, 1.88) = (1.88 – 0.75) / max (0.75, 1.88)
= 1.38 / 1.88 = 0.73 = 1.13/ 1.88 = 0.60

s(P2) = {b(P2) – a(P2)} / max {a(P2), b(P2)} s(P4) = {b(P4) – a(P4)} / max {a(P4), b(P4)}
= (2.50 – 0.50) / max (0.50, 2.50) = (2.50 – 0.75) / max (0.75, 2.50)
= 2.00 / 2.50 = 0.80 = 1.75 / 2. 50 = 0.70

s(C1) = (0.73+0.80) / 2 = 0.77 s(C2) = (0.60+0.70) / 2 = 0.65 Avg s = (0.77+0.65)/2 = 0.71

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Appendix

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Haversine Distance
Calculates the aerial distance between
two points whose GPS coordinates are
given in latitudes and longitudes.

Ф represents latitude in radian, λ


longitude in radian, r the radius of the
earth (6371 km), and d the aerial
distance between the two points. Arcsin
is sin inverse.

Haversine distance is a useful measure when GPS coordinates are given and distance has
to be calculated for clustering. Verification can be done using an online calculator from
different sources. 68

BITS Pilani, WILP


DSECL ZC415
Data Mining
Outlier (Anomaly) Detection
Revision -1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Introduction

 Outlier Detection (or Anomaly Detection) is the process of finding


data objects or points with behaviors that are very different from
the rest. Such objects are called outliers or anomalies.
 Example: Credit card usage during a particular time period is very
high than usual. The usage observation is an outlier that detects
an unusual behavior (theft etc).
 Outlier Detection could prove very useful in many real world
applications – medical, public safety, surveillance, intrusion
detection etc.
 Outlier Detection and Cluster Analysis are related tasks, though
they serve different purposes. Clustering finds majority patterns
and Outlier Detection finds exceptions that deviate from majority
patterns.
BITS Pilani, WILP
Noise vs. Outliers
 Noise is a random error or variance in a measured variable, and not
useful for data analysis.
 Occasional, genuine high value purchase using a credit card is a noise.
It will be annoying to the customer if the card is blocked.
 But if the GPS coordinates are changed in a very short span of time
where the card was used (e.g. from India to US with 3 hours) and also
there are hefty purchases, these are most definitely fraudulent
transactions and thus the outliers. Alerting the customer is required.
 Outliers are interesting because they are suspected of not being
generated by the same mechanism as rest of the data.
 Outlier Detection is also related to Novelty Detection in evolving data
sets. E.g. new topics of the discussion in social networking websites.
Initially they may be suspected as outliers but when their presence is
confirmed, their arrival is treated as normal. 3

BITS Pilani, WILP


Types of Outliers
Global Outliers:
 A data object is a global outlier if it deviates significantly from the rest of the data set.
 These objects are also called Point Anomalies.
 Most outlier detection are aimed to find global outliers.
 Challenge is to find an appropriate measurement of deviation.
Contextual Outliers:
 25◦C in Delhi in the month of December is an exception but not in May. Detection depends
on the background information consists of two types of attributes:
 Contextual Attributes: that defines the object’s context. E.g. month and location in the
above Delhi example.
 Behavioral Attributes: object’s characteristics. E.g. temperature.
Collective Outliers:
 A subset of data objects collectively deviate significantly from the whole data set, even if
the individual data objects may not be outliers.
 E.g. a warehouse receive around 2000 shipments everyday between 9 am to 12 noon. Delay
of one or two or even few more is not unusual but 275 shipments arrived late today.
 E.g. no machine is ping-able on the 3rd floor of a multistorey office.
BITS Pilani, WILP
Examples
Approximate Ideas without Details

Global Outliers
Contextual Outliers

Collective Outliers
BITS Pilani, WILP
Challenges of Outlier Detection

 Modeling: Building a comprehensive model that


encompasses data normality is very challenging because it is
difficult to enumerate all possible normal behaviors in an
application. There could be grey-areas. Therefore few model
can give objects a score called outlier-ness.
 Application Specificity: The threshold of
dissimilarity/distance measure to identify the outliers are
application specific. E.g. medical vs. marketing world.
 Presence of Noise: Availability of clean data may be a
challenge. Noise may disguise as outliers and vice-versa.
 Understandability: Not just identification but reasoning
why an object is an outlier. Justification is required for the
detection methods. 6

BITS Pilani, WILP


Outlier Detection Methods
The First Approach

Sample data
labeled by domain
expert, available as
reference

Supervised Semi-supervised Unsupervised

 Outlier detection works as


 Labels are not available.
classifier.  Only few labels are
Classifiers cannot be built.
 Classification with two available.
 Clustering can be used to
labels: normal and outliers.  Classification model is
detect outliers as normal object
 Challenges like: class prepared using available
tend to form clusters.
imbalance, lack of labeled data. Then unlabeled
 Expensive so not
representative outliers. data is labeled using the model.
appealing.
 Sensitivity (Recalls) of  Final model is used to
 Difficult to isolate noise
outlier detection is an detect outliers.
from outliers.
important measure (TP/P).

BITS Pilani, WILP


Outlier Detection Methods
The Second Approach

Parametric
Statistical
Approaches
Non-
parametric

Distance
Based on the Based
assumptions
about outliers Proximity
vs. rest of the Based Density Based
data
Clustering
#
Grid Based #
Based

Classification
Based #

# Details not in the syllabus 8

BITS Pilani, WILP


Parametric Statistical Approaches
Univariate Normal Distribution – Maximum Likelihood

Statistical Approaches are model based approaches:


 A model (or distribution) is created for the data.
 Objects are evaluated with respect to how well they fit the model.
 For univariate normal distribution, Gaussian (normal) distribution is
used to identify the outlier.
 The normal distribution N(μ, σ) has two
parameters mean (μ) and standard
deviation (σ).
o From (μ–σ) to (μ+σ): contains about 68%
of the data
o From (μ–2σ) to (μ+2σ): contains about 95%
of the data ? ?
o From (μ–3σ) to (μ+3σ): contains about -3 -2 -1 0 1 2 3
99.7% of the data σ μ σ
o Anything beyond that can be considered
an outlier.
BITS Pilani, WILP
Example
Maximum Likelihood

Ten sample temperatures values are given as: 24.0, 28.9, 28.9, 29.0, 29.1, 29.1,
29.2, 29.2, 29.3, 29.4 in oC.
 Mean (μ) = 28.61 oC
 Standard Deviation (σ) = 1.51
 The sample 24.0 oC is 4.61 oC units below from the mean. That is 4.61/1.51 = 3.05 σ steps
before the mean. So 24.0 oC can be considered an outlier because it is below (μ–3σ).
 The z-score value of 24 = (24-28.61)/1.51 = -3.05
 Looking into the z-Table the probability for -3.05 z-score is 0.0011 or 0.11%
(low probability indicates it is more unlikely that 24.0 is generated by the normal
distribution)

10

BITS Pilani, WILP


Parametric Statistical Approaches
Univariate Normal Distribution - Box Plots

Outlier detection using IQR mechanism was reviewed in


the Data Exploration and Description Module.

11

BITS Pilani, WILP


Parametric Statistical Approaches
Univariate Normal Distribution - Grubb’s Test

Grubb’s Test (also known as Maximum Normed Residual Test)


for each object xi in the data set (with mean μ and standard
deviation σ )calculates a value gi as:

| xi   |
gi 

This object is an outlier if:
N-1 tα 2
gi >= .
N N - 2 + tα 2
where,
N=Count of objects in the datset
 = Significance level
tα =value taken by t-ditstribution at a level of (α/2N) at (N-2) degree of freedom
BITS Pilani, WILP
Example
Grubb’s Test
Ten sample temperatures values are given as: 24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4 in oC.
 Mean (μ) = 28.61 oC
 Standard Deviation (σ) = 1.51
 N = 10
 α = 0.05 t-distribution Table
 gi for 24.0 = |24.0 – 28.61| / 1.51 = 3.05
 tα = 3.833 (the value of t-distribution table at 0.05/2*10 = 0.0025 at 10-2 = 8 degree of freedom.
 Since gi is greater than the value calculated below, 24.0 is an outlier.

N-1 tα 2 10  1 3.8332 9 14.70


.  .   2.29
N - 2 + tα 2 10 10  2  3.833
2
N 10 22.70

BITS Pilani, WILP


Multivariate Normal Distribution
Probability Distribution

 x1 and x2 are the variables in the bivariate normal distribution.


 How this probability distribution is identified?
 How outliers can be identified in multivariate normal distribution in
14
general?
BITS Pilani, WILP
Outlier in Multivariate Normal
Distribution using Mahalanobis Distance
 For the univariate dataset, the outlier detection approach is probability density function
drawn from μ and σ assuming the points are in normal distribution.
 The question is how to adopt a similar approach for multivariate normal distribution. The
answer is to take the similar approach and thus the covariance comes into picture.
 When attributes are in multivariate normal distribution, the concept of Mahalanobis
Distance comes into picture. It uses the covariance in calculating the distance.
 It is formalized by P.C. Mahalanobis, the famous Indian statistician who is remembered as
the founder of the Indian Statistical Institute, Kolkata and a member of the first planning
commission of India.

Mahalanobis(X,X)=[X-X].S -1 .[X-X] T
where,X is the mean of X

S -1is the inverse ofcovariance matrix


P.C. Mahalanobis
T 1893-1972
[X-X] is the transpose of matrix[X-X]

Note: X is a vector of coordinates for a point 15

BITS Pilani, WILP


Covariance Matrix & Inverse
If there are two attributes (X, Y) then covariance matrix
is defined as:
 SXX SXY 
S 
 S YX S YY 
The inverse of a 2x2 matrix is calculated as shown
below. The inverse of above covariance matrix (S-1) can
be calculated similarly:
1
a b  1  d  b
 c d   ad  bc  c a 
   
Example Link: Inverse of a 3x3 matrix.
BITS Pilani, WILP
Example
Mahalanobis Distance

Given 15 points (A to O) an outlier needs to be found out using


Mahalanobis Distance.
# X Y (X-X') (Y-Y') Mahalanobis Dist
A 2 2 -2.7 -2.6 4.00
B 2 5 -2.7 0.4 2.06
C 6 5 1.3 0.4 0.44
D 7 3 2.3 -1.6 3.32
E 4 7 -0.7 2.4 3.30
F 6 4 1.3 -0.6 0.77
G 5 3 0.3 -1.6 1.41
H 4 6 -0.7 1.4 1.27
I 2 5 -2.7 0.4 2.06
J 1 3 -3.7 -1.6 3.66
K 6 5 1.3 0.4 0.44
L 7 4 2.3 -0.6 1.80
M 8 7 3.3 2.4 4.31
N 5 6 0.3 1.4 0.94
O 5 4 0.3 -0.6 0.24
Mean (X', Y') 4.7 4.6

Inverse Covariance Matrix


0.25 -0.09
-0.09 0.51

In this example, Mahalanobis distance of > 4.0 is considered to declare a point an outlier. 17

BITS Pilani, WILP


Outlier in Multivariate Normal
Distribution using Chi-Square Test
If the χ2 is large the object can be considered as an outlier.

n 2
(o - E )
χ i 2 = i i

i=1 Ei
where,
oi =the value of object o in the i th dimesnsion
Ei = mean of the i th dimension of all objects
n = dimensionality

18

BITS Pilani, WILP


Outliers in the Mixture of Parametric
Distribution
 In few situations, the assumption that the dataset is generated by a single normal distribution is an over
simplification.
 To overcome this problem, it is assumed that the data is generated by multiple normal distributions. For
example, if a data set is generated by two normal distributions ϴ1 (μ1, σ1) and ϴ2 (μ2, σ2), the probability
that a data point is generated by the mixture of these two is given by: f(ϴ1) + f(ϴ2), where f() is the
normal probability distribution function. That means if the point does not belong to any cluster (e.g. C1
or C2), it is an outlier. Data points in the small cluster C3 will also be identified as outliers.

1 1
f ( )  exp( .( x   )2 /  2 )
2 . 2 2 19

BITS Pilani, WILP


Non-parametric Methods
Outlier Detection using Histograms

 The model of normal data is learned from


the input data without any a priori
structure.
 Often makes fewer assumptions about the
data, and thus can be applicable in more
scenarios.
 Figure shows the histogram of purchase amounts in transactions. A
transaction of the amount of $7,500 is a potential outlier, since only
0.2% transactions have an amount higher than $5,000.
 Problem: Hard to choose an appropriate bin size for histogram. If it is
too small bin size then normal objects in empty/rare bins, false
positives. If too big bin size then outliers in some frequent bins, false
negatives. Advanced methods like kernel density estimation is used to
overcome some of these issues.
BITS Pilani, WILP
Proximity Based Approach
Distance Based (DB) Outlier Detection Global Outliers

 For a dataset D having n objects a user can specify the distance


threshold r to define a reasonable neighborhood of an object.
 For an object (p), the other objects (q) in the r-neighborhood can
be examined and the following formulation can be used to
declare an object an outlier for a fraction π (0 < π <= 1)

An object p is an DB (r,  ) outlier if:


Count of q objects in the r-neighborhood of p, with dist(p, q) r
π
n

 That means, if count of objects in the r-neighborhood for an


object > ⌈π.n⌉, the object is not an outlier. (Note the ceiling operator).
 The computational complexity of the algorithm is O(n2) but for
many practical cases it comes out to be linear. Why?
BITS Pilani, WILP
Example
DB (r, π) Outlier Detection

7
For the shown dataset, find out the
outliers if r = 2 units, π = 1/3 and L1 6 J
H
norm is to be used as the distance 5 G I

measure. 4

⌈π.n⌉ ⌈(1/3) x 12⌉ = 4 3 D F K

Points r-Neighborhood Outlier 2 B C E L


A B, C, E Yes
1 A
B A, C, D, E, F No
C A, B, D, E, F No 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
D B, C, E, F, G, K No
E A, B, C, D, F, L No
F B, C, D, E, K, No
G D, H, I Yes
H G,I, J Yes
I G, H, J Yes
J H, I Yes
K D, F, L Yes
L E, K Yes 22

BITS Pilani, WILP


Density Based Outlier Detection
For Local Proximity Based Outliers

 DBSCAN and DB (r, π) identify outliers with a global view of the data set, the Global Outliers.
 In practice, datasets could demonstrate a more complex structure where objects may be considered
outliers with respect to their local neighbourhood. (a data set with different densities).
 In the shown data distribution, there are two clusters C1 and C2.
 Object O3 can be declared as distance based outlier because it is far from the majority of the
objects.
 What about objects O1 and O2?
 The distance of O1 and O2 from the objects of cluster C1 is
smaller than the average distance of an object from its
nearest neighbour in the cluster C2.
 O1 and O2 are not distance based outliers. But they are
outliers with respect to the cluster C1 because they
deviate significantly from other objects of C1.
 Similarly the distance between O4 and its nearest neighbour in C2 is higher than the distance
between O1 or O2 and their nearest neighbours in C1, still O4 may not be an outlier because
C2 is sparse. Distance based detection does not capture local outliers. There is a need of
different approach.
BITS Pilani, WILP
k-distance and its Neighborhood
Local Proximity Based Outliers

 To identify local outliers, there is a need to establish few news measures. The k-distance and Nk(x) are
first few of them.
 The k-distance of an object x in the dataset D denoted by distk(x) is defined as the distance dist(x, p)
between x and p where p is also ∈ D, such that:
 There are at least k objects y ∈ (D - x), such that dist (x, y) <= dist (x, p); excluding same distance points
 There are at most (k-1) objects z ∈ (D - x), such that dist (x, z) < dist (x, p); excluding same distance points
 In the other words, distk(x) is the distance between x and its k-nearest neighbor. It can be understood
from the following examples.
 Nk(x) is the count of all such points which are in the k-neighborhood of x. There could be more than k
points in Nk(x) because multiple points could be at the same distance from x.

x x x

p p p
For k=3, the dist3(x) = dist(x, p)
For k=3, the dist3(x) = dist(x, p) For k=3, the dist3(x) = dist(x, p) All blue points are equidistant from x
Distance of 3 objects (y) <= dist (x, p) Distance of 3 objects (y) <= dist (x, p) Distance of 3 objects (y) <= dist (x, p)
Distance of 2 objects (z) < dist (x, p) Distance of 2 objects (z) < dist (x, p) Distance of 2 objects (z) < dist (x, p)
Nk(x) = 3 Nk(x) = 4 Nk(x) = 6

BITS Pilani, WILP


Reachability Distance
Local Proximity Based Outliers

 The next measure is Reachability Distance for the point


y, from point x. It is denoted by reachdistk(y←x). It is
defined as following:
reachdistk (y  x ) = max {dist  x, y  ,distk  x  }

y2
y1
For k=3:
Red points show the k- neighborhood of x.
x dist3(x) = dist(x, p)
p
reachdist 3(y1 ← x) = dist3(y) = dist (x, p)
reachdist 3(y2 ← x) = dist (x, y2)

25

BITS Pilani, WILP


Density Based Local Outliers
Other measures are following:
 Local Reachability Density lrd (x): Density of object x for its k nearest neighbours.
It is defined as the reciprocal of the average reachability distance of k nearest
neighbours from x. It can be written as follows:

|N k (x)|
lrd(x)=

y  N k (x)
reachdist k (y  x )

 Local Outlier Factor LOFk(x) or outlier score: It can be formulated as follows:


y N k (x)
lrd k (y) / lrd k (x)
LOFk (x)=
|N k (x)|

BITS Pilani, WILP


Example
Density Based Local Outliers (k = 3)

Object k=3 nearest reachdistk LOF3(x) Objects X Y


L1 Norm Objects lrd3(x)
Pairs neighbours (from x) (outlier score)
A 1.00 2.00
A-B 1.50 A B, C, D 1.75, 1.75, 1.75 0.57 0.91 B 2.00 1.50
A-C 0.50 B A, C, D 1.50, 1.50, 1.50 0.67 0.72 C 1.00 1.50
A-D 1.75 C A, B, D D 2.00 2.75
2.25, 2.25, 2.25 0.44 1.27
B-C 1.00 E 7.00 2.25
D A, B, C 2.25, 2.25, 2.25 0.44 1.27
B-D 1.25 F 7.00 2.50
E F, G, H 0.50, 0.50, 0.50 2.00 0.67
C-D 2.25 G 7.00 2.00
F E, G, H 0.75, 0.75, 0.75 1.33 1.17
E-F 0.25 H 7.50 2.25
G E, F, H 0.75, 0.75, 0.75 1.33 1.17
E-G 0.25 I 6.00 2.50
E-H
H E, F, G 0.75, 0.75, 0.75 1.33 1.17
0.50
E-I I E, F, G 1.50, 1.50, 1.50 0.67 2.32
1.25
F-G 0.50
F-H 0.75
F-I 1.00
G-H 0.75
G-I 1.50
H-I 1.75

 LOF can be used to declare local density outliers


w.r.t. their neighbourhood.

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Appendix

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability Density Function
Multivariate Normal Distribution

If S is the covariance matrix for the multivariate data (m-dimensions).


Then the probability density function for a data point x is given by:
1
1  .( x  X ).S 1 ( x  X ) T
P( x )  .e 2
( 2 ) .| S |
m 1/ 2

Note the exp onent is a factor of Mahalanobis Dis tan ce

If natural log is taken of this probability the value comes proportional


to the magnitude of the Mahalanobis distance (because logee-x = -x).

So in one way, it is sufficient to use the Mahalanobis distance to


find out the outliers instead of calculating the actual probability.
BITS Pilani, WILP
DSECL ZC415
Data Mining
Data Mining on Unstructured Data
Revision 1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Text Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Introduction
 Most previous studies of data mining have focused on structured data, such
as relational, transactional, and data warehouse data.
 However, in reality, a substantial portion of the available information is also
stored in text databases (or document databases), which consist of large
collections of documents from various sources, such as news articles,
research papers, books, digital libraries, e-mail messages, and Web pages.
 Data stored in most text databases are semi-structured data. For example, a
document may contain a few structured fields, such as title, authors,
publication date, category, and so on, but also contain some largely
unstructured text components such as abstract and contents.
 Without knowing what could be in the documents, it is difficult to formulate
effective queries for analyzing and extracting useful information from the
data.
 Text Mining deals with the different approaches to compare different
documents, rank the importance and relevance of the documents, or find
3
patterns and trends across multiple documents.
BITS Pilani, WILP
Text Mining: Coverage

This module covers the following in the area of Text


Mining:
 Text / Information Retrieval and metrics to measure the quality of retrieval
 Document Selection and Text Retrieval methods
o Boolean Retrieval
o Text Indexing
 Document Ranking
 Pre-processing Steps: Tokenization, Stemming, Stop-List – lemmatization
 Vector Space Model based on Cornell SMART Systems
 Text Mining Approaches (basic introduction):
o Keyword-Based Association Analysis
o Document Classification Analysis
o Document Clustering Analysis
4

BITS Pilani, WILP


Text Data Analysis and Information
(Text) Retrieval
 The field of database systems, which has focused on query and transaction processing of structured
data, Information Retrieval (IR) is concerned with the organization and retrieval of information from a
large number of text-based documents. Example: online library catalogue systems and Web search
engines.
 A typical information retrieval problem is to locate relevant documents in a document collection based
on a user’s query.
 There are two basic measures for assessing the quality of retrieval:
o Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e.
correct responses).
o Recall: This is the percentage of documents that are relevant to the query and were, in fact,
retrieved.
 To trade-off recall for precision or vice versa, F-score measure, which is defined as the harmonic mean
of recall and precision, can be used.

|{ Relevant }  {Retrieved }|
precision 
|{Retrieved }|
Relevant &
|{ Relevant }  {Retrieved }| Relevant Retrieved Retrieved
recall 
|{Relevant }|
All Documents

BITS Pilani, WILP


Text Retrieval Methods
Retrieval methods fall into two broad categories:
i. Document Selection:
 The query is regarded as the specifying constraints for selecting the relevant
documents.
 Boolean Retrieval Model is a typical model in which a document is represented
by a set of keywords and a user provides a boolean expression of keywords.
 Only works well when the user knows a lot about the document collection and
can formulate a good query.
 Example: “car and repair shops,” “tea or coffee”, “database systems but not
Oracle” etc.
ii. Document Ranking:
 Most modern information retrieval systems present a ranked list of documents
in response to a user’s keyword query.
 The goal is to approximate the degree of relevance based on some measures.

Vector Space Model that will be discussed in detail in this module can be used for text retrieval.
It is also called term-frequency model.
BITS Pilani, WILP
Vector Space Model
Basic Idea

 A document and a query both are represented as vectors in a


high-dimensional space corresponding to all the keywords.
 Using an appropriate similarity measure, the similarity between
the query vector and the document vector is calculated.
 The similarity values can then be used for ranking the documents.
 Tokenization: Identification of relevant keywords during pre-
processing. The procedure also maintains a stop-list of words
which are deemed irrelevant – a, the, for, with, of etc.
 Word Stem: Words with small syntactic variations – drug,
drugged, drugs are all considered occurrences of the same word.

BITS Pilani, WILP


Vector Space Model
Formulation

 With the set of documents d and with a set of t terms, each document can be considered as a vector v
in a t dimensional space Rt.
 Term frequency is the number of occurrences of term t in the document d. It is denoted by freq(d, t).
 Term Frequency Matrix TF(d, t) elements are defined as 0 if the document does not contain the term,
and nonzero otherwise. There are several ways to define or weight the elements of TF (d, t). The Cornell
SMART system uses the following formula to compute the term frequency:

0, if freq(d, t) = 0
TF(d,t) = 
1+log(1+log(freq(d, t))), otherwise
 If a term t occurs in many documents, its importance will be scaled down due to its reduced
discriminative power. So there is another important measure, called Inverse Document Frequency (IDF).

1+|d| Recorded lecture uses slightly


IDF(t) = log
|dt | different formulae. Both
versions are accepted.
where, dt documents contain the term t.
 In a complete Vector Space Model, TF-IDF Measure is defined as: TF(d, t) x IDF(t)
 All logs are in base-10.
BITS Pilani, WILP
Example
Vector Space Model

Document/Term t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
 The table shows a term frequency matrix TF(d, t) where each row represents a
document vector, each column represents a term and each entry registers freq(di, tj).
 For the 6th term t6, in the 4th document (d4):
TF (d4, t6) = 1+log(1+log(15)) = 1.34
IDF (t6) = log ((1+5)/3) = 0.30
So, TF-IDF (d4, t6) = 1.34 x 0.30 = 0.40
 TF-IDF is a numerical statistic that is intended to reflect how important a term is to a
document in a collection of documents.
 If a user is interested in the documents that contain a specific term, then TF-IDF can be
used as ranking measure and documents can be listed in the decreasing order of TF-IDF. 9

BITS Pilani, WILP


Document Similarity

Cosine Similarity reviewed in the Data Exploration


module can be used to find out how similar or
dissimilar two documents are.

10

BITS Pilani, WILP


Text Indexing Techniques

 Text Indexing Techniques are used for the text retrieval from the unstructured
text.
 One of the techniques based on an inverted index is an index structure that
maintains two hash indexed or B+ tree indexed tables:
o Document Table: consists of a set of document records, each containing two
fields: doc id and posting list, where posting list is a list of terms (or pointers to
terms) that occur in the document, sorted according to some relevance measure.
o Term Table: consists of a set of term records, each containing two fields: term id
and posting list, where posting list specifies a list of document identifiers in which
the term appears.
 It facilitates queries like: Find all of the documents associated with a given set of
terms,” or “Find all of the terms associated with a given set of documents”.
 Signature file for a document stores the term related information that is created
after pre-processing steps: tokenization, stemming and applying the stop list.
There is limited space to store signature file with each document so there are
techniques to encode and compress the signature file.

BITS Pilani, WILP


Keyword-Based Association Analysis

Various text mining tasks can be performed on the extracted


keywords, tags, or semantic information. Keyword based
Association Analysis collects sets of keywords or terms that
occur frequently together and then finds the association or
correlation relationships among them. The process follows the
following steps:
o Preprocess the text data by parsing, stemming, removing stop words, etc.
o Evoke association mining algorithms:
• Consider each document as a transaction
• View a set of keywords in the document as a set of items in the transaction
o Applications:
• Find associations between pairs of keywords or terms from a given set of keywords
or phrases, or to find the maximal set of terms occurring together.
• E.g.: whenever the term Switzerland appears, many times Mt. Titlis also appears.
BITS Pilani, WILP
Document Classification Analysis

Motivation
o Automatic classification for the large number of on-line text
documents (Web pages, e-mails, online library, etc.)
Classification Process
o Data preprocessing
o Definition of training set and test sets
o Creation of the classification model using the selected classification
algorithm
o Classification model validation
o Classification of new/unknown text documents

13

BITS Pilani, WILP


Document Clustering Analysis

Motivation
o Automatically group related documents based on their contents.
o No predetermined training sets or taxonomies – unsupervised.
Clustering Process
o Data preprocessing: remove stop words, stem, feature extraction
etc.
o Even after pre-processing the curse of dimensionality would be
intimidating. So dimensionality reduction and then application of
traditional clustering techniques or spectral clustering, mixture
model clustering, clustering using Latent Semantic Indexing, and
clustering using Locality Preserving Indexing are applied. Several of
these areas are covered under Natural Language Processing (NLP).

BITS Pilani, WILP


References

Chapter-10 of Han, Kamber: Data Mining: Concepts and


Techniques, 2nd Edition

15

BITS Pilani, WILP


Search Engines and PageRank

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Introduction

 The term PageRank came from Larry Page


who along with Sergey Brin co-founded
Google and coined the idea with Prof
Rajeev Motwani (IIT-K and UCB alumnus and
faculty of Stanford).
 PageRank is a function that assigns a real Rajeev Motwani
1962-2009
number to each page in the Web (or
portion of the Web that has been crawled
and its links are discovered).
 The intent is that the higher is the
PageRank of a page, the more important it
is.
 The value of PageRank also depends on
Larry Page Sergey Brin
the hyperlinks on other pages that are
referring to it (in-links for the page).
17

BITS Pilani, WILP


Elaboration
What is PageRank actually? How in-links play a role to decide it?
 Pages that would have large number of surfers would be considered more important than
the pages that would rarely be visited.
 PageRank is used to simulate the importance of a page based on the following criteria – a
page will have higher PageRank where web surfers would tend to congregate if they
followed randomly chosen out-links from where they were located initially and the process
is allowed to repeat several times.
 The above criteria suggests that it is not only the page where surfers congregate but also
the neighbourhood of that page which eventually make the surfers land to this page is
important.
 So essentially if a Web page is pointed or referred by several pages having out-links to this
page, there are higher chances that surfers would eventually congregate on this page.
 A crook can create a page and make several fake pages point to it and thus increase the
PageRank of his Web page. Is it possible in this approach?
 The answer is NO. A crook can create several fake pages make point to his page, but other
independent genuine pages will not have links to his several fake pages.
So, from the PageRank perspective how does the Web look like?

BITS Pilani, WILP


PageRank Score
 For depiction, bigger the
circle, larger the PageRank.
 Page-B is pointed to by
several pages. Therefore the A B C
highest PageRank. 3.3 38.4 34.3
 Page-C is not pointed to by
several pages but pointed to
by an important page-B
D
having a high PageRank. 3.9
E F
8.1 3.9
Therefore a higher
PageRank than several 1.6
others. 1.6 1.6 1.6 1.6

 The same logic applies


elsewhere.

BITS Pilani, WILP


PageRank: Simple Formulation Example

 Users of the Web “vote” through their


web pages. On their page, they keep the i k
links to those pages which they think are ri/3
rk/4
good.
 Let for the page m the PageRank is rm. m rm/3

 If page m has n outlinks, each link will rm/3


rm/3
get rm/n votes.
 The own PageRank of page m will be the
sum of votes on its inlinks.
 So, rm = ri/3 + rk/4.
 In the shown figure, the outlinks of page
m will each have rm/3 votes because
there are only 3 outlinks from page m.
20

BITS Pilani, WILP


Transition Matrix
 In a tiny version of the Web let us say there
are only 4 Web pages – A, B, C and D. A B
 A random surfer at A can go to B, C or D
with equal probability of 1/3 and a 0
probability to A because there is no self
loop.
C D
 Similarly a random surfer at B can go to A or
D with probability 1/2.
 In general, what happens to the random  A B C D 
surfers after one step can be captured in the A 0 1 / 2 1 0 
transition matrix M.  
B 1 / 3 0 0 1 / 2
 Each element mpq of M will have a value of  
1/x, if page q has x outlinks where one of  C 1 / 3 0 0 1 / 2 
the outlinks is to page p otherwise mpq = 0.  D 1 / 3 1 / 2 0 0 

BITS Pilani, WILP


Page Rank Vector Formulation
 In the shown figure, let us assume a random surfer can start his
surfing from any page with equal probability. So the initial vector
matrix v0 = [1/4 1/4 1/4 1/4 ]T. A B
 The probability that the surfer starts from the page i and lands at
page A is = (1/4 x 0) + (1/4 x 1/2) + (1/4 x 1) + (1/4 x 0) = 9/24.
 Similarly the probabilities that the surfer starts from page i and
lands at page B, C, or D can be calculated. The updated vector
after iteration-1 becomes v1 = [9/24 5/24 5/24 5/24]T . C D
 Using v1, the probability that the surfer starts from the page i
and lands at page A is = (9/24 x 0) + (5/24 x 1/2) + (5/24 x 1) +
(5/24 x 0) = 15/48.  A B C D 
 Following the same procedure, v2 can be calculated as v2 = A 0 1 / 2 1 0 
[15/48 11/48 11/48 11/48]T .  
 We have reviewed that a page would have higher PageRank if B 1 / 3 0 0 1 / 2
surfers starting from randomly selected pages tend congregate  
to that specific page.  C 1 / 3 0 0 1 / 2 
 So, if vector v is calculated over multiple iterations and it  D 1 / 3 1 / 2 0 0 
stabilizes without significant change over the next iteration, the
value of v at that stage is called the PageRank vector 22
representing the PageRank for each page.
BITS Pilani, WILP
PageRank Vector
Using Matrix Multiplication Method

 Let M is the transition matrix for the given graph and vt is the PageRank vector
at iteration t.
 So, the PageRank vector at iteration t+1 is given by: vt+1 = M.vt
 Two matrices A and B can be multiplied if the count of columns in A is equal to
count of rows in B. Therefore:
M v0 v1
 0 1/ 2 1 0  1 / 4 9 / 24 
1 / 3 0 0 1 / 2  1 / 4  5 / 24 
 x   
1 / 3 0 0 1 / 2 1 / 4 5 / 24 
     
1 / 3 1/ 2 0 0  1 / 4 5 / 24 
M v1 v2
 0 1/ 2 1 0  9 / 24  15 / 48 
1 / 3 0 0 1 / 2  5 / 24  11 / 48 
 x     and so on.......
1 / 3 0 0 1 / 2 5 / 24  11 / 48 
      23
1 / 3 1/ 2 0 0  5 / 24  11 / 48 
BITS Pilani, WILP
Exercise

Find out the transition matrix and stabilized PageRank


vector for the shown Web graph.

X1 X3

X2

Answer:
 X1 X 2 X 3 1 / 3  6 / 15 
X1 1/ 2 1/ 2 0 
M  v1  1 / 2  - - - - - - vn  6 / 15 
X 2 1 / 2 0 1 
  1 / 6   3 / 15 
X3 0 1/ 2 0 

24

BITS Pilani, WILP


References

Chapter-5 (Link Analysis) of Mining of Massive Datasets by


Jure Leskovec, Anand Rajaraman, Jeff Ullman, Cambridge
University Press.

25

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DSECL ZC415
Data Mining
Data Mining Applications
Revision 1.0

BITS Pilani Prof Vineet Garg


Work Integrated Learning Programmes Bangalore Professional Development Center
Recommendation Systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Introduction

 In a very general way, Recommendation (or Recommender) Systems are algorithms


aimed at suggesting relevant items to users. Examples: movies to watch, text to
read, products to buy or anything else depending on industries.
 Recommendation systems use a number of different technologies. We can classify
these systems into two broad groups:
 Content Based Systems: examine properties of the items recommended. For
instance, if a Netflix user has watched many Salman Khan movies, then recommend
a movie classified in the database as having the Salman Khan in the cast.
 Collaborative Filtering: systems recommend items based on similarity measures
between users and/or items. The items recommended to a user are those preferred
by similar users. This sort of recommendation system can use similarity measures
and clustering. However, these technologies by themselves are not sufficient, and
there are some new algorithms that have proven effective for recommendation
systems.
BITS Pilani, WILP
Applications

 Product Recommendations: The most important use of


recommendation systems is at on-line retailers. Amazon or
similar on-line vendors strive to present each returning user
with some suggestions of products that they might like to
buy. These suggestions are not random, but are based on
the purchasing decisions made by similar customers or on
other techniques.
 Movie Recommendations: Netflix offers its customers
recommendations of movies they might like. These
recommendations are based on ratings provided by users.
 News Articles: News services have attempted to identify
articles of interest to readers, based on the articles that
they have read in the past. E.g. Google News 4

BITS Pilani, WILP


Long Tail Phenomenon
Physical vs. Online Stores

 A physical bookstore may have several thousand books on its shelves, but Amazon offers millions of
books.
 Physical newspaper can print several dozen articles per day, while on-line news services offer thousands
per day.
 Recommendation in the physical world is fairly simple. It is not possible to tailor the store to each
individual customer.
 The distinction between the physical and on-line worlds has been called the long tail phenomenon, and
it is captured in figure below. The vertical axis represents popularity. The items are ordered on the
horizontal axis according to their popularity.
 Physical institutions provide only the most popular items to the left of the vertical line, while the
corresponding on-line institutions provide the entire range of items: the tail as well as the popular
items.
Popularity

5
Orders
BITS Pilani, WILP
Model for Recommendation
Systems: The Utility Matrix
 There are two classes of entities, referred to as users and items. Users have preferences for certain items
and these preferences must be extracted out of the data.
 The data itself is represented as a utility matrix, giving for each user-item pair, a value that represents what
is known about the degree of preference of that user for that item.
 Values of the utility matrix come from an ordered set (e.g. star ratings of 1 to 5). Non-available values are
left blank, so matrix is sparse.
 The example table shown below captures the ratings of users (A, B, C, D) to the Harry Potter (HP), Twilight
(TW) and Star Wars (SW) movies.
 The goal of a recommendation system is to predict the blanks in the utility matrix. For example, will user A
like SW2?
 There is little information from the matrix to predict whether user A would like SW2. So the
recommendation system can be designed taking into account the properties of movies, such as their
producer, director, stars, or even the similarity of their names.
 If SW1 and SW2 are similar, then it can be concluded that since A did not like SW1, it is unlikely to enjoy
SW2 either. It is not necessary to predict every blank entry in a utility matrix. Most of the time, the goal is
to suggest a few that the user would value high.
HP1 HP2 HP3 TW SW1 SW2 SW3
A 4 5 1
B 5 5 4
C 2 4 5
6
D 3 3
BITS Pilani, WILP
Content Based Systems
Item Profiles

 Item Profiles: Construct for each item a profile, which is a record


or collection of records representing important characteristics of
that item. Example: features of a movie: cast, director, black &
white or coloured, genre (e.g. IMDB assigned) etc.
 Discovering Features of Documents: For documents where it is
not immediately apparent what the values of features should be.
o The word with the highest TF-IDF could characterize the document.
o To measure the similarity between two documents, the Jaccard Coefficient
between the pair of words or Cosine Similarity can be used.
 Obtaining Item Features From Tags: They are particularly useful
for pictures and web pages. Many applications invite users to
label or tag the pictures or write some description. Once they are
tagged, their features can be extracted and examined for profiles.
Example: Delicious Bookmarking.
7

BITS Pilani, WILP


Content Based Systems
Representing Item Profiles

There can be several approaches:


 For documents, a vector of 0s and 1s, where a 1
represented the occurrence of a high TF-IDF word in the
document.
 For movies, for each actor, with 1 if the actor is in the
movie, and 0 if not. Likewise, a component for each
possible director, and each possible genre etc.
 Average ratings for a movie: a real numbered value.
 A vector of item profiles can be created with the mix of
boolean or real numbered values.
 These vectors can be used for similarity calculations.
8

BITS Pilani, WILP


Content Based Systems
Representing User Profiles

Similar to Item Profiles, there can be several approaches:


 Let us say the element value 1 in the utility matrix for a user represents
if a user has watched a movie, 0 if the user has not. If 20% of the
movies that user A likes have Salman Khan as one of the actors, then
the user profile for A will have 0.2 in the component for Salman Khan.
 Let us say the utility matrix has user ratings from 1 to 5 that users give
to movies. User A has given an average rating of 3 to the movies that he
has watched and ratings 3, 4 and 5 to the Salman Khan movies in
particular. So in his profile, the average of 3-3, 4-3, 5-3 that is 1 will be
stored for the movies of Salman Khan.
 Similarly another user B has given average ratings as 4 and to Salman
Khan movies as 2, 3 and 5. So in his profile the average of 2-4, 3-4 and
5-4 that is -2/3 will be stored for Salman Khan movies.
9

BITS Pilani, WILP


Content Based Systems
Recommending Items to Users Based on Content

 With profile vectors for both users and items, we can


estimate the degree to which a user would prefer an
item by computing the cosine distance between the
user’s and item’s vectors.

10

BITS Pilani, WILP


Example
Content Based Systems

Three computers A, B and C and their numerical features are listed below:
Features/Computers A B C
Processor Speed 3.06 2.68 2.92
Disk Size 500 320 640
Main Memory 6 4 6
 Item Profile for computer A is the vector [3.06, 500, 6].
 Use X has rated these computers with ratings A:4, B:2 and C:5 (Average = 11/3).
 User ratings can be normalized with the average (A:1/3, B:-5/3, C:4/3).
 User profile vector can be created as [0.45, 486.67, 3.33]:
o Processor Speed: (3.06 *1/3) + (2.68*-5/3) + (2.92*4/3) = 0.45
o Disk Size: (500*1/3) + (320*-5/3) + (640*4/3) = 486.67
o Main Memory = (6*1/3) + (4*-5/3) + (6*4/3) = 3.33
 This user profile can be used to recommend a new type of computer (say D) based on
the cosine similarity between computer D feature vector and the user profile vector.
 Scaling may be required, because in the present form the Disk Size will dominate.
Scaling factor 1 for Processor Speed, α for the Disk Size, and β for the Main Memory
can be taken with suitable values of α and β.
BITS Pilani, WILP
Collaborative Filtering
Introduction and Similarity Measures

 The process of identifying similar users and recommending what similar users like is called
collaborative filtering.
 It is a different approach from Content Based Systems. Instead of using features of items to
determine their similarity, we focus on the similarity of the user ratings for the two items.
 The challenge with is how to measure similarity of users or items from their rows or
columns in the utility matrix. It can be understood from the illustration below:
HP1 HP2 HP3 TW SW1 SW2 SW3  A and C have two movies in
A 4 5 1 common but their liking is
B 5 5 4 different.
C 2 4 5  A and B have just one movie in
D 3 3 common but liking is same.

 Jaccard Distance: A and B have an intersection of size 1 and a union of size 5. Thus, their
Jaccard similarity is 1/5, and their Jaccard distance is 4/5; that is they are very far apart. In
comparison, A and C have a Jaccard similarity of 2/4, so their Jaccard distance is the same
1/2. Thus, A appears closer to C than to B. Yet that conclusion seems intuitively wrong. A
and C disagree on the two movies they both watched (different ratings), while A and B seem
both to have liked the one movie they watched in common.
 Cosine Similarity: Between A and B, it is 0.38 and between A and C it is 0.32. This measure
tells us that A is slightly closer to B than to C.
BITS Pilani, WILP
Collaborative Filtering
Rounding and Normalizing Ratings

Rounding the Ratings:


 Consider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated. The utility matrix
would then look as below:
HP1 HP2 HP3 TW SW1 SW2 SW3
A 1 1
B 1 1 1
C 1 1
D 1 1
 Now, the Jaccard distance between A and B is 3/4, while between A and C it is 1. That is C appears
further from A than B does, which is intuitively correct. Applying cosine distance also yields the
same conclusion.
Normalizing Ratings:
 If we normalize ratings, by subtracting from each rating the average rating of that user, we turn
low ratings into negative numbers and high ratings in to positive numbers as shown in the table
below:  Cosine similarity between A and B
HP1 HP2 HP3 TW SW1 SW2 SW3 is 0.092 and A and C is −0.559.
A 2/3 5/3 -7/3  A and C are much further apart
B 1/3 1/3 -2/3 than A and B, and neither pair is
C -5/3 1/3 4/3 very close. Both these observations
D 0 0 make intuitive sense.
BITS Pilani, WILP
Collaborative Filtering
Recommending Items to Users Based on Content

 The goal of Collaborative Filtering is to find similar users and based on


their ratings, recommend another similar user what he/she may like.
 So once the utility matrix is transformed (as discussed in the previous
slides), the process can follow as:
 The value of the utility matrix entry for user U and item I is to be found
out.
 The n users (for some predetermined n) most similar to U and their
average ratings for the item I are found out. This involves counting only
those among the n similar users who have rated I.
 This average rating of n users is an indicative value that user U would
also rate the item I in the similar magnitude.
 Users can be clustered also considering Jaccard Distance or Cosine
Similarity and a new user can be placed in the appropriate cluster.

BITS Pilani, WILP


Exercise

A B C D E F G H
X 4 5 5 1 3 2
Y 3 4 3 1 2 1
Z 2 1 3 4 5 3

Utility matrix of 8 items (A-H) and 3 users (X-Z) is shown above.


i. Treating the utility matrix as boolean, compute the Jaccard distance
between each pair of users.
ii. Repeat (i) using Cosine Similarity.
iii. Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the
Jaccard distance between each pair of users.
iv. Repeat (iii) using Cosine Similarity.
v. Normalize the matrix by subtracting from each nonblank entry the
average value for its user.
vi. Calculate Cosine Similarity after (v). 15

BITS Pilani, WILP


Cold Start Issue

 The cold-start problem happens when the system does not


have any form of data on new users and on new items.
 Content Based Systems treat each user independently. If
there are new items, it recommends items that are similar
to those in the user’s profile. The recommendation moves
around the familiar items. If a new user, creating a user
profile is a challenge.
 Collaborative Filtering finds similar users and based on
their ratings, recommend another similar user what
he/she may like. The issue of Cold Start is more serious
here. New users – not enough ratings, new items – not
many users have bought.
16

BITS Pilani, WILP


References

Chapter-9 (Recommendation Systems) of Mining of


Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff
Ullman, Cambridge University Press.

17

BITS Pilani, WILP


Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Potrebbero piacerti anche