Sei sulla pagina 1di 92

CE6027 Information Modelling and Retrieval: LEARNING FROM DATA

Emmanuel Tumwesigye

University College Cork (Ireland) Department of Civil and Environmental Engineering page 1

Information Modelling and Retrieval LEARNING FROM DATA


o

Objects
Describe the concept of learning from data Understand what can be learnt from data Understand need for analyses of large, complex, information-rich data sets. Goals and primary tasks of the learning from data. The role of expert in learning Data Mining Methods and demonstration with examples

University College Cork (Ireland) Department of Civil and Environmental Engineering page 2

Learning from Data: What is it about? Learning Process


Determining unknown mapping between inputs and outputs Finite number of observations A learning machine that can determine mapping
x

System

Observed data D:{x,y}


x y Y

Learning Machine

Example
Output y{occupant comfort}

y-y

Input x{ temperature, humidity, diffuser size and rate}

University College Cork (Ireland) Department of Civil and Environmental Engineering page 3

Learning from Data: What is it about? Learning Process


Determining unknown mapping between inputs and outputs Finite number of observations A learning machine that can determine mapping

Determining how well the machine has learnt


Error function

University College Cork (Ireland) Department of Civil and Environmental Engineering page 4

Learning from Data: What is it about? Learning Process


Determining unknown mapping between inputs and outputs Finite number of observations A learning machine that can determine mapping

Determining how well the machine has learnt


Error function

Therefore
Learning process = search in HYPOTHESES space to find the one that fits the data This is usually supported with expert knowledge
University College Cork (Ireland) Department of Civil and Environmental Engineering page 5

LEARNING FROM DATA NOT (Conventional) STATISTICS


STATISTICS o Assumes a known functional form e.g Gaussian o A Bayes classifier assumes Independence between attributes o Some statistical problems are ILL posed problems eg PDF
{solution does not exist, solution not unique, solution does not hold with small perturbation)

MACHINE LEARNING ASSUMES STATISTICS!

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: A necessary digression


o

What can be learnt from data?


Information leading to knowledge An ultimate understanding

What is Data?

University College Cork (Ireland) Department of Civil and Environmental Engineering page 7

Perspective

According to Russell Ackoff, a systems theorist


1. 2.

Data: symbols Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions Knowledge: application of data and information; answers "how" questions Understanding: appreciation of "why Wisdom: evaluated understanding.

3.

4. 5.

University College Cork (Ireland) Department of Civil and Environmental Engineering

o20090120_REEB_D3-pt

page 8

Perspective

1. 2.

Data: exists with no significance beyond its existence Information: data that has been given meaning by way of relational connection. Meaning doesnt have to be useful Knowledge: collection of information to be USEFUL Understanding: synthesizing new knowledge using currently held knowledge. It is cognitive and analytical Wisdom: an extrapolative and non-deterministic, nonprobabilistic process

3. 4.

5.

University College Cork (Ireland) Department of Civil and Environmental Engineering page 9

Perspective
1.

Data Example: temperature is 5, was 14 yesterday & its raining Information: Temperature Dropped and it rained Knowledge: If the humidity is high and temperature drops substantially the atmosphere is unlikely to hold moisture so it rains. Understanding: It rains because it rains Wisdom: ?

2. 3.

4. 5.

University College Cork (Ireland) Department of Civil and Environmental Engineering page 10

Learning from Data


o

What is
Data Mining, Knowledge Discovery in Databases, Machine Learning, Data Driven Modelling?

How do they learn from data?

University College Cork (Ireland) Department of Civil and Environmental Engineering page 11

Learning from Data: What is Data Mining?

Data mining :

Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data Not using statistical methods

But an Interestingness criteria

University College Cork (Ireland) Department of Civil and Environmental Engineering

Data Mining

+
Data

=
Interestingness criteria Hidden patterns

University College Cork (Ireland) Department of Civil and Environmental Engineering

Data Mining

Types of interestingness
Frequency Rarity Correlation Length of occurrence (for sequence and temporal data) Consistency Repeating / periodicity Abnormal behavior - anomaly detection Other patterns of interestingness

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases

paradigm shift from classical (Physics based) modeling and analyses based on first principles to developing models and the corresponding analyses directly from data

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What is Data Mining?


o

KDD may be

1.

On-Line Transaction Processing (OLTP) is a model for enterprise data processing with emphasis on transactions involving the input, update, and retrieval of data. On-Line Analytical Processing (OLAP) applications query the database to collate, summarize, and analyze its contents. Data mining augments the OLAP process by applying artificial intelligence and machine learning techniques to find previously unknown or undiscovered relationships in the data.

2.

3.

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What is Data Mining? KDD Process

University College Cork (Ireland) Department of Civil and Environmental Engineering page 17

Learning from Data: What is Data Mining?


o

Learning Tasks
Supervised - set of instances with known targets Unsupervised target outputs are unknown

Learning tasks
Classification Regression Control Prediction

University College Cork (Ireland) Department of Civil and Environmental Engineering page 18

Learning from Data: Data Mining in Engineering


o

Temporal data

Spatial Data

University College Cork (Ireland) Department of Civil and Environmental Engineering page 19

Learning from Data: Data Mining in Engineering


o

Data collected and stored at enormous speeds (GB/hour)


remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

o o

Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation

University College Cork (Ireland) Department of Civil and Environmental Engineering page 20

Learning from Data: Data Mining in Engineering


o

Temporal data
Univariate Multivariate

Spatial Data

University College Cork (Ireland) Department of Civil and Environmental Engineering page 21

Learning from Data: Data Mining in Engineering


Indexing: Find the most similar time series in a database to a given query time series. o Clustering: Find groups of time series in a database such that, time series of the same group are similar to each other whereas time series from different groups are dissimilar to each other. o Classification: Assign a given time series to a predefined group in a way that is more similar to other time series of the same group than it is to time series from other groups. o Novelty (anomaly) detection: Find all sections of a time series that contain a different behavior than the expected with respect to some base model.
o

Motif discovery: Detect previously unknown repeated patterns in a time series database. o Rule discovery: Infer rules from one or more time series describing the most possible behaviour that they might present at a specific time point (or interval).
o
University College Cork (Ireland) Department of Civil and Environmental Engineering page 22

Learning from Data: Data Mining in Engineering


o o

Indexing: Find the most similar time series in a database to a given query time series. Clustering: Find groups of time series in a database such that, time series of the same group are similar to each other whereas time series from different groups are dissimilar to each other. Classification: Assign a given time series to a predefined group in a way that is more similar to other time series of the same group than it is to time series from other groups. Novelty (anomaly) detection: Find all sections of a time series that contain a different behavior than the expected with respect to some base model. Motif discovery: Detect previously unknown repeated patterns in a time series database. Rule discovery: Infer rules from one or more time series describing the most possible behaviour that they might present at a specific time point (or interval).

Clustering

Classification

Motif Discovery

Visualization

Rule Discovery Query by 10 Content s = 0.5 c = 0.3 Novelty Detection

University College Cork (Ireland) Department of Civil and Environmental Engineering page 23

Learning from Data: Data Mining in Engineering


o

Temporal data
concise definition of an anomaly in multivariate Behavior envelopes Feature extraction

What is Difficult?
Differing data formats. Differing sampling rates. Noise, missing values, etc.

University College Cork (Ireland) Department of Civil and Environmental Engineering page 24

Learning from Data: Data Mining in Engineering


o

Temporal data
Similar Shape

Similar Structure

University College Cork (Ireland) Department of Civil and Environmental Engineering page 25

Learning from Data: Data Mining in Engineering


o

Temporal data
Dealing with Raw Data & Pre-processing Similar Shape

Offset Translation Amplitude Scaling Linear Trend Noise

Similar Structure

University College Cork (Ireland) Department of Civil and Environmental Engineering page 26

Learning from Data: Data Mining in Engineering


o
3 2.5 3 2 2.5 1.5 2 1 1.5 0.5 1 0 0 50 100 150 200 250 300 0.5 0

Temporal data
Dealing with Raw Data & Pre-processing

Offset Translation Amplitude Scaling Linear Trend Noise


D(Q,C)

50

100

150

200

250

300

Q = Q - mean(Q) C = C - mean(C) D(Q,C)


0 50 100 150 200 250 300

University College Cork (Ireland) Department of Civil and Environmental Engineering page 27

Learning from Data: Data Mining in Engineering


o

Temporal data

Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise

100

200

300

400

500

600

700

800

900 1000

100

200

300

400

500

600

700

800

900 1000

Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C)


University College Cork (Ireland) Department of Civil and Environmental Engineering page 28

Learning from Data: Data Mining in Engineering


o

Temporal data

Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise

12

10

-1

-2

-2

-4 0 20 40 60 80 100 120 140 160 180 200

-3 0 20 40 60 80 100 120 140 160 180 200

University College Cork (Ireland) Department of Civil and Environmental Engineering page 29

Learning from Data: Data Mining in Engineering


o

Temporal data

Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise
8 6 4 2 0 -2 -4 0

8 6 4 2 0 -2 -4 0

20

40

60

80

100

120

140

20

40

60

80

100

120

140

University College Cork (Ireland) Department of Civil and Environmental Engineering page 30

Learning from Data: Data Mining in Engineering


30 70

25

60

50 20 40 15 30 10 20

10

0 30/08/2009 04/09/2009 09/09/2009 14/09/2009 19/09/2009 24/09/2009

0 29/09/2009

Open Office Temperature

Immunology Lab Temperature

Immunology Lab Humidity

Open Office Humidity

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 31

Learning from Data: Data Mining in Engineering


70 29 60 27 50 25 40 23 30

21

19

20

17

10

15 02/09/2009 03/09/2009 04/09/2009 05/09/2009 06/09/2009 07/09/2009 08/09/2009 09/09/2009

0 10/09/2009

Open Office Temperature

Immunology Lab Temperature

Immunology Lab Humidity

Open Office Humidity

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 32

Learning from Data: Data Mining in Engineering


o

Spatial data features


spatial relationships among the variables, spatial structure of errors, mixed distributions as opposed to commonly assumed normal distributions, observations that are not independent, spatial autocorrelation among the features, and non-linear interaction in feature space.

University College Cork (Ireland) Department of Civil and Environmental Engineering page 33

Learning from Data: Data Mining in Engineering


o

Spatial data
Location Prediction Spatial Outlier Detection Co-location Rules Spatial Clustering & complete Randomness

University College Cork (Ireland) Department of Civil and Environmental Engineering page 34

o o

Learning from Data: Data Mining in Engineering Location Prediction as a classification problem
Location Prediction Spatial Outlier Detection Co-location Rules
Nest locations Spatial Clustering & complete Randomness Distance to open water

Vegetation durability

Water depth
page 35

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Mining in Engineering


o

Co-Location

Answers:

and

University College Cork (Ireland) Department of Civil and Environmental Engineering page 36

Learning from Data: Data Mining in Engineering


o

Co-Location

University College Cork (Ireland) Department of Civil and Environmental Engineering page 37

Learning from Data: Data Mining in Engineering


o

Spatial databases do not store spatial relations explicitly


Additional functionality required to compute them

Three types of spatial relations specified by the OGC reference model


Distance relations
Euclidean distance between two spatial features

Direction relations
Ordering of spatial features in space

Topological relations
Characterise the type of intersection between spatial features

University College Cork (Ireland) Department of Civil and Environmental Engineering page 38

Learning from Data: Data Mining in Engineering Learning from Data: o Distance Relations Data Mining in Engineering
If dist is a distance function and c is some real number 1. dist(A,B)>c, 2. dist(A,B)<c and 3. dist(A,B)=c A B

University College Cork (Ireland) Department of Civil and Environmental Engineering page 39

Learning from Data: Data Mining in Engineering Learning from Data: o Direction Relations Data Mining in Engineering
If directions of B and C are required with respect to A 1. Define a representative point, rep(A) 2. rep(A) defines the origin of a virtual coordinate system 3. The quadrants and half planes define the direction relations 4. B can have two values {northeast, east} 5. Exact direction relation is northeast C north A C B northeast A A B

rep(A)

University College Cork (Ireland) Department of Civil and Environmental Engineering page 40

Learning from Data: Data Mining in Engineering


o o

Topological relations describe how geometries intersect spatially Simple geometry types
Point, 0-dimension Line, 1-dimension Polygon, 2-dimension

Each geometry represented in terms of


boundary (B) geometry of the lower dimension interior (I) points of the geometry when boundary is removed exterior (E) points not in the interior or boundary

Examples for simple geometries


For a point, I = {point}, B={} and E={Points not in I and B} For a line, I={points except boundary points}, B={two end points} and E={Points not in I and B} For a polygon, I={points within the boundary}, B={the boundary} and E={points not in I and B}

University College Cork (Ireland) Department of Civil and Environmental Engineering page 41

Learning from Data: Data Mining in Engineering


o

Topological relations are defined using any one of the following models
4IM, four intersection model (only B and E considered) 9IM, nine intersection models (B, I, and E) DE-9IM, dimensionally extended 9 intersection model

DE-9IM is an OGC complaint model

Dim is the dimension function


University College Cork (Ireland) Department of Civil and Environmental Engineering page 42

Learning from Data: Data Mining in Engineering

Consider two polygons


A - POLYGON ((10 10, 15 0, 25 0, 30 10, 25 20, 15 20, 10 10)) B - POLYGON ((20 10, 30 0, 40 10, 30 20, 20 10))

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 43

Learning from Data: Data Mining in Engineering


o

9-Intersection Matrix of example geometries


I(B) B(B) E(B)

I(A)

B(A)

E(A)

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 44

Learning from Data: Data Mining in Engineering


o

DE-9IM for the example geometries

I(B) I(A) B(A) E(A) 2 1 2

B(B) 1 0 1

E(B) 2 1 2

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 45

Learning from Data: Data Mining in Engineering


Different geometries may give rise to different numbers in the overlaps B DE-9IM o For a specific type of relationship we are only interested in certain values in I(A) certain positions
o

I(B)

B(B)

E(B)

T * T

* * *

T * *

That is, we are interested in patterns in the matrix than actual values B(A)
o

Actual values are replaced by wild cards


T: value is "true" - non empty - any dimension >= 0 F: value is "false" - empty - dimension < 0 *: Don't care what the value is 0: value is exactly zero 1: value is exactly one 2: value is exactly two

E(A)

University College Cork (Ireland) Department of Civil and Environmental Engineering

oA. Hryshchenko

o20090120_REEB_D3-

page 46

Learning from Data: Data Mining Approaches


o

Learning Approaches
Frequent Pattern Mining Classifiers Clustering Neural Networks Fuzzy Logic Genetic algorithms

University College Cork (Ireland) Department of Civil and Environmental Engineering page 47

Learning from Data: Mining Frequent Patterns

What is frequent pattern mining? Frequent pattern mining algorithms


Apriori and its variations

Recent progress on efficient mining methods


Mining frequent patterns without candidate generation

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What Is Frequent Pattern Mining?

What is a frequent pattern?


Pattern (set of items, sequence, etc.) that occurs together frequently in a database

Frequent pattern: an important form of regularity


What products were often purchased together? beers and diapers! What are the consequences of a hurricane? What is the next target after buying a PC?

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What Is Frequent Pattern Mining? In general, given a count of source data S, an association rule indicates that the events A1, A2,An will most likely associate with the event B. S = A1+ A2 + .. + B + other events A1, A2, An => B The Support and Confidence level of this association is:

Re cord _ Count ( A1 A 2 ... An B ) Support = * 100 % Re cord _ Count of source data S

Re cord _ Count ( A1 A2 ... An B) Confidence = *100% Re cord _ Count ( A1 A2 ... An)


University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What Is Frequent Pattern Mining? Rule Measures: Support and Confidence
Customer buys both Customer buys diaper

Customer buys beer

Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)
51

Transaction ID 2000 1000 4000 5000

Items Bought A,B,C A,C A,D B,E,F

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

Learning from Data: What Is Frequent Pattern Mining?

The Apriori Algorithm


o

The Apriori method:


Proposed by Agrawal & Srikant 1994 A similar level-wise algorithm by Mannila et al. 1994

Major idea:
A subset of a frequent itemset must be frequent
E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. If any is infrequent, its superset cannot be frequent!

A powerful, scalable candidate set pruning technique:


It reduces candidate k-itemsets dramatically (for k > 2)

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

52

Learning from Data: What Is Frequent Pattern Mining? Mining Association Rules Example

Transaction ID 2000 1000 4000 5000

Items Bought A,B,C A,C A,D B,E,F

Min. support 50% Min. confidence 50%


Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50%

For rule A C: The Apriori principle:

support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% Any subset of a frequent itemset must be frequent.

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

53

Learning from Data: What Is Frequent Pattern Mining? AprioriPseudocode

Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count that are contained in t of all candidates in Ck+1

= candidates Lk+1 min_support end return k Lk;


o3/3/2008

in

Ck+1

with

University College Cork (Ireland) Department of Civil and Environmental Engineering 54

Learning from Data: What is Frequent Pattern Mining? The Apriori Algorithm Example Database D
TID 100 200 300 400 Items 134 235 1235 25

itemset sup. {1} 2 C1 {2} 3 Scan D {3} 3 {4} 1 {5} 3

L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3

C2 itemset sup L2 itemset sup


{1 3} {2 3} {2 5} {3 5} 2 2 3 2
{1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2

C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}

C3

itemset {2 3 5}
o3/3/2008

Scan D

L3 itemset sup {2 3 5} 2
55

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: What Is Frequent Pattern Mining? Mining without Candidate Generation
Apriori candidate generate-and-test method suffers from the following costs:
It

may need to generate a huge number of candidate sets.

It

may need to repeatedly scan the database and check a large set of candidates by pattern matching

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

56

Learning from Data: Frequent-pattern growth (FP-growth)


o

Divide-and-conquer
compress database represents frequent items as a frequent-pattern tree

The mining of the PF-tree


starts from each frequent length-1 pattern Conditional pattern base construct its conditional FP-tree.

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

57

Learning from Data: Frequent Pattern Tree algorithm


o

Steps:
1. Scan database like Apriori and derive 1-itemsets into list L 2. Create a root of tree (NULL), Scan database and process according to L order (according to support count) and branch for @ transaction 3. Mine the FP tree

Create a table of candidate data items in descending order. Step 2: Build the Frequent Pattern Tree according to each event of the candidate data items. Step 3: Link the table with the tree.
University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

58

Learning from Data: What Is Frequent Pattern Mining? Transactional data Example
TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of item _Ids I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3

Step 1 Get the frequent item set in descending order with user requirement of Support Level = 2

I2 I1 I3 I4 I5

7 6 6 2 2
University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

59

Learning from Data: What Is Frequent Pattern Mining?

Step 2 T100=I2, I1, I5

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

60

Learning from Data: What Is Frequent Pattern Mining?

Step 3 T200=I2, I4

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

61

Step 4 T300=I2, I3

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

62

Step 5 T400=I1, I2, I4

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

63

Step 6 T500=I1, I3

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

64

Learning from Data: What Is Frequent Pattern Mining?

Step 7 T600=I2, I3

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

65

Learning from Data: What Is Frequent Pattern Mining? Step 8 T700=I1, I3

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

66

Learning from Data: What Is Frequent Pattern Mining? Step 9 T800=I1, I2, I3, I5

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

67

Learning from Data: What Is Frequent Pattern Mining?

Step 10 T900=I1, I2, I3

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

68

Step 11 Link table with the tree

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

69

Learning from Data: FP Tree

An FP-tree that registers compressed, frequent pattern information


Item I5 I4 I3 I1 Conditional pattern base {(I2 I1: 1), (I2 I1 I3: 1)} {(I2 I1:1), I2: 1)} {(I2 I1: 2), (I2:2),(I1:2)} {(I2:4)} Conditional FP-tree (I2:2, I1:2) (I2:2) (I2:4,I1:2),(I1:2) (I2:4) Frequent patterns generated I2 I5:2, I1 I5:2, I2 I1 I5:2 I2 I4:2 I2 I3:4, I1,I3:2, I2 I1 I3:2 I2 I1:4

3/3/2008

70 University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Example


Temperature Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Sunny Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Humidity Windy Play

In this table, there are four attributes: outlook, temperature, humidity and wind; and the outcome is whether to play or not. (a) Show the possible Association Rules that can determine the outcome without support and confidence level. (b) Show the Support level and Confidence level of the following association rule: = normal. If temperature = cool then humidity

University College Cork (Ireland) Department of Civil and Environmental Engineering

o3/3/2008

71

GA

University College Cork (Ireland) Department of Civil and Environmental Engineering page 72

Learning from Data: Data Classification: Rules Weather Data


ID A B C D E F G H I J K L M N Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy false true false false false true true false false false true true false true Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Classification: Numeric data 1.


2. 3. 4. 5.

IF outlook = sunny and humidity high then play =no


IF outlook = rainy and windy = true IF outlook = overcast IF humidity = normal IF none of above then play =no then play =yes then play =yes then play =yes.

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Classification: Rules Weather Data


ID A B C D E F G H I J K L M N Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humidity 85 90 86 96 80 70 65 95 70 80 70 90 75 91 Windy false true false false false true true false false false true true false true Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Classification: Numeric data 1.


2. 3. 4. 5.

IF outlook = sunny and humidity > 83 then play =no


IF outlook = rainy and windy = true IF outlook = overcast IF humidity = normal IF none of above then play =no then play =yes then play =yes then play =yes.

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Classification: Trees


o o

Use Divide and conquer A set of independent values The Idea is to as a lot of questions to lead you to a node Start at root. Create leaves Leaves become Nodes for other leaves, A leaf node provides a classification

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Decision Trees Data Classification


A decision tree is a diagramming method used to select among attributes Main Assumption: Data effectively modeled via decision splits on attributes. Hypothesis Space: Variable size (non-parametric): Can model any function.

Uses the Induction principle: Given a collection of pairs <x , f(x)>. Return a hypothesis h(x) that approximates f(x).

University College Cork (Ireland) Department of Civil and Environmental Engineering

Learning from Data: Data Classification

Decision Trees Procedures


An internal node is a test on an attribute. A branch represents an outcome of the test. A leaf node represents a class label. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new case is classified by following a matching path to a leaf node.

University College Cork (Ireland) Department of Civil and Environmental Engineering

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain

Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild

Humidity high high high high normal normal normal high normal normal normal high normal high

Windy false true false false false true true false false false true true false true

Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Example: Weather Data: Play or not Play?

Note: Outlook is the Forecast

University College Cork (Ireland) Department of Civil and Environmental Engineering

Outlook

sunny overcast

rain

Example Tree for Play?

Humidity

Yes Windy

high

normal

true

false

No

Yes

No

Yes

University College Cork (Ireland) Department of Civil and Environmental Engineering

Building Decision Tree


o

Top-down tree construction At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time.

Bottom-up tree pruning Remove sub-trees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.
University College Cork (Ireland) Department of Civil and Environmental Engineering

Choosing the Splitting Attribute


o

At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index

University College Cork (Ireland) Department of Civil and Environmental Engineering

Which attribute to select?


Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy false true false false false true true false false false true true false true Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No No NO NO YES YES YES YES YES YES NO NO YES YES YES
Sunny
Overcast

Outlook Rainy

University College Cork (Ireland) Department of Civil and Environmental Engineering

A criterion for attribute selection


o

Which is the best attribute?


The one which will result in the smallest tree Heuristic: choose the attribute that produces the purest nodes

Popular impurity criterion: information gain


Information gain increases with the average purity of the subsets that an attribute produces

Strategy: choose attribute that results in greatest information gain

University College Cork (Ireland) Department of Civil and Environmental Engineering

Computing Information
o

Information is measured in bits

Given a probability distribution, the information required to predict an event is the distributions entropy Entropy gives the information required in bits (this Given probabilities p , p , .., p can involve fractions of bits!) whose sum is 1, Entropy is
1 2 n

Formula for computing the entropy:


n

defined as:

entropy( p1 , p2 ,, pn ) = p1log 2 ( p1 ) pn log 2 ( pn )


I( p1 , p2 , , pn ) = pi log 2 ( pi )
i =1
Information gain= (info. before split) (info. after split)

University College Cork (Ireland) Department of Civil and Environmental Engineering

Outlook = Sunny:

Example: attribute Outlook


Note: log(0) is not defined, but we evaluate 0*log(0) as zero

info([2,3]) = entropy(2/5,3/5) = 2/5log(2/5) 3/5log(3/5) = 0.971 bits


o

Outlook = Overcast:

info([4,0]) = entropy(1,0) = 1log(1) 0 log(0) = 0 bits


o

Outlook = Rainy:

info([3,2]) = entropy(3/5,2/5) = 3/ 5log(3/ 5) 2 / 5log(2 / 5) = 0.971 bits


o

Expected information for the outlook attribute:

info([3,2],[4,0],[3,2]) = (5 /14) 0.971 + (4 /14) 0 + (5 /14) 0.971


= 0.693 bits
Info. Gain Before Split = Expected

information for the outlook attribute:

info([9,5]) = entropy(9/14,5/14) = 9/14log(9/14) 5/14log(5/14) = 0.940 bits


University College Cork (Ireland) Department of Civil and Environmental Engineering

Information gain:

Computing the information gain

(information before split) (information after split)

gain("Outlook") = info([9,5])-info([2,3],[4,0],[3,2]) = 0.940-0.693 = 0.247 bits


o o

Thus, gain(" Outlook") = 0.247 bits Similarly, the information gain for the rest of the attributes from weather data are as follows:

gain(" Temperatur e" ) = 0.029 bits gain(" Humidity") = 0.152 bits gain(" Windy") = 0.048 bits
o

Therefore, we pick the outlook variable with either Sunny or rainy for the basic root: Say, we select Sunny
University College Cork (Ireland) Department of Civil and Environmental Engineering

Continuing to split

Similarly, the information gain for attributes: temperature, Windy, and humidity.

gain("Temperatur ) = 0.571bits e"


We can not expand any more.

gain(" Humidity") = 0.971 bits

gain(" Windy") = 0.020 bits


University College Cork (Ireland) Department of Civil and Environmental Engineering

The final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes
Splitting stops when data cant be split any further

University College Cork (Ireland) Department of Civil and Environmental Engineering

*Wish list for a purity measure


o

Properties we require from a purity measure:


When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages).

Entropy is a function that satisfies all three properties. o The multistage property: q r entropy(p,q,r) = entropy(p,q + r) + (q + r) entropy( , ) q+r q+r
o
University College Cork (Ireland) Department of Civil and Environmental Engineering

The multistage property:

*Properties of the entropy

info([2,3, 4]) = entropy(2 / 9,3/ 9, 4 / 9) =entropy(2 / 9,7 / 9) + 7 / 9 entropy(3/ 7, 4 / 7) = 2 / 9log(2 / 9) 7 / 9log(7 / 9) + 7 / 9 [ 3/ 7log(3/ 7) 4 / 7log(4 / 7)] .
o

Simplification of computation:

info([2,3, 4]) = entropy(2 / 9,3/ 9, 4 / 9) = 2 / 9 log(2 / 9) 3/ 9 log(3/ 9) 4 / 9 log(4 / 9) = [2log 2 3log3 4log 4 + 9log9]/ 9
University College Cork (Ireland) Department of Civil and Environmental Engineering

Potrebbero piacerti anche