Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Emmanuel Tumwesigye
University College Cork (Ireland) Department of Civil and Environmental Engineering page 1
Objects
Describe the concept of learning from data Understand what can be learnt from data Understand need for analyses of large, complex, information-rich data sets. Goals and primary tasks of the learning from data. The role of expert in learning Data Mining Methods and demonstration with examples
University College Cork (Ireland) Department of Civil and Environmental Engineering page 2
System
Learning Machine
Example
Output y{occupant comfort}
y-y
University College Cork (Ireland) Department of Civil and Environmental Engineering page 3
University College Cork (Ireland) Department of Civil and Environmental Engineering page 4
Therefore
Learning process = search in HYPOTHESES space to find the one that fits the data This is usually supported with expert knowledge
University College Cork (Ireland) Department of Civil and Environmental Engineering page 5
What is Data?
University College Cork (Ireland) Department of Civil and Environmental Engineering page 7
Perspective
Data: symbols Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions Knowledge: application of data and information; answers "how" questions Understanding: appreciation of "why Wisdom: evaluated understanding.
3.
4. 5.
o20090120_REEB_D3-pt
page 8
Perspective
1. 2.
Data: exists with no significance beyond its existence Information: data that has been given meaning by way of relational connection. Meaning doesnt have to be useful Knowledge: collection of information to be USEFUL Understanding: synthesizing new knowledge using currently held knowledge. It is cognitive and analytical Wisdom: an extrapolative and non-deterministic, nonprobabilistic process
3. 4.
5.
University College Cork (Ireland) Department of Civil and Environmental Engineering page 9
Perspective
1.
Data Example: temperature is 5, was 14 yesterday & its raining Information: Temperature Dropped and it rained Knowledge: If the humidity is high and temperature drops substantially the atmosphere is unlikely to hold moisture so it rains. Understanding: It rains because it rains Wisdom: ?
2. 3.
4. 5.
University College Cork (Ireland) Department of Civil and Environmental Engineering page 10
What is
Data Mining, Knowledge Discovery in Databases, Machine Learning, Data Driven Modelling?
University College Cork (Ireland) Department of Civil and Environmental Engineering page 11
Data mining :
Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data Not using statistical methods
Data Mining
+
Data
=
Interestingness criteria Hidden patterns
Data Mining
Types of interestingness
Frequency Rarity Correlation Length of occurrence (for sequence and temporal data) Consistency Repeating / periodicity Abnormal behavior - anomaly detection Other patterns of interestingness
Data mining (knowledge discovery in databases): Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases
paradigm shift from classical (Physics based) modeling and analyses based on first principles to developing models and the corresponding analyses directly from data
KDD may be
1.
On-Line Transaction Processing (OLTP) is a model for enterprise data processing with emphasis on transactions involving the input, update, and retrieval of data. On-Line Analytical Processing (OLAP) applications query the database to collate, summarize, and analyze its contents. Data mining augments the OLAP process by applying artificial intelligence and machine learning techniques to find previously unknown or undiscovered relationships in the data.
2.
3.
University College Cork (Ireland) Department of Civil and Environmental Engineering page 17
Learning Tasks
Supervised - set of instances with known targets Unsupervised target outputs are unknown
Learning tasks
Classification Regression Control Prediction
University College Cork (Ireland) Department of Civil and Environmental Engineering page 18
Temporal data
Spatial Data
University College Cork (Ireland) Department of Civil and Environmental Engineering page 19
o o
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
University College Cork (Ireland) Department of Civil and Environmental Engineering page 20
Temporal data
Univariate Multivariate
Spatial Data
University College Cork (Ireland) Department of Civil and Environmental Engineering page 21
Motif discovery: Detect previously unknown repeated patterns in a time series database. o Rule discovery: Infer rules from one or more time series describing the most possible behaviour that they might present at a specific time point (or interval).
o
University College Cork (Ireland) Department of Civil and Environmental Engineering page 22
Indexing: Find the most similar time series in a database to a given query time series. Clustering: Find groups of time series in a database such that, time series of the same group are similar to each other whereas time series from different groups are dissimilar to each other. Classification: Assign a given time series to a predefined group in a way that is more similar to other time series of the same group than it is to time series from other groups. Novelty (anomaly) detection: Find all sections of a time series that contain a different behavior than the expected with respect to some base model. Motif discovery: Detect previously unknown repeated patterns in a time series database. Rule discovery: Infer rules from one or more time series describing the most possible behaviour that they might present at a specific time point (or interval).
Clustering
Classification
Motif Discovery
Visualization
University College Cork (Ireland) Department of Civil and Environmental Engineering page 23
Temporal data
concise definition of an anomaly in multivariate Behavior envelopes Feature extraction
What is Difficult?
Differing data formats. Differing sampling rates. Noise, missing values, etc.
University College Cork (Ireland) Department of Civil and Environmental Engineering page 24
Temporal data
Similar Shape
Similar Structure
University College Cork (Ireland) Department of Civil and Environmental Engineering page 25
Temporal data
Dealing with Raw Data & Pre-processing Similar Shape
Similar Structure
University College Cork (Ireland) Department of Civil and Environmental Engineering page 26
Temporal data
Dealing with Raw Data & Pre-processing
50
100
150
200
250
300
University College Cork (Ireland) Department of Civil and Environmental Engineering page 27
Temporal data
Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise
100
200
300
400
500
600
700
800
900 1000
100
200
300
400
500
600
700
800
900 1000
Temporal data
Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise
12
10
-1
-2
-2
University College Cork (Ireland) Department of Civil and Environmental Engineering page 29
Temporal data
Offset Translation Dealing with Raw Data & Pre-processing Amplitude Scaling Linear Trend Noise
8 6 4 2 0 -2 -4 0
8 6 4 2 0 -2 -4 0
20
40
60
80
100
120
140
20
40
60
80
100
120
140
University College Cork (Ireland) Department of Civil and Environmental Engineering page 30
25
60
50 20 40 15 30 10 20
10
0 29/09/2009
oA. Hryshchenko
o20090120_REEB_D3-
page 31
21
19
20
17
10
0 10/09/2009
oA. Hryshchenko
o20090120_REEB_D3-
page 32
University College Cork (Ireland) Department of Civil and Environmental Engineering page 33
Spatial data
Location Prediction Spatial Outlier Detection Co-location Rules Spatial Clustering & complete Randomness
University College Cork (Ireland) Department of Civil and Environmental Engineering page 34
o o
Learning from Data: Data Mining in Engineering Location Prediction as a classification problem
Location Prediction Spatial Outlier Detection Co-location Rules
Nest locations Spatial Clustering & complete Randomness Distance to open water
Vegetation durability
Water depth
page 35
Co-Location
Answers:
and
University College Cork (Ireland) Department of Civil and Environmental Engineering page 36
Co-Location
University College Cork (Ireland) Department of Civil and Environmental Engineering page 37
Direction relations
Ordering of spatial features in space
Topological relations
Characterise the type of intersection between spatial features
University College Cork (Ireland) Department of Civil and Environmental Engineering page 38
Learning from Data: Data Mining in Engineering Learning from Data: o Distance Relations Data Mining in Engineering
If dist is a distance function and c is some real number 1. dist(A,B)>c, 2. dist(A,B)<c and 3. dist(A,B)=c A B
University College Cork (Ireland) Department of Civil and Environmental Engineering page 39
Learning from Data: Data Mining in Engineering Learning from Data: o Direction Relations Data Mining in Engineering
If directions of B and C are required with respect to A 1. Define a representative point, rep(A) 2. rep(A) defines the origin of a virtual coordinate system 3. The quadrants and half planes define the direction relations 4. B can have two values {northeast, east} 5. Exact direction relation is northeast C north A C B northeast A A B
rep(A)
University College Cork (Ireland) Department of Civil and Environmental Engineering page 40
Topological relations describe how geometries intersect spatially Simple geometry types
Point, 0-dimension Line, 1-dimension Polygon, 2-dimension
University College Cork (Ireland) Department of Civil and Environmental Engineering page 41
Topological relations are defined using any one of the following models
4IM, four intersection model (only B and E considered) 9IM, nine intersection models (B, I, and E) DE-9IM, dimensionally extended 9 intersection model
oA. Hryshchenko
o20090120_REEB_D3-
page 43
I(A)
B(A)
E(A)
oA. Hryshchenko
o20090120_REEB_D3-
page 44
B(B) 1 0 1
E(B) 2 1 2
oA. Hryshchenko
o20090120_REEB_D3-
page 45
I(B)
B(B)
E(B)
T * T
* * *
T * *
That is, we are interested in patterns in the matrix than actual values B(A)
o
E(A)
oA. Hryshchenko
o20090120_REEB_D3-
page 46
Learning Approaches
Frequent Pattern Mining Classifiers Clustering Neural Networks Fuzzy Logic Genetic algorithms
University College Cork (Ireland) Department of Civil and Environmental Engineering page 47
Learning from Data: What Is Frequent Pattern Mining? In general, given a count of source data S, an association rule indicates that the events A1, A2,An will most likely associate with the event B. S = A1+ A2 + .. + B + other events A1, A2, An => B The Support and Confidence level of this association is:
Learning from Data: What Is Frequent Pattern Mining? Rule Measures: Support and Confidence
Customer buys both Customer buys diaper
Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)
51
o3/3/2008
Major idea:
A subset of a frequent itemset must be frequent
E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. If any is infrequent, its superset cannot be frequent!
o3/3/2008
52
Learning from Data: What Is Frequent Pattern Mining? Mining Association Rules Example
support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% Any subset of a frequent itemset must be frequent.
o3/3/2008
53
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count that are contained in t of all candidates in Ck+1
in
Ck+1
with
Learning from Data: What is Frequent Pattern Mining? The Apriori Algorithm Example Database D
TID 100 200 300 400 Items 134 235 1235 25
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
C3
itemset {2 3 5}
o3/3/2008
Scan D
L3 itemset sup {2 3 5} 2
55
Learning from Data: What Is Frequent Pattern Mining? Mining without Candidate Generation
Apriori candidate generate-and-test method suffers from the following costs:
It
It
may need to repeatedly scan the database and check a large set of candidates by pattern matching
o3/3/2008
56
Divide-and-conquer
compress database represents frequent items as a frequent-pattern tree
o3/3/2008
57
Steps:
1. Scan database like Apriori and derive 1-itemsets into list L 2. Create a root of tree (NULL), Scan database and process according to L order (according to support count) and branch for @ transaction 3. Mine the FP tree
Create a table of candidate data items in descending order. Step 2: Build the Frequent Pattern Tree according to each event of the candidate data items. Step 3: Link the table with the tree.
University College Cork (Ireland) Department of Civil and Environmental Engineering
o3/3/2008
58
Learning from Data: What Is Frequent Pattern Mining? Transactional data Example
TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of item _Ids I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3
Step 1 Get the frequent item set in descending order with user requirement of Support Level = 2
I2 I1 I3 I4 I5
7 6 6 2 2
University College Cork (Ireland) Department of Civil and Environmental Engineering
o3/3/2008
59
o3/3/2008
60
Step 3 T200=I2, I4
o3/3/2008
61
Step 4 T300=I2, I3
o3/3/2008
62
o3/3/2008
63
Step 6 T500=I1, I3
o3/3/2008
64
Step 7 T600=I2, I3
o3/3/2008
65
o3/3/2008
66
Learning from Data: What Is Frequent Pattern Mining? Step 9 T800=I1, I2, I3, I5
o3/3/2008
67
o3/3/2008
68
o3/3/2008
69
3/3/2008
In this table, there are four attributes: outlook, temperature, humidity and wind; and the outcome is whether to play or not. (a) Show the possible Association Rules that can determine the outcome without support and confidence level. (b) Show the Support level and Confidence level of the following association rule: = normal. If temperature = cool then humidity
o3/3/2008
71
GA
University College Cork (Ireland) Department of Civil and Environmental Engineering page 72
Use Divide and conquer A set of independent values The Idea is to as a lot of questions to lead you to a node Start at root. Create leaves Leaves become Nodes for other leaves, A leaf node provides a classification
Uses the Induction principle: Given a collection of pairs <x , f(x)>. Return a hypothesis h(x) that approximates f(x).
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild
Humidity high high high high normal normal normal high normal normal normal high normal high
Windy false true false false false true true false false false true true false true
Play? No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Outlook
sunny overcast
rain
Humidity
Yes Windy
high
normal
true
false
No
Yes
No
Yes
Top-down tree construction At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time.
Bottom-up tree pruning Remove sub-trees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.
University College Cork (Ireland) Department of Civil and Environmental Engineering
At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index
Outlook Rainy
Computing Information
o
Given a probability distribution, the information required to predict an event is the distributions entropy Entropy gives the information required in bits (this Given probabilities p , p , .., p can involve fractions of bits!) whose sum is 1, Entropy is
1 2 n
defined as:
Outlook = Sunny:
Outlook = Overcast:
Outlook = Rainy:
Information gain:
Thus, gain(" Outlook") = 0.247 bits Similarly, the information gain for the rest of the attributes from weather data are as follows:
gain(" Temperatur e" ) = 0.029 bits gain(" Humidity") = 0.152 bits gain(" Windy") = 0.048 bits
o
Therefore, we pick the outlook variable with either Sunny or rainy for the basic root: Say, we select Sunny
University College Cork (Ireland) Department of Civil and Environmental Engineering
Continuing to split
Similarly, the information gain for attributes: temperature, Windy, and humidity.
Note: not all leaves need to be pure; sometimes identical instances have different classes
Splitting stops when data cant be split any further
Entropy is a function that satisfies all three properties. o The multistage property: q r entropy(p,q,r) = entropy(p,q + r) + (q + r) entropy( , ) q+r q+r
o
University College Cork (Ireland) Department of Civil and Environmental Engineering
info([2,3, 4]) = entropy(2 / 9,3/ 9, 4 / 9) =entropy(2 / 9,7 / 9) + 7 / 9 entropy(3/ 7, 4 / 7) = 2 / 9log(2 / 9) 7 / 9log(7 / 9) + 7 / 9 [ 3/ 7log(3/ 7) 4 / 7log(4 / 7)] .
o
Simplification of computation:
info([2,3, 4]) = entropy(2 / 9,3/ 9, 4 / 9) = 2 / 9 log(2 / 9) 3/ 9 log(3/ 9) 4 / 9 log(4 / 9) = [2log 2 3log3 4log 4 + 9log9]/ 9
University College Cork (Ireland) Department of Civil and Environmental Engineering