Sei sulla pagina 1di 12

Overview

• Data & Types of Data


• Fuzzy Sets
Basic Data Mining Techniques • Information Retrieval
• Machine Learning
• Statistics & Estimation Techniques
• Similarity Measures
• Decision Trees

Data Mining Lecture 2 2

What is Data? Attribute Values


Attributes
• Collection of data objects and • Attribute values are numbers or symbols assigned to
their attributes an attribute
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat

characteristic of an object 1 Yes Single 125K No


• Distinction between attributes and attribute values
– Examples: eye color of a 2 No Married 100K No – Same attribute can be mapped to different attribute values
person, temperature, etc. 3 No Single 70K No • Example: height can be measured in feet or meters
– Attribute is also known as 4 Yes Married 120K No
variable, field, 5 No Divorced 95K Yes – Different attributes can be mapped to the same set of values
characteristic, or feature Objects 6 No Married 60K No • Example: Attribute values for ID and age are integers
• A collection of attributes 7 Yes Divorced 220K No
• But properties of attribute values can be different
describe an object 8 No Single 85K Yes
– ID has no limit but age has a maximum and minimum value
– Object is also known as 9 No Married 75K No
record, point, case, sample, 10 No Single 90K Yes
entity, or instance
10

Data Mining Lecture 2 3 Data Mining Lecture 2 4

Types of Attributes Properties of Attribute Values

• There are different types of attributes • The type of an attribute depends on which of
– Nominal the following properties it possesses:
• Examples: ID numbers, eye color, zip codes – Distinctness: = ≠
– Ordinal – Order: < >
• Examples: rankings (e.g., taste of potato chips on a scale – Addition: + -
from 1-10), grades, height in {tall, medium, short}
– Multiplication: */
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit. – Nominal attribute: distinctness
– Ratio – Ordinal attribute: distinctness & order
• Examples: temperature in Kelvin, length, time, counts – Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Data Mining Lecture 2 5 Data Mining Lecture 2 6

1
Attribute Description Examples Operations Attribute Transformation Comments
Type Level
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency Nominal Any permutation of values If all employee ID numbers
attributes provide only enough sex: {male, female} correlation, χ2 test
information to distinguish one object
were reassigned, would it
from another. (=, ≠) make any difference?

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation, Ordinal An order preserving change of An attribute encompassing
objects. (<, >) grades, street numbers run tests, sign tests values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
Interval For interval attributes, the calendar dates, mean, standard {1, 2, 3} or by {0.5, 1, 10}.
differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
Interval new_value =a * old_value + b Thus, the Fahrenheit and
(+, - ) where a and b are constants Celsius temperature scales
differ in terms of where
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean, their zero value is and the
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation size of a unit (degree).
length, electrical
current Ratio new_value = a * old_value Length can be measured in
meters or feet.

Discrete and Continuous Attributes Types of data sets

• Discrete Attribute • Record


– Has only a finite or countably infinite set of values – Data Matrix
– Examples: zip codes, counts, or the set of words in a – Document Data
collection of documents
– Transaction Data
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete • Graph
attributes – World Wide Web
– Molecular Structures
• Continuous Attribute • Ordered
– Has real numbers as attribute values
– Spatial Data
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented – Temporal Data
using a finite number of digits – Sequential Data
– Continuous attributes are typically represented as floating- – Genetic Sequence Data
point variables
Data Mining Lecture 2 9 Data Mining Lecture 2 10

Characteristics of Structured Data Record Data

• Dimensionality • Data that consists of a collection of records,


– Curse of Dimensionality each of which consists of a fixed set of
attributes Tid Refund Marital Taxable
Status Income Cheat
• Sparsity 1 Yes Single 125K No

– Only presence counts 2 No Married 100K No


3 No Single 70K No
4 Yes Married 120K No

• Resolution 5 No Divorced 95K Yes


6 No Married 60K No
– Patterns depend on the scale 7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
1
0

Data Mining Lecture 2 11 Data Mining Lecture 2 12

2
Data Matrix Document Data

• If data objects have the same fixed set of numeric • Each document becomes a `term' vector,
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each – each term is a component (attribute) of the vector,
dimension represents a distinct attribute – the value of each component is the number of times
the corresponding term occurs in the document.
• Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n

timeout

season
coach

score

game
team
columns, one for each attribute

ball

lost
pla

wi
n
y
Projection Projection Distance Load Thickness
of x Load of y load Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1 Document 3 0 1 0 0 1 2 2 0 3 0

Data Mining Lecture 2 13 Data Mining Lecture 2 14

Transaction Data Graph Data

• A special type of record data, where • Examples: Generic graph and HTML Links
– each record (transaction) involves a set of items. <a href="papers/papers.html#bbbb">
Data Mining </a>
– For example, consider a grocery store. The set of <li>
products purchased by a customer during one 2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
shopping trip constitute a transaction, while the <li>
5 1 <a href="papers/papers.html#aaaa">
individual products that were purchased are the Parallel Solution of Sparse Linear System of Equations </a>
<li>
items. 2 <a href="papers/papers.html#ffff">
TID Items N-Body Computation and Dense Linear System Solvers

1 Bread, Coke, Milk 5


2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Data Mining Lecture 2 15 Data Mining Lecture 2 16

Chemical Data Ordered Data

Benzene Molecule: C6H6 Sequences of transactions


Items/Events

An element of
the sequence
Data Mining Lecture 2 17 Data Mining Lecture 2 18

3
Ordered Data Ordered Data

Genomic sequence data Spatio-Temporal Data


GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
Average Monthly
GGGGGAGGCGGGGCCGCCCGAGC
Temperature of
CCAACCGAGTCCGACCAGGTGCC land and ocean
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Data Mining Lecture 2 19 Data Mining Lecture 2 20

Data Quality Noise

• What kinds of data quality problems? • Noise refers to modification of original values
• How can we detect problems with the data? – Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
• What can we do about these problems? screen

• Examples of data quality problems:


– noise and outliers
– missing values
– duplicate data

Two Sine Waves Two Sine Waves + Noise

Data Mining Lecture 2 21 Data Mining Lecture 2 22

Outliers Missing Values

• Outliers are data objects with characteristics • Reasons for missing values
that are considerably different than most of – Information is not collected
the other data objects in the data set (e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
Data Mining Lecture 2 23 Data Mining Lecture 2 24

4
Duplicate Data Data Preprocessing

• Data set may include data objects that are • Aggregation


duplicates, or almost duplicates of one another • Sampling
– Major issue when merging data from heterogeneous • Dimensionality Reduction
sources
• Feature subset selection
• Examples: • Feature creation
– Same person with multiple email addresses • Discretization and Binarization
• Attribute Transformation
• Data cleaning
– Process of dealing with duplicate data issues

Data Mining Lecture 2 25 Data Mining Lecture 2 26

Aggregation Aggregation

• Combining two or more attributes (or objects) Variation of Precipitation in Australia


into a single attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Standard Deviation of Average Standard Deviation of Average
Monthly Precipitation Yearly Precipitation

Data Mining Lecture 2 27 Data Mining Lecture 2 28

Sampling Sampling …

• Sampling is the main technique employed for data • The key principle for effective sampling is the
selection. following:
– It is often used for both the preliminary investigation of the
data and the final data analysis. – using a sample will work almost as well as using the
entire data sets, if the sample is representative
• Statisticians sample because obtaining the entire set
of data of interest is too expensive or time consuming. – A sample is representative if it has approximately
the same property (of interest) as the original set
of data
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.

Data Mining Lecture 2 29 Data Mining Lecture 2 30

5
Types of Sampling Sample Size

• Simple Random Sampling


– There is an equal probability of selecting any particular item

• Sampling without replacement
– As each item is selected, it is removed from the population

• Sampling with replacement


– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up 8000 points 2000 Points 500 Points
more than once

• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
Data Mining Lecture 2 31 Data Mining Lecture 2 32

Curse of Dimensionality Dimensionality Reduction

• When dimensionality • Purpose:


increases, data becomes – Avoid curse of dimensionality
increasingly sparse in – Reduce amount of time and memory required by
the space that it data mining algorithms
occupies – Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
• Definitions of density noise
and distance between
points, which is critical • Techniques
for clustering and – Principle Component Analysis
outlier detection, • Randomly generate 500 points
– Singular Value Decomposition
• Compute difference between max and min
become less meaningful distance between any pair of points – Others: supervised and non-linear techniques
Data Mining Lecture 2 33 Data Mining Lecture 2 34

Dimensionality Reduction: PCA Dimensionality Reduction: PCA

• Goal is to find a projection that captures the • Find the eigenvectors of the covariance
largest amount of variation in data matrix
• The eigenvectors define the new space
x2
x2
e
e

x1
x1
Data Mining Lecture 2 35 Data Mining Lecture 2 36

6
Fuzzy Sets and Logic Fuzzy Sets

Fuzzy Set: Set where the set membership function is a real


valued function with output in the range [0,1].
– f(x): Probability x is in F. Short Medium Tall Short Medium Tall
– 1-f(x): Probability x is not in F. 1 1
Example
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall 0 0
– Here f is the membership function Height Height

DM: Prediction and classification are often fuzzy. Crisp Sets Fuzzy Sets

Data Mining Lecture 2 37 Data Mining Lecture 2 38

Classification/Prediction is Fuzzy Information Retrieval

0-1 Decision Fuzzy Decision Information Retrieval (IR): retrieving desired information from
textual data
– Library Science
– Digital Libraries
– Web Search Engines
Loan Reject – Traditionally has been keyword based
Reject – Sample query:
Amount • Find all documents about “data mining”.

Accept Accept
DM: Similarity measures;
Mine text or Web data
Salary Salary

Data Mining Lecture 2 39 Data Mining Lecture 2 40

Information Retrieval (cont’d) IR Query Result Measures and Classification

Similarity: measure of how close a query is


to a document.
• Documents which are “close enough” are
retrieved.
Relevant Relevant Tall Tall
• Metrics: Retrieved Not Retrieved Classified Tall Classified
Not Tall
– Precision = |Relevant and Retrieved| 20 10

|Retrieved| Not Relevant Not Relevant Not Tall


45 25
Not Tall
– Recall = |Relevant and Retrieved| Retrieved Not Retrieved Classified Tall Classified
Not Tall

|Relevant|
IR Classification
Data Mining Lecture 2 41 Data Mining Lecture 2 42

7
Machine Learning Statistics

• Machine Learning (ML): area of AI that examines how to • Usually creates simple descriptive models.
devise algorithms that can learn.
• Statistical inference: generalizing a model created
• Techniques from ML are often used in classification and
from a sample of the data to the entire dataset.
prediction.
• Supervised Learning: learns by example. • Exploratory Data Analysis:
• Unsupervised Learning: learns without knowledge of correct – Data can actually drive the creation of the model.
answers. – Opposite of traditional statistical view.
• Machine learning often deals with small or static datasets.
• Data mining targeted to business users.

DM: Uses many machine learning techniques.


DM: Many data mining methods are based
on statistical techniques.
Data Mining Lecture 2 43 Data Mining Lecture 2 44

Point Estimation Estimation Error

Point Estimate: estimate a population parameter. Bias: Difference between expected value and actual
value.
• May be made by calculating the parameter for a
sample.
• May be used to predict values for missing data.
Mean Squared Error (MSE): expected value of the
Ex: squared difference between the estimate and the
– R contains 100 employees actual value:
– 99 have salary information
– Mean salary of these is $50,000
– Use $50,000 as value of remaining employee’s salary. • Why square?
Is this a good idea? • Root Mean Square Error (RMSE).

Data Mining Lecture 2 45 Data Mining Lecture 2 46

Jackknife Estimate Maximum Likelihood Estimate (MLE)

• Jackknife Estimate: estimate of parameter is • Obtain parameter estimates that maximize


obtained by omitting one value from the set of the probability that the sample data occurs
observed values.
for the specific model.
• Ex: estimate of mean for X={x1, … , xn}

• Joint probability for observing the sample


θ data by multiplying the individual probabilities.
Likelihood function:

• Maximize L.
Data Mining Lecture 2 47 Data Mining Lecture 2 48

8
MLE Example MLE Example (cont’d)

General likelihood formula:


• Coin toss five times: {H,H,H,H,T}
• Assuming a perfect coin with H and T equally likely,
the likelihood of this sequence is:

• However if the probability of a H is 0.8 then:

Estimate for p is then 4/5 = 0.8


Data Mining Lecture 2 49 Data Mining Lecture 2 50

Expectation-Maximization (EM) Expectation Maximization Algorithm

Solves estimation with incomplete data.

Algorithm
• Obtain initial estimates for parameters.
• Iteratively use estimates for missing data and
continue refinement (maximization) of the estimate
until convergence.

Data Mining Lecture 2 51 Data Mining Lecture 2 52

Expectation Maximization Example Models Based on Summarization

• Visualization: Frequency distribution, mean, variance, median,


mode, etc.
• Box Plot:

Data Mining Lecture 2 53 Data Mining Lecture 2 54

9
Scatter Diagram Bayes Theorem

• Posterior Probability: P(h1|xi)


• Prior Probability: P(h1)
• Bayes Theorem:

• Assign probabilities of hypotheses given a data value.

Data Mining Lecture 2 55 Data Mining Lecture 2 56

Bayes Theorem Example Bayes Example (cont’d)

• Credit authorizations (hypotheses): Training Data:


– h1 = authorize purchase,
– h2 = authorize after further identification, ID Incom e C redit C lass xi
– h3 = do not authorize, 1 4 Excellent h1 x4
– h4 = do not authorize but contact police 2 3 G ood h1 x7
• Assign twelve data values for all combinations of 3 2 Excellent h1 x2
credit and income: 4 3 G ood h1 x7
1 2 3 4 5 4 G ood h1 x8
Excellent x1 x2 x3 x4 6 2 Excellent h1 x2
Good x5 x6 x7 x8 7 3 Bad h2 x 11
Bad x9 x10 x11 x12
8 2 Bad h2 x 10
• From training data: P(h1) = 60%; P(h2)=20%; 9 3 Bad h3 x 11
P(h3)=10%; P(h4)=10%. 10 1 Bad h4 x9
Data Mining Lecture 2 57 Data Mining Lecture 2 58

Bayes Example (cont’d) Hypothesis Testing

• Calculate P(x i|hj) and P(x i) • Find model to explain behavior by creating and
• Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; then testing a hypothesis about the data.
P(x8|h1)=1/6; and P(x i|h1)=0 for all other x i. • Exact opposite of usual DM approach.
• Predict the class for x4: • H0 – Null hypothesis; Hypothesis to be tested.
– Calculate P(hj|x4) for all hj.
• H1 – Alternative hypothesis.
– Place x4 in class with largest value.
– Ex:
• P(h1|x4) = (P(x4|h1)(P(h1))/P(x4)
= (1/6)(0.6)/0.1 = 1.
• x4 in class h1.

Data Mining Lecture 2 59 Data Mining Lecture 2 60

10
Chi Squared Statistic Regression

• O – observed value • Predict future values based on past values


• E – Expected value based on hypothesis. • Linear Regression assumes that a linear relationship
exists.
y = c0 + c1 x1 + … + cn xn
• Find ci values to best fit the data
Ex:
– O = {50,93,67,78,87}
– E = 75
– χ2 = 15.55 and therefore significant

Data Mining Lecture 2 61 Data Mining Lecture 2 62

Correlation Similarity Measures

• Examine the degree to which the values for • Determine similarity between two objects.
two variables behave similarly. • Characteristics of a good similarity measure:

• Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation

• Alternatively, distance measures indicate how unlike


or dissimilar objects are.

Data Mining Lecture 2 63 Data Mining Lecture 2 64

Commonly Used Similarity Measures Distance Measures

Measure dissimilarity between objects

Data Mining Lecture 2 65 Data Mining Lecture 2 66

11
Twenty Questions Game Decision Trees

Decision Tree (DT):


– Tree where the root and each internal node is
labeled with a question.
– The arcs represent each possible answer to the
associated question.
– Each leaf node represents a prediction of a solution
to the problem.

Popular technique for classification; Leaf nodes indicate


classes to which the corresponding tuples belong.

Data Mining Lecture 2 67 Data Mining Lecture 2 68

Decision Tree Example Decision Trees

• A Decision Tree Model is a computational model


consisting of three parts:
– Decision Tree
– Algorithm to create the tree
– Algorithm that applies the tree to data
• Creation of the tree is the most difficult part.
• Processing is basically performing a search similar
to that in a binary search tree (although DT may
not always be binary).

Data Mining Lecture 2 69 Data Mining Lecture 2 70

Decision Tree Algorithm Decision Trees: Advantages & Disadvantages

• Advantages:
– Easy to understand.
– Easy to generate rules from.

• Disadvantages:
– May suffer from overfitting.
– Classify by rectangular partitioning.
– Do not easily handle nonnumeric data.
– Can be quite large – pruning is often necessary.

Data Mining Lecture 2 71 Data Mining Lecture 2 72

12

Potrebbero piacerti anche