Basic Data Mining Techniques: Attributes

Overview
• Data & Types of Data

• Fuzzy Sets
Basic Data Mining Techniques • Information Retrieval
• Machine Learning
• Statistics & Estimation Techniques
• Similarity Measures
• Decision Trees
Data Mining Lecture 2 2
What is Data? Attribute Values

Attributes
• Collection of data objects and • Attribute values are numbers or symbols assigned to
their attributes an attribute
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No

• Distinction between attributes and attribute values
– Examples: eye color of a 2 No Married 100K No – Same attribute can be mapped to different attribute values
person, temperature, etc. 3 No Single 70K No • Example: height can be measured in feet or meters
– Attribute is also known as 4 Yes Married 120K No
variable, field, 5 No Divorced 95K Yes – Different attributes can be mapped to the same set of values
characteristic, or feature Objects 6 No Married 60K No • Example: Attribute values for ID and age are integers
• A collection of attributes 7 Yes Divorced 220K No
• But properties of attribute values can be different
describe an object 8 No Single 85K Yes
– ID has no limit but age has a maximum and minimum value
– Object is also known as 9 No Married 75K No
record, point, case, sample, 10 No Single 90K Yes
entity, or instance
10
Data Mining Lecture 2 3 Data Mining Lecture 2 4
Types of Attributes Properties of Attribute Values
• There are different types of attributes • The type of an attribute depends on which of
– Nominal the following properties it possesses:
• Examples: ID numbers, eye color, zip codes – Distinctness: = ≠
– Ordinal – Order: < >
• Examples: rankings (e.g., taste of potato chips on a scale – Addition: + -
from 1-10), grades, height in {tall, medium, short}
– Multiplication: */
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit. – Nominal attribute: distinctness
– Ratio – Ordinal attribute: distinctness & order
• Examples: temperature in Kelvin, length, time, counts – Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
1
Attribute Description Examples Operations Attribute Transformation Comments
Type Level
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency Nominal Any permutation of values If all employee ID numbers
attributes provide only enough sex: {male, female} correlation, χ2 test
information to distinguish one object
were reassigned, would it
from another. (=, ≠) make any difference?
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation, Ordinal An order preserving change of An attribute encompassing
objects. (<, >) grades, street numbers run tests, sign tests values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
Interval For interval attributes, the calendar dates, mean, standard {1, 2, 3} or by {0.5, 1, 10}.
differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
Interval new_value =a * old_value + b Thus, the Fahrenheit and
(+, - ) where a and b are constants Celsius temperature scales
differ in terms of where
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean, their zero value is and the
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation size of a unit (degree).
length, electrical
current Ratio new_value = a * old_value Length can be measured in
meters or feet.
Discrete and Continuous Attributes Types of data sets
• Discrete Attribute • Record

– Has only a finite or countably infinite set of values – Data Matrix
– Examples: zip codes, counts, or the set of words in a – Document Data
collection of documents
– Transaction Data
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete • Graph
attributes – World Wide Web
– Molecular Structures
• Continuous Attribute • Ordered
– Has real numbers as attribute values
– Spatial Data
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented – Temporal Data
using a finite number of digits – Sequential Data
– Continuous attributes are typically represented as floating- – Genetic Sequence Data
point variables
Characteristics of Structured Data Record Data
• Dimensionality • Data that consists of a collection of records,

– Curse of Dimensionality each of which consists of a fixed set of
attributes Tid Refund Marital Taxable
Status Income Cheat
• Sparsity 1 Yes Single 125K No
– Only presence counts 2 No Married 100K No

3 No Single 70K No
4 Yes Married 120K No
• Resolution 5 No Divorced 95K Yes

6 No Married 60K No
– Patterns depend on the scale 7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
1
0
2
Data Matrix Document Data
• If data objects have the same fixed set of numeric • Each document becomes a `term' vector,
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each – each term is a component (attribute) of the vector,
dimension represents a distinct attribute – the value of each component is the number of times
the corresponding term occurs in the document.
• Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
timeout
season
coach
score
game
team
columns, one for each attribute
ball
lost
pla
wi
n
y
Projection Projection Distance Load Thickness
of x Load of y load Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1 Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data Graph Data
• A special type of record data, where • Examples: Generic graph and HTML Links
– each record (transaction) involves a set of items. <a href="papers/papers.html#bbbb">
Data Mining </a>
– For example, consider a grocery store. The set of <li>
products purchased by a customer during one 2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
shopping trip constitute a transaction, while the <li>
5 1 <a href="papers/papers.html#aaaa">
individual products that were purchased are the Parallel Solution of Sparse Linear System of Equations </a>
<li>
items. 2 <a href="papers/papers.html#ffff">
TID Items N-Body Computation and Dense Linear System Solvers
1 Bread, Coke, Milk 5

2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Chemical Data Ordered Data
Benzene Molecule: C6H6 Sequences of transactions

Items/Events
An element of
the sequence
3
Ordered Data Ordered Data
Genomic sequence data Spatio-Temporal Data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
Average Monthly
GGGGGAGGCGGGGCCGCCCGAGC
Temperature of
CCAACCGAGTCCGACCAGGTGCC land and ocean
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality Noise
• What kinds of data quality problems? • Noise refers to modification of original values
• How can we detect problems with the data? – Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
• What can we do about these problems? screen
• Examples of data quality problems:

– noise and outliers
– missing values
– duplicate data
Two Sine Waves Two Sine Waves + Noise
Outliers Missing Values
• Outliers are data objects with characteristics • Reasons for missing values
that are considerably different than most of – Information is not collected
the other data objects in the data set (e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
4
Duplicate Data Data Preprocessing
• Data set may include data objects that are • Aggregation

duplicates, or almost duplicates of one another • Sampling
– Major issue when merging data from heterogeneous • Dimensionality Reduction
sources
• Feature subset selection
• Examples: • Feature creation
– Same person with multiple email addresses • Discretization and Binarization
• Attribute Transformation
• Data cleaning
– Process of dealing with duplicate data issues
Aggregation Aggregation
• Combining two or more attributes (or objects) Variation of Precipitation in Australia

into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Standard Deviation of Average Standard Deviation of Average
Monthly Precipitation Yearly Precipitation
Sampling Sampling …
• Sampling is the main technique employed for data • The key principle for effective sampling is the
selection. following:
– It is often used for both the preliminary investigation of the
data and the final data analysis. – using a sample will work almost as well as using the
entire data sets, if the sample is representative
• Statisticians sample because obtaining the entire set
of data of interest is too expensive or time consuming. – A sample is representative if it has approximately
the same property (of interest) as the original set
of data
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
5
Types of Sampling Sample Size
• Simple Random Sampling

– There is an equal probability of selecting any particular item
•
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement

– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up 8000 points 2000 Points 500 Points
more than once
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
Curse of Dimensionality Dimensionality Reduction
• When dimensionality • Purpose:

increases, data becomes – Avoid curse of dimensionality
increasingly sparse in – Reduce amount of time and memory required by
the space that it data mining algorithms
occupies – Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
• Definitions of density noise
and distance between
points, which is critical • Techniques
for clustering and – Principle Component Analysis
outlier detection, • Randomly generate 500 points
– Singular Value Decomposition
• Compute difference between max and min
become less meaningful distance between any pair of points – Others: supervised and non-linear techniques
Dimensionality Reduction: PCA Dimensionality Reduction: PCA
• Goal is to find a projection that captures the • Find the eigenvectors of the covariance
largest amount of variation in data matrix
• The eigenvectors define the new space
x2
x2
e
e
x1
x1
6
Fuzzy Sets and Logic Fuzzy Sets
Fuzzy Set: Set where the set membership function is a real

valued function with output in the range [0,1].
– f(x): Probability x is in F. Short Medium Tall Short Medium Tall
– 1-f(x): Probability x is not in F. 1 1
Example
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall 0 0
– Here f is the membership function Height Height
DM: Prediction and classification are often fuzzy. Crisp Sets Fuzzy Sets
Classification/Prediction is Fuzzy Information Retrieval
0-1 Decision Fuzzy Decision Information Retrieval (IR): retrieving desired information from
textual data
– Library Science
– Digital Libraries
– Web Search Engines
Loan Reject – Traditionally has been keyword based
Reject – Sample query:
Amount • Find all documents about “data mining”.
Accept Accept
DM: Similarity measures;
Mine text or Web data
Salary Salary
Information Retrieval (cont’d) IR Query Result Measures and Classification
Similarity: measure of how close a query is

to a document.
• Documents which are “close enough” are
retrieved.
Relevant Relevant Tall Tall
• Metrics: Retrieved Not Retrieved Classified Tall Classified
Not Tall
– Precision = |Relevant and Retrieved| 20 10
|Retrieved| Not Relevant Not Relevant Not Tall

45 25
Not Tall
– Recall = |Relevant and Retrieved| Retrieved Not Retrieved Classified Tall Classified
Not Tall
|Relevant|
IR Classification
7
Machine Learning Statistics
• Machine Learning (ML): area of AI that examines how to • Usually creates simple descriptive models.
devise algorithms that can learn.
• Statistical inference: generalizing a model created
• Techniques from ML are often used in classification and
from a sample of the data to the entire dataset.
prediction.
• Supervised Learning: learns by example. • Exploratory Data Analysis:
• Unsupervised Learning: learns without knowledge of correct – Data can actually drive the creation of the model.
answers. – Opposite of traditional statistical view.
• Machine learning often deals with small or static datasets.
• Data mining targeted to business users.
DM: Uses many machine learning techniques.

DM: Many data mining methods are based
on statistical techniques.
Point Estimation Estimation Error
Point Estimate: estimate a population parameter. Bias: Difference between expected value and actual
value.
• May be made by calculating the parameter for a
sample.
• May be used to predict values for missing data.
Mean Squared Error (MSE): expected value of the
Ex: squared difference between the estimate and the
– R contains 100 employees actual value:
– 99 have salary information
– Mean salary of these is $50,000
– Use $50,000 as value of remaining employee’s salary. • Why square?
Is this a good idea? • Root Mean Square Error (RMSE).
Jackknife Estimate Maximum Likelihood Estimate (MLE)
• Jackknife Estimate: estimate of parameter is • Obtain parameter estimates that maximize

obtained by omitting one value from the set of the probability that the sample data occurs
observed values.
for the specific model.
• Ex: estimate of mean for X={x1, … , xn}
• Joint probability for observing the sample

θ data by multiplying the individual probabilities.
Likelihood function:
• Maximize L.
8
MLE Example MLE Example (cont’d)
General likelihood formula:

• Coin toss five times: {H,H,H,H,T}
• Assuming a perfect coin with H and T equally likely,
the likelihood of this sequence is:
• However if the probability of a H is 0.8 then:
Estimate for p is then 4/5 = 0.8

Expectation-Maximization (EM) Expectation Maximization Algorithm
Solves estimation with incomplete data.
Algorithm
• Obtain initial estimates for parameters.
• Iteratively use estimates for missing data and
continue refinement (maximization) of the estimate
until convergence.
Expectation Maximization Example Models Based on Summarization
• Visualization: Frequency distribution, mean, variance, median,

mode, etc.
• Box Plot:
9
Scatter Diagram Bayes Theorem
• Posterior Probability: P(h1|xi)

• Prior Probability: P(h1)
• Bayes Theorem:
• Assign probabilities of hypotheses given a data value.
Bayes Theorem Example Bayes Example (cont’d)
• Credit authorizations (hypotheses): Training Data:

– h1 = authorize purchase,
– h2 = authorize after further identification, ID Incom e C redit C lass xi
– h3 = do not authorize, 1 4 Excellent h1 x4
– h4 = do not authorize but contact police 2 3 G ood h1 x7
• Assign twelve data values for all combinations of 3 2 Excellent h1 x2
credit and income: 4 3 G ood h1 x7
1 2 3 4 5 4 G ood h1 x8
Excellent x1 x2 x3 x4 6 2 Excellent h1 x2
Good x5 x6 x7 x8 7 3 Bad h2 x 11
Bad x9 x10 x11 x12
8 2 Bad h2 x 10
• From training data: P(h1) = 60%; P(h2)=20%; 9 3 Bad h3 x 11
P(h3)=10%; P(h4)=10%. 10 1 Bad h4 x9
Bayes Example (cont’d) Hypothesis Testing
• Calculate P(x i|hj) and P(x i) • Find model to explain behavior by creating and
• Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; then testing a hypothesis about the data.
P(x8|h1)=1/6; and P(x i|h1)=0 for all other x i. • Exact opposite of usual DM approach.
• Predict the class for x4: • H0 – Null hypothesis; Hypothesis to be tested.
– Calculate P(hj|x4) for all hj.
• H1 – Alternative hypothesis.
– Place x4 in class with largest value.
– Ex:
• P(h1|x4) = (P(x4|h1)(P(h1))/P(x4)
= (1/6)(0.6)/0.1 = 1.
• x4 in class h1.
10
Chi Squared Statistic Regression
• O – observed value • Predict future values based on past values

• E – Expected value based on hypothesis. • Linear Regression assumes that a linear relationship
exists.
y = c0 + c1 x1 + … + cn xn
• Find ci values to best fit the data
Ex:
– O = {50,93,67,78,87}
– E = 75
– χ2 = 15.55 and therefore significant
Correlation Similarity Measures
• Examine the degree to which the values for • Determine similarity between two objects.
two variables behave similarly. • Characteristics of a good similarity measure:
• Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
• Alternatively, distance measures indicate how unlike

or dissimilar objects are.
Commonly Used Similarity Measures Distance Measures
Measure dissimilarity between objects
11
Twenty Questions Game Decision Trees
Decision Tree (DT):

– Tree where the root and each internal node is
labeled with a question.
– The arcs represent each possible answer to the
associated question.
– Each leaf node represents a prediction of a solution
to the problem.
Popular technique for classification; Leaf nodes indicate

classes to which the corresponding tuples belong.
Decision Tree Example Decision Trees
• A Decision Tree Model is a computational model

consisting of three parts:
– Decision Tree
– Algorithm to create the tree
– Algorithm that applies the tree to data
• Creation of the tree is the most difficult part.
• Processing is basically performing a search similar
to that in a binary search tree (although DT may
not always be binary).
Decision Tree Algorithm Decision Trees: Advantages & Disadvantages
• Advantages:
– Easy to understand.
– Easy to generate rules from.
• Disadvantages:
– May suffer from overfitting.
– Classify by rectangular partitioning.
– Do not easily handle nonnumeric data.
– Can be quite large – pruning is often necessary.
12

Basic Data Mining Techniques: Attributes

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Basic Data Mining Techniques: Attributes

Caricato da

Copyright:

Formati disponibili

Overview

• Data & Types of Data

Data Mining Lecture 2 2

What is Data? Attribute Values

characteristic of an object 1 Yes Single 125K No

Data Mining Lecture 2 3 Data Mining Lecture 2 4

Types of Attributes Properties of Attribute Values

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

Discrete and Continuous Attributes Types of data sets

• Discrete Attribute • Record

Characteristics of Structured Data Record Data

• Dimensionality • Data that consists of a collection of records,

– Only presence counts 2 No Married 100K No

• Resolution 5 No Divorced 95K Yes

Data Mining Lecture 2 11 Data Mining Lecture 2 12

Data Mining Lecture 2 13 Data Mining Lecture 2 14

Transaction Data Graph Data

1 Bread, Coke, Milk 5

Data Mining Lecture 2 15 Data Mining Lecture 2 16

Chemical Data Ordered Data

Benzene Molecule: C6H6 Sequences of transactions

Genomic sequence data Spatio-Temporal Data

Data Mining Lecture 2 19 Data Mining Lecture 2 20

Data Quality Noise

• Examples of data quality problems:

Two Sine Waves Two Sine Waves + Noise

Data Mining Lecture 2 21 Data Mining Lecture 2 22

Outliers Missing Values

• Handling missing values

• Data set may include data objects that are • Aggregation

Data Mining Lecture 2 25 Data Mining Lecture 2 26

• Combining two or more attributes (or objects) Variation of Precipitation in Australia

Data Mining Lecture 2 27 Data Mining Lecture 2 28

Data Mining Lecture 2 29 Data Mining Lecture 2 30

• Simple Random Sampling

• Sampling with replacement

Curse of Dimensionality Dimensionality Reduction

• When dimensionality • Purpose:

Dimensionality Reduction: PCA Dimensionality Reduction: PCA

Fuzzy Set: Set where the set membership function is a real

Data Mining Lecture 2 37 Data Mining Lecture 2 38

Classification/Prediction is Fuzzy Information Retrieval

Data Mining Lecture 2 39 Data Mining Lecture 2 40

Information Retrieval (cont’d) IR Query Result Measures and Classification

Similarity: measure of how close a query is

|Retrieved| Not Relevant Not Relevant Not Tall

DM: Uses many machine learning techniques.

Point Estimation Estimation Error

Data Mining Lecture 2 45 Data Mining Lecture 2 46

Jackknife Estimate Maximum Likelihood Estimate (MLE)

• Jackknife Estimate: estimate of parameter is • Obtain parameter estimates that maximize

• Joint probability for observing the sample

General likelihood formula:

• However if the probability of a H is 0.8 then:

Estimate for p is then 4/5 = 0.8

Expectation-Maximization (EM) Expectation Maximization Algorithm

Solves estimation with incomplete data.

Data Mining Lecture 2 51 Data Mining Lecture 2 52

Expectation Maximization Example Models Based on Summarization

• Visualization: Frequency distribution, mean, variance, median,

Data Mining Lecture 2 53 Data Mining Lecture 2 54

• Posterior Probability: P(h1|xi)

• Assign probabilities of hypotheses given a data value.