Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Attribute Transformation Comments
Level
l Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
l Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential
© Tan,Steinbach, Kumar Data Introduction to Data Mining 4/18/2004
Important Characteristics of Structured Data
Before providing details of specific kinds of data sets, we discuss three characteristics
that apply to many data sets and have a significant impact on the data mining
techniques.
Dimensionality
– The more dimensions (attributes) there are the harder dealing with the
dataset becomes. [called Curse of Dimensionality]
Sparsity
– Sometimes only the presence of a value for an attribute is what is
important [e.g. Document Data].
– Sparse datasets have more zeroes than non-zero values, whereas
dense datasets have only few zero values.
– Sparsity affects the choice of the data mining method.
Resolution
– Very high resolution: Difficult to see patterns due to many unnecessary
details.
– Very low resolution: Patterns may not be present in the dataset!
– E.g.: Record air pressure every second? Every minute? Every day?
If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points (vectors) in a multi-dimensional
space, where each dimension represents a distinct attribute.
Such data set can be represented by an m by n matrix, where there are
m rows, one for each object, and n columns, one for each attribute
This matrix is called a data matrix or a pattern matrix. A data matrix is
a variation of record data, but because it consists of numeric attributes,
standard matrix operation can be applied to transform and manipulate
the data. Therefore, the data matrix is the standard data format for most
statistical data.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TI Items PI Symptoms
D D
1 Biscuits, Milk, Bread 1 Fever, Headache
2 Milk, Bread, Diapers 2 Stomachache,
Diarrhea
3 Cucumber
3 Fever, Nausea
4 Water, Orange, Milk
4 Fever, HBP
5 Bread, Tomatoes
5 Fatigue, Fever, Cough
l Sequences of transactions
Items/Events
An element of
the sequence
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
l Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
l Examples:
– Same person with multiple email addresses
l Data cleaning
– Process of dealing with duplicate data issues
l Aggregation
l Sampling
l Dimensionality Reduction
l Feature subset selection
l Feature creation
l Discretization and Binarization
l Attribute Transformation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
l Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Sample Size
l When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
l Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
l Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Dimensions = 206
Dimensions
Dimensions==160
120
10
40
80
l Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
l Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Feature Creation
l Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
l Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
l Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Similarity/Dissimilarity for Simple Attributes
l Euclidean Distance
n
dist ( pk qk ) 2
k 1
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
correlation( p, q) p q
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
l Examples:
– Euclidean density
Euclidean density = number of points per unit volume
– Probability density
– Graph-based density