Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DATA MINING
Objects
object 5 No Divorced 95K Yes
Object is also known as record, point, 6 No Married 60K No
case, sample, entity, or instance
7 Yes Divorced 220K No
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, χ2 test
information to distinguish one object
from another. (=, ≠)
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius deviation, Pearson's
unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical current
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Lesson 2: Data Preprocessing 9
INST 766: Data Mining
What is Data(6)?
Types of data sets:
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Lesson 2: Data Preprocessing 10
INST 766: Data Mining
What is Data(7)?
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
timeout
season
coach
score
game
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
<a href="papers/papers.html#bbbb">
Data Mining </a>
2 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
5 N-Body Computation and Dense Linear System Solvers
An element of
the sequence
Average Monthly
Temperature of
land and ocean
see
Monitoring www.crisp-dm.org
for more
information
Data
Monitoring
Preparation
Data Preparation
estimated to take
70-80% of the
time and effort
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill
in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree 34
Lesson 2: Data Preprocessing INST 766: Data Mining
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
Clustering
detect and remove outliers
Regression
smooth by fitting the data into regression functions
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
Y1
Y1’ y=x+1
X1 x
− −
rA, B =
∑( A − A)( B − B )
(n −1)σ Aσ B
- -
where, n is is the number of tuples, A and B are respective mean
values of A and B, and σ A and σ B are the respective standard deviations
of A and B.
Lesson 2: Data Preprocessing 43
INST 766: Data Mining
Handling Redundant Data in Data Integration
If the result value is 0 (zero) then A and B are independent and there is no
correlation between them.
If the resulting value is less than 0 (zero) ie. Negative it indicates that A and B
are negatively correlated.
This leads, the value one attribute increase and other attribute decreases.
Each attribute discourage each other
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data reduction
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
If we are interested to find out annual sales (ie. total per year) rather than
total per quarter
This can be difficult and time consuming when the behavior of the data is
unknown.
Adding irrelevant attributes in data set make slowdown the mining process.
reduce # of patterns in the patterns, easier to understand
Selection
Step-wise forward selection
Decision-tree induction
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
s s y
Original Data lo
Approximated
Method:
1) The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector and zeros as necessary.
3) Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a
weighted difference, which acts to bring out the detailed features of the data.
5) The two functions are applied to pairs of the input data, resulting in two sets of
data of length L/2. In general, these represents a smoothed or low frequency
version of the input data, and the high frequency content of it, respectively.
7) The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.
9) A selection of values from the data sets obtained in the above iterations are
designated the wavelet coefficient of the transformed data.
Note: The matrix must be orthonormal, so that the matrix inverse is just its
transpose.
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4
2 2 2 0 2 2
Count
1
[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in a corporation
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4 4 4
2
Partition data set into clusters, and one can store cluster
representation only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms.
Lesson 2: Data Preprocessing 78
INST 766: Data Mining
Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
W O R
SRS le random
i m p h o u t
( s e wi t
l
samp ment)
p l a c e
re
SRSW
R
Raw Data
Lesson 2: Data Preprocessing 80
INST 766: Data Mining
Sampling
Discretization
reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
Entropy-based discretization