Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
— Chapter 1 —
◼ Summary
2
Why Data Mining?
3
Evolution of Sciences
◼ Before 1600, empirical science
◼ 1600-1950s, theoretical science
◼ Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
◼ 1950s-1990s, computational science
◼ Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
◼ Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
◼ 1990-now, data science
◼ The flood of data from new scientific instruments and simulations
◼ The ability to economically store and manage petabytes of data online
◼ The Internet and computing Grid that makes all these archives universally accessible
◼ Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
◼ Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
4
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web
databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
5
Evolution
◼ Summary
8
What Is Data Mining?
9
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
10
Example: A Web Mining Framework
11
Steps
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
14
KDD Process: A Typical View from ML and
Statistics
15
Example: Medical Data Mining
16
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
17
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics,
◼ Summary
19
Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web
20
◼ Relational database
◼ Time-Series Databases
◼ Text database
◼ Multimedia database
◼ World-Wide Web
◼ Summary
26
Data Mining Function: (1) Generalization
27
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering,
and other applications?
28
Data Mining Function: (3) Classification
29
August 7, 2019 Data Mining: Concepts and Techniques 30
Data Mining Function: (4) Cluster Analysis
31
August 7, 2019 Data Mining: Concepts and Techniques 32
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception? ― One person’s garbage could be another
person’s treasure
◼ Methods: by product of clustering or regression analysis, …
◼ Useful in fraud detection, rare events analysis
33
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼ Sequence, trend and evolution analysis
◼ Trend, time-series, and deviation analysis: e.g.,
memory cards
◼ Periodicity analysis
◼ Similarity-based analysis
34
Structure and Network Analysis
◼ Graph mining
◼ Finding frequent subgraphs (e.g., chemical compounds), trees
family, classmates, …
◼ Links carry a lot of semantic information: Link mining
◼ Web mining
◼ Web is a big information network: from PageRank to Google
35
Evaluation of Knowledge
◼ Are all mined knowledge interesting?
◼ One can mine tremendous amount of “patterns” and knowledge
◼ Some may fit only certain dimension space (time, location, …)
◼ Some may not be representative, may be transient, …
◼ Evaluation of mined knowledge → directly mine only
interesting knowledge?
◼ Descriptive vs. predictive
◼ Coverage
◼ Typicality vs. novelty
◼ Accuracy
◼ Timeliness
◼ …
36
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
37
Data Mining: Confluence of Multiple Disciplines
38
Why Confluence of Multiple Disciplines?
◼ Summary
40
Data Mining: Confluence of Multiple Disciplines
41
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
42
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
43
Major Issues in Data Mining (1)
◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results
44
Major Issues in Data Mining (2)
45
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
46
A Brief History of Data Mining Society
48
Where to Find References? DBLP, CiteSeer, Google
◼ Summary
50
Summary
◼ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
◼ A natural evolution of database technology, in great demand, with
wide applications
◼ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
◼ Mining can be performed in a variety of data
◼ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼ Data mining technologies and applications
◼ Major issues in data mining
51
Chapter 2: Getting to Know Your
Data
◼ Summary
52
Types of Data Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
◼ Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
◼ Transaction data
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the scale
◼ Distribution
◼ Centrality and dispersion
54
Data Objects
56
Attribute Types
◼ Nominal: categories, states, or “names of things”-enumerations
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV
positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings-enumerated
57
Numeric Attribute Types
◼ Measurable Quantity (integer or real-valued)
◼ Interval-Scaled
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point, value being multiple of
another value
◼ We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts,
monetary quantities
58
Discrete vs. Continuous
Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
collection of documents
◼ Sometimes, represented as integer variables
◼ Summary
60
Basic Statistical Descriptions of Data
◼ Motivation
◼ To better understand the data: central tendency,
variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers, variance, etc.
◼ Numerical dimensions correspond to sorted intervals
◼ Data dispersion: analyzed with multiple granularities
of precision
◼ Boxplot or quantile analysis on sorted intervals
◼ Dispersion analysis on computed measures
◼ Folding measures into numerical dimensions
◼ Boxplot or quantile analysis on the transformed cube
61
Measuring the Central Tendency
◼ Mean (algebraic measure) (sample vs. population): 1 n
x = xi = x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
w x
◼
i i
◼ Trimmed mean: chopping extreme values x= i =1
n
◼ Median: w
i =1
i
63
Symmetric vs. Skewed
Data
65
Measuring the Dispersion of Data
◼ Quartiles, outliers and boxplots
◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼ Inter-quartile range: IQR = Q3 – Q1
◼ Five number summary: min, Q1, median, Q3, max
◼ Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
◼ Outlier: usually, a value higher/lower than 1.5 x IQR
◼ Variance and standard deviation (sample: s, population: σ)
◼ Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n n n
[ xi − ( xi ) 2 ]
1 1
s = ( xi − x ) = = − = xi − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
66
Boxplot Analysis
67
Visualization of Data Dispersion: 3-D
Boxplots
69
Graphic Displays of Basic Statistical
Descriptions
71
Histograms Often Tell More than Boxplots
74
Scatter plot
◼ Provides a first look at bivariate data to see clusters of
points, outliers, etc
◼ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
75
Positively and Negatively Correlated Data
76
Uncorrelated Data
77
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web, image.
◼ Different Measurements and dispersion of data
78
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
79
79
Data Quality: Why Preprocess the Data?
80
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
81
Forms
82
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
83
83
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
84
Incomplete (Missing) Data
◼ technology limitation
◼ incomplete data
◼ inconsistent data
87
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ Clustering
◼ detect and remove outliers
88
Binning and Clustering Samples
89
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
90
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
91
91
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
92 92
Handling Redundancy in Data Integration
94
Chi-Square Calculation: An Example
i =1 (ai − A)(bi − B)
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
96
Covariance (Numeric Data)
◼ Covariance is similar to correlation
Correlation coefficient:
Scatter plots
showing the
similarity from
–1 to 1.
98
Correlation (viewed as linear
relationship)
◼ Correlation measures the linear relationship
between objects
◼ To compute correlation, we standardize data
objects, A and B, and then take their dot product
99
Co-Variance: An Example
◼ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
101
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
102
102
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes
◼ Wavelet transforms
◼ Data compression
103
Data Reduction 1: Dimensionality
Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Wavelet transforms
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature selection)
104
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform
105
What Is Wavelet Transform?
◼ Decomposes a signal into
different frequency subbands
◼ Applicable to n-
dimensional signals
◼ Data are transformed to
preserve relative distance
between objects at different
levels of resolution
◼ Allow natural clusters to
become more distinguishable
◼ Used for image compression
106
Wavelet Transformation
◼ Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
◼ Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
◼ Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
◼ Pyramid Algorithm-halves the data at each iteration,
results in fast computation speed
Haar2 Daubechie4
107
Method
Wavelet Decomposition
109
Why Wavelet Transform?
◼ Use hat-shape filters
◼ Emphasize region where points cluster
◼ Multi-resolution
◼ Detect arbitrary shaped clusters at different scales
◼ Efficient
◼ Complexity O(N)
110
Principal Component Analysis (PCA)
◼ Find a projection that captures the largest amount of variation in data
◼ The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
◼ Applied to ordered & unordered attributes, can handle sparse and
skewed data. Used as inputs to multi regression and cluster analysis
x2
x1
111
Principal Component Analysis (Steps)
◼ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
◼ Normalize input data: Each attribute falls within the same range
◼ Compute k orthonormal (unit) vectors, i.e., principal components
◼ Each input data (vector) is a linear combination of the k principal
component vectors
◼ The principal components are sorted in order of decreasing
“significance” or strength
◼ Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
◼ Works for numeric data only
112
Attribute Subset Selection
114
Method for stopping criteria
115
Method for stopping criteria
116
Example
117
Attribute Creation (Feature
Generation)
◼ Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
◼ Three general methodologies
◼ Attribute extraction
◼ Domain-specific
◼ Combining features
◼ Data discretization
118
Data Reduction 2: Numerosity
Reduction
◼ Reduce data volume by choosing alternative, smaller
forms of data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
◼ Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
◼ Non-parametric methods
◼ Do not assume models
119
Parametric Data Reduction:
Regression and Log-Linear Models
◼ Linear regression
◼ Data modeled to fit a straight line
◼ Multiple regression
◼ Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
◼ Log-linear model
◼ Approximates discrete multidimensional probability
distributions
120
y
Regression Analysis
Y1
121
Regress Analysis and Log-Linear
Models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
◼ Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing
122
Histogram Analysis
◼ Use binning to approximate data distributions
◼ Disjoint subsets-bucket or bin
◼ Each represent a single attribute- singleton buckets
◼ Divide data into buckets and store average (sum) for
each bucket
◼ Partitioning rules:
◼ Equal-width: equal bucket range or width uniform
◼ Equal-frequency: each bucket is constant or equal-
depth
123
Histograms
124
Clustering
◼ Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
◼ Can be very effective if data is clustered but not if data
is “smeared”
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and
clustering algorithms
125
Sampling
128
Sampling: With or without Replacement
Raw Data
129
Sampling: Cluster or Stratified
Sampling
130
Data Cube Aggregation
134
Data Compression
Original Data
Approximated
135
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
136
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization, z-score normalization, normalization
by decimal scaling
◼ Discretization: Concept hierarchy climbing
◼ Concept hierarchy generalization for nominal data
137
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
138
Discretization
◼ Three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Ordinal—values from an ordered set, e.g., military or academic
rank
◼ Numeric—real numbers, e.g., integer or real numbers
◼ Discretization: Divide the range of a continuous attribute into intervals
◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization
◼ Supervised vs. unsupervised
◼ Split (top-down) vs. merge (bottom-up)
◼ Discretization can be performed recursively on an attribute
◼ Prepare for further analysis, e.g., classification
139
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
140
Simple Discretization: Binning
142
Labels
(Binning vs. Clustering)
143
Discretization by Classification &
Correlation Analysis
◼ Classification (e.g., decision tree analysis)
◼ Supervised: Given class labels, e.g., cancerous vs. benign
◼ Using entropy to determine split point (discretization point)
◼ Top-down, recursive split
144
Concept Hierarchy Generation
◼ Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
◼ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
◼ Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
◼ Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods.
145
Concept Hierarchy Generation
for Nominal Data
◼ Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
◼ street < city < state < country
◼ Specification of a hierarchy for a set of values by explicit
data grouping
◼ {Urbana, Champaign, Chicago} < Illinois
◼ Specification of only a partial set of attributes
◼ E.g., only street < city, not others
◼ Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
◼ E.g., for a set of attributes: {street, city, state, country}
146
Automatic Concept Hierarchy Generation
◼ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
◼ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
◼ Exceptions, e.g., weekday, month, quarter, year
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
148
Summary
◼ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
◼ Data cleaning: e.g. missing/noisy values, outliers
◼ Data integration from multiple sources:
◼ Entity identification problem
◼ Remove redundancies
◼ Detect inconsistencies
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
149