Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
S E DWBI
UNIT-I
Syllabus:
Introduction to Data Mining: Motivation for Data Mining, Data Mining-Definition &
Functionalities, Classification of DM systems, DM task primitives, Integration of a Data Mining
system with a Database or a Data Warehouse, Major issues in Data Mining.
Data Warehousing (Overview Only): Overview of concepts like star schema, fact and
dimension tables, OLAP operations, From OLAP to Data Mining.
Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases", or KDD
How is a data warehouse different from a database? How are they similar?
Differences between a data warehouse and a database:
Similarities between a data warehouse and a database: Both are repositories of information,
storing huge amounts of persistent data.
1.1.3 Data mining functionalities/Data mining tasks: what kinds of patterns can be
mined?
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories:
• Descriptive
• predictive
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Q) Describe data mining functionalities, and the kinds of patterns they can discover
(or)
Q) Define each of the following data mining functionalities: characterization,
discrimination, association and correlation analysis, classification, prediction, clustering,
and evolution analysis. Give examples of each data mining functionality, using a real-life
database that you are familiar with.
Example
The general features of students with high GPA’s may be compared with the general
features of students with low GPA’s. The resulting description could be a general comparative
profile of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.
Correlation analysis
Correlation analysis is a technique use to measure the association between two variables.
Correlation is degree or type of relationship b/w two or more quantities ( variables).
A correlation coefficient (r) is a statistic used for measuring the strength of a supposed
linear association between two variables. Correlations range from -1.1.0 to +1.1.0 in value.
A correlation coefficient of 1.1.0 indicates a perfect positive relationship in which two or
more variables fluctuate together ( one increases/decreases and other one also
increases/decreases).
A correlation coefficient of 0.0 indicates no relationship between the two variables. That is,
one cannot use the scores on one variable to tell anything about the scores on the second
variable.
A correlation coefficient of -1.1.0 indicates a perfect negative relationship in which high
values of one variable increases and other decreases.
Prediction:
Find some missing or unavailable numerical data values rather than class labels referred
to as prediction.
Although prediction may refer to both numerical prediction and class label prediction, it is
usually confined to numerical data value prediction and thus is distinct from classification.
Prediction also encompasses the identification of distribution trends based on the available
data.
Regression analysis is a statistical methodology that is most often used for numerical
prediction.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity etc. These water levels at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this point. The prediction must
be made with respect to the time the data were collected.
4. Clustering analysis
The objects are clustered or grouped based on the principle of maximizing the intra-class
similarity and minimizing the interclass similarity.
Cluster of objects are formed so that objects within a cluster have high similarity to one
another, but are very dissimilar to objects in other clusters.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into
a hierarchy of classes that group similar events together as shown below:
A 2-D plot of customer data with respect to customer locations in a city, showing three data
Clusters.
Mallikarjun, Assoc Prof MLWEC Page 7
III C.S E DWBI
Classification vs. Clustering
In general, in classification you have a set of predefined classes and want to know which
class a new object belongs to.
Clustering tries to group a set of objects and find whether there is some relationship between
the objects.
In the context of machine learning, classification is supervised learning and clustering is
unsupervised learning
5. Outlier analysis:
A database may contain data objects that do not comply with general behavior or model
of data.
These data objects are outliers. In other words, the data objects which do not fall within the
cluster will be called as outlier data objects.
Noisy data or exceptional data are also called as outlier data. The analysis of outlier data is
referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type
of purchase, or the purchase frequency.
1.1.4 Which Technologies Are Used? (Or) Classification of Data Mining Systems
A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions. Statistical models are widely used to model data and data classes.
For example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. In other words, such statistical models can be the outcome
of a data mining task.
Alternatively, data mining tasks can be built on top of statistical models. For example,
we can use statistics to model noise and missing data values. Then, when mining patterns in a
large data set, the data mining process can use the model to help identify and handle noisy or
missing values in the data.
Statistics research develops tools for prediction and forecasting using data and statistical
models. Statistical methods can be used to summarize or describe a collection of data.
Statistics is useful for mining various patterns from data as well as for understanding
the underlying useful for mining various patterns from data as well as for understanding the
underlying mechanisms generating and affecting the patterns.
Inferential statistics (or predictive statistics) models data in a way that accounts for
randomness and uncertainty in the observations and is used to draw inferences about the process
or population under investigation.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data analysis)
makes statistical decisions using experimental data. A result is called statistically significant if it
is unlikely to have occurred by chance. If the classification or prediction model holds true, then
the descriptive statistics of the model increases the soundness of the model.
Machine Learning
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. For example, a typical
machine learning problem is to program a computer so that it can automatically recognize
handwritten postal codes on mail after learning from a set of examples.
Machine learning is a fast-growing discipline. Here, we illustrate classic problems in
machine learning that are highly related to data mining.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.
Semi-supervised learning
Using the unlabeled examples, we can refine the decision boundary to the solid line. Moreover,
we can detect that the two positive examples at the top right corner, though labeled, are likely
noise or outliers.
Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given
a constraint on how many examples they can be asked to label.
Information Retrieval
The typical approaches in information retrieval adopt probabilistic models. For example,
a text document can be regarded as a bag of words, that is, a multi-set of words appearing in the
document. The document’s language model is the probability density function that generates the
bag of words in the document. The similarity between two documents can be measured by the
similarity between their corresponding language models.
A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during discovery in
order to direct the mining process, or examine the findings from different angles or depths.
The set of task-relevant data to be mined: This specifies the portions of the database or the set of
data in which the user is interested. This includes the database attributes or data warehouse
dimensions of interest (referred to as the relevant attributes or dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating
the patterns found. Concept hierarchies are a popular form of background knowledge, which
allow data to be mined at multiple levels of abstraction.
The interestingness measures and thresholds for pattern evaluation: They may be used to guide
the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of
knowledge may have different interestingness measures. For example, interestingness measures
for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
un-interesting.
The expected representation for visualizing the discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes.
Mallikarjun, Assoc Prof MLWEC Page 12
III C.S E DWBI
1.1.6 Integration of Data Mining System with a Database or Data warehouse System
Good system architecture will facilitate the data mining system to make best use of the
software environment, accomplish data mining tasks in an efficient and timely manner with
other information systems.
Design a Data Mining System should be integrate or couple with a Database (DB ) system
and/ or Data warehouse (DW) system .
2. Loose coupling:
Loose – coupling means that a DM system will use some facilities of a DB or DW
system, such as
Fetching data from a data repository managed by these systems.
Performing data mining ,
And then storing the mining results either in a file or in a designed place in a
database or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data stored in
database or data warehouses by using query processing , indexing and other system
facilities.
4. Tight coupling
Tight coupling means that a DM system is smoothly integrated into the DB/DW system .
The data mining subsystem is treated as one functional component of an information
system.
Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes and query processing methods as DB or DW system .
2. User Interaction:
Time-variant: Data are stored to provide information from a historical perspective (e.g., the
past 5–10 years). Every key structure in the data warehouse contains, either implicitly or
explicitly, an element of time.
Nonvolatile: A data warehouse is always a physically separate store of data trans- formed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data accessing: initial loading of data and
access of data.
Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse. The data are extracted using application
program interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse
and its contents.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data bases generally involve the transactions. OLTP (online transaction processing) is a class of
software programs capable of supporting transaction-oriented applications on the Internet.
Typically, OLTP systems are used for order entry, financial transactions, customer relationship
management (CRM) and retail sales.
Data warehouses generally follow OLAP. OLAP (Online Analytical Processing) is the
technology behind many Business Intelligence (BI) applications. OLAP is a powerful
technology for data discovery, including capabilities for limitless report viewing, complex
analytical calculations, and predictive “what if” scenario (budget, forecast) planning.
A 3-D data cube representation of the data in above Table, according to time, item, and
location.
1.2.4 Star, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each
dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
Snowflake schema:
The snowflake schema is a variant of the star schema model, where some dimension tables
are normalized, thereby further splitting the data into additional tables.
The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies.
Fact constellation:
Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Ex 1:
Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a hierarchy for
Location and (b) a lattice for time.
Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
This organization provides users with the flexibility to view data from different
perspectives.
A number of OLAP data cube operations exist to materialize these different views,
allowing interactive querying and analysis of the data at hand.
Hence, OLAP provides a user-friendly environment for interactive data analysis.
Roll-up:
The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction.
Example: The result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure.
This hierarchy was defined as the total order “street <city < province or state < country.”
The roll-up operation shown aggregates the data by ascending the location hierarchy from
the level of city to the level of country.
Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data.
Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions.
Following Figure shows the result of a drill-down operation performed on the central cube
by stepping down a concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month.
Slice:
The slice operation performs a selection on one dimension of the given cube, resulting in
a subcube.
Following Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criterion time = “Q1.”
Dice:
Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation.
Following Figure shows a pivot operation where the item and location axes in a 2-D slice
are rotated. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D
cube into a series of 2-D planes.
OLAP Operations
Multidimensional data mining (also known as exploratory multidimensional data mining, online
analytical mining, or OLAM) integrates OLAP with data mining to uncover knowledge in
multidimensional databases.
An OLAM server performs analytical mining data cubes in similar manner as on OLAP server
performs OLAP. Integrated OLAM and OLAP architecture is shown in following figure where
the OLAM AND OLAP servers both accept user online queries via graphical user interface API
and work with the cube in the data Analysis via cube API. A Meta data directory used to guide
access the data cube. The Data cube can be constructed by accessing and/or integrating multiple
data base API that may support OLE DB or ODBC connections.
Since an OLAM server may perform multiple data mining tasks such as concept description,
association, classification, prediction, clustering and time series analysis and so on it usually
consists of multiple integrated data mining modules and is more sophisticated than an OLAP
Server.